Mar 29, 2026

VRAM Requirements for Every Major LLM: Complete Reference Table (2026)

Exact VRAM numbers for 60+ LLMs at every quantization level — including Qwen 3, Gemma 4, Llama 4, and gpt-oss. Model weights, KV cache, and total memory broken down so you know what GPU to buy before you download anything.

Andre

GPUAILLMs

1.0

Why VRAM is the only spec that matters

When you are shopping for a GPU to run local LLMs, the only number that determines whether a model loads or crashes is VRAM. Not CUDA cores. Not clock speed. Not tensor TFLOPS. VRAM sets a hard ceiling on which models you can run, which quantization levels you can afford, and how much context you can maintain before the system starts swapping to system RAM and your token speed tanks.

The relationship is straightforward: each parameter in a model occupies memory, and the total memory required depends on how those parameters are stored (quantization level), how much conversation history the model keeps (KV cache), and framework overhead. For a deeper explanation of the math behind these numbers, see Tim Dettmers' GPU analysis, which remains the most cited reference for VRAM budgeting.

Total VRAM = model_weights + KV_cache + overhead model_weights = parameters × bytes_per_weight KV_cache ≈ 2 × layers × kv_heads × head_dim × 2 × context_length

2.0

Tiny and small models (1B - 8B parameters)

Small models are where most people start. They run on budget GPUs, load in seconds, and still deliver useful output for coding assistance, summarization, and general chat. The 2026 generation brought major quality gains to this tier — Qwen 3 8B and Gemma 4 E4B punch well above their weight class, often matching older 14B models on benchmarks.

Model	Params	Q4_K_M	Q8	FP16	Min GPU
Qwen 3 0.6B	0.6B	~0.4 GB	~0.7 GB	~1.2 GB	Any GPU
Qwen 3.5 2B	2B	~1.2 GB	~2.2 GB	~4 GB	Any GPU
Phi-4 Mini	3.8B	~2.5 GB	~4.5 GB	~8 GB	8 GB (any)
Qwen 3 4B	4B	~2.5 GB	~4.5 GB	~8 GB	8 GB (any)
Gemma 4 E2B (MoE)	5.1B	~3 GB	~5.5 GB	~10 GB	8 GB (any)
Qwen 2.5 7B	7B	~4.5 GB	~8 GB	~14 GB	8 GB (Q4)
Qwen 3 8B	8B	~5 GB	~9 GB	~16 GB	8 GB (Q4)
Llama 3.1 8B	8B	~5 GB	~9 GB	~16 GB	8 GB (Q4)
Gemma 4 E4B (MoE)	8B	~5 GB	~9 GB	~16 GB	8 GB (Q4)

All of these models fit comfortably on a single consumer GPU. An RTX 4060 with 8 GB or a used RTX 3060 12 GB is sufficient for Q4 inference on any model in this tier. The Gemma 4 MoE variants are notable because only 2-4B parameters are active per token despite loading 5-8B total, giving you better quality per GB of VRAM than dense models.

3.0

Mid-range models (9B - 35B parameters)

This is the sweet spot for quality versus hardware cost in 2026. Models in the 9B-35B range rival GPT-4-class output quality for many tasks, and several fit on a single 24 GB GPU at Q4 quantization. Qwen 3 32B and Gemma 4 31B are the standout daily drivers for 24 GB GPU owners.

Model	Params	Q4_K_M	Q8	FP16	Min GPU
Nemotron-Nano 9B	9B	~5.5 GB	~10 GB	~18 GB	12 GB (Q4)
Mistral NeMo 12B	12B	~7 GB	~13 GB	~24 GB	12 GB (Q4)
Gemma 3 12B	12B	~7 GB	~13 GB	~24 GB	12 GB (Q4)
Qwen 3 14B	14B	~8.5 GB	~15 GB	~28 GB	12 GB (Q4)
Phi-4	14B	~8.5 GB	~15 GB	~28 GB	12 GB (Q4)
gpt-oss 20B (MoE)	20B	~12 GB	~22 GB	~40 GB	16 GB (Q4)
Mistral Small 3.1 24B	24B	~14 GB	~26 GB	~48 GB	16 GB (Q4)
Gemma 4 26B-A4B (MoE)	26B	~15 GB	~28 GB	~52 GB	24 GB (Q4)
Qwen 3.5 27B	27B	~16 GB	~29 GB	~54 GB	24 GB (Q4)
Qwen 3 30B-A3B (MoE)	30B	~18 GB	~33 GB	~60 GB	24 GB (Q4)
Gemma 4 31B	31B	~19 GB	~34 GB	~62 GB	24 GB (Q4)
Qwen 3 32B	32B	~19 GB	~35 GB	~64 GB	24 GB (Q4)
Qwen 2.5 32B	32B	~19 GB	~35 GB	~64 GB	24 GB (Q4)
Qwen 3.5 35B-A3B (MoE)	35B	~21 GB	~38 GB	~70 GB	24 GB (Q4)
Command R	35B	~21 GB	~38 GB	~70 GB	24 GB (Q4)

The MoE models in this tier are game-changers. Qwen 3 30B-A3B only activates 3B parameters per token despite loading 30B total — meaning you get 30B-level knowledge with near-instant inference speed. Similarly, gpt-oss 20B MoE only uses 3.6B active params. These models make 16 GB GPUs far more capable than they were a year ago.

For 24 GB GPU owners, Qwen 3 32B at Q4_K_M fits in 19 GB with 5 GB left for context. This is the model that makes an RTX 4090 or used RTX 3090 feel like the right purchase.

4.0

Large models (70B - 150B parameters)

Large models are where local inference starts pushing against consumer hardware limits. The 2026 generation added several MoE models here that change the calculus — Llama 4 Scout and gpt-oss 120B offer flagship-level reasoning by only activating a fraction of their total parameters during inference.

Model	Params	Q4_K_M	Q8	FP16	Min GPU
Llama 3.1 70B	70B	~38 GB	~70 GB	~140 GB	2×24 GB or 5090
Llama 3.3 70B	70B	~38 GB	~70 GB	~140 GB	2×24 GB or 5090
Qwen 2.5 72B	72B	~40 GB	~74 GB	~144 GB	2×24 GB or 5090
Command R+ 104B	104B	~57 GB	~110 GB	~208 GB	3×24 GB
Llama 4 Scout (MoE)	109B	~60 GB	~115 GB	~218 GB	3×24 GB
gpt-oss 120B (MoE)	120B	~65 GB	~125 GB	~240 GB	3×24 GB
Qwen 3.5 122B-A10B (MoE)	122B	~67 GB	~128 GB	~244 GB	3×24 GB
Mixtral 8x22B (MoE)	141B	~78 GB	~150 GB	~282 GB	4×24 GB

MoE changes the hardware equation

Llama 4 Scout (109B) and gpt-oss 120B are MoE models where only a small fraction of parameters are active per token. While you still need enough VRAM to load all experts, inference speed is much faster than a dense 109B model. The Qwen 3.5 122B-A10B only activates 10B per token — near 8B speed with 122B knowledge.

For dense 70B models on a single GPU, the RTX 5090 (32 GB) can run them at Q3 or with partial offloading. For full Q4_K_M speed, you need dual 24 GB GPUs. See our RTX 5090 vs RTX 4090 comparison for the exact tradeoffs.

5.0

Flagship models (200B+)

The 2026 flagship tier has exploded with MoE architectures that pack enormous knowledge into models that only activate a fraction of their parameters. Qwen 3 235B, DeepSeek V4, Llama 4 Maverick, and Mistral Large 3 all compete at the frontier — but none fit on a single consumer GPU, even at Q2_K.

Model	Total Params	Q2_K	Q4_K_M	Active/Token	Reality
Qwen 3 235B-A22B (MoE)	235B	~60 GB	~128 GB	22B	3-4×24 GB at Q2
DeepSeek V4-Flash (MoE)	284B	~73 GB	~145 GB	varies	4×24 GB at Q2
GLM-4.5	355B	~90 GB	~185 GB	dense	5-6×24 GB at Q2
Qwen 3.5 397B-A17B (MoE)	397B	~100 GB	~200 GB	17B	5×24 GB at Q2
Llama 4 Maverick (MoE)	400B	~100 GB	~200 GB	varies	5×24 GB at Q2
DeepSeek R1 (MoE)	671B	~170 GB	~350 GB	37B	No consumer setup
Mistral Large 3 (MoE)	675B	~170 GB	~350 GB	varies	No consumer setup
Kimi K2.6	~1T	~250 GB	~500 GB	dense	Server only
DeepSeek V4-Pro (MoE)	1.6T	~400 GB	~800 GB	varies	Server cluster only

You can run distilled versions of these large models instead. DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-32B carry much of the reasoning capability at a fraction of the VRAM cost. The MoE models with low active parameter counts (Qwen 3 235B-A22B activates only 22B) are particularly interesting — if you can load all experts into VRAM, inference speed approaches much smaller models. Find distilled versions on Hugging Face.

6.0

How quantization changes the equation

Quantization is the single biggest lever you have for fitting models into limited VRAM. Moving from FP16 (2 bytes per parameter) to Q4_K_M (~0.5 bytes per parameter) cuts model weight memory by ~75%. The quality loss at Q4_K_M is typically 1-3% on standard benchmarks — imperceptible for most use cases. For the full quantization-by-quantization breakdown, see our dedicated quantization guide.

-FP16 (2.0 bytes/param): Full precision, maximum quality, double the VRAM of Q4
-Q8 (1.0 bytes/param): Near-native quality, good middle ground if you have spare VRAM
-Q5_K_M (0.625 bytes/param): Slightly better than Q4, moderate extra cost
-Q4_K_M (0.5 bytes/param): Best balance of quality and VRAM savings — most popular choice
-Q3_K_M (0.375 bytes/param): Noticeable quality drop but fits larger models on smaller GPUs
-Q2_K (0.25 bytes/param): Maximum compression, significant quality degradation

7.0

Calculate your exact VRAM needs

These tables give you ballpark numbers for planning. But model architectures vary, context length requirements differ, and overhead depends on your inference framework. For a precise calculation tailored to your specific setup, use the VRAM calculator to enter your model, quantization, and context length and get an exact memory breakdown with GPU recommendations.

FAQ

Frequently Asked Questions

How much VRAM do I need for Llama 3.3 70B?

Llama 3.3 70B requires approximately 38-40 GB VRAM at Q4_K_M quantization with 4K context. This means you need either an RTX 5090 (32 GB) with some CPU offloading or two 24 GB GPUs via tensor parallelism.

Can I run Qwen 3 32B on a single 24 GB GPU?

Yes. Qwen 3 32B at Q4_K_M needs approximately 19-20 GB VRAM with 4K context, which fits comfortably on an RTX 4090 or RTX 3090 with room for longer contexts.

What GPU do I need for Llama 4 Scout?

Llama 4 Scout is a 109B MoE model. At Q4_K_M it needs roughly 60 GB VRAM for the full model. You need at least 2-3×24 GB GPUs or significant CPU offloading. The MoE architecture means only a subset of experts are active per token, but all must be loaded into memory.

What is the cheapest GPU that can run 14B models?

A used RTX 3060 12 GB (~$180) can run 14B models like Qwen 3 14B or Phi-4 at Q4_K_M with 4K context. For more headroom, a used RTX 3090 24 GB (~$450) handles them at higher precision with longer context.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Best AMD vs Best NVIDIA GPU for Local LLMs: Where AMD Wins, and Where CUDA Still Controls the Market

Can Your GPU Run It? VRAM Compatibility Checker for 80+ LLMs

Used RTX 3090 vs New Midrange GPU for Local LLMs: Why the 3090 Wins on Value

RTX 5080 vs Used RTX 4090 for Local LLMs: New Warranty or 24 GB Model Headroom

Back to all articles

Share this article