Mar 29, 2026

VRAM Requirements for Every Major LLM: Complete Reference Table (2026)

Exact VRAM numbers for 60+ LLMs at every quantization level — including Qwen 3, Gemma 4, Llama 4, and gpt-oss. Model weights, KV cache, and total memory broken down so you know what GPU to buy before you download anything.

VRAM Requirements for Every Major LLM: Complete Reference Table (2026)
A
Andre
GPUAILLMs
1.0

Why VRAM is the only spec that matters

When you are shopping for a GPU to run local LLMs, the only number that determines whether a model loads or crashes is VRAM. Not CUDA cores. Not clock speed. Not tensor TFLOPS. VRAM sets a hard ceiling on which models you can run, which quantization levels you can afford, and how much context you can maintain before the system starts swapping to system RAM and your token speed tanks.

The relationship is straightforward: each parameter in a model occupies memory, and the total memory required depends on how those parameters are stored (quantization level), how much conversation history the model keeps (KV cache), and framework overhead. For a deeper explanation of the math behind these numbers, see Tim Dettmers' GPU analysis, which remains the most cited reference for VRAM budgeting.

Total VRAM = model_weights + KV_cache + overhead model_weights = parameters × bytes_per_weight KV_cache ≈ 2 × layers × kv_heads × head_dim × 2 × context_length
2.0

Tiny and small models (1B - 8B parameters)

Small models are where most people start. They run on budget GPUs, load in seconds, and still deliver useful output for coding assistance, summarization, and general chat. The 2026 generation brought major quality gains to this tier — Qwen 3 8B and Gemma 4 E4B punch well above their weight class, often matching older 14B models on benchmarks.

ModelParamsQ4_K_MQ8FP16Min GPU
Qwen 3 0.6B0.6B~0.4 GB~0.7 GB~1.2 GBAny GPU
Qwen 3.5 2B2B~1.2 GB~2.2 GB~4 GBAny GPU
Phi-4 Mini3.8B~2.5 GB~4.5 GB~8 GB8 GB (any)
Qwen 3 4B4B~2.5 GB~4.5 GB~8 GB8 GB (any)
Gemma 4 E2B (MoE)5.1B~3 GB~5.5 GB~10 GB8 GB (any)
Qwen 2.5 7B7B~4.5 GB~8 GB~14 GB8 GB (Q4)
Qwen 3 8B8B~5 GB~9 GB~16 GB8 GB (Q4)
Llama 3.1 8B8B~5 GB~9 GB~16 GB8 GB (Q4)
Gemma 4 E4B (MoE)8B~5 GB~9 GB~16 GB8 GB (Q4)

All of these models fit comfortably on a single consumer GPU. An RTX 4060 with 8 GB or a used RTX 3060 12 GB is sufficient for Q4 inference on any model in this tier. The Gemma 4 MoE variants are notable because only 2-4B parameters are active per token despite loading 5-8B total, giving you better quality per GB of VRAM than dense models.

3.0

Mid-range models (9B - 35B parameters)

This is the sweet spot for quality versus hardware cost in 2026. Models in the 9B-35B range rival GPT-4-class output quality for many tasks, and several fit on a single 24 GB GPU at Q4 quantization. Qwen 3 32B and Gemma 4 31B are the standout daily drivers for 24 GB GPU owners.

ModelParamsQ4_K_MQ8FP16Min GPU
Nemotron-Nano 9B9B~5.5 GB~10 GB~18 GB12 GB (Q4)
Mistral NeMo 12B12B~7 GB~13 GB~24 GB12 GB (Q4)
Gemma 3 12B12B~7 GB~13 GB~24 GB12 GB (Q4)
Qwen 3 14B14B~8.5 GB~15 GB~28 GB12 GB (Q4)
Phi-414B~8.5 GB~15 GB~28 GB12 GB (Q4)
gpt-oss 20B (MoE)20B~12 GB~22 GB~40 GB16 GB (Q4)
Mistral Small 3.1 24B24B~14 GB~26 GB~48 GB16 GB (Q4)
Gemma 4 26B-A4B (MoE)26B~15 GB~28 GB~52 GB24 GB (Q4)
Qwen 3.5 27B27B~16 GB~29 GB~54 GB24 GB (Q4)
Qwen 3 30B-A3B (MoE)30B~18 GB~33 GB~60 GB24 GB (Q4)
Gemma 4 31B31B~19 GB~34 GB~62 GB24 GB (Q4)
Qwen 3 32B32B~19 GB~35 GB~64 GB24 GB (Q4)
Qwen 2.5 32B32B~19 GB~35 GB~64 GB24 GB (Q4)
Qwen 3.5 35B-A3B (MoE)35B~21 GB~38 GB~70 GB24 GB (Q4)
Command R35B~21 GB~38 GB~70 GB24 GB (Q4)

The MoE models in this tier are game-changers. Qwen 3 30B-A3B only activates 3B parameters per token despite loading 30B total — meaning you get 30B-level knowledge with near-instant inference speed. Similarly, gpt-oss 20B MoE only uses 3.6B active params. These models make 16 GB GPUs far more capable than they were a year ago.

For 24 GB GPU owners, Qwen 3 32B at Q4_K_M fits in 19 GB with 5 GB left for context. This is the model that makes an RTX 4090 or used RTX 3090 feel like the right purchase.

4.0

Large models (70B - 150B parameters)

Large models are where local inference starts pushing against consumer hardware limits. The 2026 generation added several MoE models here that change the calculus — Llama 4 Scout and gpt-oss 120B offer flagship-level reasoning by only activating a fraction of their total parameters during inference.

ModelParamsQ4_K_MQ8FP16Min GPU
Llama 3.1 70B70B~38 GB~70 GB~140 GB2×24 GB or 5090
Llama 3.3 70B70B~38 GB~70 GB~140 GB2×24 GB or 5090
Qwen 2.5 72B72B~40 GB~74 GB~144 GB2×24 GB or 5090
Command R+ 104B104B~57 GB~110 GB~208 GB3×24 GB
Llama 4 Scout (MoE)109B~60 GB~115 GB~218 GB3×24 GB
gpt-oss 120B (MoE)120B~65 GB~125 GB~240 GB3×24 GB
Qwen 3.5 122B-A10B (MoE)122B~67 GB~128 GB~244 GB3×24 GB
Mixtral 8x22B (MoE)141B~78 GB~150 GB~282 GB4×24 GB
MoE changes the hardware equation
Llama 4 Scout (109B) and gpt-oss 120B are MoE models where only a small fraction of parameters are active per token. While you still need enough VRAM to load all experts, inference speed is much faster than a dense 109B model. The Qwen 3.5 122B-A10B only activates 10B per token — near 8B speed with 122B knowledge.

For dense 70B models on a single GPU, the RTX 5090 (32 GB) can run them at Q3 or with partial offloading. For full Q4_K_M speed, you need dual 24 GB GPUs. See our RTX 5090 vs RTX 4090 comparison for the exact tradeoffs.

5.0

Flagship models (200B+)

The 2026 flagship tier has exploded with MoE architectures that pack enormous knowledge into models that only activate a fraction of their parameters. Qwen 3 235B, DeepSeek V4, Llama 4 Maverick, and Mistral Large 3 all compete at the frontier — but none fit on a single consumer GPU, even at Q2_K.

ModelTotal ParamsQ2_KQ4_K_MActive/TokenReality
Qwen 3 235B-A22B (MoE)235B~60 GB~128 GB22B3-4×24 GB at Q2
DeepSeek V4-Flash (MoE)284B~73 GB~145 GBvaries4×24 GB at Q2
GLM-4.5355B~90 GB~185 GBdense5-6×24 GB at Q2
Qwen 3.5 397B-A17B (MoE)397B~100 GB~200 GB17B5×24 GB at Q2
Llama 4 Maverick (MoE)400B~100 GB~200 GBvaries5×24 GB at Q2
DeepSeek R1 (MoE)671B~170 GB~350 GB37BNo consumer setup
Mistral Large 3 (MoE)675B~170 GB~350 GBvariesNo consumer setup
Kimi K2.6~1T~250 GB~500 GBdenseServer only
DeepSeek V4-Pro (MoE)1.6T~400 GB~800 GBvariesServer cluster only

You can run distilled versions of these large models instead. DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-32B carry much of the reasoning capability at a fraction of the VRAM cost. The MoE models with low active parameter counts (Qwen 3 235B-A22B activates only 22B) are particularly interesting — if you can load all experts into VRAM, inference speed approaches much smaller models. Find distilled versions on Hugging Face.

6.0

How quantization changes the equation

Quantization is the single biggest lever you have for fitting models into limited VRAM. Moving from FP16 (2 bytes per parameter) to Q4_K_M (~0.5 bytes per parameter) cuts model weight memory by ~75%. The quality loss at Q4_K_M is typically 1-3% on standard benchmarks — imperceptible for most use cases. For the full quantization-by-quantization breakdown, see our dedicated quantization guide.

  • -FP16 (2.0 bytes/param): Full precision, maximum quality, double the VRAM of Q4
  • -Q8 (1.0 bytes/param): Near-native quality, good middle ground if you have spare VRAM
  • -Q5_K_M (0.625 bytes/param): Slightly better than Q4, moderate extra cost
  • -Q4_K_M (0.5 bytes/param): Best balance of quality and VRAM savings — most popular choice
  • -Q3_K_M (0.375 bytes/param): Noticeable quality drop but fits larger models on smaller GPUs
  • -Q2_K (0.25 bytes/param): Maximum compression, significant quality degradation
7.0

Calculate your exact VRAM needs

These tables give you ballpark numbers for planning. But model architectures vary, context length requirements differ, and overhead depends on your inference framework. For a precise calculation tailored to your specific setup, use the VRAM calculator to enter your model, quantization, and context length and get an exact memory breakdown with GPU recommendations.

FAQ

Frequently Asked Questions

How much VRAM do I need for Llama 3.3 70B?
Llama 3.3 70B requires approximately 38-40 GB VRAM at Q4_K_M quantization with 4K context. This means you need either an RTX 5090 (32 GB) with some CPU offloading or two 24 GB GPUs via tensor parallelism.
Can I run Qwen 3 32B on a single 24 GB GPU?
Yes. Qwen 3 32B at Q4_K_M needs approximately 19-20 GB VRAM with 4K context, which fits comfortably on an RTX 4090 or RTX 3090 with room for longer contexts.
What GPU do I need for Llama 4 Scout?
Llama 4 Scout is a 109B MoE model. At Q4_K_M it needs roughly 60 GB VRAM for the full model. You need at least 2-3×24 GB GPUs or significant CPU offloading. The MoE architecture means only a subset of experts are active per token, but all must be loaded into memory.
What is the cheapest GPU that can run 14B models?
A used RTX 3060 12 GB (~$180) can run 14B models like Qwen 3 14B or Phi-4 at Q4_K_M with 4K context. For more headroom, a used RTX 3090 24 GB (~$450) handles them at higher precision with longer context.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Back to all articles
Share this article