VRAM Requirements for Every Major LLM: Complete Reference Table (2026)
Exact VRAM numbers for 60+ LLMs at every quantization level — including Qwen 3, Gemma 4, Llama 4, and gpt-oss. Model weights, KV cache, and total memory broken down so you know what GPU to buy before you download anything.

Why VRAM is the only spec that matters
When you are shopping for a GPU to run local LLMs, the only number that determines whether a model loads or crashes is VRAM. Not CUDA cores. Not clock speed. Not tensor TFLOPS. VRAM sets a hard ceiling on which models you can run, which quantization levels you can afford, and how much context you can maintain before the system starts swapping to system RAM and your token speed tanks.
The relationship is straightforward: each parameter in a model occupies memory, and the total memory required depends on how those parameters are stored (quantization level), how much conversation history the model keeps (KV cache), and framework overhead. For a deeper explanation of the math behind these numbers, see Tim Dettmers' GPU analysis, which remains the most cited reference for VRAM budgeting.
Tiny and small models (1B - 8B parameters)
Small models are where most people start. They run on budget GPUs, load in seconds, and still deliver useful output for coding assistance, summarization, and general chat. The 2026 generation brought major quality gains to this tier — Qwen 3 8B and Gemma 4 E4B punch well above their weight class, often matching older 14B models on benchmarks.
| Model | Params | Q4_K_M | Q8 | FP16 | Min GPU |
|---|---|---|---|---|---|
| Qwen 3 0.6B | 0.6B | ~0.4 GB | ~0.7 GB | ~1.2 GB | Any GPU |
| Qwen 3.5 2B | 2B | ~1.2 GB | ~2.2 GB | ~4 GB | Any GPU |
| Phi-4 Mini | 3.8B | ~2.5 GB | ~4.5 GB | ~8 GB | 8 GB (any) |
| Qwen 3 4B | 4B | ~2.5 GB | ~4.5 GB | ~8 GB | 8 GB (any) |
| Gemma 4 E2B (MoE) | 5.1B | ~3 GB | ~5.5 GB | ~10 GB | 8 GB (any) |
| Qwen 2.5 7B | 7B | ~4.5 GB | ~8 GB | ~14 GB | 8 GB (Q4) |
| Qwen 3 8B | 8B | ~5 GB | ~9 GB | ~16 GB | 8 GB (Q4) |
| Llama 3.1 8B | 8B | ~5 GB | ~9 GB | ~16 GB | 8 GB (Q4) |
| Gemma 4 E4B (MoE) | 8B | ~5 GB | ~9 GB | ~16 GB | 8 GB (Q4) |
All of these models fit comfortably on a single consumer GPU. An RTX 4060 with 8 GB or a used RTX 3060 12 GB is sufficient for Q4 inference on any model in this tier. The Gemma 4 MoE variants are notable because only 2-4B parameters are active per token despite loading 5-8B total, giving you better quality per GB of VRAM than dense models.
Mid-range models (9B - 35B parameters)
This is the sweet spot for quality versus hardware cost in 2026. Models in the 9B-35B range rival GPT-4-class output quality for many tasks, and several fit on a single 24 GB GPU at Q4 quantization. Qwen 3 32B and Gemma 4 31B are the standout daily drivers for 24 GB GPU owners.
| Model | Params | Q4_K_M | Q8 | FP16 | Min GPU |
|---|---|---|---|---|---|
| Nemotron-Nano 9B | 9B | ~5.5 GB | ~10 GB | ~18 GB | 12 GB (Q4) |
| Mistral NeMo 12B | 12B | ~7 GB | ~13 GB | ~24 GB | 12 GB (Q4) |
| Gemma 3 12B | 12B | ~7 GB | ~13 GB | ~24 GB | 12 GB (Q4) |
| Qwen 3 14B | 14B | ~8.5 GB | ~15 GB | ~28 GB | 12 GB (Q4) |
| Phi-4 | 14B | ~8.5 GB | ~15 GB | ~28 GB | 12 GB (Q4) |
| gpt-oss 20B (MoE) | 20B | ~12 GB | ~22 GB | ~40 GB | 16 GB (Q4) |
| Mistral Small 3.1 24B | 24B | ~14 GB | ~26 GB | ~48 GB | 16 GB (Q4) |
| Gemma 4 26B-A4B (MoE) | 26B | ~15 GB | ~28 GB | ~52 GB | 24 GB (Q4) |
| Qwen 3.5 27B | 27B | ~16 GB | ~29 GB | ~54 GB | 24 GB (Q4) |
| Qwen 3 30B-A3B (MoE) | 30B | ~18 GB | ~33 GB | ~60 GB | 24 GB (Q4) |
| Gemma 4 31B | 31B | ~19 GB | ~34 GB | ~62 GB | 24 GB (Q4) |
| Qwen 3 32B | 32B | ~19 GB | ~35 GB | ~64 GB | 24 GB (Q4) |
| Qwen 2.5 32B | 32B | ~19 GB | ~35 GB | ~64 GB | 24 GB (Q4) |
| Qwen 3.5 35B-A3B (MoE) | 35B | ~21 GB | ~38 GB | ~70 GB | 24 GB (Q4) |
| Command R | 35B | ~21 GB | ~38 GB | ~70 GB | 24 GB (Q4) |
The MoE models in this tier are game-changers. Qwen 3 30B-A3B only activates 3B parameters per token despite loading 30B total — meaning you get 30B-level knowledge with near-instant inference speed. Similarly, gpt-oss 20B MoE only uses 3.6B active params. These models make 16 GB GPUs far more capable than they were a year ago.
For 24 GB GPU owners, Qwen 3 32B at Q4_K_M fits in 19 GB with 5 GB left for context. This is the model that makes an RTX 4090 or used RTX 3090 feel like the right purchase.
Large models (70B - 150B parameters)
Large models are where local inference starts pushing against consumer hardware limits. The 2026 generation added several MoE models here that change the calculus — Llama 4 Scout and gpt-oss 120B offer flagship-level reasoning by only activating a fraction of their total parameters during inference.
| Model | Params | Q4_K_M | Q8 | FP16 | Min GPU |
|---|---|---|---|---|---|
| Llama 3.1 70B | 70B | ~38 GB | ~70 GB | ~140 GB | 2×24 GB or 5090 |
| Llama 3.3 70B | 70B | ~38 GB | ~70 GB | ~140 GB | 2×24 GB or 5090 |
| Qwen 2.5 72B | 72B | ~40 GB | ~74 GB | ~144 GB | 2×24 GB or 5090 |
| Command R+ 104B | 104B | ~57 GB | ~110 GB | ~208 GB | 3×24 GB |
| Llama 4 Scout (MoE) | 109B | ~60 GB | ~115 GB | ~218 GB | 3×24 GB |
| gpt-oss 120B (MoE) | 120B | ~65 GB | ~125 GB | ~240 GB | 3×24 GB |
| Qwen 3.5 122B-A10B (MoE) | 122B | ~67 GB | ~128 GB | ~244 GB | 3×24 GB |
| Mixtral 8x22B (MoE) | 141B | ~78 GB | ~150 GB | ~282 GB | 4×24 GB |
For dense 70B models on a single GPU, the RTX 5090 (32 GB) can run them at Q3 or with partial offloading. For full Q4_K_M speed, you need dual 24 GB GPUs. See our RTX 5090 vs RTX 4090 comparison for the exact tradeoffs.
Flagship models (200B+)
The 2026 flagship tier has exploded with MoE architectures that pack enormous knowledge into models that only activate a fraction of their parameters. Qwen 3 235B, DeepSeek V4, Llama 4 Maverick, and Mistral Large 3 all compete at the frontier — but none fit on a single consumer GPU, even at Q2_K.
| Model | Total Params | Q2_K | Q4_K_M | Active/Token | Reality |
|---|---|---|---|---|---|
| Qwen 3 235B-A22B (MoE) | 235B | ~60 GB | ~128 GB | 22B | 3-4×24 GB at Q2 |
| DeepSeek V4-Flash (MoE) | 284B | ~73 GB | ~145 GB | varies | 4×24 GB at Q2 |
| GLM-4.5 | 355B | ~90 GB | ~185 GB | dense | 5-6×24 GB at Q2 |
| Qwen 3.5 397B-A17B (MoE) | 397B | ~100 GB | ~200 GB | 17B | 5×24 GB at Q2 |
| Llama 4 Maverick (MoE) | 400B | ~100 GB | ~200 GB | varies | 5×24 GB at Q2 |
| DeepSeek R1 (MoE) | 671B | ~170 GB | ~350 GB | 37B | No consumer setup |
| Mistral Large 3 (MoE) | 675B | ~170 GB | ~350 GB | varies | No consumer setup |
| Kimi K2.6 | ~1T | ~250 GB | ~500 GB | dense | Server only |
| DeepSeek V4-Pro (MoE) | 1.6T | ~400 GB | ~800 GB | varies | Server cluster only |
You can run distilled versions of these large models instead. DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-32B carry much of the reasoning capability at a fraction of the VRAM cost. The MoE models with low active parameter counts (Qwen 3 235B-A22B activates only 22B) are particularly interesting — if you can load all experts into VRAM, inference speed approaches much smaller models. Find distilled versions on Hugging Face.
How quantization changes the equation
Quantization is the single biggest lever you have for fitting models into limited VRAM. Moving from FP16 (2 bytes per parameter) to Q4_K_M (~0.5 bytes per parameter) cuts model weight memory by ~75%. The quality loss at Q4_K_M is typically 1-3% on standard benchmarks — imperceptible for most use cases. For the full quantization-by-quantization breakdown, see our dedicated quantization guide.
- -FP16 (2.0 bytes/param): Full precision, maximum quality, double the VRAM of Q4
- -Q8 (1.0 bytes/param): Near-native quality, good middle ground if you have spare VRAM
- -Q5_K_M (0.625 bytes/param): Slightly better than Q4, moderate extra cost
- -Q4_K_M (0.5 bytes/param): Best balance of quality and VRAM savings — most popular choice
- -Q3_K_M (0.375 bytes/param): Noticeable quality drop but fits larger models on smaller GPUs
- -Q2_K (0.25 bytes/param): Maximum compression, significant quality degradation
Calculate your exact VRAM needs
These tables give you ballpark numbers for planning. But model architectures vary, context length requirements differ, and overhead depends on your inference framework. For a precise calculation tailored to your specific setup, use the VRAM calculator to enter your model, quantization, and context length and get an exact memory breakdown with GPU recommendations.
Frequently Asked Questions
How much VRAM do I need for Llama 3.3 70B?
Can I run Qwen 3 32B on a single 24 GB GPU?
What GPU do I need for Llama 4 Scout?
What is the cheapest GPU that can run 14B models?
End of Document
Reader Discussion
Be the first to add a note to this article.
Please log in to join the discussion.
No comments yet.