Quantization vs VRAM: Exactly How Much Memory Each Level Saves
Q2_K through FP16 — every quantization level compared with real VRAM numbers for 2026 models including Qwen 3, Gemma 4, and Llama 4. See the exact savings for the models you want to run, so you can pick the right GPU and the right quantization in one decision.

What quantization actually does
Quantization reduces the precision of model weights — the numbers that define what the model learned during training. A model trained at FP16 (16-bit floating point, 2 bytes per weight) can have its weights stored at lower precision: 8-bit, 4-bit, 3-bit, or even 2-bit. Each reduction cuts memory usage proportionally. The tradeoff is that lower precision means each weight carries less information, which degrades output quality.
The key insight is that model quality does not degrade linearly. Moving from FP16 to Q4_K_M (4-bit) loses only 1-3% on benchmarks — your outputs will look nearly identical. But moving from Q4 to Q2 loses another 5-10% and the degradation becomes visible in longer outputs, factual accuracy, and reasoning quality. For a rigorous technical analysis, see the GPTQ paper which established much of the quantization methodology used in GGUF formats today.
Qwen 3 8B: VRAM at every quantization level
Qwen 3 8B is one of the most popular small models for local inference in 2026. Here is exactly how much VRAM it needs at each quantization level with 4K context, including model weights, KV cache (~0.5 GB at 4K), and 10% overhead.
| Quantization | Weights | Total (4K ctx) | Quality vs FP16 | Token Speed Impact |
|---|---|---|---|---|
| FP16 | ~16.0 GB | ~18.0 GB | Baseline (100%) | Slowest (bandwidth bound) |
| Q8_0 | ~8.5 GB | ~10.2 GB | ~99.5% | ~15-20% faster than FP16 |
| Q6_K | ~6.0 GB | ~7.8 GB | ~98.5% | ~20-25% faster |
| Q5_K_M | ~5.0 GB | ~6.8 GB | ~98% | ~25-30% faster |
| Q4_K_M | ~4.0 GB | ~5.8 GB | ~97% | ~30-40% faster |
| Q3_K_M | ~3.0 GB | ~4.8 GB | ~93% | ~35-45% faster |
| Q2_K | ~2.0 GB | ~3.8 GB | ~85% | ~40-50% faster |
The practical takeaway: Q4_K_M is the sweet spot. It uses 68% less VRAM than FP16 while retaining 97% quality. Q5_K_M is worth considering if you have a few extra gigabytes — it gets you to 98% quality for only 1 GB more.
Qwen 3 32B: the model that makes 24 GB GPUs shine
Qwen 3 32B is arguably the best model for single 24 GB GPUs in 2026. At Q4_K_M it fits in ~19 GB with 4K context, leaving comfortable headroom. At Q5_K_M it still fits at ~24 GB, using the full VRAM budget for better quality. This is the model that makes an RTX 3090 or RTX 4090 feel like the right purchase.
| Quantization | Weights | Total (4K ctx) | Total (16K ctx) | Fits 24 GB? |
|---|---|---|---|---|
| FP16 | ~64 GB | ~66 GB | ~72 GB | No |
| Q8_0 | ~34 GB | ~36 GB | ~42 GB | No |
| Q5_K_M | ~20 GB | ~22 GB | ~28 GB | 4K only |
| Q4_K_M | ~19 GB | ~21 GB | ~27 GB | 4K comfortable |
| Q3_K_M | ~12 GB | ~14 GB | ~20 GB | Yes, lots of headroom |
Llama 3.3 70B: where quantization decides your GPU
For dense 70B models, quantization is not about optimization — it determines whether you can run the model at all. FP16 requires 140 GB, which is far beyond any consumer GPU. Even Q4_K_M needs ~38 GB, pushing past a single 24 GB GPU.
| Quantization | Weights | Total (4K ctx) | Fits on? | Quality vs FP16 |
|---|---|---|---|---|
| FP16 | ~140 GB | ~146 GB | No consumer GPU | Baseline |
| Q8_0 | ~74 GB | ~80 GB | 3-4×24 GB | ~99.5% |
| Q6_K | ~53 GB | ~59 GB | 3×24 GB | ~98.5% |
| Q5_K_M | ~44 GB | ~50 GB | 2×24 GB | ~98% |
| Q4_K_M | ~38 GB | ~44 GB | 2×24 GB or 5090+offload | ~97% |
| Q3_K_M | ~30 GB | ~36 GB | 5090 (32 GB) w/ offload | ~93% |
| Q2_K | ~20 GB | ~26 GB | 1×24 GB (tight) | ~85% |
MoE models: quantization on the 2026 generation
MoE (Mixture of Experts) models like Qwen 3 235B-A22B and Llama 4 Scout 109B have changed the quantization calculus. These models store many experts but only activate a few per token. All experts must be quantized and loaded into VRAM, but inference speed depends only on the active experts. This means quantization is even more critical for MoE models — you are compressing parameters that are loaded but rarely used.
| Model | Total Params | Active | Q4_K_M Weights | Q2_K Weights | Fits on? |
|---|---|---|---|---|---|
| gpt-oss 20B (MoE) | 20B | 3.6B | ~12 GB | ~6 GB | 16 GB at Q4 |
| Qwen 3 30B-A3B (MoE) | 30B | 3B | ~18 GB | ~9 GB | 24 GB at Q4 |
| Gemma 4 26B-A4B (MoE) | 26B | 4B | ~15 GB | ~7.5 GB | 24 GB at Q4 |
| Llama 4 Scout (MoE) | 109B | 17B active | ~60 GB | ~30 GB | 2×24 GB at Q2 |
| Qwen 3 235B-A22B (MoE) | 235B | 22B | ~128 GB | ~64 GB | 3-4×24 GB at Q2 |
The key insight: MoE models at Q2_K often deliver better quality than dense models at Q4_K_M of the same VRAM footprint, because the 22B active parameters at Q2 are still higher quality than a 14B dense model at Q4. If you are choosing between a dense 14B at Q4 and an MoE 30B at Q2, the MoE model usually wins.
What quantization does NOT compress
Quantization only reduces model weight memory. It does not affect three other VRAM consumers that you must budget for separately. This is a common source of confusion — people see that Q4 cuts weights to 25% and assume their total VRAM drops by the same factor.
- -KV cache: The attention cache stores activations in FP16 regardless of model quantization. At long context lengths, this can exceed model weights. See our KV cache guide for the full breakdown.
- -Framework overhead: Inference engines allocate memory for the computation graph, tokenizer buffers, and CUDA workspace. This adds 10-20% on top of weights + cache.
- -Batch size: Serving multiple users simultaneously multiplies the KV cache. Local inference at batch size 1 avoids this, but API-style deployments need to budget for it.
Picking your quantization strategy
The right quantization depends on your GPU, your model target, and your quality tolerance. Here is a decision framework based on common hardware configurations in 2026.
| GPU VRAM | Target Model | Recommended Quant | Why |
|---|---|---|---|
| 8 GB | 7B-8B models | Q4_K_M | Fits with context headroom |
| 12 GB | 9B-14B models | Q5_K_M or Q4_K_M | Q5 fits 7B, Q4 fits 14B |
| 16 GB | 14B-24B models, MoE 20B | Q4_K_M | Fits dense 14B or MoE 20B |
| 24 GB | 32B models | Q5_K_M | Best quality on single GPU |
| 24 GB | MoE 30B models | Q4_K_M | Fits with headroom for context |
| 24 GB | 70B dense models | Q2_K (single GPU) | Minimum viable, use dual GPU for Q4 |
| 32 GB | 70B models | Q3_K_M or Q4_K_M | Q3 fully fits, Q4 needs partial offload |
Calculate your exact quantization-to-VRAM mapping
These numbers are accurate for the models listed, but every model has slightly different architecture parameters that affect exact memory usage. The get your exact numbers — pick any model and quantization level, set your context length, and get a VRAM breakdown showing model weights, KV cache, overhead, and which GPUs can run it.
Frequently Asked Questions
What is the best quantization for local LLMs?
How much VRAM does Q4_K_M save vs FP16?
Does quantization affect MoE models differently?
Does quantization affect inference speed?
End of Document
Reader Discussion
Be the first to add a note to this article.
Please log in to join the discussion.
No comments yet.