Apr 14, 2026

Quantization vs VRAM: Exactly How Much Memory Each Level Saves

Q2_K through FP16 — every quantization level compared with real VRAM numbers for 2026 models including Qwen 3, Gemma 4, and Llama 4. See the exact savings for the models you want to run, so you can pick the right GPU and the right quantization in one decision.

Andre

GPUAILLMs

1.0

What quantization actually does

Quantization reduces the precision of model weights — the numbers that define what the model learned during training. A model trained at FP16 (16-bit floating point, 2 bytes per weight) can have its weights stored at lower precision: 8-bit, 4-bit, 3-bit, or even 2-bit. Each reduction cuts memory usage proportionally. The tradeoff is that lower precision means each weight carries less information, which degrades output quality.

The key insight is that model quality does not degrade linearly. Moving from FP16 to Q4_K_M (4-bit) loses only 1-3% on benchmarks — your outputs will look nearly identical. But moving from Q4 to Q2 loses another 5-10% and the degradation becomes visible in longer outputs, factual accuracy, and reasoning quality. For a rigorous technical analysis, see the GPTQ paper which established much of the quantization methodology used in GGUF formats today.

bytes_per_parameter by quantization level: FP16 = 2.000 bytes (16-bit float) Q8_0 = 1.062 bytes (8-bit integer + scale) Q6_K = 0.750 bytes (6-bit with block quantization) Q5_K_M = 0.625 bytes (5-bit, medium quality mix) Q4_K_M = 0.500 bytes (4-bit, medium quality mix) Q3_K_M = 0.375 bytes (3-bit, medium quality mix) Q2_K = 0.250 bytes (2-bit, maximum compression)

2.0

Qwen 3 8B: VRAM at every quantization level

Qwen 3 8B is one of the most popular small models for local inference in 2026. Here is exactly how much VRAM it needs at each quantization level with 4K context, including model weights, KV cache (~0.5 GB at 4K), and 10% overhead.

Quantization	Weights	Total (4K ctx)	Quality vs FP16	Token Speed Impact
FP16	~16.0 GB	~18.0 GB	Baseline (100%)	Slowest (bandwidth bound)
Q8_0	~8.5 GB	~10.2 GB	~99.5%	~15-20% faster than FP16
Q6_K	~6.0 GB	~7.8 GB	~98.5%	~20-25% faster
Q5_K_M	~5.0 GB	~6.8 GB	~98%	~25-30% faster
Q4_K_M	~4.0 GB	~5.8 GB	~97%	~30-40% faster
Q3_K_M	~3.0 GB	~4.8 GB	~93%	~35-45% faster
Q2_K	~2.0 GB	~3.8 GB	~85%	~40-50% faster

The practical takeaway: Q4_K_M is the sweet spot. It uses 68% less VRAM than FP16 while retaining 97% quality. Q5_K_M is worth considering if you have a few extra gigabytes — it gets you to 98% quality for only 1 GB more.

3.0

Qwen 3 32B: the model that makes 24 GB GPUs shine

Qwen 3 32B is arguably the best model for single 24 GB GPUs in 2026. At Q4_K_M it fits in ~19 GB with 4K context, leaving comfortable headroom. At Q5_K_M it still fits at ~24 GB, using the full VRAM budget for better quality. This is the model that makes an RTX 3090 or RTX 4090 feel like the right purchase.

Quantization	Weights	Total (4K ctx)	Total (16K ctx)	Fits 24 GB?
FP16	~64 GB	~66 GB	~72 GB	No
Q8_0	~34 GB	~36 GB	~42 GB	No
Q5_K_M	~20 GB	~22 GB	~28 GB	4K only
Q4_K_M	~19 GB	~21 GB	~27 GB	4K comfortable
Q3_K_M	~12 GB	~14 GB	~20 GB	Yes, lots of headroom

4.0

Llama 3.3 70B: where quantization decides your GPU

For dense 70B models, quantization is not about optimization — it determines whether you can run the model at all. FP16 requires 140 GB, which is far beyond any consumer GPU. Even Q4_K_M needs ~38 GB, pushing past a single 24 GB GPU.

Quantization	Weights	Total (4K ctx)	Fits on?	Quality vs FP16
FP16	~140 GB	~146 GB	No consumer GPU	Baseline
Q8_0	~74 GB	~80 GB	3-4×24 GB	~99.5%
Q6_K	~53 GB	~59 GB	3×24 GB	~98.5%
Q5_K_M	~44 GB	~50 GB	2×24 GB	~98%
Q4_K_M	~38 GB	~44 GB	2×24 GB or 5090+offload	~97%
Q3_K_M	~30 GB	~36 GB	5090 (32 GB) w/ offload	~93%
Q2_K	~20 GB	~26 GB	1×24 GB (tight)	~85%

GPU decision framework

If you want to run 70B models, your GPU choice is determined by quantization tolerance. Q4_K_M on dual 24 GB GPUs gives you the best balance. An RTX 5090 can handle Q3 at 32 GB with partial offloading. See our 24 GB GPU guide for specific hardware picks.

5.0

MoE models: quantization on the 2026 generation

MoE (Mixture of Experts) models like Qwen 3 235B-A22B and Llama 4 Scout 109B have changed the quantization calculus. These models store many experts but only activate a few per token. All experts must be quantized and loaded into VRAM, but inference speed depends only on the active experts. This means quantization is even more critical for MoE models — you are compressing parameters that are loaded but rarely used.

Model	Total Params	Active	Q4_K_M Weights	Q2_K Weights	Fits on?
gpt-oss 20B (MoE)	20B	3.6B	~12 GB	~6 GB	16 GB at Q4
Qwen 3 30B-A3B (MoE)	30B	3B	~18 GB	~9 GB	24 GB at Q4
Gemma 4 26B-A4B (MoE)	26B	4B	~15 GB	~7.5 GB	24 GB at Q4
Llama 4 Scout (MoE)	109B	17B active	~60 GB	~30 GB	2×24 GB at Q2
Qwen 3 235B-A22B (MoE)	235B	22B	~128 GB	~64 GB	3-4×24 GB at Q2

The key insight: MoE models at Q2_K often deliver better quality than dense models at Q4_K_M of the same VRAM footprint, because the 22B active parameters at Q2 are still higher quality than a 14B dense model at Q4. If you are choosing between a dense 14B at Q4 and an MoE 30B at Q2, the MoE model usually wins.

6.0

What quantization does NOT compress

Quantization only reduces model weight memory. It does not affect three other VRAM consumers that you must budget for separately. This is a common source of confusion — people see that Q4 cuts weights to 25% and assume their total VRAM drops by the same factor.

-KV cache: The attention cache stores activations in FP16 regardless of model quantization. At long context lengths, this can exceed model weights. See our KV cache guide for the full breakdown.
-Framework overhead: Inference engines allocate memory for the computation graph, tokenizer buffers, and CUDA workspace. This adds 10-20% on top of weights + cache.
-Batch size: Serving multiple users simultaneously multiplies the KV cache. Local inference at batch size 1 avoids this, but API-style deployments need to budget for it.

7.0

Picking your quantization strategy

The right quantization depends on your GPU, your model target, and your quality tolerance. Here is a decision framework based on common hardware configurations in 2026.

GPU VRAM	Target Model	Recommended Quant	Why
8 GB	7B-8B models	Q4_K_M	Fits with context headroom
12 GB	9B-14B models	Q5_K_M or Q4_K_M	Q5 fits 7B, Q4 fits 14B
16 GB	14B-24B models, MoE 20B	Q4_K_M	Fits dense 14B or MoE 20B
24 GB	32B models	Q5_K_M	Best quality on single GPU
24 GB	MoE 30B models	Q4_K_M	Fits with headroom for context
24 GB	70B dense models	Q2_K (single GPU)	Minimum viable, use dual GPU for Q4
32 GB	70B models	Q3_K_M or Q4_K_M	Q3 fully fits, Q4 needs partial offload

8.0

Calculate your exact quantization-to-VRAM mapping

These numbers are accurate for the models listed, but every model has slightly different architecture parameters that affect exact memory usage. The get your exact numbers — pick any model and quantization level, set your context length, and get a VRAM breakdown showing model weights, KV cache, overhead, and which GPUs can run it.

FAQ

Frequently Asked Questions

What is the best quantization for local LLMs?

Q4_K_M is the sweet spot for most users. It reduces VRAM by 70-75% compared to FP16 with minimal quality loss. Q5_K_M offers slightly better quality for 10-15% more VRAM. Q2_K saves the most memory but with noticeable quality degradation.

How much VRAM does Q4_K_M save vs FP16?

Q4_K_M uses approximately 0.25 bytes per parameter vs 2 bytes for FP16 — an 87.5% reduction in model weight memory. For Qwen 3 8B, that is roughly 2.5 GB vs 16 GB for weights alone. However, KV cache and overhead are not affected by quantization.

Does quantization affect MoE models differently?

Quantization applies to all parameters including MoE expert weights. Since MoE models have more total parameters to store, quantization savings are even more impactful. Qwen 3 235B-A22B at Q4_K_M uses ~128 GB vs ~470 GB at FP16 for weights alone — a 342 GB saving.

Does quantization affect inference speed?

Quantized models are often faster than FP16 because less data needs to move through memory bandwidth, which is the bottleneck for LLM inference. Q4 models can be 20-40% faster than FP16 on the same hardware.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Best AMD vs Best NVIDIA GPU for Local LLMs: Where AMD Wins, and Where CUDA Still Controls the Market

Can Your GPU Run It? VRAM Compatibility Checker for 80+ LLMs

Used RTX 3090 vs New Midrange GPU for Local LLMs: Why the 3090 Wins on Value

RTX 5080 vs Used RTX 4090 for Local LLMs: New Warranty or 24 GB Model Headroom

Back to all articles

Share this article