Jan 22, 2026

How Much VRAM Do You Need for Local LLMs: The Tier-by-Tier Reality Check

VRAM is the single bottleneck that decides which models you can run. Here is the formula, the numbers, and the tiers - so you buy once and do not overspend.

How Much VRAM Do You Need for Local LLMs: The Tier-by-Tier Reality Check
A
Andre
GPUAILLMs
1.0

The VRAM formula

Total VRAM = model_weights + KV_cache + overhead model_weights = parameters x bytes_per_weight
QuantizationBytes/Param7B Model70B Model
FP162.0~14 GB~140 GB
Q81.0~7 GB~70 GB
Q4_K_M0.5~4 GB~38 GB
Q3_K_M0.375~3 GB~30 GB
Q2_K0.25~2 GB~20 GB

Overhead (tokenizer, graph, framework buffers) adds 10-20% on top of raw weights. For a deeper dive into how GPU architecture affects these numbers, see Tim Dettmers' GPU guide — still the most-cited reference for VRAM and bandwidth math.

2.0

The hidden cost: KV cache

Every token in your conversation occupies memory in the KV cache. For a 7B model, each additional 1,000 tokens of context costs roughly 50-100 MB. For a 70B model, that jumps to 300-500 MB per 1,000 tokens. Long context windows can double or triple your total VRAM usage beyond the model weights alone.

Context Length7B at Q4 Total70B at Q4 Total
2,048 tokens~4.5 GB~40 GB
8,192 tokens~5.5 GB~46 GB
32,768 tokens~8 GB~62 GB
128,000 tokens~14 GB~110 GB
3.0

Bandwidth determines token speed

During autoregressive generation, the GPU reads the entire model for each token. The theoretical maximum token speed is bandwidth divided by model size. Real-world throughput is typically 60-80% of theoretical due to overhead and KV cache access. llama.cpp and Ollama both report real token generation speeds so you can verify these numbers against your own hardware.

tokens/second = bandwidth_GB_per_s / model_size_GB x efficiency RTX 4090 (1,008 GB/s), 70B at Q4 (38 GB): ~1,008 / 38 x 0.7 = ~18.5 t/s
4.0

Quick tier summary

VRAMModels You Can RunVerdict
8 GB7B at Q4Experimentation only
12 GB7B at Q8, 13B at Q4Basic inference
16 GB13B at Q8, 34B at Q3Solid 7B-13B usage
24 GB35B at Q4, Mixtral 8x7B, 70B at Q3Sweet spot for most
32 GB70B at Q4, long context 35B+Fewest compromises
48 GB+70B at FP16, Mixtral 8x22BMulti-GPU territory
5.0

Calculate your exact VRAM

These tables give you ballpark numbers. For a precise calculation with your specific model, quantization, and context length — use the LLM VRAM Calculator to get an exact memory breakdown and see which GPUs can run your setup.

6.0

Related Guides

FAQ

Frequently Asked Questions

Does quantization quality affect VRAM usage?
Linearly. A 7B model at FP16 needs ~14 GB. At Q8, ~7 GB. At Q4, ~4 GB. Q4_K_M is the most popular quantization because it retains ~97% of FP16 quality at roughly 25% of the size. The relationship is simply: VRAM = parameters x bytes_per_weight.
How does context length affect VRAM?
The KV cache grows linearly with context length. For a 7B model at Q4, 2,048 tokens of context might use 4.5 GB total, while 32,768 tokens could push that to 8+ GB. Budget extra VRAM for long context - roughly 30% more than model weights alone.
Can I offload part of a model to system RAM?
Yes, llama.cpp supports GPU/CPU split. You can run a model larger than your VRAM by keeping some layers on the GPU and the rest in system RAM. The downside is speed: CPU layers run 3-5x slower. Keep as many layers on GPU as your VRAM allows.
Do I need ECC memory for local LLMs?
No. Consumer GPUs without ECC work fine for inference. ECC matters more for training where bit flips can corrupt weights over many iterations. For inference, a random bit flip barely affects output quality.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Back to all articles
Share this article