How Much VRAM Do You Need for Local LLMs: The Tier-by-Tier Reality Check
VRAM is the single bottleneck that decides which models you can run. Here is the formula, the numbers, and the tiers - so you buy once and do not overspend.

The VRAM formula
| Quantization | Bytes/Param | 7B Model | 70B Model |
|---|---|---|---|
| FP16 | 2.0 | ~14 GB | ~140 GB |
| Q8 | 1.0 | ~7 GB | ~70 GB |
| Q4_K_M | 0.5 | ~4 GB | ~38 GB |
| Q3_K_M | 0.375 | ~3 GB | ~30 GB |
| Q2_K | 0.25 | ~2 GB | ~20 GB |
Overhead (tokenizer, graph, framework buffers) adds 10-20% on top of raw weights. For a deeper dive into how GPU architecture affects these numbers, see Tim Dettmers' GPU guide — still the most-cited reference for VRAM and bandwidth math.
The hidden cost: KV cache
Every token in your conversation occupies memory in the KV cache. For a 7B model, each additional 1,000 tokens of context costs roughly 50-100 MB. For a 70B model, that jumps to 300-500 MB per 1,000 tokens. Long context windows can double or triple your total VRAM usage beyond the model weights alone.
| Context Length | 7B at Q4 Total | 70B at Q4 Total |
|---|---|---|
| 2,048 tokens | ~4.5 GB | ~40 GB |
| 8,192 tokens | ~5.5 GB | ~46 GB |
| 32,768 tokens | ~8 GB | ~62 GB |
| 128,000 tokens | ~14 GB | ~110 GB |
Bandwidth determines token speed
During autoregressive generation, the GPU reads the entire model for each token. The theoretical maximum token speed is bandwidth divided by model size. Real-world throughput is typically 60-80% of theoretical due to overhead and KV cache access. llama.cpp and Ollama both report real token generation speeds so you can verify these numbers against your own hardware.
Quick tier summary
| VRAM | Models You Can Run | Verdict |
|---|---|---|
| 8 GB | 7B at Q4 | Experimentation only |
| 12 GB | 7B at Q8, 13B at Q4 | Basic inference |
| 16 GB | 13B at Q8, 34B at Q3 | Solid 7B-13B usage |
| 24 GB | 35B at Q4, Mixtral 8x7B, 70B at Q3 | Sweet spot for most |
| 32 GB | 70B at Q4, long context 35B+ | Fewest compromises |
| 48 GB+ | 70B at FP16, Mixtral 8x22B | Multi-GPU territory |
Calculate your exact VRAM
These tables give you ballpark numbers. For a precise calculation with your specific model, quantization, and context length — use the LLM VRAM Calculator to get an exact memory breakdown and see which GPUs can run your setup.
Related Guides
Best GPU for Local LLMs
Every GPU ranked by VRAM tier — apply the numbers from this guide.
Is 16 GB VRAM Enough?
Why 12 GB GPUs are a trap and 16 GB is the real floor.
Is 24 GB VRAM Enough?
The sweet spot most builders miss when sizing their workstation.
24 GB vs 32 GB GPU
The price of stepping up from enthusiast to no-compromise VRAM.
Frequently Asked Questions
Does quantization quality affect VRAM usage?
How does context length affect VRAM?
Can I offload part of a model to system RAM?
Do I need ECC memory for local LLMs?
End of Document
Reader Discussion
Be the first to add a note to this article.
Please log in to join the discussion.
No comments yet.