Jan 22, 2026

How Much VRAM Do You Need for Local LLMs: The Tier-by-Tier Reality Check

VRAM is the single bottleneck that decides which models you can run. Here is the formula, the numbers, and the tiers - so you buy once and do not overspend.

Andre

GPUAILLMs

1.0

The VRAM formula

Total VRAM = model_weights + KV_cache + overhead model_weights = parameters x bytes_per_weight

Quantization	Bytes/Param	7B Model	70B Model
FP16	2.0	~14 GB	~140 GB
Q8	1.0	~7 GB	~70 GB
Q4_K_M	0.5	~4 GB	~38 GB
Q3_K_M	0.375	~3 GB	~30 GB
Q2_K	0.25	~2 GB	~20 GB

Overhead (tokenizer, graph, framework buffers) adds 10-20% on top of raw weights. For a deeper dive into how GPU architecture affects these numbers, see Tim Dettmers' GPU guide — still the most-cited reference for VRAM and bandwidth math.

2.0

The hidden cost: KV cache

Every token in your conversation occupies memory in the KV cache. For a 7B model, each additional 1,000 tokens of context costs roughly 50-100 MB. For a 70B model, that jumps to 300-500 MB per 1,000 tokens. Long context windows can double or triple your total VRAM usage beyond the model weights alone.

Context Length	7B at Q4 Total	70B at Q4 Total
2,048 tokens	~4.5 GB	~40 GB
8,192 tokens	~5.5 GB	~46 GB
32,768 tokens	~8 GB	~62 GB
128,000 tokens	~14 GB	~110 GB

3.0

Bandwidth determines token speed

During autoregressive generation, the GPU reads the entire model for each token. The theoretical maximum token speed is bandwidth divided by model size. Real-world throughput is typically 60-80% of theoretical due to overhead and KV cache access. llama.cpp and Ollama both report real token generation speeds so you can verify these numbers against your own hardware.

tokens/second = bandwidth_GB_per_s / model_size_GB x efficiency RTX 4090 (1,008 GB/s), 70B at Q4 (38 GB): ~1,008 / 38 x 0.7 = ~18.5 t/s

4.0

Quick tier summary

VRAM	Models You Can Run	Verdict
8 GB	7B at Q4	Experimentation only
12 GB	7B at Q8, 13B at Q4	Basic inference
16 GB	13B at Q8, 34B at Q3	Solid 7B-13B usage
24 GB	35B at Q4, Mixtral 8x7B, 70B at Q3	Sweet spot for most
32 GB	70B at Q4, long context 35B+	Fewest compromises
48 GB+	70B at FP16, Mixtral 8x22B	Multi-GPU territory

5.0

Calculate your exact VRAM

These tables give you ballpark numbers. For a precise calculation with your specific model, quantization, and context length — use the LLM VRAM Calculator to get an exact memory breakdown and see which GPUs can run your setup.

6.0

Related Guides

Best GPU for Local LLMs

Every GPU ranked by VRAM tier — apply the numbers from this guide.

Is 16 GB VRAM Enough?

Why 12 GB GPUs are a trap and 16 GB is the real floor.

Is 24 GB VRAM Enough?

The sweet spot most builders miss when sizing their workstation.

24 GB vs 32 GB GPU

The price of stepping up from enthusiast to no-compromise VRAM.

FAQ

Frequently Asked Questions

Does quantization quality affect VRAM usage?

Linearly. A 7B model at FP16 needs ~14 GB. At Q8, ~7 GB. At Q4, ~4 GB. Q4_K_M is the most popular quantization because it retains ~97% of FP16 quality at roughly 25% of the size. The relationship is simply: VRAM = parameters x bytes_per_weight.

How does context length affect VRAM?

The KV cache grows linearly with context length. For a 7B model at Q4, 2,048 tokens of context might use 4.5 GB total, while 32,768 tokens could push that to 8+ GB. Budget extra VRAM for long context - roughly 30% more than model weights alone.

Can I offload part of a model to system RAM?

Yes, llama.cpp supports GPU/CPU split. You can run a model larger than your VRAM by keeping some layers on the GPU and the rest in system RAM. The downside is speed: CPU layers run 3-5x slower. Keep as many layers on GPU as your VRAM allows.

Do I need ECC memory for local LLMs?

No. Consumer GPUs without ECC work fine for inference. ECC matters more for training where bit flips can corrupt weights over many iterations. For inference, a random bit flip barely affects output quality.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Best AMD vs Best NVIDIA GPU for Local LLMs: Where AMD Wins, and Where CUDA Still Controls the Market

Can Your GPU Run It? VRAM Compatibility Checker for 80+ LLMs

Used RTX 3090 vs New Midrange GPU for Local LLMs: Why the 3090 Wins on Value

RTX 5080 vs Used RTX 4090 for Local LLMs: New Warranty or 24 GB Model Headroom

Back to all articles

Share this article