How Much VRAM Do You Need for Local LLMs?
A complete breakdown of VRAM requirements for local LLMs by model size, quantization level, and context length. Covers 8 GB to 48 GB tiers with specific model recommendations.
PC Part Guide
PC Part Guide is supported by its audience. We may earn commissions from qualifying purchases through affiliate links on this page. Full disclosure
Why VRAM matters most
VRAM is the single most important specification when choosing a GPU for local LLMs. It determines which models you can run, how fast they generate tokens, and how much context you can use. This guide breaks down exactly how much VRAM you need based on the models you want to run, the quantization you use, and your performance expectations.
VRAM Required by Model Size and Quantization
The table below shows approximate VRAM requirements for popular model sizes at different quantization levels. These figures include model weights but not the KV cache (context memory), which adds 0.5-4 GB depending on context length.
| Model Size | FP16 | 8-bit (Q8) | 4-bit (Q4) | 3-bit (Q3) | Example Models |
|---|---|---|---|---|---|
| 3B | ~6 GB | ~3 GB | ~2 GB | ~1.5 GB | Phi-3 Mini |
| 7B | ~14 GB | ~7 GB | ~4 GB | ~3 GB | Llama 3.1 8B, Mistral 7B |
| 13B | ~26 GB | ~13 GB | ~8 GB | ~6 GB | Llama 3.1 8B (long ctx) |
| 34B | ~68 GB | ~34 GB | ~20 GB | ~15 GB | Command R, CodeLlama 34B |
| 70B | ~140 GB | ~70 GB | ~38 GB | ~30 GB | Llama 3.1 70B, Mixtral 8x7B |
* 4-bit (Q4) is highlighted because it is the most common quantization for local inference. Values are approximate and vary by specific model architecture.
VRAM Tiers at a Glance
8 GB
Entry-level7B at Q4, 3B at Q8
Experimentation only
12 GB
Budget7B at Q8, 13B at Q4
Basic local inference
16 GB
Entry13B at Q8, 34B at Q3
Comfortable 7B-13B usage
24 GB
Enthusiast70B at Q4, 35B at Q6, Mixtral 8x7B
Sweet spot for most users
32 GB
Premium70B at Q6, 70B+ long context
Fewest compromises
48 GB+ (multi-GPU)
Power user70B at FP16, Mixtral 8x22B
Maximum flexibility
Recommended GPUs by VRAM Tier
Quick picks for each VRAM tier. Click through to the full guide for detailed analysis.

GeForce RTX 5080
16 GB GDDR7 — Best 16 GB pick

Radeon RX 7900 XTX
24 GB GDDR6 — Cheapest new 24 GB
How Context Length Affects VRAM
Context length is the hidden VRAM cost. The KV cache stores the attention keys and values for every token in the conversation. Longer conversations and documents mean more tokens in the cache, which means more VRAM consumed beyond the model weights themselves.
For a 7B model at 4-bit quantization, the model weights use roughly 4 GB. At 2,048 tokens of context, the KV cache might add 0.5 GB. At 32,768 tokens, that can balloon to 4+ GB — doubling your total VRAM usage. Larger models scale even more aggressively.
If you regularly work with long documents (research papers, codebases, books), budget at least 30% more VRAM than the model weights alone require. For short conversations and prompts, the KV cache overhead is negligible.
Dive Deeper by VRAM Tier
Frequently Asked Questions
Does quantization quality affect VRAM usage?
How does context length affect VRAM?
Can I offload part of a model to system RAM?
Do I need ECC memory for local LLMs?
Is VRAM speed (GDDR6 vs GDDR6X vs GDDR7) important?
Should I buy two smaller GPUs instead of one large one?
Looking for specific GPU recommendations? Our main guide covers every budget and VRAM tier.
Best GPU for Local LLMs →
