ArticleGPU

How Much VRAM Do You Need for Local LLMs?

A complete breakdown of VRAM requirements for local LLMs by model size, quantization level, and context length. Covers 8 GB to 48 GB tiers with specific model recommendations.

P

PC Part Guide

April 24, 2026

PC Part Guide is supported by its audience. We may earn commissions from qualifying purchases through affiliate links on this page. Full disclosure

Why VRAM matters most

VRAM is the single most important specification when choosing a GPU for local LLMs. It determines which models you can run, how fast they generate tokens, and how much context you can use. This guide breaks down exactly how much VRAM you need based on the models you want to run, the quantization you use, and your performance expectations.

VRAM Required by Model Size and Quantization

The table below shows approximate VRAM requirements for popular model sizes at different quantization levels. These figures include model weights but not the KV cache (context memory), which adds 0.5-4 GB depending on context length.

Model SizeFP168-bit (Q8)4-bit (Q4)3-bit (Q3)Example Models
3B~6 GB~3 GB~2 GB~1.5 GBPhi-3 Mini
7B~14 GB~7 GB~4 GB~3 GBLlama 3.1 8B, Mistral 7B
13B~26 GB~13 GB~8 GB~6 GBLlama 3.1 8B (long ctx)
34B~68 GB~34 GB~20 GB~15 GBCommand R, CodeLlama 34B
70B~140 GB~70 GB~38 GB~30 GBLlama 3.1 70B, Mixtral 8x7B

* 4-bit (Q4) is highlighted because it is the most common quantization for local inference. Values are approximate and vary by specific model architecture.

VRAM Tiers at a Glance

8 GB

Entry-level

7B at Q4, 3B at Q8

Experimentation only

12 GB

Budget

7B at Q8, 13B at Q4

Basic local inference

16 GB

Entry

13B at Q8, 34B at Q3

Comfortable 7B-13B usage

24 GB

Enthusiast

70B at Q4, 35B at Q6, Mixtral 8x7B

Sweet spot for most users

32 GB

Premium

70B at Q6, 70B+ long context

Fewest compromises

48 GB+ (multi-GPU)

Power user

70B at FP16, Mixtral 8x22B

Maximum flexibility

Recommended GPUs by VRAM Tier

Quick picks for each VRAM tier. Click through to the full guide for detailed analysis.

GeForce RTX 5080

GeForce RTX 5080

16 GB GDDR7 — Best 16 GB pick

VRAM16 GB GDDR7
Bandwidth960 GB/s
Price~$999
Best For7B-13B models
Radeon RX 7900 XTX

Radeon RX 7900 XTX

24 GB GDDR6 — Cheapest new 24 GB

VRAM24 GB GDDR6
Bandwidth960 GB/s
Price~$900
Best For70B at Q4
Editor's Pick
GeForce RTX 5090

GeForce RTX 5090

32 GB GDDR7 — Maximum flexibility

VRAM32 GB GDDR7
Bandwidth1,792 GB/s
Price~$1,999
Best For70B+ at high precision

How Context Length Affects VRAM

Context length is the hidden VRAM cost. The KV cache stores the attention keys and values for every token in the conversation. Longer conversations and documents mean more tokens in the cache, which means more VRAM consumed beyond the model weights themselves.

For a 7B model at 4-bit quantization, the model weights use roughly 4 GB. At 2,048 tokens of context, the KV cache might add 0.5 GB. At 32,768 tokens, that can balloon to 4+ GB — doubling your total VRAM usage. Larger models scale even more aggressively.

If you regularly work with long documents (research papers, codebases, books), budget at least 30% more VRAM than the model weights alone require. For short conversations and prompts, the KV cache overhead is negligible.

Dive Deeper by VRAM Tier

Frequently Asked Questions

Does quantization quality affect VRAM usage?
Yes, significantly. A 7B model at FP16 needs ~14 GB. At 8-bit (Q8), it needs ~7 GB. At 4-bit (Q4), it needs ~4 GB. The trade-off is a small loss in output quality for a large reduction in VRAM. For most use cases, Q4_K_M quantization provides excellent quality at roughly 25% of the FP16 size.
How does context length affect VRAM?
The KV cache grows linearly with context length. For a 7B model at 4-bit, running at 2,048 tokens of context might use 4.5 GB total, while 32,768 tokens could push that to 8+ GB. Larger models scale even faster. If you need long context windows, budget extra VRAM beyond what the model weights alone require.
Can I offload part of a model to system RAM?
Yes, llama.cpp supports GPU/CPU split offloading. You can run a model larger than your VRAM by keeping some layers on the GPU and the rest in system RAM. The downside is significantly slower inference speed for the layers running on CPU — often 3-5x slower overall.
Do I need ECC memory for local LLMs?
No. Consumer GPUs without ECC (RTX 4090, 5080, etc.) work fine for inference. ECC matters more for training where bit flips can corrupt model weights over many iterations. For inference, a random bit flip in a weight barely affects output quality.
Is VRAM speed (GDDR6 vs GDDR6X vs GDDR7) important?
Yes for inference speed, no for capacity. The memory type affects bandwidth, which determines how fast tokens are generated. GDDR7 (RTX 5090/5080) is fastest, followed by GDDR6X (RTX 4090/3090), then GDDR6 (RX 7900 XTX). All provide the same capacity per GB — the difference is how quickly the GPU can read that data.
Should I buy two smaller GPUs instead of one large one?
Two GPUs with tensor parallelism can work, but it adds complexity. Two used RTX 3090s (48 GB total) cost less than one RTX 5090 and run models that no single consumer GPU can. However, multi-GPU setups have communication overhead and require more power, cooling, and troubleshooting. Start with one GPU and add a second only if you need more VRAM.

Looking for specific GPU recommendations? Our main guide covers every budget and VRAM tier.

Best GPU for Local LLMs →
Back to all articles
Share this article