Home/Tools/VRAM Calculator

LLM VRAM Calculator

Estimate GPU memory requirements for running large language models locally. Select a model, choose your quantization, and dial in your context length to see exactly how much VRAM you need — plus which GPUs can handle it.

PC Part Guide is supported by its audience. We may earn commissions from qualifying purchases through affiliate links on this page. Full disclosure

8.0B params · 32 layers · d=4096 · 8 KV heads · max 131,072 ctx

4-bit (Medium)~97% of FP16

Max: 131,072

Batch > 1 multiplies KV cache for parallel sequences. Prompt processing (batch > 1) requires additional scratch memory.

Default 10% covers cuBLAS buffers and workspace. Increase for multi-GPU or unusual setups.

VRAM Estimate
Total VRAM Required
4.65 GB
Base: 4.15 GB  +  KV Cache: 0.50 GB
Per-token KV cost: 0.00 GB/1K tokens
Memory Usage vs Top GPU
0 GB16 GB
Memory Breakdown
Model Weights3.74 GB
8.0B params × 0.5 bytes/weight
KV Cache0.50 GB
4,096 tokens · 125.00 MB per 1K tokens
System Overhead0.37 GB
10% of model weights (cuBLAS + workspace)
Scratchpad0.04 GB
~1% of weights (temporary tensors)
Formula: Standard — KV = 2 × L × (d / g) × b_kv
Recommended GPUs

GPUs with ≥ 5 GB VRAM

View All GPUs →
Radeon RX 7600

Radeon RX 7600

AMD

8 GB VRAM$259.99
VRAM8GB
VRAM TypeGDDR6
Boost Clock2,655 MHz
GeForce RTX 5060

GeForce RTX 5060

NVIDIA

8 GB VRAM$299.99
VRAM8GB
VRAM TypeGDDR7
Boost Clock2,497 MHz
GeForce RTX 4060

GeForce RTX 4060

NVIDIA

8 GB VRAM$299.99
VRAM8GB
VRAM TypeGDDR6
Boost Clock2,460 MHz
Radeon RX 9060 XT

Radeon RX 9060 XT

AMD

8 GB VRAM$349.99
VRAM8GB
VRAM TypeGDDR6
Boost Clock2,690 MHz
GeForce RTX 4060 Ti

GeForce RTX 4060 Ti

NVIDIA

8 GB VRAM$399.99
VRAM8GB
VRAM TypeGDDR6
Boost Clock2,535 MHz
Radeon RX 7800 XT

Radeon RX 7800 XT

AMD

16 GB VRAM$499.99
VRAM16GB
VRAM TypeGDDR6
Boost Clock2,430 MHz
GeForce RTX 5070

GeForce RTX 5070

NVIDIA

12 GB VRAM$549.99
VRAM12GB
VRAM TypeGDDR7
Boost Clock2,512 MHz
Radeon RX 9070

Radeon RX 9070

AMD

16 GB VRAM$549.99
VRAM16GB
VRAM TypeGDDR6
Boost Clock2,700 MHz

How Does Local LLM VRAM Calculation Work?

This calculator uses an empirical formula derived from real-world testing with the llama.cpp inference backend. VRAM usage consists of two distinct components: a fixed cost(model weights, CUDA overhead, scratchpad) and a variable cost that grows linearly with context length (KV cache).

Learn more: What is a Large Language Model? (Wikipedia) · GPU Memory & LLM Inference Explained (BentoML)

1. Fixed Cost (Constant)

  • Model Weights: The quantized parameters stored in GPU memory.
  • CUDA Overhead: ~10% of model weights for cuBLAS buffers, workspace, and driver allocations.
  • Scratchpad Memory: ~1% of model weights for temporary tensors and activations during inference.

2. Variable Cost (Linear with Context)

  • KV Cache: Grows linearly with every token in your prompt + generated output.
  • Per-token cost: Depends on the model's layer count, KV head count, and head dimension — not on total parameters.
  • Context Impact: Long conversations or documents can consume more VRAM than the model weights themselves.
Key Insight
The KV cache, not model size, often becomes the limiting factor for long contexts. Understanding this helps you optimize your hardware choices and usage patterns — a 7B model at 128K context can require more VRAM than a 32B model at 4K context.

Precise VRAM Formulas for Local LLMs

1. Standard Architecture (Llama, Mistral, Qwen 2.5, Gemma, Phi)

VRAM [GB] = (P × B)  +  (O + S)  +  (2 × L × KV_heads × head_dim × b_kv × C) ⁄ (1024³)
P = params in billions
B = bytes per weight
O = overhead (default 10%)
S = scratchpad (~1%)
L = transformer layers
KV_heads = GQA KV heads
head_dim = KV head dimension
b_kv = 2 bytes (FP16)
C = context length

Equivalent to d/g where g = attention_heads / kv_heads (GQA factor). For Llama 3.1 8B: d/g = 4096/4 = 1024, and kv_heads × head_dim = 8 × 128 = 1024.

2. Decoupled Head Dimension (Qwen 3 32B)

VRAM [GB] = (P × B)  +  (O + S)  +  (2 × L × n_kv × head_dim_kv × b_kv × C) ⁄ (1024³)

Used for architectures where the K and V head dimension is decoupled from the Q (query) dimension. Qwen 3 32B uses this architecture — its KV cache calculation differs from the standard d/g formula. Qwen 3 8B and other models in the family use the standard formula.

3. MoE Models (Mixtral, DeepSeek, Llama 4)

All expert parameters must be loaded into VRAM, not just the active subset. For example, Mixtral 8×7B has 46.7B total parameters (all 8 experts), even though only 12.9B are active per token. DeepSeek R1/V3 loads all 671B parameters across 256 experts (37B active). The calculator uses total parameters for the model weights calculation.

Reference Examples (Verified)

ModelQuantContextWeightsKV CacheTotalReal-World
Llama 3.1 8BQ4_K_M4,0964.01 GB0.50 GB4.96 GB~4.8 GB
Qwen 2.5 7BQ4_K_M4,0963.80 GB0.22 GB4.42 GB~4.2 GB
Llama 3.3 70BQ4_K_M4,09635.3 GB1.25 GB40.6 GB~39 GB
Mixtral 8×7BQ4_K_M4,09623.4 GB0.50 GB26.5 GB~25 GB
Qwen 2.5 32BQ4_K_M4,09616.3 GB0.50 GB18.5 GB~18 GB
DeepSeek R1Q4_K_M4,096335.5 GB0.48 GB370.6 GBN/A*

* DeepSeek R1 uses Multi-head Latent Attention (MLA) which compresses the KV cache significantly. The listed KV cache for DeepSeek uses the standard formula; actual MLA KV cache is much smaller (~95% compression). Real-world measurement not available due to extreme hardware requirements.

Frequently Asked Questions

What is the KV cache and why does it grow with context?+
During auto-regressive generation, the model stores Key (K) and Value (V) tensors for every processed token to avoid re-computing them. For each new token, only the new K/V pair is computed and appended to the cache. The cache grows linearly at a fixed rate per token, determined by: 2 × layers × kv_heads × head_dim × 2 bytes. At 128K context for a 70B model, the KV cache alone can exceed 40 GB.
Why does quantization reduce VRAM for weights but not the KV cache?+
Quantization compresses model weights because they are static and loaded once. The KV cache, however, is dynamic — new entries are added at every token and must be computed in FP16 (or FP8 on newer hardware) for numerical stability during attention. The KV cache is always stored at full precision regardless of the weight quantization format.
Can I offload layers to system RAM (CPU) to use a larger model?+
Yes. GGUF/llama.cpp supports partial GPU offloading with the --n-gpu-layers flag. This calculator shows the total memory required — if you only offload N layers to GPU, the GPU VRAM usage scales proportionally. The remaining layers run on CPU at reduced speed (typically 10-50% of GPU throughput). Use a GPU with enough VRAM for the best experience.
Why do MoE models like DeepSeek R1 require so much VRAM?+
Mixture of Experts (MoE) models have multiple "expert" sub-networks. Only a subset of experts activates per token, but all experts must be loaded into memory. DeepSeek R1 has 256 experts x ~2.4B params each + shared parameters = 671B total. Even though only 37B are active per token, the full 671B must fit in VRAM.
Does batch size affect VRAM requirements?+
Yes. Batch size > 1 multiplies the KV cache requirement (each sequence has its own KV cache) and increases scratchpad memory for parallel prompt processing. For llama.cpp, the KV cache portion is multiplied by batch size. For production serving (vLLM, TGI), PagedAttention reduces this overhead significantly.

Related Guides

Turn these VRAM numbers into a buying decision. Our in-depth GPU guides walk you through the best hardware for every budget and model size.

VRAM Calculator — PCPARTGUIDE

Last updated: April 28, 2026