Apr 5, 2026

KV Cache Explained: Why Context Length Eats Your VRAM

The KV cache is the hidden memory cost that catches everyone off guard. It can consume more VRAM than the model itself at long context lengths. Here is exactly how it works, how to calculate it, and how to keep it under control.

KV Cache Explained: Why Context Length Eats Your VRAM
A
Andre
GPUAILLMs
1.0

What is the KV cache and why does it exist

Transformer models use an attention mechanism where every new token must look at all previous tokens to decide what comes next. Without caching, generating the 1,000th token would require recomputing attention for tokens 1 through 999 — a cost that grows quadratically with context length. The KV cache solves this by storing the key (K) and value (V) projections for every token that has already been processed.

When the model generates token 1,001, it only needs to compute K and V for that single new token and append them to the cache. The attention query then reads the full cache in linear time. This trades memory for compute: instead of recomputing everything, you store it. For a visual walkthrough of the attention mechanism, Jay Alammar's Illustrated Transformer remains the clearest introduction.

Key insight
The KV cache is the reason inference VRAM grows with context length. Model weights are fixed — the cache is what makes a 4 GB model balloon to 14 GB at 128K context.
2.0

The KV cache formula

The memory used by the KV cache follows a deterministic formula based on the model architecture. You can calculate it for any model if you know the number of layers, key-value heads, head dimension, and your target context length.

KV Cache (bytes) = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element × context_length × batch_size Where: 2 = one K tensor + one V tensor per layer num_layers = transformer layers (e.g. 64 for Qwen 3 32B) num_kv_heads = GQA group size (e.g. 4 for Gemma 4) head_dim = dimension per head (e.g. 128) bytes_per_elem = 2 for FP16, 1 for Q8, 0.5 for Q4 context_length = max tokens you plan to use batch_size = 1 for single-user inference

For Qwen 3 8B at FP16 with 8K context and batch size 1, this works out to roughly 1-2 GB. Scale that to 128K context on Qwen 3 32B and the cache alone can exceed 32 GB — more than the Q4 model weights themselves.

3.0

Real KV cache sizes by model and context length

The table below shows KV cache sizes for popular 2026 models at FP16 precision with batch size 1. These are the numbers you need to add on top of model weights and overhead when budgeting your VRAM. For the full model weight VRAM numbers, see our complete VRAM reference table.

Model4K Context8K Context32K Context128K Context
Qwen 3 4B~0.3 GB~0.6 GB~2.4 GB~10 GB
Qwen 3 8B~0.5 GB~1.0 GB~4.0 GB~16 GB
Qwen 3.5 9B~0.5 GB~1.0 GB~4.0 GB~16 GB
Gemma 4 31B~1.0 GB~2.0 GB~8.0 GB~32 GB
Qwen 3 32B~1.0 GB~2.0 GB~8.0 GB~32 GB
Llama 3.3 70B~2.0 GB~4.0 GB~16 GB~64 GB
Qwen 3 235B-A22B (MoE)~0.6 GB~1.2 GB~4.8 GB~19 GB

Notice the MoE advantage in the last row: Qwen 3 235B-A22B only activates 22B parameters, so its KV cache is proportional to those 22B active parameters — not the full 235B. This is why MoE models are so memory-efficient at inference despite their massive knowledge base. Compare this to Llama 3.3 70B (dense), where the cache at 128K context is 64 GB — more than many users have in total VRAM.

4.0

Grouped Query Attention changes the math

Not all models use the same KV cache size per parameter. Models with Grouped Query Attention (GQA) or Multi-Query Attention (MQA) share KV heads across query heads, dramatically reducing cache size. Most 2026 models use GQA — this is now the standard rather than the exception.

ModelQuery HeadsKV HeadsSharing RatioCache Savings
Qwen 3 8B3284:1~75% vs MHA
Qwen 3 32B4085:1~80% vs MHA
Gemma 4 31B1644:1~75% vs MHA
Llama 3.3 70B6488:1~87.5% vs MHA
Qwen 3.5 122B-A10B (MoE)4085:1~80% vs MHA
Mistral Small 3.1 24B3284:1~75% vs MHA

Without GQA, a 70B model would need 8× more KV cache memory. If you are comparing older models (pre-2024) that use full Multi-Head Attention against newer GQA models, expect the older ones to have significantly larger cache requirements. For a detailed breakdown, the GQA paper (Ainslie et al.) provides the architectural details.

5.0

Strategies to reduce KV cache memory

If your model fits in VRAM but the KV cache pushes you over the limit, you have several options that do not require buying a new GPU. Each trades some quality or speed for reduced memory consumption.

  • -Reduce context length: The most direct lever. If you do not need 128K context, cutting to 8K or 32K dramatically reduces cache size. Most tasks work well at 8K.
  • -KV cache quantization: Some inference engines (vLLM, TensorRT-LLM) support FP8 or INT4 KV cache, cutting cache memory by 50-75% with minimal quality loss.
  • -Sliding window attention: Mistral and some other models support fixed-size attention windows where older cache entries are discarded. The model retains recent context but forgets early tokens.
  • -Flash Attention: Does not reduce cache size, but reduces the compute cost of reading the cache. Speeds up inference at long contexts. Supported by Ollama and llama.cpp.
  • -MoE models: Models like Qwen 3 235B-A22B only activate a fraction of parameters per token, meaning the KV cache scales with active params, not total. This is a major architectural advantage for long-context inference.
  • -Multi-GPU tensor parallelism: Distributes the KV cache across multiple GPUs. If one 24 GB GPU cannot hold model + cache, two GPUs each hold half.
6.0

Calculate your exact KV cache and total VRAM

The tables above give you estimates for planning. But every model has different architecture parameters, and your actual context length needs vary by task. Use the calculate your exact numbers — enter your specific model, quantization, and context length to get both model weights and KV cache, plus which GPUs can handle your workload.

FAQ

Frequently Asked Questions

What is the KV cache in LLMs?
The KV cache stores the key and value tensors from the attention mechanism for all previous tokens. During autoregressive generation, the model reuses these cached values instead of recomputing attention for the entire context, trading memory for compute.
How much VRAM does the KV cache use?
It depends on model architecture, context length, and batch size. For Qwen 3 8B at FP16 with 8K context, the KV cache uses roughly 1-2 GB. For Qwen 3 32B at 128K context, it can exceed 32 GB — more than the model weights themselves.
Does quantization reduce KV cache size?
Not by default. Most inference frameworks keep the KV cache in FP16 regardless of model quantization. Some implementations support KV cache quantization (Q8 or Q4), which can reduce cache size by 50-75% with minimal quality impact.
How do MoE models affect KV cache?
MoE models have the same KV cache per active expert as dense models of equivalent active parameter count. The total cache depends on the shared attention layers, not the expert count. Qwen 3 235B-A22B has KV cache proportional to its 22B active params, not the full 235B.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Back to all articles
Share this article