KV Cache Explained: Why Context Length Eats Your VRAM
The KV cache is the hidden memory cost that catches everyone off guard. It can consume more VRAM than the model itself at long context lengths. Here is exactly how it works, how to calculate it, and how to keep it under control.

What is the KV cache and why does it exist
Transformer models use an attention mechanism where every new token must look at all previous tokens to decide what comes next. Without caching, generating the 1,000th token would require recomputing attention for tokens 1 through 999 — a cost that grows quadratically with context length. The KV cache solves this by storing the key (K) and value (V) projections for every token that has already been processed.
When the model generates token 1,001, it only needs to compute K and V for that single new token and append them to the cache. The attention query then reads the full cache in linear time. This trades memory for compute: instead of recomputing everything, you store it. For a visual walkthrough of the attention mechanism, Jay Alammar's Illustrated Transformer remains the clearest introduction.
The KV cache formula
The memory used by the KV cache follows a deterministic formula based on the model architecture. You can calculate it for any model if you know the number of layers, key-value heads, head dimension, and your target context length.
For Qwen 3 8B at FP16 with 8K context and batch size 1, this works out to roughly 1-2 GB. Scale that to 128K context on Qwen 3 32B and the cache alone can exceed 32 GB — more than the Q4 model weights themselves.
Real KV cache sizes by model and context length
The table below shows KV cache sizes for popular 2026 models at FP16 precision with batch size 1. These are the numbers you need to add on top of model weights and overhead when budgeting your VRAM. For the full model weight VRAM numbers, see our complete VRAM reference table.
| Model | 4K Context | 8K Context | 32K Context | 128K Context |
|---|---|---|---|---|
| Qwen 3 4B | ~0.3 GB | ~0.6 GB | ~2.4 GB | ~10 GB |
| Qwen 3 8B | ~0.5 GB | ~1.0 GB | ~4.0 GB | ~16 GB |
| Qwen 3.5 9B | ~0.5 GB | ~1.0 GB | ~4.0 GB | ~16 GB |
| Gemma 4 31B | ~1.0 GB | ~2.0 GB | ~8.0 GB | ~32 GB |
| Qwen 3 32B | ~1.0 GB | ~2.0 GB | ~8.0 GB | ~32 GB |
| Llama 3.3 70B | ~2.0 GB | ~4.0 GB | ~16 GB | ~64 GB |
| Qwen 3 235B-A22B (MoE) | ~0.6 GB | ~1.2 GB | ~4.8 GB | ~19 GB |
Notice the MoE advantage in the last row: Qwen 3 235B-A22B only activates 22B parameters, so its KV cache is proportional to those 22B active parameters — not the full 235B. This is why MoE models are so memory-efficient at inference despite their massive knowledge base. Compare this to Llama 3.3 70B (dense), where the cache at 128K context is 64 GB — more than many users have in total VRAM.
Grouped Query Attention changes the math
Not all models use the same KV cache size per parameter. Models with Grouped Query Attention (GQA) or Multi-Query Attention (MQA) share KV heads across query heads, dramatically reducing cache size. Most 2026 models use GQA — this is now the standard rather than the exception.
| Model | Query Heads | KV Heads | Sharing Ratio | Cache Savings |
|---|---|---|---|---|
| Qwen 3 8B | 32 | 8 | 4:1 | ~75% vs MHA |
| Qwen 3 32B | 40 | 8 | 5:1 | ~80% vs MHA |
| Gemma 4 31B | 16 | 4 | 4:1 | ~75% vs MHA |
| Llama 3.3 70B | 64 | 8 | 8:1 | ~87.5% vs MHA |
| Qwen 3.5 122B-A10B (MoE) | 40 | 8 | 5:1 | ~80% vs MHA |
| Mistral Small 3.1 24B | 32 | 8 | 4:1 | ~75% vs MHA |
Without GQA, a 70B model would need 8× more KV cache memory. If you are comparing older models (pre-2024) that use full Multi-Head Attention against newer GQA models, expect the older ones to have significantly larger cache requirements. For a detailed breakdown, the GQA paper (Ainslie et al.) provides the architectural details.
Strategies to reduce KV cache memory
If your model fits in VRAM but the KV cache pushes you over the limit, you have several options that do not require buying a new GPU. Each trades some quality or speed for reduced memory consumption.
- -Reduce context length: The most direct lever. If you do not need 128K context, cutting to 8K or 32K dramatically reduces cache size. Most tasks work well at 8K.
- -KV cache quantization: Some inference engines (vLLM, TensorRT-LLM) support FP8 or INT4 KV cache, cutting cache memory by 50-75% with minimal quality loss.
- -Sliding window attention: Mistral and some other models support fixed-size attention windows where older cache entries are discarded. The model retains recent context but forgets early tokens.
- -Flash Attention: Does not reduce cache size, but reduces the compute cost of reading the cache. Speeds up inference at long contexts. Supported by Ollama and llama.cpp.
- -MoE models: Models like Qwen 3 235B-A22B only activate a fraction of parameters per token, meaning the KV cache scales with active params, not total. This is a major architectural advantage for long-context inference.
- -Multi-GPU tensor parallelism: Distributes the KV cache across multiple GPUs. If one 24 GB GPU cannot hold model + cache, two GPUs each hold half.
Calculate your exact KV cache and total VRAM
The tables above give you estimates for planning. But every model has different architecture parameters, and your actual context length needs vary by task. Use the calculate your exact numbers — enter your specific model, quantization, and context length to get both model weights and KV cache, plus which GPUs can handle your workload.
Frequently Asked Questions
What is the KV cache in LLMs?
How much VRAM does the KV cache use?
Does quantization reduce KV cache size?
How do MoE models affect KV cache?
End of Document
Reader Discussion
Be the first to add a note to this article.
Please log in to join the discussion.
No comments yet.