Question 1

What is the KV cache and why does it grow with context?

Accepted Answer

During auto-regressive generation, the model stores Key (K) and Value (V) tensors for every processed token to avoid re-computing them. For each new token, only the new K/V pair is computed and appended to the cache. The cache grows linearly at a fixed rate per token, determined by: 2 × layers × kv_heads × head_dim × 2 bytes.

Question 2

Why does quantization reduce VRAM for weights but not the KV cache?

Accepted Answer

Quantization compresses model weights because they are static and loaded once. The KV cache, however, is dynamic — new entries are added at every token and must be computed in FP16 (or FP8 on newer hardware) for numerical stability during attention.

Question 3

Can I offload layers to system RAM (CPU) to use a larger model?

Accepted Answer

Yes. GGUF/llama.cpp supports partial GPU offloading with the --n-gpu-layers flag. If you only offload N layers to GPU, the GPU VRAM usage scales proportionally. The remaining layers run on CPU at reduced speed.

Question 4

Why do MoE models like DeepSeek R1 require so much VRAM?

Accepted Answer

Mixture of Experts (MoE) models have multiple expert sub-networks. Only a subset of experts activates per token, but all experts must be loaded into memory. DeepSeek R1 has 256 experts with 671B total parameters — even though only 37B are active per token, the full 671B must fit in VRAM.

Question 5

Does batch size affect VRAM requirements?

Accepted Answer

Yes. Batch size > 1 multiplies the KV cache requirement (each sequence has its own KV cache) and increases scratchpad memory for parallel prompt processing. For llama.cpp, the KV cache portion is multiplied by batch size.

LLM VRAM Calculator

Configuration

Memory Breakdown

Compatible GPUs

LLM VRAM Calculator

Configuration

Memory Breakdown

Compatible GPUs