Apr 22, 2026

Can Your GPU Run It? VRAM Compatibility Checker for 80+ LLMs

Check whether your GPU has enough VRAM to run any major 2026 LLM. Cross-reference RTX 5090, RTX 4090, RTX 3090, RX 7900 XTX, and other popular cards against model requirements including Qwen 3, Gemma 4, Llama 4, and MoE models.

Can Your GPU Run It? VRAM Compatibility Checker for 80+ LLMs
A
Andre
GPUAILLMs
1.0

How to read these compatibility tables

The tables below show which models fit on each GPU at Q4_K_M quantization with 4K context length. Green means the model fits comfortably with headroom. Tight means it technically fits but leaves less than 2 GB free for longer contexts. Offload means it exceeds the GPU VRAM and requires partial CPU offloading, which reduces speed significantly. For the exact VRAM numbers behind these compatibility ratings, see our complete VRAM reference table.

Methodology
All numbers assume Q4_K_M quantization, 4K context length, batch size 1, and llama.cpp/vLLM framework overhead of ~10-15%. Your actual mileage may vary by 1-2 GB depending on inference engine and settings. Use the VRAM Calculator for precise estimates.
2.0

8 GB GPUs: RTX 4060, RTX 3060 Ti, RX 7600

8 GB is the entry point for local LLM inference. You can run 7B models comfortably and 8B models at Q4, which covers capable models like Qwen 3 8B and Llama 3.1 8B. The 2026 generation of MoE models also helps here — Gemma 4 E2B loads 5.1B total but only activates 2B, giving you excellent quality per GB. If you are buying a GPU specifically for LLMs, 8 GB should be considered the absolute minimum.

ModelParamsVRAM Needed8 GB StatusNotes
Qwen 3.5 2B2B~1.2 GBFits easilyTiny but capable
Phi-4 Mini3.8B~2.5 GBFits easilyPlenty of context headroom
Qwen 3 4B4B~2.5 GBFits easilyGreat for basic tasks
Gemma 4 E2B (MoE)5.1B~3 GBFits easilyOnly 2B active per token
Qwen 3 8B8B~5 GBFits~2.5 GB headroom, Q4 only
Llama 3.1 8B8B~5 GBFits~2.5 GB headroom, Q4 only
Gemma 4 E4B (MoE)8B~5 GBFitsOnly 4B active per token
Phi-414B~8.5 GBTight / OffloadNeeds Q3 or offloading
3.0

12 GB GPUs: RTX 3060 12 GB, RTX 4070

The RTX 3060 12 GB is the best value entry point for local LLMs. For around $180 used, you get enough VRAM to run 14B models at Q4 comfortably. This is the tier where local LLMs start feeling genuinely useful for coding, writing, and analysis tasks. The 2026 MoE models also shine here — gpt-oss 20B MoE fits with room to spare despite packing flagship-class knowledge.

ModelParamsVRAM Needed12 GB StatusNotes
Qwen 3 8B8B~5 GBFits easilyRoom for Q8 or long context
Nemotron-Nano 9B9B~5.5 GBFitsStrong reasoning at this size
Gemma 3 12B12B~7 GBFitsCan use Q5 for better quality
Qwen 3 14B14B~8.5 GBFits~3 GB context headroom
Phi-414B~8.5 GBFitsStrong reasoning at this size
gpt-oss 20B (MoE)20B~12 GBTightOnly 3.6B active, Q4 tight fit
Qwen 3 32B32B~19 GBOffloadNeed 24 GB GPU for Q4
4.0

16 GB GPUs: RTX 4070 Ti Super, RTX 5080, RX 9070 XT

16 GB is the tier where MoE models really start to flex. gpt-oss 20B MoE fits at Q4 with 4 GB headroom despite 20B total parameters. You can also run dense 14B models at Q5 for near-native quality. This is also where AMD GPUs become competitive — the RX 9070 XT and RX 7800 XT both offer 16 GB at lower prices than NVIDIA equivalents, though with software ecosystem tradeoffs covered in our AMD vs NVIDIA comparison.

ModelParamsVRAM Needed16 GB StatusNotes
Qwen 3 8B8B~5 GB (Q8: ~9 GB)Fits Q8Near-native quality
Qwen 3 14B14B~8.5 GB (Q5: ~10 GB)Fits Q5Great quality/VRAM balance
gpt-oss 20B (MoE)20B~12 GB (Q4)FitsOnly 3.6B active per token
Mistral Small 3.1 24B24B~14 GB (Q4)Fits tightShort context only
Gemma 4 26B-A4B (MoE)26B~15 GB (Q4)Fits tightOnly 4B active, Q4 tight
Qwen 3 32B32B~19 GB (Q4)OffloadNeed Q3 (~12 GB) to fit
5.0

24 GB GPUs: RTX 4090, RTX 3090, RX 7900 XTX

24 GB is the sweet spot for single-GPU local LLM inference. This is the tier where Qwen 3 32B and Gemma 4 31B at Q4 run with comfortable context headroom. MoE models like Qwen 3 30B-A3B fit easily with lots of room for long context. Both the RTX 4090 and the used RTX 3090 sit here — see our 3090 vs midrange comparison for why the used 3090 at $450 is often the smarter buy for LLM workloads.

ModelParamsVRAM Needed24 GB StatusNotes
Gemma 4 26B-A4B (MoE)26B~15 GB (Q4)Fits easilyOnly 4B active, lots of headroom
Qwen 3.5 27B27B~16 GB (Q4)Fits comfortably~7 GB context headroom
Qwen 3 30B-A3B (MoE)30B~18 GB (Q4)FitsOnly 3B active, ~5 GB headroom
Gemma 4 31B31B~19 GB (Q4)Fits comfortably~4 GB context headroom
Qwen 3 32B32B~19 GB (Q4)Fits comfortably~4 GB context headroom
Qwen 3 32B32B~24 GB (Q5)Fits tightGreat quality, short context
Command R35B~21 GB (Q4)Fits~2 GB context headroom
Llama 3.3 70B70B~38 GB (Q4)OffloadNeed Q2 (~20 GB) for partial fit
Qwen 2.5 72B72B~40 GB (Q4)OffloadSame as 70B — dual GPU recommended

The standout here is Qwen 3 32B at Q4_K_M. It fits in 19 GB, runs fast on a single GPU, and benchmarks close to GPT-4 class output for many tasks. If you have a 24 GB GPU, this should be your daily driver model. The MoE models (Qwen 3 30B-A3B, Gemma 4 26B-A4B) are even more compelling if you prioritize inference speed — they activate only 3-4B parameters per token.

6.0

32 GB GPUs: RTX 5090

The RTX 5090 is the first consumer GPU to break the 24 GB barrier. The extra 8 GB opens the door to running 70B models at Q3 (which fits in ~30 GB) and running 32B models at Q5 or Q8 with full context headroom. For the full breakdown of whether the premium over an RTX 4090 is worth it for LLM workloads, see our RTX 5090 vs 4090 comparison.

ModelParamsVRAM Needed32 GB StatusNotes
Qwen 3 32B32B~24 GB (Q5)Fits8 GB context headroom
Qwen 3 32B32B~34 GB (Q8)OffloadNearly fits at Q8
Llama 3.3 70B70B~30 GB (Q3)Fits tightFirst single-GPU 70B option
Llama 3.3 70B70B~38 GB (Q4)Offload~6 GB short of full Q4
Qwen 3.5 122B-A10B (MoE)122B~67 GB (Q4)OffloadNeed multi-GPU
Command R35B~21 GB (Q4)Fits easily11 GB headroom for context
7.0

AMD GPU compatibility notes

AMD GPUs with ROCm support can run most LLMs via Ollama and llama.cpp. The RX 7900 XTX (24 GB) has the same VRAM capacity as an RTX 4090, so the model compatibility tables above apply identically. However, there are ecosystem differences to be aware of when choosing between AMD and NVIDIA for LLM workloads.

  • -ROCm vs CUDA: Most inference frameworks support both, but CUDA gets updates first and has better documentation. See our AMD GPU guide for the current state of ROCm support.
  • -Flash Attention: AMD supports Flash Attention on RDNA3 GPUs via ROCm, but performance is typically 10-20% behind NVIDIA's implementation.
  • -Multi-GPU: AMD's multi-GPU support for LLM inference is less mature than NVIDIA's NCCL-based tensor parallelism. If you plan to run dual GPUs, NVIDIA is the safer choice.
  • -vLLM support: vLLM supports AMD GPUs but may lag behind CUDA versions by weeks for new features. llama.cpp and Ollama have more parity between AMD and NVIDIA.
8.0

Check your exact GPU-model compatibility

The tables above cover the most common GPU-model combinations at Q4_K_M. But your specific setup may use different quantization levels, context lengths, or less common models. The check your exact setup — enter your GPU, pick your model, set quantization and context length, and get a VRAM breakdown with a pass/fail verdict for your hardware.

FAQ

Frequently Asked Questions

Can an RTX 4090 run Llama 3.3 70B?
Not fully in VRAM at Q4_K_M. Llama 3.3 70B at Q4 needs ~38-40 GB VRAM, and the RTX 4090 has 24 GB. You can run it with CPU offloading via llama.cpp or Ollama, but inference speed drops significantly. Two RTX 4090s or an RTX 5090 is the practical solution.
Can an RTX 3090 run Qwen 3 32B?
Yes. Qwen 3 32B at Q4_K_M needs approximately 19-20 GB VRAM with 4K context, which fits within the RTX 3090's 24 GB. You have room for longer contexts or a higher quantization level.
Can an RX 7900 XTX run Llama 4 Scout?
Llama 4 Scout is a 109B MoE model. At Q4_K_M it needs ~60 GB, and at Q2_K ~30 GB. The RX 7900 XTX (24 GB) cannot fit even Q2 without significant CPU offloading. Consider smaller MoE models like Qwen 3 30B-A3B instead.
How do I check if a specific model fits my GPU?
Use the calculator at pcpartguide.com/tools/vram-calculator — enter your model, quantization, and context length to get an exact VRAM breakdown and see which GPUs match your requirements.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Back to all articles
Share this article