Can Your GPU Run It? VRAM Compatibility Checker for 80+ LLMs
Check whether your GPU has enough VRAM to run any major 2026 LLM. Cross-reference RTX 5090, RTX 4090, RTX 3090, RX 7900 XTX, and other popular cards against model requirements including Qwen 3, Gemma 4, Llama 4, and MoE models.

How to read these compatibility tables
The tables below show which models fit on each GPU at Q4_K_M quantization with 4K context length. Green means the model fits comfortably with headroom. Tight means it technically fits but leaves less than 2 GB free for longer contexts. Offload means it exceeds the GPU VRAM and requires partial CPU offloading, which reduces speed significantly. For the exact VRAM numbers behind these compatibility ratings, see our complete VRAM reference table.
8 GB GPUs: RTX 4060, RTX 3060 Ti, RX 7600
8 GB is the entry point for local LLM inference. You can run 7B models comfortably and 8B models at Q4, which covers capable models like Qwen 3 8B and Llama 3.1 8B. The 2026 generation of MoE models also helps here — Gemma 4 E2B loads 5.1B total but only activates 2B, giving you excellent quality per GB. If you are buying a GPU specifically for LLMs, 8 GB should be considered the absolute minimum.
| Model | Params | VRAM Needed | 8 GB Status | Notes |
|---|---|---|---|---|
| Qwen 3.5 2B | 2B | ~1.2 GB | Fits easily | Tiny but capable |
| Phi-4 Mini | 3.8B | ~2.5 GB | Fits easily | Plenty of context headroom |
| Qwen 3 4B | 4B | ~2.5 GB | Fits easily | Great for basic tasks |
| Gemma 4 E2B (MoE) | 5.1B | ~3 GB | Fits easily | Only 2B active per token |
| Qwen 3 8B | 8B | ~5 GB | Fits | ~2.5 GB headroom, Q4 only |
| Llama 3.1 8B | 8B | ~5 GB | Fits | ~2.5 GB headroom, Q4 only |
| Gemma 4 E4B (MoE) | 8B | ~5 GB | Fits | Only 4B active per token |
| Phi-4 | 14B | ~8.5 GB | Tight / Offload | Needs Q3 or offloading |
12 GB GPUs: RTX 3060 12 GB, RTX 4070
The RTX 3060 12 GB is the best value entry point for local LLMs. For around $180 used, you get enough VRAM to run 14B models at Q4 comfortably. This is the tier where local LLMs start feeling genuinely useful for coding, writing, and analysis tasks. The 2026 MoE models also shine here — gpt-oss 20B MoE fits with room to spare despite packing flagship-class knowledge.
| Model | Params | VRAM Needed | 12 GB Status | Notes |
|---|---|---|---|---|
| Qwen 3 8B | 8B | ~5 GB | Fits easily | Room for Q8 or long context |
| Nemotron-Nano 9B | 9B | ~5.5 GB | Fits | Strong reasoning at this size |
| Gemma 3 12B | 12B | ~7 GB | Fits | Can use Q5 for better quality |
| Qwen 3 14B | 14B | ~8.5 GB | Fits | ~3 GB context headroom |
| Phi-4 | 14B | ~8.5 GB | Fits | Strong reasoning at this size |
| gpt-oss 20B (MoE) | 20B | ~12 GB | Tight | Only 3.6B active, Q4 tight fit |
| Qwen 3 32B | 32B | ~19 GB | Offload | Need 24 GB GPU for Q4 |
16 GB GPUs: RTX 4070 Ti Super, RTX 5080, RX 9070 XT
16 GB is the tier where MoE models really start to flex. gpt-oss 20B MoE fits at Q4 with 4 GB headroom despite 20B total parameters. You can also run dense 14B models at Q5 for near-native quality. This is also where AMD GPUs become competitive — the RX 9070 XT and RX 7800 XT both offer 16 GB at lower prices than NVIDIA equivalents, though with software ecosystem tradeoffs covered in our AMD vs NVIDIA comparison.
| Model | Params | VRAM Needed | 16 GB Status | Notes |
|---|---|---|---|---|
| Qwen 3 8B | 8B | ~5 GB (Q8: ~9 GB) | Fits Q8 | Near-native quality |
| Qwen 3 14B | 14B | ~8.5 GB (Q5: ~10 GB) | Fits Q5 | Great quality/VRAM balance |
| gpt-oss 20B (MoE) | 20B | ~12 GB (Q4) | Fits | Only 3.6B active per token |
| Mistral Small 3.1 24B | 24B | ~14 GB (Q4) | Fits tight | Short context only |
| Gemma 4 26B-A4B (MoE) | 26B | ~15 GB (Q4) | Fits tight | Only 4B active, Q4 tight |
| Qwen 3 32B | 32B | ~19 GB (Q4) | Offload | Need Q3 (~12 GB) to fit |
24 GB GPUs: RTX 4090, RTX 3090, RX 7900 XTX
24 GB is the sweet spot for single-GPU local LLM inference. This is the tier where Qwen 3 32B and Gemma 4 31B at Q4 run with comfortable context headroom. MoE models like Qwen 3 30B-A3B fit easily with lots of room for long context. Both the RTX 4090 and the used RTX 3090 sit here — see our 3090 vs midrange comparison for why the used 3090 at $450 is often the smarter buy for LLM workloads.
| Model | Params | VRAM Needed | 24 GB Status | Notes |
|---|---|---|---|---|
| Gemma 4 26B-A4B (MoE) | 26B | ~15 GB (Q4) | Fits easily | Only 4B active, lots of headroom |
| Qwen 3.5 27B | 27B | ~16 GB (Q4) | Fits comfortably | ~7 GB context headroom |
| Qwen 3 30B-A3B (MoE) | 30B | ~18 GB (Q4) | Fits | Only 3B active, ~5 GB headroom |
| Gemma 4 31B | 31B | ~19 GB (Q4) | Fits comfortably | ~4 GB context headroom |
| Qwen 3 32B | 32B | ~19 GB (Q4) | Fits comfortably | ~4 GB context headroom |
| Qwen 3 32B | 32B | ~24 GB (Q5) | Fits tight | Great quality, short context |
| Command R | 35B | ~21 GB (Q4) | Fits | ~2 GB context headroom |
| Llama 3.3 70B | 70B | ~38 GB (Q4) | Offload | Need Q2 (~20 GB) for partial fit |
| Qwen 2.5 72B | 72B | ~40 GB (Q4) | Offload | Same as 70B — dual GPU recommended |
The standout here is Qwen 3 32B at Q4_K_M. It fits in 19 GB, runs fast on a single GPU, and benchmarks close to GPT-4 class output for many tasks. If you have a 24 GB GPU, this should be your daily driver model. The MoE models (Qwen 3 30B-A3B, Gemma 4 26B-A4B) are even more compelling if you prioritize inference speed — they activate only 3-4B parameters per token.
32 GB GPUs: RTX 5090
The RTX 5090 is the first consumer GPU to break the 24 GB barrier. The extra 8 GB opens the door to running 70B models at Q3 (which fits in ~30 GB) and running 32B models at Q5 or Q8 with full context headroom. For the full breakdown of whether the premium over an RTX 4090 is worth it for LLM workloads, see our RTX 5090 vs 4090 comparison.
| Model | Params | VRAM Needed | 32 GB Status | Notes |
|---|---|---|---|---|
| Qwen 3 32B | 32B | ~24 GB (Q5) | Fits | 8 GB context headroom |
| Qwen 3 32B | 32B | ~34 GB (Q8) | Offload | Nearly fits at Q8 |
| Llama 3.3 70B | 70B | ~30 GB (Q3) | Fits tight | First single-GPU 70B option |
| Llama 3.3 70B | 70B | ~38 GB (Q4) | Offload | ~6 GB short of full Q4 |
| Qwen 3.5 122B-A10B (MoE) | 122B | ~67 GB (Q4) | Offload | Need multi-GPU |
| Command R | 35B | ~21 GB (Q4) | Fits easily | 11 GB headroom for context |
AMD GPU compatibility notes
AMD GPUs with ROCm support can run most LLMs via Ollama and llama.cpp. The RX 7900 XTX (24 GB) has the same VRAM capacity as an RTX 4090, so the model compatibility tables above apply identically. However, there are ecosystem differences to be aware of when choosing between AMD and NVIDIA for LLM workloads.
- -ROCm vs CUDA: Most inference frameworks support both, but CUDA gets updates first and has better documentation. See our AMD GPU guide for the current state of ROCm support.
- -Flash Attention: AMD supports Flash Attention on RDNA3 GPUs via ROCm, but performance is typically 10-20% behind NVIDIA's implementation.
- -Multi-GPU: AMD's multi-GPU support for LLM inference is less mature than NVIDIA's NCCL-based tensor parallelism. If you plan to run dual GPUs, NVIDIA is the safer choice.
- -vLLM support: vLLM supports AMD GPUs but may lag behind CUDA versions by weeks for new features. llama.cpp and Ollama have more parity between AMD and NVIDIA.
Check your exact GPU-model compatibility
The tables above cover the most common GPU-model combinations at Q4_K_M. But your specific setup may use different quantization levels, context lengths, or less common models. The check your exact setup — enter your GPU, pick your model, set quantization and context length, and get a VRAM breakdown with a pass/fail verdict for your hardware.
Frequently Asked Questions
Can an RTX 4090 run Llama 3.3 70B?
Can an RTX 3090 run Qwen 3 32B?
Can an RX 7900 XTX run Llama 4 Scout?
How do I check if a specific model fits my GPU?
End of Document
Reader Discussion
Be the first to add a note to this article.
Please log in to join the discussion.
No comments yet.