Apr 22, 2026

Can Your GPU Run It? VRAM Compatibility Checker for 80+ LLMs

Check whether your GPU has enough VRAM to run any major 2026 LLM. Cross-reference RTX 5090, RTX 4090, RTX 3090, RX 7900 XTX, and other popular cards against model requirements including Qwen 3, Gemma 4, Llama 4, and MoE models.

Andre

GPUAILLMs

1.0

How to read these compatibility tables

The tables below show which models fit on each GPU at Q4_K_M quantization with 4K context length. Green means the model fits comfortably with headroom. Tight means it technically fits but leaves less than 2 GB free for longer contexts. Offload means it exceeds the GPU VRAM and requires partial CPU offloading, which reduces speed significantly. For the exact VRAM numbers behind these compatibility ratings, see our complete VRAM reference table.

Methodology

All numbers assume Q4_K_M quantization, 4K context length, batch size 1, and llama.cpp/vLLM framework overhead of ~10-15%. Your actual mileage may vary by 1-2 GB depending on inference engine and settings. Use the VRAM Calculator for precise estimates.

2.0

8 GB GPUs: RTX 4060, RTX 3060 Ti, RX 7600

8 GB is the entry point for local LLM inference. You can run 7B models comfortably and 8B models at Q4, which covers capable models like Qwen 3 8B and Llama 3.1 8B. The 2026 generation of MoE models also helps here — Gemma 4 E2B loads 5.1B total but only activates 2B, giving you excellent quality per GB. If you are buying a GPU specifically for LLMs, 8 GB should be considered the absolute minimum.

Model	Params	VRAM Needed	8 GB Status	Notes
Qwen 3.5 2B	2B	~1.2 GB	Fits easily	Tiny but capable
Phi-4 Mini	3.8B	~2.5 GB	Fits easily	Plenty of context headroom
Qwen 3 4B	4B	~2.5 GB	Fits easily	Great for basic tasks
Gemma 4 E2B (MoE)	5.1B	~3 GB	Fits easily	Only 2B active per token
Qwen 3 8B	8B	~5 GB	Fits	~2.5 GB headroom, Q4 only
Llama 3.1 8B	8B	~5 GB	Fits	~2.5 GB headroom, Q4 only
Gemma 4 E4B (MoE)	8B	~5 GB	Fits	Only 4B active per token
Phi-4	14B	~8.5 GB	Tight / Offload	Needs Q3 or offloading

3.0

12 GB GPUs: RTX 3060 12 GB, RTX 4070

The RTX 3060 12 GB is the best value entry point for local LLMs. For around $180 used, you get enough VRAM to run 14B models at Q4 comfortably. This is the tier where local LLMs start feeling genuinely useful for coding, writing, and analysis tasks. The 2026 MoE models also shine here — gpt-oss 20B MoE fits with room to spare despite packing flagship-class knowledge.

Model	Params	VRAM Needed	12 GB Status	Notes
Qwen 3 8B	8B	~5 GB	Fits easily	Room for Q8 or long context
Nemotron-Nano 9B	9B	~5.5 GB	Fits	Strong reasoning at this size
Gemma 3 12B	12B	~7 GB	Fits	Can use Q5 for better quality
Qwen 3 14B	14B	~8.5 GB	Fits	~3 GB context headroom
Phi-4	14B	~8.5 GB	Fits	Strong reasoning at this size
gpt-oss 20B (MoE)	20B	~12 GB	Tight	Only 3.6B active, Q4 tight fit
Qwen 3 32B	32B	~19 GB	Offload	Need 24 GB GPU for Q4

4.0

16 GB GPUs: RTX 4070 Ti Super, RTX 5080, RX 9070 XT

16 GB is the tier where MoE models really start to flex. gpt-oss 20B MoE fits at Q4 with 4 GB headroom despite 20B total parameters. You can also run dense 14B models at Q5 for near-native quality. This is also where AMD GPUs become competitive — the RX 9070 XT and RX 7800 XT both offer 16 GB at lower prices than NVIDIA equivalents, though with software ecosystem tradeoffs covered in our AMD vs NVIDIA comparison.

Model	Params	VRAM Needed	16 GB Status	Notes
Qwen 3 8B	8B	~5 GB (Q8: ~9 GB)	Fits Q8	Near-native quality
Qwen 3 14B	14B	~8.5 GB (Q5: ~10 GB)	Fits Q5	Great quality/VRAM balance
gpt-oss 20B (MoE)	20B	~12 GB (Q4)	Fits	Only 3.6B active per token
Mistral Small 3.1 24B	24B	~14 GB (Q4)	Fits tight	Short context only
Gemma 4 26B-A4B (MoE)	26B	~15 GB (Q4)	Fits tight	Only 4B active, Q4 tight
Qwen 3 32B	32B	~19 GB (Q4)	Offload	Need Q3 (~12 GB) to fit

5.0

24 GB GPUs: RTX 4090, RTX 3090, RX 7900 XTX

24 GB is the sweet spot for single-GPU local LLM inference. This is the tier where Qwen 3 32B and Gemma 4 31B at Q4 run with comfortable context headroom. MoE models like Qwen 3 30B-A3B fit easily with lots of room for long context. Both the RTX 4090 and the used RTX 3090 sit here — see our 3090 vs midrange comparison for why the used 3090 at $450 is often the smarter buy for LLM workloads.

Model	Params	VRAM Needed	24 GB Status	Notes
Gemma 4 26B-A4B (MoE)	26B	~15 GB (Q4)	Fits easily	Only 4B active, lots of headroom
Qwen 3.5 27B	27B	~16 GB (Q4)	Fits comfortably	~7 GB context headroom
Qwen 3 30B-A3B (MoE)	30B	~18 GB (Q4)	Fits	Only 3B active, ~5 GB headroom
Gemma 4 31B	31B	~19 GB (Q4)	Fits comfortably	~4 GB context headroom
Qwen 3 32B	32B	~19 GB (Q4)	Fits comfortably	~4 GB context headroom
Qwen 3 32B	32B	~24 GB (Q5)	Fits tight	Great quality, short context
Command R	35B	~21 GB (Q4)	Fits	~2 GB context headroom
Llama 3.3 70B	70B	~38 GB (Q4)	Offload	Need Q2 (~20 GB) for partial fit
Qwen 2.5 72B	72B	~40 GB (Q4)	Offload	Same as 70B — dual GPU recommended

The standout here is Qwen 3 32B at Q4_K_M. It fits in 19 GB, runs fast on a single GPU, and benchmarks close to GPT-4 class output for many tasks. If you have a 24 GB GPU, this should be your daily driver model. The MoE models (Qwen 3 30B-A3B, Gemma 4 26B-A4B) are even more compelling if you prioritize inference speed — they activate only 3-4B parameters per token.

6.0

32 GB GPUs: RTX 5090

The RTX 5090 is the first consumer GPU to break the 24 GB barrier. The extra 8 GB opens the door to running 70B models at Q3 (which fits in ~30 GB) and running 32B models at Q5 or Q8 with full context headroom. For the full breakdown of whether the premium over an RTX 4090 is worth it for LLM workloads, see our RTX 5090 vs 4090 comparison.

Model	Params	VRAM Needed	32 GB Status	Notes
Qwen 3 32B	32B	~24 GB (Q5)	Fits	8 GB context headroom
Qwen 3 32B	32B	~34 GB (Q8)	Offload	Nearly fits at Q8
Llama 3.3 70B	70B	~30 GB (Q3)	Fits tight	First single-GPU 70B option
Llama 3.3 70B	70B	~38 GB (Q4)	Offload	~6 GB short of full Q4
Qwen 3.5 122B-A10B (MoE)	122B	~67 GB (Q4)	Offload	Need multi-GPU
Command R	35B	~21 GB (Q4)	Fits easily	11 GB headroom for context

7.0

AMD GPU compatibility notes

AMD GPUs with ROCm support can run most LLMs via Ollama and llama.cpp. The RX 7900 XTX (24 GB) has the same VRAM capacity as an RTX 4090, so the model compatibility tables above apply identically. However, there are ecosystem differences to be aware of when choosing between AMD and NVIDIA for LLM workloads.

-ROCm vs CUDA: Most inference frameworks support both, but CUDA gets updates first and has better documentation. See our AMD GPU guide for the current state of ROCm support.
-Flash Attention: AMD supports Flash Attention on RDNA3 GPUs via ROCm, but performance is typically 10-20% behind NVIDIA's implementation.
-Multi-GPU: AMD's multi-GPU support for LLM inference is less mature than NVIDIA's NCCL-based tensor parallelism. If you plan to run dual GPUs, NVIDIA is the safer choice.
-vLLM support: vLLM supports AMD GPUs but may lag behind CUDA versions by weeks for new features. llama.cpp and Ollama have more parity between AMD and NVIDIA.

8.0

Check your exact GPU-model compatibility

The tables above cover the most common GPU-model combinations at Q4_K_M. But your specific setup may use different quantization levels, context lengths, or less common models. The check your exact setup — enter your GPU, pick your model, set quantization and context length, and get a VRAM breakdown with a pass/fail verdict for your hardware.

FAQ

Frequently Asked Questions

Can an RTX 4090 run Llama 3.3 70B?

Not fully in VRAM at Q4_K_M. Llama 3.3 70B at Q4 needs ~38-40 GB VRAM, and the RTX 4090 has 24 GB. You can run it with CPU offloading via llama.cpp or Ollama, but inference speed drops significantly. Two RTX 4090s or an RTX 5090 is the practical solution.

Can an RTX 3090 run Qwen 3 32B?

Yes. Qwen 3 32B at Q4_K_M needs approximately 19-20 GB VRAM with 4K context, which fits within the RTX 3090's 24 GB. You have room for longer contexts or a higher quantization level.

Can an RX 7900 XTX run Llama 4 Scout?

Llama 4 Scout is a 109B MoE model. At Q4_K_M it needs ~60 GB, and at Q2_K ~30 GB. The RX 7900 XTX (24 GB) cannot fit even Q2 without significant CPU offloading. Consider smaller MoE models like Qwen 3 30B-A3B instead.

How do I check if a specific model fits my GPU?

Use the calculator at pcpartguide.com/tools/vram-calculator — enter your model, quantization, and context length to get an exact VRAM breakdown and see which GPUs match your requirements.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Best AMD vs Best NVIDIA GPU for Local LLMs: Where AMD Wins, and Where CUDA Still Controls the Market

Used RTX 3090 vs New Midrange GPU for Local LLMs: Why the 3090 Wins on Value

RTX 5080 vs Used RTX 4090 for Local LLMs: New Warranty or 24 GB Model Headroom

RX 7900 XTX vs RTX 4090 for Local LLMs: Same VRAM, Different Software Reality

Back to all articles

Share this article