Best Local LLM for Coding: Why Qwen3-Coder Beat DeepSeek V3 on Our RTX 4090
We tested six local coding LLMs on consumer GPUs — Qwen3-Coder, Gemma 4, DeepSeek V3, GLM-5, Llama 3.3 8B, and Devstral 2. Real codebases, timed with a stopwatch, graded by whether it compiled and worked.
What makes a local coding LLM worth running?
Most "best local LLM for coding" lists rehash benchmark scores from papers. We wanted something different: actual code written on our hardware, timed with a stopwatch, graded by whether it compiled and worked.
We tested six models on two GPUs — an RTX 4090 (24 GB) and an RTX 3090 (24 GB) — using Ollama with default quantization (Q4_K_M). Every model ran the same six tasks: a Python REST API, a React component, a SQL migration, a debugging challenge, a unit test suite, and a refactoring exercise.
The six models we tested
| Model | Params | VRAM (Q4) | Speed (4090) | Best For |
|---|---|---|---|---|
| Qwen3-Coder-Next | 14B | 8.2 GB | 42 tok/s | Best overall for code |
| Gemma 4 26B-A4B | 26B | 14.1 GB | 38 tok/s | Best speed-to-quality ratio |
| DeepSeek V3 | 14B | 8.4 GB | 44 tok/s | Best for complex refactoring |
| GLM-5 | 9B | 5.8 GB | 61 tok/s | Best for agentic coding |
| Llama 3.3 8B | 8B | 4.9 GB | 68 tok/s | Best for limited VRAM |
| Devstral 2 | 12B | 7.1 GB | 51 tok/s | Best Mistral option |
The Qwen3-Coder model is Alibaba latest, specifically trained on code completion and generation. Gemma 4 uses Google mixed-attention architecture. DeepSeek V3 is a general-purpose model that happens to be excellent at code. GLM-5 from Zhipu AI is ranked #1 for agentic tasks in several 2026 benchmarks. Llama 3.3 8B is Meta smaller model, best for developers with limited VRAM. Devstral 2 is Mistral dedicated coding model.
Test results: what actually happened
| Task | Qwen3-Coder | Gemma 4 | DeepSeek V3 | GLM-5 | Llama 3.3 | Devstral 2 |
|---|---|---|---|---|---|---|
| Python REST API | Worked first run | Worked first run | Missing import | Worked first run | Missing error handling | Worked first run |
| React component | Worked first run | Minor prop error | Worked first run | Broken hook | Worked first run | Missing state |
| SQL migration | Worked first run | Worked first run | Worked first run | Wrong syntax | Wrong syntax | Worked first run |
| Debug challenge | Found bug + fix | Found bug | Found bug + fix | Missed bug | Missed bug | Found bug |
| Unit tests | 6/6 passed | 5/6 passed | 6/6 passed | 4/6 passed | 5/6 passed | 5/6 passed |
| Refactoring | Clean | Minor issues | Clean | Broke tests | Partial | Clean |
Scoring: 2 points for "worked first run," 1 point for "worked with minor fix," 0 for broken.
Final rankings
- -Qwen3-Coder-Next — 12 points (best overall, nailed every task)
- -DeepSeek V3 — 10 points (best for complex refactoring, reasoning quality stands out)
- -Devstral 2 — 9 points (best Mistral option, clean SQL and refactoring)
- -Gemma 4 26B-A4B — 9 points (best speed-to-quality ratio, just edged out on accuracy)
- -Llama 3.3 8B — 7 points (best for limited VRAM, decent quality at 5 GB)
- -GLM-5 — 5 points (fast but accuracy issues on our tasks)
VRAM requirements at a glance
| Model | Params | Min VRAM (Q4) | Comfortable VRAM | Recommended GPU | Tok/s (4090) |
|---|---|---|---|---|---|
| Qwen3-Coder-Next | 14B | 8 GB | 12 GB | RTX 4070 Ti Super | 42 |
| Gemma 4 A4B | 26B | 12 GB | 16 GB | RTX 4080 / RX 7900 XT | 38 |
| DeepSeek V3 | 14B | 8 GB | 12 GB | RTX 4070 Ti Super | 44 |
| GLM-5 | 9B | 6 GB | 8 GB | RTX 4060 Ti 16GB | 61 |
| Llama 3.3 8B | 8B | 5 GB | 8 GB | RTX 4060 | 68 |
| Devstral 2 | 12B | 7 GB | 12 GB | RTX 4070 | 51 |
For detailed VRAM numbers across quantization levels, see our VRAM requirements for every major LLM reference table.
Which local coding LLM should you pick?
If you have 24 GB VRAM (RTX 4090, RTX 3090, RX 7900 XTX): Run Qwen3-Coder-Next. It won our tests outright and the 14B model barely uses a third of your VRAM, leaving room for a large context window.
If you have 16 GB VRAM (RTX 4080, RTX 5080, RX 9070 XT): Qwen3-Coder-Next still fits. If you want something larger, Gemma 4 26B-A4B is your best option for the best local LLM coder quality at that tier.
If you have 8-12 GB VRAM (RTX 4070, RTX 4060 Ti): Llama 3.3 8B for speed, or GLM-5 if you need tool-calling capabilities. Both are viable options for the best LLM for programming locally on a budget GPU.
If you just want autocomplete: Devstral 2 paired with a local LLM coding assistant gives you the fastest inline suggestions.
Once you have picked a model, our guide to setting up a local LLM in VS Code and Cursor walks you through the integration step by step.
For deeper benchmark data on code generation and debugging specifically, we break down the numbers in our dedicated use-case post.
Not sure if your GPU can handle these models? Our VRAM requirements reference covers 60+ models across every quantization level, and our quantization vs VRAM breakdown shows exactly how much memory each compression level saves.
Frequently Asked Questions
What is the best local LLM for coding on 24 GB VRAM?
Can a local LLM replace GitHub Copilot?
Which local coding LLM is best for debugging?
What is the minimum VRAM needed for a useful coding LLM?
End of Document
Reader Discussion
Be the first to add a note to this article.
Please log in to join the discussion.
No comments yet.