5 Min ReadMay 2, 2026

Local LLM for VS Code: The Setup That Finally Made Me Drop Copilot

Step-by-step instructions for running a local LLM inside VS Code with Continue, connecting Cursor to a local model via BYOK, and using Claude Code with a local API proxy. Tested on Windows, Mac, and Linux.

Andre

AILLMsCoding

1.0

Prerequisites — pick your model runner

Before connecting a local LLM for VS Code, you need a model server running on your machine. Three options:

-Ollama — The fastest path. One command to install, one command to pull a model. Works on Mac, Linux, and Windows (native). This is what we recommend for most developers.
-LM Studio — Best GUI experience. Download models through a searchable interface, tweak parameters with sliders, see real-time performance metrics.
-llama.cpp server — Maximum control and performance. Build from source, tune every parameter. Overkill for most people.

For this guide, we use Ollama. Install it from ollama.com, then pull a coding model:

ollama pull qwen3-coder

This downloads the Qwen3-Coder model (about 8 GB). For the best local LLM for VS Code experience on a 24 GB GPU, you could also try:

ollama pull deepseek-v3 ollama pull gemma4

See our best local LLM for coding guide for model recommendations by GPU.

2.0

Local LLM in VS Code (with Continue extension)

Continue is the most popular open-source VS Code local LLM extension, with over 500K installs. It supports autocomplete, chat, and agentic mode — all pointing to your local model.

Step 1: Verify Ollama

Run the following to confirm Ollama is working and the model is loaded:

ollama run qwen3-coder "Write a Python hello world"

You should see a response in your terminal.

Step 2: Open VS Code, go to Extensions (Ctrl+Shift+X), search for "Continue", and install it.

Step 3: When you first open Continue, select Ollama as your provider and choose your model. For manual configuration, the config file lives at ~/.continue/config.json:

{ "models": [{ "title": "Qwen3-Coder", "provider": "ollama", "model": "qwen3-coder" }], "tabAutocompleteModel": { "title": "Qwen3-Coder", "provider": "ollama", "model": "qwen3-coder" } }

This gives you both chat (Cmd+L) and tab autocomplete from the same model.

Step 4: If autocomplete feels slow, try a smaller model for tab completion and keep the larger model for chat:

"tabAutocompleteModel": { "title": "Qwen3-Coder-1.5B", "provider": "ollama", "model": "qwen3-coder-1.5b" }

The 1.5B model responds in under 50ms — almost indistinguishable from Copilot. You can also check our local LLM coding assistant comparison for tool-specific tuning tips.

3.0

Local LLM in Cursor (workarounds)

As of May 2026, Cursor does not natively support direct localhost connections to local models. All BYOK (Bring Your Own Key) requests route through Cursor servers. However, there are two workarounds.

Method 1: BYOK with Base URL override

Go to Settings > Models > OpenAI API Key. Enter any API key (it will not be used for local calls). Then override the Base URL:

http://localhost:11434/v1

This works for Ollama default port. Select your model in the model dropdown. Some Cursor features (agent mode, apply) may not work perfectly with local models.

Method 2: ngrok tunnel. If Cursor rejects localhost URLs, create a tunnel:

ngrok http 11434

Copy the ngrok URL (e.g. https://abc123.ngrok.io) and use that as your Base URL.

Limitations: Cursor agent mode sometimes struggles with local model responses. Inline edits work better than multi-file agent operations. The Cursor local LLM experience is improving but still behind VS Code + Continue.

4.0

Local LLM with Claude Code

Claude Code is Anthropic CLI-based coding tool. It does not support local models natively, but you can use a local API proxy to intercept and redirect requests.

The most common approach: run an OpenAI-compatible proxy (like LiteLLM) that forwards to your local Ollama instance, then configure Claude Code to use that proxy endpoint.

When to use local vs cloud

Local models are better for privacy-sensitive code, offline work, and cost savings on high-volume autocomplete. Cloud models (Claude, GPT) still win on complex reasoning, large-context tasks, and multi-file agentic workflows.

5.0

Performance tips

-GPU offloading — Make sure Ollama is using your GPU, not CPU. Run nvidia-smi while generating to confirm GPU utilization. On Mac, Ollama uses Metal automatically.
-Quantization — Q4_K_M is the sweet spot for coding tasks. Lower (Q2, Q3) saves VRAM but hurts code accuracy. Higher (Q6, Q8) improves quality marginally while doubling VRAM usage.
-Context window — Coding benefits from large context. Set num_ctx: 8192 or higher in your Ollama Modelfile. This uses more VRAM.

For the exact memory cost of larger contexts, see our KV cache explained post. For VRAM savings at each quantization level, our quantization vs VRAM guide has the numbers.

For more on which model pairs best with your hardware, our best local LLM for coding rankings cover VRAM requirements and speed benchmarks.

FAQ

Frequently Asked Questions

Can I use a local LLM in VS Code?

Yes. Install the Continue extension from the VS Code marketplace, point it to Ollama running locally, and configure your model. The whole setup takes about 15 minutes.

Does Cursor support local LLMs?

Cursor does not natively support direct localhost connections. You can work around this by overriding the Base URL to http://localhost:11434/v1 (Ollama) or using an ngrok tunnel. Agent mode may not work perfectly with local models.

Which is better for local LLM coding: VS Code or Cursor?

VS Code with Continue is the better experience today. Cursor workarounds are functional but limited. VS Code gives you autocomplete, chat, and agentic mode with any local model through Continue.

End of Document

Reader Discussion

Be the first to add a note to this article.

Please log in to join the discussion.

No comments yet.

Local LLM for Code Generation: The 3 Tasks Where DeepSeek V3 Beat Qwen3-Coder

Local LLM Coding Assistant Comparison: Why We Switched From Tabby to Continue

Best Local LLM for Coding: Why Qwen3-Coder Beat DeepSeek V3 on Our RTX 4090

Best AMD vs Best NVIDIA GPU for Local LLMs: Where AMD Wins, and Where CUDA Still Controls the Market

Back to all articles

Share this article

Local LLM for VS Code: The Setup That Finally Made Me Drop Copilot

Prerequisites — pick your model runner

Local LLM in VS Code (with Continue extension)

Local LLM in Cursor (workarounds)

Local LLM with Claude Code

Performance tips

Related Guides

Best Local LLM for Coding

Coding Assistants Compared

Code Generation & Debugging

KV Cache Explained

Quantization vs VRAM