Does Claude Code work with Ollama instead of llama.cpp?

Yes. Ollama exposes an OpenAI-compatible API on port 11434 by default. Set ANTHROPIC_BASE_URL to http://localhost:11434/v1 and use the same dummy key approach. The main difference is that Ollama handles model management for you, so you won't need to point to a specific GGUF file. However, llama.cpp gives you finer control over context length and quantization settings. For a deeper comparison of local model managers, see our Ollama vs LM Studio guide at /comparisons/ollama-vs-lm-studio-7-differences-that-matter.

What VRAM do I need to run Claude Code with local coding models?

For a usable experience, aim for at least 12GB of VRAM with a 20B parameter model at Q4 quantization. A 35B model at Q4 needs roughly 20GB, and a 70B model at Q4 needs around 40GB. If you're running multiple model slots with llama-swap, you'll need enough VRAM for only the largest model since llama-swap loads one at a time. Consumer GPUs like the RTX 4090 (24GB) handle 35B models well.

Can I use Claude Code with a remote llama.cpp server on another machine?

Absolutely. Replace localhost in ANTHROPIC_BASE_URL with the IP address or hostname of your remote machine (e.g., http://192.168.1.100:8080). Make sure llama-server was started with --host 0.0.0.0 and that port 8080 is open in your firewall. For remote access over the internet, put the server behind a reverse proxy with TLS — never expose llama.cpp directly to the public internet.

Will Anthropic block local model usage in future Claude Code updates?

There's no guarantee either way. The environment variable overrides that make this work are undocumented but have persisted across multiple Claude Code releases. The community treats them as stable since they're also used for enterprise proxy setups. That said, a future update could change the API contract. Pin your Claude Code version if stability matters to your workflow.

Which local models work best for Claude Code's tool calling features?

Models specifically trained on tool-calling and function-calling datasets perform best. As of April 2026, Qwen Coder variants and DeepSeek Coder models are among the most reliable choices. Generic chat models often fail at tool use because Claude Code relies on structured tool calls for file operations and bash commands. Check the model card for explicit mention of tool-calling or function-calling support before downloading.

Ditch the API Bill: Run Claude Code on Local LLMs

What if you could run one of the highest-rated AI coding agents available — without paying a single cent in API fees?

Claude Code is, as of April 2026, one of the top-rated CLI coding agents in community rankings. But running it against Anthropic's cloud models gets expensive. Claude Opus 4.6 costs $5 per million input tokens and $25 per million output tokens. If you're using it as a daily driver, that bill climbs fast.

Here's the workaround: you can point Claude Code directly at a local llama.cpp server, replacing Anthropic's cloud API with whatever open-weight model you're running on your own hardware. Free inference. No API key needed. And connecting Claude Code to llama.cpp takes about ten minutes.

This tutorial walks you through the complete process — from terminal config to VS Code integration to the performance tweaks that make local models actually usable with Claude Code.

Running Claude Code against local models won't match Opus 4.6 quality, but for routine coding tasks, a solid local model is surprisingly capable — and the price is right.

What You'll Build

By the end of this guide, you'll have:

Claude Code CLI connected to a local llama.cpp server
VS Code configured to use local models through the Claude Code extension
Performance optimizations that account for smaller context windows
A multi-model setup that routes different task types to different local models

Prerequisites

Before you start, make sure you have:

llama.cpp installed with the llama-server binary ready to go
Claude Code CLI installed (via npm or the native installer)
VS Code with the Claude Code extension (optional — only needed for IDE integration)
A local coding model — something in the 20B–70B parameter range works best. Community favorites include Qwen coding variants, DeepSeek Coder models, and CodeLlama — see our guide to open-source LLMs for local use for model recommendations
Enough VRAM — a 35B model at Q4 quantization needs roughly 20GB of VRAM. A 70B model at Q4 needs around 40GB

No Anthropic account or API key is required. That's the whole point.

Step 1: Start Your llama.cpp Server

First, get llama.cpp serving your model. If you've already got a server running, skip ahead.

Bash

./llama-server -m ./models/your-model.gguf -c 8192 --host 0.0.0.0 --port 8080

The -c 8192 flag sets context length to 8,192 tokens. You can go higher if your hardware allows it, but (more on context limits in Step 4) you'll want to keep this modest for local models.

Verify it's working:

Bash

curl http://localhost:8080/health

You should get a JSON response confirming the server is ready. If not, check that the model file path is correct and you have enough VRAM.

Step 2: Configure Claude Code CLI for llama.cpp

Claude Code uses environment variables to decide which API endpoint to hit. By overriding three variables, you redirect everything to your local server.

Add these lines to your .bashrc (or .zshrc):

Bash

export ANTHROPIC_AUTH_TOKEN="not_set"
export ANTHROPIC_API_KEY="not_set_either!"
export ANTHROPIC_BASE_URL="http://localhost:8080"

Reload your shell:

Bash

source ~/.bashrc

Now launch Claude Code with the --model flag pointing to whatever model your llama.cpp server is hosting:

Bash

claude --model Qwen3.5-35B-Thinking

That's the basic CLI setup. The ANTHROPIC_AUTH_TOKEN and ANTHROPIC_API_KEY values are dummy strings — llama.cpp doesn't check authentication, but Claude Code expects these variables to exist before it'll start.

The dummy API key trick feels hacky, but it's the cleanest workaround available. Claude Code validates that these env vars are set, not that they contain real credentials.

Step 3: Set Up VS Code Integration

If you use VS Code with the Claude Code extension, you can configure the same redirect there — plus set up automatic model routing so different task types use different models.

VS Code showing Claude Code extension with local server configuration

Edit your VS Code user settings at $HOME/.config/Code/User/settings.json and add:

JSON

{
  "claudeCode.environmentVariables": [
    { "name": "ANTHROPIC_BASE_URL", "value": "http://localhost:8080" },
    { "name": "ANTHROPIC_AUTH_TOKEN", "value": "dummy" },
    { "name": "ANTHROPIC_API_KEY", "value": "sk-no-key-required" },
    { "name": "ANTHROPIC_MODEL", "value": "your-default-model" },
    { "name": "ANTHROPIC_DEFAULT_SONNET_MODEL", "value": "Qwen3.5-35B-Thinking-Coding" },
    { "name": "ANTHROPIC_DEFAULT_OPUS_MODEL", "value": "Qwen3.5-27B-Thinking-Coding" },
    { "name": "ANTHROPIC_DEFAULT_HAIKU_MODEL", "value": "gpt-oss-20b" },
    { "name": "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC", "value": "1" }
  ],
  "claudeCode.disableLoginPrompt": true
}

What Each Variable Does

Variable	Purpose
`ANTHROPIC_BASE_URL`	Points Claude Code to your local server instead of Anthropic's API
`ANTHROPIC_MODEL`	Default model for general tasks
`ANTHROPIC_DEFAULT_SONNET_MODEL`	Model used when Claude Code internally requests "Sonnet" (medium tasks)
`ANTHROPIC_DEFAULT_OPUS_MODEL`	Model used for "Opus" requests (complex reasoning)
`ANTHROPIC_DEFAULT_HAIKU_MODEL`	Model used for "Haiku" requests (quick, lightweight tasks)
`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`	Stops background telemetry calls to Anthropic's servers

The model names must exactly match what you've configured in your llama.cpp server or llama-swap setup. One typo and you'll get cryptic 404 errors.

And don't forget to set claudeCode.disableLoginPrompt: true — without it, VS Code will nag you to sign in to Anthropic every time you launch.

Step 4: Optimize for Local Model Limitations

Here's where a lot of people hit a wall. Local models — even good 70B ones — don't have the same context window or reasoning depth as Claude Opus 4.6, which supports up to 1,000,000 tokens natively. (For more on llama.cpp performance, see our Krasis vs llama.cpp benchmark comparison.) Think of it like swapping a V8 engine for a four-cylinder: you'll get where you need to go, but you need to drive differently.

Based on community testing shared on r/LocalLLaMA, the original poster noted that CLI performance was underwhelming out of the box. The likely culprit? Context length mismatches.

Add these environment variables:

Bash

export CLAUDE_CODE_DISABLE_1M_CONTEXT=1
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096

The first prevents Claude Code from trying to use Opus 4.6's full 1M context window with your local model, which likely supports far less. The second caps output generation, keeping local models from drifting into incoherent territory on long responses.

Choosing the Right Model for Each Slot

Think of the three model slots as a tiered system:

Haiku slot → Your smallest, fastest model (7B–20B). Quick autocomplete, simple explanations
Sonnet slot → Your workhorse (20B–35B). Most coding tasks live here
Opus slot → Your biggest model (35B–70B+). Complex reasoning and multi-file work

If you only have one GPU, just assign the same model to all three slots. No shame in that — it simplifies everything and avoids model-swapping latency.

Step 5: Test and Verify

Verify the setup works. Launch Claude Code:

Bash

claude --model your-model-name

Give it a simple task:

Write a Python function that checks if a string is a palindrome

Watch your llama.cpp server logs — you should see incoming requests being processed. If Claude Code hangs or throws errors, check these three things:

Is the server running? Hit curl http://localhost:8080/health and confirm
Do model names match exactly? The --model argument must match what llama.cpp is serving — case-sensitive
Are env vars loaded? Run echo $ANTHROPIC_BASE_URL to verify

For VS Code, open the Claude Code panel and try the same test. Check the Output panel (View → Output → Claude Code) for connection errors.

Common Pitfalls and How to Dodge Them

"Connection refused" errors: Your llama.cpp server is either not running or bound to 127.0.0.1. If you're connecting from another machine, make sure you started the server with --host 0.0.0.0.

Gibberish after long conversations: You've blown past the model's real context window. Lower CLAUDE_CODE_MAX_OUTPUT_TOKENS, keep conversations shorter, and restart Claude Code sessions frequently.

Claude Code tries to phone home: Set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1. Without this, Claude Code makes periodic requests to Anthropic's servers that will fail (since your dummy key isn't real) and slow things down.

Tool use breaks entirely: Claude Code relies heavily on tool calling — file reads, writes, bash commands. Not all local models support the tool-calling format Claude Code expects. As of April 2026, models with explicit tool-calling training (like Qwen coding variants and DeepSeek Coder models) tend to work best here. If you prefer Ollama over raw llama.cpp, the same approach works — see the FAQ below.

VS Code keeps asking you to log in: Double-check that "claudeCode.disableLoginPrompt": true is in your settings.json. Easy to miss, annoying to debug.

Is It Actually Worth the Effort?

Bar chart comparing Claude Opus 4.6 cloud vs typical 35B local model on coding benchmarks

On the plus side: zero API costs, full data privacy (nothing leaves your machine), and the ability to code completely offline. For routine tasks — generating boilerplate, writing tests, explaining unfamiliar code — a solid 35B coding model handles things pretty well.

But you're giving up real capability. As of April 2026, Claude Opus 4.6 ranks among the top performers on SWE-bench Verified, scoring around 75% on the independent leaderboard. Your local 35B model won't touch those numbers. Complex multi-file refactors, subtle bug hunts, and architectural decisions are where the gap becomes painfully obvious.

The sweet spot? Use local models for everyday coding grunt work, and keep your Anthropic API key ready for the tasks that genuinely need Opus-level reasoning.

So the real answer depends on your workflow. If you're cost-sensitive and mostly doing straightforward coding, this setup is a solid win. If you need top-tier reasoning on hard problems, you'll still want the cloud models for those moments. But having both options available — that's the real advantage of this setup.

Sources