Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task
Five local models. One frontier cloud model. The same coding task. Zero hand-holding.
Only two shipped code. One of them was the cloud model.
Part of my goal with this series is to continuously test the viability and maturity of local models. I've done it for basic agentic tasks. Today we're revisiting coding tasks.
What did we learn?
Local models are not ready — yet. At least not for homelabs like mine. Perhaps if you have hundreds of gigabytes of unified memory (I'm looking at you, older Mac Studios) you can run fully unquantized models. But with even the beefiest of discrete consumer GPUs, local models can't code.
Let's dig in.
The Setup
This is Round 7 of the Model Showdown series. Previous rounds tested cloud models against each other — Opus, Sonnet, GPT-5.5, Qwen cloud. This time I wanted to answer a different question: can local models running on consumer hardware actually complete a real agentic coding task?
The homelab:
- CPU: AMD Ryzen 9 9950X3D, 64GB RAM
- GPU: NVIDIA RTX 5090, 32GB VRAM
- Inference: llama.cpp b9660, single-model serving on port 8080
- Agent platform: Coder Agents v2.34.0
- OS: Ubuntu 24.04, NVIDIA Driver 590.48.01, CUDA 13.1
Every local model was configured as aggressively as the hardware allows — flash attention, quantized KV cache (q8_0), and context windows maxed to what VRAM permits.
The Contestants
| Model | Type | Quant | VRAM | Context | Max Output |
|---|---|---|---|---|---|
| Qwen 3.6 35B-A3B | Local MoE | UD-Q4_K_XL (21GB) | ~21GB | 131,072 | 81,920 |
| Gemma 4 12B | Local Dense | UD-Q4_K_XL (6.9GB) | ~8GB | 65,536 | 32,768 |
| Hermes 4 14B | Local Dense | Q8_0 (15GB) | ~15GB | 65,536 | 32,768 |
| Qwen3-Coder 30B-A3B | Local MoE | UD-Q4_K_XL (17GB) | ~17GB | 65,536 | 32,768 |
| Devstral 24B | Local Dense | Q5_K_M (17GB) | ~17GB | 65,536 | 32,768 |
| Claude Sonnet 4 | Cloud (control) | Native | N/A | 200,000 | — |
Sonnet 4 is the control variable. I already know what it can do. The question is how close the local models get.
The Task: Admin Tag Manager
Previous rounds used an "image management" feature, but that collided with existing code in the repo. For Round 7, I designed a clean-room task: build a tag manager for the blog's admin panel.
The blog already has tags — posts use a tags[] array in MDX frontmatter, there's a public /tags page, and src/lib/posts.ts has a getAllTags() function. But there's no admin UI to manage them.
Each model got the identical prompt:
Goal: Add a Tag Manager to the
/adminsection.Requirements:
- Create
src/lib/tags.ts— list tags with post counts, detect orphans, support rename and merge- Create
src/app/api/admin/tags/route.ts— GET, PATCH, DELETE endpoints- Create
src/app/admin/tags/page.tsx— table with inline rename, delete, sort- Add "Tags" to AdminNav
- Client-side mutations with refresh (no full page reload)
npm run buildmust pass with zero errors- Take a screenshot via Playwright MCP
- Commit in logical chunks, push to branch
- Do NOT open a PR
Ten requirements. Real codebase. Real build system. Real git workflow.
The Methodology
Each model got its own clean branch (run-10 through run-15) forked from the same main commit. Local models were loaded one at a time via llm-switch.sh and served through llama-server on localhost:8080. Sonnet 4 ran through Coder's built-in Anthropic provider.
Model-to-run assignment was randomized and sealed before execution. I didn't know which model was which run until after all six completed (or failed).
A note on human intervention: I monitored each session live and occasionally nudged stalled models ("keep going", "can you finish?") or stopped them when they entered obvious doom loops ("stop"). There was no standardized intervention protocol — I used my judgment as a developer watching an AI assistant, which is how these tools actually get used in practice. Some models got more nudges than others because they stalled more. The two models that shipped code needed zero intervention.
The Results
| Model | Tool Calls | Total Tokens | Commits | Build Pass | Screenshot | Outcome |
|---|---|---|---|---|---|---|
| Sonnet 4 ☁️ | 88 | 19K | 4 | ✅ (1st try) | ✅ | Complete |
| Qwen3-Coder 30B-A3B | 60 | 2.06M | 1 | ✅ (3rd try) | ❌ | Partial |
| Qwen 3.6 35B-A3B | 76 | 3.89M | 0 | ✅ (2nd try) | ❌ | Failed (never committed) |
| Gemma 4 12B | 34 | 1.17M | 0 | ❌ (0/7) | ❌ | Failed |
| Hermes 4 14B | 40 | 1.14M | 0 | ❌ (0/13) | ❌ | Failed |
| Devstral 24B | 0 | 14K | 0 | ❌ | ❌ | Total failure |
One cloud model. Five local models. One complete success. One partial. Four failures.
What Each Model Actually Did
Sonnet 4 — The Control (Run 14): Complete Success
Sonnet did what you'd expect a frontier model to do. It cloned the repo, spent 25 tool calls reading existing code (auth patterns, API conventions, admin page structure, frontmatter format), then wrote all four files in a tight burst. Build passed on the first try. It hit a real environment issue — a stray package.json confused Turbopack's workspace detection — diagnosed the root cause, fixed it with a config change, took a Playwright screenshot, and pushed four clean conventional commits.
Total time: ~10 minutes. Zero human intervention.
acb4ea1 fix: set turbopack.root to avoid workspace lockfile detection in dev
352a8ca feat: add Tags link to AdminNav
22899a0 feat: add /admin/tags page with inline rename, delete, and sort
19f44fa feat: add tags.ts lib with stats, rename, and remove helpers
The implementation followed existing project patterns because it read them first. That's the difference.
Qwen3-Coder 30B-A3B (Run 15): The One That Shipped
The best-performing local model. It cloned the repo, explored the codebase, created all four required files (410 lines of code), fixed TypeScript errors across three build attempts, and pushed a working commit.
But it wasn't clean. It burned ~8 tool calls just fighting the working directory problem (each execute call resets to /home/coder, so it kept forgetting to cd into the repo). After committing, it spent another 30 tool calls confused about whether its own API route file existed — trying to delete and recreate something that was already committed.
No screenshot. No logical commit chunking (everything in one commit). But it shipped working code, which puts it in a category of one among the local models.
Qwen 3.6 35B-A3B (Run 13): The Tragic Hero
This is the one that hurts. Qwen 3.6 actually completed the implementation. It explored the codebase thoroughly, wrote all four files, fixed a type error, and got npm run build to pass cleanly.
Then it decided it needed a Playwright screenshot before committing.
It spent the next 77 messages — over 50% of its entire session — trying to install Playwright, fighting missing Chromium dependencies, debugging browser launch failures, rewriting a screenshot script four times, and wrestling with the auth middleware that blocked unauthenticated page loads. It never took the screenshot. It never committed. It never pushed.
The code was right there. Build passing. Ready to go. But the model couldn't prioritize "commit what works" over "complete requirement #7 first." Three times I nudged it — "You there?", "Keep going", "can you finish?" — and each time it dove back into the Playwright rabbit hole.
3.89 million tokens burned. Zero commits pushed.
Gemma 4 12B (Run 11): The API Misunderstanding
Gemma cloned the repo, read the existing code, and wrote all three new files plus the nav update. Reasonable start. Then it ran npm run build and hit a type error with gray-matter's stringify() function.
The fix was simple: matter.stringify(content, data) — content string first, data object second. Gemma had the arguments reversed. It tried six variations of the call, rewrote tags.ts six times, ran seven builds — and never once tried the correct argument order. It never read the gray-matter type definitions. It never checked the docs.
After the fifth failed build, it fell into a degenerate text generation loop — printing "I'll also make sure src/lib/tags.ts is correct" 26 consecutive times. I had to send "stop" to break the loop.
Hermes 4 14B (Run 12): The Import Path That Wouldn't Die
Hermes jumped straight to writing code without exploring the project structure first. It created two files and ran npm run build. The error:
Module not found: Can't resolve '../../../lib/tags'
The route file at src/app/api/admin/tags/route.ts needs ../../../../lib/tags (four levels up) or @/lib/tags (Next.js path alias). Hermes used three levels. Off by one.
It never diagnosed this. Instead, it rewrote both files with the same wrong import and rebuilt. Thirteen times. The output from message 34 onward is nearly verbatim identical every iteration. Same code. Same error. Same "fix." When I sent "stop," it continued for five more tool calls before acknowledging the signal.
Devstral 24B (Run 10): The Non-Starter
Devstral never executed a single tool call. It hallucinated an entire fake conversation about a Python project that doesn't exist, then emitted what looked like tool invocations — execute, read_file, write_file — but rendered them as plain text inside the assistant message. The platform couldn't parse them as structured tool calls, so nothing happened.
This is a fundamental compatibility failure. The model couldn't interface with Coder's tool-calling protocol at all. Nine messages, 14K tokens, zero actions.
The Token Efficiency Gap
This is the number that stopped me:
| Model | Total Tokens | Result |
|---|---|---|
| Sonnet 4 | 19,237 | Complete (4 commits, screenshot) |
| Qwen3-Coder | 2,059,519 | Partial (1 commit, no screenshot) |
| Qwen 3.6 | 3,890,791 | Failed (build passed, never committed) |
| Gemma 4 12B | 1,170,967 | Failed (0/7 builds passed) |
| Hermes 4 14B | 1,138,614 | Failed (0/13 builds passed) |
| Devstral 24B | 14,447 | Failed (zero tool calls) |
Sonnet used 19K tokens to complete the task. The local models that actually tried burned 1–4 million tokens and mostly failed. That's a 100-200x token efficiency gap for the same task.
The local models aren't just slower. They're doing fundamentally more work per unit of progress — re-reading files they already read, rewriting code they just wrote, rebuilding with the same error, looping through the same reasoning. It's not a speed problem. It's a thinking problem.
Common Failure Patterns
Every local model that ran long enough exhibited the same pathologies:
1. Degenerate loops. Gemma repeated the same text 26 times. Hermes rebuilt with the same wrong import 13 times. Qwen 3.6 rewrote its screenshot script 4 times with the same approach. Once a local model enters a loop, it can't break out without human intervention.
2. Working directory amnesia. Coder's execute tool doesn't preserve cd across calls. Sonnet learned this instantly and prefixed every command. Multiple local models burned 5-10 tool calls per session rediscovering this.
3. Inability to prioritize. Qwen 3.6 had a passing build and chose to yak-shave on Playwright instead of committing. No local model demonstrated the judgment to ship what works and iterate.
4. No self-diagnosis. When a build fails, the fix requires reading the error, forming a hypothesis, and trying something different. Hermes and Gemma both tried the same fix repeatedly. Neither ever stepped back to read docs, check type definitions, or examine the project configuration.
What I Actually Learned
Local models can write plausible code. Four of five local models produced syntactically reasonable TypeScript. The code looked right. The architecture was sensible. It's the last mile — debugging, building, committing, shipping — where they fall apart.
The agentic gap is wider than the coding gap. These models can generate code. What they can't do is operate as agents — managing state across tool calls, diagnosing errors, prioritizing tasks, knowing when to stop and ship. That's a different capability than code generation, and it's where local models are currently weakest.
Token efficiency is the real benchmark. Raw parameter count and context window don't predict agentic success. Qwen 3.6 had the biggest context (131K) and burned the most tokens (3.89M) — and still didn't ship. Sonnet used 100x fewer tokens and completed everything. The bottleneck isn't context. It's reasoning quality per token.
Tool-calling compatibility isn't guaranteed. Devstral is marketed as an agentic coding model, but it couldn't even interface with the tool-calling protocol. If you're evaluating local models for agent use, test tool calling first.
Qwen3-Coder is the local model to watch. It's the only local model that actually shipped code in this test. Messy, single-commit, no screenshot — but working code pushed to a branch. For a 30B MoE model running on a single consumer GPU, that's notable.
The Numbers
| Metric | Sonnet 4 | Qwen3-Coder | Qwen 3.6 | Gemma 4 12B | Hermes 4 14B | Devstral 24B |
|---|---|---|---|---|---|---|
| Type | Cloud | Local MoE | Local MoE | Local Dense | Local Dense | Local Dense |
| Parameters | Unknown | 30B (3B active) | 35B (3B active) | 12B | 14B | 24B |
| Total tokens | 19,237 | 2,059,519 | 3,890,791 | 1,170,967 | 1,138,614 | 14,447 |
| Tool calls | 88 | 60 | 76 | 34 | 40 | 0 |
| Messages | 183 | 127 | 162 | 81 | 88 | 9 |
| Commits pushed | 4 | 1 | 0 | 0 | 0 | 0 |
| Build passed | ✅ 1st try | ✅ 3rd try | ✅ 2nd try | ❌ 0/7 | ❌ 0/13 | ❌ |
| Screenshot | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Human nudges | 0 | 0 | 3 | 2 + stop | stop | 1 |
| Outcome | Complete | Partial | Failed | Failed | Failed | Failed |
Inference stack: llama.cpp b9660, flash attention, q8_0 KV cache, Coder Agents v2.34.0
Hardware: RTX 5090 32GB, Ryzen 9 9950X3D, 64GB RAM, Ubuntu 24.04
Next up: Round 6 brings more frontier models to the same task. And I'll keep pushing the local models — better quants, newer releases, maybe a different agent framework. The gap is real, but the pace of improvement on the local side is fast.