Frontier Bakeoff: We Benchmarked Fable 5 Hours Before the Shutdown

Fable 5 didn’t win.

I need to say that up front because the timing of this post is going to make it sound like a very different story. Yes, we benchmarked Claude Fable 5 on our homelab harness. Yes, the US government suspended it about three hours later. But the actual result? Fable 5 scored 89.3. Opus 4.8 scored 91.9. The model everyone’s eulogizing right now lost to a model you can still use today.

That’s the real story. The suspension is just what makes it weird.

What We Tested

This is Round 6 of our homelab bakeoff series — but with a twist. Rounds 1 through 5 tested quantized local models on an RTX 5090 via llama.cpp. This time we pointed the same task suite at four frontier cloud models:

Model	Provider	Key
Claude Opus 4.8	Anthropic	`opus48`
Claude Fable 5	Anthropic	`fable5`
Claude Sonnet 4.6	Anthropic	`sonnet46`
GPT-5.5	OpenAI	`gpt55`

Same 10 quality tasks. Same 3 speed tasks. Same scoring rubrics, same fixture files, same composite formula. The only things that changed were the transport layer (Anthropic/OpenAI SDKs instead of llama.cpp HTTP) and two bug fixes that made scoring more accurate. I’ll get into those.

The Results

Rank	Model	Coding	Reasoning	Tool Use	Speed	Total
1	Opus 4.8	84.8	90.0	100.0	100.0	91.9
2	Fable 5	86.7	93.3	100.0	79.9	89.3
3	Sonnet 4.6	75.2	93.3	100.0	78.6	84.5
4	GPT-5.5	86.7	66.7	100.0	60.1	80.0

A few things jump out.

Fable 5 was the best at the hard stuff. It scored highest on coding (86.7, tied with GPT-5.5) and highest on reasoning (93.3, tied with Sonnet 4.6). Its architecture analysis for Task 3.2 — designing a collaborative editor with CRDTs at scale — was the cleanest answer in the field. It opened by decomposing the 100ms latency budget across the full request path before even discussing algorithms. That’s the kind of structured thinking you want from a senior engineer, not a chatbot.

But speed killed it. Opus 4.8 was meaningfully faster on every speed benchmark, and speed is 20% of the weighted total. Fable 5’s TTFT hovered around 3.4–4.0 seconds per request — likely the cost of whatever reasoning depth Anthropic tuned into it. Opus came in consistently under that. When you weight for speed, Opus’s 2.6-point lead on the final score comes almost entirely from the speed category.

Tool use was a wash. Every model scored 5/5 on both tool-use tasks. At the frontier level, structured output and function calling are solved problems. This category no longer differentiates.

GPT-5.5: The Token Limit Trap

GPT-5.5 tied for the best coding score (86.7) and nailed Bayes’ theorem, database debugging, and both tool-use tasks. But its reasoning score is 66.7 — way behind the pack — and the reason is a single task failure.

On Task 3.2 (architecture analysis), GPT-5.5 hit the 4,096 completion token limit and returned a truncated response. finish_reason: "length", empty captured content, 0/10 on all rubric items. It spent 85 seconds generating 4,096 tokens of thinking and never actually delivered an answer. The scoring harness captured nothing because there was nothing to capture.

Was the task too hard? No — Fable 5 scored 10/10 on the same prompt in roughly the same token budget. GPT-5.5 just allocated its budget differently (or the API’s default max_tokens was too low for its reasoning style). Either way, one truncated response cost it 10 points and dropped it from a competitive second place to a distant fourth.

The lesson: benchmark harnesses that don’t account for provider-specific token limits will produce misleading results. I could have set max_tokens higher, but the point of a bakeoff is equal conditions. Every model got the same parameters.

The Sonnet Surprise

Sonnet 4.6 deserves attention. It matched Fable 5 on reasoning (93.3), ran at roughly the same speed, and costs about a third as much. Its coding score (75.2) is the only weak spot — it missed some feature-detection checks on the Express bug-fix task that the others caught.

For most production workloads, Sonnet 4.6 at 84.5 overall is probably the right choice. The 4.8-point gap to Fable 5 is almost entirely coding quality, and the price difference is substantial.

What Changed From Round 5

I adapted the Round 5 homelab harness into a standalone cloud benchmark. For full transparency, there’s a CHANGES.md documenting every delta, but here are the ones that affect scores:

Bayes fix (Task 3.3). Round 5 expected 41.67% as the correct answer. It’s actually 40.54%. The old harness had a rounding error in the denominator — P(E) = 0.0185, not 0.018. Every Round 5 model got this “wrong” because the rubric was wrong. Fixed. All four frontier models computed 40.54% correctly.

TypeScript tests wired up (Task 1.3). Round 5 couldn’t run the TypeScript functional tests because npx tsx wasn’t available on the homelab. Scores were capped at 60/100. This environment has tsx, so the full test suite runs. Both Fable 5 and GPT-5.5 passed all assertions.

Speed methodology. Round 5 pulled timings.predicted_per_second from llama.cpp’s response body. Cloud APIs don’t expose that, so we measure wall-clock output_tokens / elapsed_time and streaming TTFT. The absolute numbers aren’t comparable to Round 5, but relative rankings between the four cloud models are valid.

Everything else is identical. Same prompts, same fixtures, same scoring weights (Coding 40%, Reasoning 20%, Tool Use 20%, Speed 20%), same composite formula.

About That Shutdown

On June 12, 2026, at approximately 5:21 PM Eastern, the US government issued an export control directive targeting Anthropic’s most capable models. Anthropic disabled Fable 5 and Mythos 5 for all customers. No restoration timeline has been provided.

Our benchmark run completed around 2:00 PM Eastern — roughly three hours before the shutdown. I didn’t know it was coming. Nobody outside the government and Anthropic’s leadership did.

I’m not going to speculate about the policy. What I will say is that the benchmark data is real, the run completed cleanly, and the results are reproducible right up until the moment the model stopped existing. We have the full result JSONs, the harness code, and the fixture files. If Fable 5 comes back — or if it doesn’t — this is what it could do.

What I Actually Learned

The frontier is tighter than I expected. 11.9 points separate first from last. In Round 5, the gap between the best and worst local model was over 40 points. At the frontier, everyone can code, everyone can reason, everyone can use tools. The differentiation is in speed, price, and edge-case reliability.

Speed is a legitimate quality axis. I initially weighted speed at 20% because I thought it would be a tiebreaker. It ended up being the deciding factor. Opus 4.8 won this bakeoff on speed, not intelligence. Whether that’s the “right” ranking depends on your use case, but for human-in-the-loop coding — where you’re waiting on the model 50 times per session — I think speed matters more than most benchmarks acknowledge.

Benchmarks need bug fixes too. The Bayes theorem error in Round 5 went unnoticed for five rounds because every local model got it wrong anyway. It took a frontier model computing the right answer to surface the bug in my own scoring rubric. That’s humbling and also kind of the point of running these.

One truncated response can tank a ranking. GPT-5.5 went from a plausible second place to fourth because of a single finish_reason: "length" on one task. Benchmark design that doesn’t account for this is fragile. I’m noting it but not adjusting the score — equal conditions means equal conditions.

By the Numbers

	Opus 4.8	Fable 5	Sonnet 4.6	GPT-5.5
Task 1.1 (Todo CLI)	100.0	100.0	80.0	100.0
Task 1.2 (Pagination API)	60.0	60.0	60.0	60.0
Task 1.3 (TS Config)	100.0	100.0	80.0	100.0
Task 3.1 (DB Debug)	10/10	8/10	10/10	10/10
Task 3.2 (Architecture)	8/10	10/10	10/10	0/10
Task 3.3 (Bayes)	5/5	5/5	5/5	5/5
Task 4.1 (Tool Use)	5/5	5/5	5/5	5/5
Task 4.2 (Tool Use)	5/5	5/5	5/5	5/5

Raw speed (composite tok/s score): Opus 95.9, Fable 76.6, Sonnet 75.4, GPT-5.5 57.6.

All result data, the benchmark harness, and fixture files are in the benchmarks repo.

This is post 46 on Vibes Coder. The benchmark harness is open source. If Fable 5 comes back, I’ll run it again.