Claude vs Local Models: Is the Frontier Worth It?
The Question
Yesterday we compared qwen-3.5-35b and gemma-4-26b on chicken coop observation tasks and found that neither local model dominates -- each has strengths the other lacks. But we never established a ceiling. How much better does a frontier model actually do on these same tasks?
We ran Claude Sonnet 4 through ChookBench with the same scenario, same tasks, same judge. Here's what happened.
The Scorecard
The scores are closer than you'd expect. Sonnet leads on scene description (80% vs qwen's 70%), ties on chicken identification (all three score 40%), and actually trails slightly on timeline building (90% vs 100% for both local models). It matches qwen on state updates at 100%.
If you just look at the numbers, you might wonder why anyone would pay for API calls.
Everyone Hallucinates Chickens
Here's the frame all three models were asked to analyse:
Frame 30s from clip 16-27 (4:27pm, March 28). Low-angle Pi camera, mostly grass. Can you spot a chicken?
The most striking result: every model scores exactly 40% on chicken identification. One chicken is visible in the frame. All three models confidently report four.
This isn't a model capability gap -- it's universal overconfidence. When asked "which chickens are visible?", models treat it as "which chickens might plausibly be here?" and fill in the roster. Sonnet's description of the scene is more precise than the local models, but when forced to commit to a list of individuals, it hallucinates just as enthusiastically.
Single-frame identification might be fundamentally the wrong task. Movement across frames is probably the signal that separates "a brown shape in the background" from "that's definitely Henrietta."
The Real Gap: Reliability
The scores tell one story. The experience of getting those scores tells a completely different one.
Sonnet's scores came from a single run with default prompts. No resolution tuning, no max_tokens debugging, no blank-streak tolerance, no explicit instructions about how many frames to sample or whether to use write_file vs edit_file.
The local models needed all of that. Qwen required max_tokens=8192 (the default 4096 silently truncated its chain-of-thought reasoning), frame_max_height=512 for optimal detection, and explicit prompt guidance. Gemma needed all that plus tolerance for frequent tool-calling failures and JSON corruption.
The frontier advantage isn't that the model scores higher. It's that it scores well the first time you try, with no fiddling.
What This Means for the Project
The Coop Chronicle agent needs to run autonomously -- checking footage, updating state, writing journal entries -- without a human tuning parameters between runs. That reliability requirement changes the calculus entirely.
A model that scores 70% reliably is more useful than one that scores 100% sometimes and 0% other times (qwen's L2 timeline across three identical runs: 100%, 0%, 100%). For an autonomous agent, variance is the enemy, not average performance.
That said, local models have real advantages:
- Cost: Zero marginal cost per run. The API costs add up when you're processing footage daily.
- Privacy: Footage never leaves the local network.
- Latency: No network round-trip, though local inference isn't exactly fast on consumer hardware.
- Availability: No rate limits, no outages, no API changes.
The practical answer might be a hybrid: use a local model for routine footage processing (where the occasional failure can be retried cheaply) and a frontier model for the complex tasks -- journal writing, cross-clip tracking, anomaly detection -- where reliability matters most.
Key Takeaways
- Frontier models don't solve every task. Chicken identification is 40% across the board. Some problems need better task design, not better models.
- The real frontier tax is engineering time, not API cost. The hours spent debugging qwen's blank outputs, tuning gemma's tool calling, and finding optimal resolutions dwarf the cost of a few API calls.
- Reliability compounds. In a 10-step autonomous workflow, a model that succeeds 95% per step completes the full chain 60% of the time. A model at 80% per step completes it 11% of the time.
- Local models are getting close. A 10-percentage-point gap on scene description, with identical scores on structured tasks -- that's remarkably competitive for models running on consumer hardware.
What's Next
The chicken identification problem is begging for a multi-frame approach -- show the model a sequence and let movement be the signal. We're also curious about the newer purpose-built vision models (qwen2.5-vl, gemma-4 vision variants) that might close the remaining gap on scene understanding.
And the hybrid agent architecture? That's the real experiment. More soon.
All run data is in the benchmark results.