April 6, 2026 · 9 min read

Local Vision Showdown: Can Open-Source Models Watch Chickens?

chookbench local-llm vision experiments

The Experiment

The Coop Chronicle runs on AI -- an agent watches camera footage, identifies chickens, updates state files, and writes journal entries. But which local vision model should power it?

We ran qwen-3.5-35b and gemma-4-26b through ChookBench, our benchmark for chicken coop management tasks. Four tasks across two difficulty levels: simple frame description (L1) and multi-step observation with state updates (L2). Both models ran locally via LM Studio, evaluated by a GPT-4o-mini judge.

Head-to-Head: Task Scores

Neither model dominates. Qwen nails structured tasks (100% on state updates -- valid JSON, correct timestamps, memory observations) but struggles with free-form narration. Gemma produces beautiful timelines (100%) but cannot reliably write JSON state files -- its tool calling breaks down, sometimes emitting 350+ malformed calls in a single turn.

Both models score identically on chicken identification (40%) -- and for the same reason: they hallucinate all four chickens when only one is visible. Confidently wrong, equally.

The benchmark frame: a low-angle shot from inside the chicken run, mostly grass with the coop structure visible in the background.

The frame both models analysed. One chicken is (arguably) visible. Both reported four.

The max_tokens Breakthrough

The most important finding had nothing to do with model quality. At the default max_tokens=4096, qwen produced 20 turns of pure whitespace on L2 tasks. Zero tool calls. Zero text. It looked completely broken.

Bumping to max_tokens=8192 instantly fixed it. The cause: qwen's chain-of-thought reasoning in <think> tags was consuming the entire 4096-token budget before the model could emit a single tool call. The model was thinking hard -- just silently truncated before it could act.

This is invisible from the outside. The API returns a valid response with empty content. No error, no warning. You'd assume the model can't do the task, when really it just needs more room to think.

Does Resolution Matter?

We tested the same frame at three resolutions to see if downscaling helps the models spot chickens. The source frames are 1920x1080 -- a wide shot of the yard where chickens are relatively small subjects.

Counterintuitively, smaller is better for qwen. At 512p the base64 payload shrinks by 4x (880KB to 205KB) and accuracy increases from 60% to 70%. The chickens are small subjects in a wide frame -- downscaling concentrates the visual information into the model's native processing resolution rather than wasting it on background detail.

Gemma peaks at 768p but the gains are marginal. Its main limitation is tool use, not vision.

Key Takeaways

No single model wins. Qwen handles structured data; gemma handles narrative. A production agent might need both.
max_tokens=8192 is non-negotiable for local models doing agentic tasks. CoT reasoning silently consumes the default 4096 budget.
Downscale your frames. 512p (910x512) beats 1080p for small-subject detection and cuts payload by 4x.
Local models need explicit prompts. "Sample frames" is too vague. "Sample 5-8 frames spread evenly across the clip" works. Frontier models infer intent; local models need it spelled out.
Both models hallucinate chickens. When asked to identify individuals, both confidently report all four chickens present when only one is visible. The vision is there; the restraint isn't.

What's Next

The obvious next step is testing the purpose-built vision models -- qwen2.5-vl and the newer gemma-4 variants that are specifically designed for image understanding rather than having vision bolted on. We're also curious whether batching 2-3 frames per turn (showing the model a before/after sequence) helps with chicken identification, since movement is the strongest visual signal.

All experiment data is in the ChookBench runs if you want to poke around.