April 11, 2026 · 9 min read

You Can Just Run the Experiment

ai agents experimentation multi-agent

A hunch

Here's a question. Do two AI agents comparing notes catch more chicken activity in a video clip than one agent working alone? I had a hunch, yeah, but I didn't actually know.

A few years back I'd have picked a side and argued about it online. This time I just ran it.

The question, specifically

My chicken-coop agent scans camera footage and builds a timeline of what the flock got up to -- scratching, drinking, dust bathing, the occasional squabble. Call it an activity timeline. The task's fiddly because chickens move constantly and the events worth catching are brief.

Here was the hunch. One agent working alone will miss stuff. Give the same clips to two agents independently and they'll miss different stuff, so the union of their answers should be more complete. And if you let them actually talk to each other -- one drafts, the other critiques, the first revises -- maybe you get something better still.

Or maybe not. Maybe the second agent just agrees with everything the first one said and you've doubled your API bill for nothing. That's the whole reason I wanted to run it.

The gap between wondering and knowing

Fifteen years ago, running this experiment properly meant hiring annotators, writing a protocol, getting ethics review, and reporting the results six months later. The number of people on Earth who could answer a question like this was a few thousand, and most of them were doing it for a PhD.

Now it takes an afternoon. I already had the agent harness. I already had the camera footage. I already had a benchmark task defined. Running four different multi-agent configurations costs roughly the price of a coffee.

Nothing about my setup is special. If you've wired up an API key and written a prompt, you've got everything you need to do the same. The people who could run experiments like this used to number in the thousands. Now it's in the tens of millions, and most of them haven't realised it yet.

Four ways to solve the same problem

Here are the configurations I lined up. Same clips, same task, four different shapes.

One agent. The baseline. Feed the clips in, get a timeline out. Whatever it catches, it catches.

Two agents in isolation. Same clips, two separate runs, no communication between them. Take the union of their event lists. Two chances to spot each thing, at double the cost.

Two agents collaborating. Agent A drafts a timeline. Agent B reads it and tries to poke holes -- what did A miss, what did A get wrong. A reads the critique and produces a revised timeline.

Four agents voting. All four write their own timeline independently. A candidate event only makes it into the final answer if at least two agents spotted it. A noise filter.

Four shapes. One question. Same task for each.

What I reckon will happen

I don't know the answers yet -- that's the whole point -- but here's where my money is, so I can check it against reality when the run finishes.

One versus two isolated. I reckon two isolated will beat one, and meaningfully. Chickens move too much for one agent to catch everything in a single pass, and the two agents' mistakes should be uncorrelated enough that the union helps. Call it a 15 to 25 percent lift in events caught, at twice the cost.

Two isolated versus two collaborating. Less sure on this one. Critique-then-revise should help, because the second agent has more context than an independent run. But collaborative setups also tend to converge: the critic gets anchored by the draft, the reviser gets anchored by the critique, and both sand off the disagreements that would have been useful. My guess is collaborating beats isolated on precision (fewer wrong events) and ties or loses on recall (total events caught).

Four voting. Voting is brutal on recall. Half the real events are probably spotted by only one of the four agents, and those get filtered out. I expect voting to land with the highest precision and the lowest recall of the four setups. Good for "summarise the day in five events". Bad for "tell me everything that happened".

Pre-run ranking. Two-isolated for recall, two-collaborating for overall quality, four-voting for precision, one-agent for cost. I'm prepared to be wrong on all of it. That's the fun part.

Results

Coming soon. I'll run the four configurations once the current benchmark slot is free and update this section with the numbers. The predictions above stay exactly as they are, so we can see where I was right and where I wasn't.

Your turn

None of this needed a research lab. It needed a question I actually wanted answered, a benchmark task I already had lying around, and an afternoon.

What's your hunch? Does your coding agent do better with a code-reviewer sub-agent, or does the reviewer just slow it down? Does splitting a long document across five agents produce a better summary than feeding the whole thing to one? Does a second agent catch the hallucinations of the first?

You already know how to guess. So did I. The shift is that you don't have to stop at guessing anymore.

The follow-up post takes the same question to the longer l3-write-journal task -- more clips, more nuance, more room for coordination costs to bite. Different regime, possibly different answer. If the predictions on the simple task hold up, the real question is: when do they stop holding up?

One experiment at a time.