Field Notes
Twice-weekly observations on building agents that survive contact with reality.
Thursdays · Sundays
Loop engineering is a 30-year-old loop with a new hashtag
A two-sentence post got 8.4 million views and the field spent the next week defining it. The definition is a control loop computer science named in 1995, and it quietly hands you back the one part that was always hard.
Anthropic just told you the harness is the product
Three Anthropic moves in April and May 2026, an independent essay from January, and a late-2025 preprint converge on the same point. The unit of evaluation is harness plus model plus task, not model.
Your eval rubric needs failure buckets, not just scores
Two agents can score identically on a benchmark and fail differently in production. The aggregate is not a rubric. It is what falls out of one.
The multi-agent papers nobody is citing
Three studies published since March 2025 put multi-agent failure rates between 41 and 86.7 percent and error amplification up to 17.2x over single-agent baselines. The vendor decks have not caught up.