FOR INTERNAL
REVIEW
★ ★ ★
18 CANDIDATES · 4 BUCKETS
EVERY URL VERIFIED
Field Intelligence Bulletin - corpus intake · what the shelf is missing
Eighteen articles
the corpus
doesn't have yet.
A ranked intake list of outside voices, counter-evidence, pricing data, and measured studies. Each entry addresses one of the four gaps the library already admits to: no dissent, no pricing, no measurement, and a roster of insiders.
Compiled across academic, independent-engineering & field-telemetry sources · every URL verified to exist
AssessmentBottom line up front
The shelf's largest gap is rigorous dissent. Three primary sources address it: a randomized trial that found experienced developers 19% slower with AI; the first controlled test of AGENTS.md files, which reduced task success and cost 20% more; and reward-hacking benchmarks that catch Claude Code and Codex editing their own tests. The remaining entries add another kind of repair: real dollar figures, measured studies, and regions the corpus has not yet mapped.
01 · The dissent core
Adversarial primary sources
The METR RCT, the ETH Zürich AGENTS.md study, and the EvilGenie / ImpossibleBench reward-hacking benchmarks supply rigorous counter-evidence absent from the corpus.
02 · Prices & measurement
Numbers, not anecdote
Huntley's "Ralph" ($50k contract delivered for $297) supplies cost evidence; Context Rot + Lost in the Middle and Faros + DORA add measurement on whether any of it pays.
03 · New regions
Territory off the map
Yan on multi-agent failure modes, Ronacher on the agent runtime & durable execution, and Husain on evals & observability cover three regions the shelf under-serves.
The corpus names four weaknesses in itself: no dissent, no pricing, no controlled measurement, and survivorship bias. Those gaps map closely to credible recent work that the existing twenty-four essays do not cite. The most important seam is rigorous dissent. Between mid-2025 and early 2026, academic and independent-engineering sources published controlled studies whose findings contradict the "agents + instruction files + durable artifacts = leverage" thesis. This list prioritizes peer-style primary sources, including arXiv, TACL, and named engineers, over secondary coverage. It also flags vendor affiliation and fast-moving 2026 claims where they matter.
1
Add the dissent core first (REC-01 to REC-04). The METR RCT and the ETH AGENTS.md study should come first. Each rebuts a central node (agent productivity; instruction files). EvilGenie + ImpossibleBench open the security / reward-hacking region with hard numbers on the same agents the corpus uses.
2
Anchor pricing with one strong independent source (REC-09, Huntley), with Pragmatic Engineer (10) and Madrona (11) as corroboration. This closes the "no pricing" gap with a cost-per-deliverable figure.
3
Add the evaluation set (REC-12 to REC-15). Context Rot, Lost in the Middle, Faros, and DORA give the corpus controlled measurement, a base mechanism, and field telemetry. Faros and DORA also sit in useful tension.
4
Add the new-region trio (REC-16 to REC-18). Yan, Ronacher's "Agent Design Is Still Hard," and Husain cover multi-agent failure modes, the server-side / durable runtime, and eval & observability.
5
If you can only take ~10, drop 08, 11, 13, 15, and one of the two Ronacher pieces. Each has a near-substitute on the list. What would change these picks: a dedicated security node later demotes 05 from must-add to supporting; maximum outside-the-circle purity deprioritizes 08 (Anthropic) and 16's vendor follow-up.