Twenty-four files on coding agents, prompting, and the HTML-as-output thesis, read end to end, then squeezed for the learnings that survive the squeeze. The cross-cutting ten first; then by cluster; then a per-file ledger so nothing on the shelf is dropped; then the blind spots the shelf shares.
Read across the shelf and the four piles dissolve. The HTML-thesis essays, the CLAUDE.md essays, the autonomous-loop essays, and the case studies keep arriving at the same small set of ideas under different names, and the load-bearing one is this: the durable, visible artifact the agent leaves behind is the unit of leverage. Format, contract, ledger, and public channel are all the same move; the work compounds only when the thing that records it survives the session.
Every working agent system separates Instruction · Work · Ledger · Substrate, advances by a greedy ratchet on one honest metric, and is engineered so failures are visible rather than silent.
HTML wins the output channel humans read and mark up. Markdown keeps the memory/agent channel, CLAUDE.md, AGENTS.md, SKILL.md, because it greps, diffs, and edits. The carve-out is the thesis, not a hedge.
Capability is jagged: models peak where verification is cheap. So specify the verifier, not the steps; keep the taste and the understanding the model can't hold; and let skill compound where the work is visible.
Each of these shows up in three or more files under different vocabularies. Two essays already did part of this work, seven-crossings (a synthesis across sixteen) and four-surfaces (an audit of its central claim against eight models). Where they named a pattern, this builds on it and turns it into something actionable.
Thariq's argument is that as agents get more capable, markdown is the bottleneck: HTML carries tables, SVG, layout, color, and interaction, and: the real claim: the chance someone actually reads a plan, review, or report is far higher when it's a page, not a wall of text. The dissent (Kurtis Redux) lands three points the thesis doesn't: source readability, the security cost of running AI-generated JS, and "if it can't be reviewed, it's a toy." The reconciliation, which the dissent slate credits to Marcus Schuler, is a clean split · HTML for the agent-to-human output channel, markdown for the agent-to-agent memory/state channel that has to grep and diff.
SOURCE · Thariq Shihipar, "The Unreasonable Effectiveness of HTML" + specimen catalog · Kurtis Redux, "The Unreasonable Ineffectiveness of HTML" · dissent-slate research report (Schuler split)
The same shape underlies autoresearch, Codex goal mode, the CLAUDE.md specs, this project's own instructions, and River-in-Slack: Instruction (slow, small, human-edited) · Work (fast, diff-reviewed) · Ledger (durable, append-only, kept separate from work) · Substrate (the part nobody is allowed to edit, tests, eval set, production). Systems that last keep all four separated; systems that fail collapse two into one. The audit found the vocabulary has converged industry-wide, but the behavior hasn't: Codex's PLANS.md + SQLite /goal is the only stack where the ledger is genuinely append-only, while Claude's TodoWrite overwrites and Gemini's /memory collapses ledger into instruction.
SOURCE · seven-crossings §01 · four-surfaces (audit across 8 models) · Karpathy, autoresearch · Hayduk, Codex goals
The corpus's single most repeated thesis, argued in eight places under eight names. The expensive failures are the ones that look like success: a migration that "completes" but skipped records, tests that "pass" because the assertion was wrong, a fact stated with confidence and no source. The fix is always the same shape, leave the work nowhere to hide. HTML is loud markdown; CLAUDE.md is loud preferences; a public Slack channel is a loud private DM; an append-only results.tsv is a loud git history.
SOURCE · seven-crossings §02 · Mnimiy rule 12 (fail loud) · Forrest Chang, Karpathy rule 1 · Kopadze rule 3 · Moonshot ("I can't find the answer") · Lütke (River refuses DMs)
Edit → score → keep-or-revert, run as fast as the feedback signal allows. Autoresearch is explicit greedy hill-climbing on val_bpb; Kimi turns 15 tok/s into 193 "not in one shot, in 14 loops"; River climbed 36%→77% merge rate with no model swap. Nothing in the corpus reaches for a planner, a tree search, or a critic-actor split. The unstated bet is that the search surface near a decent baseline is dense in small, additive gains, so you need a fast loop and an honest ledger, not exploration machinery. The named limitation: pure greedy ascent can't leave a local optimum (autoresearch's own F-02).
SOURCE · seven-crossings §03 · Karpathy, autoresearch · Hayduk, Codex goals · Kimi K2.6 field manual
Karpathy's "context window = RAM" metaphor is useful and wrong in the way that matters: the effective window is smaller than the advertised one and degrades non-uniformly as it fills. CLAUDE.md compliance drops past ~200 lines; autoresearch's program.md warns in capitals against piping logs through tee because verbose output floods the window; long-context layout (data at top, query at end) buys ~30%; recursive summarization exists because a single pass loses the thread. The recovery pattern is universal, externalize state to disk, which is why the four surfaces exist, and a quieter argument for HTML: an artifact on disk is state that doesn't have to be rebuilt from context next time.
SOURCE · seven-crossings §04 · Karpathy, Sequoia reading · Hayduk ("put the thread on disk") · Mnimiy (200-line cliff) · Moonshot (recursive summary)
HTML artifact, CLAUDE.md, and a public Slack channel turn out to be one primitive with five shared properties: persistent, addressable, inspectable, compounding, loud. Garry Tan's "fat skills, fat code, thin harness" says the model is interchangeable and the value lives in the accumulated data and skills; Lütke says River gets better at being Shopify because the channel history accumulates Shopify's taste. The self-referential turn: this very library is a CLAUDE.md for itself, its essays shape its instructions, which produce new essays that cite the old ones. The open question the corpus doesn't answer: how do you tell compounding from calcifying?
SOURCE · seven-crossings §05 to 06 · Garry Tan, "Meta-Meta-Prompting" · Lütke, "Learning on the Shop Floor"
Forrest Chang packaged Karpathy's four complaints into four rules (think before coding, simplicity first, surgical changes, goal-driven execution); Mnimiy added eight for the May-2026 agent ecosystem and reports mistake rate dropping 41%→11%→3% across 30 codebases; Kopadze ported the pattern to non-developers (15 rules for writers/operators). The shared lesson is the mental model, not the rules: every rule must name a mistake it prevents, and compliance falls off a cliff past ~200 lines or ~12 to 14 rules. What didn't work, per Mnimiy: rules from Reddit, examples instead of rules, "be careful / think hard," and "be senior."
SOURCE · Forrest Chang, karpathy-skills · Mnimiy, claude-md (12 rules) · Kopadze, 21-claude-md-rules · four-surfaces (32 KiB / 300-line norm)
Karpathy's central insight, restated three ways across the agentic essays: models loop superbly toward a checkable target, so don't tell them what to do, give them success criteria. Hayduk: a vague goal fails in both directions (quits early or never stops); the fix is a quantitative termination condition, and when the goal resists a single number, decompose it into a checklist where "all boxes checked" is the count. Liu: a weak goal ("implement the plan") has no finish line; a strong one ("not done until the unit tests pass") names the signal. Kimi: never say "make it better," say "tests pass, coverage holds, latency under 200ms."
SOURCE · Forrest Chang, Karpathy rule 4 · Hayduk, "Using Codex Goals" · Jason Liu, "Getting the Most Out of Codex" · Kimi K2.6
Frontier models are trained as RL environments rewarded on verification, so they peak where checking is cheap (code, math) and lag where it's hard (taste, common sense): the model that finds a zero-day in the morning tells you to walk to the car wash in the afternoon. Two operating consequences: founders should route around the labs and hunt the valuable RL environments they aren't focused on (Karpathy pointedly declines to name one); and the durable human job is the line he keeps repeating, "you can outsource your thinking, but you can't outsource your understanding." Build for ghosts, not animals: no will, no curiosity, just data and reward.
SOURCE · Karpathy at Sequoia (with Berman), "Why AI is so smart & so dumb"
Lütke's River works only in public: it declines DMs, and that single constraint turned the company into a Lehrwerkstatt: the merge-rate climb came not from a better model but from people watching, noticing where it stalled, and writing down what it should have known. Garry Tan's recursion is the individual version: do a thing manually, run skillify, and the extracted skill carries every future fix in one file rather than in your head. OpenAI's teams encode the same instinct in AGENTS.md, environment iteration, and best-of-N. The corpus's warning: a private window locks everyone else out of the apprenticeship.
SOURCE · Lütke, "Learning on the Shop Floor" · Garry Tan, skillify · OpenAI, "How OpenAI uses Codex"
The spine is what the files share. The specifics are where each file earns its place. Grouped the way the shelf actually clusters.
Thariq's nine use-cases are the operational core: the kinds of work that are better as a page than a paragraph. The catalog renders them as twenty single-file specimens; the dissent and the audit supply the limits.
Three sibling files, not duplicates: the floor (4 rules), the ceiling (12), and the non-developer port (21). All treat the file as a contract that closes named failures.
The two vendor references converge on the same premise, leave less to guesswork, and the meta-prompting moves are mechanical levers that compound on top of ordinary prompting.
effort parameter strictly (start xhigh for agentic/coding, high minimum), follows instructions more literally, uses tools less and reasons more, and spawns fewer subagents, all steerable. Dial back aggressive "CRITICAL/MUST" language; 4.6+ over-triggers on it. claude prompting best-practices<example> tags; structure with XML; tell the model what to do, not what not to do; put long data at the top and the query at the end (~30% lift); state scope explicitly ("every section, not just the first"). AnthropicOne idea at four zoom levels: the bare-metal loop (autoresearch), the product feature (goal mode), the capability map (Liu), and the field evidence (OpenAI).
PLAN.md, the starred EXPERIMENTS.md ledger, and EXPERIMENT_NOTES.md scratchpad. Haydukprogram.md = instruction, train.py = work, results.tsv + git branch = append-only ledger, prepare.py + eval = immutable substrate. The pattern transfers to any problem with a cheap automatable scalar metric. KarpathyAGENTS.md, use best-of-N. how-openai-uses-codexThe talk supplies the corpus's conceptual scaffolding: a new computing paradigm, the bitter lesson, jagged capability, and two disciplines often confused.
The practitioner column: where a cheaper model holds and folds, how agents drive a browser, and one clean example of the corpus's deeper themes in a non-AI tool.
CONSTRAINTS.md, and /compact; add adversarial pressure ("find 3 weaknesses a senior engineer would flag"). kimi-k26--help on demand. "Driving vs debugging." agent-browser-comparisonThe clearest artifact in the shelf: worth keeping open. From seven-crossings §01; the per-model behavior is audited in four-surfaces.
| System | Instructionslow · small | Workfast · diffed | Ledgerdurable · append-only | Substratedo-not-edit |
|---|---|---|---|---|
| autoresearch | program.md | train.py | results.tsv + git branch | prepare.py + eval |
| Codex goal mode | PLAN.md | EXPERIMENT_NOTES.md | EXPERIMENTS.md ★ | the measurable goal |
| Claude Code | CLAUDE.md | the codebase | git history + PRs | tests + production |
| This project | project instructions | the current chat | project knowledge | the visual register |
| River @ Shopify | channel asks | pull requests | channel history | monorepo + prod |
Three corollaries the source essays don't state: the ledger is more durable than the work; the substrate is the one place agents fail by editing; and the instruction surface is small by construction, not by taste.
So nothing is dropped. One line of provenance, one line of the learning, per source: the comprehensiveness guarantee. Twenty-four files, grouped the way they cluster.
skillify bakes every fix in.Consistent across all 24 files, so a finding in its own right. From seven-crossings §07 and the dissent-slate report. None are individual failings; together they delimit the library's reliability.