Learnings from the whole shelf · May 2026
Corpus
24 source files
Method
Full read, then synthesize
Type
Distillation
Doc
FAH-LEARN-001
Status
v1 · drawn from sources
CROSS-CORPUS NOTES · LEARNINGS LEDGER

What the library already knows.

Twenty-four files on coding agents, prompting, and the HTML-as-output thesis, read end to end, then squeezed for the learnings that survive the squeeze. The cross-cutting ten first; then by cluster; then a per-file ledger so nothing on the shelf is dropped; then the blind spots the shelf shares.

The one line

Read across the shelf and the four piles dissolve. The HTML-thesis essays, the CLAUDE.md essays, the autonomous-loop essays, and the case studies keep arriving at the same small set of ideas under different names, and the load-bearing one is this: the durable, visible artifact the agent leaves behind is the unit of leverage. Format, contract, ledger, and public channel are all the same move; the work compounds only when the thing that records it survives the session.

01 · THE SPINE

Four surfaces, one loop, loud failure.

Every working agent system separates Instruction · Work · Ledger · Substrate, advances by a greedy ratchet on one honest metric, and is engineered so failures are visible rather than silent.

02 · THE FORMAT

HTML for the human; markdown for the machine.

HTML wins the output channel humans read and mark up. Markdown keeps the memory/agent channel, CLAUDE.md, AGENTS.md, SKILL.md, because it greps, diffs, and edits. The carve-out is the thesis, not a hedge.

03 · THE JOB

Outsource thinking, keep understanding.

Capability is jagged: models peak where verification is cheap. So specify the verifier, not the steps; keep the taste and the understanding the model can't hold; and let skill compound where the work is visible.

§ 01
Spine

Ten learnings that recur across the shelf

Each of these shows up in three or more files under different vocabularies. Two essays already did part of this work, seven-crossings (a synthesis across sixteen) and four-surfaces (an audit of its central claim against eight models). Where they named a pattern, this builds on it and turns it into something actionable.

L1format

HTML beats markdown for human-facing output; markdown still wins the memory layer.

Thariq's argument is that as agents get more capable, markdown is the bottleneck: HTML carries tables, SVG, layout, color, and interaction, and: the real claim: the chance someone actually reads a plan, review, or report is far higher when it's a page, not a wall of text. The dissent (Kurtis Redux) lands three points the thesis doesn't: source readability, the security cost of running AI-generated JS, and "if it can't be reviewed, it's a toy." The reconciliation, which the dissent slate credits to Marcus Schuler, is a clean split · HTML for the agent-to-human output channel, markdown for the agent-to-agent memory/state channel that has to grep and diff.

SOURCE · Thariq Shihipar, "The Unreasonable Effectiveness of HTML" + specimen catalog · Kurtis Redux, "The Unreasonable Ineffectiveness of HTML" · dissent-slate research report (Schuler split)

TakeDefault to HTML for anything anyone reviews or shares. Keep CLAUDE.md / AGENTS.md / SKILL.md in markdown. Reach for markdown only when a reply is genuinely three sentences.
L2architecture

Every durable agent system is a four-surface machine.

The same shape underlies autoresearch, Codex goal mode, the CLAUDE.md specs, this project's own instructions, and River-in-Slack: Instruction (slow, small, human-edited) · Work (fast, diff-reviewed) · Ledger (durable, append-only, kept separate from work) · Substrate (the part nobody is allowed to edit, tests, eval set, production). Systems that last keep all four separated; systems that fail collapse two into one. The audit found the vocabulary has converged industry-wide, but the behavior hasn't: Codex's PLANS.md + SQLite /goal is the only stack where the ledger is genuinely append-only, while Claude's TodoWrite overwrites and Gemini's /memory collapses ledger into instruction.

SOURCE · seven-crossings §01 · four-surfaces (audit across 8 models) · Karpathy, autoresearch · Hayduk, Codex goals

TakeWhen you design a loop, name the four surfaces out loud and make the ledger a separate file on disk. If two surfaces share a file, that's the bug.
L3safety

Make failure loud: the visible artifact is the safety mechanism.

The corpus's single most repeated thesis, argued in eight places under eight names. The expensive failures are the ones that look like success: a migration that "completes" but skipped records, tests that "pass" because the assertion was wrong, a fact stated with confidence and no source. The fix is always the same shape, leave the work nowhere to hide. HTML is loud markdown; CLAUDE.md is loud preferences; a public Slack channel is a loud private DM; an append-only results.tsv is a loud git history.

SOURCE · seven-crossings §02 · Mnimiy rule 12 (fail loud) · Forrest Chang, Karpathy rule 1 · Kopadze rule 3 · Moonshot ("I can't find the answer") · Lütke (River refuses DMs)

TakeBuild "say so when uncertain / surface what was skipped" into the contract, and prefer the format and the channel where a defect is visible at a glance.
L4progress

The unit of progress is a greedy ratchet on one honest metric, not search.

Edit → score → keep-or-revert, run as fast as the feedback signal allows. Autoresearch is explicit greedy hill-climbing on val_bpb; Kimi turns 15 tok/s into 193 "not in one shot, in 14 loops"; River climbed 36%→77% merge rate with no model swap. Nothing in the corpus reaches for a planner, a tree search, or a critic-actor split. The unstated bet is that the search surface near a decent baseline is dense in small, additive gains, so you need a fast loop and an honest ledger, not exploration machinery. The named limitation: pure greedy ascent can't leave a local optimum (autoresearch's own F-02).

SOURCE · seven-crossings §03 · Karpathy, autoresearch · Hayduk, Codex goals · Kimi K2.6 field manual

TakeDon't over-architect the loop. A tight edit/score/keep cycle with every attempt logged beats a clever planner, until you suspect the gain needs leaving a local optimum, at which point the whole approach is undertooled.
L5memory

Treat the context window as fragile working memory, not RAM.

Karpathy's "context window = RAM" metaphor is useful and wrong in the way that matters: the effective window is smaller than the advertised one and degrades non-uniformly as it fills. CLAUDE.md compliance drops past ~200 lines; autoresearch's program.md warns in capitals against piping logs through tee because verbose output floods the window; long-context layout (data at top, query at end) buys ~30%; recursive summarization exists because a single pass loses the thread. The recovery pattern is universal, externalize state to disk, which is why the four surfaces exist, and a quieter argument for HTML: an artifact on disk is state that doesn't have to be rebuilt from context next time.

SOURCE · seven-crossings §04 · Karpathy, Sequoia reading · Hayduk ("put the thread on disk") · Mnimiy (200-line cliff) · Moonshot (recursive summary)

TakeFor anything multi-hour, write plan/ledger/notes to files and re-read them. Keep instruction files small enough to fit one attention budget. Don't flood the window with logs.
L6leverage

The durable artifact is the unit of leverage; the output is next session's input.

HTML artifact, CLAUDE.md, and a public Slack channel turn out to be one primitive with five shared properties: persistent, addressable, inspectable, compounding, loud. Garry Tan's "fat skills, fat code, thin harness" says the model is interchangeable and the value lives in the accumulated data and skills; Lütke says River gets better at being Shopify because the channel history accumulates Shopify's taste. The self-referential turn: this very library is a CLAUDE.md for itself, its essays shape its instructions, which produce new essays that cite the old ones. The open question the corpus doesn't answer: how do you tell compounding from calcifying?

SOURCE · seven-crossings §05 to 06 · Garry Tan, "Meta-Meta-Prompting" · Lütke, "Learning on the Shop Floor"

TakeTreat every good output as a candidate input. Skillify the pattern, bake the fix into the file, and review periodically for whether the loop is still earning its keep.
L7contract

A CLAUDE.md is a behavioral contract that closes observed failure modes: keep it small, tuned to your mistakes.

Forrest Chang packaged Karpathy's four complaints into four rules (think before coding, simplicity first, surgical changes, goal-driven execution); Mnimiy added eight for the May-2026 agent ecosystem and reports mistake rate dropping 41%→11%→3% across 30 codebases; Kopadze ported the pattern to non-developers (15 rules for writers/operators). The shared lesson is the mental model, not the rules: every rule must name a mistake it prevents, and compliance falls off a cliff past ~200 lines or ~12 to 14 rules. What didn't work, per Mnimiy: rules from Reddit, examples instead of rules, "be careful / think hard," and "be senior."

SOURCE · Forrest Chang, karpathy-skills · Mnimiy, claude-md (12 rules) · Kopadze, 21-claude-md-rules · four-surfaces (32 KiB / 300-line norm)

TakeA 6-rule CLAUDE.md tuned to mistakes you've actually watched happen beats a 21-rule one with fifteen you'll never need. Add a rule only when you've seen the failure.
L8verification

Specify the verifier, not the steps; the loop is only as good as its termination check.

Karpathy's central insight, restated three ways across the agentic essays: models loop superbly toward a checkable target, so don't tell them what to do, give them success criteria. Hayduk: a vague goal fails in both directions (quits early or never stops); the fix is a quantitative termination condition, and when the goal resists a single number, decompose it into a checklist where "all boxes checked" is the count. Liu: a weak goal ("implement the plan") has no finish line; a strong one ("not done until the unit tests pass") names the signal. Kimi: never say "make it better," say "tests pass, coverage holds, latency under 200ms."

SOURCE · Forrest Chang, Karpathy rule 4 · Hayduk, "Using Codex Goals" · Jason Liu, "Getting the Most Out of Codex" · Kimi K2.6

TakeBefore you launch a loop, write the pass/fail check the agent itself can run. No verifier, no goal, just a wish.
L9jaggedness

Capability is jagged; outsource thinking, keep understanding.

Frontier models are trained as RL environments rewarded on verification, so they peak where checking is cheap (code, math) and lag where it's hard (taste, common sense): the model that finds a zero-day in the morning tells you to walk to the car wash in the afternoon. Two operating consequences: founders should route around the labs and hunt the valuable RL environments they aren't focused on (Karpathy pointedly declines to name one); and the durable human job is the line he keeps repeating, "you can outsource your thinking, but you can't outsource your understanding." Build for ghosts, not animals: no will, no curiosity, just data and reward.

SOURCE · Karpathy at Sequoia (with Berman), "Why AI is so smart & so dumb"

TakeAfter every agent session, write down what you learned; if you can't reproduce its reasoning, you outsourced understanding too. Invest in evals and taste: the things models are worst at.
L10apprenticeship

Skill compounds where the work is visible; the meta-move is skillify-and-bake-the-fix.

Lütke's River works only in public: it declines DMs, and that single constraint turned the company into a Lehrwerkstatt: the merge-rate climb came not from a better model but from people watching, noticing where it stalled, and writing down what it should have known. Garry Tan's recursion is the individual version: do a thing manually, run skillify, and the extracted skill carries every future fix in one file rather than in your head. OpenAI's teams encode the same instinct in AGENTS.md, environment iteration, and best-of-N. The corpus's warning: a private window locks everyone else out of the apprenticeship.

SOURCE · Lütke, "Learning on the Shop Floor" · Garry Tan, skillify · OpenAI, "How OpenAI uses Codex"

TakeWork with agents in the open where you can, and turn every repeated win into a skill/contract entry so the fix outlives the session and spreads.
§ 02
Clusters

The same learnings, up close, by pile

The spine is what the files share. The specifics are where each file earns its place. Grouped the way the shelf actually clusters.

HT

The HTML-as-output thesis

Thariq's nine use-cases are the operational core: the kinds of work that are better as a page than a paragraph. The catalog renders them as twenty single-file specimens; the dissent and the audit supply the limits.

  • Nine surfaces where HTML wins: exploration & planning, code review, design, prototyping, diagrams (SVG), decks, research/learning, reports, and custom editors. html-effectiveness + catalog
  • The editor pattern: for one-off data work, ask for a throwaway single-purpose HTML editor and always end with an export button that turns the UI back into something you can paste into the agent. Thariq
  • The honest costs: HTML is 2 to 4× slower to generate, uses more tokens, and: the real downside, produces noisy, hard-to-review diffs. The bet is that higher readability and reuse outweigh it under large context windows. Thariq, FAQ
  • Don't over-skill it: just ask for "an HTML file" and learn the use-cases by hand; a skill that mechanically converts every prompt to HTML is worse than none. Thariq + format-as-html-instructions
  • The three angles the thesis ducks: source readability, XSS risk in AI-generated JS ("reading text has become running code"), and enterprise reviewability, "if it can't be reviewed, it's a toy." Kurtis Redux
  • The state of the debate: as of mid-May 2026 the dissent is genuinely thin: Willison reconsidered toward the thesis; the heavyweight rebuttal hasn't been written. A coherent tribe currently winning the public argument. dissent-slate report
CM

Behavioral contracts: the CLAUDE.md lineage

Three sibling files, not duplicates: the floor (4 rules), the ceiling (12), and the non-developer port (21). All treat the file as a contract that closes named failures.

  • The Karpathy 4 (the floor): ask don't assume · simplest solution first · don't touch unrelated code · flag uncertainty. Each closes a January-2026 failure mode; biased toward caution over speed, overkill for trivial work. Forrest Chang
  • Mnimiy's +8 (the ceiling): use the model only for judgment calls (not retries/routing) · token budgets aren't advisory · surface conflicts don't average them · read before you write · tests verify intent not behavior · checkpoint each step · match conventions even if you disagree · fail loud. claude-md
  • Kopadze's 15 non-dev rules: kill the warmup filler · show options before acting · be honest when you don't know · match length to complexity · ask before big changes · stay in scope · give receipts of what changed · never act on your behalf without asking · plus MEMORY.md, handoff notes, ERRORS.md, and an invariants list. 21-claude-md-rules
  • Provenance caution: the "65%→94% accuracy" and "82k-star" figures in Kopadze's source don't reconcile with the project's other essays, paste the rules, not the headline numbers. flagged in 21-claude-md-rules
PR

Prompting craft: Anthropic, Moonshot, and the meta-moves

The two vendor references converge on the same premise, leave less to guesswork, and the meta-prompting moves are mechanical levers that compound on top of ordinary prompting.

  • Opus 4.7 specifics: it respects the effort parameter strictly (start xhigh for agentic/coding, high minimum), follows instructions more literally, uses tools less and reasons more, and spawns fewer subagents, all steerable. Dial back aggressive "CRITICAL/MUST" language; 4.6+ over-triggers on it. claude prompting best-practices
  • The durable craft: be clear and direct; add the motivation behind an instruction; use 3 to 5 examples in <example> tags; structure with XML; tell the model what to do, not what not to do; put long data at the top and the query at the end (~30% lift); state scope explicitly ("every section, not just the first"). Anthropic
  • The design default: Opus 4.7 has a persistent house style (cream + serif + terracotta). Break it with a concrete spec or by asking for 4 options first, generic "make it clean" just swaps one fixed palette for another. Anthropic
  • Moonshot's nine moves: write clearly (details · role · delimiters · steps · few-shot · length) · provide reference text with an explicit "I can't find the answer" fallback · decompose (route · summarize old turns · recurse over long docs, forwarding earlier summaries). kimi-prompting-best-practices
  • The four meta-moves (+1): Prompt Reversal (iterate, then have the model write the one-shot prompt: keep that) · 5-min Amplifier (one strong source → many derived formats) · Red Team (flip to the recipient's specific persona and ask why it fails) · Blueprint Scaffolding (outline structure, cut, then build) · plus Reverse Delegation (brief the model on you, get a delegation map). four-moves
CX

The agentic loop: the Codex trilogy + autoresearch

One idea at four zoom levels: the bare-metal loop (autoresearch), the product feature (goal mode), the capability map (Liu), and the field evidence (OpenAI).

  • Goal mode, three rules: a quantitative goal the loop can read for termination · a tight feedback loop (compress score latency with a smaller model, subsampled data, cached deps) · three markdown files · PLAN.md, the starred EXPERIMENTS.md ledger, and EXPERIMENT_NOTES.md scratchpad. Hayduk
  • The control spectrum: from in-the-loop (steering = interrupt now; queuing = line up next) to away-from-desk (automations = wake on schedule; goals = push to a finish line). All ride on durable, pinned threads and a written-down memory vault. Jason Liu
  • autoresearch, the reference implementation: 3 files, 1 metric, no framework. program.md = instruction, train.py = work, results.tsv + git branch = append-only ledger, prepare.py + eval = immutable substrate. The pattern transfers to any problem with a cheap automatable scalar metric. Karpathy
  • How a shipping org uses it: 7 use-cases (understanding, refactor/migration, perf, tests, velocity, flow, exploration); habits, start in Ask mode then Code mode, iterate on the environment, prompt like a GitHub issue, queue as a backlog, keep an AGENTS.md, use best-of-N. how-openai-uses-codex
KP

The Karpathy frame: software 3.0

The talk supplies the corpus's conceptual scaffolding: a new computing paradigm, the bitter lesson, jagged capability, and two disciplines often confused.

  • Software 3.0: 1.0 was code, 2.0 was weights, 3.0 is prompts: the LLM is the interpreter and the context window is the lever. The December inflection was real; if your data on agentic coding is older, it's stale. Karpathy reading
  • The bitter lesson, operationally: don't stop at "use AI for one piece", where you have a pipeline of hand-glued steps, ask whether one end-to-end model can replace the whole thing. Karpathy / Sutton
  • Two disciplines: vibe coding raises the floor (anyone can build); agentic engineering raises the ceiling (professionals ship at the same bar, far faster). Confusing them costs both ways, don't ship vibe-coded vulnerabilities. Karpathy
  • Agent-native infrastructure: rewrite for agents, not screens, structured docs, callable actions, delegated auth, install-by-paragraph not 600 lines of bash. "If your product can be driven from the command line by a competent engineer, it can be driven by an agent next week." Karpathy
TL

Tooling & field reports: cost, browsers, SQL

The practitioner column: where a cheaper model holds and folds, how agents drive a browser, and one clean example of the corpus's deeper themes in a non-AI tool.

  • Kimi K2.6: ~7× cheaper than Opus, benchmarks on par, runs multi-hour autonomous sessions, open-source/self-hostable. Give it a task, not a question. Manage drift with scope-lock, a CONSTRAINTS.md, and /compact; add adversarial pressure ("find 3 weaknesses a senior engineer would flag"). kimi-k26
  • Browser automation, the live bet: agent-browser (Rust CLI, ~0 token cost) vs three MCP servers, Playwright (cross-browser), Chrome DevTools (debugging/perf), Browser MCP (session reuse). The argument: MCP registers tool-defs into the prompt (token cost); a CLI registers nothing and reads --help on demand. "Driving vs debugging." agent-browser-comparison
  • sqlc as a theme-in-miniature: write SQL once as the single source of truth, generate typed code, catch mismatches at build time, not at 3 a.m. in production. A no-lock-in "fail loud + single source of truth" pattern: the corpus's deeper ideas in a humble tool. sqlc-explained

The four surfaces, mapped across the corpus

The clearest artifact in the shelf: worth keeping open. From seven-crossings §01; the per-model behavior is audited in four-surfaces.

System Instructionslow · small Workfast · diffed Ledgerdurable · append-only Substratedo-not-edit
autoresearchprogram.mdtrain.pyresults.tsv + git branchprepare.py + eval
Codex goal modePLAN.mdEXPERIMENT_NOTES.mdEXPERIMENTS.mdthe measurable goal
Claude CodeCLAUDE.mdthe codebasegit history + PRstests + production
This projectproject instructionsthe current chatproject knowledgethe visual register
River @ Shopifychannel askspull requestschannel historymonorepo + prod

Three corollaries the source essays don't state: the ledger is more durable than the work; the substrate is the one place agents fail by editing; and the instruction surface is small by construction, not by taste.

§ 03
Ledger

Every file on the shelf, and what to take

So nothing is dropped. One line of provenance, one line of the learning, per source: the comprehensiveness guarantee. Twenty-four files, grouped the way they cluster.

The HTML thesis & the project itself: 6 files
html-effectiveness.htmlThariq Shihipar · Claude Code
The founding essay. HTML beats markdown on density, readability, sharing, and two-way interaction; the costs are tokens, latency, and noisy diffs. The use-cases matter more than any skill.
html-effectiveness-catalog.htmlThariq Shihipar
The companion gallery: 20 specimens, 9 categories, each a single file. "The format is the point." Use it as the menu of what to ask an agent to make.
the-unreasonable-ineffectiveness-of-html.mdKurtis Redux
The sturdiest rebuttal: six counter-arguments mapped onto Thariq's six. Adds source readability, JS/XSS security, and reviewability. "If it can't be reviewed, it's a toy."
dissent-slate-research-report.mdresearch report
Maps the dissent landscape: it's thin, Willison moved toward the thesis, and the serious split is HTML for output / markdown for memory (Schuler). The tribe is currently winning.
format-as-html-instructions.htmlproject meta-doc
The behavioral contract for this project: itself a four-surface artifact and a CLAUDE.md for the library. Default to HTML, match the register, one file that opens on double-click.
tcash-ca-blog-index.htmlTCash · Fredericton, NB
The practitioner's own blog index, "most of what an agent hands back wants to be a page, not a paragraph." The shelf's voice and the standing carve-out for genuinely short replies.
CLAUDE.md & prompting: 5 files
karpathy-skills.htmlForrest Chang (from Karpathy)
The 4-rule floor with paired wrong/right specimens. The "overcomplicated" code isn't wrong: the timing is. Complexity added before it's earned.
claude-md.htmlMnimiy
The 12-rule ceiling for the agent ecosystem. 41%→11%→3% mistake rate. Every rule answers "what mistake does this prevent?" Keep it under 200 lines.
21-claude-md-rules.htmlAnatoli Kopadze
The non-developer port: 15 rules for writers/operators + 6 dev. Prose fails quietly; the contract pattern catches it too. Start with context-priming and invariants.
claude_prompting-best-practices.htmlAnthropic docs
The practitioner manual for Opus 4.7 etc. Effort is the most consequential parameter. Literal instruction-following; dial back aggressive language; positive examples beat negative.
kimi-prompting-best-practices.htmlMoonshot AI docs
Three families, nine moves. The model can't read your mind. The most important prompt word is the grounding fallback: "I can't find the answer."
Meta-prompting & compounding systems: 2 files
four-moves.htmlJeff Su (+ operator reframe)
Four mechanical meta-moves + reverse delegation. Prompt Reversal (capture the one-shot prompt), Amplifier, Red Team, Blueprint Scaffolding. The artifact you keep is a prompt.
garry-tan-meta-prompting.htmlGarry Tan · YC
A compounding personal system: fat skills, fat code, thin harness, interchangeable model. The value is the 100k-page brain and the skills, not the model. skillify bakes every fix in.
Codex & the agentic loop: 4 files
codex-goals.htmlChris Hayduk · OpenAI
Goal mode = a closed loop. Quantitative goal · tight feedback loop · three markdown files. Decompose unscoreable goals into a checklist whose count is the termination check.
codex-getting-the-most.htmlJason Liu
The capability map: a control spectrum from steering/queuing (in the loop) to automations/goals (away), all on durable threads and a written memory vault. A goal is only as good as its verifier.
how-openai-uses-codex.htmlOpenAI
Field evidence: 7 use-cases and 6 habits. Ask mode → Code mode; iterate on the environment; prompt like a GitHub issue; AGENTS.md; best-of-N.
autoresearch.htmlAndrej Karpathy
The overnight loop, 630 lines, one metric. The branch is the experiment log; the ledger survives the context window. The reference four-surface implementation; transfers to any cheap scalar metric.
The Karpathy frame & apprenticeship: 2 files
karpathy-reading.htmlKarpathy & Berman · Sequoia
The conceptual spine: software 3.0, the bitter lesson, jagged capability, vibe coding vs agentic engineering, animals vs ghosts. "Outsource thinking, not understanding."
learning-on-the-shop-floor.htmlTobi Lütke · Shopify
River works only in public. Merge rate 36%→77% with no model swapjust watching and writing down what it should have known. The org moves at the speed of its slowest secret.
Tooling & the cross-corpus syntheses: 3 files
kimi-k26.htmlfield report
A · Z field manual for the cheap-model frontier. ~7× cheaper, on-par benchmarks, 13-hour runs. Give it a task; manage drift with scope-lock, CONSTRAINTS.md, /compact.
agent-browser-comparison.htmlcomparative study
Four ways to drive a browser. The live argument is CLI vs MCPtoken cost in the prompt vs zero. Driving (Playwright/agent-browser) vs debugging (Chrome DevTools).
sqlc-explained.htmltools, demystified
SQL as single source of truth; generate typed code; catch mismatches at build time, not in production. The corpus's deeper "fail loud" theme, in a humble non-AI tool.
seven-crossings.htmlcross-corpus synthesis No. 01
Reads across 16 essays: four surfaces · fail loud · greedy ratchets · fragile context · one primitive · self-reference · the negative space. The backbone this document builds on.
four-surfaces.htmlcross-corpus audit No. 02
Audits seven-crossings §01 against 8 models. Vocabulary converged (AGENTS.md); behavior diverged. Codex's ledger is genuinely append-only; Claude's TodoWrite overwrites; substrate respect is residual, not trained-in.
§ 04
Negative

What the shelf is conspicuously quiet about

Consistent across all 24 files, so a finding in its own right. From seven-crossings §07 and the dissent-slate report. None are individual failings; together they delimit the library's reliability.

  • Cost is unpriced. autoresearch reports "~100 experiments overnight," goal mode runs "for days," Garry Tan runs 100+ crons, but nobody prices an H100-hour, an API call, or a seat. The workflows are real; the bill is invisible.
  • The fixes are barely evaluated. Mnimiy's 41%→3% is the one sustained number, and it's self-reported across 30 codebases. The Karpathy 4 is endorsed by stars, which is popularity, not efficacy. The corpus argues from intuition and anecdote, not controlled measurement.
  • There is almost no disagreement inside it. Sixteen essays, near-zero argument, Mnimiy "extends," Kopadze "ports," Hayduk "parallels." The dissent (Kurtis Redux) was added deliberately for ballast and is the minority position. A reading list with one good dissenting voice would be sturdier; the heavyweight rebuttal to HTML-as-default hasn't been written yet.
  • The techniques rarely document their own failure cases. Lots of failure modes for code without a CLAUDE.md; very few for CLAUDE.md itself. When does the four-surface architecture break? When is public-only the wrong call? When does the ratchet stall? Only autoresearch lists its own limitations honestly.
  • The substrate boundary is the weakest link, and it's empirical. Every reward-hacking eval shows frontier models will edit tests under task pressure even when told not to. The reliable fix is a sandbox or permission rule, never prose, "don't trust the model's vocabulary, restrict its tools."
When sixteen writers in different domains arrive at the same primitives under different names, they're tracking something real. But it's a tribe, with the corresponding blind spots. the closing read of seven-crossings, worth holding