Goal mode turns the agent into a closed loop, act, score, check, repeat, and that small change in shape rewards a very different kind of prompting. Three rules from someone running it for days at a time.
For a year, prompting got lazier: the models were good enough that we could gesture vaguely and they would figure it out. Goal mode breaks that habit hard, because the loop has nowhere to stop.
Perceptive Codex users have noticed that the /goal command is now available in the app: start your prompt with /goal and specify what you want your agent to do. This triggers Codex to loop continuously until it achieves the goal. But to get the most out of it, the prompt has to be shaped differently than the everyday workflow rewards.
Goal mode is, at its core, a loop. The agent executes some actions, scores those actions, checks whether the score satisfies the goal, and either continues or terminates. The whole thing pivots on step three, check if the score satisfies the goal, which means a vague qualitative goal leaves the loop’s end state underspecified. How is the agent supposed to know when it’s done?
With an underspecified goal, the model fails in two opposite directions. Sometimes it gives up early, working for only a few minutes before declaring victory. Other times it never stops, flailing at an unsatisfiable target. Both failure modes trace to the same root cause: the loop can’t read the termination condition. Fixing that is the whole game.
The agent doesn’t need beauty in the goal: it needs a check it can run. Numbers, constraints, and a checklist will outperform any amount of careful adjectives.
A better goal than “make my code better” is something like:
specific_file by 20% without causing regressions in existing unit or integration tests.”The score doesn’t have to come from a benchmark; the model itself can do scoring if the rules are clear enough. When converting a NeurIPS preprint to an ICML workshop submission, ICML’s formatting rules are buried in a LaTeX file that doesn’t lend itself to mechanical grading. The fix was to have Codex extract those rules into a markdown checklist of just over two hundred items, then make the goal:
“Change the NeurIPS paper to ICML format based on the provided checklist.md without changing any of the technical content of the paper.”
By providing the checklist, the qualitative goal becomes quantitative: Codex is done when 200 of 200 boxes are checked. Each individual rule can still be vague: the agent reasons about each rule at a time, which it does well, but the overall goal is now a hard count. As a bonus, the model checks off items as it goes, which persists progress to the filesystem and lets a human keep tabs visually.
Whenever the goal resists a single number, decompose it into a checklist the agent can chew through item by item, then make checking it off the termination condition. The aggregate quantity (200 of 200) does the work the single scalar would do in an easier case.
The agent is rate-limited by how long it takes to find out whether the last move helped. Shorten that and you change what the loop can attempt.
To evaluate actions against the goal, the agent needs a test. The faster that test runs, and the less ceremony it takes to run it: the faster the model gets feedback on whether the last move helped. Latency in the scoring step is the dominant cost in any long-running loop. Find anywhere to compress it that doesn’t compromise the quality of the score it produces.
For improving an ML architecture, that often means operating on a smaller model and a subsampled dataset rather than the full production training setup. For protein structure work specifically, NanoFold: a small but well-sampled dataset, cut a scoring iteration from days to minutes. The score you get isn’t identical to a full training run, but it’s good enough to rank ideas, and you can do dozens or hundreds of those rankings in the time one production-quality score would take.
The loop’s clock speed is set by step 02. Smaller model · subsampled data · cached deps · pre-warmed env.
Three failure modes show up over and over. Cold environments: every action repaying a setup cost the score doesn’t care about; cache it. Full-fat fixtures: running against the production dataset when a tenth of it would rank ideas just as well; subsample. Hidden manual steps: anything the model has to ask the human about adds the human’s response time to every iteration; remove the gate or trust the agent to bypass it.
Goal mode can run for days. Even with good compaction, no context window holds a coherent thread that long. The fix is to put the thread on disk.
Rather than force the model to hold all the relevant state in memory, expose markdown files for it to write to. Three files cover the cases: a plan, a ledger of attempts, and a chronological scratchpad. Each plays a different role, and together they let the loop run autonomously for hours or days while remaining auditable.
EXPERIMENT_NOTES.md#L412. → +0.011 TM-score, stable across seeds (kept).Of the three, the ledger is the most leveraged. It’s the file the agent re-reads to avoid repeating a failed direction; it’s the file you skim to confirm the loop hasn’t gone off the rails; and it’s the file that survives long after a session’s context window is gone. PLAN.md is forward-looking and ages quickly. EXPERIMENT_NOTES.md is high-fidelity but voluminous. EXPERIMENTS.md is the curated, append-only record of what the loop actually learned.
The structure is also forgiving. If a column is missing or a row is sloppy, the next pass cleans it up. The schema isn’t precious, what matters is that every attempt gets a line, every line has a verdict (kept, reverted, promoted), and the file never lies about the past.
The pattern travels. For a refactor: PLAN.md is the migration plan, EXPERIMENTS.md is the per-module attempt log (which approach was tried, what broke, what shipped), EXPERIMENT_NOTES.md is the running commentary. For a research write-up: a section plan, a per-claim verification log, a stream of editorial notes. Same shape, different content.
Set up a clear, measurable goal, keep the feedback loop tight, and give the agent markdown files to think in. With those three pieces in place, Codex will happily grind for hours, or even days, on your hardest problems.Chris Hayduk · the playbook, in full
Hayduk’s three rules describe goal mode as a product feature, what
Codex’s /goal command does and how to prompt for it. Karpathy’s
autoresearch, archived in this project at autoresearch.html,
describes the same idea built by hand: a single-prompt long-running
loop, a strict protocol, and a deliberate split between the human surface
(the prompt sheet) and the agent surface (the training code).
The triplets line up almost one-for-one. Hayduk’s PLAN.md is Karpathy’s
program.md: the human-edited marching orders. Hayduk’s
EXPERIMENTS.md is Karpathy’s results.tsv: the curated
ledger of every attempt and its verdict. Hayduk’s EXPERIMENT_NOTES.md
is the agent’s scratchpad context, which Karpathy leaves implicit because the
git branch + tsv together already carry the audit trail.
autoresearch.html. See § 02 (The three files) and § 06
(The ledger) for the parallel architecture.