A Reading · Format as html
On autonomous loops & markdown ledgers
Vol. III · No. 04 · 12 min read
Codex · Goal mode · Field notes

Using Codex Goals Effectively

Goal mode turns the agent into a closed loop, act, score, check, repeat, and that small change in shape rewards a very different kind of prompting. Three rules from someone running it for days at a time.

PP. 02 to 03
Cover · ToC
PP. 04 to 06
Specify a clear, quantitative goal
PP. 07 to 08
Tight feedback loops
PP. 09 to 11
Three markdown ledgers
P. 12
Ed. note · Colophon
Written by
Chris Hayduk
@ChrisHayduk · OpenAI
“Some tips I’ve picked up from using goal mode
both internally and in side projects.”
Filed under
Explain / Reference
Surfaces 03 + 04 from html-effectiveness
(explain a topic · structured reference)
Companion to · autoresearch.html
Contents · No. 04

The whole playbook,
in three rules.

  1. I
    Specify a clear, quantitative goal
    The loop’s termination condition has to be readable by the agent. Vague goals fail in two opposite directions.
    P. 04
  2. II
    Keep the feedback loop tight
    Score quality and score speed are different things. Sacrifice neither, but compress the latter ruthlessly.
    P. 07
  3. III
    Give the agent markdown files to think in
    PLAN.md, EXPERIMENTS.md, EXPERIMENT_NOTES.md: the agent’s plan, ledger, and scratchpad. Cheap, durable, auditable.
    P. 09
★ Editor’s pick · The EXPERIMENTS.md ledger: the single most leveraged of the three files
§
00
Opening · Why goal mode is different

The agent is a loop now.
That changes the prompt.

For a year, prompting got lazier: the models were good enough that we could gesture vaguely and they would figure it out. Goal mode breaks that habit hard, because the loop has nowhere to stop.

Perceptive Codex users have noticed that the /goal command is now available in the app: start your prompt with /goal and specify what you want your agent to do. This triggers Codex to loop continuously until it achieves the goal. But to get the most out of it, the prompt has to be shaped differently than the everyday workflow rewards.

Goal mode is, at its core, a loop. The agent executes some actions, scores those actions, checks whether the score satisfies the goal, and either continues or terminates. The whole thing pivots on step three, check if the score satisfies the goal, which means a vague qualitative goal leaves the loop’s end state underspecified. How is the agent supposed to know when it’s done?

With an underspecified goal, the model fails in two opposite directions. Sometimes it gives up early, working for only a few minutes before declaring victory. Other times it never stops, flailing at an unsatisfiable target. Both failure modes trace to the same root cause: the loop can’t read the termination condition. Fixing that is the whole game.

§
01
Rule 01 · The termination condition

Specify a clear, quantitative goal.

The agent doesn’t need beauty in the goal: it needs a check it can run. Numbers, constraints, and a checklist will outperform any amount of careful adjectives.

A better goal than “make my code better” is something like:

Comparison · the same intent, two framings Goal mode
Underspecified
“Make my code better.”
Two failure modes. The agent either declares victory after a few minutes (no termination check to fail) or never stops (no termination check to pass). Either way, the loop is broken.
Termination · unreadable Score · qualitative Constraints · none
Tightly specified
“Reduce the runtime of the code in specific_file by 20% without causing regressions in existing unit or integration tests.”
The loop is now closeable. 20% is a number the agent can check. The test suite is a constraint the agent can re-run. Done means done.
Termination · readable Score · %-runtime Constraints · tests pass

Quantitative doesn’t mean simple

The score doesn’t have to come from a benchmark; the model itself can do scoring if the rules are clear enough. When converting a NeurIPS preprint to an ICML workshop submission, ICML’s formatting rules are buried in a LaTeX file that doesn’t lend itself to mechanical grading. The fix was to have Codex extract those rules into a markdown checklist of just over two hundred items, then make the goal:

“Change the NeurIPS paper to ICML format based on the provided checklist.md without changing any of the technical content of the paper.”

By providing the checklist, the qualitative goal becomes quantitative: Codex is done when 200 of 200 boxes are checked. Each individual rule can still be vague: the agent reasons about each rule at a time, which it does well, but the overall goal is now a hard count. As a bonus, the model checks off items as it goes, which persists progress to the filesystem and lets a human keep tabs visually.

The pattern, generalized

Whenever the goal resists a single number, decompose it into a checklist the agent can chew through item by item, then make checking it off the termination condition. The aggregate quantity (200 of 200) does the work the single scalar would do in an easier case.

§
02
Rule 02 · Latency of the score

Make sure the feedback loop is tight.

The agent is rate-limited by how long it takes to find out whether the last move helped. Shorten that and you change what the loop can attempt.

To evaluate actions against the goal, the agent needs a test. The faster that test runs, and the less ceremony it takes to run it: the faster the model gets feedback on whether the last move helped. Latency in the scoring step is the dominant cost in any long-running loop. Find anywhere to compress it that doesn’t compromise the quality of the score it produces.

For improving an ML architecture, that often means operating on a smaller model and a subsampled dataset rather than the full production training setup. For protein structure work specifically, NanoFold: a small but well-sampled dataset, cut a scoring iteration from days to minutes. The score you get isn’t identical to a full training run, but it’s good enough to rank ideas, and you can do dozens or hundreds of those rankings in the time one production-quality score would take.

The goal-mode loop · what Codex runs continuously Loop · greedy on a single goal
STEP 01 act execute action STEP 02 · the hot path score run the test STEP 03 check goal does the score satisfy the rule? STEP 04A · loop continue next action STEP 04B · done terminate return to human NO YES the loop · until the goal is satisfied The whole loop’s clock speed is set by step 02. Smaller model · subsampled data · cached deps · pre-warmed env
Step 01
act
execute action
Step 02 · the hot path
score
run the test
Step 03
check goal
does the score satisfy the rule?
NO → continue YES → terminate
Step 04a · loop back
continue
next action → return to step 01
Step 04b · done
terminate
return to human

The loop’s clock speed is set by step 02. Smaller model · subsampled data · cached deps · pre-warmed env.

Where the time actually goes

Three failure modes show up over and over. Cold environments: every action repaying a setup cost the score doesn’t care about; cache it. Full-fat fixtures: running against the production dataset when a tenth of it would rank ideas just as well; subsample. Hidden manual steps: anything the model has to ask the human about adds the human’s response time to every iteration; remove the gate or trust the agent to bypass it.

§
03
Rule 03 · The agent’s memory

Give the agent markdown files to think in.

Goal mode can run for days. Even with good compaction, no context window holds a coherent thread that long. The fix is to put the thread on disk.

Rather than force the model to hold all the relevant state in memory, expose markdown files for it to write to. Three files cover the cases: a plan, a ledger of attempts, and a chronological scratchpad. Each plays a different role, and together they let the loop run autonomously for hours or days while remaining auditable.

The three files · what the agent writes to disk 3 files
Forward-looking
PLAN.md
The strategy.
The high-level plan the agent intends to follow on its way to the goal. Seed it with your own initial ideas about direction; let the agent revise it as the picture clarifies.
★ Most leveraged
EXPERIMENTS.md
The ledger.
A clean, curated list of every experiment run: title, brief description of what was tried, and the result. The agent consults this to avoid repeating failed paths; you skim it to see where it’s been.
Backward-looking
EXPERIMENT_NOTES.md
The scratchpad.
A chronologically ordered stream of real-time thoughts as the agent executes. Audit it when you need to know why the agent did something, or whether to nudge the loop in a different direction.
EXPERIMENTS.md · excerpt protein-structure run · goal: improve TM-score on NanoFold
#01
Add rotary positional embeddings to MSA block.
Replaces sinusoidal PE in the MSA self-attention path. Re-ran NanoFold eval. → +0.018 TM-score (kept).
#02
Increase MSA depth from 4 → 6 blocks.
Hypothesis: under-parameterised vs reference. → OOM on eval batch; reverted.
#03
Same as #02, with gradient checkpointing on MSA.
Re-ran cleanly. → −0.004 TM-score (no improvement); reverted.
#04
Triangle multiplicative update gating.
Per literature note in EXPERIMENT_NOTES.md#L412. → +0.011 TM-score, stable across seeds (kept).
#05
Combine #01 + #04 with shared LayerNorm.
Suspect interaction; verifying. → +0.034 TM-score cumulative; promoted to baseline.

Why EXPERIMENTS.md earns the asterisk

Of the three, the ledger is the most leveraged. It’s the file the agent re-reads to avoid repeating a failed direction; it’s the file you skim to confirm the loop hasn’t gone off the rails; and it’s the file that survives long after a session’s context window is gone. PLAN.md is forward-looking and ages quickly. EXPERIMENT_NOTES.md is high-fidelity but voluminous. EXPERIMENTS.md is the curated, append-only record of what the loop actually learned.

The structure is also forgiving. If a column is missing or a row is sloppy, the next pass cleans it up. The schema isn’t precious, what matters is that every attempt gets a line, every line has a verdict (kept, reverted, promoted), and the file never lies about the past.

Repurposing the triplet outside ML

The pattern travels. For a refactor: PLAN.md is the migration plan, EXPERIMENTS.md is the per-module attempt log (which approach was tried, what broke, what shipped), EXPERIMENT_NOTES.md is the running commentary. For a research write-up: a section plan, a per-claim verification log, a stream of editorial notes. Same shape, different content.

Set up a clear, measurable goal, keep the feedback loop tight, and give the agent markdown files to think in. With those three pieces in place, Codex will happily grind for hours, or even days, on your hardest problems. Chris Hayduk · the playbook, in full
The playbook, restated

That’s the whole playbook.
Now go run some loops.

C.H.
OpenAI · Codex team · 2026
Editor’s note

Sits directly next to autoresearch in this library.

Hayduk’s three rules describe goal mode as a product feature, what Codex’s /goal command does and how to prompt for it. Karpathy’s autoresearch, archived in this project at autoresearch.html, describes the same idea built by hand: a single-prompt long-running loop, a strict protocol, and a deliberate split between the human surface (the prompt sheet) and the agent surface (the training code).

The triplets line up almost one-for-one. Hayduk’s PLAN.md is Karpathy’s program.md: the human-edited marching orders. Hayduk’s EXPERIMENTS.md is Karpathy’s results.tsv: the curated ledger of every attempt and its verdict. Hayduk’s EXPERIMENT_NOTES.md is the agent’s scratchpad context, which Karpathy leaves implicit because the git branch + tsv together already carry the audit trail.

Cross-reference · Karpathy, “Autoresearch.” In project knowledge as autoresearch.html. See § 02 (The three files) and § 06 (The ledger) for the parallel architecture.