Codex · Goal mode · Field notes

Using Codex Goals Effectively

Goal mode turns the agent into a closed loop, act, score, check, repeat, and that small change in shape rewards a very different kind of prompting. Three rules from someone running it for days at a time.

PP. 02 to 03

Cover · ToC

PP. 04 to 06

Specify a clear, quantitative goal

PP. 07 to 08

Tight feedback loops

PP. 09 to 11

Three markdown ledgers

P. 12

Ed. note · Colophon

Written by

Chris Hayduk

@ChrisHayduk · OpenAI
“Some tips I’ve picked up from using goal mode
both internally and in side projects.”

Filed under

Explain / Reference

Surfaces 03 + 04 from html-effectiveness
(explain a topic · structured reference)
Companion to · autoresearch.html

Contents · No. 04

The whole playbook,
in three rules.

I
Specify a clear, quantitative goal
The loop’s termination condition has to be readable by the agent. Vague goals fail in two opposite directions.

P. 04
II
Keep the feedback loop tight
Score quality and score speed are different things. Sacrifice neither, but compress the latter ruthlessly.

P. 07
III
Give the agent markdown files to think in
PLAN.md, EXPERIMENTS.md, EXPERIMENT_NOTES.md: the agent’s plan, ledger, and scratchpad. Cheap, durable, auditable.

P. 09

★ Editor’s pick · The EXPERIMENTS.md ledger: the single most leveraged of the three files

§
00

Perceptive Codex users have noticed that the /goal command is now available in the app: start your prompt with /goal and specify what you want your agent to do. This triggers Codex to loop continuously until it achieves the goal. But to get the most out of it, the prompt has to be shaped differently than the everyday workflow rewards.

Goal mode is, at its core, a loop. The agent executes some actions, scores those actions, checks whether the score satisfies the goal, and either continues or terminates. The whole thing pivots on step three, check if the score satisfies the goal, which means a vague qualitative goal leaves the loop’s end state underspecified. How is the agent supposed to know when it’s done?

With an underspecified goal, the model fails in two opposite directions. Sometimes it gives up early, working for only a few minutes before declaring victory. Other times it never stops, flailing at an unsatisfiable target. Both failure modes trace to the same root cause: the loop can’t read the termination condition. Fixing that is the whole game.

§
01

A better goal than “make my code better” is something like:

Comparison · the same intent, two framings Goal mode

Underspecified

“Make my code better.”

Two failure modes. The agent either declares victory after a few minutes (no termination check to fail) or never stops (no termination check to pass). Either way, the loop is broken.

Termination · unreadable Score · qualitative Constraints · none

Tightly specified

“Reduce the runtime of the code in specific_file by 20% without causing regressions in existing unit or integration tests.”

The loop is now closeable. 20% is a number the agent can check. The test suite is a constraint the agent can re-run. Done means done.

Termination · readable Score · %-runtime Constraints · tests pass

Quantitative doesn’t mean simple

The score doesn’t have to come from a benchmark; the model itself can do scoring if the rules are clear enough. When converting a NeurIPS preprint to an ICML workshop submission, ICML’s formatting rules are buried in a LaTeX file that doesn’t lend itself to mechanical grading. The fix was to have Codex extract those rules into a markdown checklist of just over two hundred items, then make the goal:

“Change the NeurIPS paper to ICML format based on the provided checklist.md without changing any of the technical content of the paper.”

By providing the checklist, the qualitative goal becomes quantitative: Codex is done when 200 of 200 boxes are checked. Each individual rule can still be vague: the agent reasons about each rule at a time, which it does well, but the overall goal is now a hard count. As a bonus, the model checks off items as it goes, which persists progress to the filesystem and lets a human keep tabs visually.

The pattern, generalized

Whenever the goal resists a single number, decompose it into a checklist the agent can chew through item by item, then make checking it off the termination condition. The aggregate quantity (200 of 200) does the work the single scalar would do in an easier case.

§
02

To evaluate actions against the goal, the agent needs a test. The faster that test runs, and the less ceremony it takes to run it: the faster the model gets feedback on whether the last move helped. Latency in the scoring step is the dominant cost in any long-running loop. Find anywhere to compress it that doesn’t compromise the quality of the score it produces.

For improving an ML architecture, that often means operating on a smaller model and a subsampled dataset rather than the full production training setup. For protein structure work specifically, NanoFold: a small but well-sampled dataset, cut a scoring iteration from days to minutes. The score you get isn’t identical to a full training run, but it’s good enough to rank ideas, and you can do dozens or hundreds of those rankings in the time one production-quality score would take.

The goal-mode loop · what Codex runs continuously Loop · greedy on a single goal

Step 01

act

execute action

Step 02 · the hot path

score

run the test

Step 03

check goal

does the score satisfy the rule?

NO → continue YES → terminate

Step 04a · loop back

continue

next action → return to step 01

Step 04b · done

terminate

return to human

The loop’s clock speed is set by step 02. Smaller model · subsampled data · cached deps · pre-warmed env.

Where the time actually goes

Three failure modes show up over and over. Cold environments: every action repaying a setup cost the score doesn’t care about; cache it. Full-fat fixtures: running against the production dataset when a tenth of it would rank ideas just as well; subsample. Hidden manual steps: anything the model has to ask the human about adds the human’s response time to every iteration; remove the gate or trust the agent to bypass it.

§
03

Rather than force the model to hold all the relevant state in memory, expose markdown files for it to write to. Three files cover the cases: a plan, a ledger of attempts, and a chronological scratchpad. Each plays a different role, and together they let the loop run autonomously for hours or days while remaining auditable.

The three files · what the agent writes to disk 3 files

Forward-looking

PLAN.md

The strategy.

The high-level plan the agent intends to follow on its way to the goal. Seed it with your own initial ideas about direction; let the agent revise it as the picture clarifies.

★ Most leveraged

EXPERIMENTS.md

The ledger.

A clean, curated list of every experiment run: title, brief description of what was tried, and the result. The agent consults this to avoid repeating failed paths; you skim it to see where it’s been.

Backward-looking

EXPERIMENT_NOTES.md

The scratchpad.

A chronologically ordered stream of real-time thoughts as the agent executes. Audit it when you need to know why the agent did something, or whether to nudge the loop in a different direction.

EXPERIMENTS.md · excerpt protein-structure run · goal: improve TM-score on NanoFold

#01

Add rotary positional embeddings to MSA block.
Replaces sinusoidal PE in the MSA self-attention path. Re-ran NanoFold eval. → +0.018 TM-score (kept).

#02

Increase MSA depth from 4 → 6 blocks.
Hypothesis: under-parameterised vs reference. → OOM on eval batch; reverted.

#03

Same as #02, with gradient checkpointing on MSA.
Re-ran cleanly. → −0.004 TM-score (no improvement); reverted.

#04

Triangle multiplicative update gating.
Per literature note in EXPERIMENT_NOTES.md#L412. → +0.011 TM-score, stable across seeds (kept).

#05

Combine #01 + #04 with shared LayerNorm.
Suspect interaction; verifying. → +0.034 TM-score cumulative; promoted to baseline.

Why EXPERIMENTS.md earns the asterisk

Of the three, the ledger is the most leveraged. It’s the file the agent re-reads to avoid repeating a failed direction; it’s the file you skim to confirm the loop hasn’t gone off the rails; and it’s the file that survives long after a session’s context window is gone. PLAN.md is forward-looking and ages quickly. EXPERIMENT_NOTES.md is high-fidelity but voluminous. EXPERIMENTS.md is the curated, append-only record of what the loop actually learned.

The structure is also forgiving. If a column is missing or a row is sloppy, the next pass cleans it up. The schema isn’t precious, what matters is that every attempt gets a line, every line has a verdict (kept, reverted, promoted), and the file never lies about the past.

Repurposing the triplet outside ML

The pattern travels. For a refactor: PLAN.md is the migration plan, EXPERIMENTS.md is the per-module attempt log (which approach was tried, what broke, what shipped), EXPERIMENT_NOTES.md is the running commentary. For a research write-up: a section plan, a per-claim verification log, a stream of editorial notes. Same shape, different content.

Set up a clear, measurable goal, keep the feedback loop tight, and give the agent markdown files to think in. With those three pieces in place, Codex will happily grind for hours, or even days, on your hardest problems. Chris Hayduk · the playbook, in full

The playbook, restated

That’s the whole playbook.
Now go run some loops.

C.H.

OpenAI · Codex team · 2026

Editor’s note

Sits directly next to autoresearch in this library.

Hayduk’s three rules describe goal mode as a product feature, what Codex’s /goal command does and how to prompt for it. Karpathy’s autoresearch, archived in this project at autoresearch.html, describes the same idea built by hand: a single-prompt long-running loop, a strict protocol, and a deliberate split between the human surface (the prompt sheet) and the agent surface (the training code).

The triplets line up almost one-for-one. Hayduk’s PLAN.md is Karpathy’s program.md: the human-edited marching orders. Hayduk’s EXPERIMENTS.md is Karpathy’s results.tsv: the curated ledger of every attempt and its verdict. Hayduk’s EXPERIMENT_NOTES.md is the agent’s scratchpad context, which Karpathy leaves implicit because the git branch + tsv together already carry the audit trail.

Cross-reference · Karpathy, “Autoresearch.” In project knowledge as autoresearch.html. See § 02 (The three files) and § 06 (The ledger) for the parallel architecture.

Using Codex Goals Effectively

The whole playbook,
in three rules.

The agent is a loop now.
That changes the prompt.

Specify a clear, quantitative goal.

Quantitative doesn’t mean simple

The pattern, generalized

Make sure the feedback loop is tight.

Where the time actually goes

Give the agent markdown files to think in.

Why EXPERIMENTS.md earns the asterisk

Repurposing the triplet outside ML

That’s the whole playbook.
Now go run some loops.

Sits directly next to autoresearch in this library.

Using Codex Goals Effectively

The whole playbook,in three rules.

The agent is a loop now.That changes the prompt.

Specify a clear, quantitative goal.

Quantitative doesn’t mean simple

The pattern, generalized

Make sure the feedback loop is tight.

Where the time actually goes

Give the agent markdown files to think in.

Why EXPERIMENTS.md earns the asterisk

Repurposing the triplet outside ML

That’s the whole playbook.Now go run some loops.

Sits directly next to autoresearch in this library.

The whole playbook,
in three rules.

The agent is a loop now.
That changes the prompt.

That’s the whole playbook.
Now go run some loops.