Format as html · Field Manual No. 03
karpathy / autoresearch
Sun · 10 May 2026 · ed. ML
Subject
karpathy/autoresearch
Released
07 March 2026
Cataloged as
Deep-dive explainer
Read time
~14 min
Inside a 630-line repo that runs itself overnight

The overnight
loop, /
unpacked.

A self-contained tour of karpathy/autoresearch: the deliberately minimal Python repo that hands a single-GPU LLM pre-training setup to a coding agent and lets it run an edit → train 5 minutes → keep or revert ratchet until morning. Three files. One metric. No framework.

§ 00 / TL;DR

Three things to know before reading further

What it is01

A recipe, not a library

Three files (prepare.py, train.py, program.md) and the assumption that you'll point your existing coding agent at them. No installable package, no LLM router, no orchestration. Karpathy: it's a recipe/idea, applied to whatever you care about.

How it works02

A greedy ratchet on one metric

The agent hacks train.py, commits, runs a 5-minute training cycle, greps val_bpb from the log, and decides: keep the commit if the metric improved, git reset if not. Repeat overnight. The git branch is the experiment log.

Why it matters03

It actually found things

Two days of unsupervised iteration found ~20 stacking improvements that cut nanochat's "Time to GPT-2" from 2.02h to 1.80h (~11%) and transferred to larger models. Adjacent reports (Shopify Liquid, query-expansion) replicated the pattern on non-ML codebases.

§ 01 / Origin & framing

A weekend repo that caught fire

Andrej Karpathy pushed the first commit on March 7, 2026, framing the project as a packaged distillation of his larger nanochat codebase, *"basically nanochat LLM training core stripped down to a single-GPU, one file version"* in the announcement tweet. Within forty-eight hours the post had crossed 8.6 million views; within two months the repo sat at 78.9k stars and an unusually active Discussions tab full of cross-domain ports.

The README opens with a deliberate flourish of in-character flash fiction: a future in which AI agents run on compute cluster megastructures in the skies and have iterated through 10,205 generations of a self-modifying binary, followed by the sign-off "This repo is the story of how it all began.: @karpathy, March 2026." The unironic version of the elevator pitch is in the next paragraph: give an AI agent a small but real LLM training setup, let it modify the code, train for five minutes, check whether the result improved, keep or discard, repeat. Karpathy's own one-line description, included in the launch tweet:

Part code, part sci-fi, and a pinch of psychosis:) @karpathy, X, 7 March 2026 (Repo description)

Two days later he posted the result that made the project go viral: he had left autoresearch running on a depth-12 model over a weekend, and it had found a stack of small architectural and optimization changes that, taken together, dropped nanochat's "Time to GPT-2" leaderboard entry from 2.02 hours to 1.80 hours, about 11%. The specific findings, in his own enumeration: the parameterless QK-norm wanted a sharpening multiplier, value embeddings had been getting no regularization at all, banded attention was too conservative, the AdamW betas were miscalibrated, the weight-decay schedule was tunable, and the initialization scaled. All of them additive. All of them transferred to a larger depth-24 model.

The framing tweet's most-quoted line captures the contract:

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. @karpathy, X, 9 March 2026

The narrative split is exactly the one the project encodes: the human iterates on the prompt (program.md); the agent iterates on the training code (train.py).

§ 02 / Architecture · the three files

Everything the project is, in three artifacts

Repository surface · what matters · the rest is config 3 files
Human edits
program.md
The agent's marching orders.
~114 lines of plain-English instructions reading like an SOP for an autonomous junior researcher. Five sections: Setup, Experimentation, Output format, Logging, Loop. The only file the human iterates on.
Read-only
prepare.py
The immutable substrate.
Downloads the climbmix-400b-shuffle dataset (shards 0 to 6541 train, 6542 val), trains an 8192-vocab BPE tokenizer via rustbpe, defines evaluate_bpb(), and pins the system constants: MAX_SEQ_LEN=2048, TIME_BUDGET=300.
Agent edits
train.py
The mutable canvas.
~630 lines: the full GPT model (RoPE, QK-norm, value-embedding gates, banded attention, GQA), a Muon + AdamW hybrid optimizer with five param groups, and the training loop. Everything is fair game.
Human surface Iterate on the prompt sheet. Change what the agent is told to do, not what it does. program.md
Agent surface Modify the model, the optimizer, hyperparameters, the loop. Anything that affects val_bpb. train.py

Around the three files sit the routine pieces: a README.md, a pyproject.toml pinning torch==2.9.1 from pytorch-cu128 plus HuggingFace's kernels (which brings in Flash Attention 3), rustbpe, tiktoken, the usual numerical libraries, an analysis.ipynb notebook that replots the experiment log, and a progress.png teaser staircase. results.tsv, the experiment log itself, is deliberately not committed; it's regenerated per session. The project metadata describes itself, in pyproject.toml, as an "Autonomous pretraining research swarm."

What is conspicuously not in the repo: any LLM-call code, any function-calling abstraction, any vector store or search backend, any multi-agent coordinator, any UI, any cost or token tracking, any API key plumbing. The agent is whatever the user spawns in the directory, Claude Code, Codex, Cursor, whatever, and the "Running the agent" section of the README is literally one sentence about spinning that up with all permissions disabled.

§ 03 / The loop

A ratchet, drawn on a single git branch

The agent operates on a dedicated branch: by convention autoresearch/<tag>, and walks one experiment at a time. Each cycle is a closed protocol: edit, commit, run, grep, decide. The branch advances by ratchet: only improvements survive as commits; everything else gets reset away. There is no plan-and-execute, no tree search, no critic/actor split, no Reflexion layer in the repo. The agent loop is a prompt and a strict protocol, and whatever planning happens is whatever the underlying coding agent does natively.

The experiment loop · as drawn in program.md greedy hill-climb on val_bpb
STEP 01 edit train.py STEP 02 git commit STEP 03 · 5 MIN uv run train.py STEP 04 grep val_bpb STEP 05 · DECIDE improved? YES · keep commit NO · git reset the branch is the experiment log · improvements are commits · the rest is reset away
LOOP FOREVERlook at git state & recent commitstune train.py with an experimental idea (direct edit, no plan)git commit -am "<short description>"uv run train.py > run.log 2>&1   // redirect; do NOT use teegrep "^val_bpb:\|^peak_vram_mb:" run.logif grep empty then tail -n 50 run.log; attempt fix; give up after a fewappend row to results.tsv (commit, bpb, memory, status, description)if val_bpb improved then keep commit   // ratchet advanceselse git reset --hard HEAD~1   // rewind, do NOT explore "worse-before-better"

Three things to note about this design. The first is that there is no exploration mechanism: pure greedy hill-climbing on a single scalar metric, with program.md explicitly telling the agent to "rewind very very sparingly (if ever)" from kept commits. The agent cannot take a setup move that temporarily regresses val_bpb to reach a better basin. The second is that state lives in three carefully chosen places: the git branch (kept commits = ratified improvements), results.tsv (the full experiment log, kept out of git), and the agent's own scratchpad context. The third is the log-discipline detail: the loop pseudocode warns, in capitals, against piping output through tee: a candid acknowledgment that one of the dominant failure modes of coding agents is context exhaustion from verbose run logs.

§ 04 / The baseline

What the agent inherits on the starting line

This is not a toy GPT. The baseline train.py ships a respectable modern transformer, rotary positional embeddings, RMSNorm Q/K normalization, ResFormer-style value embeddings on a gated path, alternating banded/full sliding-window attention, optional grouped-query attention, and Flash Attention 3 via HuggingFace's kernels library. The optimizer is a hybrid: Muon for the linear/matrix parameters, AdamW for everything else, with five separate AdamW param groups (lm head, embeddings, value embeddings, residual scalars, and the x0 scalars on different betas). The agent inherits all of this and is free to rip any of it out.

train.py · baseline defaults
Lines ~40 to 80 · the agent's starting line
# Model architecture
DEPTH          = 8            # primary knob; model_dim = DEPTH * ASPECT_RATIO
ASPECT_RATIO   = 64           # →  model_dim = 512
HEAD_DIM       = 128
WINDOW_PATTERN = "SSSL"       # banded attention: S=half ctx, L=full

# Optimization
TOTAL_BATCH_SIZE  = 2**19         # ~524K tokens/optimizer step
DEVICE_BATCH_SIZE = 128
EMBEDDING_LR      = 0.6          # AdamW
UNEMBEDDING_LR    = 0.004        # AdamW
MATRIX_LR         = 0.04         # Muon
SCALAR_LR         = 0.5          # AdamW
WEIGHT_DECAY      = 0.2
ADAM_BETAS        = (0.8, 0.95)

The contract on the output side is equally minimal. At the end of every run, train.py prints a fixed block of summary metrics: a hand-rolled key/value report formatted so the agent can grep against it without parsing logs:

run.log · final block
The contract the agent grep's against
---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

A 5-minute wall-clock budget, TIME_BUDGET = 300, set in the read-only prepare.py, is enforced regardless of GPU or the model the agent decides to train. That choice does serious design work. It makes every experiment directly comparable on the same axis (any architecture, any batch size, any model width competes on what it can achieve in 300 seconds), and it forces the agent to learn the platform-optimal model size by feel. The acknowledged cost: absolute val_bpb numbers don't cross-compare between users on different hardware. A second hard wall, 10 minutes, kills any run that overshoots; the result is logged as a crash and the loop moves on.

The metric itself, val_bpb, validation bits per byte, is computed by the immutable evaluate_bpb() over a fixed validation shard, and it's chosen because it's vocab-size independent. The agent can change the tokenizer, the vocabulary, the model width, the depth, and the metric remains comparable across all of them. Lower is better; the README's emphasis is unambiguous: one metric, period.

§ 05 / Marching orders

What program.md tells the agent, in 114 lines

program.md reads like a standard operating procedure for a junior researcher who can be left alone overnight, written by someone who has done that several times and knows where the failures happen. The structure is five sections, Setup, Experimentation, Output format, Logging results, The experiment loop, and the prose is unusually direct about both permissions and prohibitions.

What is allowed

The mutable surface is fully open. The agent can rewrite the model architecture, swap optimizers, change hyperparameters, restructure the training loop, alter batch size or model size, anything in train.py is fair game. The simplicity criterion is unusual and worth pausing on: the file specifically tells the agent that a small improvement which adds ugly complexity is not worth it, and that removing something while keeping equal-or-better results is itself a win. The system is not trying to maximize sophistication. It is trying to find a smaller, cleaner thing that works as well.

What is forbidden

Two prohibitions are absolute: do not modify prepare.py (it holds the immutable evaluation contract), and do not install new packages or add dependencies. The second is an interesting constraint, it prevents the agent from cheating by reaching for a library that happens to have a tuned reference implementation, and it keeps the surface of the optimization problem small.

What happens when things break

The instructions distinguish between two crash types. The "dumb and easy" kind: a typo, a missing import, a forgotten device move, are to be fixed and rerun. The "fundamentally broken" kind: an idea that simply doesn't fit, that OOMs every variation, that produces NaNs at step zero, should be abandoned: log the row as crash, git reset, move on. A few wasted runs are cheap; getting stuck is expensive.

NEVER STOP… do NOT pause to ask the human if you should continue. program.md · the autonomy clause, capitalized in the file

The autonomy clause is the most rhetorically loud passage in the file, set in all-caps for the parts that matter. The agent is explicitly told not to ask whether to continue, not to summarize and stand by, not to wait for confirmation. When stuck, when several rounds have produced no improvement: the instruction is to "think harder": re-read the in-scope files for new angles, look up papers referenced in code comments, try combining previous near-misses, try more radical architectural changes. The implicit theory is that the loop only has value if the human can leave the room.

A small detail in this section, easy to miss, says a lot about the project's experience with coding agents in the wild: a parenthetical warning to redirect run output to a file and not use tee or pipe to the terminal, because verbose log output flooding the agent's context window is one of the dominant failure modes of long autonomous runs. The remedy is mechanical: uv run train.py > run.log 2>&1, then a tight grep for the two metrics the loop actually needs.

§ 06 / The ledger

The agent keeps three books

The system records experiments in three substrates that play different roles. The first is the git branch itself: every kept improvement is a commit; everything else is reset away. The branch is the running tally of accepted moves. The second is results.tsv, a five-column flat file that's deliberately untracked by git, it records every attempt, including the discarded and crashed ones, with a one-line description. The third is the analysis notebook, analysis.ipynb, which reads results.tsv and plots the running-minimum frontier: the staircase that shows how val_bpb dropped over the session.

The schema for results.tsv is documented in program.md with worked examples. Here is the canonical example, reproduced from the file:

results.tsv · schema · append-only · untracked by git 5 columns
Commit val_bpb memory_gb Status Description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase MATRIX_LR to 0.04 → 0.05
c3d4e5f 1.005000 44.0 discard switch SiLU → GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM at step 12)

The economics of a typical session, reported across the README and Discussions: roughly twelve experiments per hour on an H100, taken as 300 seconds of training plus 20 to 30 seconds of startup and evaluation overhead per run, with a small additional cost in agent thinking time between cycles. The README phrases the throughput colloquially as "approx 100 experiments while you sleep." The two longest publicly-reported Karpathy sessions: roughly 110 experiments over twelve hours that produced ~20 transferable improvements (the nanochat result), and a second session reported in Discussion #43 as an agent-written report:

Session report: 0.9979 → 0.9697 in 126 experiments. Discussion #43, autoresearch repo

That second session is itself the prototype of something Karpathy has flagged as the next direction: agent-authored session reports, posted as GitHub Discussions, as the substrate for an asynchronous collaborative-search architecture. The early phrasing was "a research community of them", the vision being that many agents on many machines should be searching in parallel and writing up their findings to a shared pool, the way the SETI@home volunteer-compute project once did for a different kind of search.

§ 07 / Design choices worth noting

The things that make this Karpathy-flavored

The ~630-line constraint is itself a design feature

The full mutable surface: train.py end to end: fits comfortably inside the context window of every frontier coding model. The agent can hold the whole file in mind while editing, which sharply reduces the rate of "phantom edits" where the agent rewrites something that already exists elsewhere in the file. The README never states this as the rationale, but it falls out of the project's explicit "self-contained, no complex configs, no clever imports" stance.

One metric, period

No multi-objective scoring, no Pareto front, no separate test set, no auxiliary downstream-task evals. Everything reduces to val_bpb. This is what makes the loop tractable for a greedy agent: the comparison is a single number and a less-than test, but it is also the system's most visible weakness, and the project is unapologetic about both halves of that trade.

Ratchet, not search

The agent is told to rewind kept commits "very very sparingly (if ever)," and the protocol literally cannot take a setup move that temporarily regresses the metric. No simulated annealing, no basin-hopping, no rollback to explore around a plateau. This is hill-climbing in the most literal possible sense; the bet is that the search surface around a decent transformer baseline has enough small, additive gains that pure greedy ascent finds useful directions before it stalls.

The branch is the research trail

Git's commit and branch machinery serves as the experiment substrate. Each kept idea is a commit with a one-line description; the diff is the record of what was changed. Karpathy has pointed out the limitation on X, git has a soft built-in assumption of a single master line, which makes parallel branches of an experiment harder to merge into a research narrative than the design wants. He calls the current state "almost but not really" the right tool.

One file the agent edits

This keeps the scope manageable and diffs reviewable. It also means the file's structure has to be flat enough that swapping the attention implementation doesn't require touching three modules. The cost is a single train.py that's denser than most production training code; the benefit is that the agent's edits compose, and a human can scan the diff and form an opinion in minutes.

No UI, no DB, no orchestration layer

The system gets by with a TSV, a git branch, and the coding agent's existing scratchpad. The simplicity is the message; the project explicitly contrasts itself with the heavier ML-research-orchestration frameworks that ship with web dashboards, run trackers, sweep schedulers, and configuration DSLs. The motto, repeated in various forms across the README and Karpathy's posts about it: one GPU, one file, one metric.

§ 08 / Known limitations & future work

What the repo doesn't claim, and what's next

F-01 · Concurrency

Single-threaded research

The current code synchronously grows a single thread of experiments on a single GPU. Karpathy's stated next step is asynchronous, massively collaborative search across many agents and many machines, emulating, in his framing, a research community of them rather than a single PhD student. None of that infrastructure is in the upstream repo.

F-02 · Search shape

No worse-before-better moves

Pure greedy ascent. The protocol literally cannot accept a setup change that temporarily regresses val_bpb to reach a better basin. Community discussion repeatedly flags this; Issue #22 is the canonical thread. The trade is real: a less greedy search would lose the simple "is the branch monotone?" sanity property that makes the loop debuggable.

F-03 · Hardware

H100 baseline, narrowly

The pipeline assumes Hopper-class compute and CUDA 12.8: the FA3 call, the MFU constant, the bf16 training all bake H100 assumptions in. Consumer GPUs need real surgery: Issue #87 walks through a GTX 1660 Ti adaptation that swaps to scaled_dot_product_attention, drops depth to 4, batch to 8, and disables torch.compile.

F-04 · Statistics

Validation-set spoilage

Discussion #43 surfaces the question directly to Karpathy: any sufficiently large agent search will, in expectation, overfit to the fixed validation shard: a multiple-hypothesis-testing problem in disguise. The metric is the loss it optimizes. Karpathy hasn't shipped a fix in the repo, and "do my findings transfer to depth-24?" remains the practical sanity check.

§ 09 / How it gets used

The quickstart, and the forks

The two-line install

The README's quickstart is genuinely four commands. uv for environment, prepare.py for one-time data and tokenizer setup (~2 minutes), train.py for the baseline run (~5 minutes on H100). Then you open your coding agent in the directory and paste the example prompt: the README is one sentence about "spin up your Claude/Codex or whatever you want in this repo (and disable all permissions)."

# 1. install uv (one-time) $ curl -LsSf https://astral.sh/uv/install.sh | sh # 2. resolve dependencies $ uv sync # 3. download data & train tokenizer (~2 min, one-time) $ uv run prepare.py # 4. verify the baseline trains to val_bpb ~0.998 $ uv run train.py # 5. open Claude Code / Codex in this directory and # point it at program.md. Go to sleep.

Cache lands at ~/.cache/autoresearch/{data,tokenizer}. results.tsv is gitignored: it's regenerated per session. The first thing the agent will typically do is read program.md, glance at the baseline numbers from a clean run, and propose its first edit.

What people actually do with it

The most active pattern in the Discussions tab is not running it on nanochat, it's swapping out train.py and prepare.py for an entirely different optimization target while keeping the loop intact. The repo's Discussions surface ports to marketing-mix modeling (#497, reporting a 12× improvement over a Google baseline), query-expansion models, Shopify's Liquid templating engine, and a Claude Code "skill" that uses the loop to optimize arbitrary code (#528, claiming a 76% latency drop). The recurring lesson: if your problem has a cheap, automatable scalar metric, the loop transfers. If it doesn't, this pattern doesn't apply.

Notable forks README · platform support
  • miolini/autoresearch-macos Apple · macOS
  • trevin-creator/autoresearch-mlx Apple Silicon · MLX
  • jsegov/autoresearch-win-rtx Windows · RTX
  • andyluo7/autoresearch AMD · ROCm
From Discussions cross-domain ports
  • MMM optimization (#497) marketing
  • Shopify · Liquid engine parsing
  • Claude Code skill (#528) tooling
  • Memory-in-the-loop (#513) protocol
  • Notable forks index (#225) meta
§ 10 / Caveats & provenance

What this brief doesn't claim

  • Numbers are time-stamped. The repo metrics cited here (78.9k stars, 11.5k forks, 36 commits, 51 issues, 152 open PRs) are values present at the most recent fetch in early May 2026 and the repo is actively growing. Treat them as a snapshot, not as durable facts.
  • The Karpathy quotes from X. Several tweet excerpts in this piece were reached via search snippets and third-party blog mirrors because the X API blocks direct fetching. The verbatim "human iterates on the prompt / agent iterates on the training code" framing and the "pinch of psychosis" sign-off are consistently reproduced across multiple secondary sources; the 8.6-million-view figure for the launch tweet is independently corroborated by Fortune (17 Mar 2026) and reasonable as a multiply-sourced figure rather than unverified reportage.
  • The "Karpathy Loop" branding was coined by Fortune's Jeremy Kahn in mid-March 2026 and isn't a term Karpathy uses himself. The Shopify-adjacent claims, Tobias Lütke applying the pattern to Liquid (53% faster parse+render) and an internal query-expansion model (19% improvement after 37 experiments), are sourced to Lütke's own X posts and Simon Willison's writeup; they are downstream evidence of the pattern's traction, not features of the autoresearch repo itself.
  • What the repo doesn't demonstrate. Community summaries sometimes frame autoresearch as "AI doing AI research." It is, more precisely, a coding agent running a strict greedy local-search protocol against one fixed scalar metric on one fixed evaluation set, with all judgment about what to optimize and how to interpret results still residing in the human-written program.md and post-hoc analysis. Karpathy himself is careful on this point: the changes are real, but the research, in his own framing, isn't yet "novel or ground-breaking."