# AutoResearch Repo: How It Works & How To Build Your Own

Source: [karpathy/autoresearch](https://github.com/karpathy/autoresearch) (MIT). "AI agents running research on single-GPU nanochat training automatically."

*"The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model."* — Karpathy

## What's in the repo

Only \~10 files. The three that matter:

| File | Role | Who edits it |
| :---- | :---- | :---- |
| `prepare.py` | Fixed constants, data download, BPE tokenizer training, dataloader, `evaluate_bpb` — the ground-truth metric harness. | **Never modified** (by anyone). |
| `train.py` | Full GPT model, optimizer (Muon \+ AdamW), training loop. All hyperparams and architecture live here. | **Agent modifies this**, every experiment. |
| `program.md` | Lightweight "skill" / system prompt that tells the agent how to run the loop, what it may/may not touch, how to log results. | **Human modifies this** to tune research-org behavior. |

Supporting: `pyproject.toml` (deps), `uv.lock`, `analysis.ipynb` (plot `results.tsv`), `progress.png`.

## Quick start (the original setup)

Requirements: a single NVIDIA GPU (tested H100), Python 3.10+, [uv](https://docs.astral.sh/uv/).

```shell
curl -LsSf https://astral.sh/uv/install.sh | sh     # 1. install uv
uv sync                                              # 2. deps
uv run prepare.py                                    # 3. data + tokenizer (~2 min, one-time)
uv run train.py                                      # 4. sanity-check a single 5-min run
```

Then run your coding agent in the repo with permissions disabled, and prompt: *"Have a look at program.md and let's kick off a new experiment."*

## The 9-step ratchet loop (from `program.md`)

Everything the agent does, autonomously, on a dedicated `autoresearch/<tag>` branch:

1. Inspect current git state (branch, HEAD commit).  
2. Modify `train.py` with one experimental idea.  
3. `git commit`.  
4. Run: `uv run train.py > run.log 2>&1` (redirect — never `tee`, never flood the agent's context).  
5. Extract metric: `grep "^val_bpb:\|^peak_vram_mb:" run.log`.  
6. If empty → crashed. `tail -n 50 run.log`, attempt fix, or mark crash and move on.  
7. Append a row to `results.tsv` (tab-separated: commit, val\_bpb, memory\_gb, status, description). **Do not commit `results.tsv`.**  
8. If `val_bpb` improved → keep the commit (ratchet advances).  
9. Else → `git reset --hard` back to the prior commit. Loop forever. Never ask the human whether to continue.

Hard rules enforced by `program.md`:

- Only `train.py` may change. `prepare.py` and the eval harness are read-only.  
- No new dependencies.  
- 5-minute training wall-clock budget; if a run exceeds 10 min, kill and mark as failure.  
- VRAM is a soft constraint; simplicity is a tiebreaker (deleting code for equal results \= keep).

## Building an AutoResearch on *anything* — the generalized recipe

The framework generalizes to any domain that satisfies three conditions:

1. **One editable artifact.** A single file (or tightly scoped directory) the agent mutates. Bigger scope \= unreviewable diffs \+ ratchet noise.  
2. **An automatic, numeric, hard-to-game metric.** One scalar, higher-or-lower-is-better. If a human has to judge quality, the loop breaks.  
3. **A fixed per-experiment time budget.** Makes runs comparable regardless of what the agent changed, and bounds iteration cost.

### Step-by-step build

**1\. Pick the artifact and the metric.**

- Artifact examples: a training script, a SQL query, a prompt/skill file, a trading strategy `.py`, a CSS/HTML landing page, a RAG retriever config, a physics-formula file.  
- Metric examples: validation loss, Sharpe ratio, conversion rate (from a scoring harness), MRR, RMSE, pass-rate on a held-out test set, p95 latency.  
- If your "metric" needs human taste — stop. Either build an LLM-as-judge harness with a frozen rubric and calibrate it against human labels first, or pick a different target.

**2\. Write the read-only evaluation harness (your `prepare.py` analog).**

- Downloads/loads fixed data.  
- Produces one number deterministically (seed everything that can be seeded).  
- Prints the metric in a grep-friendly format: `metric_name: 1.234567`.  
- Enforces the time budget (e.g. `signal.alarm`, or check wall-clock inside the loop).  
- **Commit to never editing it.** If the harness moves, historical results become incomparable — the whole ratchet is invalidated.

**3\. Write the baseline editable artifact (your `train.py` analog).**

- Must run end-to-end producing the metric.  
- Keep it one file if at all possible. Inline the knobs the agent should tune.  
- No hidden config files; the agent should see everything by reading this one file.

**4\. Write `program.md` — the agent's skill.** Structure it in three sections (mirror Karpathy's):

- **Setup**: agree a run tag, create `autoresearch/<tag>` branch, read in-scope files, verify data/harness works, initialize `results.tsv` with a header.  
- **Experimentation**: what the agent can/cannot touch, the metric, the time budget, the simplicity tiebreaker, how to handle crashes.  
- **The loop**: the 9 steps verbatim, with a `NEVER STOP` clause. Include the exact shell commands (redirect to `run.log`, grep pattern, `git reset` command).

**5\. Create `results.tsv`** with columns `commit\tmetric\tresource\tstatus\tdescription`. Tab-separated (commas break on free-text descriptions). Keep it untracked.

**6\. Lock the environment.** `uv` \+ `pyproject.toml` \+ lockfile so runs are reproducible across the 100+ experiments in a session.

**7\. Launch.** Start Claude Code / Codex in the repo, disable permission prompts, prompt: *"Read program.md and begin."* Walk away.

**8\. Morning review.** Read `results.tsv`, plot it (`analysis.ipynb`\-style), pick surviving commits, decide whether to edit `program.md` (add hints, rule out dead ends, add new in-scope references) and relaunch.

### Domain-specific adaptations

- **Smaller GPU / MacBook:** shrink the problem, not the loop. Use TinyStories, cut `MAX_SEQ_LEN` to 256, drop `DEPTH` to 4, use `WINDOW_PATTERN="L"`, shrink `TOTAL_BATCH_SIZE` to `2**14`. See the notable forks (`autoresearch-macos`, `autoresearch-mlx`, `autoresearch-win-rtx`, AMD fork) for concrete patches.  
- **Non-ML domains:** replace `train.py` with the mutable artifact (a strategy file, a prompt, a page), replace the GPU run with your evaluator (backtest, A/B scorer, eval suite), keep the ratchet unchanged.  
- **Multiple metrics:** collapse to one scalar (weighted sum, or hard-constraint \+ single objective). Multi-objective breaks the ratchet.  
- **Expensive evaluations:** shrink the per-run budget ruthlessly — 12 runs/hour is the design target. A 30-min eval kills the loop.

### Common pitfalls

- Letting the agent edit the harness "just to fix a bug" → invalidates every prior result.  
- Non-deterministic metric (seed leaks, wall-clock dependencies) → the ratchet accepts noise.  
- Metric game-ability (e.g. early-exit, caching, hardcoding the test set) → agent will find it. Assume adversarial optimization.  
- Too-broad editable surface → diffs become unreviewable and crashes multiply.  
- Letting `run.log` stream into agent context → blows the context window in minutes. Always redirect to a file and grep.  
- Asking the agent "should I continue?" prompts in `program.md` → defeats the whole point. Make `NEVER STOP` explicit.

### Minimum viable AutoResearch checklist

- [ ] One file the agent edits.  
- [ ] One read-only harness that prints one number.  
- [ ] One `program.md` with setup \+ rules \+ 9-step loop \+ `NEVER STOP`.  
- [ ] One `results.tsv` for the ratchet history.  
- [ ] Fixed per-run time budget enforced by the harness.  
- [ ] Git branch per run, `git reset --hard` on regressions.  
- [ ] Agent launched with permissions disabled, context redirected to log files.

That's the whole framework. Everything else — the GPU, the transformer, the BPE tokenizer — is domain detail.  
