AutoResearch on REST-bench
Now you build the loop yourself
In Exercise 5 the loopLoopWhat makes Claude an agent and not a chatbot. Instead of one ask-and-answer turn, a loop runs Claude over and over: act, observe what changed, decide the next step, repeat. was already wired — you ran /autoresearch and watched. This time you build it. Same shape, different domain: a restaurant simulator instead of a propensity model.
Stuck?
Don't see the /autoresearch command?
Two fallbacks:
- Fully restart the Claude desktop application — quit it, don't just close the window. Project-scoped slash commandsSlash commandA shortcut you type as /something to launch a workflow Claude already knows. Type /check-mail and Claude does the whole routine without re-prompting. only refresh on a full restart.
- If it still doesn't appear, paste this prompt instead:
Read .claude/commands/autoresearch.md and follow those instructions
What you're optimizing
You're managing The Rotterdam Table — a casual European bistro in Rotterdam — for 30 days. Your only handle on the world is python -m rest_sim <command>: read state, decide, advance, repeat.
The daily loop
Each day is 2–5 CLI calls. The brain that lives inside decide is .claude/agents/restaurant-manager.md — that's the file your loop will mutate.
What's modeled
Every distribution is calibrated against published service-industry research — citations live inline in rest_sim/distributions.py. The agent reads a 7-day manager_view, never the raw engine state.
The levers you control
The agent's whole job is choosing which of these to fire on which day.
How you're scored
Goal: beat +€24,129 net profit while keeping reputation ≥ 4.0 and satisfaction ≥ 0.75.
Setup
1. CloneClonegit clone makes a full local copy of a repo — files, full history, the lot. After cloning you can read it offline, edit it, and commit your own changes. the repoRepoA folder that git is tracking — looks normal, but inside is a hidden .git/ directory holding the entire history: every file, every change, every commit, every branch..
In your current Claude Code sessionSessionA single ongoing conversation with Claude Code. Every message, every file Claude has read, every tool result lives inside one session — stored locally on your machine.:
Prompt:
Clone this repo: https://github.com/fjfok/REST-bench
Place it in ~/Documents/github/
This copies the code from GitHubGitHubThe website where most people host git repositories. Git is the tool, GitHub is one popular place that stores the result. Owned by Microsoft, free for public projects. onto your laptop. A new REST-bench/ folder appears inside ~/Documents/github/.
2. Open that folder in a fresh Claude Code session — manually.
Claude Code can't switch its own working directory mid-session, so this step has to be done by hand:
- Quit Claude Code (or open a new window).
- Start a fresh session.
- When it asks which folder to open, double-click the
REST-benchfolder to open it as the project.

Verify you're in the right place.
Prompt:
What files are in this folder
You should see roughly:
.claude/.gitignoreREADME.mddata/pyproject.tomlrest_sim/tests/
The exact list may shift over time as the repo evolves — the key signal is that rest_sim/ and .claude/ are present. If you see your other GitHub folders instead (autoresearch-edu, etc.), Claude Code is rooted one level too high — quit and reopen, double-clicking REST-bench this time.
From here on, everything happens inside this new session.
3. Check the install.
Prompt:
Check if you have everything installed for running this repo's code
Two errors might happen on this step:
1) PythonPythonA programming language — the lingua franca of data science, machine learning, and scripting. A separate ecosystem from Node and npm, with its own runtime and package installer (pip). is not installed (Windows)
Go to python.org/downloads and download the latest Python 3 installer for Windows.
Run the installer. On the first screen, tick the box labelled "Add python.exe to PATHPATHA list of folders your terminal searches when you type a command. Type 'git' and the shell looks through every folder in PATH, in order, for a program named git. No match → 'command not found'." before clicking Install Now. This is the most common step people miss.
When the installer finishes, open a new Command Prompt (search "cmd" in the Start menu) and verify:
python --version pip --versionBoth commands should print a version number. If you get "command not found", close and reopen the Command Prompt, or re-run the installer and make sure the PATH box was ticked.
2) numpy missing (Claude will warn you about this)
You can ask Claude to install it, or run:
python3 -m pip install --user numpy
Reproduce the baseline (3 min)
Before optimising anything, run the baselineBaselineThe score a dumb-but-honest approach gets on your problem. The bar you need to clear to claim your fancy approach actually does anything. yourself so you have ground truth.
Prompt:
/play-month 30 20260423
Same args as the baseline above (30 = days, 20260423 = seed). Your run should land near +€24,129 net profit — that's the floor your loop has to beat.
Watch it live
While /play-month runs, REST-bench serves a live dashboard at http://localhost:8765. It auto-launches the first time you init, but you can re-open it any time the sim is running — it streams init, advance, and decision events over SSE, so you can watch the agent's day-by-day choices, KPIs, and arrivals as they happen. If your browser didn't open it automatically, click that link.
If port 8765 is taken, run the dashboard yourself on another port:
python -m rest_sim dashboard --port 8766
The task: Convert this to AutoResearch
Download the AutoResearchAutoResearchA loop that turns Claude into a tireless ML researcher. Give it a dataset and a metric to beat; it tries an approach, scores it, journals what it learned, and keeps going overnight. primer: AutoResearch Repo: How It Works & How To Build Your Own.
Drag the downloaded .md into Claude Code.
Prompt:
Read the "autoresearch-build-your-own.md" attached.
.claude/agents/restaurant-manager.md is the only file the loop can edit (like the train.py)
Our goal is to maximize end-of-month net profit while keeping reputation healthy (≥ 4.0) and satisfaction ≥ 0.75.
Create a command (.claude/commands/auto-research) that: edits restaurant-manager.md; runs /play-month; writes results; learns; repeats within the limits!
Write results to .tsv
Stuck?Debrief — what should have happened
- A working
.claude/commands/auto-research.mdyou wrote with the agent. - A
.tsvresults file with one row per iteration: which mutations were tried, the resulting metrics, kept-or-reverted. - A new high score that beats the baseline.
Notice the mapping from Exercise 5:
train.py→restaurant-manager.md(the only file the loop edits).prepare.py→/play-month(the scorer, locked).program.md→ your newauto-researchcommand (the loop logic).
This mapping is the skillSkillA reusable bundle of know-how that Claude loads on demand. Lives as a folder of markdown on your machine; kicks in when the conversation matches what the skill is for.. Once you can do it on REST-bench, you can do it on any process at your company that satisfies the three conditions.
Stuck?FAQ for this exercise
Q: Can you explain the problem before we start? What variables, what are we selling, what's the price? A: See What you're optimizing at the top of this page — the venue, the daily loop, what's modeled, the levers, and the score are all there in one snapshot. The short version: we simulate a Rotterdam bistro with tables, menu, staff, and stock. Probabilities come from real service-industry data. We don't model weather or seasonality.
Q: Where does the baseline come from? What's the absolute bottom line? A: If the manager does nothing, the restaurant runs out of food in 6 days and ends the month around -€3,000 to -€5,000. Plain Haiku as manager (no instructions) keeps it afloat at +€16,000. Opus reaches close to €100,000.
Q: What does the seed do? A: A pseudo-random seed — guarantees the same "random" numbers each run so the experiment is reproducible. Without a seed, the simulation pulls from something like CPUGPU vs CPUA CPU is your computer's general-purpose brain — a handful of fast, flexible cores. A GPU has thousands of slower cores that do simple math in parallel — perfect for ML training. temperature for randomness, and you can't replicate. That's why the baseline pins --seed 20260423: so your numbers match ours.
Q: So the manager is the equivalent of train.py? A: Yes — program.md is the AutoResearch command (the loop). train.py decides how to train the model in the previous example; restaurant_manager.md decides how to run the business in this example. Mapping the same loop pattern to your domain is the challenging part.
Q: How can I see what day the simulation is on? A: REST-bench ships a live dashboard at http://localhost:8765 — it auto-launches when you init and streams every advance and decision as it happens. If you'd rather watch via files, open results.tsv to see all iterations.
Q: Could a digital twin be built from real operational data instead of random data? A: Experimental Research Direction Appropriately governed, aggregated operational data across large-scale platforms could, in principle, enable differentiated AI-driven simulations and decision-support systems. The illustrative direction: move from random distributions toward aggregated learnings that better reflect real-world variability — always within the privacy, consent, and competition-law frameworks that apply to any such data.
Q: Is it feasible to use simulated personas to stress-test a new product overnight? A: Experimental Research Direction That's the conceptual AutoPMF idea. Early prototypes suggested simulated personas can be too "polite" — they propose small additions where real humans want whole redesigns, which would call for a more critical "innovation QA" persona. Future research could explore privacy-preserving simulations based on approved, aggregated behavioural patterns — not real user records — to stress-test product variants safely.
OptionalGo deeper
Illustrative Scenario Try YC bench as a second benchmark. Same loop scaffolding, completely different domain — the YC startup-evaluation benchmark instead of REST-bench's restaurant simulator.
Clone https://github.com/FlorisFok/AutoResearchYC (built on the
yc-bench dataset: https://huggingface.co/datasets/collinear-ai/yc-bench).
Map the three roles from REST-bench:
- restaurant-manager.md → the editable agent file (what does it become here?)
- /play-month → the locked scorer (what's the YC-bench equivalent?)
- auto-research command → the same shape, pointed at the new harness.
Beat the published baseline.
Compare the trajectories side-by-side. If autoresearch finds analogous patterns across two unrelated domains, you've felt the generality of the loop. If it stalls on YC bench, the gap tells you exactly what makes a domain "loop-able" — and what doesn't.