Act II·11 of 11·75 min

AutoResearch on REST-bench

Now you build the loop yourself

In Exercise 5 the loop was already wired — you ran /autoresearch and watched. This time you build it. Same shape, different domain: a restaurant simulator instead of a propensity model.

Stuck? Don't see the /autoresearch command?

Two fallbacks:

Fully restart the Claude desktop application — quit it, don't just close the window. Project-scoped slash commands only refresh on a full restart.
If it still doesn't appear, paste this prompt instead:
Read .claude/commands/autoresearch.md and follow those instructions

What you're optimizing

You're managing The Rotterdam Table — a casual European bistro in Rotterdam — for 30 days. Your only handle on the world is python -m rest_sim <command>: read state, decide, advance, repeat.

The daily loop

Each day is 2–5 CLI calls. The brain that lives inside decide is .claude/agents/restaurant-manager.md — that's the file your loop will mutate.

What's modeled

Every distribution is calibrated against published service-industry research — citations live inline in rest_sim/distributions.py. The agent reads a 7-day manager_view, never the raw engine state.

The levers you control

The agent's whole job is choosing which of these to fire on which day.

How you're scored

Goal: beat +€24,129 net profit while keeping reputation ≥ 4.0 and satisfaction ≥ 0.75.

Setup

1. Clone the repo.

In your current Claude Code session:

Prompt:

Clone this repo: https://github.com/fjfok/REST-bench
Place it in ~/Documents/github/

This copies the code from GitHub onto your laptop. A new REST-bench/ folder appears inside ~/Documents/github/.

2. Open that folder in a fresh Claude Code session — manually.

Claude Code can't switch its own working directory mid-session, so this step has to be done by hand:

Quit Claude Code (or open a new window).
Start a fresh session.
When it asks which folder to open, double-click the REST-bench folder to open it as the project.

Open the cloned folder in a new Claude Code session

Verify you're in the right place.

Prompt:

What files are in this folder

You should see roughly:

.claude/
.gitignore
README.md
data/
pyproject.toml
rest_sim/
tests/

The exact list may shift over time as the repo evolves — the key signal is that rest_sim/ and .claude/ are present. If you see your other GitHub folders instead (autoresearch-edu, etc.), Claude Code is rooted one level too high — quit and reopen, double-clicking REST-bench this time.

From here on, everything happens inside this new session.

3. Check the install.

Prompt:

Check if you have everything installed for running this repo's code

Two errors might happen on this step:

1) Python is not installed (Windows)

Go to python.org/downloads and download the latest Python 3 installer for Windows.
Run the installer. On the first screen, tick the box labelled "Add python.exe to PATH" before clicking Install Now. This is the most common step people miss.
When the installer finishes, open a new Command Prompt (search "cmd" in the Start menu) and verify:
```
python --version
pip --version
```
Both commands should print a version number. If you get "command not found", close and reopen the Command Prompt, or re-run the installer and make sure the PATH box was ticked.

2) numpy missing (Claude will warn you about this)

You can ask Claude to install it, or run:

python3 -m pip install --user numpy

Reproduce the baseline (3 min)

Before optimising anything, run the baseline yourself so you have ground truth.

Prompt:

/play-month 30 20260423

Same args as the baseline above (30 = days, 20260423 = seed). Your run should land near +€24,129 net profit — that's the floor your loop has to beat.

Watch it live

While /play-month runs, REST-bench serves a live dashboard at http://localhost:8765. It auto-launches the first time you init, but you can re-open it any time the sim is running — it streams init, advance, and decision events over SSE, so you can watch the agent's day-by-day choices, KPIs, and arrivals as they happen. If your browser didn't open it automatically, click that link.

If port 8765 is taken, run the dashboard yourself on another port:

python -m rest_sim dashboard --port 8766

The task: Convert this to AutoResearch

Download the AutoResearch primer: AutoResearch Repo: How It Works & How To Build Your Own.

Drag the downloaded .md into Claude Code.

Prompt:

Read the "autoresearch-build-your-own.md" attached.

.claude/agents/restaurant-manager.md is the only file the loop can edit (like the train.py)

Our goal is to maximize end-of-month net profit while keeping reputation healthy (≥ 4.0) and satisfaction ≥ 0.75.

Create a command (.claude/commands/auto-research) that: edits restaurant-manager.md; runs /play-month; writes results; learns; repeats within the limits!

Write results to .tsv

Stuck?Debrief — what should have happened

A working .claude/commands/auto-research.md you wrote with the agent.
A .tsv results file with one row per iteration: which mutations were tried, the resulting metrics, kept-or-reverted.
A new high score that beats the baseline.

Notice the mapping from Exercise 5:

train.py → restaurant-manager.md (the only file the loop edits).
prepare.py → /play-month (the scorer, locked).
program.md → your new auto-research command (the loop logic).

This mapping is the skill. Once you can do it on REST-bench, you can do it on any process at your company that satisfies the three conditions.

Stuck?FAQ for this exercise

Q: Can you explain the problem before we start? What variables, what are we selling, what's the price? A: See What you're optimizing at the top of this page — the venue, the daily loop, what's modeled, the levers, and the score are all there in one snapshot. The short version: we simulate a Rotterdam bistro with tables, menu, staff, and stock. Probabilities come from real service-industry data. We don't model weather or seasonality.

Q: Where does the baseline come from? What's the absolute bottom line? A: If the manager does nothing, the restaurant runs out of food in 6 days and ends the month around -€3,000 to -€5,000. Plain Haiku as manager (no instructions) keeps it afloat at +€16,000. Opus reaches close to €100,000.

Q: What does the seed do? A: A pseudo-random seed — guarantees the same "random" numbers each run so the experiment is reproducible. Without a seed, the simulation pulls from something like CPU temperature for randomness, and you can't replicate. That's why the baseline pins --seed 20260423: so your numbers match ours.

Q: So the manager is the equivalent of train.py? A: Yes — program.md is the AutoResearch command (the loop). train.py decides how to train the model in the previous example; restaurant_manager.md decides how to run the business in this example. Mapping the same loop pattern to your domain is the challenging part.

Q: How can I see what day the simulation is on? A: REST-bench ships a live dashboard at http://localhost:8765 — it auto-launches when you init and streams every advance and decision as it happens. If you'd rather watch via files, open results.tsv to see all iterations.

Q: Could a digital twin be built from real operational data instead of random data? A: Experimental Research Direction Appropriately governed, aggregated operational data across large-scale platforms could, in principle, enable differentiated AI-driven simulations and decision-support systems. The illustrative direction: move from random distributions toward aggregated learnings that better reflect real-world variability — always within the privacy, consent, and competition-law frameworks that apply to any such data.

Q: Is it feasible to use simulated personas to stress-test a new product overnight? A: Experimental Research Direction That's the conceptual AutoPMF idea. Early prototypes suggested simulated personas can be too "polite" — they propose small additions where real humans want whole redesigns, which would call for a more critical "innovation QA" persona. Future research could explore privacy-preserving simulations based on approved, aggregated behavioural patterns — not real user records — to stress-test product variants safely.

OptionalGo deeper

Illustrative Scenario Try YC bench as a second benchmark. Same loop scaffolding, completely different domain — the YC startup-evaluation benchmark instead of REST-bench's restaurant simulator.

Clone https://github.com/FlorisFok/AutoResearchYC (built on the
yc-bench dataset: https://huggingface.co/datasets/collinear-ai/yc-bench).

Map the three roles from REST-bench:
- restaurant-manager.md → the editable agent file (what does it become here?)
- /play-month → the locked scorer (what's the YC-bench equivalent?)
- auto-research command → the same shape, pointed at the new harness.

Beat the published baseline.

Compare the trajectories side-by-side. If autoresearch finds analogous patterns across two unrelated domains, you've felt the generality of the loop. If it stalls on YC bench, the gap tells you exactly what makes a domain "loop-able" — and what doesn't.

Stuck? Ask the assistant →

Ask for help

Finish course — back to all exercises