All exercises
Act II·11 of 11·75 min
6

AutoResearch on REST-bench

Now you build the loop yourself

In Exercise 5 the loopLoopWhat makes Claude an agent and not a chatbot. Instead of one ask-and-answer turn, a loop runs Claude over and over: act, observe what changed, decide the next step, repeat. was already wired — you ran /autoresearch and watched. This time you build it. Same shape, different domain: a restaurant simulator instead of a propensity model.

Stuck? Don't see the /autoresearch command?

Two fallbacks:

  1. Fully restart the Claude desktop application — quit it, don't just close the window. Project-scoped slash commandsSlash commandA shortcut you type as /something to launch a workflow Claude already knows. Type /check-mail and Claude does the whole routine without re-prompting. only refresh on a full restart.
  2. If it still doesn't appear, paste this prompt instead:
    Read .claude/commands/autoresearch.md and follow those instructions

What you're optimizing

You're managing The Rotterdam Table — a casual European bistro in Rotterdam — for 30 days. Your only handle on the world is python -m rest_sim <command>: read state, decide, advance, repeat.

Rotterdam Table European bistro 22 tables 78 seats 30 days one shift / day €10,000 starting cash seed 20260423 deterministic RNG

The daily loop

status read state decide 0–N actions the LLM's only step advance +1 day done? yes scorecard no — next day repeats 30× · the pink box is the only file your AutoResearch loop edits

Each day is 2–5 CLI calls. The brain that lives inside decide is .claude/agents/restaurant-manager.md — that's the file your loop will mutate.

What's modeled

Arrivals bimodal lunch / dinner peaks NHPP · thinning Party size ~45% pairs · skew to small empirical discrete Menu popularity 80/20 — top items dominate Zipf · α = 1.16 Wait tolerance ~15 min before walking exponential Satisfaction degraded by waits and outages Beta(8, 2) Customer cohorts regulars · occasionals · prospects slow demand multiplier Delayed reviews a bad day bleeds for a week geometric lag · EWMA Supplier shocks price moves 10–25% on alert 1–3 days ahead

Every distribution is calibrated against published service-industry research — citations live inline in rest_sim/distributions.py. The agent reads a 7-day manager_view, never the raw engine state.

The levers you control

INVENTORY MENU STAFFING RESERVATIONS MARKETING LAYOUT restock inventory set-price add-item remove-item set-staff staffing set-cap reservations marketing promo loyalty happy-hour convert-table plus read-only views: status · kpis · pnl · heatmap · news

The agent's whole job is choosing which of these to fire on which day.

How you're scored

score_eur = + profit 200k × Δsat² 80k × Δrep² 8 × walkouts total net profit if sat < 0.78 if rep < 4.10 always (per walkout) cliff if cash drops below −€5,000, the run halts and the score floors at −€150,000 Where you stand on day 30 do nothing −€5,577 default Haiku +€24,129 ← floor to beat Opus (reported) ~€100k €0 −€10k +€30k

Goal: beat +€24,129 net profit while keeping reputation ≥ 4.0 and satisfaction ≥ 0.75.

Setup

1. CloneClonegit clone makes a full local copy of a repo — files, full history, the lot. After cloning you can read it offline, edit it, and commit your own changes. the repoRepoA folder that git is tracking — looks normal, but inside is a hidden .git/ directory holding the entire history: every file, every change, every commit, every branch..

In your current Claude Code sessionSessionA single ongoing conversation with Claude Code. Every message, every file Claude has read, every tool result lives inside one session — stored locally on your machine.:

Prompt:

Clone this repo: https://github.com/fjfok/REST-bench
Place it in ~/Documents/github/

This copies the code from GitHubGitHubThe website where most people host git repositories. Git is the tool, GitHub is one popular place that stores the result. Owned by Microsoft, free for public projects. onto your laptop. A new REST-bench/ folder appears inside ~/Documents/github/.

2. Open that folder in a fresh Claude Code session — manually.

Claude Code can't switch its own working directory mid-session, so this step has to be done by hand:

  • Quit Claude Code (or open a new window).
  • Start a fresh session.
  • When it asks which folder to open, double-click the REST-bench folder to open it as the project.

Open the cloned folder in a new Claude Code session

Verify you're in the right place.

Prompt:

What files are in this folder

You should see roughly:

  • .claude/
  • .gitignore
  • README.md
  • data/
  • pyproject.toml
  • rest_sim/
  • tests/

The exact list may shift over time as the repo evolves — the key signal is that rest_sim/ and .claude/ are present. If you see your other GitHub folders instead (autoresearch-edu, etc.), Claude Code is rooted one level too high — quit and reopen, double-clicking REST-bench this time.

From here on, everything happens inside this new session.

3. Check the install.

Prompt:

Check if you have everything installed for running this repo's code

Two errors might happen on this step:

1) PythonPythonA programming language — the lingua franca of data science, machine learning, and scripting. A separate ecosystem from Node and npm, with its own runtime and package installer (pip). is not installed (Windows)

  1. Go to python.org/downloads and download the latest Python 3 installer for Windows.

  2. Run the installer. On the first screen, tick the box labelled "Add python.exe to PATHPATHA list of folders your terminal searches when you type a command. Type 'git' and the shell looks through every folder in PATH, in order, for a program named git. No match → 'command not found'." before clicking Install Now. This is the most common step people miss.

  3. When the installer finishes, open a new Command Prompt (search "cmd" in the Start menu) and verify:

    python --version
    pip --version
    
  4. Both commands should print a version number. If you get "command not found", close and reopen the Command Prompt, or re-run the installer and make sure the PATH box was ticked.

2) numpy missing (Claude will warn you about this)

You can ask Claude to install it, or run:

python3 -m pip install --user numpy

Reproduce the baseline (3 min)

Before optimising anything, run the baselineBaselineThe score a dumb-but-honest approach gets on your problem. The bar you need to clear to claim your fancy approach actually does anything. yourself so you have ground truth.

Prompt:

/play-month 30 20260423

Same args as the baseline above (30 = days, 20260423 = seed). Your run should land near +€24,129 net profit — that's the floor your loop has to beat.

Watch it live

While /play-month runs, REST-bench serves a live dashboard at http://localhost:8765. It auto-launches the first time you init, but you can re-open it any time the sim is running — it streams init, advance, and decision events over SSE, so you can watch the agent's day-by-day choices, KPIs, and arrivals as they happen. If your browser didn't open it automatically, click that link.

If port 8765 is taken, run the dashboard yourself on another port:

python -m rest_sim dashboard --port 8766

The task: Convert this to AutoResearch

Download the AutoResearchAutoResearchA loop that turns Claude into a tireless ML researcher. Give it a dataset and a metric to beat; it tries an approach, scores it, journals what it learned, and keeps going overnight. primer: AutoResearch Repo: How It Works & How To Build Your Own.

Drag the downloaded .md into Claude Code.

Prompt:

Read the "autoresearch-build-your-own.md" attached.

.claude/agents/restaurant-manager.md is the only file the loop can edit (like the train.py)

Our goal is to maximize end-of-month net profit while keeping reputation healthy (≥ 4.0) and satisfaction ≥ 0.75.

Create a command (.claude/commands/auto-research) that: edits restaurant-manager.md; runs /play-month; writes results; learns; repeats within the limits!

Write results to .tsv
Stuck?Debrief — what should have happened
  • A working .claude/commands/auto-research.md you wrote with the agent.
  • A .tsv results file with one row per iteration: which mutations were tried, the resulting metrics, kept-or-reverted.
  • A new high score that beats the baseline.

Notice the mapping from Exercise 5:

  • train.pyrestaurant-manager.md (the only file the loop edits).
  • prepare.py/play-month (the scorer, locked).
  • program.md → your new auto-research command (the loop logic).

This mapping is the skillSkillA reusable bundle of know-how that Claude loads on demand. Lives as a folder of markdown on your machine; kicks in when the conversation matches what the skill is for.. Once you can do it on REST-bench, you can do it on any process at your company that satisfies the three conditions.

Stuck?FAQ for this exercise

Q: Can you explain the problem before we start? What variables, what are we selling, what's the price? A: See What you're optimizing at the top of this page — the venue, the daily loop, what's modeled, the levers, and the score are all there in one snapshot. The short version: we simulate a Rotterdam bistro with tables, menu, staff, and stock. Probabilities come from real service-industry data. We don't model weather or seasonality.

Q: Where does the baseline come from? What's the absolute bottom line? A: If the manager does nothing, the restaurant runs out of food in 6 days and ends the month around -€3,000 to -€5,000. Plain Haiku as manager (no instructions) keeps it afloat at +€16,000. Opus reaches close to €100,000.

Q: What does the seed do? A: A pseudo-random seed — guarantees the same "random" numbers each run so the experiment is reproducible. Without a seed, the simulation pulls from something like CPUGPU vs CPUA CPU is your computer's general-purpose brain — a handful of fast, flexible cores. A GPU has thousands of slower cores that do simple math in parallel — perfect for ML training. temperature for randomness, and you can't replicate. That's why the baseline pins --seed 20260423: so your numbers match ours.

Q: So the manager is the equivalent of train.py? A: Yes — program.md is the AutoResearch command (the loop). train.py decides how to train the model in the previous example; restaurant_manager.md decides how to run the business in this example. Mapping the same loop pattern to your domain is the challenging part.

Q: How can I see what day the simulation is on? A: REST-bench ships a live dashboard at http://localhost:8765 — it auto-launches when you init and streams every advance and decision as it happens. If you'd rather watch via files, open results.tsv to see all iterations.

Q: Could a digital twin be built from real operational data instead of random data? A: Experimental Research Direction Appropriately governed, aggregated operational data across large-scale platforms could, in principle, enable differentiated AI-driven simulations and decision-support systems. The illustrative direction: move from random distributions toward aggregated learnings that better reflect real-world variability — always within the privacy, consent, and competition-law frameworks that apply to any such data.

Q: Is it feasible to use simulated personas to stress-test a new product overnight? A: Experimental Research Direction That's the conceptual AutoPMF idea. Early prototypes suggested simulated personas can be too "polite" — they propose small additions where real humans want whole redesigns, which would call for a more critical "innovation QA" persona. Future research could explore privacy-preserving simulations based on approved, aggregated behavioural patterns — not real user records — to stress-test product variants safely.

OptionalGo deeper

Illustrative Scenario Try YC bench as a second benchmark. Same loop scaffolding, completely different domain — the YC startup-evaluation benchmark instead of REST-bench's restaurant simulator.

Clone https://github.com/FlorisFok/AutoResearchYC (built on the
yc-bench dataset: https://huggingface.co/datasets/collinear-ai/yc-bench).

Map the three roles from REST-bench:
- restaurant-manager.md → the editable agent file (what does it become here?)
- /play-month → the locked scorer (what's the YC-bench equivalent?)
- auto-research command → the same shape, pointed at the new harness.

Beat the published baseline.

Compare the trajectories side-by-side. If autoresearch finds analogous patterns across two unrelated domains, you've felt the generality of the loop. If it stalls on YC bench, the gap tells you exactly what makes a domain "loop-able" — and what doesn't.


Ask for help