Skip to content

Karpathy's AutoResearch

The Minimalist Architecture for Automated AI Experiments

Karpathy AutoResearch

讓 Claude Code 自動跑實驗的極簡架構


The best agent framework is no framework at all. Just a markdown file and a clear objective function.

最好的 agent framework 就是沒有 framework。一個 markdown 檔案加上一個明確的目標函數就夠了。


What Is AutoResearch?

Andrej Karpathy released autoresearch — a system that lets Claude Code autonomously run machine learning experiments, evaluate results, keep improvements, discard regressions, and loop forever. It ran ~100 experiments overnight on a single H100 GPU, continuously improving a GPT language model's validation loss.

The remarkable thing is not what it does. It's how little code it takes to do it.

The entire system is 4 files. There is no custom orchestration code. No Python loop runner. No bash script coordinator. No agent framework. Claude Code itself is the orchestration engine, guided by a single markdown file.

This article breaks down exactly how it works and why this architecture is more robust than it looks.

什麼是 AutoResearch?

Andrej Karpathy 發布了 autoresearch——一個讓 Claude Code 自主執行機器學習實驗、評估結果、保留改進、丟棄退步、然後無限循環的系統。它在一張 H100 GPU 上一夜跑了約 100 個實驗,持續改進 GPT 語言模型的 validation loss。

令人驚訝的不是它做了什麼,而是做到這件事需要多少程式碼。

整個系統只有 4 個檔案。沒有客製的 orchestration 程式碼。沒有 Python loop runner。沒有 bash 腳本協調器。沒有 agent framework。Claude Code 本身就是 orchestration 引擎,僅由一個 markdown 檔案引導。

這篇文章會完整拆解它的運作方式,以及為什麼這種架構比表面看起來更穩健。


The 4-File Architecture

The entire system consists of exactly four files. Each has a clearly defined role, and the boundaries between them are strict.

FileRoleSizeMutable by AI?
prepare.pyRead-only infrastructureNo
train.pyExperiment surface~26 KBYes (only file)
program.mdAgent instruction manualNo
pyproject.tomlDependenciesNo

Let's look at each one.

4 個檔案的架構

整個系統剛好由四個檔案組成。每個都有明確定義的角色,檔案之間的邊界非常嚴格。

檔案角色大小AI 可修改?
prepare.py唯讀基礎設施
train.py實驗操作面~26 KB是(唯一可修改檔案)
program.mdAgent 指令手冊
pyproject.toml相依套件

逐一來看。


prepare.py — The Immutable Foundation

This file is the evaluation infrastructure. The AI never touches it.

It handles:

  • Constants: MAX_SEQ_LEN=2048, TIME_BUDGET=300 (5-minute training budget per experiment)
  • Data pipeline: Downloads data from HuggingFace, trains a BPE tokenizer via rustbpe, builds a dataloader
  • Evaluation: The evaluate_bpb() function — computes bits-per-byte on the validation set

The critical design decision: by making prepare.py immutable, the AI cannot game the evaluation metric. It can only improve the model, not redefine what "improvement" means.

prepare.py — 不可變的基礎設施

這個檔案是評估基礎設施。AI 絕對不會動它。

它負責:

  • 常數定義MAX_SEQ_LEN=2048TIME_BUDGET=300(每次實驗 5 分鐘訓練預算)
  • 資料管線:從 HuggingFace 下載資料、透過 rustbpe 訓練 BPE tokenizer、建構 dataloader
  • 評估函數evaluate_bpb() ——在 validation set 上計算 bits-per-byte

關鍵的設計決策是:透過讓 prepare.py 不可變,AI 無法作弊操控評估指標。它只能改進模型,不能重新定義什麼叫做「改進」。


train.py — The Single Mutable Surface

This is the only file Claude Code is allowed to modify. At ~26 KB, it contains:

  • A full GPT model implementation
  • Optimizer configuration
  • The training loop
  • All hyperparameters

Every experiment the AI runs is an edit to this file. Want to try a different learning rate schedule? Edit train.py. Want to experiment with a different attention mechanism? Edit train.py. Want to add gradient clipping? Edit train.py.

This single-file constraint is what makes the system tractable. The AI doesn't need to reason about file dependencies, import chains, or multi-module architectures. There is exactly one file to read, one file to modify, and one metric to optimize.

train.py — 唯一可修改的操作面

這是 Claude Code 唯一被允許修改的檔案。大約 26 KB,包含:

  • 完整的 GPT 模型實作
  • Optimizer 設定
  • 訓練迴圈
  • 所有超參數

AI 跑的每一個實驗都是對這個檔案的編輯。想試不同的 learning rate schedule?改 train.py。想實驗不同的 attention 機制?改 train.py。想加 gradient clipping?改 train.py

這個單一檔案限制是讓系統可控的關鍵。AI 不需要推理檔案依賴關係、import chain、或多模組架構。就是讀一個檔案、改一個檔案、優化一個指標。


program.md — The Only "Framework"

This is where it gets interesting. program.md is a structured markdown document that gets loaded as CLAUDE.md — the instruction file that Claude Code reads at startup.

It is the agent's instruction manual. It tells Claude Code:

  1. How to set up the environment
  2. How to run experiments
  3. How to log results
  4. How to handle success and failure
  5. How to loop forever

There is no Python code orchestrating the loop. There is no bash script calling Claude Code repeatedly. The markdown file simply tells Claude Code to never stop, and Claude Code obeys.

This is the zero-orchestration insight: Claude Code is already a loop. You don't need to build a loop around it. You just need to tell it not to stop.

program.md — 唯一的「Framework」

這裡是最有意思的部分。program.md 是一份結構化的 markdown 文件,以 CLAUDE.md 的形式載入——就是 Claude Code 啟動時讀取的指令檔案。

它是 agent 的指令手冊。它告訴 Claude Code:

  1. 如何設定環境
  2. 如何執行實驗
  3. 如何記錄結果
  4. 如何處理成功和失敗
  5. 如何無限循環

沒有 Python 程式碼在 orchestrate 這個迴圈。沒有 bash 腳本反覆呼叫 Claude Code。這個 markdown 檔案只是告訴 Claude Code 永遠不要停下來,然後 Claude Code 就照做了。

這就是零 orchestration 的洞見:Claude Code 本身已經是一個迴圈。你不需要在它外面再包一個迴圈。你只需要告訴它不要停下來。


pyproject.toml — Dependencies

The project dependencies are straightforward:

  • PyTorch 2.9.1 — The ML framework
  • numpy — Numerical operations
  • pandas — Data handling
  • tiktoken — OpenAI's tokenizer library
  • rustbpe — Fast BPE tokenizer training
  • kernels — Custom CUDA kernels
  • matplotlib — Plotting results

Everything is managed via uv, making environment setup reproducible and fast.

pyproject.toml — 相依套件

專案的相依套件很直觀:

  • PyTorch 2.9.1 — ML 框架
  • numpy — 數值運算
  • pandas — 資料處理
  • tiktoken — OpenAI 的 tokenizer 函式庫
  • rustbpe — 高速 BPE tokenizer 訓練
  • kernels — 自訂 CUDA kernels
  • matplotlib — 繪圖結果

全部透過 uv 管理,確保環境設定可重現且快速。


The Experiment Loop

The core of the system is a loop defined entirely in program.md. Here's what Claude Code does on every iteration:

LOOP FOREVER:
1. Check git state
2. Edit train.py with an experimental idea
3. git commit
4. Run: uv run train.py > run.log 2>&1
5. Extract results: grep for val_bpb and peak_vram_mb
6. If grep empty → crash → read tail -n 50 for traceback
7. Log to results.tsv
8. If val_bpb improved → keep commit
9. If val_bpb worse → git reset
10. NEVER STOP.

Let's walk through each step.

實驗迴圈

系統的核心是一個完全定義在 program.md 裡的迴圈。以下是 Claude Code 在每次迭代中做的事:

LOOP FOREVER:
1. 檢查 git 狀態
2. 用一個實驗想法編輯 train.py
3. git commit
4. 執行:uv run train.py > run.log 2>&1
5. 提取結果:grep 搜尋 val_bpb 和 peak_vram_mb
6. 如果 grep 為空 → 崩潰 → 讀取 tail -n 50 查看 traceback
7. 記錄到 results.tsv
8. 如果 val_bpb 改進 → 保留 commit
9. 如果 val_bpb 退步 → git reset
10. 永不停止。

逐步拆解每一步。


Step 1–3: Hypothesis and Commit

Claude Code starts each iteration by checking the git state to understand where it is. Then it comes up with an experimental idea — maybe a different learning rate, a modified architecture component, or an optimization trick — and edits train.py accordingly.

Before running anything, it commits the change. This is critical: the commit happens before the experiment runs, not after. This creates an atomic checkpoint that can be cleanly reverted if the experiment fails.

步驟 1–3:假設與 Commit

Claude Code 在每次迭代開始時先檢查 git 狀態,了解自己在哪裡。然後它提出一個實驗想法——也許是不同的 learning rate、修改過的架構組件、或某個優化技巧——並相應地編輯 train.py

在執行任何東西之前,它先 commit 這個修改。這很關鍵:commit 發生在實驗執行之前,而不是之後。這創建了一個原子化的 checkpoint,如果實驗失敗可以乾淨地回復。


Step 4: Run the Experiment

The experiment runs as a simple subprocess:

bash
uv run train.py > run.log 2>&1

Output goes to run.log. The training is bounded by TIME_BUDGET=300 (5 minutes) defined in prepare.py, plus a per-experiment timeout of 10 minutes max. This prevents any single experiment from hogging the GPU indefinitely.

步驟 4:執行實驗

實驗以一個簡單的子程序執行:

bash
uv run train.py > run.log 2>&1

輸出導向 run.log。訓練受 prepare.py 中定義的 TIME_BUDGET=300(5 分鐘)限制,加上每個實驗最多 10 分鐘的 timeout。這防止任何單一實驗無限佔用 GPU。


Step 5–6: Result Extraction and Crash Detection

After the run completes, Claude Code extracts the results by grepping the log for val_bpb and peak_vram_mb.

Here's the elegant part: crash detection is just "did grep return empty?"

If grep returns results → experiment completed, extract the numbers
If grep returns empty → experiment crashed → read tail -n 50 of run.log for the traceback

No try-catch blocks. No error handling framework. Just grep.

步驟 5–6:結果提取與崩潰偵測

執行完成後,Claude Code 透過 grep 搜尋 log 中的 val_bpbpeak_vram_mb 來提取結果。

優雅的部分在這裡:崩潰偵測就是「grep 有沒有回傳結果?」

如果 grep 有結果 → 實驗完成,提取數字
如果 grep 為空 → 實驗崩潰 → 讀取 run.log 的 tail -n 50 查看 traceback

沒有 try-catch 區塊。沒有 error handling framework。就只是 grep。


Step 7: Logging

Results are appended to results.tsv — a simple tab-separated file that serves as the running experiment log. This file is gitignored, so it persists across git resets but doesn't pollute the commit history.

The TSV format makes it easy for both Claude Code and humans to read the experiment history.

步驟 7:記錄結果

結果附加到 results.tsv——一個簡單的 tab 分隔檔案,作為實驗的執行日誌。這個檔案被 gitignore,所以它會在 git reset 之間持續存在,但不會汙染 commit 歷史。

TSV 格式讓 Claude Code 和人類都能輕鬆閱讀實驗歷史。


Step 8–9: Keep or Revert

This is the selection mechanism:

  • val_bpb improved → The commit stays. The branch advances. This configuration is now the new baseline.
  • val_bpb got worsegit reset back to the previous commit. The branch tip remains at the last successful experiment.

This creates a monotonically improving sequence of commits. The branch tip always represents the best-so-far configuration.

步驟 8–9:保留或回復

這是選擇機制:

  • val_bpb 改進 → commit 保留。Branch 前進。這個配置成為新的 baseline。
  • val_bpb 退步git reset 回到上一個 commit。Branch tip 停留在最後一次成功的實驗。

這創造了一個單調改進的 commit 序列。Branch tip 永遠代表目前最佳的配置。


Step 10: Never Stop

The instruction is explicit: NEVER STOP. Claude Code keeps looping until it runs out of context window, hits an API rate limit, or the user manually terminates it.

At ~12 experiments per hour, an overnight run produces roughly 100 experiments. Each one is a genuine attempt to improve the model — a different architectural tweak, a different hyperparameter, a different optimization strategy.

步驟 10:永不停止

指令很明確:NEVER STOP.(永遠不要停下來)。Claude Code 會持續循環,直到用完 context window、觸及 API 速率限制、或使用者手動終止。

每小時約跑 12 個實驗,一整夜大約能產出 100 個實驗。每一個都是對模型改進的真實嘗試——不同的架構調整、不同的超參數、不同的優化策略。


State Management via Git

One of the most clever aspects of the design is using git as the state management layer.

Branch Strategy

Each autoresearch session runs on its own branch: autoresearch/{tag}. This means:

  • Multiple experiments can run in parallel on different branches
  • The main branch is never touched
  • Each branch tells a complete story of incremental improvements

The Branch Tip Is the Best Model

Because failed experiments are reverted and only improvements are kept, the branch tip always represents the best configuration found so far. You can check out any autoresearch branch and immediately have a working, optimized train.py.

results.tsv Is Gitignored

The experiment log lives outside of git. This is important because:

  1. It survives git reset — you don't lose the log of failed experiments
  2. It doesn't clutter the commit history
  3. Claude Code can reference it to avoid repeating failed experiments

One Commit = One Experiment

Each commit represents exactly one experimental change. The commit message describes what was tried. The diff shows exactly what was changed. The git log becomes a research notebook.

透過 Git 管理狀態

這個設計最聰明的面向之一,是使用 git 作為狀態管理層。

Branch 策略

每個 autoresearch session 在自己的 branch 上執行:autoresearch/{tag}。這表示:

  • 多個實驗可以在不同 branch 上平行執行
  • 主 branch 永遠不會被碰到
  • 每個 branch 都述說了一個漸進式改進的完整故事

Branch Tip 就是最佳模型

因為失敗的實驗會被回復,只有改進會被保留,所以 branch tip 永遠代表目前找到的最佳配置。你可以 checkout 任何 autoresearch branch,立刻就有一個可運作的、最佳化過的 train.py

results.tsv 被 Gitignore

實驗日誌存在 git 之外。這很重要,因為:

  1. 它在 git reset 之後仍然存在——你不會失去失敗實驗的記錄
  2. 它不會弄亂 commit 歷史
  3. Claude Code 可以參考它,避免重複已失敗的實驗

一個 Commit = 一個實驗

每個 commit 代表剛好一個實驗性的修改。Commit message 描述嘗試了什麼。Diff 精確顯示改了什麼。Git log 就變成了一本研究筆記本。


Error Handling: Pragmatic, Not Defensive

The error handling strategy is refreshingly simple. No exception hierarchies, no retry decorators, no circuit breakers. Just common sense encoded in markdown.

ScenarioWhat Claude Code Does
Experiment crashesCheck if grep returns empty, read tail -n 50 for traceback
Trivial fix (typo, import)Fix it and re-run the same experiment
Multiple consecutive failuresGive up on that idea, move on to something else
Ugly improvement (messy code, marginal gain)Discard it, even if val_bpb improved
Simplification that maintains performanceKeep it, even if val_bpb doesn't improve
Experiment exceeds time limit10-minute max timeout per experiment

The last two points reveal a subtle sophistication. The system isn't just hill-climbing on a metric. The program.md instructions encode aesthetic preferences: prefer clean code over marginal gains, prefer simplicity over complexity. This prevents the kind of code rot you'd get from a purely metric-driven optimization loop.

錯誤處理:務實而非防禦性

錯誤處理策略令人耳目一新地簡單。沒有 exception 階層、沒有 retry decorator、沒有 circuit breaker。只有用 markdown 編碼的常識。

情境Claude Code 的做法
實驗崩潰檢查 grep 是否為空,讀取 tail -n 50 查看 traceback
簡單修復(typo、import)修正後重新執行同一個實驗
連續多次失敗放棄這個想法,轉向其他方向
醜陋的改進(雜亂代碼、微小增益)丟棄,即使 val_bpb 有改進
維持效能的簡化保留,即使 val_bpb 沒有改進
實驗超過時間限制每個實驗最多 10 分鐘 timeout

最後兩點揭示了一種微妙的精巧。系統不只是在指標上做 hill-climbing。program.md 的指令編碼了美學偏好:寧可要乾淨的程式碼而非微小的增益、寧可要簡潔而非複雜。這防止了純粹指標驅動的優化迴圈會產生的那種 code rot。


Why This Architecture Works

It's tempting to dismiss this as "too simple to be real." But the simplicity is precisely why it works. Let's examine the design principles.

1. Single Mutable Surface

Only train.py can be modified. This eliminates an entire class of failure modes: the AI can't accidentally break the evaluation pipeline, corrupt the data loading, or modify the success criteria. The attack surface for bugs is exactly one file.

2. Objective Metric

val_bpb (validation bits-per-byte) is a clean, unambiguous metric. Lower is better. There's no subjective judgment, no "looks good to me," no multi-objective trade-off. The AI has a single number to optimize, and it can't redefine how that number is computed.

3. Atomic Experiments

One commit = one experiment. This means every experiment is:

  • Reversible: git reset undoes it completely
  • Inspectable: git diff shows exactly what changed
  • Reproducible: check out the commit, run train.py, get the same result

4. Immutable Evaluation

Because prepare.py is read-only, the evaluation function evaluate_bpb() is a fixed reference point. The AI optimizes against a stable target, not a moving one.

5. Stateless Iterations

Each loop iteration, Claude Code re-reads the files from scratch. It doesn't accumulate stale state in memory. If something gets corrupted, the next iteration starts fresh from the filesystem.

6. Git as Database

Git provides ACID-like properties for experiment management:

  • Atomicity: Each commit is all-or-nothing
  • Consistency: Branch tip is always a valid, improved configuration
  • Isolation: Different branches don't interfere with each other
  • Durability: Commits are persistent

No custom database needed. No experiment tracking service. Just git.

為什麼這個架構有效

你可能會想輕率地說這「太簡單了,不可能是真的」。但簡單恰恰就是它有效的原因。讓我們檢視其設計原則。

1. 單一可修改操作面

只有 train.py 可以被修改。這消除了一整類的故障模式:AI 不可能意外破壞評估管線、損壞資料載入、或修改成功標準。bug 的攻擊面剛好就是一個檔案。

2. 客觀指標

val_bpb(validation bits-per-byte)是一個乾淨、明確的指標。越低越好。沒有主觀判斷、沒有「看起來不錯」、沒有多目標權衡。AI 有一個數字需要優化,而且它無法重新定義這個數字的計算方式。

3. 原子化實驗

一個 commit = 一個實驗。這表示每個實驗都是:

  • 可回復的git reset 完全還原
  • 可檢視的git diff 精確顯示改了什麼
  • 可重現的:checkout 那個 commit、執行 train.py、得到同樣的結果

4. 不可變的評估

因為 prepare.py 是唯讀的,評估函數 evaluate_bpb() 是一個固定的參考點。AI 針對一個穩定的目標進行優化,而不是一個移動的目標。

5. 無狀態的迭代

每次迴圈迭代,Claude Code 都從頭重新讀取檔案。它不會在記憶體中累積過時的狀態。如果有什麼東西壞了,下一次迭代會從檔案系統重新開始。

6. Git 作為資料庫

Git 為實驗管理提供了類似 ACID 的屬性:

  • 原子性:每個 commit 是全有或全無
  • 一致性:Branch tip 永遠是有效的、改進過的配置
  • 隔離性:不同的 branch 不會互相干擾
  • 持久性:Commit 是持久保存的

不需要客製的資料庫。不需要實驗追蹤服務。就是 git。


The Zero-Orchestration Insight

Most AI agent systems look something like this:

[Orchestrator Script]
    ├── calls Agent API
    ├── parses response
    ├── decides next action
    ├── manages state
    ├── handles errors
    └── loops

AutoResearch looks like this:

[Claude Code]
    └── reads program.md
        └── does everything

There is no orchestrator. Claude Code reads program.md, which says "do this loop forever," and Claude Code does it. The "framework" is a markdown file. The "orchestration engine" is the LLM itself.

This works because Claude Code already has everything an orchestrator needs:

  • It can read and write files
  • It can run shell commands
  • It can parse output
  • It can make decisions
  • It can loop

Building a custom orchestrator on top of Claude Code would add complexity without adding capability. It would be a worse version of what Claude Code already is.

零 Orchestration 的洞見

大多數 AI agent 系統看起來是這樣的:

[Orchestrator 腳本]
    ├── 呼叫 Agent API
    ├── 解析回應
    ├── 決定下一步行動
    ├── 管理狀態
    ├── 處理錯誤
    └── 迴圈

AutoResearch 看起來是這樣的:

[Claude Code]
    └── 讀取 program.md
        └── 做所有事情

沒有 orchestrator。Claude Code 讀取 program.md,裡面說「永遠執行這個迴圈」,然後 Claude Code 就照做。「Framework」是一個 markdown 檔案。「Orchestration 引擎」是 LLM 本身。

這之所以有效,是因為 Claude Code 已經具備 orchestrator 需要的一切:

  • 它可以讀寫檔案
  • 它可以執行 shell 指令
  • 它可以解析輸出
  • 它可以做決策
  • 它可以迴圈

在 Claude Code 之上建一個客製的 orchestrator 只會增加複雜性,卻不會增加能力。那會是 Claude Code 本身功能的一個更差的版本。


System Requirements

AutoResearch is designed to be run with minimal setup:

RequirementSpecification
GPUSingle NVIDIA GPU (tested on H100)
Python>= 3.10
Package manageruv
AI agentClaude Code (vanilla, no special flags)
OSLinux (CUDA required)

Note that Claude Code is used as-is — no special configuration, no custom flags, no plugins. The program.md file loaded as CLAUDE.md is the only customization.

系統需求

AutoResearch 被設計為只需最少的設定就能執行:

需求規格
GPU單張 NVIDIA GPU(在 H100 上測試)
Python>= 3.10
套件管理器uv
AI agentClaude Code(原版,無特殊旗標)
作業系統Linux(需要 CUDA)

注意 Claude Code 是直接使用的——沒有特殊設定、沒有自訂旗標、沒有 plugin。以 CLAUDE.md 載入的 program.md 是唯一的客製化。


Performance Characteristics

The system achieves roughly:

  • ~12 experiments per hour — each experiment takes about 5 minutes to train plus overhead for Claude Code to think, commit, and evaluate
  • ~100 experiments overnight — an 8-hour unattended run
  • Monotonic improvement — every surviving commit is at least as good as its predecessor
  • 5-minute training budget — enforced by TIME_BUDGET=300 in prepare.py
  • 10-minute hard timeout — per-experiment ceiling to catch hanging processes

The throughput is limited by GPU training time, not by Claude Code's reasoning speed. Most of each 5-minute cycle is spent actually training the model.

效能特性

系統大約達到:

  • 每小時約 12 個實驗——每個實驗大約 5 分鐘訓練加上 Claude Code 思考、commit、評估的開銷
  • 一整夜約 100 個實驗——8 小時無人值守的執行
  • 單調改進——每個留存的 commit 至少跟前一個一樣好
  • 5 分鐘訓練預算——由 prepare.py 中的 TIME_BUDGET=300 強制執行
  • 10 分鐘硬性 timeout——每個實驗的上限,用來捕捉掛住的 process

吞吐量的瓶頸是 GPU 訓練時間,而不是 Claude Code 的推理速度。每個 5 分鐘週期的大部分時間都花在實際訓練模型上。


Lessons for Your Own Projects

AutoResearch isn't just about ML experiments. The design patterns generalize to any scenario where you want an AI agent to iterate autonomously.

The Recipe

  1. Define a single mutable surface. One file, one config, one thing the AI is allowed to change.
  2. Define an objective metric. Something computable, unambiguous, and impossible for the AI to game.
  3. Make evaluation immutable. The AI optimizes the thing, not the measurement of the thing.
  4. Use git for state. Commits for checkpoints, branches for isolation, reset for rollback.
  5. Write instructions in markdown. No framework needed. Tell the AI what to do in plain language.
  6. Set a time budget. Prevent runaway experiments from blocking progress.
  7. Tell it to never stop. The AI already knows how to loop. Just tell it to.

Where This Pattern Applies

  • Hyperparameter tuning: Obvious direct application
  • Prompt engineering: Modify a prompt file, evaluate against a test set, keep or revert
  • Configuration optimization: Tune server configs against a benchmark
  • Code optimization: Modify a function, run benchmarks, keep if faster
  • CSS/UI iteration: Modify styles, run visual regression tests, keep if passing

The common thread: one file changes, one metric judges, git remembers.

對你自己專案的啟示

AutoResearch 不只是關於 ML 實驗。這些設計模式可以推廣到任何你想讓 AI agent 自主迭代的場景。

配方

  1. **定義單一可修改操作面。**一個檔案、一個配置、一個 AI 被允許修改的東西。
  2. **定義客觀指標。**可計算的、明確的、AI 不可能作弊的。
  3. **讓評估不可變。**AI 優化的是目標本身,而不是對目標的度量方式。
  4. **用 git 管理狀態。**Commit 作為 checkpoint、branch 作為隔離、reset 作為回滾。
  5. **用 markdown 寫指令。**不需要 framework。用自然語言告訴 AI 該做什麼。
  6. **設定時間預算。**防止失控的實驗阻擋進度。
  7. **告訴它永遠不要停。**AI 已經知道怎麼迴圈。只需要告訴它去做。

這個模式適用的場景

  • 超參數調校:最直接的應用
  • Prompt engineering:修改 prompt 檔案、對 test set 評估、保留或回復
  • 配置最佳化:針對 benchmark 調校 server 設定
  • 程式碼最佳化:修改一個函數、跑 benchmark、更快就保留
  • CSS/UI 迭代:修改樣式、跑視覺回歸測試、通過就保留

共同的主線:一個檔案改變、一個指標判斷、git 記住一切。


Conclusion

Karpathy's autoresearch is a masterclass in minimalist system design. It proves that you don't need an agent framework to build an effective AI agent. You need:

  • A clear objective
  • A constrained action space
  • An immutable evaluation function
  • A version control system
  • A markdown file

Four files. Zero orchestration code. One hundred experiments overnight.

The next time you reach for LangChain, CrewAI, or AutoGen to build an AI agent workflow, ask yourself: could I just write a CLAUDE.md instead?

結論

Karpathy 的 autoresearch 是極簡系統設計的大師級示範。它證明了你不需要 agent framework 來建構一個有效的 AI agent。你需要的是:

  • 一個明確的目標
  • 一個受約束的行動空間
  • 一個不可變的評估函數
  • 一個版本控制系統
  • 一個 markdown 檔案

四個檔案。零 orchestration 程式碼。一夜一百個實驗。

下次當你為了建構 AI agent 工作流而想用 LangChain、CrewAI、或 AutoGen 的時候,問問自己:我能不能只寫一個 CLAUDE.md 就好?