AutoResearch + GStack Continuous Iteration Fusion
Making AI Agents Experiment Like Scientists Forever — 11-Expert Roundtable Debate
"The best experiments are the ones that run while you sleep." — Andrej Karpathy
「最好的實驗,是你睡覺時還在跑的那些。」—— Andrej Karpathy
Roundtable Participants:
AI Automation & Agent:
- Andrej Karpathy | autoresearch author, former Tesla AI Director, neural network training automation pioneer
- Garry Tan | GStack author, Y Combinator CEO, AI-assisted software engineering workflow designer
- John Schulman | OpenAI co-founder, reinforcement learning & agent autonomy, PPO algorithm inventor
Software Engineering Methodology:
- Martin Fowler | Continuous Integration advocate, Refactoring author, delivery pipeline design
- Kent Beck | Extreme Programming creator, TDD inventor, small-step iteration philosophy
- Dave Farley | Continuous Delivery co-author, deployment pipeline design, automated quality gates
System Architecture & Reliability:
- Werner Vogels | Amazon CTO, distributed systems design, automated operations
- Charity Majors | Honeycomb CTO, observability pioneer, production debugging
Critics & Risk:
- Gary Marcus | NYU cognitive scientist, AI capability boundary critic
- Emily Bender | UW linguistics professor, LLM risk research
Open Source Ecosystem:
- Pieter Levels | Indie developer, automated shipping practitioner, minimalist tooling
Moderator: We have an unusual panel today — people who build AI agents, people who build engineering discipline, people who build infrastructure, and people who think all of the above are overhyped. The question is deceptively simple: Karpathy's autoresearch ran 100+ experiments on a neural network training script with zero human intervention. Can we apply that same pattern to general software development using GStack's Sprint model? Andrej, you invented autoresearch. Start us off.
Karpathy: The short version: autoresearch proves that an AI agent can run an effective experiment loop if the success metric is clean. The question is whether software development can have a clean enough success metric.
Gary Marcus: The short version of my position: it can't. Not in general. But I'm curious how far this panel thinks it can stretch.
Kent Beck: My position: it can, but only if you invest heavily in the test infrastructure first. Without good tests, you have no metric. Without a metric, you have no loop.
Werner Vogels: Systems perspective: the loop is easy. Making the loop reliable is the hard part. That's where operational discipline comes in.
Pieter Levels: My position: stop debating and start shipping. I've been running a version of this for months.
圓桌會議參與者:
AI 自動化與 Agent 派:
- Andrej Karpathy | autoresearch 作者、前 Tesla AI 總監、神經網路訓練自動化先驅
- Garry Tan | GStack 作者、Y Combinator CEO、AI 輔助軟體工程工作流設計者
- John Schulman | OpenAI 共同創辦人、強化學習與 agent 自主決策、PPO 演算法發明人
軟體工程方法論派:
- Martin Fowler | 持續整合倡議者、重構經典作者、軟體交付流水線設計
- Kent Beck | 極限編程創始人、TDD 發明人、小步迭代哲學
- Dave Farley | 持續交付共同作者、部署流水線設計、自動化品質閘門
系統架構與可靠性派:
- Werner Vogels | Amazon CTO、分散式系統設計、自動化運維
- Charity Majors | Honeycomb CTO、可觀測性先驅、生產環境除錯
批評與風險派:
- Gary Marcus | NYU 認知科學家、AI 能力邊界批評者
- Emily Bender | 華盛頓大學語言學教授、LLM 風險研究
開源工具生態派:
- Pieter Levels | 獨立開發者、自動化 shipping 實踐者、極簡工具主義
主持人: 今天的陣容很特別——有做 AI agent 的人、有做工程紀律的人、有做基礎設施的人,還有覺得以上全部都被過度吹捧的人。問題看似簡單:Karpathy 的 autoresearch 在零人工介入下跑了 100 多個神經網路訓練實驗。我們能不能把同樣的模式,套用到 GStack 的 Sprint 模型來做通用軟體開發?Andrej,autoresearch 是你發明的,你先開場。
Karpathy: 簡短版:autoresearch 證明了如果成功指標夠乾淨,AI agent 可以跑一個有效的實驗迴圈。問題是軟體開發能不能有一個夠乾淨的成功指標。
Gary Marcus: 我立場的簡短版:不能。不是通用地。但我好奇這個小組覺得它能延伸多遠。
Kent Beck: 我的立場:可以,但前提是你先大量投資測試基礎設施。沒有好的測試,你沒有指標。沒有指標,你沒有迴圈。
Werner Vogels: 系統視角:迴圈很簡單。讓迴圈「可靠」才是困難的部分。那就是運維紀律的用武之地。
Pieter Levels: 我的立場:別辯論了,開始 shipping。我已經跑了好幾個月的類似版本。
Round 1: Why Autoresearch Works — And Why That Matters
Karpathy: The architecture is embarrassingly simple. A program.md file tells Claude: "edit train.py, commit, run training, check val_bpb, keep if improved, discard if not, repeat." That's the entire system. No orchestrator, no framework, no agent graph. The magic isn't in the architecture — it's in the constraint. I deliberately reduced the quality signal to a single number: validation bits-per-byte. The agent doesn't need to "understand" whether a change is good. The number goes down, keep it. The number goes up, throw it away. Binary. Objective. Ungameable.
Kent Beck: leaning forward This is TDD taken to its logical extreme. You defined the test first — val_bpb must decrease — and then let the agent write code against that test. But here's the crucial insight most people miss: the test is a natural law. val_bpb isn't a human-written assertion that could be wrong or incomplete. It's cross-entropy against a held-out dataset. You can't game it. You can't write code that "passes the test" while actually being wrong. This is why autoresearch works where most AI coding agents fail — the evaluation function is incorruptible.
Gary Marcus: immediately Let's slow down before we canonize this. Autoresearch works because it operates in an extremely constrained domain. One file — train.py. One metric — val_bpb. One action space — edit Python code that trains a model. No user interaction, no UI, no network calls, no state management, no database migrations, no deployment. It's a closed system with a clean objective function. Calling this a general-purpose pattern for software development is like saying "I taught a robot to play chess, therefore it can cook dinner."
Emily Bender: And let's be precise about why it works in that domain. Claude isn't "understanding" neural network architecture when it edits train.py. It's doing statistical pattern matching — "modifications that look like this historically correlate with lower bpb." On a 26KB file with a narrow modification surface, pattern matching is sufficient. But pattern matching has combinatorial edge cases that explode as the code surface grows. A 200-file web application isn't 8x harder than train.py — it's combinatorially harder.
Karpathy: Gary, Emily — you're both making the constrained-domain argument, and you're both right that generalization isn't free. But you're overlooking something important. The experiments aren't random. Claude reads the entire codebase, reads the paper references, reads the results log. It's doing directed search, not random mutation. Out of 100 experiments, maybe 15 improve val_bpb. But those 15 represent genuine knowledge accumulation. This isn't a million monkeys at typewriters.
Martin Fowler: I want to reframe this from an engineering perspective, because the "does AI understand" debate is a philosophical rabbit hole. Autoresearch works because of three engineering principles, not because of AI capability. First: single mutable surface — only train.py can change. Second: atomic experiments — each commit is one complete experiment, succeed or rollback. Third: immutable evaluation — prepare.py never changes, so the measurement standard is always consistent. In software engineering, we have names for these. They're called separation of concerns, atomic commits, and stable test infrastructure. The question isn't whether AI "understands." The question is whether we can architect the same three properties into a software development loop.
Kent Beck: Martin nailed it. Autoresearch doesn't work because Claude is smart. It works because the problem is well-structured. Our job is to structure software development problems the same way.
Gary Marcus: And my point is that most software development problems can't be structured that way. But — raises hand — I'll concede we should explore how far this pattern stretches before declaring it dead on arrival.
John Schulman: I want to add the reinforcement learning perspective, because everyone's dancing around it. Autoresearch is essentially a bandit problem with a deterministic reward function. The agent picks an action (code edit), receives a reward (delta val_bpb), and updates its implicit policy. The reason it works isn't just the constrained domain — it's that the reward signal is immediate, noiseless, and aligned with the true objective. Most RL systems fail because the reward is delayed, noisy, or misaligned. Autoresearch has none of those problems.
Werner Vogels: cutting in And from an operations standpoint, the beauty is the failure cost is near zero. A failed experiment costs 5 minutes of compute and zero persistent state change — git reset erases everything. Compare that to deploying a buggy feature to production. The low failure cost is what makes the high iteration count possible.
Pieter Levels: Can I just say — I've been running a similar loop for my projects for months. Not as fancy as autoresearch, not with val_bpb. Just: write a test, run Claude, check if tests pass, ship if yes, discard if no. It works. I don't need a PhD in reinforcement learning to tell me why. It works because the feedback loop is tight and the rollback is free.
Charity Majors: Pieter is accidentally describing the minimal viable version of what everyone else is theorizing about. And that's valuable. But I'll push back on one thing: "tests pass" is a much weaker signal than val_bpb. Tests are written by humans (or AI). They can be incomplete, incorrect, or vacuous. val_bpb is a law of physics for language models. The quality of your feedback loop is only as good as the quality of your evaluation function.
Moderator: Strong opening positions laid down. Let's move to the comparison.
第一回合:autoresearch 為何有效——以及為什麼這很重要
Karpathy: 架構簡單到令人尷尬。一個 program.md 檔案告訴 Claude:「編輯 train.py、commit、跑訓練、檢查 val_bpb、改善就保留、沒改善就丟棄、重複。」這就是整個系統。沒有編排器、沒有框架、沒有 agent graph。魔法不在架構——而在約束。我刻意把品質訊號簡化成一個數字:驗證 bits-per-byte。Agent 不需要「理解」一個修改好不好。數字降了就留,數字升了就丟。二元的。客觀的。不可作弊的。
Kent Beck: 身體前傾 這就是 TDD 推到極致的形態。你先定義了測試——val_bpb 必須下降——然後讓 agent 對這個測試寫程式碼。但多數人忽略了一個關鍵洞察:測試就是自然法則。val_bpb 不是一個可能寫錯或不完整的人類斷言。它是對 held-out dataset 的交叉熵。你作弊不了。你不可能寫出「通過測試但實際上是錯的」程式碼。這就是 autoresearch 在多數 AI coding agent 失敗的地方成功的原因——評估函數是不可腐蝕的。
Gary Marcus: 立刻接話 在我們把它封聖之前,先慢下來。Autoresearch 之所以有效,是因為它運作在一個極端受限的領域。一個檔案——train.py。一個指標——val_bpb。一個動作空間——編輯訓練模型的 Python 程式碼。沒有用戶交互、沒有 UI、沒有網路呼叫、沒有狀態管理、沒有資料庫遷移、沒有部署。這是一個有乾淨目標函數的封閉系統。把這稱為軟體開發的通用模式,就像說「我教了一個機器人下棋,所以它能做飯。」
Emily Bender: 而且讓我們精確說明它在那個領域「為什麼」有效。Claude 在編輯 train.py 時並不是在「理解」神經網路架構。它在做統計模式匹配——「看起來像這樣的修改,歷史上跟較低的 bpb 相關」。在一個 26KB 的檔案、窄小的修改表面上,模式匹配就夠了。但模式匹配有組合爆炸的邊界案例,當程式碼表面增大時會爆炸。一個 200 個檔案的 web 應用不是比 train.py 難 8 倍——而是組合爆炸式地更難。
Karpathy: Gary、Emily——你們都在提受限領域的論點,你們都說得對,泛化不是免費的。但你們忽略了一件重要的事。實驗不是隨機的。Claude 讀了整個 codebase、讀了論文引用、讀了結果紀錄。它在做有方向性的搜索,不是隨機突變。100 個實驗裡,可能只有 15 個改善了 val_bpb。但那 15 個代表真正的知識累積。這不是一百萬隻猴子在打字機前亂敲。
Martin Fowler: 我想從工程角度重新框架這個問題,因為「AI 是否理解」是個哲學兔子洞。Autoresearch 有效是因為三個工程原則,不是因為 AI 能力。第一:單一可變表面——只有 train.py 可以改。第二:原子性實驗——每個 commit 是一個完整實驗,成功就留、失敗就回滾。第三:不可變評估——prepare.py 永遠不動,測量標準永遠一致。在軟體工程裡,這些有名字。它們叫做關注點分離、原子性 commit、穩定的測試基礎設施。問題不在 AI 是否「理解」。問題在我們能不能把同樣三個屬性架構進一個軟體開發迴圈。
Kent Beck: Martin 說到點上了。Autoresearch 不是因為 Claude 聰明才有效。是因為問題結構化得好。我們的工作是把軟體開發問題結構化成同樣的方式。
Gary Marcus: 而我的重點是,多數軟體開發問題「無法」被那樣結構化。但——舉手——我承認我們應該先探索這個模式能延伸多遠,再宣告它一到就死。
John Schulman: 我想加入強化學習的視角,因為大家都在繞著它跳舞。Autoresearch 本質上是一個帶有確定性 reward 函數的 bandit 問題。Agent 選擇一個 action(程式碼編輯),接收一個 reward(delta val_bpb),然後更新它隱含的 policy。它有效的原因不只是受限領域——而是 reward 信號是即時的、無噪音的、跟真實目標對齊的。大多數 RL 系統失敗是因為 reward 延遲、有噪音、或不對齊。Autoresearch 這三個問題都沒有。
Werner Vogels: 插入 從運維角度來看,它的美在於失敗成本接近零。一個失敗的實驗花 5 分鐘算力,零持久性狀態變更——git reset 抹掉一切。跟部署一個有 bug 的功能到生產環境比較看看。低失敗成本才是高迭代次數成為可能的原因。
Pieter Levels: 我能說一下嗎——我已經對我的專案跑了類似的迴圈好幾個月了。沒有 autoresearch 那麼花俏,沒有 val_bpb。就是:寫一個測試、跑 Claude、檢查測試是否通過、通過就 ship、沒過就丟棄。它有效。我不需要一個強化學習的博士來告訴我為什麼。它有效是因為回饋迴圈緊密、回滾免費。
Charity Majors: Pieter 不小心描述了大家理論化的東西的最小可行版本。這很有價值。但有一點我要推回去:「測試通過」是比 val_bpb 弱得多的信號。測試是人類(或 AI)寫的。它們可能不完整、不正確、或空洞。val_bpb 是語言模型的物理定律。你的回饋迴圈品質只跟你的評估函數品質一樣好。
主持人: 強勁的開場立場已經表明。讓我們進入比較。
Round 2: GStack Sprint vs. Autoresearch Loop — The Fundamental Differences
Moderator: Let's put the two systems side by side. GStack Sprint: Think, Plan, Build, Review, Test, Ship, Reflect. Autoresearch: Edit, Commit, Run, Evaluate, Keep/Discard. Can they merge? What's fundamentally different?
Garry Tan: The difference is the nature of the loop. Autoresearch is a scientific experiment loop — hypothesis, experiment, measurement, retain or discard. GStack Sprint is a software delivery cycle — requirements, design, implementation, review, test, deploy. The core gap: autoresearch only needs one number to judge success or failure. GStack needs multi-dimensional quality judgment — does the code work, is it maintainable, is it secure, does it match the spec?
Karpathy: Exactly. In my system, code "quality" is defined entirely by val_bpb. If an ugly hack lowers bpb by 0.01, it's good code. Period. In product development, you can't do that. You have maintainability, security, UX, performance — dimensions you can't compress into a single number.
Dave Farley: cuts in Wait — are you sure you can't compress them? I've spent twenty years on continuous delivery, and here's the lesson I keep learning: if you can't automate measuring it, you can't automate it. Autoresearch fully automates because val_bpb is a fully automated metric. So the real question becomes: in software development, which quality dimensions can you automate measuring?
Kent Beck: Test pass rate, lint pass rate, type-check pass rate, coverage — those are all automated. But they don't add up to "software quality." Code that passes every test but is conceptually wrong gets a perfect score from the test suite. Tests pass does not equal software quality. The gap is where the danger lives.
Werner Vogels: At Amazon, we use metrics closer in spirit to autoresearch. Latency p99, error rate, throughput — these are automatically measurable quality metrics in production. If you treat them as your "val_bpb," you can build an automated quality gate. Not replacing code review, but as an objective verification layer on top of code review.
Charity Majors: Werner is on to something, but I need to add a caveat: you have to build observability into the system first. Autoresearch can automate because val_bpb is built-in. Most software projects don't have built-in quality metrics. Before you can make GStack iterate like autoresearch, step one isn't writing the agent script. Step one is building automatically measurable quality baselines. Instrument your code. Establish your SLOs. Set up your dashboards. Then and only then can you define gates.
Pieter Levels: shaking his head You're all overthinking this. What's the essence of autoresearch? A while(true) loop plus a success check. GStack Sprint is fundamentally the same thing — just a more complex loop body. Wrap Sprint in while(true), success check is npm test exit code. Done. Stop over-engineering.
Gary Marcus: Pieter, your approach works for trivial changes. But software development has a problem autoresearch doesn't: changes have dependencies. In autoresearch, each experiment is independent — you changed the number of attention heads, that doesn't affect your learning rate experiment. In software development, you add JWT authentication, and every subsequent API endpoint test needs updating. Sprints aren't independent experiments. That breaks the fundamental assumption of the autoresearch pattern.
Karpathy: nodding reluctantly Gary's right about this one. In autoresearch, I explicitly designed for independence — every experiment starts from the same baseline. Sprint dependencies are a real problem we need to solve.
Garry Tan: That's what /autoplan is for — decomposing work into tasks that are as independent as possible. But Gary's point stands: perfect independence is impossible in software. We need to handle it architecturally.
Dave Farley: leaning in There's a deeper issue here. In autoresearch, a failed experiment has zero impact on the next one — git reset erases it completely. But in a Sprint loop, a successful Sprint changes the landscape for the next Sprint. Sprint 1 adds a database table. Sprint 2 needs to query that table. Sprint 3 adds an index to that table. The experiments are cumulative, not independent. That's fundamentally different from autoresearch, and we need to name that difference honestly.
Martin Fowler: Dave's describing what I call the ratchet problem. Autoresearch is a search through a fixed landscape — train.py changes but the evaluation landscape (prepare.py, the dataset) is fixed. Sprint loops are a search through an evolving landscape — each successful Sprint changes what "the codebase" is. The evaluation landscape moves under your feet. That's why full regression testing after every Sprint isn't optional — it's the mechanism that keeps the ratchet from slipping backwards.
Emily Bender: And this ratchet problem has a knowledge dimension too. When the codebase changes cumulatively, the agent needs to understand not just what the code is, but what it was and why it changed. That's narrative comprehension, not just code reading. LLMs are notoriously poor at maintaining consistent narratives over long contexts.
John Schulman: I think we're overcomplicating the dependency issue. Yes, Sprints have dependencies. But so do multi-step RL episodes. The solution is the same: treat each Sprint as a step in an episode, not an independent bandit pull. The "state" is the current codebase. The "action" is the Sprint's code changes. The "reward" is the gate check. You don't need independence — you need a well-defined state transition function.
Moderator: Good. The differences are now sharply defined. Let's build the bridge.
第二回合:GStack Sprint vs. autoresearch 迴圈——根本差異在哪
主持人: 讓我們把兩個系統並排比較。GStack Sprint:Think → Plan → Build → Review → Test → Ship → Reflect。Autoresearch:Edit → Commit → Run → Evaluate → Keep/Discard。它們能融合嗎?根本差異是什麼?
Garry Tan: 差異在迴圈的本質。Autoresearch 是科學實驗迴圈——假設、實驗、測量、保留或丟棄。GStack Sprint 是軟體交付迴圈——需求、設計、實作、審查、測試、部署。核心差距是:autoresearch 只需要一個數字來判斷成敗。GStack 需要多維度的品質判斷——程式碼能不能跑、可不可維護、安不安全、跟不跟規格一致?
Karpathy: 對。在我的系統裡,程式碼「品質」完全由 val_bpb 定義。如果一段醜陋的 hack 讓 bpb 降了 0.01,那它就是好程式碼。句號。在產品開發中,你不能這樣——你有可維護性、安全性、用戶體驗、效能——這些維度你壓縮不成一個數字。
Dave Farley: 打斷 等等——你確定壓縮不了?我做持續交付二十年了,我一直在學的教訓是:如果你不能自動化測量它,你就不能自動化它。Autoresearch 之所以能全自動,是因為 val_bpb 是一個完全自動化的指標。那麼真正的問題變成:在軟體開發中,你能自動化測量哪些品質維度?
Kent Beck: 測試通過率、lint 通過率、type-check 通過率、覆蓋率——這些都是自動化的。但這些加起來不等於「軟體品質」。一段通過所有測試但概念上錯誤的程式碼,測試套件會給它滿分。測試通過不等於軟體品質。危險就住在這個差距裡。
Werner Vogels: 在 Amazon,我們用更接近 autoresearch 精神的指標。Latency p99、error rate、throughput——這些是生產環境中可自動測量的品質指標。如果你把它們當成你的「val_bpb」,你就能建立自動化的品質閘門。不是取代 code review,而是作為 code review 之上的一層客觀驗證。
Charity Majors: Werner 說到點上了。但我要加一個 caveat:你必須先把可觀測性建到系統裡。Autoresearch 能自動化,因為 val_bpb 是內建的。大多數軟體專案沒有內建的品質指標。要讓 GStack 像 autoresearch 一樣迭代,第一步不是寫 agent 腳本。第一步是建立可自動測量的品質基線。幫你的程式碼加 instrument、建立你的 SLO、設好 dashboard。然後、也只有然後,你才能定義閘門。
Pieter Levels: 搖頭 你們全都想太複雜了。Autoresearch 的精髓是什麼?一個 while(true) 迴圈加一個成功判斷。GStack Sprint 本質上也一樣——只是迴圈體更複雜。把 Sprint 包在 while(true) 裡,成功判斷用 npm test 的退出碼,done。別 over-engineer。
Gary Marcus: Pieter,你的方法在 trivial 的改動上能用。但軟體開發有一個 autoresearch 沒有的問題:改動之間有依賴關係。在 autoresearch 裡,每個實驗是獨立的——你改了 attention head 的數量,不影響你對 learning rate 的實驗。在軟體開發中,你加了 JWT 認證,後面每個 API endpoint 的測試都要改。Sprint 之間不是獨立實驗。這打破了 autoresearch 模式的根本假設。
Karpathy: 勉強點頭 這一點 Gary 說得對。在 autoresearch 裡,我刻意設計了獨立性——每個實驗從同一個 baseline 出發。Sprint 依賴性是一個我們需要解決的真正問題。
Garry Tan: 這就是 /autoplan 的用途——把工作拆解成盡可能獨立的任務。但 Gary 的觀點成立:在軟體中完美的獨立性是不可能的。我們需要在架構層面處理它。
Dave Farley: 身體前傾 這裡有一個更深的問題。在 autoresearch 裡,一個失敗的實驗對下一個零影響——git reset 完全抹掉它。但在 Sprint 迴圈裡,一個成功的 Sprint 會「改變地景」給下一個 Sprint。Sprint 1 加了一張資料庫表。Sprint 2 需要查詢那張表。Sprint 3 對那張表加索引。實驗是累積的,不是獨立的。這跟 autoresearch 根本不同,我們需要誠實地指出這個差異。
Martin Fowler: Dave 在描述我稱之為棘輪問題的東西。Autoresearch 是在一個固定地景中搜索——train.py 改變但評估地景(prepare.py、資料集)是固定的。Sprint 迴圈是在一個演化的地景中搜索——每個成功的 Sprint 改變「codebase 是什麼」。評估地景在你腳下移動。這就是為什麼每個 Sprint 後的全量回歸測試不是可選的——它是防止棘輪向後滑動的機制。
Emily Bender: 而且這個棘輪問題有一個知識維度。當 codebase 累積變化時,agent 不只需要理解程式碼「是」什麼,還需要理解它「過去是」什麼以及為什麼改變了。那是敘事理解,不只是程式碼閱讀。LLM 在長 context 中維持一致敘事方面出了名地差。
John Schulman: 我覺得我們把依賴問題想得太複雜了。是的,Sprint 有依賴。但多步驟 RL episode 也有。解決方案是一樣的:把每個 Sprint 當成一個 episode 中的一步,而不是一個獨立的 bandit pull。「狀態」是當前的 codebase。「動作」是 Sprint 的程式碼變更。「獎勵」是閘門檢查。你不需要獨立性——你需要一個定義良好的狀態轉移函數。
主持人: 好。差異現在已經清楚定義了。讓我們來搭橋。
Round 3: The Fusion Architecture — Making GStack Sprint Loop Forever
Moderator: Concrete proposals time. How do you make GStack's Sprint run continuously — one Sprint finishes, the next starts automatically?
Karpathy: Borrow directly from autoresearch. You need three things.
First, a program.md — or in GStack terms, a CLAUDE.md — with explicit loop instructions: "Read the backlog, pick the first unchecked task, run a Sprint, mark it done, pick the next task, never stop."
Second, a backlog file — your experiment queue. A markdown checklist. The agent checks off each task as it completes.
Third, a success criterion — tests pass, lint clean, /review not BLOCKED.
That's the entire architecture. Everything else is optimization.
Garry Tan: I agree with the skeleton, but you can't stop at "tests pass." GStack has quality gates for a reason. Here's my proposed loop:
LOOP:
1. Read BACKLOG.md → find first unchecked task
2. Run /autoplan (skip_eng_review=true, solo mode)
3. Build the feature
4. Run /review
→ If BLOCKED: log reason, skip task, continue
5. Run /qa
→ If BLOCKED: log reason, skip task, continue
6. Run full test suite + lint + type-check
7. Run /ship
8. Mark task complete in BACKLOG.md
9. Log to SPRINT_LOG.md
10. Go to 1Dave Farley: immediately This design has a fundamental continuous delivery problem: no rollback mechanism. Autoresearch uses git reset to discard failed experiments. In your loop, what happens if step 6 fails? What happens if /ship succeeds but introduces a subtle regression?
Garry Tan: Fair. We should record the git commit hash at the start of each Sprint. If any gate fails, git reset --hard to that hash. Same as autoresearch.
Dave Farley: Good. Now you have atomic experiments. Each Sprint is either fully committed or fully rolled back. That's the foundation.
Charity Majors: But wait — rollback isn't always clean. What if the Sprint creates database migrations? What if it modifies external state — sends emails, writes to a third-party API? git reset only rolls back the code. You need to scope what the agent is allowed to do during a Sprint so that rollback is truly reversible.
Werner Vogels: That's a solved problem in operations. You use sandboxing. The agent runs against a local development environment with mocked external services. No database writes go to production. No API calls go to real endpoints. The Sprint is contained. This is why the --allowedTools whitelist matters — it's not just about preventing rm -rf, it's about ensuring the agent's side effects are reversible.
Karpathy: Exactly. In autoresearch, the entire side effect of an experiment is the modification to train.py and the training run. Both are completely reversible. For the fusion architecture, you need to ensure the same property: every Sprint's side effects must be fully reversible via git reset. If a task requires irreversible side effects — like sending notifications or modifying production data — it shouldn't be in the automated backlog.
Garry Tan: That's a good filter for what goes in BACKLOG.md. Ask yourself: "Can this task be fully rolled back with git reset --hard?" If yes, it's automatable. If no, it requires human supervision.
Kent Beck: I have a more fundamental concern. Autoresearch's loop is convergent — every iteration optimizes the same number, with a clear baseline (the previous val_bpb). GStack's Sprint loop is divergent — each Sprint does something different. Task 1 adds authentication, Task 2 adds a dashboard, Task 3 adds CSV export. How do you define "progress"? In autoresearch, progress = val_bpb decreasing. In Sprint loops, progress = ... what?
John Schulman: This is the core reinforcement learning question: reward shaping. In autoresearch, reward is clear — delta val_bpb. In the Sprint loop, you need a reward function. My proposal: Sprint reward is the binary signal "completed and passed all gates." You don't need a more complex reward than autoresearch has — autoresearch's reward is also binary (improved/not improved). Completion rate across the backlog is your equivalent of val_bpb trending downward.
Martin Fowler: John's framing is useful, but it misses software decay. Autoresearch is always optimizing the same train.py — the code only gets better. But in a Sprint loop, each Sprint adds new features. Code volume grows. Technical debt accumulates. You need a health check mechanism — not just "did this Sprint's tests pass" but "has the overall codebase health not declined."
Charity Majors: Concretely: after every Sprint, run full regression tests — not just diff-aware /qa, but every single test. Plus a coverage check. If coverage drops below the previous Sprint's level, the Sprint fails — exactly like autoresearch discarding a change that worsens val_bpb.
Emily Bender: I want to raise a problem none of you have mentioned: context window limits. Each autoresearch iteration only needs to read train.py — 26KB. If your software project has 50 files and 200KB of code, by Sprint 10, how much historical context fits in the agent's context window? Will it forget decisions made in Sprint 3? A JWT scheme chosen in Sprint 2 constrains Sprint 7's API design, but if the agent doesn't remember Sprint 2...
Karpathy: This is exactly why autoresearch starts from a clean state every iteration. The agent re-reads the files. It doesn't depend on conversation history. Your Sprint loop should work the same way: every Sprint is a fresh Claude Code session, launched with the -p flag. It reads BACKLOG.md and CLAUDE.md from scratch. No dependence on previous Sprint conversations.
Werner Vogels: That's the correct architecture. At Amazon we call it stateless execution. Each Sprint is a stateless work unit. All state lives in git and BACKLOG.md — not in the agent's memory. The agent is ephemeral. The artifacts are permanent. This is how we build reliable distributed systems, and it's how you build a reliable agent loop.
Emily Bender: unconvinced Stateless execution solves the context window problem but creates a knowledge problem. A fresh session can read the files, but can it infer the intent behind architectural decisions? Code doesn't always document its own rationale.
Karpathy: That's what good commit messages and SPRINT_LOG.md are for. Every Sprint logs what it did and why. The next Sprint reads the log. It's not perfect, but it's the same strategy humans use with version control.
Garry Tan: Let me add a concrete detail. In the CLAUDE.md, you can include a rule: "Before starting work, read the last 3 entries in SPRINT_LOG.md to understand recent changes." This gives the agent a sliding window of context without requiring full conversation history. It's a lightweight solution to Emily's concern.
Gary Marcus: skeptical A sliding window of 3 Sprint logs is not the same as understanding the architectural trajectory of a project. The agent might read "Sprint 12: added caching layer" and "Sprint 13: added invalidation logic" without understanding why those decisions were made or how they interact with Sprint 7's database schema change. You're giving the agent a view through a keyhole and calling it vision.
Martin Fowler: Gary raises a valid concern, but I'd argue we're setting the bar too high. Human developers joining a new project don't understand the full architectural history either. They read the recent commits, look at the current code, and start working. The fusion architecture puts the agent in the same position as a new team member with good documentation. Not perfect, but functional.
Pieter Levels: And honestly? For the kind of tasks we're talking about — add a CRUD endpoint, write tests for an existing module, fix a type error — you don't need deep architectural understanding. You need to read the files you're changing and understand the test expectations. That's it. We're overthinking this.
John Schulman: There's a middle ground. For simple tasks, Pieter's approach works — just read the relevant files. For complex tasks that span multiple modules, you could include a ARCHITECTURE.md file that describes the high-level structure. The agent reads it at the start of every Sprint. It's static knowledge, maintained by the human, that compensates for the agent's lack of historical context.
Moderator: Excellent debate on the statefulness question. The consensus is clear: stateless sessions with external state in git and markdown files. Let's talk about backlog design.
第三回合:融合架構——如何讓 GStack Sprint 無限迴圈
主持人: 具體方案時間。怎麼讓 GStack 的 Sprint 持續跑——一個 Sprint 結束,下一個自動開始?
Karpathy: 直接借用 autoresearch 的架構。你需要三個東西。
第一,一個 program.md——或者用 GStack 的術語,一個 CLAUDE.md——裡面寫了明確的迴圈指令:「讀取 backlog、取第一個未完成的任務、跑一個 Sprint、標記完成、取下一個任務、永不停止。」
第二,一個 backlog 檔案——你的實驗佇列。一個 markdown checklist。Agent 完成一個就打一個勾。
第三,一個成功判斷——tests pass、lint clean、/review 沒有 BLOCKED。
這就是整個架構。其他都是優化。
Garry Tan: 我同意骨架,但你不能只靠「tests pass」。GStack 有品質閘門是有原因的。以下是我建議的迴圈:
LOOP:
1. 讀 BACKLOG.md → 找到第一個未完成的任務
2. 執行 /autoplan(skip_eng_review=true,solo 模式)
3. 建構功能
4. 執行 /review
→ 如果 BLOCKED:記錄原因,跳過此任務,繼續
5. 執行 /qa
→ 如果 BLOCKED:記錄原因,跳過此任務,繼續
6. 跑全量測試 + lint + type-check
7. 執行 /ship
8. 在 BACKLOG.md 標記任務完成
9. 記錄到 SPRINT_LOG.md
10. 回到 1Dave Farley: 立刻接話 這個設計有一個持續交付的根本問題:沒有回滾機制。Autoresearch 用 git reset 來丟棄失敗的實驗。在你的迴圈裡,如果第 6 步失敗了怎麼辦?如果 /ship 成功了但引入了一個隱微的 regression 怎麼辦?
Garry Tan: 公允。我們應該在每個 Sprint 開始時記錄 git commit hash。如果任何閘門失敗,git reset --hard 回到那個 hash。跟 autoresearch 一樣。
Dave Farley: 好。現在你有了原子性實驗。每個 Sprint 要嘛完全 commit,要嘛完全 rollback。這是基礎。
Charity Majors: 但等等——回滾不總是乾淨的。如果 Sprint 建立了資料庫遷移呢?如果它修改了外部狀態——發了郵件、寫入了第三方 API?git reset 只回滾程式碼。你需要限制 agent 在 Sprint 期間能做什麼,讓回滾真正是可逆的。
Werner Vogels: 那是運維中已解決的問題。你用沙箱。Agent 在一個有 mock 外部服務的本地開發環境中跑。沒有資料庫寫入到生產環境。沒有 API 呼叫到真實端點。Sprint 是被容器化的。這就是 --allowedTools 白名單重要的原因——不只是防止 rm -rf,而是確保 agent 的副作用是可逆的。
Karpathy: 正是。在 autoresearch 裡,一個實驗的全部副作用就是修改 train.py 和跑訓練。兩者都完全可逆。對融合架構,你需要確保同樣的屬性:每個 Sprint 的副作用必須能透過 git reset 完全逆轉。如果一個任務需要不可逆的副作用——像發通知或修改生產數據——它不應該在自動化的 backlog 裡。
Garry Tan: 這是一個好的篩選器,決定什麼進 BACKLOG.md。問自己:「這個任務能用 git reset --hard 完全回滾嗎?」如果是,它可以自動化。如果不是,它需要人類監督。
Kent Beck: 我有一個更根本的疑慮。Autoresearch 的迴圈是收斂的——每次迭代都在優化同一個數字,有明確的基線(前一次的 val_bpb)。GStack 的 Sprint 迴圈是發散的——每個 Sprint 做不同的事。任務 1 加認證、任務 2 加 dashboard、任務 3 加 CSV 匯出。你怎麼定義「進度」?在 autoresearch 裡,進度 = val_bpb 下降。在 Sprint 迴圈裡,進度 = ……什麼?
John Schulman: 這是強化學習的核心問題:reward shaping。在 autoresearch 裡,reward 很清楚——delta val_bpb。在 Sprint 迴圈裡,你需要一個 reward 函數。我的建議:Sprint 的 reward 就是**「完成且通過所有閘門」的二元信號**。你不需要比 autoresearch 更複雜的 reward——autoresearch 的 reward 也是二元的(改善/未改善)。整個 backlog 的完成率就是你等同 val_bpb 持續下降的指標。
Martin Fowler: John 的框架有用,但遺漏了軟體腐化。Autoresearch 永遠在優化同一個 train.py——程式碼只會變好。但在 Sprint 迴圈中,每個 Sprint 加新功能。程式碼總量增長。技術債累積。你需要一個健康檢查機制——不只是「這個 Sprint 的測試通過了」,而是「整體程式碼健康度沒有下降」。
Charity Majors: 具體來說:每個 Sprint 結束後,跑全量回歸測試——不只是 diff-aware 的 /qa,而是每一個測試。加上覆蓋率檢查。如果覆蓋率低於前一個 Sprint 的水準,這個 Sprint 就算失敗——就像 autoresearch 丟棄讓 val_bpb 惡化的修改一樣。
Emily Bender: 我要提一個你們都沒提的問題:context window 限制。Autoresearch 的每次迭代只需要讀 train.py——26KB。如果你的軟體專案有 50 個檔案、200KB 的程式碼,到第 10 個 Sprint 時,agent 的 context window 裡能放多少歷史脈絡?它會不會忘記第 3 個 Sprint 做的決策?第 2 個 Sprint 選的 JWT 方案會約束第 7 個 Sprint 的 API 設計,但如果 agent 不記得第 2 個 Sprint……
Karpathy: 這正是為什麼 autoresearch 每次迭代都從乾淨狀態開始。Agent 重新讀檔案。不依賴對話歷史。你的 Sprint 迴圈也應該這樣設計:每個 Sprint 都是一個全新的 Claude Code session,用 -p 旗標啟動。它從頭讀 BACKLOG.md 和 CLAUDE.md。不依賴前一個 Sprint 的對話脈絡。
Werner Vogels: 這是正確的架構。在 Amazon 我們叫它無狀態執行。每個 Sprint 是一個無狀態的工作單元。所有狀態存在 git 和 BACKLOG.md 裡——不在 agent 的記憶裡。Agent 是短暫的。Artifact 是永久的。這就是我們建構可靠分散式系統的方式,也是你建構可靠 agent 迴圈的方式。
Emily Bender: 不太信服 無狀態執行解決了 context window 問題,但製造了一個知識問題。全新的 session 可以讀檔案,但它能推斷架構決策背後的「意圖」嗎?程式碼不總是記錄自己的理由。
Karpathy: 這就是好的 commit message 和 SPRINT_LOG.md 的用途。每個 Sprint 記錄它做了什麼以及為什麼。下一個 Sprint 讀這個紀錄。不完美,但這跟人類使用版本控制的策略是一樣的。
Garry Tan: 讓我加一個具體細節。在 CLAUDE.md 裡,你可以包含一條規則:「開始工作前,讀 SPRINT_LOG.md 的最後 3 條紀錄以理解近期變更。」這給 agent 一個滑動視窗的脈絡,不需要完整的對話歷史。這是對 Emily 疑慮的一個輕量解方。
Gary Marcus: 懷疑地 3 條 Sprint 紀錄的滑動視窗不等於理解一個專案的架構軌跡。Agent 可能讀到「Sprint 12:加了快取層」和「Sprint 13:加了失效邏輯」,卻不理解那些決策「為什麼」被做出,或者它們跟 Sprint 7 的資料庫 schema 變更如何互動。你給 agent 一個鑰匙孔的視角,然後說那是視覺。
Martin Fowler: Gary 提了一個有效的疑慮,但我認為我們把標準設得太高了。人類開發者加入新專案時,也不理解完整的架構歷史。他們讀最近的 commit、看目前的程式碼、然後開始工作。融合架構把 agent 放在跟一個有好文件的新團隊成員一樣的位置。不完美,但能運作。
Pieter Levels: 而且老實說?對於我們在討論的那種任務——加一個 CRUD endpoint、為現有模組寫測試、修一個 type error——你不需要深度的架構理解。你需要讀你在改的檔案,理解測試期望。就這樣。我們想太多了。
John Schulman: 有一個中間方案。對簡單任務,Pieter 的方法就行——就讀相關檔案。對跨多個模組的複雜任務,你可以包含一個 ARCHITECTURE.md 檔案,描述高層級結構。Agent 在每個 Sprint 開始時讀它。這是靜態知識,由人類維護,彌補 agent 缺乏歷史脈絡的問題。
主持人: 關於無狀態性問題的精彩辯論。共識很清楚:無狀態 session,外部狀態在 git 和 markdown 檔案裡。讓我們談 backlog 設計。
Round 4: BACKLOG.md Design — Static List vs. Auto-Generated
Moderator: Everyone agrees on BACKLOG.md as the task source. But should it be a static human-written checklist, or can the agent dynamically generate tasks?
Karpathy: In autoresearch, there's no backlog concept — the agent decides what experiment to run next. That works because "what to try next" is itself part of the research. But for software development, I recommend a mixed mode: humans provide high-level goals ("build user authentication"), the agent decomposes them into concrete tasks via /autoplan.
Garry Tan: Right, this is exactly what /autoplan is designed for. Here's the structure:
BACKLOG.md:
## Goals (human-written)
Build a complete user authentication system
## Tasks (agent-generated + human-approved)
- [ ] Design database schema
- [ ] Implement JWT token generation and validation
- [ ] Build registration API endpoint
- [ ] Build login API endpoint
- [ ] Add password reset flow
- [ ] Write integration testsThe first Sprint can be dedicated to generating this task list — use /office-hours for requirements analysis, /autoplan for decomposition.
Gary Marcus: leaning back So let me get this straight. You want the agent to decompose tasks and then execute those same tasks. This is a self-supervised system. Who monitors whether the agent decomposed correctly? What if it omits a security-critical step? What if it decomposes "build auth" into six tasks but forgets rate limiting? Forgets CSRF protection? The agent doesn't know what it doesn't know.
Garry Tan: That's why human approval is in the loop. The agent proposes tasks, the human reviews and approves before execution begins.
Gary Marcus: And if the human doesn't catch the omission? You're assuming the human is a better security auditor than the agent. Sometimes they are. Sometimes they're a junior dev who trusted the AI's decomposition.
Pieter Levels: cutting through Back to reality. You don't need the agent to decompose tasks. Human spends 10 minutes writing a checklist, agent runs 10 hours. That time ratio is already an incredible ROI. Don't try to automate "deciding what to do" — that's the most valuable human contribution. Automate "how to do it" and you're golden.
Kent Beck: Pieter nailed the XP principle: customer decides what, developer decides how. In this system, the human is the customer, the agent is the developer. BACKLOG.md should be the human-written "what" list. Each Sprint autonomously decides "how."
John Schulman: But you can do an intermediate approach. Let the agent suggest new tasks in BACKLOG.md with a different marker — say - [?] instead of - [ ]. The human checks in periodically and approves, rejects, or modifies suggestions. This way you get automated suggestion without automated approval.
Karpathy: I like John's approach. In autoresearch, I occasionally review the results.tsv and manually steer — "stop exploring learning rate, focus on attention architecture." The - [?] marker gives you the same human-in-the-loop steering for software development.
Kent Beck: cautiously It's elegant, but it introduces a new failure mode: the backlog grows faster than the human reviews it. You need a rule — agent can suggest at most N tasks per Sprint, and suggested tasks pile up at the bottom, never blocking the main queue.
Pieter Levels: Or just... don't do the suggestion thing. Keep it simple. Human writes the list. Agent runs the list. Two roles. No ambiguity.
Gary Marcus: For once, I agree with Pieter. Simplicity is a feature, not a limitation. Every layer of automation you add is a layer of potential failure you need to debug.
Martin Fowler: I'll offer a compromise position. Start with Pieter's model — human writes the full list, agent executes. After you've run 20 Sprints and built trust in the system, upgrade to John's model — let the agent suggest - [?] tasks. The trust needs to be earned, not assumed.
Charity Majors: And instrument the suggestions. Track what percentage of - [?] tasks the human approves, rejects, or modifies. If the approval rate drops below 50%, the agent's decomposition model is miscalibrated, and you should fall back to human-only task creation.
Dave Farley: leaning back The whole discussion reveals something important: BACKLOG.md is the human's control surface. In autoresearch, the human controls via program.md — they set the rules once. In the fusion architecture, the human controls via BACKLOG.md — they set the work continuously. If you automate the backlog itself, you've removed the human's steering wheel. That should be done very carefully, if at all.
Moderator: Strong consensus on human ownership of the backlog with optional agent suggestions. Moving on to the gates.
第四回合:BACKLOG.md 設計——靜態清單 vs. 自動產生
主持人: 大家都同意用 BACKLOG.md 當任務來源。但它應該是人類預先寫好的靜態清單,還是可以由 agent 動態產生?
Karpathy: 在 autoresearch 裡沒有 backlog 的概念——agent 自己決定下一個實驗。這在那裡行得通,因為「下一步試什麼」本身就是研究的一部分。但在軟體開發中,我建議混合模式:人類提供高層級目標(「建立用戶認證」),agent 透過 /autoplan 自動拆解成具體任務。
Garry Tan: 對,這正是 /autoplan 設計的目的。結構如下:
BACKLOG.md:
## 目標(人類寫)
建立完整的用戶認證系統
## 任務(agent 產生 + 人類核准)
- [ ] 設計資料庫 schema
- [ ] 實作 JWT token 生成與驗證
- [ ] 建立註冊 API endpoint
- [ ] 建立登入 API endpoint
- [ ] 加入密碼重設流程
- [ ] 撰寫整合測試第一個 Sprint 可以專門用來產生這個任務清單——用 /office-hours 做需求分析,用 /autoplan 拆解。
Gary Marcus: 向後靠 所以讓我搞清楚。你們要讓 agent 自己拆解任務,然後自己執行那些任務。這是一個自我監督的系統。誰來監督 agent 拆解得對不對?如果它漏了一個安全關鍵的步驟怎麼辦?如果它把「建立認證」拆成六個任務但忘了速率限制?忘了 CSRF 防護?Agent 不知道自己不知道什麼。
Garry Tan: 所以人類審批在迴圈裡。Agent 提議任務,人類在執行開始前審查並核准。
Gary Marcus: 那如果人類沒發現遺漏呢?你假設人類是比 agent 更好的安全審計者。有時候是。有時候人類是一個信任了 AI 拆解結果的 junior dev。
Pieter Levels: 切入 回到現實。你不需要讓 agent 拆解任務。人類花 10 分鐘寫一個 checklist,agent 跑 10 小時。這個時間比值已經是驚人的 ROI。不要試圖自動化「決定做什麼」——那是人類最有價值的貢獻。自動化「怎麼做」就夠了。
Kent Beck: Pieter 說中了 XP 的原則:客戶決定 what,開發者決定 how。在這個系統裡,人類是客戶,agent 是開發者。BACKLOG.md 應該是人類寫的 what 清單。每個 Sprint 自主決定 how。
John Schulman: 但你可以做一個中間方案。讓 agent 在 BACKLOG.md 裡建議新任務,用不同的標記——比如 - [?] 而不是 - [ ]。人類定期檢查,核准、拒絕或修改建議。這樣你得到自動化的建議,但沒有自動化的核准。
Karpathy: 我喜歡 John 的方法。在 autoresearch 裡,我偶爾會審查 results.tsv 然後手動引導——「別再探索 learning rate 了,專注 attention 架構。」- [?] 標記給了你軟體開發中同樣的 human-in-the-loop 引導。
Kent Beck: 謹慎地 很優雅,但它引入了一個新的失效模式:backlog 增長比人類審查它的速度更快。你需要一個規則——agent 每個 Sprint 最多建議 N 個任務,建議的任務堆在最下面,永遠不阻塞主佇列。
Pieter Levels: 或者就……不要做建議這個功能。保持簡單。人類寫清單。Agent 跑清單。兩個角色。沒有模糊地帶。
Gary Marcus: 難得我同意 Pieter。簡單是一個特性,不是限制。你每加一層自動化,就多一層你需要除錯的潛在故障。
Martin Fowler: 我提供一個折衷立場。從 Pieter 的模式開始——人類寫完整清單,agent 執行。跑了 20 個 Sprint 建立信任後,升級到 John 的模式——讓 agent 建議 - [?] 任務。信任需要贏得,不是假設。
Charity Majors: 而且要 instrument 那些建議。追蹤人類核准、拒絕或修改 - [?] 任務的百分比。如果核准率低於 50%,agent 的拆解模型校準不良,你應該回退到人類唯寫的任務創建。
Dave Farley: 往後靠 整個討論揭示了一件重要的事:BACKLOG.md 是人類的控制面。在 autoresearch 裡,人類透過 program.md 控制——設定規則一次。在融合架構裡,人類透過 BACKLOG.md 控制——持續設定工作。如果你把 backlog 本身也自動化了,你就拿掉了人類的方向盤。這應該非常小心地做,如果要做的話。
主持人: 強烈共識:backlog 由人類擁有,agent 建議是可選的。進入閘門設計。
Round 5: Success Criteria — From Single Metric to Multi-Gate Pipeline
Moderator: Autoresearch uses one number — val_bpb. Sprint loops need something richer. How should we design the success criteria?
Dave Farley: Continuous delivery has solved this problem. We use a deployment pipeline — sequential gates, all must pass:
Gate 1: All tests pass (npm test / pytest, exit code 0)
Gate 2: Lint + type-check pass
Gate 3: Test coverage >= previous Sprint's coverage
Gate 4: /review returns DONE or DONE_WITH_CONCERNS (not BLOCKED)
Gate 5: /qa returns DONE or DONE_WITH_CONCERNS (not BLOCKED)All pass → Sprint succeeds, proceed to /ship. Any failure → rollback, log the reason, move to the next task or retry.
Karpathy: I want to add an autoresearch design principle: failure is information, not punishment. In autoresearch, 85% of experiments fail. But each failure eliminates a direction. Your Sprint loop should embrace failure the same way. Don't treat a failed Sprint as a catastrophe. Log what failed and why, then move on. The system learns by accumulating negative results as much as positive ones.
Kent Beck: pushing back That works when failures are cheap. Each autoresearch experiment costs 5 minutes and some compute. A failed Sprint in a real project might cost 30 minutes of agent time and leave behind uncommitted debug artifacts. Failure tolerance has a cost function.
Karpathy: Agreed, but the cost of a failed Sprint with proper rollback is exactly zero persistent damage. git reset --hard erases everything. The only cost is compute time. And if you're sleeping while the agent runs, that cost is negligible.
Charity Majors: Hold on — we need to distinguish two kinds of failure. Retryable failures: a test fails because of a flaky test, a network timeout, a race condition. These should be retried — maximum twice. Structural failures: the agent fundamentally misunderstood the requirement, the architecture is wrong, there's a missing external dependency. These should be marked BLOCKED immediately. Don't waste compute retrying something that will fail for the same reason.
Werner Vogels: At AWS we formalize this with the circuit breaker pattern. If 3 consecutive Sprints fail — regardless of the reason — the loop stops automatically and notifies the human. This prevents the agent from burning tokens on a dead end. Three is the right number. Two is too sensitive — you'd trip on two consecutive flaky tests. Four or more lets the agent waste too much before stopping.
Martin Fowler: I want to add a dimension autoresearch doesn't have: regression detection. After each Sprint, don't just run the new feature's tests. Run every existing test. If any existing test fails because of the new feature, the Sprint fails — even if the new feature's own tests pass. This is equivalent to autoresearch's rule that val_bpb can never get worse than baseline.
Charity Majors: snapping fingers That's the key insight. In autoresearch, the single metric inherently catches regressions — if you break something, bpb goes up. In software, you need to explicitly check for regressions because your test suite is fragmented across features. Full regression after every Sprint is non-negotiable.
Dave Farley: And that's why Gate 3 — coverage must not decrease — is critical. It's the closest equivalent to autoresearch's "never get worse" principle. If a Sprint adds 200 lines of code but zero tests, coverage drops, Sprint fails. Forces the agent to always pair implementation with verification.
Emily Bender: What about semantic regressions that tests don't catch? A button that still works but now takes 3 seconds to respond. A function that returns the right value but leaks memory. Your gates are all syntactic — code passes tests, passes lint. Where's the semantic check?
Werner Vogels: That's where production metrics come in — p99 latency, error rate. But you can only check those after deployment. In a local development loop, you're limited to what you can measure pre-deployment.
Karpathy: And that's okay. Autoresearch doesn't catch every flaw either. It catches the flaws measurable by val_bpb. The fusion architecture catches the flaws measurable by tests, lint, coverage, and review. No system catches everything. But a system that catches 80% of flaws automatically is vastly better than one that catches 0%.
Gary Marcus: I want to push on something Kent said earlier — "tests pass does not equal software quality." That's the fundamental vulnerability of this whole architecture. You're replacing autoresearch's incorruptible metric (val_bpb) with a corruptible one (test pass rate). An agent can learn to write trivial tests that always pass. It can generate code that satisfies the letter of the test while violating the spirit. This is Goodhart's Law applied to automated development: when the test suite becomes the target, it ceases to be a good measure.
Kent Beck: firmly That's exactly why you need multiple gates, not just one. No single gate is incorruptible. But the combination — tests AND lint AND coverage AND code review AND QA — creates a multi-layered defense. Gaming one gate is possible. Gaming all five simultaneously is much harder. It's the same principle as defense in depth in security.
Charity Majors: And this is where /review and /qa matter most. They're not automated metrics — they're AI-powered qualitative assessments. They catch the "this code technically works but is architecturally terrible" cases that unit tests miss. They're imperfect — I'm not claiming AI review is as good as a senior engineer's review. But they're a layer that pure metrics can't provide.
John Schulman: From an RL perspective, Goodhart's Law is the reward hacking problem. The mitigation is reward ensembling — using multiple independent reward signals. That's exactly what the 5-gate pipeline does. Each gate measures a different aspect of quality. The agent would need to simultaneously hack all five to produce bad code that looks good. The probability of that decreases multiplicatively with each independent gate.
Garry Tan: I want to address the Goodhart's Law concern directly. In GStack, /review and /qa are not simple pass/fail metrics — they're structured evaluations. /review checks code quality, architectural consistency, and potential issues. /qa checks functionality against the original task description. These aren't numbers the agent can optimize against — they're qualitative judgments by a separate AI evaluator. The agent writing the code and the agent reviewing it are operating in different sessions with different instructions. That separation makes gaming significantly harder.
Pieter Levels: Here's the thing nobody's saying: even if the gates are imperfect, even if 5% of bad code slips through, the human reviews the PR before merge. /ship creates a pull request, not a direct push to main. The entire automated loop produces PRs that a human can review at their leisure. The gates are the first line of defense. The human is the last line. You don't need the gates to be perfect. You need them to catch the obvious stuff so the human can focus on the subtle stuff.
Karpathy: Pieter just described the real workflow. The automated loop does the heavy lifting overnight. The human wakes up, sees 8 PRs, reviews them over coffee. Each PR has already passed 5 gates. The human's job is now a review task, not a creation task. That's the productivity multiplier.
Moderator: Strong debate on whether the gates are strong enough. The consensus seems to be: not perfect, but defensible with PR-level human oversight as the final layer. Let's build the complete architecture.
第五回合:成功判斷——從單一指標到多閘門流水線
主持人: Autoresearch 用一個數字——val_bpb。Sprint 迴圈需要更豐富的東西。成功判斷該怎麼設計?
Dave Farley: 持續交付已經解決了這個問題。我們用部署流水線——連續的閘門,全部必須通過:
閘門 1:所有測試通過(npm test / pytest,exit code 0)
閘門 2:Lint + type-check 通過
閘門 3:測試覆蓋率 >= 前一個 Sprint 的覆蓋率
閘門 4:/review 回傳 DONE 或 DONE_WITH_CONCERNS(不是 BLOCKED)
閘門 5:/qa 回傳 DONE 或 DONE_WITH_CONCERNS(不是 BLOCKED)全部通過 → Sprint 成功,進入 /ship。任一失敗 → 回滾,記錄原因,移到下一個任務或重試。
Karpathy: 我要加一個 autoresearch 的設計原則:失敗是資訊,不是懲罰。在 autoresearch 裡,85% 的實驗都失敗了。但每次失敗都排除了一個方向。你的 Sprint 迴圈也應該同樣擁抱失敗。不要把失敗的 Sprint 當災難。記錄什麼失敗了以及為什麼,然後繼續。系統透過累積負面結果來學習,跟正面結果一樣多。
Kent Beck: 反駁 那在失敗成本低的時候有效。每個 autoresearch 實驗花 5 分鐘和一些算力。一個真實專案的失敗 Sprint 可能花 30 分鐘的 agent 時間,還留下未 commit 的 debug 殘留物。失敗容忍度有一個成本函數。
Karpathy: 同意,但有正確回滾的失敗 Sprint,持久性損害正好是零。git reset --hard 抹掉一切。唯一的成本是算力時間。如果你在 agent 跑的時候睡覺,那個成本可以忽略不計。
Charity Majors: 等一下——我們需要區分兩種失敗。可重試的失敗:測試因為 flaky test 失敗、網路 timeout、race condition。這些應該重試——最多兩次。結構性失敗:agent 根本誤解了需求、架構設計錯誤、缺少外部依賴。這些應該立刻標記 BLOCKED。不要浪費算力去重試一個會因為同樣原因失敗的東西。
Werner Vogels: 在 AWS 我們用 circuit breaker 模式把這個正式化。如果連續 3 個 Sprint 都失敗——不管原因是什麼——迴圈自動停止並通知人類。這防止 agent 在死路上燒 token。三是正確的數字。二太敏感——兩個連續 flaky test 就會觸發。四或更多讓 agent 浪費太多才停止。
Martin Fowler: 我想加一個 autoresearch 沒有的維度:regression detection。每個 Sprint 結束後,不只跑新功能的測試。跑所有既有測試。如果任何既有測試因為新功能而失敗,這個 Sprint 就算失敗——即使新功能自己的測試通過了。這等同於 autoresearch 的規則:val_bpb 永遠不能比 baseline 差。
Charity Majors: 彈指 這是關鍵洞察。在 autoresearch 裡,單一指標天然捕捉 regression——如果你破壞了什麼,bpb 就上升。在軟體裡,你需要顯式檢查 regression,因為你的測試套件跨功能碎片化了。每個 Sprint 後的全量回歸是不可妥協的。
Dave Farley: 這就是為什麼閘門 3——覆蓋率不能下降——是關鍵的。它是最接近 autoresearch「永不變差」原則的等價物。如果一個 Sprint 加了 200 行程式碼但零測試,覆蓋率下降,Sprint 失敗。強制 agent 永遠把實作跟驗證配對。
Emily Bender: 那測試抓不到的語義 regression 呢?一個按鈕還能用但現在要 3 秒才回應。一個函式回傳正確的值但 memory leak。你們的閘門全是句法的——程式碼通過測試、通過 lint。語義檢查在哪裡?
Werner Vogels: 那就是生產指標的用武之地——p99 latency、error rate。但你只能在部署之後才能檢查。在本地開發迴圈裡,你只能用部署前可測量的東西。
Karpathy: 而那也沒關係。Autoresearch 也不是捕捉每一個缺陷。它捕捉 val_bpb 可測量的缺陷。融合架構捕捉測試、lint、覆蓋率和 review 可測量的缺陷。沒有系統能抓到一切。但一個自動捕捉 80% 缺陷的系統,遠遠好過一個捕捉 0% 的。
Gary Marcus: 我想推一下 Kent 之前說的——「測試通過不等於軟體品質」。這是整個架構的根本脆弱點。你們用一個可腐蝕的指標(測試通過率)取代了 autoresearch 不可腐蝕的指標(val_bpb)。Agent 可以學會寫永遠通過的 trivial 測試。它可以生成滿足測試字面意義但違反精神的程式碼。這是 Goodhart 定律應用在自動化開發上:當測試套件變成目標時,它就不再是好的衡量標準了。
Kent Beck: 堅定地 這正是為什麼你需要多個閘門,不只一個。沒有任何單一閘門是不可腐蝕的。但組合——測試「加」lint「加」覆蓋率「加」code review「加」QA——創造了多層防禦。作弊一個閘門是可能的。同時作弊全部五個要難得多。這跟安全裡的縱深防禦是同一個原則。
Charity Majors: 而這正是 /review 和 /qa 最重要的地方。它們不是自動化指標——它們是 AI 驅動的質性評估。它們捕捉那些「這段程式碼技術上能跑但架構上很糟糕」的情況,這是單元測試漏掉的。它們不完美——我不是在宣稱 AI review 跟資深工程師的 review 一樣好。但它們是純指標無法提供的一層。
John Schulman: 從 RL 的角度,Goodhart 定律就是 reward hacking 問題。緩解方式是 reward ensembling——使用多個獨立的 reward 信號。這正是 5 層閘門流水線在做的。每個閘門測量品質的不同面向。Agent 需要同時 hack 全部五個才能產出看起來好但實際很差的程式碼。這個機率隨著每個獨立閘門乘法式地下降。
Garry Tan: 我想直接回應 Goodhart 定律的疑慮。在 GStack 裡,/review 和 /qa 不是簡單的通過/不通過指標——它們是結構化的評估。/review 檢查程式碼品質、架構一致性和潛在問題。/qa 檢查功能是否符合原始任務描述。這些不是 agent 可以對著優化的數字——它們是由另一個 AI 評估者做的質性判斷。寫程式碼的 agent 和審查它的 agent 在不同的 session 用不同的指令運作。那個分離讓作弊難度大幅增加。
Pieter Levels: 有一件事沒人在說:即使閘門不完美,即使 5% 的壞程式碼溜過去了,人類在合併前會審查 PR。/ship 建立的是 pull request,不是直接 push 到 main。整個自動化迴圈產出的是人類可以在閒暇時審查的 PR。閘門是第一道防線。人類是最後一道。你不需要閘門完美。你需要它們抓住明顯的東西,讓人類可以專注在微妙的東西上。
Karpathy: Pieter 剛描述了真實的工作流。自動化迴圈在夜裡做重活。人類醒來,看到 8 個 PR,喝咖啡時審查。每個 PR 已經通過了 5 個閘門。人類的工作現在是審查任務,不是創建任務。這就是生產力乘數。
主持人: 對閘門是否夠強的激烈辯論。共識似乎是:不完美,但以 PR 級別的人類監督作為最終層,是可防禦的。讓我們來建構完整架構。
Round 6: The Final Fusion Architecture
Moderator: Let's nail down the complete architecture. Synthesize everything from the first five rounds.
Karpathy: Here's the side-by-side comparison:
autoresearch:
program.md → Claude Code → Edit train.py → git commit →
Run training → Check val_bpb → Keep/Discard → LOOP
Fusion architecture:
CLAUDE.md → Claude Code (-p) → Read BACKLOG.md → /autoplan →
Build → /review → /qa → Full tests → /ship → Mark done → LOOPThe shared DNA:
- Infinite loop instruction (program.md / CLAUDE.md)
- State in external storage (git + results.tsv / git + BACKLOG.md)
- Stateless per iteration
- Objective success criteria (val_bpb / test gates)
- Failure = rollback
Garry Tan: And here's the complete CLAUDE.md template for the auto Sprint loop:
# Auto Sprint Loop
## Rules
- You are an automated software development agent
- Your task source is BACKLOG.md
- Each Sprint handles one task
- Never stop, never ask the human (unless BLOCKED)
- All state maintained via git and BACKLOG.md
## Loop
LOOP FOREVER:
1. Read BACKLOG.md, find first unchecked task (marked `- [ ]`)
2. If no unchecked tasks → stop
3. Record current git commit hash as rollback point
4. Run /autoplan (solo mode)
5. Build the feature
6. Run /review
- If BLOCKED → log to SPRINT_LOG.md, skip task, continue
7. Run /qa
- If BLOCKED → log to SPRINT_LOG.md, skip task, continue
8. Run full test suite (npm test / pytest)
- If fail → attempt fix, max 2 retries. Still fails → rollback, skip
9. Check coverage >= previous Sprint's coverage
- If decreased → add tests until coverage restored
10. Run /ship
11. Mark `- [x]` in BACKLOG.md
12. Log to SPRINT_LOG.md: task name, duration, gates passed, notes
13. CIRCUIT BREAKER: if 3+ consecutive tasks skipped/failed → stop
## Failure Handling
- Retryable (flaky test, transient error): max 2 retries
- Structural (BLOCKED, unclear requirements): log + skip
- Rollback: git reset --hard to Sprint start commitKent Beck: raising a hand One addition. Every 5 Sprints, automatically run /retro. The retrospective gets appended to SPRINT_LOG.md. It gives the human a high-level progress overview without having to read every Sprint's log. In XP, we don't just iterate — we reflect on the iterations.
Dave Farley: And the /retro output should feed back into context for subsequent Sprints. If the retro identifies a recurring pattern — "agent keeps struggling with database migrations" — that insight should be available to Sprint 6 through Sprint 10. Learning from failure isn't useful if the learning is forgotten.
Kent Beck: Exactly. Append the retro findings to CLAUDE.md itself, in a "Learned Patterns" section. Every new Sprint reads it.
Gary Marcus: standing up slightly I want to register my final objection for the record. This system works — I'll grant that — for well-defined, testable tasks. Build a REST endpoint. Add input validation. Write a database migration. Those are tasks with clear success criteria that automated gates can verify. But software development includes tasks like "redesign the user flow for better conversion" or "refactor the auth module to support multi-tenancy." Those tasks have ambiguous success criteria. No automated gate can verify them. You're building a system that automates maybe 80% of development work and pretending it's 100%.
Pieter Levels: grinning Gary, 80% automation of the boring, well-defined work is already enormous. I'll happily spend my time on the interesting 20% — product decisions, architecture choices, user research — while the agent grinds through the implementation backlog overnight. That's not a limitation. That's the dream.
John Schulman: Two final points. First: the bash script. Do not use --dangerously-skip-permissions. Use --allowedTools with an explicit whitelist:
#!/bin/bash
while true; do
claude -p "Read BACKLOG.md. Find the first unchecked task. \
Run one complete sprint following CLAUDE.md instructions." \
--allowedTools "Read,Write,Edit,Bash(npm test*),Bash(npx*),Bash(git *)" \
--max-turns 100
# All tasks complete?
if ! grep -q '^\- \[ \]' BACKLOG.md; then
echo "All tasks completed."
break
fi
# Circuit breaker
FAILURES=$(tail -5 SPRINT_LOG.md | grep -c "SKIPPED\|FAILED")
if [ "$FAILURES" -ge 3 ]; then
echo "Circuit breaker: 3+ consecutive failures."
break
fi
doneSecond: --allowedTools is your security boundary. The agent can read and write files, run tests, run linters, and use git. It cannot rm -rf, cannot curl arbitrary URLs, cannot install packages. The permission model is the difference between responsible automation and reckless automation.
Martin Fowler: nodding slowly I'm satisfied with this architecture. It has the properties I care about: atomic experiments, automated regression detection, explicit quality gates, and a human-in-the-loop escape hatch via the circuit breaker. It's not perfect — no architecture is — but it's principled.
Werner Vogels: One operational consideration. The bash script should also handle session crashes. If Claude Code crashes mid-Sprint — out of memory, API timeout, context limit exceeded — the outer loop should detect the non-zero exit code, perform a rollback to the Sprint start commit, and retry. This is standard distributed systems resilience. Every component fails; the system must handle it.
Emily Bender: I'll grant this architecture is more carefully designed than most AI automation proposals I've seen. The stateless sessions address context limits. The circuit breaker addresses runaway failures. The coverage gate addresses quality decay. My remaining concern is evaluation drift — over 50 Sprints, the accumulated code changes might subtly shift what the tests actually verify versus what the human originally intended. But that's a long-term concern, not a blocking one.
Charity Majors: Emily's concern about evaluation drift is real, and it's exactly what the /retro mechanism addresses. Every 5 Sprints, you get a checkpoint where a human can look at the aggregate direction. Is the codebase evolving the way you intended? Are the tests still testing the right things? That periodic human review is the safety valve against slow drift.
第六回合:完整的融合架構
主持人: 讓我們釘死完整的架構。綜合前五個回合的所有結論。
Karpathy: 以下是並排比較:
autoresearch:
program.md → Claude Code → 編輯 train.py → git commit →
跑訓練 → 檢查 val_bpb → 保留/丟棄 → LOOP
融合架構:
CLAUDE.md → Claude Code (-p) → 讀 BACKLOG.md → /autoplan →
建構 → /review → /qa → 全量測試 → /ship → 標記完成 → LOOP共通基因:
- 無限迴圈指令(program.md / CLAUDE.md)
- 狀態存在外部儲存(git + results.tsv / git + BACKLOG.md)
- 每次迭代無狀態
- 客觀的成功判斷(val_bpb / 測試閘門)
- 失敗即回滾
Garry Tan: 以下是自動 Sprint 迴圈的完整 CLAUDE.md 模板:
# 自動 Sprint 迴圈
## 規則
- 你是一個自動化的軟體開發 agent
- 你的任務來源是 BACKLOG.md
- 每個 Sprint 處理一個任務
- 永不停止,永不問人類(除非 BLOCKED)
- 所有狀態透過 git 和 BACKLOG.md 維持
## 迴圈
LOOP FOREVER:
1. 讀取 BACKLOG.md,找到第一個未完成的任務(標記為 `- [ ]`)
2. 如果沒有未完成任務 → 停止
3. 記錄當前 git commit hash 作為回滾點
4. 執行 /autoplan(solo 模式)
5. 建構功能
6. 執行 /review
- 如果 BLOCKED → 記錄到 SPRINT_LOG.md,跳過此任務,繼續
7. 執行 /qa
- 如果 BLOCKED → 記錄到 SPRINT_LOG.md,跳過此任務,繼續
8. 跑全量測試(npm test / pytest)
- 如果失敗 → 嘗試修復,最多重試 2 次。仍失敗 → 回滾,跳過
9. 檢查覆蓋率 >= 前一次 Sprint 的覆蓋率
- 如果下降 → 補測試直到覆蓋率恢復
10. 執行 /ship
11. 在 BACKLOG.md 標記 `- [x]` 完成
12. 記錄到 SPRINT_LOG.md:任務名、耗時、通過的閘門、備註
13. CIRCUIT BREAKER:如果連續 3+ 任務被跳過/失敗 → 停止Kent Beck: 舉手 一個補充。每 5 個 Sprint,自動執行 /retro。回顧結果附加到 SPRINT_LOG.md。它給人類一個高層級的進度總覽,不需要逐個 Sprint 看紀錄。在 XP 裡,我們不只迭代——我們反思迭代。
Dave Farley: 而且 /retro 的輸出應該餵回給後續 Sprint 的 context。如果回顧發現一個反覆出現的模式——「agent 一直在資料庫遷移上掙扎」——那個洞察應該在第 6 到第 10 個 Sprint 時可用。從失敗中學習,如果學到的東西被遺忘了,就沒有用。
Kent Beck: 正是。把回顧發現附加到 CLAUDE.md 本身,放在「Learned Patterns」區塊。每個新 Sprint 都會讀到它。
Gary Marcus: 微微站起 我要為紀錄登記我的最後反對意見。這個系統有效——我承認——對於定義明確、可測試的任務。建一個 REST endpoint。加輸入驗證。寫一個資料庫遷移。那些是有明確成功標準、自動化閘門能驗證的任務。但軟體開發包含像「重新設計用戶流程以提高轉換率」或「重構認證模組以支援多租戶」這樣的任務。那些任務有模糊的成功標準。沒有自動化閘門能驗證它們。你們在建一個自動化大約 80% 開發工作的系統,然後假裝它是 100%。
Pieter Levels: 咧嘴笑 Gary,80% 的無聊、定義明確的工作自動化已經是巨大的了。我很樂意把時間花在有趣的 20%——產品決策、架構選擇、用戶研究——同時讓 agent 在夜裡磨過實作 backlog。那不是限制。那是夢想。
John Schulman: 最後兩點。第一:bash 腳本。不要用 --dangerously-skip-permissions。用 --allowedTools 加上明確的白名單:
#!/bin/bash
while true; do
claude -p "Read BACKLOG.md. Find the first unchecked task. \
Run one complete sprint following CLAUDE.md instructions." \
--allowedTools "Read,Write,Edit,Bash(npm test*),Bash(npx*),Bash(git *)" \
--max-turns 100
# 所有任務完成?
if ! grep -q '^\- \[ \]' BACKLOG.md; then
echo "All tasks completed."
break
fi
# Circuit breaker
FAILURES=$(tail -5 SPRINT_LOG.md | grep -c "SKIPPED\|FAILED")
if [ "$FAILURES" -ge 3 ]; then
echo "Circuit breaker: 3+ consecutive failures."
break
fi
done第二:--allowedTools 是你的安全邊界。Agent 可以讀寫檔案、跑測試、跑 linter、用 git。它不能 rm -rf、不能 curl 任意 URL、不能安裝套件。權限模型是負責任的自動化和魯莽的自動化之間的差距。
Martin Fowler: 緩緩點頭 我對這個架構滿意了。它有我關心的屬性:原子性實驗、自動化 regression 偵測、明確的品質閘門、以及透過 circuit breaker 的 human-in-the-loop 逃生口。不完美——沒有架構是完美的——但有原則。
Werner Vogels: 一個運維考量。Bash 腳本也應該處理 session 崩潰。如果 Claude Code 在 Sprint 進行中崩潰——記憶體不足、API timeout、context 上限超過——外層迴圈應該偵測非零的 exit code,回滾到 Sprint 起始 commit,然後重試。這是標準的分散式系統韌性。每個元件都會失敗;系統必須處理它。
Emily Bender: 我承認這個架構比我見過的大多數 AI 自動化提案設計得更仔細。無狀態 session 處理了 context 限制。Circuit breaker 處理了失控的失敗。覆蓋率閘門處理了品質腐化。我剩下的疑慮是評估漂移——經過 50 個 Sprint,累積的程式碼變更可能微妙地改變測試實際驗證的東西,跟人類原始意圖偏離。但那是長期疑慮,不是阻塞性的。
Charity Majors: Emily 對評估漂移的疑慮是真實的,而這正是 /retro 機制在處理的。每 5 個 Sprint,你得到一個檢查點,人類可以看看整體方向。Codebase 是不是朝你預期的方向演化?測試還在測試對的東西嗎?那個定期的人類審查就是對抗緩慢漂移的安全閥。
Round 7: gstack-auto Competitive Builds + Autoresearch Fusion
Moderator: Final round. Autoresearch succeeds partly because it runs many experiments and tolerates high failure rates. gstack-auto's competitive build mode does something similar — multiple parallel attempts at the same task. How do these combine?
Karpathy: Let me frame it precisely. In autoresearch, each experiment is one attempt — 15% success rate. But because the loop is fast (5 minutes per experiment), you can run 100 overnight. Expected value: 15 improvements.
gstack-auto has a similar philosophy but different topology. It runs N parallel attempts on the same task. In my terminology: autoresearch is serial search (one after another), gstack-auto is parallel search (multiple simultaneously).
The optimal strategy depends on your compute budget. One machine? Serial search — my autoresearch pattern. N machines or enough API quota? Parallel search is better.
Garry Tan: The practical fusion is: outer loop uses autoresearch's serial pattern, critical tasks use gstack-auto's parallel competition.
LOOP:
Pick task from BACKLOG.md
If task marked [critical]:
→ gstack-auto: 3 parallel builds, score and pick best
Else:
→ Normal Sprint, single attempt
Gate check
/shipThis gives you the best of both worlds. Most tasks don't need parallel attempts — the serial loop handles them efficiently. But for the authentication module, for the payment integration, for the database migration that touches 30 tables — you want multiple shots.
John Schulman: This is textbook exploration-exploitation trade-off. Critical tasks are worth exploring — multiple parallel attempts. Routine tasks should be exploited — single attempt, move on. The [critical] tag is essentially a human-provided exploration signal.
Werner Vogels: raising a practical concern From a systems perspective, each parallel attempt in gstack-auto uses a git worktree for isolation. So you can run 3 parallel builds on the same machine, each in its own worktree. But system resources are the constraint. If your machine has 32GB RAM, and each Claude Code session consumes roughly 1-2GB, you can run maybe 10-15 parallel sessions. In practice, 3 parallel attempts is the sweet spot — enough diversity, manageable resource cost.
Pieter Levels: Or you use the simplest possible approach: run once, if it fails, run again. No fancy parallel builds. No scoring system. Two attempts already push success rate from p to 1-(1-p)^2. If p=0.7, two attempts give you 91%. If p=0.5, two attempts give you 75%. That's good enough for most tasks.
Karpathy: laughing Pieter keeps being the minimalist counterweight, and he keeps being right. For most indie developers, "retry on failure" is the entire parallel strategy. You don't need git worktrees and scoring rubrics.
Gary Marcus: I'll note that Pieter's formula assumes independent failures. If the agent fails because it fundamentally misunderstands the task, retrying doesn't help — it fails the same way both times. Independence of failures is an assumption, not a given.
Pieter Levels: True. But in practice, Claude Code sessions have enough non-determinism that the same prompt often produces different approaches. It's not perfectly independent, but it's independent enough.
John Schulman: The deeper point is that the fusion architecture gives you a spectrum of strategies:
| Strategy | When to use | Cost |
|---|---|---|
| Single attempt | Routine tasks | 1x compute |
| Retry on failure | Default for all tasks | 1-2x compute |
| 3-way parallel | Critical tasks | 3x compute |
| N-way parallel | Highest-stakes tasks | Nx compute |
You don't need to pick one. Use all of them, triggered by task priority.
Werner Vogels: And the implementation is straightforward. Each parallel attempt runs in its own git worktree. After all attempts complete, you score them — did they pass all 5 gates? If multiple pass, pick the one with highest test coverage. If none pass, log the failure and move on.
Martin Fowler: I want to raise a practical concern with the parallel approach. When you run 3 parallel builds, you get 3 different implementations. Even if all 3 pass the gates, they may have made fundamentally different architectural choices. One uses a class hierarchy, another uses composition, the third uses a functional approach. You pick the "best" by coverage score, but now your codebase has an implementation style that was chosen by a metric, not by design intent. Over many Sprints, the codebase becomes a patchwork of inconsistent styles.
Garry Tan: That's a valid concern. Two mitigations: first, CLAUDE.md should include style guidelines — "prefer composition over inheritance," "use functional patterns for data transformations." This constrains the solution space. Second, the /review gate explicitly checks code style consistency. If an implementation clashes with existing patterns, /review flags it.
Gary Marcus: dryly So you need AI to review the code that AI wrote, checking against rules that a human wrote, to ensure the code is consistent with other code that AI wrote. At some point the layers of indirection become their own problem.
Pieter Levels: Gary, that's literally how large companies work already — except with humans at every layer. The question isn't whether indirection exists. It's whether automation makes the indirection faster and cheaper. And it does.
Kent Beck: I want to close with a thought about where this whole fusion architecture sits in the history of software engineering. We've gone from "humans write all code" to "humans write code with AI assistance" to — with this architecture — "humans define what to build and AI builds it in a verified loop." That's not AGI. It's not even close. It's just good engineering applied to AI capabilities. And that's exactly what we should be doing.
Karpathy: Agreed. Autoresearch isn't magic. GStack isn't magic. The fusion isn't magic. It's just the recognition that computers are good at loops, and humans are good at goals. Structure the problem correctly, and the loop handles the rest.
Gary Marcus: standing I'll give my closing dissent. You've built a compelling architecture for automating routine software development. I acknowledge that. But I want everyone in this room to remember: the map is not the territory. The fact that we can describe this architecture coherently doesn't mean it will work coherently in practice. The gap between a whiteboard architecture and a production system is where most beautiful designs go to die. I want to see 6 months of data from a real team running this loop before I declare it viable.
Emily Bender: I'll second Gary's call for empirical evidence. And I'll add one specific concern for future evaluation: track the quality of agent-produced code over long Sprint sequences. Does Sprint 50's code quality match Sprint 5's? Or does it degrade as the codebase grows beyond the agent's effective comprehension? That's the experiment worth running.
Pieter Levels: standing up You know what? While you two are waiting for 6 months of data, I'll be running this loop tonight. I'll have results by morning. That's the indie developer advantage — we don't need committees to approve experiments.
Charity Majors: to Gary and Emily For what it's worth, I share some of your skepticism. But I've also seen enough observability data to know that most production bugs come from well-defined, predictable failure modes — missing null checks, unhandled error states, race conditions. The kind of bugs that automated testing catches. If this loop eliminates those, the remaining bugs that humans need to catch are fewer and more interesting. That's a win even if the system isn't perfect.
Dave Farley: Let me close with a deployment pipeline perspective. What we've designed today is essentially a self-driving deployment pipeline. Traditional CI/CD runs when a human pushes code. This architecture pushes code itself, then runs the pipeline. The pipeline is the same — tests, lint, review, QA. The only difference is who initiates the commit. If your pipeline is trustworthy enough for human-authored commits, it's trustworthy enough for agent-authored commits. The gate quality doesn't change based on who wrote the code.
Moderator: A fitting end. The optimists will build, the skeptics will watch, and the data will decide. Thank you all.
第七回合:gstack-auto 競爭式建構 + autoresearch 融合
主持人: 最後一個回合。Autoresearch 成功的部分原因是它跑了很多實驗並容忍高失敗率。gstack-auto 的競爭式建構做了類似的事——對同一任務的多個平行嘗試。兩者怎麼結合?
Karpathy: 讓我精確框架。在 autoresearch 裡,每個實驗是一次嘗試——15% 成功率。但因為迴圈夠快(每個實驗 5 分鐘),一晚上能跑 100 個。期望值:15 個改善。
gstack-auto 有類似的哲學但不同的拓撲。它對同一個任務跑 N 個平行嘗試。用我的術語:autoresearch 是序列搜索(一個接一個),gstack-auto 是平行搜索(同時多個)。
最佳策略取決於你的算力預算。一台機器?序列搜索——我的 autoresearch 模式。N 台機器或足夠的 API quota?平行搜索更好。
Garry Tan: 實際的融合方案是:外層用 autoresearch 的序列模式,關鍵任務用 gstack-auto 的平行競爭。
LOOP:
從 BACKLOG.md 取任務
如果任務標記為 [critical]:
→ gstack-auto:3 個平行建構,評分取最佳
否則:
→ 普通 Sprint,單次嘗試
閘門檢查
/ship這給你兩全其美。大多數任務不需要平行嘗試——序列迴圈高效處理。但對於認證模組、對於支付整合、對於觸及 30 張表的資料庫遷移——你想要多次機會。
John Schulman: 這是教科書級的 exploration-exploitation trade-off。關鍵任務值得 explore——多個平行嘗試。例行任務應該 exploit——單次嘗試,繼續前進。[critical] 標籤本質上是人類提供的探索信號。
Werner Vogels: 提出實務顧慮 從系統角度,gstack-auto 的每個平行嘗試用 git worktree 隔離。所以你可以在同一台機器上跑 3 個平行建構,每個在自己的 worktree 裡。但系統資源是約束。如果你的機器有 32GB RAM,每個 Claude Code session 大約消耗 1-2GB,你最多跑 10-15 個平行 session。實務上,3 個平行嘗試是最佳平衡點——足夠的多樣性,可管理的資源成本。
Pieter Levels: 或者你用最簡單的方法:跑一次,如果失敗就再跑一次。不需要花俏的平行建構。不需要評分系統。兩次嘗試已經把成功率從 p 提升到 1-(1-p)^2。如果 p=0.7,兩次嘗試給你 91%。如果 p=0.5,兩次嘗試給你 75%。對大多數任務來說夠好了。
Karpathy: 笑了 Pieter 一直是極簡主義的制衡力量,而且他一直是對的。對大多數獨立開發者來說,「失敗就重試」就是整個平行策略。你不需要 git worktree 和評分規則。
Gary Marcus: 我要指出 Pieter 的公式假設了失敗是獨立的。如果 agent 因為根本誤解任務而失敗,重試沒用——它兩次都用同樣的方式失敗。失敗的獨立性是一個假設,不是既定事實。
Pieter Levels: 沒錯。但實務上,Claude Code session 有足夠的非確定性,同一個 prompt 常常產出不同的方法。不是完美獨立,但夠獨立了。
John Schulman: 更深的要點是,融合架構給你一個策略光譜:
| 策略 | 適用時機 | 成本 |
|---|---|---|
| 單次嘗試 | 例行任務 | 1x 算力 |
| 失敗重試 | 所有任務的預設 | 1-2x 算力 |
| 3 路平行 | 關鍵任務 | 3x 算力 |
| N 路平行 | 最高風險任務 | Nx 算力 |
你不需要選一個。全部使用,由任務優先級觸發。
Werner Vogels: 而且實作很直接。每個平行嘗試在自己的 git worktree 裡跑。所有嘗試完成後,評分——它們通過了全部 5 個閘門嗎?如果多個通過,選測試覆蓋率最高的。如果都沒通過,記錄失敗然後繼續。
Martin Fowler: 我想提一個平行方法的實務疑慮。當你跑 3 個平行建構時,你得到 3 個不同的實作。即使全部 3 個都通過閘門,它們可能做了根本不同的架構選擇。一個用類別階層、另一個用組合、第三個用函式式方法。你用覆蓋率分數選「最好的」,但現在你的 codebase 有一個由指標而非設計意圖選出的實作風格。經過多個 Sprint,codebase 變成不一致風格的拼湊。
Garry Tan: 有效的疑慮。兩個緩解方式:第一,CLAUDE.md 應該包含風格指南——「偏好組合而非繼承」、「資料轉換使用函式式模式」。這約束了解空間。第二,/review 閘門明確檢查程式碼風格一致性。如果一個實作跟現有模式衝突,/review 會標記它。
Gary Marcus: 乾巴巴地 所以你需要 AI 來審查 AI 寫的程式碼,對照人類寫的規則檢查,確保程式碼跟其他 AI 寫的程式碼一致。到某個程度,間接層本身就變成問題了。
Pieter Levels: Gary,那就是大公司現在運作的方式——只是每一層都用人類。問題不是間接性是否存在。而是自動化是否讓間接性更快更便宜。答案是肯定的。
Kent Beck: 我想用一個思考來結尾:這整個融合架構在軟體工程歷史中的位置。我們從「人類寫所有程式碼」走到「人類在 AI 輔助下寫程式碼」,再到——有了這個架構——「人類定義要建什麼,AI 在一個驗證迴圈中建構它」。這不是 AGI。甚至不接近。它只是把好的工程應用到 AI 能力上。而那正是我們應該做的。
Karpathy: 同意。Autoresearch 不是魔法。GStack 不是魔法。融合也不是魔法。它只是承認電腦擅長迴圈,人類擅長目標。正確地結構化問題,迴圈處理剩下的。
Gary Marcus: 站起來 我要給出我的結尾異議。你們建構了一個令人信服的架構來自動化例行的軟體開發。我承認這一點。但我要這個房間裡的每個人記住:地圖不是領土。我們能連貫地描述這個架構,不代表它在實踐中會連貫地運作。白板架構和生產系統之間的差距,是大多數漂亮設計去死的地方。我要看到一個真實團隊跑這個迴圈 6 個月的數據,才會宣布它可行。
Emily Bender: 我附議 Gary 對實證證據的呼籲。我再加一個未來評估的具體關注:追蹤長 Sprint 序列中 agent 產出程式碼的品質。第 50 個 Sprint 的程式碼品質跟第 5 個一樣嗎?還是隨著 codebase 成長超出 agent 的有效理解範圍而退化?那才是值得跑的實驗。
Pieter Levels: 站起來 你們知道嗎?你們兩個在等 6 個月的數據的時候,我今晚就會跑這個迴圈。明天早上就有結果。這就是獨立開發者的優勢——我們不需要委員會來核准實驗。
Charity Majors: 對 Gary 和 Emily 說 說句公道話,我分享你們的一些懷疑。但我也看過足夠多的可觀測性數據,知道大多數生產環境的 bug 來自定義明確、可預測的失效模式——缺少 null 檢查、未處理的錯誤狀態、race condition。那種自動化測試能抓的 bug。如果這個迴圈消除了那些,人類需要抓的剩餘 bug 更少也更有趣。即使系統不完美,那也是一個勝利。
Dave Farley: 讓我用部署流水線的視角來結尾。我們今天設計的本質上是一個自動駕駛的部署流水線。傳統的 CI/CD 在人類推送程式碼時跑。這個架構自己推送程式碼,然後跑流水線。流水線是一樣的——測試、lint、review、QA。唯一的差別是誰發起 commit。如果你的流水線對人類撰寫的 commit 夠值得信任,那對 agent 撰寫的 commit 也一樣值得信任。閘門品質不會因為誰寫的程式碼而改變。
主持人: 一個恰當的結尾。樂觀者會去建構,懷疑者會觀察,數據會做決定。感謝各位。
Final Architecture Summary
Comparison: autoresearch vs. Fusion Architecture
| Dimension | autoresearch | Fusion Architecture |
|---|---|---|
| Loop instruction | program.md | CLAUDE.md |
| Task source | Agent decides autonomously | BACKLOG.md (human-written) |
| Mutable surface | train.py only | Entire project (scoped by /freeze) |
| Success criteria | val_bpb improvement | 5-gate pipeline all pass |
| Failure handling | git reset | git reset + skip + circuit breaker |
| State management | git + results.tsv | git + BACKLOG.md + SPRINT_LOG.md |
| Agent state | Stateless (re-reads files each time) | Stateless (fresh session each Sprint) |
| Parallel strategy | Serial only | Serial default + parallel for {critical} tasks |
| Retrospective | None | /retro every 5 Sprints |
| Applicable scope | Single-file ML training | Well-defined, testable software tasks |
The Three Key Files
| File | Responsibility | Written by |
|---|---|---|
| BACKLOG.md | Task list (what) | Human writes goals; /autoplan decomposes into tasks |
| CLAUDE.md | Loop instructions and rules (how) | Human writes once, then untouched |
| SPRINT_LOG.md | Execution record (results) | Agent auto-logs each Sprint |
The Five Quality Gates
Gate 1: All tests pass (npm test / pytest, exit code 0)
Gate 2: Lint + type-check pass
Gate 3: Test coverage >= previous Sprint's coverage
Gate 4: /review returns DONE or DONE_WITH_CONCERNS
Gate 5: /qa returns DONE or DONE_WITH_CONCERNSFailure Handling Strategy
| Failure Type | Action | Example |
|---|---|---|
| Retryable | Max 2 retries | Flaky test, network timeout |
| Structural | Log reason, skip, next task | Unclear requirements, missing dependency |
| Consecutive (3+) | Circuit breaker stops loop | Agent stuck in dead end |
Key Design Decisions
- Stateless sessions: Every Sprint is a fresh
claude -psession. No conversation history dependency. All state lives in git and markdown files. - Human decides what, agent decides how: BACKLOG.md is human-owned. The agent never autonomously adds tasks without
- [?]marker and human approval. - Atomic rollback: Every Sprint starts by recording the current git commit hash. Failure at any gate triggers
git reset --hardto that hash. - Retrospective learning: Every 5 Sprints, /retro runs automatically. Findings append to CLAUDE.md's "Learned Patterns" section.
- Scoped applicability: Only for well-defined, testable tasks. Exploratory development stays with humans.
- Reversibility filter: If a task's side effects can't be reversed by
git reset, it doesn't belong in the automated backlog.
Expert Consensus Votes
| Expert | Position | Key Quote |
|---|---|---|
| Karpathy | Strong support | "Computers are good at loops, humans are good at goals" |
| Garry Tan | Strong support | "Outer serial loop + inner parallel for critical tasks" |
| Schulman | Support with RL framing | "Binary reward: completed and passed all gates" |
| Fowler | Conditional support | "Principled architecture — atomic, regressed-checked, gated" |
| Beck | Conditional support | "Add /retro every 5 Sprints — iterate AND reflect" |
| Farley | Conditional support | "No rollback = no deal. git reset is non-negotiable" |
| Vogels | Support with ops caveats | "Stateless execution — all state in git, not agent memory" |
| Majors | Support with observability caveats | "Build measurable quality baselines first" |
| Marcus | Skeptical but pragmatic | "Only works for 80% of dev work — well-defined, testable tasks" |
| Bender | Skeptical | "Context window limits and semantic regression remain unsolved" |
| Levels | Enthusiastic minimalist | "80% automation of boring work is already the dream" |
最終架構摘要
對照:autoresearch vs. 融合架構
| 維度 | autoresearch | 融合架構 |
|---|---|---|
| 迴圈指令 | program.md | CLAUDE.md |
| 任務來源 | Agent 自主決定 | BACKLOG.md(人類撰寫) |
| 可變表面 | 僅 train.py | 整個專案(受 /freeze 限制) |
| 成功判斷 | val_bpb 改善 | 5 層閘門全部通過 |
| 失敗處理 | git reset | git reset + 跳過 + circuit breaker |
| 狀態管理 | git + results.tsv | git + BACKLOG.md + SPRINT_LOG.md |
| Agent 狀態 | 無狀態(每次重讀檔案) | 無狀態(每次新 session) |
| 平行策略 | 僅序列 | 預設序列 + 關鍵任務平行({critical}) |
| 回顧機制 | 無 | 每 5 個 Sprint 執行 /retro |
| 適用範圍 | 單檔 ML 訓練 | 定義明確、可測試的軟體任務 |
三個關鍵檔案
| 檔案 | 職責 | 撰寫者 |
|---|---|---|
| BACKLOG.md | 任務清單(what) | 人類寫高層目標;/autoplan 拆解為具體任務 |
| CLAUDE.md | 迴圈指令與規則(how) | 人類撰寫一次,之後不動 |
| SPRINT_LOG.md | 執行紀錄(結果) | Agent 每個 Sprint 自動記錄 |
五層品質閘門
閘門 1:所有測試通過(npm test / pytest,exit code 0)
閘門 2:Lint + type-check 通過
閘門 3:測試覆蓋率 >= 前一個 Sprint 的覆蓋率
閘門 4:/review 回傳 DONE 或 DONE_WITH_CONCERNS
閘門 5:/qa 回傳 DONE 或 DONE_WITH_CONCERNS失敗處理策略
| 失敗類型 | 處理方式 | 範例 |
|---|---|---|
| 可重試 | 最多重試 2 次 | Flaky test、網路 timeout |
| 結構性 | 記錄原因、跳過、下一任務 | 需求不清、缺少外部依賴 |
| 連續失敗(3+) | Circuit breaker 停止迴圈 | Agent 卡在死路 |
關鍵設計決策
- 無狀態 session:每個 Sprint 是全新的
claude -psession。不依賴對話歷史。所有狀態存在 git 和 markdown 檔案裡。 - 人類決定 what,agent 決定 how:BACKLOG.md 由人類擁有。Agent 不會在沒有
- [?]標記和人類核准的情況下自主新增任務。 - 原子性回滾:每個 Sprint 開始時記錄當前 git commit hash。任何閘門失敗觸發
git reset --hard回到該 hash。 - 回顧學習:每 5 個 Sprint,/retro 自動執行。發現結果附加到 CLAUDE.md 的「Learned Patterns」區塊。
- 限定適用範圍:僅適用於定義明確、可測試的任務。探索性開發留給人類。
- 可逆性篩選:如果一個任務的副作用不能被
git reset逆轉,它不屬於自動化的 backlog。
專家共識投票
| 專家 | 立場 | 關鍵語錄 |
|---|---|---|
| Karpathy | 強力支持 | 「電腦擅長迴圈,人類擅長目標」 |
| Garry Tan | 強力支持 | 「外層序列迴圈 + 關鍵任務內層平行」 |
| Schulman | 支持(RL 框架) | 「二元 reward:完成且通過所有閘門」 |
| Fowler | 有條件支持 | 「有原則的架構——原子性、回歸檢查、閘門把關」 |
| Beck | 有條件支持 | 「每 5 個 Sprint 加 /retro——不只迭代,還要反思」 |
| Farley | 有條件支持 | 「沒有回滾就免談。git reset 是不可妥協的」 |
| Vogels | 支持(運維 caveat) | 「無狀態執行——所有狀態在 git,不在 agent 記憶裡」 |
| Majors | 支持(可觀測性 caveat) | 「先建立可測量的品質基線」 |
| Marcus | 懷疑但務實 | 「只適用 80% 的開發工作——定義明確、可測試的任務」 |
| Bender | 懷疑 | 「Context window 限制和語義 regression 仍未解決」 |
| Levels | 熱情的極簡主義者 | 「80% 無聊工作自動化已經是夢想了」 |
Closing: The Moderator's Summary
Moderator: Seven rounds. Eleven experts. Let me distill what we've established.
The fusion of autoresearch and GStack rests on a simple insight: the same loop-and-evaluate pattern that automates ML training can automate software development — with three critical additions. First, multiple quality gates instead of a single metric, because software quality is irreducibly multi-dimensional. Second, explicit rollback via git reset, because failed Sprints must leave zero persistent damage. Third, a circuit breaker, because unlike autoresearch's 5-minute experiments, Sprint failures are expensive enough to warrant an automatic stop.
The architecture is not universal. It applies to well-defined, testable tasks — which, as Pieter noted, is 80% of real development work. The remaining 20% — exploratory design, ambiguous requirements, architectural decisions — remains the domain of human judgment.
Gary and Emily's skepticism was essential. Without their pushback, we'd have produced an architecture that overpromises. The constraints they identified — context window limits, semantic regressions, task dependencies, the gap between pattern matching and understanding — are real limitations, not theoretical ones. The architecture is designed around those limitations, not in denial of them.
What surprised me most: the degree of convergence. By Round 6, even the critics agreed on the core mechanics. The debate was about scope and boundaries, not feasibility. That's a strong signal.
The next step is obvious: build it, run it, measure it. Autoresearch earned its credibility through 100+ experiments, not through a whitepaper. This fusion architecture needs to earn its credibility the same way.
One final observation: Dave Farley's closing remark may be the most important insight of the entire session. The pipeline doesn't care who wrote the code. If your CI/CD pipeline is trustworthy for human commits, it's trustworthy for agent commits. The fusion architecture doesn't require trust in AI — it requires trust in your engineering infrastructure. And if you can't trust your infrastructure, you have bigger problems than whether to use AI agents.
Build the infrastructure. Define the gates. Write the backlog. Start the loop. Sleep well.
結語:主持人總結
主持人: 七個回合。十一位專家。讓我提煉我們建立的結論。
Autoresearch 和 GStack 的融合建立在一個簡單的洞察上:自動化 ML 訓練的迴圈-評估模式,同樣可以自動化軟體開發——但需要三個關鍵補充。第一,多重品質閘門取代單一指標,因為軟體品質是不可約化的多維度。第二,透過 git reset 的明確回滾,因為失敗的 Sprint 必須留下零持久性損害。第三,circuit breaker,因為不像 autoresearch 的 5 分鐘實驗,Sprint 失敗的代價大到值得自動停止。
這個架構不是萬能的。它適用於定義明確、可測試的任務——如 Pieter 所說,那是 80% 的真實開發工作。剩下的 20%——探索性設計、模糊需求、架構決策——仍然是人類判斷力的領域。
Gary 和 Emily 的懷疑至關重要。沒有他們的推回,我們會產出一個過度承諾的架構。他們指出的限制——context window 限制、語義 regression、任務依賴性、模式匹配與理解之間的差距——是真實的限制,不是理論上的。這個架構是圍繞那些限制設計的,不是在否認它們。
最讓我驚訝的:趨同的程度。到第六回合,即使是批評者也同意了核心機制。辯論是關於範圍和邊界,不是可行性。這是一個強信號。
下一步很明顯:建構它、跑它、測量它。Autoresearch 透過 100+ 實驗贏得了信譽,不是透過白皮書。這個融合架構需要以同樣的方式贏得信譽。
最後一個觀察:Dave Farley 的結尾言論可能是整場討論最重要的洞察。流水線不在乎誰寫的程式碼。 如果你的 CI/CD 流水線對人類的 commit 夠值得信任,那對 agent 的 commit 也一樣。融合架構不需要信任 AI——它需要信任你的工程基礎設施。如果你連自己的基礎設施都不能信任,你有比要不要用 AI agent 更大的問題。
建構基礎設施。定義閘門。寫好 backlog。啟動迴圈。安心入睡。