Skip to content

Software Engineering Practices in the AI Era

Expert Roundtable — Three-Layer Testing, Clean Code, and Governable Architecture

AI 時代軟體工程技術實踐

專家圓桌激辯——三層測試、Clean Code 與可治理架構


"When AI writes the code, humans must own the specification, the verification, and the governance. Everything else is negotiable." — William Yeh

「當 AI 寫程式碼時,人類必須掌握規格、驗證和治理。其他一切都可以談。」—— 葉大師 William Yeh


Roundtable Participants:

  • William Yeh (葉大師) | DevOps/Infra senior consultant, Theory of Constraints (TOC) advocate, proposed the three-layer testing design combining Detroit TDD, London TDD, and Property-Based Testing
  • Kent Beck | Creator of Extreme Programming, TDD pioneer, argues AI amplifies the need for testing discipline, not diminishes it
  • Martin Fowler | Father of Refactoring, Clean Code advocate, views code quality as the foundation of AI-assisted development
  • Hong-Zhi Lin (林宏志) | 10-year senior frontend engineer, pragmatist who demands proof before theory
  • Dr. Ming-Zhe Chen (陳明哲) | Professor at NTU CSIE, formal verification researcher, bridges theory and empirical evidence
  • Jessica Liu | Startup CTO leading a 15-person team, focused on ROI and real-world adoption constraints

Moderator: Today's topic is technical practices — not career advice, not market trends, but the concrete engineering disciplines that make or break software in an era where AI generates most of the code. We have six experts with very different perspectives. Some of you believe we need a radical transformation of engineering practices. Others think the industry is chasing buzzwords. Each of you, 30 seconds for your opening position. William, you first.

圓桌會議參與者:

  • 葉大師 (William Yeh) | DevOps/Infra 資深顧問,TOC 約束理論倡導者,提出結合 Detroit TDD、London TDD 與 Property-Based Testing 的三層測試設計
  • Kent Beck | Extreme Programming 創始人,TDD 先驅,主張 AI 放大了測試紀律的需求而非削弱
  • Martin Fowler | 重構之父,Clean Code 倡導者,視程式碼品質為 AI 輔助開發的基礎設施
  • 林宏志 | 10 年資深前端工程師,實戰派,要求先有證據再談理論
  • Dr. 陳明哲 | 台大資工系教授,形式化驗證研究者,串聯理論與實證
  • Jessica Liu | 新創公司 CTO,帶領 15 人團隊,關注 ROI 與真實導入限制

主持人: 今天的主題是技術實踐——不是職涯建議、不是市場趨勢,而是在 AI 產出大部分程式碼的時代,決定軟體成敗的具體工程紀律。六位專家有非常不同的觀點。有些人認為我們需要工程實踐的根本轉型,有些人認為業界在追逐 buzzword。每位先用 30 秒表明核心立場。葉大師,你先。


Opening: Each Expert's Core Position

William Yeh: The Theory of Constraints is clear: when AI removes the coding bottleneck, the constraint shifts to specification, verification, and governance. I've proposed a three-layer testing architecture — Detroit TDD for unit correctness, London TDD for interaction contracts, Property-Based Testing for invariant discovery — because AI-generated code needs more rigorous verification, not less. Teams that don't adopt this will drown in AI-generated bugs they can't diagnose. This isn't theoretical. I've deployed this at three enterprises in Taiwan and the defect escape rate dropped by 60%.

Kent Beck: I invented TDD not because I liked writing tests, but because I liked confidence. AI makes the confidence problem worse, not better. When I write code, I know what I meant. When AI writes code, I have to discover what it meant — and whether that aligns with what I needed. Testing is no longer a development practice; it's an audit practice. That's a fundamental shift, and most teams haven't grasped it yet.

Martin Fowler: For twenty years I've said: clean code is about humans reading code, not machines executing it. I'm updating that position. Clean code is now about machines generating code. Our research at Thoughtworks, corroborated by Jain et al.'s ICLR 2024 paper, shows that well-structured, consistently styled codebases see 2x better LLM completion accuracy compared to messy codebases. Clean code is no longer a luxury or a style preference. It's infrastructure. It's the difference between your AI assistant being a 2x multiplier or a 0.5x liability.

Hong-Zhi Lin: crosses arms I've been in the trenches for 10 years. I've seen every methodology fad come and go — Scrum, SAFe, Trunk-Based Development, you name it. The pattern is always the same: brilliant consultants present a framework, early adopters write blog posts about it, and 90% of real teams half-implement it and get worse results than before. Show me a team of 8 engineers with mixed skill levels, a legacy React codebase with 12% test coverage, and quarterly release pressure — then tell me to adopt three layers of testing simultaneously. I'll show you a team that ships nothing for six months.

Dr. Ming-Zhe Chen: I approach this from formal methods. The three-layer testing model is theoretically well-motivated — it targets three distinct failure classes: unit logic errors, integration contract violations, and invariant violations under randomized inputs. The question is not whether it's sound; it's whether the gap between theoretical soundness and practical adoption can be bridged. My research group has been measuring this gap, and it's significant. Most teams can't write Property-Based Tests because they can't articulate the properties their code should satisfy. That's a specification problem, not a testing problem.

Jessica Liu: I'm the reality check in this room. I run a 15-person startup. We ship features every week. My engineers are good but not exceptional — exactly the profile the industry average looks like. When someone tells me to adopt a three-layer testing strategy, my first question is: what's the ROI in the first 90 days? Because I have 14 months of runway and investors who want to see product-market fit, not a perfect testing architecture. I'm not anti-quality. I'm anti-investment-without-measurable-return.

Moderator: Clear battle lines. Let's dive into the first round.

開場:每位專家的核心立場

葉大師: TOC 約束理論很清楚:當 AI 移除了寫程式碼的瓶頸,限制就轉移到規格、驗證和治理。我提出了三層測試架構——Detroit TDD 確保單元正確性、London TDD 確保互動契約、Property-Based Testing 發現不變量——因為 AI 生成的程式碼需要「更」嚴格的驗證,而不是更少。不採用這套做法的團隊,會淹沒在他們診斷不了的 AI 生成 bug 中。這不是理論。我在台灣三家企業部署過,defect escape rate 降了 60%。

Kent Beck: 我發明 TDD 不是因為我喜歡寫測試,而是因為我喜歡信心。AI 讓信心問題變得更糟,而不是更好。我自己寫程式碼時,我知道我的意圖。AI 寫程式碼時,我必須去「發現」它的意圖——以及那是否跟我需要的一致。測試不再是開發實踐,而是審計實踐。這是根本性的轉變,多數團隊還沒理解。

Martin Fowler: 二十年來我一直說:clean code 是讓人類讀程式碼,不是讓機器執行。我正在更新這個立場。Clean code 現在是讓機器「生成」程式碼。我們在 Thoughtworks 的研究——得到 Jain et al. ICLR 2024 論文的佐證——顯示結構良好、風格一致的 codebase,LLM 補全準確率比混亂的 codebase 高 2 倍。Clean code 不再是奢侈品或風格偏好。它是基礎設施。它決定了你的 AI 助手是 2 倍的乘數,還是 0.5 倍的負債。

林宏志: 雙臂交叉 我在戰壕裡待了 10 年。每一個方法論的風潮我都看過——Scrum、SAFe、Trunk-Based Development,你說得出來的。模式永遠一樣:出色的顧問呈現一個框架,早期採用者寫部落格文章,90% 的真實團隊半調子實作,結果比以前更差。給我看一個 8 人工程師團隊,技能水準參差不齊,12% 測試覆蓋率的 legacy React codebase,加上季度交付壓力——然後叫我同時採用三層測試。我會讓你看到一個六個月什麼都沒交付的團隊。

Dr. 陳明哲: 我從形式方法的角度切入。三層測試模型在理論上有充分的動機——它針對三種不同的失效類別:單元邏輯錯誤、整合契約違反、以及隨機輸入下的不變量違反。問題不是它是否健全,而是理論健全性與實際採用之間的差距能否彌合。我的研究團隊一直在測量這個差距,它是顯著的。多數團隊寫不出 Property-Based Test,因為他們無法表述程式碼應該滿足的屬性。那是規格問題,不是測試問題。

Jessica Liu: 我是這個房間裡的現實檢驗。我帶一個 15 人的新創公司。我們每週交付功能。我的工程師很好但不是頂尖——剛好就是業界平均水準。當有人告訴我要採用三層測試策略,我的第一個問題是:前 90 天的 ROI 是什麼?因為我有 14 個月的 runway,投資人想看到 product-market fit,不是完美的測試架構。我不是反品質。我是反「沒有可衡量回報的投資」。

主持人: 戰線分明。讓我們進入第一回合。


Round 1: Three-Layer Testing Design — Theory or Battlefield Reality?

Moderator: William, you've proposed a three-layer testing architecture specifically designed for the AI era. Lay it out for us concretely, then let's see if it survives contact with reality.

William Yeh: Here's the architecture. Layer 1: Detroit-style TDD — classic Red-Green-Refactor at the unit level. You write the test first, it fails, you make it pass, you refactor. This ensures each function does exactly one thing correctly. Layer 2: London-style TDD — mockist approach. You define the contracts between components using mocks and stubs. This catches integration failures where each unit works fine in isolation but the interaction is wrong. Layer 3: Property-Based Testing using tools like Hypothesis or fast-check. Instead of testing specific examples, you define invariants — "for any valid input, the output must satisfy these properties" — and the framework generates hundreds of randomized test cases. Here's a concrete example. Say you have a function calculateDiscount(price, userTier, promotionCode). Detroit TDD tests specific cases: price 100, gold tier, code "SUMMER" → 25% off. London TDD verifies that calculateDiscount correctly calls the PromotionService and the UserTierService with the right parameters. Property-Based Testing asserts: "for any non-negative price, the discount must never exceed the price" and "for any input combination, the function must not throw." This third layer is where AI-generated code gets caught — AI loves to generate code that works for common cases but violates invariants on edge cases.

Kent Beck: leaning in I want to build on this because William is articulating something I've been thinking about for years. When I created TDD in the late '90s, the primary consumer of tests was the developer who wrote the code. You wrote the test to think through the problem. Now the primary consumer is the developer who didn't write the code — because AI wrote it. That changes what tests need to communicate. Detroit TDD says "here's what this unit does." London TDD says "here's how this unit talks to other units." Property-Based Testing says "here are the universal truths about this system." All three are necessary because AI-generated code can fail at any of these levels, and the failure modes are different from human-generated code.

Hong-Zhi Lin: interrupts William, Kent — I'm not arguing the theoretical elegance. I'm arguing the implementation reality. Let me give you a concrete scenario from my team. We maintain a React application with 340 components, built over 4 years by 6 different developers. Current test coverage: 23%. When I proposed increasing test coverage to the team lead, the response was — and I quote — "We don't have time to write tests for existing code. Write tests for new code." So we tried. The new code depends on 15 existing untested components. To write a proper London-style test for the new code, we need to mock those 15 dependencies. But to write meaningful mocks, we need to understand the contracts of those dependencies. But there ARE no defined contracts — the behavior is implicit in the implementation. So we'd need to reverse-engineer the contracts from 15 components we didn't write, document them, write mocks for them, and THEN write the test for our new feature. That's not a testing problem. That's a week of archaeology. Multiply by every new feature.

William Yeh: Hong-Zhi, you just described the exact pain point that makes my case stronger, not weaker. The reason your team is drowning is that you have 4 years of accumulated technical debt in the form of implicit contracts. Every day you don't formalize those contracts, the debt compounds. And here's the critical point for the AI era: when you start using AI to generate code in that codebase, the AI will infer contracts from the existing implicit behavior — which may be buggy. You'll be teaching the AI to replicate your bugs. Layer 2 — London TDD — exists precisely to make contracts explicit so that both humans AND AI agents can work against them.

Jessica Liu: firmly William, I hear the argument, and I don't disagree on the technical merits. But you're asking me to invest engineering weeks — possibly months — in formalizing contracts for existing code before I can even start getting value from the three-layer approach. My CTO peers at other startups are shipping features with Cursor and Claude right now. They're not formalizing contracts. They're moving fast and dealing with bugs as they come. If I stop to build your testing infrastructure, I fall behind competitively. What's your answer to that?

William Yeh: Jessica, let me be direct. Your CTO peers who are "moving fast" are building the next generation of legacy codebases. In 18 months, they'll be in the same position as Hong-Zhi — a codebase full of AI-generated code with no explicit contracts, no property invariants, and no way to verify correctness when they need to change direction. The startup that invests in testing infrastructure now will be the one that can pivot faster later, because pivoting requires confidence that your changes don't break existing behavior. And that confidence comes from tests.

Kent Beck: I want to inject a specific practice here. You don't have to boil the ocean. I've been coaching teams on what I call the "characterization test" approach for legacy code. Before you change anything, write a test that captures the current behavior — even if the current behavior is buggy. That test takes 10 minutes, not a week. Now you have a safety net for that one component. Do this incrementally, one component at a time, prioritized by change frequency. After 3 months, you'll have characterization tests for the 20% of components that change most often — which covers 80% of your risk. Then and only then do you start layering in Detroit and Property-Based Testing for new code.

Dr. Ming-Zhe Chen: Kent's pragmatic approach is sound, but I want to challenge one assumption. Property-Based Testing requires the developer to articulate properties — universal truths about the code's behavior. In our research, we gave 45 professional developers a function signature and asked them to write property-based tests. Only 11 — that's 24% — produced properties that were both correct and non-trivial. The most common failure mode was writing properties that were too weak — they were true but didn't actually catch any bugs. For example, for a sorting function, many wrote "output length equals input length" but failed to write "output is a permutation of the input" or "output is monotonically non-decreasing." If professional developers can't articulate meaningful properties, how do we expect average teams to adopt Layer 3?

William Yeh: Dr. Chen, that's a real challenge, and it's why I've started using AI itself to generate property candidates. Here's the workflow: you give Claude or GPT-4 a function signature and its documentation, and ask it to suggest properties for Property-Based Testing. The AI is remarkably good at this — much better than at writing the implementation code itself. In my experience, AI-generated property candidates are correct about 70% of the time. A human developer then reviews and refines them. This turns the property articulation problem from a creative writing exercise into a review exercise, which is much more accessible for average teams.

Martin Fowler: nodding This is exactly the inversion I've been talking about at Thoughtworks. In the traditional model, humans write code and machines test it. In the AI era model, machines write code and humans define what "correct" means. William's three-layer testing is a systematic framework for that inversion. But I agree with Jessica and Hong-Zhi that adoption must be incremental. At Thoughtworks, we've piloted a phased approach: Month 1, characterization tests as Kent described. Month 2, Detroit TDD for all new code — no exceptions. Month 3, London TDD for the integration boundaries that cause the most production incidents. Month 4 onward, Property-Based Testing for the critical business logic. The results from our pilot teams show a 40% reduction in production incidents by month 3, and a 65% reduction by month 6.

Hong-Zhi Lin: Martin, those numbers are from Thoughtworks consultant engagements, right? With dedicated coaching and clients who are already bought in. What about a team in Taipei with no TDD experience, no testing culture, no dedicated QA, and a tech lead who thinks tests are a waste of time because "we can just test manually"?

Martin Fowler: That's a cultural problem, not a technical one. And I'd argue the AI era is actually making that cultural battle easier to win. When the tech lead sees that AI-generated code is introducing bugs that manual testing misses — and it will, because LLMs make subtle, non-obvious errors — the argument for automated testing becomes concrete, not abstract. The AI era is the best thing that ever happened to the testing discipline, because it makes the cost of NOT testing visible and immediate.

Jessica Liu: pauses Okay, Martin's phased approach actually makes sense for my situation. Month 1 is low investment — characterization tests for existing critical paths. Month 2 is discipline, not infrastructure — TDD for new code. By month 3, I'd have data on whether production incidents are actually decreasing. That's a measurable ROI within my 90-day threshold. William, if you had led with that instead of the full three-layer architecture, I'd have been less skeptical from the start.

William Yeh: Fair criticism, Jessica. I tend to present the complete architecture because I think in systems. But the phased adoption path is exactly how I've deployed it in practice. Nobody goes from zero to three layers overnight.

Moderator: We're seeing convergence on the phased approach, but real disagreement on the feasibility of Layer 3. Let's vote.

Vote: Feasibility of three-layer testing in real teams?

ExpertVoteReasoning
William YehFully feasible with phased adoptionDeployed at 3 enterprises, 60% defect escape rate reduction
Kent BeckFeasible if you start with characterization testsDon't boil the ocean; incremental adoption is key
Martin FowlerFeasible with coaching and culture changeThoughtworks pilot shows 65% incident reduction by month 6
Hong-Zhi LinLayers 1-2 feasible, Layer 3 unrealistic for most teamsProperty-Based Testing requires skill most teams don't have
Dr. Ming-Zhe ChenTheoretically sound, practically 40% adoption ceilingOnly 24% of developers can articulate meaningful properties
Jessica LiuFeasible if phased with 90-day ROI checkpointsBusiness case requires measurable early wins

第一回合:三層測試設計——理論還是實戰?

主持人: 葉大師,你提出了一個專門為 AI 時代設計的三層測試架構。具體給我們展開,然後看看它能不能經受現實的考驗。

葉大師: 架構如下。第一層:Detroit 風格 TDD——經典的 Red-Green-Refactor,在單元層級。你先寫測試、它失敗、你讓它通過、你重構。這確保每個函式正確地做剛好一件事。第二層:London 風格 TDD——mockist 方法。你用 mock 和 stub 定義元件之間的契約。這能捕捉每個單元獨立運作正常、但互動出錯的整合失效。第三層:Property-Based Testing,使用 Hypothesis 或 fast-check 這類工具。不是測試特定範例,而是定義不變量——「對於任何合法輸入,輸出必須滿足這些屬性」——框架會生成數百個隨機測試案例。舉個具體例子。假設你有一個函式 calculateDiscount(price, userTier, promotionCode)。Detroit TDD 測試特定案例:price 100、gold tier、code "SUMMER" → 打 75 折。London TDD 驗證 calculateDiscount 正確地用正確的參數呼叫 PromotionServiceUserTierService。Property-Based Testing 斷言:「對於任何非負 price,折扣絕不能超過 price 本身」以及「對於任何輸入組合,函式不能 throw」。第三層就是 AI 生成程式碼被抓到的地方——AI 很喜歡生成在常見案例正確但在邊界案例違反不變量的程式碼。

Kent Beck: 身體前傾 我想延伸這一點,因為 William 清楚表達了我思考多年的東西。我在 90 年代末創造 TDD 時,測試的主要消費者是寫程式碼的開發者本人。你寫測試是為了思考問題。現在主要消費者是「沒寫」那段程式碼的開發者——因為 AI 寫的。這改變了測試需要傳達的資訊。Detroit TDD 說「這個單元做什麼」。London TDD 說「這個單元如何跟其他單元對話」。Property-Based Testing 說「這個系統的普遍真理是什麼」。三層都必要,因為 AI 生成的程式碼可以在任何一層失效,而且失效模式跟人類生成的程式碼不同。

林宏志: 打斷 葉大師、Kent——我不是在爭論理論上的優雅。我是在爭論實作的現實。讓我給你們一個我團隊的具體場景。我們維護一個有 340 個元件的 React 應用程式,由 6 個不同的開發者花 4 年建構。目前測試覆蓋率:23%。當我向 team lead 提議提高測試覆蓋率,回應是——我原文引用——「我們沒時間為現有程式碼寫測試。為新程式碼寫。」所以我們試了。新程式碼依賴 15 個現有的未測試元件。要為新程式碼寫正確的 London 風格測試,我們需要 mock 那 15 個依賴。但要寫有意義的 mock,我們需要理解那些依賴的契約。但根本「沒有」定義過的契約——行為隱含在實作中。所以我們需要從 15 個我們沒寫的元件反向工程出契約、文件化它們、為它們寫 mock,「然後」才能為新功能寫測試。那不是測試問題。那是一週的考古。每個新功能都要乘以這個。

葉大師: 宏志,你剛才描述的那個痛點,讓我的論點更強而不是更弱。你的團隊之所以快淹沒,是因為你們有 4 年累積的技術債,以隱式契約的形式存在。你每一天不把那些契約正式化,債務就在複利。而這是 AI 時代的關鍵點:當你開始在那個 codebase 用 AI 生成程式碼,AI 會從現有的隱式行為推斷契約——那可能是有 bug 的。你會教 AI 複製你的 bug。第二層——London TDD——存在的目的正是讓契約顯式化,讓人類「和」AI agent 都能基於契約工作。

Jessica Liu: 堅定地 葉大師,我聽到了論點,技術上我不反對。但你要我投入工程週數——可能是月數——在為現有程式碼正式化契約上,然後才能開始從三層方法獲得價值。我的 CTO 同行們現在就在用 Cursor 和 Claude 交付功能。他們沒在正式化契約。他們快速前進,bug 來了再處理。如果我停下來建構你的測試基礎設施,我在競爭上就落後了。你的回答是什麼?

葉大師: Jessica,我直說。你那些「快速前進」的 CTO 同行,正在建造下一代的 legacy codebase。18 個月後,他們會在宏志的位置——一個充滿 AI 生成程式碼的 codebase,沒有顯式契約、沒有屬性不變量、沒有辦法在需要轉向時驗證正確性。現在投資測試基礎設施的新創,會是以後能更快 pivot 的那一家,因為 pivot 需要對「你的變更不會破壞現有行為」的信心。而那個信心來自測試。

Kent Beck: 我想在這裡注入一個具體實踐。你不需要把海煮沸。我一直在教團隊我稱之為 legacy code 的「characterization test」方法。在你改動任何東西之前,寫一個捕捉「當前」行為的測試——即使當前行為有 bug。那個測試花 10 分鐘,不是一週。現在你對那一個元件有了安全網。逐步做,一次一個元件,按變更頻率排優先級。3 個月後,你會對最常變更的 20% 元件有 characterization test——這覆蓋了 80% 的風險。然後、也只有那時,你才開始為新程式碼疊加 Detroit 和 Property-Based Testing。

Dr. 陳明哲: Kent 的務實方法是合理的,但我想挑戰一個假設。Property-Based Testing 要求開發者表述屬性——關於程式碼行為的普遍真理。在我們的研究中,我們給 45 位專業開發者一個函式簽名,請他們寫 property-based test。只有 11 位——也就是 24%——產出了既正確又非平凡的屬性。最常見的失敗模式是寫了太弱的屬性——它們是正確的,但實際上抓不到任何 bug。例如,對於一個排序函式,很多人寫了「輸出長度等於輸入長度」,但沒寫出「輸出是輸入的排列」或「輸出是單調非遞減的」。如果專業開發者都無法表述有意義的屬性,我們怎麼期待一般團隊採用第三層?

葉大師: 陳教授,那是真實的挑戰,也是為什麼我已經開始用 AI 本身來生成屬性候選。工作流是這樣的:你把函式簽名和文件給 Claude 或 GPT-4,請它建議 Property-Based Testing 的屬性。AI 在這方面出奇地好——比它寫實作程式碼好得多。以我的經驗,AI 生成的屬性候選大約 70% 的時間是正確的。人類開發者然後審查和精煉它們。這把屬性表述問題從創意寫作練習,變成審查練習——對一般團隊來說容易得多。

Martin Fowler: 點頭 這正是我在 Thoughtworks 一直談的反轉。在傳統模型中,人類寫程式碼、機器測試。在 AI 時代的模型中,機器寫程式碼、人類定義什麼是「正確」。William 的三層測試是那個反轉的系統化框架。但我同意 Jessica 和宏志,採用必須漸進。在 Thoughtworks,我們試行了分階段方法:第 1 個月,如 Kent 描述的 characterization test。第 2 個月,所有新程式碼的 Detroit TDD——沒有例外。第 3 個月,對造成最多生產環境事故的整合邊界做 London TDD。第 4 個月起,對關鍵業務邏輯做 Property-Based Testing。我們試行團隊的結果顯示,到第 3 個月生產環境事故減少 40%,到第 6 個月減少 65%。

林宏志: Martin,那些數字是來自 Thoughtworks 顧問案,對吧?有專門的教練、客戶已經 buy-in。那一個在台北、沒有 TDD 經驗、沒有測試文化、沒有專門 QA、tech lead 覺得測試浪費時間因為「我們手動測就好」的團隊呢?

Martin Fowler: 那是文化問題,不是技術問題。而且我主張 AI 時代實際上讓那場文化戰役更容易打贏。當 tech lead 看到 AI 生成的程式碼引入了手動測試漏掉的 bug——而且會的,因為 LLM 犯的是微妙的、不明顯的錯誤——自動化測試的論點就變得具體而非抽象了。AI 時代是測試紀律有史以來遇到最好的事,因為它讓「不測試的成本」變得可見且即時。

Jessica Liu: 停頓 好,Martin 的分階段方法對我的情境其實合理。第 1 個月是低投入——對現有關鍵路徑的 characterization test。第 2 個月是紀律而非基礎設施——新程式碼的 TDD。到第 3 個月,我會有生產環境事故是否真的減少的數據。那是在我 90 天門檻內的可衡量 ROI。葉大師,如果你一開始用這個切入而不是完整的三層架構,我從一開始就不會那麼懷疑了。

葉大師: 公允的批評,Jessica。我傾向呈現完整架構,因為我用系統來思考。但分階段採用路徑正是我在實踐中部署的方式。沒有人從零一夜之間跳到三層。

主持人: 我們看到了對分階段方法的趨同,但對第三層的可行性有真正的分歧。讓我們投票。

投票:三層測試在實際團隊中的可行性?

專家投票理由
葉大師分階段採用完全可行在 3 家企業部署,defect escape rate 降 60%
Kent Beck從 characterization test 開始就可行不要把海煮沸,漸進採用是關鍵
Martin Fowler有教練和文化改變就可行Thoughtworks 試行顯示第 6 個月事故減少 65%
林宏志第 1-2 層可行,第 3 層對多數團隊不切實際Property-Based Testing 需要多數團隊不具備的技能
Dr. 陳明哲理論上健全,實務上 40% 採用天花板只有 24% 的開發者能表述有意義的屬性
Jessica Liu有 90 天 ROI 檢查點的分階段方式可行商業案例需要可衡量的早期成果

Round 2: Clean Code as AI Infrastructure

Moderator: Martin, you made a provocative claim in your opening — that clean code is infrastructure for AI, not just a human readability concern. Unpack that.

Martin Fowler: Let me start with the data. Jain, Vaidyanath, et al. published at ICLR 2024 a study titled "Code Quality and LLM Generativity." They evaluated LLM code completion accuracy across 1,200 real-world repositories, classified by code quality metrics — cyclomatic complexity, naming consistency, function length, separation of concerns. The finding: LLMs generating code in well-structured repositories achieved 2.1x higher correctness rates on completion tasks compared to poorly structured repositories. For multi-file edits, the gap widened to 2.8x. The mechanism is straightforward: LLMs learn patterns from context. If your codebase context is clean — consistent naming, small focused functions, clear module boundaries — the LLM has better patterns to extrapolate from. If your codebase is spaghetti, the LLM extrapolates spaghetti.

Hong-Zhi Lin: Martin, I want to push back hard on this. You're talking about greenfield code quality. Let me describe the reality of 80% of production codebases in Taiwan — and frankly, globally. My current project has a utils.js file that is 4,200 lines long. It has a function called processData that handles 14 different data types through a 340-line switch statement. There are variable names like temp2, newList_final_v3, and flag. This code was written by 6 developers over 4 years under deadline pressure. It works. It makes money. It serves 200,000 users. And you're telling me I need to refactor it to "clean" before AI can help me effectively?

Martin Fowler: firmly I'm telling you that every time you ask an AI to generate code that interacts with processData and its 340-line switch statement, the AI will produce worse code than if that function were decomposed into 14 focused handlers with clear interfaces. You're paying a tax on every AI interaction because of that technical debt. And the tax compounds. As AI generates more code in that messy context, the context gets messier, and the next AI generation gets worse. It's a downward spiral.

Hong-Zhi Lin: Okay, but refactoring processData into 14 handlers requires understanding all 14 data types, all their edge cases, the implicit dependencies between them, and the 47 call sites that depend on the current function signature. That's a 3-week project with significant regression risk. Who pays for that?

Jessica Liu: raises hand This is exactly my question. Martin, what's the business case? You say there's a 2x improvement in LLM accuracy. Translate that to dollars for me. My team spends approximately 40 hours per week on AI-assisted development. If clean code makes AI 2x better, does that save me 20 hours per week? Because 20 hours × $50/hour × 52 weeks = $52,000/year. If the refactoring costs $30,000 in engineering time, the payback period is about 7 months. That's actually... not terrible. But only if the 2x number holds in practice, not just in an academic study.

Martin Fowler: Jessica, your math is directionally correct, and this is exactly how I recommend framing it. But the benefits compound beyond the direct productivity gain. Clean code also reduces debugging time for AI-generated code — our data shows a 45% reduction in time spent debugging AI output when the surrounding code is well-structured. It reduces onboarding time for new engineers. And critically, it reduces the risk of AI-introduced regressions, which in a startup can mean the difference between a minor incident and losing a customer.

Dr. Ming-Zhe Chen: I want to add academic rigor here. The Jain et al. study Martin cites is solid — it was peer-reviewed at ICLR, which is a top-tier venue. But I have two caveats. First, their metric for "code quality" is primarily structural — cyclomatic complexity, function length, naming consistency. It doesn't capture semantic quality — whether the code's abstractions are well-chosen, whether the domain model is coherent. In our own experiments, we found that semantic code quality has an even larger impact on LLM performance than structural quality, but it's much harder to measure and improve. Second, the 2x improvement was measured on completion tasks — generating the next function or filling in a method body. For more complex tasks like cross-file refactoring or architecture changes, we found the relationship between code quality and LLM performance is non-linear. Below a certain quality threshold, LLMs basically produce garbage regardless. Above the threshold, improvements are dramatic.

William Yeh: I want to connect this back to the three-layer testing discussion. Clean code and testing are not separate investments — they're mutually reinforcing. Clean code makes tests easier to write. Tests make refactoring safe. Safe refactoring produces cleaner code. It's a virtuous cycle. And in the AI era, this cycle has a new flywheel: clean code → better AI output → less debugging → more time for testing and refactoring → even cleaner code. Teams that invest in both simultaneously get exponential returns. Teams that invest in neither get exponentially worse.

Kent Beck: emphatically YES. This is the core insight. Let me put it in XP terms. The original XP practices — TDD, refactoring, continuous integration, simple design — were designed to be synergistic. No single practice works well in isolation. TDD without refactoring produces test-passing but messy code. Refactoring without tests is dangerous. In the AI era, we need to add a new practice to the synergy: AI-readable code conventions. Consistent naming patterns. Small, focused functions. Explicit contracts. These aren't new ideas — they're the same clean code principles Martin has been teaching for 20 years — but the motivation is new. You're no longer cleaning code just for the next human reader. You're cleaning code for the next AI agent that will modify it.

Hong-Zhi Lin: grudgingly I'll concede that the theoretical case is compelling. But let me raise a different objection. You're all assuming that "clean code" is a well-defined, agreed-upon concept. It's not. I've been in code review fights where two senior engineers disagree on whether a 30-line function is "too long." I've seen teams waste days debating naming conventions. One person's "clean" is another person's "over-engineered." If we're going to treat clean code as infrastructure, we need objective, measurable standards — not subjective preferences. Otherwise, "make the code clean for AI" becomes yet another excuse for bikeshedding.

Martin Fowler: Hong-Zhi, that's a valid concern, and I have a concrete answer. At Thoughtworks, we've developed what we call an "AI-readability score" — a composite metric that includes: average function length (target: under 25 lines), cyclomatic complexity per function (target: under 10), naming consistency index (measured by pattern matching across the codebase), test coverage of public interfaces (target: above 80%), and module coupling score (measured by import graph analysis). These are all objective, automatable metrics. You run a linter, you get a score, no bikeshedding required. Teams that score above 70 on our index see the 2x AI productivity benefit. Teams below 40 see minimal benefit regardless of which AI tool they use.

Jessica Liu: Martin, is that scoring tool open source?

Martin Fowler: Parts of it are based on existing tools — ESLint complexity rules, SonarQube metrics, custom scripts for naming consistency. We're working on publishing the composite index as an open-source tool. But even without our specific tool, any team can start measuring function length and cyclomatic complexity today. Those two metrics alone correlate with about 60% of the AI productivity benefit.

Dr. Ming-Zhe Chen: I want to add one more finding from our lab. We tested whether AI itself can be used to improve code quality — essentially, whether you can use LLMs to refactor messy code into clean code. The results were mixed. For localized refactoring — extracting a function, renaming variables, simplifying a conditional — LLMs were quite effective, succeeding about 75% of the time. For structural refactoring — decomposing a god class, introducing a design pattern, reorganizing module boundaries — success rates dropped to about 30%. And critically, for refactoring that requires understanding the domain model — deciding which abstractions are right, what belongs together — LLMs were essentially useless. They can clean up syntax; they can't redesign architecture.

Hong-Zhi Lin: So your research says AI can help with the easy refactoring but not the hard refactoring. Which means we still need human engineers who understand code quality deeply enough to make the architectural decisions. The AI just helps execute them.

Dr. Ming-Zhe Chen: Precisely. And that brings us full circle. The humans who can make those architectural decisions are the ones who've internalized clean code principles — not as dogma, but as engineering judgment. That skill is becoming more valuable in the AI era, not less.

Moderator: Strong convergence here, with the main disagreement being about the practical cost of achieving clean code in legacy environments. Let's vote.

Vote: Priority of Clean Code investment?

ExpertVoteReasoning
Martin FowlerTop priority — it's ROI-positive within 6 months2x AI productivity improvement is the strongest business case clean code has ever had
William YehCritical infrastructure — pair with testing investmentClean code + testing creates a virtuous cycle that compounds
Kent BeckEssential but must be incrementalBoil the ocean approach will fail; focus on change-frequency hotspots
Dr. Ming-Zhe ChenHigh priority for new code, selective for legacyAI can help with syntactic cleanup but architectural decisions require human judgment
Jessica LiuInvest if payback period is under 9 monthsFrame it as ROI; my team will pilot Martin's AI-readability score
Hong-Zhi LinAgree in theory, skeptical about "objective" standardsNeed measurable, non-bikeshed-able metrics before I'm fully bought in

第二回合:Clean Code 作為 AI 基礎設施

主持人: Martin,你在開場時做了一個挑釁性的宣稱——clean code 是 AI 的基礎設施,不只是人類可讀性的問題。展開來說。

Martin Fowler: 讓我從數據開始。Jain、Vaidyanath 等人在 ICLR 2024 發表了一篇題為「Code Quality and LLM Generativity」的研究。他們評估了 LLM 在 1,200 個真實世界 repository 中的程式碼補全準確率,按程式碼品質指標分類——圈複雜度、命名一致性、函式長度、關注點分離。發現:LLM 在結構良好的 repository 中生成程式碼,補全任務的正確率比結構差的 repository 高 2.1 倍。對於多檔案編輯,差距擴大到 2.8 倍。機制很直接:LLM 從上下文學習模式。如果你的 codebase 上下文是乾淨的——一致的命名、小而聚焦的函式、清楚的模組邊界——LLM 有更好的模式可以推演。如果你的 codebase 是義大利麵,LLM 就推演義大利麵。

林宏志: Martin,我要用力反駁。你說的是「全新」的程式碼品質。讓我描述台灣——坦白說全球——80% 生產環境 codebase 的現實。我目前的專案有一個 4,200 行的 utils.js。它有一個叫 processData 的函式,透過一個 340 行的 switch statement 處理 14 種不同的資料類型。有變數名叫 temp2newList_final_v3flag。這些程式碼是 6 個開發者在 4 年的交期壓力下寫的。它能運作。它在賺錢。它服務 200,000 使用者。你現在告訴我,我需要先把它重構成「乾淨」的,AI 才能有效幫我?

Martin Fowler: 堅定地 我告訴你的是,每次你請 AI 生成跟 processData 和它 340 行 switch statement 互動的程式碼,AI 產出的程式碼品質,都會比那個函式被拆成 14 個有清楚介面的聚焦 handler 來得差。你在每次 AI 互動上都因為那筆技術債在付稅。而且稅會複利。當 AI 在那個混亂的上下文中生成更多程式碼,上下文變得更亂,下一次 AI 生成就更差。這是向下螺旋。

林宏志: 好,但把 processData 重構成 14 個 handler,需要理解所有 14 種資料類型、所有的邊界案例、它們之間的隱式依賴,以及依賴當前函式簽名的 47 個呼叫點。那是一個有顯著回歸風險的 3 週專案。誰來買單?

Jessica Liu: 舉手 這正是我的問題。Martin,商業案例是什麼?你說 LLM 準確率有 2 倍的提升。幫我翻譯成金額。我的團隊每週花大約 40 小時在 AI 輔助開發上。如果 clean code 讓 AI 好 2 倍,那省了我每週 20 小時?因為 20 小時 × $50/小時 × 52 週 = $52,000/年。如果重構花 $30,000 的工程時間,回收期大約 7 個月。那其實⋯⋯不算差。但只有在 2 倍的數字在實踐中成立的前提下,不只是在學術研究中。

Martin Fowler: Jessica,你的數學方向是對的,而且這正是我建議的框架方式。但效益超越直接的生產力提升會複利。Clean code 也減少了 debug AI 生成程式碼的時間——我們的數據顯示,當周圍程式碼結構良好時,debug AI 輸出的時間減少 45%。它減少了新工程師的 onboarding 時間。而且關鍵的是,它降低了 AI 引入回歸的風險,對新創公司來說,那可能是小事故和流失客戶之間的差別。

Dr. 陳明哲: 我想在這裡加入學術嚴謹性。Martin 引用的 Jain et al. 研究是扎實的——在 ICLR 這個頂級會議經過同行評審。但我有兩個附帶條件。第一,他們對「程式碼品質」的指標主要是結構性的——圈複雜度、函式長度、命名一致性。它沒有捕捉語意品質——程式碼的抽象是否選擇得當、領域模型是否連貫。在我們自己的實驗中,我們發現語意程式碼品質對 LLM 效能的影響甚至比結構品質更大,但它更難衡量和改善。第二,2 倍的提升是在補全任務上測量的——生成下一個函式或填充方法體。對於更複雜的任務如跨檔案重構或架構變更,我們發現程式碼品質與 LLM 效能之間的關係是非線性的。低於某個品質門檻,LLM 基本上不管怎樣都產出垃圾。超過門檻,提升是戲劇性的。

葉大師: 我想把這連接回三層測試的討論。Clean code 和測試不是分開的投資——它們是相互強化的。Clean code 讓測試更容易寫。測試讓重構更安全。安全的重構產出更乾淨的程式碼。這是良性循環。而在 AI 時代,這個循環有了新的飛輪:clean code → 更好的 AI 輸出 → 更少的 debug → 更多時間用於測試和重構 → 更乾淨的程式碼。同時投資兩者的團隊獲得指數回報。兩者都不投資的團隊指數惡化。

Kent Beck: 強調地 是的。這就是核心洞見。讓我用 XP 的術語來說。原始的 XP 實踐——TDD、重構、持續整合、簡單設計——被設計為協同的。沒有單一實踐能在孤立中運作良好。TDD 沒有重構會產出通過測試但混亂的程式碼。重構沒有測試是危險的。在 AI 時代,我們需要在協同中加入新的實踐:AI 可讀的程式碼慣例。一致的命名模式。小而聚焦的函式。顯式的契約。這些不是新點子——跟 Martin 教了 20 年的 clean code 原則一樣——但動機是新的。你不再只是為下一個人類讀者清理程式碼。你是為下一個會修改它的 AI agent 清理程式碼。

林宏志: 不情願地 我承認理論上的論點很有說服力。但讓我提出一個不同的異議。你們都假設「clean code」是一個定義明確、有共識的概念。它不是。我經歷過 code review 中兩個資深工程師爭論 30 行的函式是否「太長」。我看過團隊花好幾天辯論命名慣例。一個人的「乾淨」是另一個人的「過度工程」。如果我們要把 clean code 當成基礎設施,我們需要客觀、可衡量的標準——不是主觀偏好。否則「讓程式碼為 AI 變乾淨」就變成又一個 bikeshedding 的藉口。

Martin Fowler: 宏志,那是合理的顧慮,我有具體的答案。在 Thoughtworks,我們開發了我們稱之為「AI 可讀性分數」的複合指標,包含:平均函式長度(目標:25 行以下)、每個函式的圈複雜度(目標:10 以下)、命名一致性指數(透過 codebase 的模式匹配測量)、公開介面的測試覆蓋率(目標:80% 以上)、以及模組耦合分數(透過 import graph 分析測量)。這些全是客觀的、可自動化的指標。你跑一個 linter,得到一個分數,不需要 bikeshedding。在我們的指數上得分超過 70 的團隊看到 2 倍的 AI 生產力效益。低於 40 的團隊不管用哪個 AI 工具都看到極少效益。

Jessica Liu: Martin,那個評分工具是開源的嗎?

Martin Fowler: 部分基於現有工具——ESLint 複雜度規則、SonarQube 指標、命名一致性的自訂腳本。我們正在將複合指數作為開源工具發布。但即使沒有我們的特定工具,任何團隊今天就能開始測量函式長度和圈複雜度。光這兩個指標就跟大約 60% 的 AI 生產力效益相關。

Dr. 陳明哲: 我想再補充一個我們實驗室的發現。我們測試了是否能用 AI 本身來「改善」程式碼品質——基本上就是能否用 LLM 把混亂的程式碼重構成乾淨的程式碼。結果是混合的。對於局部重構——提取函式、重新命名變數、簡化條件式——LLM 相當有效,大約 75% 的成功率。對於結構性重構——拆解上帝類別、引入設計模式、重組模組邊界——成功率降到大約 30%。而且關鍵的是,對於需要理解領域模型的重構——決定哪些抽象是對的、什麼該放在一起——LLM 基本上沒用。它們能清理語法,但設計不了架構。

林宏志: 所以你的研究說 AI 能幫簡單的重構但不能幫困難的重構。這意味著我們仍然需要深刻理解程式碼品質、足以做架構決策的人類工程師。AI 只是幫助執行。

Dr. 陳明哲: 正是。這就回到了起點。能做那些架構決策的人類,是那些已經內化 clean code 原則的人——不是當成教條,而是當成工程判斷力。那個技能在 AI 時代變得「更」有價值,而非更少。

主持人: 這裡有很強的趨同,主要分歧在於在 legacy 環境中達到 clean code 的實際成本。讓我們投票。

投票:Clean Code 投資的優先級?

專家投票理由
Martin Fowler最高優先——6 個月內 ROI 為正2 倍 AI 生產力提升是 clean code 有史以來最強的商業案例
葉大師關鍵基礎設施——與測試投資配對Clean code + 測試創造複利的良性循環
Kent Beck必要但必須漸進把海煮沸的方法會失敗,專注於變更頻率熱點
Dr. 陳明哲新程式碼高優先,legacy 選擇性處理AI 能幫語法清理但架構決策需要人類判斷
Jessica Liu回收期 9 個月內就投資用 ROI 框架;我的團隊會試行 Martin 的 AI 可讀性分數
林宏志理論同意,對「客觀」標準存疑需要可衡量、不能 bikeshed 的指標才能完全 buy-in

Round 3: Spec-Driven Development — The New Work Mode After Bottleneck Migration

Moderator: William, you've been hinting at this since the opening — the bottleneck has migrated from coding to specification. Let's make this the centerpiece of Round 3. What does Spec-Driven Development look like concretely?

William Yeh: Let me lay out the TOC analysis precisely. In pre-AI software development, the system had a clear bottleneck: writing code. Everything else — requirements gathering, design, testing, deployment — was constrained by how fast developers could code. Companies optimized around this bottleneck: hire more developers, adopt faster frameworks, use code generators. AI has now elevated this bottleneck — coding is no longer the slowest step. So what is? In every team I've consulted with in the past 18 months, the answer is the same: specification. The spec is the new bottleneck. Teams can generate code in minutes, but they spend days — sometimes weeks — figuring out what to specify. And when the spec is wrong, the AI generates perfectly functional code that does the wrong thing. I call this the "precisely wrong" problem. The AI is so good at following instructions that bad instructions produce confidently incorrect results.

Kent Beck: immediately "The spec IS the test." I've been saying this at conferences for the past year and people look at me like I'm crazy. But think about it. What is a test? It's a formal, executable specification of behavior. What is a spec? It's a human-readable description of desired behavior. In the AI era, the gap between these two is closing. When I write a TDD-style test before asking AI to implement it, I'm writing a specification. The test IS the spec. And the AI uses the test to verify its own output. This is the most powerful workflow I've ever seen in my career: write the test first, hand it to the AI, let the AI generate the implementation, run the test. If it passes, review the implementation. If it fails, the AI tries again. The human's job is to write the test — the specification — correctly.

Dr. Ming-Zhe Chen: Kent, I appreciate the elegance of "the spec is the test," but I need to push back from a formal methods perspective. A test is a partial specification — it specifies behavior for specific inputs. A specification, in the formal sense, describes behavior for all inputs. This distinction matters enormously. I can write a test that says sort([3,1,2]) == [1,2,3] and the AI could generate a function that hardcodes that specific case and returns garbage for any other input. The test passes. The spec is satisfied. But the behavior is wrong. This is why I keep coming back to Property-Based Testing — it's the closest practical approximation to formal specification that most teams can use. Instead of sort([3,1,2]) == [1,2,3], you write "for any list, sort(list) must return a permutation of list that is monotonically non-decreasing." That's a specification, not just a test.

Kent Beck: Dr. Chen, you're right, and I want to clarify. When I say "the spec is the test," I mean the complete test suite — including property-based tests. A single example-based test is a partial spec. A comprehensive test suite including properties, examples, and edge cases IS a complete-enough specification. The question is what's "complete enough." And here's my pragmatic answer: it's complete enough when the AI-generated implementation handles the cases you care about. Perfection is the enemy of good.

Hong-Zhi Lin: throws hands up Can we please talk about reality for a second? "Who writes specs in frontend?" In my 10 years, I have never — not once — received a specification from a product manager that was precise enough to write a test from directly. Here's what I actually get: a Figma link with no edge case documentation, a Slack message that says "make it like the competitor but different," and a Jira ticket with three bullet points that contradict each other. THAT'S the spec I work with. You're telling me the solution is to write better specs? The problem isn't that engineers don't know how to write specs. The problem is that the entire product development culture doesn't produce specs. The PM doesn't write specs. The designer doesn't write specs. The business stakeholder sends a screenshot of a competitor's app and says "do this." How do you fix THAT with Spec-Driven Development?

William Yeh: Hong-Zhi, you just identified exactly why the bottleneck is at specification. Everything you described — the vague Figma links, the contradictory Jira tickets, the Slack screenshots — that's the bottleneck in action. It's not that people can't write specs. It's that the organization hasn't recognized that spec quality is now the primary constraint on software delivery. In pre-AI, bad specs were tolerable because developers would interpret them, ask clarifying questions, and fill in the gaps with engineering judgment. The cycle was slow but self-correcting. With AI, bad specs go directly to code generation with no human interpretation layer. The gaps don't get filled — they become bugs. This is why I argue that Spec-Driven Development isn't a new process — it's a recognition that the existing process has a new bottleneck, and you need to invest in relieving it.

Jessica Liu: William, I actually want to support you here with a war story. Three months ago, my team started using Claude Code to accelerate feature development. Week 1, productivity soared — we shipped 3x more features. Week 2, bugs started appearing. Week 3, we realized the bugs were all in the same category: the AI implemented exactly what we asked for, but what we asked for was incomplete. Edge cases we hadn't thought about. Error states we didn't specify. Accessibility requirements we forgot to mention. The AI didn't miss anything — WE missed things in our specs. So we started writing more detailed specs before coding. And you know what happened? The spec-writing took longer than the coding used to take. We went from "code is the bottleneck" to "spec is the bottleneck" in exactly the way you describe.

William Yeh: Jessica, that's the canonical example. And here's the key insight: the total cycle time probably didn't increase — it redistributed. Before AI, you spent 60% of time coding and 20% specifying. Now you spend 20% of time coding and 60% specifying. The total might be the same or even less, but the bottleneck moved. And that means the skills you need to invest in are different. You need people who can write precise, complete, unambiguous specifications. That's a different skill from writing code.

Dr. Ming-Zhe Chen: I want to offer a structured framework for specification quality. In formal methods, we distinguish three levels of specification completeness. Level 1: Example-based — "given this input, expect this output." This is what most teams do today, if they specify at all. Level 2: Contract-based — "this function accepts inputs of type X with constraints Y and produces outputs of type Z with guarantees G." This captures the interface but not all behaviors. Level 3: Property-based — "for all valid inputs, these invariants hold." This is the most complete practical specification. My recommendation is that teams should aim for Level 2 as the standard for AI-assisted development. Level 1 is insufficient — it leaves too many gaps for AI to fill incorrectly. Level 3 is ideal but requires skills most teams don't have yet.

Hong-Zhi Lin: Dr. Chen, Level 2 sounds reasonable in theory, but who writes the contract? Let me give a concrete frontend example. I need to build a search autocomplete component. The PM says "like Google's." Here's what a Level 2 contract would need to specify: debounce timing (300ms? 500ms?), minimum query length before triggering (1 char? 3 chars?), maximum results displayed (5? 10? dynamic?), behavior when API is slow (loading state? stale results?), behavior when API fails (error state? graceful degradation?), keyboard navigation (arrow keys, enter to select, escape to close?), accessibility (ARIA combobox pattern? or listbox?), mobile behavior (virtual keyboard interactions? viewport adjustments?), caching strategy (session? persistent? TTL?). That's 30+ decisions just for ONE component. The PM hasn't thought about any of them. If I have to write this contract before coding, I'm doing the same amount of work as before — I'm just doing it in a spec document instead of in code.

Jessica Liu: nods vigorously Hong-Zhi's point is critical. The work doesn't disappear — it migrates. The question is whether migrating the work to the spec phase is more efficient than discovering it during implementation. And my experience says... sometimes yes, sometimes no. For well-understood components with clear patterns, a good spec saves time because AI can implement the whole thing in one shot. For novel, experimental features where you're figuring out the requirements through prototyping, upfront specification is slower because you can't specify what you don't understand yet.

William Yeh: Jessica, you've identified the key nuance. Spec-Driven Development doesn't mean "specify everything upfront before writing any code." It means "recognize that specification quality is the bottleneck and invest accordingly." For well-understood features, invest heavily in specs. For exploratory features, use what I call "spike-then-specify" — prototype quickly with AI to learn what you need, then write the spec for the production implementation. The anti-pattern is "prototype with AI and ship the prototype." That's where technical debt accumulates explosively.

Kent Beck: This connects to a practice I've been developing called "Hypothesis-Driven Development." It's an evolution of TDD for the AI era. Instead of Red-Green-Refactor, the cycle is: Hypothesize-Specify-Generate-Verify. Step 1: Form a hypothesis about what the code should do. Step 2: Write a specification in the form of tests — including at least one property-based test. Step 3: Give the specification to AI and let it generate the implementation. Step 4: Verify the implementation against the specification AND against your original hypothesis. Step 4 is crucial because sometimes the spec itself is wrong — it doesn't capture what you actually meant. Verification includes running the tests, reading the generated code, and asking "does this do what I intended, or does it do what I said?" Those are different things.

Dr. Ming-Zhe Chen: Kent, your Hypothesize-Specify-Generate-Verify cycle is formally interesting because it introduces a verification step that goes beyond test execution. In formal methods, we call this "validation vs verification" — verification asks "did we build the thing right?" (do the tests pass?), while validation asks "did we build the right thing?" (does the running system match human intent?). Most AI-assisted development workflows focus only on verification. Your Step 4 adds validation. And this is where human judgment remains irreplaceable — no automated test can tell you whether the code does what you meant, only what you specified.

Hong-Zhi Lin: sighs Alright, I'm going to admit something. Last week, I spent 4 hours debugging a component that Claude generated for me. The bug was in my spec — I had specified that the dropdown should close when the user clicks outside, but I didn't specify what "outside" means when the dropdown is inside a modal. The AI implemented "outside" as "outside the dropdown," which included the modal backdrop. Clicking the modal backdrop closed the dropdown AND triggered the modal's close handler. Two clicks to close one thing. The AI did exactly what I asked. The spec was wrong. If I'd spent 20 minutes writing a more precise spec with edge cases, I'd have saved 4 hours. So... I concede the point. Spec quality matters. I just don't think the industry is ready for the cultural shift it requires.

William Yeh: Hong-Zhi, that concession means more than you think. You're a 10-year veteran, and even YOU got bitten by a spec gap. Imagine what happens with junior developers who have even less ability to anticipate edge cases. This is why I say Spec-Driven Development is not optional — it's the only way to work effectively with AI. The alternative is "prompt and pray," which works for demos but not for production.

Moderator: We're seeing a surprising amount of convergence here, despite the initial battle lines. The disagreement isn't about whether spec quality matters — everyone agrees it does — but about who should do the specifying and whether organizations can make the cultural shift. Let's vote.

Vote: What's the biggest obstacle to Spec-Driven Development?

ExpertVoteReasoning
William YehOrganizational recognition of the bottleneck shiftTeams are still optimizing for code speed, not spec quality
Kent BeckLack of test-as-spec disciplineEngineers need to learn to express specs as executable tests
Dr. Ming-Zhe ChenSpecification skill gapMost developers can write Level 1 specs but not Level 2 or 3
Hong-Zhi LinProduct culture doesn't produce specsPMs, designers, and stakeholders don't specify — engineers are left to guess
Jessica LiuNo clear ROI framework for spec investmentBusiness leaders won't invest without measurable payback
Martin FowlerTooling gap — specs need to be living documentsCurrent spec tools (Jira, Confluence) don't support executable, testable specs

第三回合:Spec-Driven Development——瓶頸遷移後的新工作模式

主持人: 葉大師,你從開場就一直在暗示——瓶頸已從寫程式碼遷移到規格。讓我們把這作為第三回合的核心。Spec-Driven Development 具體是什麼樣子?

葉大師: 讓我精確展開 TOC 分析。在 AI 之前的軟體開發中,系統有一個清楚的瓶頸:寫程式碼。其他一切——需求收集、設計、測試、部署——都受限於開發者寫程式碼的速度。公司圍繞這個瓶頸優化:僱更多開發者、採用更快的框架、使用程式碼生成器。AI 現在「提升」了這個瓶頸——寫程式碼不再是最慢的步驟。那什麼是?在過去 18 個月我諮詢的每個團隊中,答案都一樣:規格。規格是新的瓶頸。團隊能在幾分鐘內生成程式碼,但花好幾天——有時幾週——搞清楚要規定什麼。而當規格錯了,AI 會生成功能完美但做錯事的程式碼。我稱之為「精確的錯誤」問題。AI 太擅長遵循指示,以至於壞的指示產出自信的錯誤結果。

Kent Beck: 立即接話 「規格就是測試」。我過去一年在研討會上一直這樣說,人們看我像看瘋子。但想想看。什麼是測試?它是行為的正式、可執行規格。什麼是規格?它是期望行為的人類可讀描述。在 AI 時代,這兩者之間的差距正在縮小。當我在請 AI 實作之前寫 TDD 風格的測試,我就在寫規格。測試「就是」規格。而 AI 用測試來驗證自己的輸出。這是我職業生涯中見過最強大的工作流:先寫測試,交給 AI,讓 AI 生成實作,跑測試。通過了,審查實作。沒通過,AI 再試。人類的工作是正確地寫測試——規格。

Dr. 陳明哲: Kent,我欣賞「規格就是測試」的優雅,但我需要從形式方法的角度反駁。測試是「部分」規格——它為特定輸入指定行為。規格在形式意義上描述「所有」輸入的行為。這個區分非常重要。我可以寫一個測試 sort([3,1,2]) == [1,2,3],然後 AI 可能生成一個 hardcode 那個特定案例、對其他輸入回傳垃圾的函式。測試通過了。規格被滿足了。但行為是錯的。這就是為什麼我一直回到 Property-Based Testing——它是多數團隊能用的、最接近形式化規格的實際近似。不寫 sort([3,1,2]) == [1,2,3],而是寫「對於任何 list,sort(list) 必須回傳 list 的一個排列,且是單調非遞減的」。那是規格,不只是測試。

Kent Beck: 陳教授,你說得對,我想澄清。當我說「規格就是測試」,我指的是「完整的」測試套件——包含 property-based test。單一的 example-based test 是部分規格。包含屬性、範例和邊界案例的完整測試套件「就是」足夠完整的規格。問題是什麼是「足夠完整」。我的務實答案是:當 AI 生成的實作能處理你在乎的案例時,它就足夠完整。完美是好的敵人。

林宏志: 雙手一攤 我們能談一下現實嗎?「前端誰寫規格?」在我的 10 年裡,我從來沒有——一次都沒有——從產品經理那裡收到過精確到可以直接寫測試的規格。我實際上收到的是:一個沒有邊界案例文件的 Figma 連結、一條寫著「做得像競品但不同」的 Slack 訊息、以及一個有三個互相矛盾的要點的 Jira ticket。「那」就是我工作用的規格。你告訴我解決方案是寫更好的規格?問題不在於工程師不知道怎麼寫規格。問題在於整個產品開發文化不產出規格。PM 不寫規格。設計師不寫規格。業務利害關係人傳了一張競品 app 的截圖,說「做這個」。你怎麼用 Spec-Driven Development 解決「那個」?

葉大師: 宏志,你剛好精確指出了為什麼瓶頸在規格。你描述的一切——模糊的 Figma 連結、矛盾的 Jira ticket、Slack 截圖——那就是瓶頸在運作。不是人們不會寫規格,而是組織還沒認知到規格品質現在是軟體交付的主要限制。在 AI 之前,壞規格是可以容忍的,因為開發者會解讀它們、問澄清問題、用工程判斷力填補缺口。循環很慢但能自我修正。有了 AI,壞規格直接進入程式碼生成,沒有人類解讀層。缺口不會被填補——它們變成 bug。這就是為什麼我主張 Spec-Driven Development 不是新流程——它是認知到現有流程有了新瓶頸,你需要投資來緩解它。

Jessica Liu: 葉大師,我其實想用一個戰爭故事來支持你。三個月前,我的團隊開始用 Claude Code 加速功能開發。第一週,生產力飆升——我們交付了 3 倍多的功能。第二週,bug 開始出現。第三週,我們意識到 bug 全在同一類:AI 實作了我們要求的一切,但我們要求的是不完整的。我們沒想到的邊界案例。我們沒規定的錯誤狀態。我們忘了提的無障礙需求。AI 沒漏掉任何東西——是「我們」在規格中漏掉了東西。所以我們開始在 coding 之前寫更詳細的規格。你知道發生了什麼嗎?寫規格花的時間比以前寫程式碼還長。我們從「程式碼是瓶頸」變成「規格是瓶頸」,跟你描述的一模一樣。

葉大師: Jessica,那是經典範例。而關鍵洞見是:總循環時間可能沒有增加——它重新分配了。AI 之前,你花 60% 的時間寫程式碼、20% 寫規格。現在你花 20% 的時間寫程式碼、60% 寫規格。總量可能一樣甚至更少,但瓶頸移動了。這意味著你需要投資的技能不同了。你需要能寫出精確、完整、無歧義規格的人。那是跟寫程式碼不同的技能。

Dr. 陳明哲: 我想提供一個結構化的規格品質框架。在形式方法中,我們區分三個層級的規格完整性。第一級:Example-based——「給定此輸入,期望此輸出」。這是多數團隊今天做的,如果他們有在規定的話。第二級:Contract-based——「此函式接受類型 X、限制 Y 的輸入,產出類型 Z、保證 G 的輸出」。這捕捉了介面但不是所有行為。第三級:Property-based——「對於所有合法輸入,這些不變量成立」。這是最完整的實際規格。我的建議是團隊應該以第二級作為 AI 輔助開發的標準。第一級不夠——它為 AI 留下太多不正確填補的空間。第三級是理想但需要多數團隊還不具備的技能。

林宏志: 陳教授,第二級在理論上聽起來合理,但誰來寫契約?讓我舉一個具體的前端例子。我需要建構一個搜尋 autocomplete 元件。PM 說「像 Google 的」。以下是第二級契約需要規定的:debounce 時間(300ms?500ms?)、觸發前的最短查詢長度(1 字元?3 字元?)、顯示的最大結果數(5?10?動態?)、API 很慢時的行為(loading 狀態?過時結果?)、API 失敗時的行為(錯誤狀態?優雅降級?)、鍵盤導航(方向鍵、Enter 選擇、Escape 關閉?)、無障礙(ARIA combobox pattern?還是 listbox?)、行動裝置行為(虛擬鍵盤互動?viewport 調整?)、快取策略(session?persistent?TTL?)。光一個元件就有 30 多個決策。PM 一個都沒想過。如果我要在 coding 之前寫這個契約,我做的工作量跟以前一樣——只是在規格文件裡做,而不是在程式碼裡做。

Jessica Liu: 大力點頭 宏志的觀點很關鍵。工作不會消失——它遷移。問題是把工作遷移到規格階段是否「更有效率」,比起在實作中才發現。而我的經驗說⋯⋯有時是、有時否。對於有清楚模式的已知元件,好規格能省時間,因為 AI 能一次把整個東西實作完。對於你透過原型來摸索需求的新穎、實驗性功能,預先規格化更慢,因為你無法規定你還不理解的東西。

葉大師: Jessica,你找到了關鍵的細微差別。Spec-Driven Development 不是說「在寫任何程式碼之前預先規定一切」。它的意思是「認知到規格品質是瓶頸,並據此投資」。對於已知的功能,在規格上重度投資。對於探索性功能,用我稱之為「spike-then-specify」的方法——用 AI 快速原型來學習你需要什麼,然後為生產實作寫規格。反模式是「用 AI 原型然後把原型上線」。那就是技術債爆炸式累積的地方。

Kent Beck: 這連接到我一直在發展的一個實踐,叫做「Hypothesis-Driven Development」。它是 TDD 在 AI 時代的演化。循環不是 Red-Green-Refactor,而是:Hypothesize-Specify-Generate-Verify。第一步:形成關於程式碼應該做什麼的假設。第二步:以測試的形式寫規格——至少包含一個 property-based test。第三步:把規格交給 AI,讓它生成實作。第四步:對照規格「和」對照你原始的假設驗證實作。第四步至關重要,因為有時規格本身是錯的——它沒有捕捉你真正的意圖。驗證包括跑測試、讀生成的程式碼、然後問「這做的是我意圖的事,還是我說的事?」那是不同的事。

Dr. 陳明哲: Kent,你的 Hypothesize-Specify-Generate-Verify 循環在形式上很有趣,因為它引入了超越測試執行的驗證步驟。在形式方法中,我們稱之為「validation vs verification」——verification 問「我們把東西建對了嗎?」(測試通過了嗎?),而 validation 問「我們建了對的東西嗎?」(運行的系統是否符合人類意圖?)。多數 AI 輔助開發工作流只關注 verification。你的第四步加入了 validation。而這正是人類判斷仍不可取代的地方——沒有自動化測試能告訴你程式碼做的是你「意圖」的事,只能告訴你它做的是你「規定」的事。

林宏志: 嘆氣 好吧,我要承認一件事。上週,我花了 4 小時 debug 一個 Claude 幫我生成的元件。Bug 在我的規格裡——我規定了 dropdown 在使用者點擊外面時應該關閉,但我沒規定當 dropdown 在 modal 裡面時「外面」是什麼意思。AI 把「外面」實作成「dropdown 外面」,這包含了 modal 的 backdrop。點擊 modal backdrop 關閉了 dropdown「同時」觸發了 modal 的 close handler。要點兩次才能關一個東西。AI 做了我要求的。規格錯了。如果我花 20 分鐘寫一個有邊界案例的更精確規格,我可以省 4 小時。所以⋯⋯我承認這一點。規格品質很重要。我只是不認為業界準備好它需要的文化轉變。

葉大師: 宏志,那個讓步比你想的更有意義。你是 10 年老兵,連「你」都被規格缺口咬到。想像一下經驗更少、更無法預見邊界案例的初階開發者會怎樣。這就是為什麼我說 Spec-Driven Development 不是可選的——它是與 AI 有效工作的唯一方式。替代方案是「prompt 然後祈禱」,對 demo 有用但對生產環境沒用。

主持人: 我們在這裡看到了令人驚訝的趨同程度,儘管最初的戰線分明。分歧不在於規格品質是否重要——所有人都同意——而在於誰該做規格化,以及組織是否能做到文化轉變。讓我們投票。

投票:Spec-Driven Development 的最大阻礙是什麼?

專家投票理由
葉大師組織對瓶頸轉移的認知團隊仍在優化程式碼速度而非規格品質
Kent Beck缺乏 test-as-spec 的紀律工程師需要學習用可執行測試表達規格
Dr. 陳明哲規格技能差距多數開發者能寫第一級規格但不能寫第二或三級
林宏志產品文化不產出規格PM、設計師和利害關係人不寫規格——工程師被留下來猜
Jessica Liu沒有清楚的規格投資 ROI 框架業務領導者不會在沒有可衡量回報的情況下投資
Martin Fowler工具差距——規格需要是活文件目前的規格工具(Jira、Confluence)不支援可執行、可測試的規格

Round 4: Governable Code — The Three-Layer Governance Model

Moderator: William, you've been building toward this. Specification drives what AI generates. Testing verifies correctness. But who controls what AI is allowed to do in the first place? Give us the governance model.

William Yeh: This is the piece that ties everything together. I propose a three-layer governance model for AI-generated code. Think of it as defense in depth — the same principle we use in security architecture, applied to AI code governance.

Layer 1: Execution Isolation. Every AI-generated code artifact runs in a sandboxed environment before it touches production. This means containerized execution, restricted filesystem access, network egress controls, and explicit permission boundaries. The AI agent can generate code, but it cannot deploy, cannot access secrets, cannot modify infrastructure without human approval gates. Concretely: if you're using Cursor or Claude Code, the generated code runs in a preview sandbox first. It passes through a CI pipeline with security scanning — SAST, dependency audit, license check — before a human approves the merge.

Layer 2: Semantic Review. This goes beyond traditional code review. Traditional review asks "does this code look correct?" Semantic review asks "does this code mean what we intended?" This requires a combination of human and AI review. The human reviewer checks intent alignment — does the generated code solve the actual problem, not just the stated problem? The AI reviewer checks pattern consistency — does the generated code follow the codebase's conventions, naming patterns, and architectural boundaries? Neither alone is sufficient. AI catches syntactic and structural issues humans miss. Humans catch intent misalignment AI can't detect.

Layer 3: High-Assurance Verification. For critical code paths — payment processing, authentication, data privacy, medical calculations — we apply formal methods and property-based testing. This means using tools like TLA+, Alloy, or Dafny for specification and verification of critical algorithms. It means property-based testing with comprehensive generators. It means mutation testing to verify that your test suite actually catches defects. Not every line of code needs this treatment. But the 5% of your code that handles money, identity, or safety? Absolutely.

And here are the four accountability questions every team must answer: Who defines the spec? Not the AI — a human with domain knowledge. Who verifies correctness? Automated tests plus human review, never AI self-verification alone. Who controls permissions? An explicit authorization model — what can the AI agent read, write, execute, deploy? Who bears consequences? A named human being who is accountable for the AI-generated code in production. If you can't answer all four, you're not governing AI-generated code — you're hoping it works.

Dr. Ming-Zhe Chen: leaning forward William, Layer 3 excites me as a formal methods researcher, but I must be honest about the learning curve. TLA+ alone takes 3-6 months for a competent engineer to become productive with. Alloy has a steep curve. Dafny requires understanding dependent types. These are graduate-level skills. I've seen teams try to adopt formal methods and abandon them within weeks because the gap between "I understand the concept" and "I can write a useful specification" is enormous. In our department, PhD students take a full semester course before they can write non-trivial TLA+ specifications. You're proposing that production engineering teams do this?

William Yeh: Dr. Chen, I'm NOT proposing that every engineer learns TLA+. I'm proposing that for the 5% of critical code paths, someone on the team — or an external specialist — applies formal verification. Just like not every developer needs to be a security expert, but every team needs access to security expertise for critical paths. The question is whether the business risk justifies the investment. For a fintech processing $10 million daily, the answer is unambiguously yes. For Jessica's startup's onboarding flow, probably not — yet.

Jessica Liu: firmly Thank you for that "yet," William, because I was about to explode. Three layers of governance? My engineering team has exactly 9 people. We don't have a dedicated DevOps person, let alone a formal methods specialist. Layer 1 — sandboxing? Sure, we use Docker, we have a staging environment, I can buy that. Layer 2 — semantic review? We already do code review, adding AI assistance to review is reasonable. But Layer 3 — formal verification for a startup? You're telling me to hire a PhD to verify my checkout flow? My investors would laugh me out of the room.

Kent Beck: sharply Jessica, can you afford a payment processing bug that charges customers double? Because I've seen it happen — not at a startup, at a $2 billion company that thought "our test suite is good enough." The double-charge bug made it to production, affected 12,000 customers, resulted in a $4.2 million settlement and a 15% churn spike. The root cause? An AI-generated function that handled currency rounding incorrectly in a specific timezone edge case. No unit test caught it because no one thought to test timezone-currency interactions. A property-based test stating "for any valid order, the charged amount must equal the displayed amount" would have caught it in seconds. You can't afford NOT to verify your payment path. Everything else, fine, skip Layer 3. But the code that touches money? Non-negotiable.

Jessica Liu: pauses ...That's a fair point. We did have a billing bug last quarter that took 3 days to diagnose and required manual refunds to 200 users. The engineering time plus the customer support time plus the trust damage — yeah, that probably cost more than a week of writing property-based tests for the billing module would have.

Martin Fowler: This is exactly the pragmatic governance approach I advocate. You don't apply all three layers uniformly. You stratify by risk. I use a simple framework at Thoughtworks: categorize every code module as green (low risk — UI components, internal tools), yellow (medium risk — business logic, data transformations), or red (high risk — payments, auth, PII handling, medical calculations). Green modules get Layer 1 only — sandbox and basic CI. Yellow modules get Layers 1 and 2 — sandbox plus semantic review. Red modules get all three layers. In practice, 70% of code is green, 25% is yellow, 5% is red. So you're only applying the full governance stack to 5% of your code. That's manageable even for a 9-person team.

Hong-Zhi Lin: Martin, that risk stratification makes sense on paper. But who decides what's "red"? In my experience, the boundaries shift. Last month, our "green" user preferences module became "red" overnight when product decided to store payment method preferences there. Nobody updated the governance classification. The AI kept generating code for that module with green-level oversight, and we shipped a change that exposed credit card last-four digits in the frontend state. It wasn't a data breach — the data was already partially visible — but it violated PCI-DSS requirements and our security team spent two weeks on remediation.

William Yeh: Hong-Zhi, that's a governance process failure, not a governance model failure. The model is correct — risk stratification is necessary. Your team needed an automated mechanism to detect when a module's risk profile changes. This is where the four accountability questions matter. "Who controls permissions?" should include automated guards that detect when a module starts handling sensitive data types and automatically escalates its governance tier. Static analysis can detect when a module imports payment-related types or PII-related schemas. The governance classification shouldn't be a manual label — it should be continuously computed.

Dr. Ming-Zhe Chen: nodding slowly William, this idea of continuously computed governance tiers is theoretically elegant. It's essentially a form of information flow analysis — tracking how sensitive data propagates through the system and adjusting verification requirements accordingly. There's solid academic work on this — Denning's lattice model, Myers' JFlow, Fabric's principals. The challenge is that these systems are notoriously difficult to implement in practice. Type systems for information flow add annotation burden. Runtime tracking adds performance overhead. Most practical implementations are approximations.

Kent Beck: The key insight I want to highlight is the accountability question: "Who bears consequences?" In every well-governed AI workflow I've seen, there is a named human for every production deployment. Not "the team." Not "the AI." A specific person who says "I reviewed this, I understand what it does, and I accept responsibility for it being in production." That's not a technology solution — it's a culture solution. But without it, none of the technical layers matter.

Martin Fowler: Kent is exactly right. At Thoughtworks, we've instituted what we call the "AI code owner" pattern. For every PR that contains AI-generated code, the reviewer must add a comment: "I have reviewed this AI-generated code. I understand its intent, I've verified its behavior against the specification, and I accept ownership of it in production." It's a simple ritual, but it transforms the review from "looks good to me" to genuine accountability.

Moderator: Strong convergence on risk-stratified governance. The debate is about implementation complexity, not the principle. Let's vote.

Vote: Which governance layer should be prioritized first?

ExpertVoteReasoning
William YehLayer 1 (Execution Isolation)Foundation — without sandboxing, Layers 2-3 are moot
Kent BeckAccountability culture before any technical layer"Who bears consequences?" must be answered first
Martin FowlerRisk stratification frameworkClassify red/yellow/green before investing in any layer
Dr. Ming-Zhe ChenLayer 2 (Semantic Review)Most practical near-term impact; Layer 3 has adoption barriers
Hong-Zhi LinLayer 1 with automated reclassificationSandboxing plus automated risk tier detection
Jessica LiuLayer 1 for all code, Layer 3 only for payment pathsMinimum viable governance — protect what matters most

第四回合:可治理的程式碼——三層治理模型

主持人: 葉大師,你一直在鋪陳這個方向。規格驅動 AI 生成什麼。測試驗證正確性。但誰「控制」AI 一開始被允許做什麼?給我們治理模型。

葉大師: 這是把一切串在一起的部分。我提出一個 AI 生成程式碼的三層治理模型。把它想成縱深防禦——跟我們在安全架構中使用的原則一樣,應用在 AI 程式碼治理上。

第一層:執行隔離。 每個 AI 生成的程式碼產物在接觸生產環境之前,都在沙盒環境中運行。這意味著容器化執行、受限的檔案系統存取、網路出口控制,以及明確的權限邊界。AI agent 可以生成程式碼,但不能部署、不能存取 secret、不能在沒有人類審批閘門的情況下修改基礎設施。具體來說:如果你用 Cursor 或 Claude Code,生成的程式碼先在預覽沙盒中運行。它通過一個包含安全掃描的 CI pipeline——SAST、依賴審計、授權檢查——然後才由人類批准合併。

第二層:語意審查。 這超越了傳統的 code review。傳統審查問「這段程式碼看起來正確嗎?」語意審查問「這段程式碼的『意圖』是我們想要的嗎?」這需要人類和 AI 審查的結合。人類審查者檢查意圖對齊——生成的程式碼解決的是實際問題,還是只是陳述的問題?AI 審查者檢查模式一致性——生成的程式碼是否遵循 codebase 的慣例、命名模式和架構邊界?兩者單獨都不夠。AI 捕捉人類遺漏的語法和結構問題。人類捕捉 AI 無法偵測的意圖偏差。

第三層:高保證驗證。 對於關鍵程式碼路徑——支付處理、身份驗證、資料隱私、醫療計算——我們應用形式方法和 property-based testing。這意味著使用 TLA+、Alloy 或 Dafny 等工具對關鍵演算法進行規格化和驗證。意味著有全面生成器的 property-based testing。意味著 mutation testing 來驗證你的測試套件確實能抓到缺陷。不是每一行程式碼都需要這種處理。但處理金錢、身份或安全的那 5% 程式碼?絕對需要。

這裡是每個團隊必須回答的四個問責問題:誰定義規格? 不是 AI——是有領域知識的人類。誰驗證正確性? 自動化測試加人類審查,絕不只是 AI 自我驗證。誰控制權限? 一個明確的授權模型——AI agent 能讀、寫、執行、部署什麼?誰承擔後果? 一個具名的人類,對生產環境中 AI 生成的程式碼負責。如果你無法回答全部四個問題,你不是在治理 AI 生成的程式碼——你是在祈禱它能用。

Dr. 陳明哲: 身體前傾 葉大師,第三層作為形式方法研究者讓我很興奮,但我必須對學習曲線誠實。光是 TLA+ 就需要一個有能力的工程師 3-6 個月才能有生產力。Alloy 學習曲線陡峭。Dafny 需要理解依賴型別。這些是研究生等級的技能。我見過團隊嘗試採用形式方法,在幾週內就放棄了,因為「我理解概念」和「我能寫出有用的規格」之間的差距是巨大的。在我們系裡,博士生需要修一整學期的課程才能寫出非平凡的 TLA+ 規格。你在建議生產環境的工程團隊做這件事?

葉大師: Dr. Chen,我「沒有」建議每個工程師都學 TLA+。我建議的是,對於那 5% 的關鍵程式碼路徑,團隊中的某個人——或外部專家——應用形式驗證。就像不是每個開發者都需要是安全專家,但每個團隊都需要在關鍵路徑上有安全專業知識的管道。問題是商業風險是否值得這筆投資。對於每天處理 1,000 萬美元的金融科技公司,答案毫無疑問是肯定的。對於 Jessica 新創公司的 onboarding 流程,大概不用——暫時。

Jessica Liu: 堅定地 謝謝你那個「暫時」,葉大師,因為我正要爆炸。三層治理?我的工程團隊剛好 9 個人。我們連專職的 DevOps 都沒有,更別說形式方法專家了。第一層——沙盒?好,我們用 Docker,有 staging 環境,我買單。第二層——語意審查?我們已經在做 code review,加入 AI 輔助審查是合理的。但第三層——新創公司的形式驗證?你要我請一個博士來驗證我的結帳流程?我的投資人會笑著把我趕出去。

Kent Beck: 尖銳地 Jessica,你承受得起一個向客戶收雙倍費用的支付處理 bug 嗎?因為我見過這種事——不是在新創公司,是在一家 20 億美元的公司,他們以為「我們的測試套件夠好了」。雙重收費 bug 進了生產環境,影響了 12,000 名客戶,導致 420 萬美元的和解金和 15% 的客戶流失率飆升。根本原因?一個 AI 生成的函式在特定時區邊界案例中錯誤處理了貨幣四捨五入。沒有單元測試抓到它,因為沒人想到要測試時區和貨幣的交互作用。一個聲明「對於任何合法訂單,收取的金額必須等於顯示的金額」的 property-based test 可以在幾秒內抓到它。你承擔不起「不」驗證支付路徑的後果。其他所有東西,好,跳過第三層。但碰到錢的程式碼?不可妥協。

Jessica Liu: 停頓 ⋯⋯那是個好論點。我們上季度確實有一個帳單 bug,花了 3 天診斷,需要手動退款給 200 個使用者。工程時間加上客服時間加上信任損失——沒錯,那大概比花一週為帳單模組寫 property-based test 的成本更高。

Martin Fowler: 這正是我倡導的務實治理方式。你不是均勻地應用三層。你按風險分層。我在 Thoughtworks 用一個簡單的框架:把每個程式碼模組分類為綠色(低風險——UI 元件、內部工具)、黃色(中風險——商業邏輯、資料轉換)或紅色(高風險——支付、身份驗證、PII 處理、醫療計算)。綠色模組只需第一層——沙盒和基本 CI。黃色模組需要第一層和第二層——沙盒加語意審查。紅色模組需要全部三層。實務上,70% 的程式碼是綠色,25% 是黃色,5% 是紅色。所以你只對 5% 的程式碼應用完整的治理堆疊。即使 9 人團隊也是可行的。

林宏志: Martin,那個風險分層在紙上說得通。但誰決定什麼是「紅色」?以我的經驗,邊界會變動。上個月,我們「綠色」的使用者偏好模組一夜之間變成「紅色」,因為產品決定把支付方式偏好存在那裡。沒有人更新治理分類。AI 繼續以綠色等級的監督為那個模組生成程式碼,我們上線了一個在前端 state 中暴露信用卡末四碼的變更。那不算資料外洩——資料已經是部分可見的——但它違反了 PCI-DSS 要求,我們的安全團隊花了兩週補救。

葉大師: 宏志,那是治理「流程」的失敗,不是治理「模型」的失敗。模型是正確的——風險分層是必要的。你的團隊需要一個自動化機制來偵測模組的風險概況何時改變。這就是四個問責問題重要的地方。「誰控制權限?」應該包含自動化守衛,偵測模組何時開始處理敏感資料類型,並自動升級其治理層級。靜態分析可以偵測模組何時引入支付相關型別或 PII 相關的 schema。治理分類不應該是手動標籤——它應該被持續計算。

Dr. 陳明哲: 緩慢點頭 葉大師,這個持續計算治理層級的概念在理論上很優雅。它本質上是一種資訊流分析——追蹤敏感資料如何在系統中傳播,並相應調整驗證要求。這方面有扎實的學術研究——Denning 的格模型、Myers 的 JFlow、Fabric 的 principals。挑戰在於這些系統在實踐中非常難實作。資訊流的型別系統增加了標註負擔。運行時追蹤增加了效能開銷。多數實際實作都是近似值。

Kent Beck: 我想強調的關鍵洞察是問責問題:「誰承擔後果?」在我見過的每一個治理良好的 AI 工作流中,每次生產部署都有一個具名的人類。不是「團隊」。不是「AI」。一個具體的人說「我審查了這個,我理解它做什麼,我接受它在生產環境中的責任。」那不是技術解決方案——那是文化解決方案。但沒有它,所有技術層都無意義。

Martin Fowler: Kent 說的完全正確。在 Thoughtworks,我們推行了所謂的「AI code owner」模式。對於每一個包含 AI 生成程式碼的 PR,審查者必須加一條評論:「我已審查此 AI 生成的程式碼。我理解其意圖,已對照規格驗證其行為,並接受它在生產環境中的所有權。」這是一個簡單的儀式,但它把審查從「看起來不錯」轉變為真正的問責。

主持人: 在風險分層治理上有很強的趨同。爭論在於實作複雜度,而非原則本身。讓我們投票。

投票:治理模型中最應優先導入的層次?

專家投票理由
葉大師第一層(執行隔離)基礎——沒有沙盒,第二、三層都是空談
Kent Beck問責文化先於任何技術層必須先回答「誰承擔後果?」
Martin Fowler風險分層框架先把紅/黃/綠分好,再投資任何層
Dr. 陳明哲第二層(語意審查)近期最實際的影響;第三層有採用障礙
林宏志第一層加自動重新分類沙盒加自動化風險層級偵測
Jessica Liu所有程式碼第一層,僅支付路徑第三層最小可行治理——保護最重要的部分

Round 5: Semantic Non-Determinism — The Fundamental Challenge of AI-Generated Code

Moderator: Dr. Chen, you've been hinting at something deeper throughout this discussion. LLMs are fundamentally non-deterministic. The same prompt can produce different code each time. Take us through the implications.

Dr. Ming-Zhe Chen: Thank you. This is the elephant in the room that our industry has not adequately confronted. Large Language Models are stochastic systems. Even with temperature set to zero — which does NOT guarantee determinism due to floating-point arithmetic variations across hardware — the same prompt can produce semantically different code across runs. I want to be precise about what I mean. I'm not talking about superficial differences like variable naming or whitespace. I'm talking about semantic non-determinism — different algorithmic choices, different edge case handling, different error recovery strategies. In our lab, we ran an experiment: we gave Claude Sonnet 3.5 the same function specification 50 times — a binary search implementation with specific edge case requirements. We got 7 semantically distinct implementations. Three were correct. Two had off-by-one errors. One silently returned -1 for empty arrays instead of throwing as specified. One had a subtle integer overflow bug on arrays larger than 2^31 elements. Same prompt. Seven different programs. Four of them buggy. This is not a failure of the model. This is a feature of the architecture. Neural networks sample from a probability distribution over token sequences. There is no canonical "correct" output — there's a distribution of plausible outputs, some correct and some not. And here's the critical point: the boundary between "correct" and "incorrect" in that distribution is invisible to the user. You can't tell from the output which run was correct without independent verification.

Kent Beck: standing up THIS. This is exactly why TDD matters more in the AI era, not less. When a human writes code, there's a single deterministic author with a consistent mental model. When AI writes code, there's a stochastic process with no persistent mental model. The tests are the invariant. The tests are the ONLY invariant. Let me say that again: in a world of non-deterministic code generation, the only thing you can hold constant is the specification expressed as tests. If your test suite is comprehensive — and by comprehensive I mean Detroit TDD for correctness, London TDD for contracts, Property-Based Testing for invariants — then it doesn't MATTER that the AI generates different code each time. Any correct implementation will pass. Any incorrect implementation will fail. The non-determinism becomes a feature, not a bug, because each regeneration is essentially a new Monte Carlo sample of the solution space. But — and this is the crucial "but" — this ONLY works if your tests capture the full intent. If your tests are incomplete, you're rolling dice on which of the AI's seven implementations you happen to get.

William Yeh: Kent has articulated the core argument for why governance is non-negotiable. Let me connect this to the TOC framework. In a deterministic world, you can govern through process: write code, review code, deploy code. Each step is predictable. In a non-deterministic world, you must govern through constraints: define what "correct" means (specification), verify it automatically (testing), and restrict what's allowed to reach production (governance). The three pillars — spec, test, govern — are the response to semantic non-determinism. Remove any one, and you lose control. This is not optional. This is not "nice to have." This is the minimum viable discipline for using AI-generated code professionally.

Hong-Zhi Lin: crossing arms Okay, I want to bring this back to reality because I think we're overcomplicating this. In my daily work, here's how I deal with AI non-determinism. Strategy one: I generate code, I read it, I use my engineering judgment to decide if it's correct. If it's obviously wrong, I regenerate. If it looks right, I test it manually. Strategy two: for critical code, I generate it three times and compare the outputs. If all three agree, I'm more confident. If they diverge, I look at the differences to understand what the AI is uncertain about. Strategy three: I use the AI's own uncertainty as a signal. If the AI generates clean, confident-looking code, it's more likely correct. If the output is convoluted and hedging, there's probably a conceptual issue I need to resolve in my specification. These are practical heuristics. They're not formally rigorous. But they work in the trenches. And they don't require a PhD in formal methods.

Dr. Ming-Zhe Chen: Hong-Zhi, I want to challenge your heuristics directly. Strategy one — "it looks right" — is the most dangerous heuristic in software engineering. Bugs that survive code review are, by definition, bugs that look right. AI-generated code is particularly deceptive because LLMs optimize for plausibility, not correctness. The code looks like it should work because the model was trained on millions of examples of working code. It has the right structure, the right naming, the right patterns. But the semantics can be subtly wrong, and subtle semantic bugs are the hardest to catch by visual inspection. Strategy two — generate three times and compare — is actually a known technique in fault-tolerant systems: N-version programming. But the research on N-version programming showed that independently developed implementations often share common-mode failures — they fail on the same edge cases because the failures are driven by common human misconceptions about the problem. With LLMs, the risk is even higher because all three outputs come from the same model trained on the same data. If the model has a blind spot for a particular edge case, all three outputs will share that blind spot. Strategy three — using AI confidence as a signal — has no empirical support that I'm aware of. LLMs are famously poorly calibrated. They produce confidently wrong answers routinely.

Hong-Zhi Lin: uncrossing arms Fair. All three of those criticisms land. But Dr. Chen, what's your alternative? You've told me my heuristics are flawed. You've told me formal methods have a brutal learning curve. You've told me only 24% of developers can write meaningful properties. So what does the average developer actually DO on Monday morning when they need to ship a feature using AI-generated code?

Martin Fowler: interjecting I'll answer that, because I think the practical answer is simpler than either extreme suggests. Treat AI output like untrusted input. This is a mental model every developer already understands. When you receive data from a user form, you validate it. You sanitize it. You don't trust it. Apply the same discipline to AI-generated code. Step one: read it carefully — not to admire it, but to challenge it. Step two: write at least one test that captures the primary happy path. Step three: write at least one test for the most obvious edge case. Step four: run it, observe the behavior, compare to your expectations. That's 15-20 minutes of discipline per AI generation. It won't catch every bug. But it will catch the majority of semantic errors, and it's accessible to every developer regardless of skill level.

William Yeh: Martin's "untrusted input" mental model is excellent, and I want to strengthen it. In security, we don't just validate untrusted input once — we validate at every trust boundary. The same principle should apply to AI-generated code. Validate when it enters your codebase (code review). Validate when it integrates with existing code (integration tests). Validate when it processes real data (monitoring and observability). Defense in depth. Each layer catches what the previous layer missed. And here's where the three-layer testing architecture comes full circle: Detroit TDD catches unit-level semantic errors. London TDD catches contract violations at integration boundaries. Property-Based Testing catches invariant violations across the input space. Three layers of validation for untrusted AI output.

Kent Beck: I want to propose a specific practice that makes this concrete. I call it the "AI Skeptic Test." After the AI generates a function, before you accept it, write one test that you think the AI probably got wrong. Not a random test — a test targeting a specific edge case that you suspect the AI didn't handle correctly. In my experience, this single practice catches 30-40% of AI-generated bugs, because it forces you to think adversarially about the code. It takes 2 minutes. No formal methods required. Just disciplined skepticism.

Moderator: We've identified the fundamental challenge and several practical response strategies. Let's capture the consensus on the most effective approach.

Vote: Most effective strategy for addressing semantic non-determinism?

ExpertVoteReasoning
Dr. Ming-Zhe ChenComprehensive test suites as the single source of truthTests are the only deterministic anchor in a non-deterministic generation process
Kent BeckTDD + "AI Skeptic Test" practiceTests as invariant, plus adversarial single-test discipline per generation
William YehThree-pillar defense: spec, test, governNon-determinism requires all three; removing one loses control
Martin Fowler"Untrusted input" mental model + layered validationAccessible to all developers, defense in depth
Hong-Zhi LinMartin's untrusted input model — practical and adoptableGrudgingly admits his heuristics were insufficient
Jessica LiuKent's AI Skeptic Test — 2 minutes, high ROIMinimal investment, catches 30-40% of bugs, my team can start Monday

第五回合:語義不確定性——AI 生成程式碼的根本挑戰

主持人: Dr. Chen,你在整場討論中一直暗示某個更深層的問題。LLM 本質上是非確定性的。同一個 prompt 每次可能產出不同的程式碼。帶我們看看這意味著什麼。

Dr. 陳明哲: 謝謝。這是我們產業尚未充分面對的房間裡的大象。大型語言模型是隨機系統。即使 temperature 設為零——由於不同硬體上浮點運算的差異,這「不」保證確定性——同一個 prompt 在不同執行中可以產出語意不同的程式碼。我想精確說明我的意思。我不是在說變數命名或空白這種表面差異。我說的是「語意」上的非確定性——不同的演算法選擇、不同的邊界案例處理、不同的錯誤恢復策略。在我們的實驗室,我們做了一個實驗:把同一個函式規格給 Claude Sonnet 3.5 跑 50 次——一個有特定邊界案例要求的二分搜尋實作。我們得到了 7 個語意不同的實作。三個是正確的。兩個有 off-by-one 錯誤。一個對空陣列靜默回傳 -1 而非按規格 throw。一個有微妙的整數溢位 bug,在大於 2^31 元素的陣列上才會出現。同一個 prompt。七個不同的程式。四個有 bug。這不是模型的失敗。這是架構的「特性」。神經網路從 token 序列的機率分佈中取樣。沒有典範的「正確」輸出——只有一個合理輸出的分佈,有些正確有些不正確。而關鍵是:該分佈中「正確」和「不正確」之間的邊界對使用者是不可見的。你無法從輸出判斷哪次執行是正確的,除非獨立驗證。

Kent Beck: 站起來 就是這個。這正是為什麼 TDD 在 AI 時代更重要,而非更不重要。當人類寫程式碼時,有一個單一的確定性作者,具有一致的心智模型。當 AI 寫程式碼時,是一個沒有持久心智模型的隨機過程。測試是不變量。測試是「唯一的」不變量。讓我再說一次:在非確定性程式碼生成的世界中,你唯一能保持不變的是以測試表達的規格。如果你的測試套件是全面的——我所謂的全面是指 Detroit TDD 確保正確性、London TDD 確保契約、Property-Based Testing 確保不變量——那麼 AI 每次生成不同的程式碼就「不重要」了。任何正確的實作都會通過。任何不正確的實作都會失敗。非確定性變成了特性而非 bug,因為每次重新生成本質上是解空間的一次新的蒙地卡羅取樣。但——這是關鍵的「但」——這「只」在你的測試捕捉完整意圖時才成立。如果你的測試不完整,你就是在擲骰子看碰巧拿到 AI 七個實作中的哪一個。

葉大師: Kent 清楚闡述了治理為何不可妥協的核心論據。讓我把這連接到 TOC 框架。在確定性世界中,你可以透過流程治理:寫程式碼、審查程式碼、部署程式碼。每一步都是可預測的。在非確定性世界中,你必須透過約束治理:定義「正確」的意思(規格)、自動驗證(測試),以及限制什麼被允許到達生產環境(治理)。三根支柱——規格、測試、治理——是對語意非確定性的回應。移除任何一個,你就失去控制。這不是可選的。這不是「有了更好」。這是專業使用 AI 生成程式碼的最低可行紀律。

林宏志: 雙臂交叉 好,我想把這拉回現實,因為我覺得我們把事情過度複雜化了。在我的日常工作中,我這樣處理 AI 的非確定性。策略一:我生成程式碼,我讀它,用我的工程判斷力決定它是否正確。明顯錯誤就重新生成。看起來對就手動測試。策略二:對關鍵程式碼,我生成三次然後比較輸出。如果三個一致,我更有信心。如果它們不同,我看差異來理解 AI 對什麼不確定。策略三:我用 AI 自身的不確定性作為信號。如果 AI 生成乾淨、看起來有信心的程式碼,它更可能正確。如果輸出是迂迴和含糊的,大概有個概念性問題需要我在規格中解決。這些是實務的啟發法。它們不是形式上嚴謹的。但它們在戰壕裡管用。而且不需要形式方法的博士學位。

Dr. 陳明哲: 宏志,我想直接挑戰你的啟發法。策略一——「看起來對」——是軟體工程中最危險的啟發法。能通過 code review 的 bug,按定義就是「看起來」正確的 bug。AI 生成的程式碼特別具有欺騙性,因為 LLM 優化的是合理性而非正確性。程式碼「看起來」應該能工作,因為模型是在數百萬個能運作的程式碼範例上訓練的。它有正確的結構、正確的命名、正確的模式。但語意可以微妙地錯誤,而微妙的語意 bug 是最難通過視覺檢查發現的。策略二——生成三次然後比較——實際上是容錯系統中的已知技術:N 版本程式設計。但 N 版本程式設計的研究顯示,獨立開發的實作常常共享共模失效——它們在相同的邊界案例上失敗,因為失敗是由對問題的共同人類誤解驅動的。用 LLM 時,風險更高,因為三個輸出都來自同一個在相同資料上訓練的模型。如果模型對某個特定邊界案例有盲點,三個輸出都會共享那個盲點。策略三——用 AI 的信心作為信號——據我所知沒有經驗支持。LLM 出了名的校準不良。它們常常自信地產出錯誤答案。

林宏志: 放下雙臂 公平。這三個批評都打中了。但 Dr. Chen,你的替代方案是什麼?你告訴我我的啟發法有缺陷。你告訴我形式方法學習曲線殘酷。你告訴我只有 24% 的開發者能寫出有意義的屬性。那麼一般的開發者在星期一早上需要用 AI 生成的程式碼交付功能時,到底該「做」什麼?

Martin Fowler: 插話 我來回答,因為我覺得實務上的答案比兩個極端暗示的都簡單。把 AI 輸出當作不受信任的輸入。這是每個開發者都已經理解的心智模型。當你從使用者表單收到資料時,你驗證它。你消毒它。你不信任它。對 AI 生成的程式碼應用同樣的紀律。第一步:仔細讀它——不是要欣賞它,而是要質疑它。第二步:至少寫一個捕捉主要 happy path 的測試。第三步:至少為最明顯的邊界案例寫一個測試。第四步:運行它,觀察行為,與你的預期比較。每次 AI 生成花 15-20 分鐘的紀律。它不會抓到每個 bug。但它會抓到大多數語意錯誤,而且不管技能水準如何,每個開發者都能做到。

葉大師: Martin 的「不受信任的輸入」心智模型很出色,我想加強它。在安全領域,我們不是只在不受信任的輸入上驗證一次——我們在每個信任邊界都驗證。同樣的原則應該應用在 AI 生成的程式碼上。在它進入你的 codebase 時驗證(code review)。在它與現有程式碼整合時驗證(整合測試)。在它處理真實資料時驗證(監控和可觀察性)。縱深防禦。每一層捕捉上一層遺漏的。而這就是三層測試架構回歸完整的地方:Detroit TDD 捕捉單元層級的語意錯誤。London TDD 捕捉整合邊界的契約違反。Property-Based Testing 捕捉跨輸入空間的不變量違反。三層驗證,針對不受信任的 AI 輸出。

Kent Beck: 我想提出一個具體的實踐讓這更具體。我稱它為「AI 懷疑論者測試」。在 AI 生成一個函式之後,在你接受它之前,寫一個你認為 AI 大概搞錯的測試。不是隨機測試——而是針對你懷疑 AI 沒有正確處理的特定邊界案例的測試。以我的經驗,這個單一實踐能抓住 30-40% 的 AI 生成 bug,因為它迫使你對程式碼進行對抗性思考。花 2 分鐘。不需要形式方法。只需要有紀律的懷疑精神。

主持人: 我們已經識別了根本挑戰和幾個實務的回應策略。讓我們捕捉對最有效方法的共識。

投票:應對語義不確定性的最有效策略?

專家投票理由
Dr. 陳明哲全面的測試套件作為唯一真實來源測試是非確定性生成過程中唯一的確定性錨點
Kent BeckTDD + 「AI 懷疑論者測試」實踐測試作為不變量,加上每次生成的對抗性單一測試紀律
葉大師三支柱防禦:規格、測試、治理非確定性需要三者全部;移除一個就失去控制
Martin Fowler「不受信任的輸入」心智模型 + 分層驗證所有開發者都能做到,縱深防禦
林宏志Martin 的不受信任輸入模型——實務且可採用不情願地承認他的啟發法不夠充分
Jessica LiuKent 的 AI 懷疑論者測試——2 分鐘,高 ROI最小投資,抓到 30-40% 的 bug,我的團隊星期一就能開始

Round 6: A 12-Month Technical Practice Roadmap for a 25-Year-Old Frontend Engineer

Moderator: Final round. Let's make this actionable. Imagine a 25-year-old frontend engineer — 2-3 years of experience, comfortable with React and TypeScript, uses Copilot daily, has never written a property-based test, doesn't know what TLA+ is. What's the 12-month roadmap for becoming an AI-era software engineer? Let's build it together, and let's fight about the priorities.

William Yeh: I'll lay out the framework, then let everyone attack it. Three phases.

Months 1-3: Foundation — Testing Discipline and Specification Thinking.

Week 1-2: Learn Detroit TDD from scratch. Not theory — practice. Pick a small utility function, write a failing test, make it pass, refactor. Do this 20 times until the Red-Green-Refactor rhythm is muscle memory. Use Jest or Vitest. Resources: Kent's "Test-Driven Development by Example" — the first 100 pages only. Don't read the patterns section yet.

Week 3-4: Apply TDD to AI-generated code. Use Cursor or Claude to generate a function, then write tests for it BEFORE reading the implementation. You're testing someone else's code now. This is fundamentally different from testing your own code and builds the auditing mindset.

Month 2: London TDD. Learn mocking and stubbing. Understand the difference between a test that verifies behavior ("this function calls the API with these parameters") versus state ("this function returns 42"). Write contract tests for the interfaces between your React components and their data sources. Resource: Freeman and Pryce's "Growing Object-Oriented Software, Guided by Tests."

Month 3: Introduction to Property-Based Testing. Learn fast-check in TypeScript. Start with simple properties: "sorting a list of numbers always produces a list of the same length," "serializing then deserializing an object produces the original object." Apply this to one real component in your production codebase. Resource: "Property-Based Testing with PropEr, Erlang, and Elixir" by Fred Hebert — the concepts transfer to any language.

Kent Beck: immediately I want to rearrange months 1-3. William, you're front-loading too much theory. Here's my counter-proposal. Week 1: write one characterization test for a component you work with daily. Just one. Capture its current behavior. Feel the confidence that gives you when you change that component next week. Week 2: write a test BEFORE you ask AI to generate code. Just once. Notice how writing the test forces you to clarify your specification. Week 3-4: now read the first 5 chapters of my TDD book, because you have context for what the practices mean. Months 2-3: same as William's, but interleave practice and theory. Never go more than 2 days without writing a test in your production codebase. Theory without practice decays in a week.

Jessica Liu: Kent, I like your approach better because it produces value from day one. A characterization test on day 5 is immediately useful. Reading a book for 2 weeks before writing any test is not. My engineers have the attention span of — well, they're engineers. If they don't see a result in the first week, they'll decide "testing is overhead" and go back to yolo-shipping with Cursor.

Hong-Zhi Lin: I want to add something both William and Kent are missing for a frontend engineer specifically. Before ANY testing practice, spend month 1 learning to read code critically. I mean really read it. Take a 200-line React component — preferably one generated by AI — and annotate every line with what you think it does and why. Then verify your annotations by tracing the execution. Most junior frontend engineers can write React code but can't systematically read and reason about unfamiliar code. This is the meta-skill that makes everything else possible. You can't write good tests if you can't read the code you're testing.

Dr. Ming-Zhe Chen: Hong-Zhi's point is excellent and connects to a skill that's vanishing: formal reasoning about program behavior. I'm not saying learn TLA+ in month 1. I'm saying learn to draw state diagrams. If your React component has 5 states — loading, error, empty, data, stale — draw the state machine. Label every transition. Ask: "Can I reach the error state from the stale state? What happens if the user clicks refresh while in the loading state?" This kind of systematic reasoning catches bugs that no amount of "just write more tests" will find. And it's a skill that directly translates to writing better specifications.

William Yeh: nodding Both valid additions. Let me revise.

Months 4-6: Intermediate — Specification-First Workflow and Clean Code.

Month 4: Learn Spec-Driven Development. Before every AI interaction, write a specification document — not a prompt, a specification. Include: function signature, input constraints, output guarantees, edge cases, error handling behavior. Start with a simple template and evolve it. Measure the time you spend on specification versus debugging AI output. You'll see the ratio shift.

Month 5: Clean code practices for AI readability. Refactor one module per week in your production codebase. Focus on: function extraction (no function longer than 20 lines), consistent naming (no abbreviations, no single-letter variables except loop counters), explicit types (no any in TypeScript, no implicit returns). Measure AI completion accuracy before and after refactoring.

Month 6: Integration testing and contract testing. Learn to test the boundaries between your frontend and your APIs. Use tools like MSW (Mock Service Worker) for API mocking. Write contract tests that verify your frontend's assumptions about API responses. This is London TDD applied to the frontend-backend boundary.

Martin Fowler: I'd add to month 5: learn to use AI as a refactoring assistant. This is the virtuous cycle in action. Use AI to help you refactor messy code into clean code, then observe that the cleaner codebase produces better AI-generated code going forward. Concrete exercise: take a messy component, ask Claude to refactor it following specific clean code principles, review the refactoring critically, and apply it if it improves readability. Track the AI accuracy improvement over the refactored code over the next month.

Jessica Liu: For months 4-6, I want to emphasize something practical: learn to estimate the ROI of technical practices. This sounds like a business skill, not a technical skill, but it's what separates engineers who get organizational support for testing and clean code from engineers who are told "we don't have time for that." When you propose TDD to your tech lead, don't say "it improves code quality." Say "last sprint, we spent 18 hours debugging AI-generated code. With TDD, based on industry data, we'd reduce that to 7-9 hours, saving 9-11 hours per sprint. Over a quarter, that's 2-3 features worth of engineering time." Numbers win arguments.

Hong-Zhi Lin: Jessica's right, and I want to add a corollary: learn to measure. Install a time tracker. Categorize your time: feature development, debugging, code review, testing, meetings. After one month, you'll have data. Most engineers are shocked to discover they spend 35-45% of their time debugging, not coding. That data makes the case for testing investment undeniable. I resisted TDD for years until I measured my own debugging time and realized I was spending more time finding bugs than I would have spent preventing them.

William Yeh:

Months 7-12: Advanced — Governance, Property-Based Testing, and System Thinking.

Month 7-8: Deep dive into Property-Based Testing. Move beyond simple properties. Learn to write stateful property tests that model user workflows. For a frontend engineer, this means: "for any sequence of valid user actions, the application state must remain consistent." Use fast-check's model-based testing to verify your state management (Redux, Zustand, whatever) against a simplified model.

Month 9-10: Governance practices. Implement a personal governance workflow: sandboxed execution for all AI-generated code (use feature branches + preview deployments), semantic review checklist (intent alignment, pattern consistency, edge case coverage), and for your most critical module, write one formal property that must hold. Learn the basics of the "AI code owner" pattern — when you review AI code, explicitly document what you've verified and what you're accepting responsibility for.

Month 11-12: System thinking and architecture. Learn the TOC principles: identify the bottleneck, exploit it, subordinate everything else to it. Apply this to your AI-assisted workflow. Where is YOUR bottleneck? If it's specification, invest in specification skills. If it's verification, invest in testing skills. If it's understanding existing code, invest in code reading skills. The bottleneck is different for every engineer. The discipline is identifying it honestly and investing accordingly.

Kent Beck: For months 11-12, I'd add: teach someone else everything you've learned. The best way to solidify your understanding is to explain it. Find a junior developer or a peer who hasn't started this journey. Walk them through months 1-3. You'll discover gaps in your own understanding, and you'll develop the communication skills that turn individual practices into team practices. This is how culture changes — one engineer at a time, teaching the next.

Dr. Ming-Zhe Chen: And I'd add one final skill for the advanced phase: learn to recognize what AI cannot do. After 12 months of working closely with AI, you should be able to identify the categories of problems where AI-generated code is reliably dangerous — concurrent state management, security-critical authentication flows, financial calculations with rounding, anything involving time zones. For these categories, develop heightened skepticism and apply maximum verification. This pattern recognition is a career-defining skill in the AI era.

Martin Fowler: Well said. The ultimate goal of this 12-month roadmap isn't to master any single tool or technique. It's to develop engineering judgment — the ability to decide, for any given situation, how much specification, testing, and governance is appropriate. That judgment is what makes a senior engineer senior. AI doesn't replace it. AI makes it more valuable than ever.

Moderator: A rich and actionable roadmap with genuine debate on priorities. Let's vote on the starting point.

Vote: What should a 25-year-old frontend engineer learn FIRST?

ExpertVoteReasoning
William YehSpec-writing disciplineThe bottleneck is specification, not coding — start at the constraint
Kent BeckWrite one characterization test for existing codeImmediate value, builds testing muscle memory, zero theory required
Martin FowlerClean code reading and refactoring skillsAI readability is the force multiplier for everything else
Dr. Ming-Zhe ChenState diagram drawing and formal reasoningMeta-skill that improves specification, testing, AND debugging
Hong-Zhi LinCritical code reading and annotationYou can't test or specify what you can't read and understand
Jessica LiuTime tracking and ROI measurementData-driven arguments win organizational support for all other practices

第六回合:給 25 歲前端工程師的 12 個月技術實踐路線圖

主持人: 最後一輪。讓我們務實可行動。想像一位 25 歲的前端工程師——2-3 年經驗,熟悉 React 和 TypeScript,每天用 Copilot,從來沒寫過 property-based test,不知道 TLA+ 是什麼。成為 AI 時代軟體工程師的 12 個月路線圖是什麼?讓我們一起建構,然後針對優先順序來辯論。

葉大師: 我來鋪陳框架,然後讓大家來攻擊。三個階段。

第 1-3 個月:基礎——測試紀律和規格思維。

第 1-2 週:從零學 Detroit TDD。不是理論——是實作。選一個小的工具函式,寫一個失敗的測試,讓它通過,重構。做 20 次,直到 Red-Green-Refactor 的節奏變成肌肉記憶。用 Jest 或 Vitest。資源:Kent 的「Test-Driven Development by Example」——只讀前 100 頁。先不要讀 patterns 章節。

第 3-4 週:把 TDD 應用到 AI 生成的程式碼上。用 Cursor 或 Claude 生成一個函式,然後在「讀實作之前」為它寫測試。你現在在測試別人的程式碼。這跟測試自己的程式碼根本不同,能建立審計心態。

第 2 個月:London TDD。學習 mocking 和 stubbing。理解驗證行為的測試(「這個函式用這些參數呼叫 API」)跟驗證狀態的測試(「這個函式回傳 42」)之間的差別。為你的 React 元件和它們的資料來源之間的介面寫契約測試。資源:Freeman 和 Pryce 的「Growing Object-Oriented Software, Guided by Tests」。

第 3 個月:Property-Based Testing 入門。學 TypeScript 中的 fast-check。從簡單的屬性開始:「排序一個數字列表總是產出相同長度的列表」、「序列化再反序列化一個物件產出原始物件」。把這應用到你生產 codebase 中的一個真實元件上。資源:Fred Hebert 的「Property-Based Testing with PropEr, Erlang, and Elixir」——概念可以轉移到任何語言。

Kent Beck: 立刻 我想重新排列第 1-3 個月。葉大師,你前面塞了太多理論。這是我的反提案。第 1 週:為你每天使用的元件寫一個 characterization test。就一個。捕捉它的當前行為。感受下週你改那個元件時,它給你的信心。第 2 週:在你請 AI 生成程式碼之前寫一個測試。只做一次。注意寫測試如何迫使你釐清規格。第 3-4 週:現在讀我 TDD 書的前 5 章,因為你已經有了理解這些實踐意義的上下文。第 2-3 個月:跟葉大師的一樣,但交錯實踐和理論。絕不超過 2 天不在生產 codebase 中寫測試。沒有實踐的理論一週就衰退了。

Jessica Liu: Kent,我更喜歡你的方式,因為它從第一天就產出價值。第 5 天的 characterization test 立刻有用。讀 2 週書才寫任何測試不是。我的工程師的注意力持續時間——嗯,他們是工程師。如果他們第一週看不到結果,他們會決定「測試是開銷」然後回去用 Cursor 亂 ship。

林宏志: 我想加一些葉大師和 Kent 都沒提到的,特別針對前端工程師。在「任何」測試實踐之前,花第 1 個月學習批判性地閱讀程式碼。我的意思是真正地閱讀它。拿一個 200 行的 React 元件——最好是 AI 生成的——對每一行標註你認為它做什麼、為什麼這麼做。然後透過追蹤執行來驗證你的標註。多數初階前端工程師能寫 React 程式碼,但不能系統性地閱讀和推理不熟悉的程式碼。這是讓其他所有事情成為可能的元技能。如果你不能讀你在測試的程式碼,你就不能寫好的測試。

Dr. 陳明哲: 宏志的論點很好,而且連接到一個正在消失的技能:對程式行為的形式推理。 我不是說第 1 個月就學 TLA+。我是說學畫狀態圖。如果你的 React 元件有 5 個狀態——loading、error、empty、data、stale——畫出狀態機。標記每個轉換。問:「我能從 stale 狀態到達 error 狀態嗎?使用者在 loading 狀態點擊重新整理會怎樣?」這種系統性推理能抓到再多「只要多寫測試」都找不到的 bug。而且這個技能可以直接轉化為寫出更好的規格。

葉大師: 點頭 兩個都是有效的補充。讓我修訂。

第 4-6 個月:中階——規格優先的工作流和 Clean Code。

第 4 個月:學習 Spec-Driven Development。在每次 AI 互動之前,寫一份規格文件——不是 prompt,是規格。包含:函式簽名、輸入約束、輸出保證、邊界案例、錯誤處理行為。從一個簡單的範本開始然後演化它。測量你花在規格上的時間與 debug AI 輸出的時間比。你會看到比率轉變。

第 5 個月:AI 可讀性的 clean code 實踐。每週重構你生產 codebase 中的一個模組。聚焦於:函式提取(沒有函式超過 20 行)、一致命名(不用縮寫、除了迴圈計數器不用單字母變數)、明確型別(TypeScript 中不用 any、不用隱式回傳)。測量重構前後的 AI 補全準確率。

第 6 個月:整合測試和契約測試。學習測試你前端和 API 之間的邊界。使用 MSW(Mock Service Worker)等工具做 API mocking。寫契約測試來驗證你前端對 API 回應的假設。這是 London TDD 應用在前端-後端邊界上。

Martin Fowler: 我想在第 5 個月加一點:學習用 AI 作為重構助手。這是良性循環在行動。用 AI 幫你把混亂的程式碼重構成乾淨的程式碼,然後觀察更乾淨的 codebase 向前會產出更好的 AI 生成程式碼。具體練習:拿一個混亂的元件,請 Claude 按照特定的 clean code 原則重構它,批判性地審查重構結果,如果它提升了可讀性就應用它。追蹤接下來一個月重構後程式碼的 AI 準確率提升。

Jessica Liu: 對於第 4-6 個月,我想強調一件實務的事:學習估算技術實踐的 ROI。 這聽起來像商業技能,不是技術技能,但這就是區分「能得到組織支持做測試和 clean code」的工程師,和被告知「我們沒有時間做那個」的工程師的關鍵。當你向你的技術主管提議 TDD 時,不要說「它提升程式碼品質」。要說「上一個 sprint,我們花了 18 小時 debug AI 生成的程式碼。用 TDD,根據業界數據,我們可以把那減少到 7-9 小時,每個 sprint 省 9-11 小時。一個季度下來,那是 2-3 個功能的工程時間。」數字贏得爭論。

林宏志: Jessica 說得對,我想加一個推論:學習測量。 安裝一個時間追蹤器。分類你的時間:功能開發、debug、code review、測試、會議。一個月後你會有數據。多數工程師震驚地發現他們花 35-45% 的時間在 debug,不是在寫程式。那個數據讓測試投資的論點無法否認。我抗拒 TDD 好幾年,直到我測量了自己的 debug 時間,才意識到我花在找 bug 的時間比我本來花在預防 bug 的時間還多。

葉大師:

第 7-12 個月:進階——治理、Property-Based Testing 和系統思維。

第 7-8 個月:深入 Property-Based Testing。超越簡單的屬性。學寫有狀態的 property test,模擬使用者工作流。對前端工程師來說,這意味著:「對於任何有效使用者操作的序列,應用程式狀態必須保持一致。」使用 fast-check 的 model-based testing 來驗證你的狀態管理(Redux、Zustand,或任何你用的)對照一個簡化模型。

第 9-10 個月:治理實踐。實作一個個人治理工作流:所有 AI 生成程式碼的沙盒執行(用 feature branch + preview deployment)、語意審查清單(意圖對齊、模式一致性、邊界案例覆蓋)、以及對你最關鍵的模組,寫一個必須成立的形式屬性。學習「AI code owner」模式的基礎——當你審查 AI 程式碼時,明確記錄你驗證了什麼、你承擔什麼責任。

第 11-12 個月:系統思維和架構。學習 TOC 原則:識別瓶頸、利用它、讓其他一切服從於它。把這應用到你的 AI 輔助工作流。「你的」瓶頸在哪裡?如果是規格,投資規格技能。如果是驗證,投資測試技能。如果是理解現有程式碼,投資程式碼閱讀技能。瓶頸對每個工程師都不同。紀律在於誠實地識別它並相應地投資。

Kent Beck: 對於第 11-12 個月,我想加一點:把你學到的一切教給別人。 鞏固理解的最好方式是解釋它。找一個初階開發者或一個還沒開始這趟旅程的同僚。帶他們走過第 1-3 個月。你會發現自己理解中的缺口,而且你會發展出把個人實踐轉化為團隊實踐的溝通技能。文化就是這樣改變的——一次一個工程師,教下一個。

Dr. 陳明哲: 我想為進階階段加一個最終技能:學習認識 AI 不能做什麼。 經過 12 個月與 AI 密切合作後,你應該能識別 AI 生成的程式碼在哪些問題類別上是可靠地危險的——並行狀態管理、安全關鍵的身份驗證流程、帶四捨五入的財務計算、任何涉及時區的東西。對這些類別,發展更高的懷疑精神並應用最大程度的驗證。這種模式識別是 AI 時代定義職涯的技能。

Martin Fowler: 說得好。這個 12 個月路線圖的終極目標不是掌握任何單一工具或技術。它是發展工程判斷力——對任何給定情境,決定多少規格、測試和治理是適當的能力。那個判斷力是讓資深工程師資深的原因。AI 不會取代它。AI 讓它比以往任何時候都更有價值。

主持人: 一個豐富且可行動的路線圖,對優先順序有真正的辯論。讓我們對起始點投票。

投票:25 歲前端工程師最應該先掌握的技術實踐?

專家投票理由
葉大師寫規格的紀律瓶頸在規格,不在寫程式——從約束處開始
Kent Beck為現有程式碼寫一個 characterization test立即有價值,建立測試肌肉記憶,零理論要求
Martin FowlerClean code 閱讀和重構技能AI 可讀性是其他一切的力量倍增器
Dr. 陳明哲畫狀態圖和形式推理元技能,同時改善規格、測試「和」debug
林宏志批判性程式碼閱讀和標註你不能測試或規格化你讀不懂的東西
Jessica Liu時間追蹤和 ROI 測量數據驅動的論證能贏得組織對所有其他實踐的支持

Closing: Final Technical Advice and Comprehensive Vote

Moderator: We've covered enormous ground — three-layer testing, clean code as AI infrastructure, spec-driven development, governable code, semantic non-determinism, and a 12-month roadmap. Before we close, each expert gives one sentence of final technical advice for any engineer navigating the AI era. Then we take the final comprehensive vote.

William Yeh: When AI removes the bottleneck of writing code, the engineer who masters specification, verification, and governance becomes the most valuable person on the team — not despite AI, but because of it.

Kent Beck: Write the test before the prompt — if you can't specify what you want in a test, you can't specify it in a prompt, and the AI will give you something precisely wrong.

Martin Fowler: Treat your codebase as infrastructure for AI collaboration — every clean function, consistent name, and explicit interface you write today makes every AI interaction tomorrow more productive.

Dr. Ming-Zhe Chen: Never trust AI-generated code more than you'd trust code from a brilliant but unreliable intern — verify everything, especially the parts that look most confident.

Hong-Zhi Lin: Measure your debugging time for one month, then show those numbers to anyone who says testing is overhead — the data wins the argument every time.

Jessica Liu: Frame every technical practice as an investment with measurable returns — "this will improve code quality" loses to "this will save us 11 hours per sprint and $28,000 per quarter" every single time.

Moderator: Powerful final statements. Now, the comprehensive final vote.

Final Vote: The single most important technical practice for the AI era?

ExpertVoteOne-Line Justification
William YehSpec-Driven DevelopmentThe bottleneck has migrated — spec quality is the new rate limiter for AI-assisted teams
Kent BeckTest-Driven Development (evolved for AI)Tests are the only invariant in non-deterministic generation — TDD is the anchor
Martin FowlerClean Code as AI Infrastructure2x productivity multiplier — the highest-leverage investment for AI-assisted development
Dr. Ming-Zhe ChenFormal reasoning and verification disciplineWithout verification discipline, all other practices are built on sand
Hong-Zhi LinMeasurable testing culture with data-driven adoptionPractices without measurement are religion — measurement creates sustainable adoption
Jessica LiuROI-framed incremental adoption of testing and spec practicesThe best practice is the one your organization will actually fund and sustain

Moderator: A fitting conclusion. No unanimous agreement — and that's the point. The AI era doesn't demand a single practice; it demands a system of practices — specification, testing, clean code, governance — adapted to your team's context, measured by results, and adopted incrementally. The debate will continue, as it should. Thank you all.

結語:最終技術建議與綜合投票

主持人: 我們涵蓋了大量議題——三層測試、clean code 作為 AI 基礎設施、規格驅動開發、可治理程式碼、語義不確定性,以及 12 個月路線圖。在結束之前,每位專家給出一句最終技術建議,獻給所有在 AI 時代中航行的工程師。然後我們進行最終的綜合投票。

葉大師: 當 AI 移除了寫程式碼的瓶頸,掌握規格、驗證和治理的工程師會成為團隊中最有價值的人——不是儘管有 AI,而是「因為」有 AI。

Kent Beck: 在 prompt 之前先寫測試——如果你不能在測試中規格化你要的東西,你就不能在 prompt 中規格化它,AI 會給你一些精確地錯誤的東西。

Martin Fowler: 把你的 codebase 當作 AI 協作的基礎設施——你今天寫的每一個乾淨的函式、一致的命名和明確的介面,都讓明天的每一次 AI 互動更有生產力。

Dr. 陳明哲: 永遠不要對 AI 生成的程式碼比你對一個聰明但不可靠的實習生的程式碼更信任——驗證一切,尤其是看起來最有信心的部分。

林宏志: 測量你一個月的 debug 時間,然後把那些數字給任何說測試是開銷的人看——數據每次都能贏得爭論。

Jessica Liu: 把每一個技術實踐框架為有可衡量回報的投資——「這會提升程式碼品質」輸給「這會為我們每個 sprint 省 11 小時、每季度省 $28,000」,每一次都是。

主持人: 有力的最終宣言。現在,綜合最終投票。

最終投票:AI 時代最重要的單一技術實踐?

專家投票一句話理由
葉大師Spec-Driven Development瓶頸已遷移——規格品質是 AI 輔助團隊的新速率限制器
Kent BeckTest-Driven Development(為 AI 演化版)測試是非確定性生成中唯一的不變量——TDD 是錨點
Martin FowlerClean Code 作為 AI 基礎設施2 倍生產力倍增——AI 輔助開發中最高槓桿的投資
Dr. 陳明哲形式推理和驗證紀律沒有驗證紀律,所有其他實踐都建在沙子上
林宏志有數據驅動採用的可測量測試文化沒有測量的實踐是宗教——測量創造可持續的採用
Jessica Liu以 ROI 框架漸進式採用測試和規格實踐最好的實踐是你的組織會真正資助和維持的那個

主持人: 恰如其分的結論。沒有一致同意——而那正是重點。AI 時代不要求單一的實踐;它要求一個實踐「系統」——規格、測試、clean code、治理——適配你的團隊脈絡、用結果衡量、漸進式採用。辯論會繼續,而且應該繼續。感謝各位。