Skip to content

Building a YouTube-to-Podcast Platform: Technical Research

YouTube 轉 Podcast 平台技術研究

Google's NotebookLM Audio Overview feature took the world by storm — upload documents, get a natural two-host podcast discussing your content. But what if you want this capability for YouTube videos? And what if you want to run it yourself, with your own voice profiles and complete control?

This research covers the complete technical stack for building a personal YouTube-to-Podcast platform: extracting audio, transcribing speech, generating engaging scripts, and synthesizing natural multi-speaker audio.

Google 的 NotebookLM Audio Overview 功能席捲全球——上傳文件,就能得到兩位主持人自然討論你內容的 Podcast。但如果你想對 YouTube 影片做到這一點呢?而且如果你想自己運行,使用自己的聲音檔案並完全掌控呢?

本研究涵蓋建構個人 YouTube 轉 Podcast 平台的完整技術堆疊:提取音訊、轉錄語音、生成引人入勝的腳本,以及合成自然的多人語音。


System Architecture Overview

系統架構概覽

┌─────────────────────────────────────────────────────────────────┐
│                    YouTube to Podcast Pipeline                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ YouTube  │───▶│  Audio   │───▶│ Transcript│───▶│  Script  │  │
│  │   URL    │    │Extraction│    │   (STT)   │    │Generation│  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│                                                       │         │
│                                                       ▼         │
│                                              ┌──────────────┐   │
│                                              │   Podcast    │   │
│                                              │    Audio     │   │
│                                              │    (TTS)     │   │
│                                              └──────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The pipeline consists of four main stages:

  1. Audio Extraction — Download audio from YouTube
  2. Speech-to-Text (STT) — Convert audio to transcript
  3. Script Generation — Transform transcript into podcast dialogue
  4. Text-to-Speech (TTS) — Synthesize natural multi-speaker audio
┌─────────────────────────────────────────────────────────────────┐
│                    YouTube 轉 Podcast 流程                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ YouTube  │───▶│   音訊   │───▶│  逐字稿  │───▶│  腳本   │  │
│  │   URL    │    │   提取   │    │  (STT)   │    │   生成   │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│                                                       │         │
│                                                       ▼         │
│                                              ┌──────────────┐   │
│                                              │   Podcast    │   │
│                                              │    音訊      │   │
│                                              │    (TTS)     │   │
│                                              └──────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

流程包含四個主要階段:

  1. 音訊提取 — 從 YouTube 下載音訊
  2. 語音轉文字(STT) — 將音訊轉為逐字稿
  3. 腳本生成 — 將逐字稿轉換為 Podcast 對話
  4. 文字轉語音(TTS) — 合成自然的多人語音

Part 1: Audio Extraction from YouTube

第一部分:從 YouTube 提取音訊

yt-dlp: The Standard Tool

yt-dlp is the de facto standard for extracting audio from YouTube. It's a feature-rich fork of youtube-dl with faster adaptation to YouTube's changes.

Installation:

bash
# macOS
brew install yt-dlp ffmpeg

# pip
pip install yt-dlp

# Windows
winget install yt-dlp

Basic Usage:

bash
# Extract audio only (best quality)
yt-dlp -x "https://www.youtube.com/watch?v=VIDEO_ID"

# Convert to MP3
yt-dlp -x --audio-format mp3 "URL"

# Best quality M4A with metadata
yt-dlp -x -f bestaudio[ext=m4a] --add-metadata --embed-thumbnail "URL"

# Download to specific directory
yt-dlp -x -o "~/podcasts/%(title)s.%(ext)s" "URL"

Requirements: FFmpeg must be installed for format conversion.

AspectStatus
YouTube ToSExplicitly prohibits third-party downloads
Fair UseCase-by-case evaluation; not blanket permission
Personal UseLower legal risk, but technically violates ToS
ConsequencesAccount termination possible; DMCA issues if redistributing

Recommendation: For personal use and learning purposes, the risk is minimal. Avoid redistributing copyrighted content.

yt-dlp:標準工具

yt-dlp 是從 YouTube 提取音訊的事實標準。它是 youtube-dl 功能豐富的分支,能更快適應 YouTube 的變化。

安裝:

bash
# macOS
brew install yt-dlp ffmpeg

# pip
pip install yt-dlp

# Windows
winget install yt-dlp

基本用法:

bash
# 只提取音訊(最佳品質)
yt-dlp -x "https://www.youtube.com/watch?v=VIDEO_ID"

# 轉換為 MP3
yt-dlp -x --audio-format mp3 "URL"

# 最佳品質 M4A 帶 metadata
yt-dlp -x -f bestaudio[ext=m4a] --add-metadata --embed-thumbnail "URL"

# 下載到指定目錄
yt-dlp -x -o "~/podcasts/%(title)s.%(ext)s" "URL"

**需求:**必須安裝 FFmpeg 才能進行格式轉換。

法律考量

面向狀態
YouTube 服務條款明確禁止第三方下載
合理使用逐案評估;非全面許可
個人使用法律風險較低,但技術上違反服務條款
後果可能帳號終止;若重新散布可能有 DMCA 問題

**建議:**用於個人使用和學習目的,風險很小。避免重新散布受版權保護的內容。


Part 2: Speech-to-Text (Transcription)

第二部分:語音轉文字(轉錄)

Service Comparison

ServicePrice/minAccuracyLanguagesBest For
OpenAI Whisper API$0.00695-99%100+Best value
Whisper (Local)Free95-99%100+Privacy, high volume
faster-whisperFree95-99%100+4x faster than Whisper
Deepgram$0.004330% lower WER36Real-time
AssemblyAI$0.002585-92%20+Add-on features
Google Cloud STT$0.024High125+Google ecosystem

Cost for 1,000 Minutes

ServiceCost
OpenAI Whisper API$6
Deepgram~$4.30
Local WhisperFree (hardware only)
Google Cloud$24
AssemblyAI (base)~$15

Local Whisper Setup

faster-whisper is the recommended local solution — 4x faster than standard Whisper with lower memory usage.

bash
pip install faster-whisper

# Python usage
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Hardware Requirements:

ModelVRAMSpeed (RTX 3090)
tiny~1GBReal-time x50
base~1GBReal-time x30
small~2GBReal-time x15
medium~5GBReal-time x8
large-v3~10GBReal-time x4

OpenAI Whisper API

python
from openai import OpenAI

client = OpenAI()

with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["segment"]
    )

print(transcript.text)

Advantages:

  • No hardware requirements
  • Automatic language detection
  • High accuracy across accents
  • $0.006/minute is very affordable

服務比較

服務價格/分鐘準確度語言最適合
OpenAI Whisper API$0.00695-99%100+最佳性價比
Whisper(本地)免費95-99%100+隱私、大量使用
faster-whisper免費95-99%100+比 Whisper 快 4 倍
Deepgram$0.0043WER 低 30%36即時
AssemblyAI$0.002585-92%20+附加功能
Google Cloud STT$0.024125+Google 生態系

1,000 分鐘的費用

服務費用
OpenAI Whisper API$6
Deepgram~$4.30
本地 Whisper免費(只有硬體成本)
Google Cloud$24
AssemblyAI(基礎)~$15

本地 Whisper 設置

faster-whisper 是推薦的本地解決方案——比標準 Whisper 快 4 倍,記憶體使用更低。

bash
pip install faster-whisper

# Python 用法
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

硬體需求:

模型VRAM速度(RTX 3090)
tiny~1GB即時 x50
base~1GB即時 x30
small~2GB即時 x15
medium~5GB即時 x8
large-v3~10GB即時 x4

OpenAI Whisper API

python
from openai import OpenAI

client = OpenAI()

with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["segment"]
    )

print(transcript.text)

優點:

  • 無硬體需求
  • 自動語言偵測
  • 跨口音高準確度
  • $0.006/分鐘非常實惠

Part 3: NotebookLM Analysis — What Makes It Work?

第三部分:NotebookLM 分析——是什麼讓它有效?

How NotebookLM Audio Overview Works

Google's NotebookLM uses:

  • Gemini 1.5 Pro for content understanding and script generation
  • SoundStorm (likely) for realistic audio synthesis
  • RAG system for processing up to 50 documents

What Makes It Feel Natural

  1. Two distinct AI hosts with different personalities
  2. Thematic connections — linking ideas across the content like real podcasters
  3. Natural conversation flow — interruptions, agreements, clarifications
  4. Casual language — "Oh, that's interesting!" "Right, and..."
  5. Explaining jargon — making technical content accessible

Generation Parameters

AspectValue
Generation time~2-5 minutes
Output length6-30 minutes (typically ~10)
Daily limits (free)3 per day
Daily limits (Plus)20 per day
Languages80+ for generation

Limitations to Consider

  • Cannot make minor edits — must regenerate entire audio
  • May contain inaccuracies
  • Audio glitches possible
  • Quality varies by source material

NotebookLM Audio Overview 如何運作

Google 的 NotebookLM 使用:

  • Gemini 1.5 Pro 用於內容理解和腳本生成
  • SoundStorm(可能)用於逼真的音訊合成
  • RAG 系統用於處理多達 50 份文件

是什麼讓它感覺自然

  1. 兩位不同個性的 AI 主持人
  2. 主題連結 — 像真正的 Podcaster 一樣在內容間建立連結
  3. 自然對話流程 — 打斷、同意、澄清
  4. 口語化語言 — 「喔,這很有趣!」「對,而且...」
  5. 解釋術語 — 讓技術內容變得易懂

生成參數

面向數值
生成時間~2-5 分鐘
輸出長度6-30 分鐘(通常約 10 分鐘)
每日限制(免費)每天 3 次
每日限制(Plus)每天 20 次
語言80+ 種可生成

需考慮的限制

  • 無法做小幅修改——必須重新生成整段音訊
  • 可能包含不準確之處
  • 可能有音訊瑕疵
  • 品質依來源素材而異

Part 4: Script Generation with LLMs

第四部分:使用 LLM 生成腳本

The Art of Podcast Script Generation

Transforming a transcript into an engaging podcast requires more than summarization — it needs conversation design.

LLMStrengthBest For
ClaudeCharacter development, natural dialogue, "show don't tell"Two-host conversations
GPT-4Dynamic writing, warmer tone, versatilityCreative storytelling
GeminiLong context, research accuracyFactual synthesis

Recommendation: Claude excels at dialogue and character work. GPT-4 offers versatility. Many pipelines use both.

Prompt Engineering for Natural Dialogue

System Prompt Example:

You are a podcast scriptwriter creating engaging two-host dialogue.

HOST PROFILES:
- Alex: Curious, asks clarifying questions, uses analogies
- Sam: Expert, explains concepts, shares insights

STYLE GUIDELINES:
- Use short sentences suitable for speech synthesis
- Include natural filler words: "uh", "well", "you know"
- Add reactions: "Oh interesting!", "Right, that makes sense"
- Create back-and-forth flow, not monologues
- Explain jargon in conversational terms
- Target 3,000-5,000 words for a 10-minute podcast

OUTPUT FORMAT:
ALEX: [dialogue]
SAM: [dialogue]

User Prompt Example:

Transform the following transcript into an engaging podcast conversation.

Key requirements:
1. Cover the main ideas, not every detail
2. Make it accessible to a general audience
3. Include at least 3 "aha moments" where a concept clicks
4. End with actionable takeaways

TRANSCRIPT:
[Your transcript here]

Advanced Techniques

1. Scratchpad Method (from Together.ai):

First, use a <scratchpad> to brainstorm:
- Key themes to cover
- Interesting angles
- Potential questions listeners might have
- Natural transitions between topics

Then generate the dialogue.

2. Chunking for Long Content:

For transcripts over 10,000 words:

  1. Break into thematic chunks
  2. Generate dialogue for each chunk
  3. Add contextual linking between chunks
  4. Merge with transition phrases

3. Emotional Markers:

SAM: [excited] Oh, this is where it gets really interesting!
ALEX: [curious] Wait, so you're saying...
SAM: [thoughtful pause] Exactly. And here's why that matters...

These markers help TTS systems add appropriate intonation.

Podcast 腳本生成的藝術

將逐字稿轉換為引人入勝的 Podcast 不只是摘要——它需要對話設計

推薦的 LLM

LLM優勢最適合
Claude角色發展、自然對話、「展示而非講述」雙主持人對話
GPT-4動態寫作、溫暖語調、多功能創意敘事
Gemini長上下文、研究準確性事實合成

**建議:**Claude 擅長對話和角色塑造。GPT-4 提供多功能性。許多流程同時使用兩者。

自然對話的 Prompt 工程

系統 Prompt 範例:

你是一位 Podcast 編劇,創作引人入勝的雙主持人對話。

主持人檔案:
- Alex:好奇、問澄清問題、使用類比
- Sam:專家、解釋概念、分享見解

風格指南:
- 使用適合語音合成的短句
- 包含自然的填充詞:「嗯」、「那個」、「你知道」
- 加入反應:「噢有趣!」「對,這有道理」
- 創造來回交流,不是獨白
- 用口語化方式解釋術語
- 目標 3,000-5,000 字,約 10 分鐘 Podcast

輸出格式:
ALEX:[對話]
SAM:[對話]

使用者 Prompt 範例:

將以下逐字稿轉換為引人入勝的 Podcast 對話。

關鍵要求:
1. 涵蓋主要想法,不是每個細節
2. 讓一般觀眾都能理解
3. 包含至少 3 個概念「頓悟時刻」
4. 以可行動的要點結束

逐字稿:
[你的逐字稿]

進階技巧

1. 草稿本方法(來自 Together.ai):

首先,使用 <scratchpad> 腦力激盪:
- 要涵蓋的主要主題
- 有趣的角度
- 聽眾可能有的問題
- 主題之間的自然過渡

然後生成對話。

2. 長內容分塊:

對於超過 10,000 字的逐字稿:

  1. 按主題分塊
  2. 為每個塊生成對話
  3. 在塊之間添加上下文連結
  4. 用過渡語合併

3. 情緒標記:

SAM:[興奮] 噢,這是真正有趣的地方!
ALEX:[好奇] 等等,所以你是說...
SAM:[沉思停頓] 正是如此。這就是為什麼它重要...

這些標記幫助 TTS 系統添加適當的語調。


Part 5: Text-to-Speech (TTS) for Podcast Audio

第五部分:用於 Podcast 音訊的文字轉語音(TTS)

TTS Service Comparison

ProviderPrice/1M charsVoice CloningMulti-SpeakerLatencyQuality
ElevenLabs$165-180Yes (instant + pro)Yes (v3)75msExcellent
OpenAI TTS$15-30NoLimited~200msGood
Google Cloud$16NoYesMediumGood
Azure Speech$15YesYesMediumGood
Play.ht$39.6YesYesLowGood
Kokoro (Local)FreeNoYes~20s/3minGood

ElevenLabs: The Quality Leader

Pricing Tiers:

  • Free: 10,000 chars/month
  • Starter ($5/mo): 30,000 chars
  • Creator ($22/mo): 100,000 chars
  • Pro ($99/mo): 500,000 chars

Voice Cloning:

  • Instant Clone: 1-5 minutes of sample audio
  • Professional Clone: 30 min minimum, 3 hours optimal

Multi-Speaker Dialogue (v3):

python
from elevenlabs import generate, Voice

# Generate dialogue with multiple speakers
script = """
<speaker name="Alex">Hey, welcome to the show!</speaker>
<speaker name="Sam">Thanks for having me. Let's dive in.</speaker>
"""

audio = generate(
    text=script,
    voice=Voice(voice_id="multi_speaker"),
    model="eleven_turbo_v3"
)

OpenAI TTS: Best Value

Pricing:

  • tts-1: $15/1M characters
  • tts-1-hd: $30/1M characters

Usage:

python
from openai import OpenAI

client = OpenAI()

# Available voices: alloy, echo, fable, onyx, nova, shimmer
response = client.audio.speech.create(
    model="tts-1-hd",
    voice="nova",
    input="Hello, welcome to the podcast!"
)

response.stream_to_file("output.mp3")

Multi-Speaker Approach: Generate each speaker's lines separately, then merge audio files.

Local TTS: Kokoro

For complete privacy, Kokoro offers local TTS:

  • ~20 seconds to generate 3 minutes of audio
  • Multiple voice options
  • No API costs
  • Requires decent GPU

Making TTS Sound Natural

1. Add Filler Words in Script:

"Well, uh, that's a great point. You know, I hadn't thought about it that way."

2. Use SSML for Control:

xml
<speak>
  <prosody rate="medium" pitch="+5%">
    That's really interesting!
  </prosody>
  <break time="500ms"/>
  <prosody rate="slow">
    Let me think about that...
  </prosody>
</speak>

3. Emotional Markers:

[laughs] Oh, that's a good one!
[thoughtful] Hmm, that's a tricky question...
[excited] Yes! Exactly!

TTS 服務比較

提供者價格/100 萬字元語音克隆多人說話延遲品質
ElevenLabs$165-180是(即時 + 專業)是(v3)75ms優秀
OpenAI TTS$15-30有限~200ms良好
Google Cloud$16中等良好
Azure Speech$15中等良好
Play.ht$39.6良好
Kokoro(本地)免費~20 秒/3 分鐘良好

ElevenLabs:品質領導者

定價層級:

  • 免費:10,000 字元/月
  • Starter($5/月):30,000 字元
  • Creator($22/月):100,000 字元
  • Pro($99/月):500,000 字元

語音克隆:

  • 即時克隆:1-5 分鐘的樣本音訊
  • 專業克隆:最少 30 分鐘,最佳 3 小時

多人說話對話(v3):

python
from elevenlabs import generate, Voice

# 生成多個說話者的對話
script = """
<speaker name="Alex">嘿,歡迎來到節目!</speaker>
<speaker name="Sam">謝謝邀請。讓我們開始吧。</speaker>
"""

audio = generate(
    text=script,
    voice=Voice(voice_id="multi_speaker"),
    model="eleven_turbo_v3"
)

OpenAI TTS:最佳性價比

定價:

  • tts-1:$15/100 萬字元
  • tts-1-hd:$30/100 萬字元

用法:

python
from openai import OpenAI

client = OpenAI()

# 可用語音:alloy, echo, fable, onyx, nova, shimmer
response = client.audio.speech.create(
    model="tts-1-hd",
    voice="nova",
    input="你好,歡迎收聽 Podcast!"
)

response.stream_to_file("output.mp3")

多人說話方式: 分別生成每位說話者的台詞,然後合併音訊檔案。

本地 TTS:Kokoro

為了完全隱私,Kokoro 提供本地 TTS:

  • ~20 秒生成 3 分鐘音訊
  • 多種語音選項
  • 無 API 費用
  • 需要不錯的 GPU

讓 TTS 聽起來自然

1. 在腳本中添加填充詞:

「嗯,那個,這是個好觀點。你知道,我之前沒這樣想過。」

2. 使用 SSML 控制:

xml
<speak>
  <prosody rate="medium" pitch="+5%">
    這真的很有趣!
  </prosody>
  <break time="500ms"/>
  <prosody rate="slow">
    讓我想想...
  </prosody>
</speak>

3. 情緒標記:

[笑] 噢,這個好!
[沉思] 嗯,這是個棘手的問題...
[興奮] 對!正是如此!

Part 6: Open Source Alternatives

第六部分:開源替代方案

Podcastfy is the most feature-complete open-source alternative to NotebookLM's Audio Overview.

Features:

  • 100+ LLM models (OpenAI, Anthropic, Google, local via Ollama)
  • Multiple TTS integrations (OpenAI, Google, ElevenLabs, Edge)
  • Multi-language support (156+ HuggingFace models)
  • Shortform (2-5 min) and longform (30+ min)
  • "Content Chunking with Contextual Linking" for long content

Installation:

bash
pip install podcastfy

# Set API keys
export OPENAI_API_KEY="your-key"
export ELEVENLABS_API_KEY="your-key"  # optional

Usage:

python
from podcastfy import generate_podcast

# From YouTube URL
podcast = generate_podcast(
    urls=["https://youtube.com/watch?v=..."],
    tts_model="openai",
    conversation_style="casual",
    output_language="en"
)

PDF2Audio (MIT)

PDF2Audio from MIT researchers:

  • Multiple PDF support
  • Template selection (podcasts, lectures, summaries)
  • Draft transcript editing before generation
  • Uses OpenAI GPT + TTS

Open Notebook

Open Notebook:

  • Multi-speaker podcast generation
  • 16+ LLM providers
  • Full-text and vector search
  • AI conversations powered by research

SurfSense (Self-Hosted)

SurfSense:

  • Completely self-hosted
  • 150+ LLMs, 6,000+ embedding models
  • Local TTS via Kokoro
  • ~20 seconds for 3-minute audio

Comparison Table

ProjectLLM SupportTTS OptionsLocal OptionComplexity
Podcastfy100+OpenAI, ElevenLabs, GoogleYes (Ollama)Low
PDF2AudioOpenAIOpenAINoLow
Open Notebook16+MultipleYesMedium
SurfSense150+Kokoro, CloudFully localHigh

Podcastfy(推薦起點)

Podcastfy 是 NotebookLM Audio Overview 功能最完整的開源替代方案。

功能:

  • 100+ LLM 模型(OpenAI、Anthropic、Google、透過 Ollama 本地)
  • 多種 TTS 整合(OpenAI、Google、ElevenLabs、Edge)
  • 多語言支援(156+ HuggingFace 模型)
  • 短格式(2-5 分鐘)和長格式(30+ 分鐘)
  • 長內容的「內容分塊與上下文連結」

安裝:

bash
pip install podcastfy

# 設定 API 金鑰
export OPENAI_API_KEY="your-key"
export ELEVENLABS_API_KEY="your-key"  # 可選

用法:

python
from podcastfy import generate_podcast

# 從 YouTube URL
podcast = generate_podcast(
    urls=["https://youtube.com/watch?v=..."],
    tts_model="openai",
    conversation_style="casual",
    output_language="zh"
)

PDF2Audio(MIT)

PDF2Audio 來自 MIT 研究者:

  • 多 PDF 支援
  • 範本選擇(Podcast、講座、摘要)
  • 生成前可編輯草稿逐字稿
  • 使用 OpenAI GPT + TTS

Open Notebook

Open Notebook

  • 多人說話 Podcast 生成
  • 16+ LLM 提供者
  • 全文和向量搜尋
  • 由研究驅動的 AI 對話

SurfSense(自託管)

SurfSense

  • 完全自託管
  • 150+ LLM、6,000+ embedding 模型
  • 透過 Kokoro 本地 TTS
  • ~20 秒生成 3 分鐘音訊

比較表

專案LLM 支援TTS 選項本地選項複雜度
Podcastfy100+OpenAI、ElevenLabs、Google是(Ollama)
PDF2AudioOpenAIOpenAI
Open Notebook16+多種中等
SurfSense150+Kokoro、雲端完全本地

Part 7: Architecture Recommendations

第七部分:架構建議

Option A: Cloud-Based (Easiest)

YouTube URL


┌──────────┐
│  yt-dlp  │  (local, free)
└────┬─────┘


┌──────────────────┐
│ OpenAI Whisper   │  ($0.006/min)
│      API         │
└────────┬─────────┘


┌──────────────────┐
│  Claude API      │  (~$0.01-0.05/script)
│ Script Generation│
└────────┬─────────┘


┌──────────────────┐
│  ElevenLabs or   │  ($5-22/month)
│  OpenAI TTS      │
└────────┬─────────┘


    Podcast MP3

Monthly Cost (10 videos, 30 min avg):

  • Transcription: ~$2
  • Script generation: ~$5-10
  • TTS: ~$15-30
  • Total: $22-42/month
YouTube URL


┌──────────┐
│  yt-dlp  │  (local, free)
└────┬─────┘


┌──────────────────┐
│ faster-whisper   │  (local, free)
│    (local)       │
└────────┬─────────┘


┌──────────────────┐
│  Claude API      │  (~$5-10/month)
└────────┬─────────┘


┌──────────────────┐
│  OpenAI TTS      │  (~$5-15/month)
└────────┬─────────┘


    Podcast MP3

Monthly Cost: $10-25/month

Requirements:

  • GPU with 5-10GB VRAM for local Whisper
  • 16GB+ RAM

Option C: Fully Local (Privacy-Focused)

YouTube URL


┌──────────┐
│  yt-dlp  │
└────┬─────┘


┌──────────────────┐
│ faster-whisper   │
└────────┬─────────┘


┌──────────────────┐
│ Ollama + Llama   │
│    or Mistral    │
└────────┬─────────┘


┌──────────────────┐
│     Kokoro       │
│  (local TTS)     │
└────────┬─────────┘


    Podcast MP3

Monthly Cost: $0 (hardware only)

Requirements:

  • Modern GPU (RTX 3080+ recommended)
  • 32GB+ RAM
  • 7-10GB VRAM minimum
  • More setup complexity

選項 A:雲端(最簡單)

YouTube URL


┌──────────┐
│  yt-dlp  │  (本地,免費)
└────┬─────┘


┌──────────────────┐
│ OpenAI Whisper   │  ($0.006/分鐘)
│      API         │
└────────┬─────────┘


┌──────────────────┐
│  Claude API      │  (~$0.01-0.05/腳本)
│    腳本生成       │
└────────┬─────────┘


┌──────────────────┐
│  ElevenLabs 或   │  ($5-22/月)
│  OpenAI TTS      │
└────────┬─────────┘


    Podcast MP3

每月費用(10 部影片,平均 30 分鐘):

  • 轉錄:~$2
  • 腳本生成:~$5-10
  • TTS:~$15-30
  • 總計:$22-42/月

選項 B:混合(推薦)

YouTube URL


┌──────────┐
│  yt-dlp  │  (本地,免費)
└────┬─────┘


┌──────────────────┐
│ faster-whisper   │  (本地,免費)
│    (本地)        │
└────────┬─────────┘


┌──────────────────┐
│  Claude API      │  (~$5-10/月)
└────────┬─────────┘


┌──────────────────┐
│  OpenAI TTS      │  (~$5-15/月)
└────────┬─────────┘


    Podcast MP3

每月費用:$10-25/月

需求:

  • 具有 5-10GB VRAM 的 GPU 用於本地 Whisper
  • 16GB+ RAM

選項 C:完全本地(隱私優先)

YouTube URL


┌──────────┐
│  yt-dlp  │
└────┬─────┘


┌──────────────────┐
│ faster-whisper   │
└────────┬─────────┘


┌──────────────────┐
│ Ollama + Llama   │
│    或 Mistral    │
└────────┬─────────┘


┌──────────────────┐
│     Kokoro       │
│   (本地 TTS)     │
└────────┬─────────┘


    Podcast MP3

每月費用:$0(只有硬體成本)

需求:

  • 現代 GPU(建議 RTX 3080+)
  • 32GB+ RAM
  • 最少 7-10GB VRAM
  • 更複雜的設置

Part 8: Quick Start Guide

第八部分:快速入門指南

Fastest Path: Podcastfy + Cloud APIs

Step 1: Install Dependencies

bash
# Install yt-dlp and ffmpeg
brew install yt-dlp ffmpeg  # macOS
# or: pip install yt-dlp

# Install Podcastfy
pip install podcastfy

Step 2: Set API Keys

bash
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."  # optional, for Claude
export ELEVENLABS_API_KEY="..."  # optional, for better TTS

Step 3: Create Your First Podcast

python
from podcastfy import generate_podcast

# Generate from YouTube
podcast = generate_podcast(
    urls=["https://www.youtube.com/watch?v=YOUR_VIDEO_ID"],
    tts_model="openai",  # or "elevenlabs"
    conversation_style="casual",
    creativity=0.7,
    output_language="en"
)

print(f"Podcast saved to: {podcast.audio_path}")

Step 4: Customize (Optional)

python
# Custom conversation config
config = {
    "word_count": 4000,  # ~10 min podcast
    "conversation_style": ["casual", "educational"],
    "roles_person1": "curious host who asks clarifying questions",
    "roles_person2": "expert who explains concepts with analogies",
    "dialogue_structure": [
        "Introduction",
        "Main discussion with examples",
        "Key takeaways",
        "Closing thoughts"
    ]
}

podcast = generate_podcast(
    urls=["https://youtube.com/..."],
    conversation_config=config
)

Building Custom Pipeline

For more control, build your own pipeline:

python
import subprocess
from faster_whisper import WhisperModel
from anthropic import Anthropic
from openai import OpenAI

# Step 1: Extract audio
def extract_audio(youtube_url, output_path):
    subprocess.run([
        "yt-dlp", "-x", "--audio-format", "mp3",
        "-o", output_path, youtube_url
    ])

# Step 2: Transcribe
def transcribe(audio_path):
    model = WhisperModel("large-v3", device="cuda")
    segments, _ = model.transcribe(audio_path)
    return " ".join([s.text for s in segments])

# Step 3: Generate script
def generate_script(transcript):
    client = Anthropic()
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=8000,
        system="You are a podcast scriptwriter...",
        messages=[{"role": "user", "content": f"Transform this transcript into a podcast dialogue:\n\n{transcript}"}]
    )
    return response.content[0].text

# Step 4: Generate audio
def generate_audio(script, output_path):
    client = OpenAI()
    # Parse script into speaker segments and generate each
    # Then merge audio files
    ...

# Full pipeline
extract_audio("https://youtube.com/...", "audio.mp3")
transcript = transcribe("audio.mp3")
script = generate_script(transcript)
generate_audio(script, "podcast.mp3")

最快路徑:Podcastfy + 雲端 API

步驟一:安裝依賴

bash
# 安裝 yt-dlp 和 ffmpeg
brew install yt-dlp ffmpeg  # macOS
# 或:pip install yt-dlp

# 安裝 Podcastfy
pip install podcastfy

步驟二:設定 API 金鑰

bash
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."  # 可選,用於 Claude
export ELEVENLABS_API_KEY="..."  # 可選,用於更好的 TTS

步驟三:建立你的第一個 Podcast

python
from podcastfy import generate_podcast

# 從 YouTube 生成
podcast = generate_podcast(
    urls=["https://www.youtube.com/watch?v=YOUR_VIDEO_ID"],
    tts_model="openai",  # 或 "elevenlabs"
    conversation_style="casual",
    creativity=0.7,
    output_language="zh"
)

print(f"Podcast 儲存於:{podcast.audio_path}")

步驟四:自訂(可選)

python
# 自訂對話配置
config = {
    "word_count": 4000,  # ~10 分鐘 podcast
    "conversation_style": ["casual", "educational"],
    "roles_person1": "好奇的主持人,會問澄清問題",
    "roles_person2": "專家,用類比解釋概念",
    "dialogue_structure": [
        "介紹",
        "帶範例的主要討論",
        "關鍵要點",
        "結語"
    ]
}

podcast = generate_podcast(
    urls=["https://youtube.com/..."],
    conversation_config=config
)

建構自訂流程

如需更多控制,建構你自己的流程:

python
import subprocess
from faster_whisper import WhisperModel
from anthropic import Anthropic
from openai import OpenAI

# 步驟一:提取音訊
def extract_audio(youtube_url, output_path):
    subprocess.run([
        "yt-dlp", "-x", "--audio-format", "mp3",
        "-o", output_path, youtube_url
    ])

# 步驟二:轉錄
def transcribe(audio_path):
    model = WhisperModel("large-v3", device="cuda")
    segments, _ = model.transcribe(audio_path)
    return " ".join([s.text for s in segments])

# 步驟三:生成腳本
def generate_script(transcript):
    client = Anthropic()
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=8000,
        system="你是一位 Podcast 編劇...",
        messages=[{"role": "user", "content": f"將此逐字稿轉換為 Podcast 對話:\n\n{transcript}"}]
    )
    return response.content[0].text

# 步驟四:生成音訊
def generate_audio(script, output_path):
    client = OpenAI()
    # 將腳本解析為說話者片段並分別生成
    # 然後合併音訊檔案
    ...

# 完整流程
extract_audio("https://youtube.com/...", "audio.mp3")
transcript = transcribe("audio.mp3")
script = generate_script(transcript)
generate_audio(script, "podcast.mp3")

Summary: Technology Stack Decision Matrix

NeedRecommendation
Fastest setupPodcastfy + OpenAI APIs
Best quality TTSElevenLabs
Best value TTSOpenAI TTS
Best dialogue qualityClaude for script generation
Complete privacyfaster-whisper + Ollama + Kokoro
Lowest costLocal pipeline (hardware only)
Best balanceHybrid: local transcription + cloud LLM/TTS
  1. Audio Extraction: yt-dlp (free)
  2. Transcription: faster-whisper local OR OpenAI API
  3. Script Generation: Claude API
  4. TTS: OpenAI TTS (budget) or ElevenLabs (quality)
  5. Framework: Start with Podcastfy, customize as needed

Monthly Cost Estimates

UsageCloudHybridLocal
5 videos$10-20$5-12$0
15 videos$30-60$15-30$0
30 videos$60-120$30-50$0

The technology is mature. The tools exist. You can have your own personal NotebookLM-style podcast generator running this weekend.

總結:技術堆疊決策矩陣

需求建議
最快設置Podcastfy + OpenAI API
最佳 TTS 品質ElevenLabs
最佳性價比 TTSOpenAI TTS
最佳對話品質Claude 用於腳本生成
完全隱私faster-whisper + Ollama + Kokoro
最低成本本地流程(只有硬體成本)
最佳平衡混合:本地轉錄 + 雲端 LLM/TTS

推薦起始堆疊

  1. **音訊提取:**yt-dlp(免費)
  2. **轉錄:**faster-whisper 本地 或 OpenAI API
  3. **腳本生成:**Claude API
  4. **TTS:**OpenAI TTS(預算)或 ElevenLabs(品質)
  5. **框架:**從 Podcastfy 開始,按需自訂

每月費用估算

用量雲端混合本地
5 部影片$10-20$5-12$0
15 部影片$30-60$15-30$0
30 部影片$60-120$30-50$0

技術已經成熟。工具都已存在。你可以在這個週末就讓自己的個人 NotebookLM 風格 Podcast 生成器運行起來。