Building a YouTube-to-Podcast Platform: Technical Research
YouTube 轉 Podcast 平台技術研究
Google's NotebookLM Audio Overview feature took the world by storm — upload documents, get a natural two-host podcast discussing your content. But what if you want this capability for YouTube videos? And what if you want to run it yourself, with your own voice profiles and complete control?
This research covers the complete technical stack for building a personal YouTube-to-Podcast platform: extracting audio, transcribing speech, generating engaging scripts, and synthesizing natural multi-speaker audio.
Google 的 NotebookLM Audio Overview 功能席捲全球——上傳文件,就能得到兩位主持人自然討論你內容的 Podcast。但如果你想對 YouTube 影片做到這一點呢?而且如果你想自己運行,使用自己的聲音檔案並完全掌控呢?
本研究涵蓋建構個人 YouTube 轉 Podcast 平台的完整技術堆疊:提取音訊、轉錄語音、生成引人入勝的腳本,以及合成自然的多人語音。
System Architecture Overview
系統架構概覽
┌─────────────────────────────────────────────────────────────────┐
│ YouTube to Podcast Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ YouTube │───▶│ Audio │───▶│ Transcript│───▶│ Script │ │
│ │ URL │ │Extraction│ │ (STT) │ │Generation│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Podcast │ │
│ │ Audio │ │
│ │ (TTS) │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘The pipeline consists of four main stages:
- Audio Extraction — Download audio from YouTube
- Speech-to-Text (STT) — Convert audio to transcript
- Script Generation — Transform transcript into podcast dialogue
- Text-to-Speech (TTS) — Synthesize natural multi-speaker audio
┌─────────────────────────────────────────────────────────────────┐
│ YouTube 轉 Podcast 流程 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ YouTube │───▶│ 音訊 │───▶│ 逐字稿 │───▶│ 腳本 │ │
│ │ URL │ │ 提取 │ │ (STT) │ │ 生成 │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Podcast │ │
│ │ 音訊 │ │
│ │ (TTS) │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘流程包含四個主要階段:
- 音訊提取 — 從 YouTube 下載音訊
- 語音轉文字(STT) — 將音訊轉為逐字稿
- 腳本生成 — 將逐字稿轉換為 Podcast 對話
- 文字轉語音(TTS) — 合成自然的多人語音
Part 1: Audio Extraction from YouTube
第一部分:從 YouTube 提取音訊
yt-dlp: The Standard Tool
yt-dlp is the de facto standard for extracting audio from YouTube. It's a feature-rich fork of youtube-dl with faster adaptation to YouTube's changes.
Installation:
# macOS
brew install yt-dlp ffmpeg
# pip
pip install yt-dlp
# Windows
winget install yt-dlpBasic Usage:
# Extract audio only (best quality)
yt-dlp -x "https://www.youtube.com/watch?v=VIDEO_ID"
# Convert to MP3
yt-dlp -x --audio-format mp3 "URL"
# Best quality M4A with metadata
yt-dlp -x -f bestaudio[ext=m4a] --add-metadata --embed-thumbnail "URL"
# Download to specific directory
yt-dlp -x -o "~/podcasts/%(title)s.%(ext)s" "URL"Requirements: FFmpeg must be installed for format conversion.
Legal Considerations
| Aspect | Status |
|---|---|
| YouTube ToS | Explicitly prohibits third-party downloads |
| Fair Use | Case-by-case evaluation; not blanket permission |
| Personal Use | Lower legal risk, but technically violates ToS |
| Consequences | Account termination possible; DMCA issues if redistributing |
Recommendation: For personal use and learning purposes, the risk is minimal. Avoid redistributing copyrighted content.
yt-dlp:標準工具
yt-dlp 是從 YouTube 提取音訊的事實標準。它是 youtube-dl 功能豐富的分支,能更快適應 YouTube 的變化。
安裝:
# macOS
brew install yt-dlp ffmpeg
# pip
pip install yt-dlp
# Windows
winget install yt-dlp基本用法:
# 只提取音訊(最佳品質)
yt-dlp -x "https://www.youtube.com/watch?v=VIDEO_ID"
# 轉換為 MP3
yt-dlp -x --audio-format mp3 "URL"
# 最佳品質 M4A 帶 metadata
yt-dlp -x -f bestaudio[ext=m4a] --add-metadata --embed-thumbnail "URL"
# 下載到指定目錄
yt-dlp -x -o "~/podcasts/%(title)s.%(ext)s" "URL"**需求:**必須安裝 FFmpeg 才能進行格式轉換。
法律考量
| 面向 | 狀態 |
|---|---|
| YouTube 服務條款 | 明確禁止第三方下載 |
| 合理使用 | 逐案評估;非全面許可 |
| 個人使用 | 法律風險較低,但技術上違反服務條款 |
| 後果 | 可能帳號終止;若重新散布可能有 DMCA 問題 |
**建議:**用於個人使用和學習目的,風險很小。避免重新散布受版權保護的內容。
Part 2: Speech-to-Text (Transcription)
第二部分:語音轉文字(轉錄)
Service Comparison
| Service | Price/min | Accuracy | Languages | Best For |
|---|---|---|---|---|
| OpenAI Whisper API | $0.006 | 95-99% | 100+ | Best value |
| Whisper (Local) | Free | 95-99% | 100+ | Privacy, high volume |
| faster-whisper | Free | 95-99% | 100+ | 4x faster than Whisper |
| Deepgram | $0.0043 | 30% lower WER | 36 | Real-time |
| AssemblyAI | $0.0025 | 85-92% | 20+ | Add-on features |
| Google Cloud STT | $0.024 | High | 125+ | Google ecosystem |
Cost for 1,000 Minutes
| Service | Cost |
|---|---|
| OpenAI Whisper API | $6 |
| Deepgram | ~$4.30 |
| Local Whisper | Free (hardware only) |
| Google Cloud | $24 |
| AssemblyAI (base) | ~$15 |
Local Whisper Setup
faster-whisper is the recommended local solution — 4x faster than standard Whisper with lower memory usage.
pip install faster-whisper
# Python usage
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")Hardware Requirements:
| Model | VRAM | Speed (RTX 3090) |
|---|---|---|
| tiny | ~1GB | Real-time x50 |
| base | ~1GB | Real-time x30 |
| small | ~2GB | Real-time x15 |
| medium | ~5GB | Real-time x8 |
| large-v3 | ~10GB | Real-time x4 |
OpenAI Whisper API
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"]
)
print(transcript.text)Advantages:
- No hardware requirements
- Automatic language detection
- High accuracy across accents
- $0.006/minute is very affordable
服務比較
| 服務 | 價格/分鐘 | 準確度 | 語言 | 最適合 |
|---|---|---|---|---|
| OpenAI Whisper API | $0.006 | 95-99% | 100+ | 最佳性價比 |
| Whisper(本地) | 免費 | 95-99% | 100+ | 隱私、大量使用 |
| faster-whisper | 免費 | 95-99% | 100+ | 比 Whisper 快 4 倍 |
| Deepgram | $0.0043 | WER 低 30% | 36 | 即時 |
| AssemblyAI | $0.0025 | 85-92% | 20+ | 附加功能 |
| Google Cloud STT | $0.024 | 高 | 125+ | Google 生態系 |
1,000 分鐘的費用
| 服務 | 費用 |
|---|---|
| OpenAI Whisper API | $6 |
| Deepgram | ~$4.30 |
| 本地 Whisper | 免費(只有硬體成本) |
| Google Cloud | $24 |
| AssemblyAI(基礎) | ~$15 |
本地 Whisper 設置
faster-whisper 是推薦的本地解決方案——比標準 Whisper 快 4 倍,記憶體使用更低。
pip install faster-whisper
# Python 用法
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")硬體需求:
| 模型 | VRAM | 速度(RTX 3090) |
|---|---|---|
| tiny | ~1GB | 即時 x50 |
| base | ~1GB | 即時 x30 |
| small | ~2GB | 即時 x15 |
| medium | ~5GB | 即時 x8 |
| large-v3 | ~10GB | 即時 x4 |
OpenAI Whisper API
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"]
)
print(transcript.text)優點:
- 無硬體需求
- 自動語言偵測
- 跨口音高準確度
- $0.006/分鐘非常實惠
Part 3: NotebookLM Analysis — What Makes It Work?
第三部分:NotebookLM 分析——是什麼讓它有效?
How NotebookLM Audio Overview Works
Google's NotebookLM uses:
- Gemini 1.5 Pro for content understanding and script generation
- SoundStorm (likely) for realistic audio synthesis
- RAG system for processing up to 50 documents
What Makes It Feel Natural
- Two distinct AI hosts with different personalities
- Thematic connections — linking ideas across the content like real podcasters
- Natural conversation flow — interruptions, agreements, clarifications
- Casual language — "Oh, that's interesting!" "Right, and..."
- Explaining jargon — making technical content accessible
Generation Parameters
| Aspect | Value |
|---|---|
| Generation time | ~2-5 minutes |
| Output length | 6-30 minutes (typically ~10) |
| Daily limits (free) | 3 per day |
| Daily limits (Plus) | 20 per day |
| Languages | 80+ for generation |
Limitations to Consider
- Cannot make minor edits — must regenerate entire audio
- May contain inaccuracies
- Audio glitches possible
- Quality varies by source material
NotebookLM Audio Overview 如何運作
Google 的 NotebookLM 使用:
- Gemini 1.5 Pro 用於內容理解和腳本生成
- SoundStorm(可能)用於逼真的音訊合成
- RAG 系統用於處理多達 50 份文件
是什麼讓它感覺自然
- 兩位不同個性的 AI 主持人
- 主題連結 — 像真正的 Podcaster 一樣在內容間建立連結
- 自然對話流程 — 打斷、同意、澄清
- 口語化語言 — 「喔,這很有趣!」「對,而且...」
- 解釋術語 — 讓技術內容變得易懂
生成參數
| 面向 | 數值 |
|---|---|
| 生成時間 | ~2-5 分鐘 |
| 輸出長度 | 6-30 分鐘(通常約 10 分鐘) |
| 每日限制(免費) | 每天 3 次 |
| 每日限制(Plus) | 每天 20 次 |
| 語言 | 80+ 種可生成 |
需考慮的限制
- 無法做小幅修改——必須重新生成整段音訊
- 可能包含不準確之處
- 可能有音訊瑕疵
- 品質依來源素材而異
Part 4: Script Generation with LLMs
第四部分:使用 LLM 生成腳本
The Art of Podcast Script Generation
Transforming a transcript into an engaging podcast requires more than summarization — it needs conversation design.
Recommended LLMs
| LLM | Strength | Best For |
|---|---|---|
| Claude | Character development, natural dialogue, "show don't tell" | Two-host conversations |
| GPT-4 | Dynamic writing, warmer tone, versatility | Creative storytelling |
| Gemini | Long context, research accuracy | Factual synthesis |
Recommendation: Claude excels at dialogue and character work. GPT-4 offers versatility. Many pipelines use both.
Prompt Engineering for Natural Dialogue
System Prompt Example:
You are a podcast scriptwriter creating engaging two-host dialogue.
HOST PROFILES:
- Alex: Curious, asks clarifying questions, uses analogies
- Sam: Expert, explains concepts, shares insights
STYLE GUIDELINES:
- Use short sentences suitable for speech synthesis
- Include natural filler words: "uh", "well", "you know"
- Add reactions: "Oh interesting!", "Right, that makes sense"
- Create back-and-forth flow, not monologues
- Explain jargon in conversational terms
- Target 3,000-5,000 words for a 10-minute podcast
OUTPUT FORMAT:
ALEX: [dialogue]
SAM: [dialogue]User Prompt Example:
Transform the following transcript into an engaging podcast conversation.
Key requirements:
1. Cover the main ideas, not every detail
2. Make it accessible to a general audience
3. Include at least 3 "aha moments" where a concept clicks
4. End with actionable takeaways
TRANSCRIPT:
[Your transcript here]Advanced Techniques
1. Scratchpad Method (from Together.ai):
First, use a <scratchpad> to brainstorm:
- Key themes to cover
- Interesting angles
- Potential questions listeners might have
- Natural transitions between topics
Then generate the dialogue.2. Chunking for Long Content:
For transcripts over 10,000 words:
- Break into thematic chunks
- Generate dialogue for each chunk
- Add contextual linking between chunks
- Merge with transition phrases
3. Emotional Markers:
SAM: [excited] Oh, this is where it gets really interesting!
ALEX: [curious] Wait, so you're saying...
SAM: [thoughtful pause] Exactly. And here's why that matters...These markers help TTS systems add appropriate intonation.
Podcast 腳本生成的藝術
將逐字稿轉換為引人入勝的 Podcast 不只是摘要——它需要對話設計。
推薦的 LLM
| LLM | 優勢 | 最適合 |
|---|---|---|
| Claude | 角色發展、自然對話、「展示而非講述」 | 雙主持人對話 |
| GPT-4 | 動態寫作、溫暖語調、多功能 | 創意敘事 |
| Gemini | 長上下文、研究準確性 | 事實合成 |
**建議:**Claude 擅長對話和角色塑造。GPT-4 提供多功能性。許多流程同時使用兩者。
自然對話的 Prompt 工程
系統 Prompt 範例:
你是一位 Podcast 編劇,創作引人入勝的雙主持人對話。
主持人檔案:
- Alex:好奇、問澄清問題、使用類比
- Sam:專家、解釋概念、分享見解
風格指南:
- 使用適合語音合成的短句
- 包含自然的填充詞:「嗯」、「那個」、「你知道」
- 加入反應:「噢有趣!」「對,這有道理」
- 創造來回交流,不是獨白
- 用口語化方式解釋術語
- 目標 3,000-5,000 字,約 10 分鐘 Podcast
輸出格式:
ALEX:[對話]
SAM:[對話]使用者 Prompt 範例:
將以下逐字稿轉換為引人入勝的 Podcast 對話。
關鍵要求:
1. 涵蓋主要想法,不是每個細節
2. 讓一般觀眾都能理解
3. 包含至少 3 個概念「頓悟時刻」
4. 以可行動的要點結束
逐字稿:
[你的逐字稿]進階技巧
1. 草稿本方法(來自 Together.ai):
首先,使用 <scratchpad> 腦力激盪:
- 要涵蓋的主要主題
- 有趣的角度
- 聽眾可能有的問題
- 主題之間的自然過渡
然後生成對話。2. 長內容分塊:
對於超過 10,000 字的逐字稿:
- 按主題分塊
- 為每個塊生成對話
- 在塊之間添加上下文連結
- 用過渡語合併
3. 情緒標記:
SAM:[興奮] 噢,這是真正有趣的地方!
ALEX:[好奇] 等等,所以你是說...
SAM:[沉思停頓] 正是如此。這就是為什麼它重要...這些標記幫助 TTS 系統添加適當的語調。
Part 5: Text-to-Speech (TTS) for Podcast Audio
第五部分:用於 Podcast 音訊的文字轉語音(TTS)
TTS Service Comparison
| Provider | Price/1M chars | Voice Cloning | Multi-Speaker | Latency | Quality |
|---|---|---|---|---|---|
| ElevenLabs | $165-180 | Yes (instant + pro) | Yes (v3) | 75ms | Excellent |
| OpenAI TTS | $15-30 | No | Limited | ~200ms | Good |
| Google Cloud | $16 | No | Yes | Medium | Good |
| Azure Speech | $15 | Yes | Yes | Medium | Good |
| Play.ht | $39.6 | Yes | Yes | Low | Good |
| Kokoro (Local) | Free | No | Yes | ~20s/3min | Good |
ElevenLabs: The Quality Leader
Pricing Tiers:
- Free: 10,000 chars/month
- Starter ($5/mo): 30,000 chars
- Creator ($22/mo): 100,000 chars
- Pro ($99/mo): 500,000 chars
Voice Cloning:
- Instant Clone: 1-5 minutes of sample audio
- Professional Clone: 30 min minimum, 3 hours optimal
Multi-Speaker Dialogue (v3):
from elevenlabs import generate, Voice
# Generate dialogue with multiple speakers
script = """
<speaker name="Alex">Hey, welcome to the show!</speaker>
<speaker name="Sam">Thanks for having me. Let's dive in.</speaker>
"""
audio = generate(
text=script,
voice=Voice(voice_id="multi_speaker"),
model="eleven_turbo_v3"
)OpenAI TTS: Best Value
Pricing:
- tts-1: $15/1M characters
- tts-1-hd: $30/1M characters
Usage:
from openai import OpenAI
client = OpenAI()
# Available voices: alloy, echo, fable, onyx, nova, shimmer
response = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input="Hello, welcome to the podcast!"
)
response.stream_to_file("output.mp3")Multi-Speaker Approach: Generate each speaker's lines separately, then merge audio files.
Local TTS: Kokoro
For complete privacy, Kokoro offers local TTS:
- ~20 seconds to generate 3 minutes of audio
- Multiple voice options
- No API costs
- Requires decent GPU
Making TTS Sound Natural
1. Add Filler Words in Script:
"Well, uh, that's a great point. You know, I hadn't thought about it that way."2. Use SSML for Control:
<speak>
<prosody rate="medium" pitch="+5%">
That's really interesting!
</prosody>
<break time="500ms"/>
<prosody rate="slow">
Let me think about that...
</prosody>
</speak>3. Emotional Markers:
[laughs] Oh, that's a good one!
[thoughtful] Hmm, that's a tricky question...
[excited] Yes! Exactly!TTS 服務比較
| 提供者 | 價格/100 萬字元 | 語音克隆 | 多人說話 | 延遲 | 品質 |
|---|---|---|---|---|---|
| ElevenLabs | $165-180 | 是(即時 + 專業) | 是(v3) | 75ms | 優秀 |
| OpenAI TTS | $15-30 | 否 | 有限 | ~200ms | 良好 |
| Google Cloud | $16 | 否 | 是 | 中等 | 良好 |
| Azure Speech | $15 | 是 | 是 | 中等 | 良好 |
| Play.ht | $39.6 | 是 | 是 | 低 | 良好 |
| Kokoro(本地) | 免費 | 否 | 是 | ~20 秒/3 分鐘 | 良好 |
ElevenLabs:品質領導者
定價層級:
- 免費:10,000 字元/月
- Starter($5/月):30,000 字元
- Creator($22/月):100,000 字元
- Pro($99/月):500,000 字元
語音克隆:
- 即時克隆:1-5 分鐘的樣本音訊
- 專業克隆:最少 30 分鐘,最佳 3 小時
多人說話對話(v3):
from elevenlabs import generate, Voice
# 生成多個說話者的對話
script = """
<speaker name="Alex">嘿,歡迎來到節目!</speaker>
<speaker name="Sam">謝謝邀請。讓我們開始吧。</speaker>
"""
audio = generate(
text=script,
voice=Voice(voice_id="multi_speaker"),
model="eleven_turbo_v3"
)OpenAI TTS:最佳性價比
定價:
- tts-1:$15/100 萬字元
- tts-1-hd:$30/100 萬字元
用法:
from openai import OpenAI
client = OpenAI()
# 可用語音:alloy, echo, fable, onyx, nova, shimmer
response = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input="你好,歡迎收聽 Podcast!"
)
response.stream_to_file("output.mp3")多人說話方式: 分別生成每位說話者的台詞,然後合併音訊檔案。
本地 TTS:Kokoro
為了完全隱私,Kokoro 提供本地 TTS:
- ~20 秒生成 3 分鐘音訊
- 多種語音選項
- 無 API 費用
- 需要不錯的 GPU
讓 TTS 聽起來自然
1. 在腳本中添加填充詞:
「嗯,那個,這是個好觀點。你知道,我之前沒這樣想過。」2. 使用 SSML 控制:
<speak>
<prosody rate="medium" pitch="+5%">
這真的很有趣!
</prosody>
<break time="500ms"/>
<prosody rate="slow">
讓我想想...
</prosody>
</speak>3. 情緒標記:
[笑] 噢,這個好!
[沉思] 嗯,這是個棘手的問題...
[興奮] 對!正是如此!Part 6: Open Source Alternatives
第六部分:開源替代方案
Podcastfy (Recommended Starting Point)
Podcastfy is the most feature-complete open-source alternative to NotebookLM's Audio Overview.
Features:
- 100+ LLM models (OpenAI, Anthropic, Google, local via Ollama)
- Multiple TTS integrations (OpenAI, Google, ElevenLabs, Edge)
- Multi-language support (156+ HuggingFace models)
- Shortform (2-5 min) and longform (30+ min)
- "Content Chunking with Contextual Linking" for long content
Installation:
pip install podcastfy
# Set API keys
export OPENAI_API_KEY="your-key"
export ELEVENLABS_API_KEY="your-key" # optionalUsage:
from podcastfy import generate_podcast
# From YouTube URL
podcast = generate_podcast(
urls=["https://youtube.com/watch?v=..."],
tts_model="openai",
conversation_style="casual",
output_language="en"
)PDF2Audio (MIT)
PDF2Audio from MIT researchers:
- Multiple PDF support
- Template selection (podcasts, lectures, summaries)
- Draft transcript editing before generation
- Uses OpenAI GPT + TTS
Open Notebook
- Multi-speaker podcast generation
- 16+ LLM providers
- Full-text and vector search
- AI conversations powered by research
SurfSense (Self-Hosted)
- Completely self-hosted
- 150+ LLMs, 6,000+ embedding models
- Local TTS via Kokoro
- ~20 seconds for 3-minute audio
Comparison Table
| Project | LLM Support | TTS Options | Local Option | Complexity |
|---|---|---|---|---|
| Podcastfy | 100+ | OpenAI, ElevenLabs, Google | Yes (Ollama) | Low |
| PDF2Audio | OpenAI | OpenAI | No | Low |
| Open Notebook | 16+ | Multiple | Yes | Medium |
| SurfSense | 150+ | Kokoro, Cloud | Fully local | High |
Podcastfy(推薦起點)
Podcastfy 是 NotebookLM Audio Overview 功能最完整的開源替代方案。
功能:
- 100+ LLM 模型(OpenAI、Anthropic、Google、透過 Ollama 本地)
- 多種 TTS 整合(OpenAI、Google、ElevenLabs、Edge)
- 多語言支援(156+ HuggingFace 模型)
- 短格式(2-5 分鐘)和長格式(30+ 分鐘)
- 長內容的「內容分塊與上下文連結」
安裝:
pip install podcastfy
# 設定 API 金鑰
export OPENAI_API_KEY="your-key"
export ELEVENLABS_API_KEY="your-key" # 可選用法:
from podcastfy import generate_podcast
# 從 YouTube URL
podcast = generate_podcast(
urls=["https://youtube.com/watch?v=..."],
tts_model="openai",
conversation_style="casual",
output_language="zh"
)PDF2Audio(MIT)
PDF2Audio 來自 MIT 研究者:
- 多 PDF 支援
- 範本選擇(Podcast、講座、摘要)
- 生成前可編輯草稿逐字稿
- 使用 OpenAI GPT + TTS
Open Notebook
- 多人說話 Podcast 生成
- 16+ LLM 提供者
- 全文和向量搜尋
- 由研究驅動的 AI 對話
SurfSense(自託管)
- 完全自託管
- 150+ LLM、6,000+ embedding 模型
- 透過 Kokoro 本地 TTS
- ~20 秒生成 3 分鐘音訊
比較表
| 專案 | LLM 支援 | TTS 選項 | 本地選項 | 複雜度 |
|---|---|---|---|---|
| Podcastfy | 100+ | OpenAI、ElevenLabs、Google | 是(Ollama) | 低 |
| PDF2Audio | OpenAI | OpenAI | 否 | 低 |
| Open Notebook | 16+ | 多種 | 是 | 中等 |
| SurfSense | 150+ | Kokoro、雲端 | 完全本地 | 高 |
Part 7: Architecture Recommendations
第七部分:架構建議
Option A: Cloud-Based (Easiest)
YouTube URL
│
▼
┌──────────┐
│ yt-dlp │ (local, free)
└────┬─────┘
│
▼
┌──────────────────┐
│ OpenAI Whisper │ ($0.006/min)
│ API │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Claude API │ (~$0.01-0.05/script)
│ Script Generation│
└────────┬─────────┘
│
▼
┌──────────────────┐
│ ElevenLabs or │ ($5-22/month)
│ OpenAI TTS │
└────────┬─────────┘
│
▼
Podcast MP3Monthly Cost (10 videos, 30 min avg):
- Transcription: ~$2
- Script generation: ~$5-10
- TTS: ~$15-30
- Total: $22-42/month
Option B: Hybrid (Recommended)
YouTube URL
│
▼
┌──────────┐
│ yt-dlp │ (local, free)
└────┬─────┘
│
▼
┌──────────────────┐
│ faster-whisper │ (local, free)
│ (local) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Claude API │ (~$5-10/month)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ OpenAI TTS │ (~$5-15/month)
└────────┬─────────┘
│
▼
Podcast MP3Monthly Cost: $10-25/month
Requirements:
- GPU with 5-10GB VRAM for local Whisper
- 16GB+ RAM
Option C: Fully Local (Privacy-Focused)
YouTube URL
│
▼
┌──────────┐
│ yt-dlp │
└────┬─────┘
│
▼
┌──────────────────┐
│ faster-whisper │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Ollama + Llama │
│ or Mistral │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Kokoro │
│ (local TTS) │
└────────┬─────────┘
│
▼
Podcast MP3Monthly Cost: $0 (hardware only)
Requirements:
- Modern GPU (RTX 3080+ recommended)
- 32GB+ RAM
- 7-10GB VRAM minimum
- More setup complexity
選項 A:雲端(最簡單)
YouTube URL
│
▼
┌──────────┐
│ yt-dlp │ (本地,免費)
└────┬─────┘
│
▼
┌──────────────────┐
│ OpenAI Whisper │ ($0.006/分鐘)
│ API │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Claude API │ (~$0.01-0.05/腳本)
│ 腳本生成 │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ ElevenLabs 或 │ ($5-22/月)
│ OpenAI TTS │
└────────┬─────────┘
│
▼
Podcast MP3每月費用(10 部影片,平均 30 分鐘):
- 轉錄:~$2
- 腳本生成:~$5-10
- TTS:~$15-30
- 總計:$22-42/月
選項 B:混合(推薦)
YouTube URL
│
▼
┌──────────┐
│ yt-dlp │ (本地,免費)
└────┬─────┘
│
▼
┌──────────────────┐
│ faster-whisper │ (本地,免費)
│ (本地) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Claude API │ (~$5-10/月)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ OpenAI TTS │ (~$5-15/月)
└────────┬─────────┘
│
▼
Podcast MP3每月費用:$10-25/月
需求:
- 具有 5-10GB VRAM 的 GPU 用於本地 Whisper
- 16GB+ RAM
選項 C:完全本地(隱私優先)
YouTube URL
│
▼
┌──────────┐
│ yt-dlp │
└────┬─────┘
│
▼
┌──────────────────┐
│ faster-whisper │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Ollama + Llama │
│ 或 Mistral │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Kokoro │
│ (本地 TTS) │
└────────┬─────────┘
│
▼
Podcast MP3每月費用:$0(只有硬體成本)
需求:
- 現代 GPU(建議 RTX 3080+)
- 32GB+ RAM
- 最少 7-10GB VRAM
- 更複雜的設置
Part 8: Quick Start Guide
第八部分:快速入門指南
Fastest Path: Podcastfy + Cloud APIs
Step 1: Install Dependencies
# Install yt-dlp and ffmpeg
brew install yt-dlp ffmpeg # macOS
# or: pip install yt-dlp
# Install Podcastfy
pip install podcastfyStep 2: Set API Keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..." # optional, for Claude
export ELEVENLABS_API_KEY="..." # optional, for better TTSStep 3: Create Your First Podcast
from podcastfy import generate_podcast
# Generate from YouTube
podcast = generate_podcast(
urls=["https://www.youtube.com/watch?v=YOUR_VIDEO_ID"],
tts_model="openai", # or "elevenlabs"
conversation_style="casual",
creativity=0.7,
output_language="en"
)
print(f"Podcast saved to: {podcast.audio_path}")Step 4: Customize (Optional)
# Custom conversation config
config = {
"word_count": 4000, # ~10 min podcast
"conversation_style": ["casual", "educational"],
"roles_person1": "curious host who asks clarifying questions",
"roles_person2": "expert who explains concepts with analogies",
"dialogue_structure": [
"Introduction",
"Main discussion with examples",
"Key takeaways",
"Closing thoughts"
]
}
podcast = generate_podcast(
urls=["https://youtube.com/..."],
conversation_config=config
)Building Custom Pipeline
For more control, build your own pipeline:
import subprocess
from faster_whisper import WhisperModel
from anthropic import Anthropic
from openai import OpenAI
# Step 1: Extract audio
def extract_audio(youtube_url, output_path):
subprocess.run([
"yt-dlp", "-x", "--audio-format", "mp3",
"-o", output_path, youtube_url
])
# Step 2: Transcribe
def transcribe(audio_path):
model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe(audio_path)
return " ".join([s.text for s in segments])
# Step 3: Generate script
def generate_script(transcript):
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=8000,
system="You are a podcast scriptwriter...",
messages=[{"role": "user", "content": f"Transform this transcript into a podcast dialogue:\n\n{transcript}"}]
)
return response.content[0].text
# Step 4: Generate audio
def generate_audio(script, output_path):
client = OpenAI()
# Parse script into speaker segments and generate each
# Then merge audio files
...
# Full pipeline
extract_audio("https://youtube.com/...", "audio.mp3")
transcript = transcribe("audio.mp3")
script = generate_script(transcript)
generate_audio(script, "podcast.mp3")最快路徑:Podcastfy + 雲端 API
步驟一:安裝依賴
# 安裝 yt-dlp 和 ffmpeg
brew install yt-dlp ffmpeg # macOS
# 或:pip install yt-dlp
# 安裝 Podcastfy
pip install podcastfy步驟二:設定 API 金鑰
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..." # 可選,用於 Claude
export ELEVENLABS_API_KEY="..." # 可選,用於更好的 TTS步驟三:建立你的第一個 Podcast
from podcastfy import generate_podcast
# 從 YouTube 生成
podcast = generate_podcast(
urls=["https://www.youtube.com/watch?v=YOUR_VIDEO_ID"],
tts_model="openai", # 或 "elevenlabs"
conversation_style="casual",
creativity=0.7,
output_language="zh"
)
print(f"Podcast 儲存於:{podcast.audio_path}")步驟四:自訂(可選)
# 自訂對話配置
config = {
"word_count": 4000, # ~10 分鐘 podcast
"conversation_style": ["casual", "educational"],
"roles_person1": "好奇的主持人,會問澄清問題",
"roles_person2": "專家,用類比解釋概念",
"dialogue_structure": [
"介紹",
"帶範例的主要討論",
"關鍵要點",
"結語"
]
}
podcast = generate_podcast(
urls=["https://youtube.com/..."],
conversation_config=config
)建構自訂流程
如需更多控制,建構你自己的流程:
import subprocess
from faster_whisper import WhisperModel
from anthropic import Anthropic
from openai import OpenAI
# 步驟一:提取音訊
def extract_audio(youtube_url, output_path):
subprocess.run([
"yt-dlp", "-x", "--audio-format", "mp3",
"-o", output_path, youtube_url
])
# 步驟二:轉錄
def transcribe(audio_path):
model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe(audio_path)
return " ".join([s.text for s in segments])
# 步驟三:生成腳本
def generate_script(transcript):
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=8000,
system="你是一位 Podcast 編劇...",
messages=[{"role": "user", "content": f"將此逐字稿轉換為 Podcast 對話:\n\n{transcript}"}]
)
return response.content[0].text
# 步驟四:生成音訊
def generate_audio(script, output_path):
client = OpenAI()
# 將腳本解析為說話者片段並分別生成
# 然後合併音訊檔案
...
# 完整流程
extract_audio("https://youtube.com/...", "audio.mp3")
transcript = transcribe("audio.mp3")
script = generate_script(transcript)
generate_audio(script, "podcast.mp3")Summary: Technology Stack Decision Matrix
| Need | Recommendation |
|---|---|
| Fastest setup | Podcastfy + OpenAI APIs |
| Best quality TTS | ElevenLabs |
| Best value TTS | OpenAI TTS |
| Best dialogue quality | Claude for script generation |
| Complete privacy | faster-whisper + Ollama + Kokoro |
| Lowest cost | Local pipeline (hardware only) |
| Best balance | Hybrid: local transcription + cloud LLM/TTS |
Recommended Starting Stack
- Audio Extraction: yt-dlp (free)
- Transcription: faster-whisper local OR OpenAI API
- Script Generation: Claude API
- TTS: OpenAI TTS (budget) or ElevenLabs (quality)
- Framework: Start with Podcastfy, customize as needed
Monthly Cost Estimates
| Usage | Cloud | Hybrid | Local |
|---|---|---|---|
| 5 videos | $10-20 | $5-12 | $0 |
| 15 videos | $30-60 | $15-30 | $0 |
| 30 videos | $60-120 | $30-50 | $0 |
The technology is mature. The tools exist. You can have your own personal NotebookLM-style podcast generator running this weekend.
總結:技術堆疊決策矩陣
| 需求 | 建議 |
|---|---|
| 最快設置 | Podcastfy + OpenAI API |
| 最佳 TTS 品質 | ElevenLabs |
| 最佳性價比 TTS | OpenAI TTS |
| 最佳對話品質 | Claude 用於腳本生成 |
| 完全隱私 | faster-whisper + Ollama + Kokoro |
| 最低成本 | 本地流程(只有硬體成本) |
| 最佳平衡 | 混合:本地轉錄 + 雲端 LLM/TTS |
推薦起始堆疊
- **音訊提取:**yt-dlp(免費)
- **轉錄:**faster-whisper 本地 或 OpenAI API
- **腳本生成:**Claude API
- **TTS:**OpenAI TTS(預算)或 ElevenLabs(品質)
- **框架:**從 Podcastfy 開始,按需自訂
每月費用估算
| 用量 | 雲端 | 混合 | 本地 |
|---|---|---|---|
| 5 部影片 | $10-20 | $5-12 | $0 |
| 15 部影片 | $30-60 | $15-30 | $0 |
| 30 部影片 | $60-120 | $30-50 | $0 |
技術已經成熟。工具都已存在。你可以在這個週末就讓自己的個人 NotebookLM 風格 Podcast 生成器運行起來。
Sources: