Inspiration
Reading a full research paper takes time, and most “AI summaries” either skip citations or make things up. We wanted a way to listen to accurate, engaging conversations about new papers - on the bus, at the gym, anywhere, without losing the fidelity of the original source. NVIDIA NIM gave us fast, reliable model access, so we built an agentic pipeline around it.
What it does
It turns any academic PDF into a polished two-host podcast episode. The system: - Ingests the PDF and builds a semantic index - Plans a six-segment episode (Intro → Background → Methods → Results → Discussion → Conclusions) - Writes a conversational script with citations - Fact-checks lines against the paper and fixes issues - Generates natural TTS for two distinct hosts - Packages everything into an MP3 with chapters, plus transcript and a short report
How we built it
Models (NVIDIA NIM): We use llama-3.1-nemotron-nano-8b-v1 to plan and write the script, and nv-embedqa-e5-v5 to understand the paper and power retrieval.
Agentic workflow: Think of it as a relay team—Planning Agent → Content Agent → Verification Agent → Audio Agent. If something’s off, the loop sends it back for fixes before moving on.
RAG: We keep two indexes—one focused on facts, the other on style—so the episode stays accurate and sounds natural.
Audio: Two distinct voices, clean chapter markers, even volume, and proper metadata so it feels like a real show.
Cloud: It runs on AWS EKS behind a public load balancer, with Docker images in ECR, files in S3, and secrets in AWS Secrets Manager. We keep things simple with one Gunicorn worker (shared in-memory queue), and if ffmpeg ever refuses to stitch audio, we fall back to a NumPy combiner so the MP3 still finishes.
Challenges we ran into
PDF extraction quality: Messy layouts broke context windows. We improved parsing and chunking to preserve headings, captions, and tables.
Audio continuity: Some TTS segments clipped or felt “robotic.” We tuned pacing and normalized loudness across segments.
Accomplishments that we're proud of
- A working, end-to-end research-to-podcast pipeline with live UI and APIs.
- two realistic hosts, smooth pacing, chapter markers
What we learned
- Dual-index RAG helps keep tone human while staying faithful to the paper.
- TTS quality isn’t just voices—it’s timing, phrasing, and post-processing.
What's next for Podcast Episode Generation from Research
We plan to integrate richer voices with more natural pauses and expressive emphasis. We also plan to integrate multi-paper episodes that compare and debate results.
Log in or sign up for Devpost to join the conversation.