Podcast Episode Generation from Research

Inspiration

Reading a full research paper takes time, and most “AI summaries” either skip citations or make things up. We wanted a way to listen to accurate, engaging conversations about new papers - on the bus, at the gym, anywhere, without losing the fidelity of the original source. NVIDIA NIM gave us fast, reliable model access, so we built an agentic pipeline around it.

What it does

It turns any academic PDF into a polished two-host podcast episode. The system: - Ingests the PDF and builds a semantic index - Plans a six-segment episode (Intro → Background → Methods → Results → Discussion → Conclusions) - Writes a conversational script with citations - Fact-checks lines against the paper and fixes issues - Generates natural TTS for two distinct hosts - Packages everything into an MP3 with chapters, plus transcript and a short report

How we built it

Models (NVIDIA NIM): We use llama-3.1-nemotron-nano-8b-v1 to plan and write the script, and nv-embedqa-e5-v5 to understand the paper and power retrieval.

Agentic workflow: Think of it as a relay team—Planning Agent → Content Agent → Verification Agent → Audio Agent. If something’s off, the loop sends it back for fixes before moving on.

RAG: We keep two indexes—one focused on facts, the other on style—so the episode stays accurate and sounds natural.

Audio: Two distinct voices, clean chapter markers, even volume, and proper metadata so it feels like a real show.

Cloud: It runs on AWS EKS behind a public load balancer, with Docker images in ECR, files in S3, and secrets in AWS Secrets Manager. We keep things simple with one Gunicorn worker (shared in-memory queue), and if ffmpeg ever refuses to stitch audio, we fall back to a NumPy combiner so the MP3 still finishes.

Challenges we ran into

PDF extraction quality: Messy layouts broke context windows. We improved parsing and chunking to preserve headings, captions, and tables.

Audio continuity: Some TTS segments clipped or felt “robotic.” We tuned pacing and normalized loudness across segments.

Accomplishments that we're proud of

A working, end-to-end research-to-podcast pipeline with live UI and APIs.
two realistic hosts, smooth pacing, chapter markers

What we learned

Dual-index RAG helps keep tone human while staying faithful to the paper.
TTS quality isn’t just voices—it’s timing, phrasing, and post-processing.

What's next for Podcast Episode Generation from Research

We plan to integrate richer voices with more natural pauses and expressive emphasis. We also plan to integrate multi-paper episodes that compare and debate results.

Built With

dockerfile
html
python

Submitted to

Agentic AI Unleashed: AWS & NVIDIA Hackathon

Created by

I worked on most parts of the project, mainly setting up the agentic workflow that handles document uploads, RAG-based content generation, and podcast script creation. I also built the embedded NVIDIA NIM setup for faster inference, containerized everything with Docker, deployed it on AWS EKS using Kubernetes, and connected the backend website to the cluster so inputs and outputs run smoothly.

Kesav Nagendra
Bhaskar Nikhil Sunkara
Pallavi Sharma
Sonith Bingi

Updates

Kesav Nagendra posted an update — Nov 10, 2025 01:02 PM EST

Website is down since we are out of credits in console but the core algorithm is working you can deploy the code on your own console and use the website or you can run the algorithm the steps are mentioned in the github

Log in or sign up for Devpost to join the conversation.

Kesav Nagendra posted an update — Nov 04, 2025 05:10 PM EST

We put the demo video together in the last 20 minutes didn’t realize we had to make one explaining the project, so it’s a bit rushed and rough around the edges. The GitHub has clearer docs if you want to check out the details.

Log in or sign up for Devpost to join the conversation.

Kesav Nagendra started this project — Nov 04, 2025 01:58 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.