Inspiration

We are big fans of documentary-style storytelling, think channels like Fern that make history, culture, and science deeply engaging. We initially wanted to build a tool to democratize this kind of high-quality video production. However, we quickly realized that video generation is computationally heavy and complex.

We pivoted to the next best medium: Audio. We set out to build an intelligent "Research Agent" and "Podcast Host" in one, capable of turning a simple topic into a fully produced, conversational deep-dive.

What it does

Recapsule is an AI-powered podcast generator that transforms a single keyword into a highly engaging, multi-speaker audio episode.

It acts as an automated historian and producer:

  1. Researches complex topics to ensure accuracy.
  2. Writes a natural, conversational script between two distinct personalities.
  3. Produces a studio-quality audio file with distinct voices and pacing.

How we built it

We built a modern full-stack application leveraging an agentic workflow to handle the pipeline from research to audio synthesis.

The Tech Stack

  • Backend: FastAPI (Python) using async/await patterns for non-blocking tasks.
  • Frontend: React (Vite) for a responsive Single Page Application (SPA).
  • AI Logic: Google Gemini (Research & Scripting) + ElevenLabs (Voice Synthesis).
  • Infrastructure: MongoDB (Data persistence) + Google Cloud Storage (Audio hosting).

The Agentic Workflow

  1. Ingestion: The user submits a topic via the React frontend.
  2. Deep Research: The backend triggers Gemini, which browses the web to compile factual information and citations, mitigating hallucinations.
  3. Scripting: A second Gemini instance converts the research notes into a dynamic dialogue script between two hosts.
  4. Voice Synthesis: We pipe the script into ElevenLabs, generating distinct audio streams for each character.
  5. Audio Engineering: We use pydub and FFmpeg to stitch the audio segments together, inserting natural pauses and timing to mimic real conversation.
  6. Delivery: The final MP3 is stored in Google Cloud Storage (GCS) and streamed back to the user's browser.

Challenges we ran into

  • Hallucination Mitigation: Getting the LLM to stick strictly to facts was difficult. We had to engineer robust prompts that forced Gemini to cite sources and verify data before scriptwriting.
  • Media Retrieval: We faced rate limits and scraping hurdles when trying to programmatically fetch context images from Wikipedia and Google Images to support the research phase.
  • Cloud Collaboration: Configuring Google Cloud Storage buckets for secure but shared access during development was trickier than expected, specifically managing service account permissions across our team's different environments.

Accomplishments that we're proud of

High-Fidelity Audio: We are proud of how we tuned the ElevenLabs integration. The voices don't just read text; they interact, resulting in a podcast that feels genuinely human and tailored to the creator of the podcast.

What we learned

  • API Economics: We learned the hard way that high-quality voice synthesis is expensive! Managing API credits for ElevenLabs required us to be very efficient with our testing.

What's next for Recapsule

  • Visuals: Moving back to our original vision by adding AI-generated video or slideshows to accompany the audio.
  • Cost Optimization: Implementing local hosting for TTS (Text-to-Speech) and LLM models (like Llama 3) to reduce reliance on paid APIs.
  • Customization: Allowing users to configure podcast length and depth of research.

Built With

Share this project:

Updates