About the Project
Inspiration
I was inspired by the magic of audiobooks and theater, and wanted to create a tool that could bring stories to life automatically. Reading a story is one thing, but hearing characters speak in unique, expressive voices makes it immersive and engaging. I asked myself: “What if a PDF story could become a full audio performance, with each character voiced differently?” That idea sparked this project.
I also noticed that most existing text-to-speech tools use a single voice for all characters, which makes stories feel flat and hard to follow. I wanted to explore whether AI could understand the story context and generate distinct voices for each character, creating a richer, more human-like experience.
How I Built It
Story.ai takes a PDF story and produces a multi-character narrated audio file, all accessible through a web interface. Here’s how it works:
- I used Python and PyPDF2 to extract clean, structured text from PDFs.
- I integrated the Google Gemini API to analyze the story and identify who is speaking in each line, distinguishing dialogue from narration and grouping consecutive lines by speaker to ensure smooth audio.
- I used ElevenLabs’ Text-to-Speech API to generate realistic, expressive voices for each character and the narrator.
- I stitched all the generated audio clips together using Python libraries like pydub, producing a seamless audiobook.
- The web interface allows users to upload PDFs, generate audio, and play it directly in the browser.
- I stored uploaded PDFs and generated audio files in MongoDB Atlas, so stories can be replayed without regeneration.
This combination of tools allows Story.ai to turn static story text into a dynamic, multi-character audio experience that users can access instantly online.
What I Learned
- How to extract structured text from PDFs and prepare it for AI processing.
- How to prompt an LLM (Google Gemini) effectively to identify speakers while grouping consecutive lines together.
- How to generate multiple AI voices and merge them into a coherent audio narrative using ElevenLabs.
- How to think about user experience, including pacing, dialogue flow, and listener immersion.
- How to leverage MongoDB Atlas for storage and retrieval, ensuring scalability and reusability.
- How to integrate all components into a web interface, allowing users to interact with the system directly.
Challenges
- Over-segmentation: Initially, the AI split every sentence into separate chunks, making the audio choppy. I solved this by updating prompts to group consecutive lines by the same speaker.
- Voice assignment: Ensuring each character had a distinct, expressive voice while keeping the sequence natural.
- Audio stitching: Combining multiple audio files into a seamless story without gaps or mismatched timing.
- PDF variability: Handling different PDF formats and layouts while extracting clean text.
- Web integration: Coordinating API calls and audio playback in a user-friendly web interface.
Despite these challenges, Story.ai successfully transforms static stories into immersive, multi-character audio experiences powered entirely by AI using Python, Google Gemini, ElevenLabs, MongoDB Atlas, and a web interface.
Future Improvements
- Scene-level visuals: Adding illustrations for each scene to enhance immersion.
- Subtitle syncing: Highlighting dialogue in real-time for accessibility and engagement.
- Character consistency across stories: Maintaining the same voices for recurring characters.
- Enhanced web features: Letting users share generated stories directly from the web interface.
Log in or sign up for Devpost to join the conversation.