Immersive Audio Generator

Inspiration

The inspiration came from the realization that traditional audiobooks, while convenient, often lack the immersive quality of visual media. We wanted to bridge this gap by creating audiobooks that feel like audio dramas - with distinct character voices, emotional delivery, and atmospheric sound effects. The goal was to transform reading into a cinematic audio experience, making classic literature more accessible and engaging for modern audiences.

What it does

AudioNovel AI transforms written novels into immersive audiobooks with:

Intelligent Character Recognition: Automatically identifies all characters in the text and assigns unique AI voices to each
Emotion-Aware Narration: Analyzes dialogue context to adjust voice tone, pace, and style based on the emotional content
Sequential Audio Generation: Preserves the natural flow of the story by processing dialogue and narration in the correct order
Atmospheric Sound Effects: Generates contextual ambient sounds (footsteps, doors, weather) to enhance immersion
Production-Ready Output: Combines all segments into a complete audiobook file ready for listening

How we built it

We built a modular 8-step pipeline using:

OpenAI GPT-4 for intelligent text analysis, character extraction, and emotion detection
ElevenLabs API for high-quality voice synthesis and sound effect generation
Node.js as the runtime environment with a modular architecture
FFmpeg for audio processing and seamless concatenation
Custom algorithms for dialogue separation, ensuring characters only speak their own lines
Caching system that allows resuming from any step, making the process fault-tolerant

Challenges we ran into

PDF Text Extraction: Different PDF formats required multiple extraction libraries (pdf-parse, pdfjs-dist, pdf2json) as fallbacks
Character Voice Consistency: Ensuring the same voice was used throughout for each character required careful state management
Dialogue Attribution: Complex narrative structures made it challenging to accurately identify who was speaking
Rate Limiting: Managing API calls to avoid hitting ElevenLabs' rate limits while maintaining reasonable processing speed
Audio Synchronization: Ensuring narration and dialogue segments flowed naturally without awkward pauses or overlaps

Accomplishments that we're proud of

Emotional Intelligence: Successfully implemented emotion detection that adjusts voice parameters for more natural delivery
Modular Architecture: Created a fault-tolerant system that can resume from any step, saving time and API costs
Character Voice Mapping: Developed an intelligent system that assigns appropriate voices based on character traits
Production Quality: Generated a professional-sounding audiobook of "The Caves of Steel" Chapter 1 with distinct character voices
Sound Design Integration: Added atmospheric sound effects that enhance the listening experience without overwhelming the narration

What we learned

AI Context Matters: Providing proper context to GPT-4 dramatically improved character and emotion detection accuracy
Voice Synthesis Nuances: Small adjustments in stability and style parameters can significantly impact the naturalness of speech
Pipeline Design: Breaking complex processes into discrete, cacheable steps is crucial for reliability and debugging
User Experience: The importance of progress indicators and clear error messages when processing takes time
API Economics: Implementing smart caching and batch processing can significantly reduce API costs

What's next for Immersive Audio Generator

Multi-Chapter Support: Extend the system to process entire novels with consistent character voices across chapters
Real-Time Processing: Optimize for streaming generation to reduce wait times for longer texts
Voice Cloning: Allow users to upload voice samples for custom character voices
Interactive Features: Add bookmarks, chapter navigation, and playback speed controls
Language Support: Expand to support multiple languages with appropriate voice selections
Music Integration: Add background music that adapts to scene emotions
Community Platform: Create a marketplace where users can share their generated audiobooks and voice configurations
Fine-Tuning Models: Train custom models for better genre-specific dialogue understanding

Built With

elevenlabs
javascript
node.js
openai

Updates

Kannappan Sirchabesan started this project — Jun 29, 2025 07:42 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.