Inspiration
Storytelling through audio has always been powerful, but creating high-quality, multi-voice narration is inaccessible to most creators. Professional audiobooks and cinematic dramas require actors, directors, editors, and weeks of iteration, often costing thousands of dollars.
At the same time, most modern text-to-speech tools—even the advanced ones—still behave like readers, not performers. They speak the text accurately, but they miss the subtext: the panic, the hesitation, the emotional buildup, and the rhythm that makes a scene feel alive.
We wanted to answer a simple question: "What if AI didn’t just read stories, but actually directed a performance?"
That question became EchoRead.
What it does
EchoRead is a real-time AI narrative performance engine that transforms raw story text into a cinematic, emotionally aware, multi-voice audio experience.
Instead of treating text-to-speech as a single monolithic operation, EchoRead introduces a "Reasoning Layer" between the text and the audio engine.
- It Analyzes: The system reads the subtext to understand character motivations, emotional shifts, and scene intensity.
- It Directs: It assigns specific roles (Narrator, Male Character, Female Character) and dictates how a line should be spoken (e.g., "Anxious whisper," "Angry shout").
- It Performs: It generates audio that follows these directorial constraints, resulting in a cohesive audio drama rather than a flat reading.
How we built it
The architecture relies on a synergy between a reasoning engine and an audio engine.
- The Director (Google Gemini 3 Flash): We use Gemini via Vertex AI to act as the "Director Agent." It ingests the raw text and outputs a structured JSON script. This script contains segmentation data, mapping every line to a specific speaker, an emotion (calm, tense, happy), and an intensity level (1-5).
- The Performers (ElevenLabs v3): We feed the structured script into the ElevenLabs API. Crucially, we don't just send text; we dynamically tune the voice settings (Stability, Similarity Boost, Style Exaggeration) based on the emotion and intensity data provided by Gemini.
- The Stage (Next.js & FastAPI): The backend (FastAPI) handles the orchestration and caching to ensure low latency. The frontend (Next.js) parses the timestamped audio data to provide a karaoke-style visualization, highlighting words in perfect sync with the emotional performance.
Challenges we ran into
Our biggest challenge was learning that text manipulation destroys performance.
Initially, we tried to control pacing by modifying the text itself—adding ellipses ..., dashes -, or break tags <break /> to force pauses. We found that this confused the model, causing it to lose the natural flow and emotional punch of the dialogue. A line like "Take it off!" lost all urgency when artificially slowed down.
We realized that Voice Parameters > Text Hacks. We pivoted to controlling the performance strictly through ElevenLabs' voice settings (specifically stability and style). However, mapping the continuous intensity scale (1-100) from our reasoning engine to the discrete stability steps supported by the model required significant trial and error to get right.
Accomplishments that we're proud of
- The "Reasoning Layer": We successfully built a pipeline where Gemini understands the context of a scene (e.g., knowing that a character is whispering because they are hiding) and successfully passes that instruction to the audio engine.
- Cinema-Quality Output: The difference between the raw TTS output and EchoRead's directed output is night and day. Hearing the AI genuinely sound "nervous" or "relieved" based on the story arc is a huge win.
- Real-Time Sync: Achieving word-level synchronization on the frontend while managing multiple dynamic audio segments was a complex engineering hurdle that we solved effectively.
What we learned
We learned that the future of AI audio isn't just about better voice quality—it's about better direction.
The current generation of voice models is incredibly capable, but they need guidance. By inserting a reasoning layer (Gemini) before the generation layer (ElevenLabs), we unlocked a level of quality that neither tool could easily achieve on its own. We also learned that "perfect" speech isn't always "good" storytelling; sometimes you need the instability and raw fluctuation of a lower stability setting to convey real emotion.
What's next for EchoRead
EchoRead is just the beginning of AI-directed audio. Our roadmap includes:
- Soundscape Generation: Using the scene analysis to automatically generate background ambient noise (rain, city traffic, silence).
- Long-Form Support: Optimizing the architecture to handle full chapters or books while maintaining consistent character voices.
- Director Controls: Allowing the user to manually override the AI director's choices (e.g., "Make this line angrier").
- Export Options: allowing creators to download the full, mixed audio file for use in podcasts or videos.
Built With
- elevenlabs-v3
- fastapi
- gemini-3-flash
- nextjs
- tailwind-v4
Log in or sign up for Devpost to join the conversation.