Inspiration
We love storytelling — but turning an idea into a polished visual story requires writing scripts, generating art, recording voiceover, editing video, and stitching it all together. That's a dozen tools and hours of work for even a short piece. We wondered: what if you could just talk about your story and watch it come to life? Voice is the most natural way humans tell stories — campfire tales, bedtime stories, movie pitches. We wanted to make the creation process feel just as natural as the storytelling itself.
What it does
SayCut is a voice-first storybook and short film maker. You speak your idea into the mic, and an AI voice agent guides you through the entire creation pipeline — scripting, illustration, narration, and video — all through conversation.
It supports two modes:
- Story mode — A single-narrator storybook with illustrated scenes and voiceover, like an animated picture book.
- Movie mode — A two-character short film with multi-voice dialogue (including a Morgan Freeman-style narrator), creating cinematic scenes with distinct voices for each character.
You can edit anything by voice after the fact: "make the dragon bigger in scene 2," "add a scene between 1 and 3," or "remove the last scene." The final result plays back in a cinematic player with crossfade transitions and synced audio.
How we built it
- Voice agent: BosonAI's HiggsAudioM3 v3.5 model handles speech-to-text and tool calling in a single streaming pass — the user speaks, and the model both understands the audio and decides which tools to invoke, without a separate STT step.
- Tool orchestration: The voice agent uses structured tool calls (generate_script, generate_scene_image, generate_scene_audio, generate_scene_video, edit_scene_image, etc.) to drive the entire pipeline. Each tool call triggers real asset generation and pushes live updates to the frontend over WebSocket.
- Multi-model pipeline: Script generation (Kimi K2.5), image generation (EigenImage), image editing (Qwen), image-to-video (Wan2.2), and TTS (Higgs 2.5) — all orchestrated through a single voice conversation loop.
- Frontend: Next.js with a real-time storyboard editor, Zustand state management, and a cinematic video player with audio-synced scene transitions.
- Backend: FastAPI with async WebSocket sessions, SQLite persistence, and local asset storage. Each session gets its own VoiceAgent instance configured for the chosen mode.
Challenges we ran into
- Voice + tool calling in one model: The HiggsAudioM3 model processes raw audio and emits structured tool calls in the same response stream. Parsing tags from a streaming text response — especially when the model sometimes generates malformed or truncated JSON — required robust parsing with fallbacks.
- Multi-voice audio concatenation: Movie mode needs three distinct voices (narrator + two characters) stitched into a single audio track per scene. Getting the silence gaps, sample rate normalization (24kHz/16-bit/mono), and voice-to-character mapping right across varying TTS output formats was tricky.
- Async video generation: Image-to-video is a long-running job (submit → poll → download). Coordinating this with the voice agent's tool call loop without blocking the conversation required careful async orchestration and live progress events to the frontend.
- Context management: Resuming an existing storybook mid-conversation means injecting scene context into the voice agent without replaying the full message history (which would pollute the model's context). We built a LOAD_STORYBOOK protocol that summarizes the existing state for the agent.
Accomplishments that we're proud of
- Fully voice-driven creation: From "I want a story about a space cat" to a playable cinematic storybook — entirely through conversation, no typing or clicking required.
- Two distinct creative modes with shared infrastructure: Story and Movie mode reuse the same image/video pipeline but produce fundamentally different outputs (narrated storybook vs. multi-voice film).
- Live storyboard updates: Scenes appear and update in real-time as the AI generates them — you watch your story materialize as you talk.
- Edit-in-place by voice: Insert, remove, or modify individual scenes without regenerating the whole storybook. The backend handles index shifting and asset cleanup automatically.
What we learned
- Voice-first interfaces change how people interact with creative tools. Users naturally iterate ("actually, make it more dramatic") in ways they wouldn't with a form-based UI.
- Orchestrating 5+ AI models through a single conversation loop is a systems design challenge more than an ML challenge — the hard part is state management, error recovery, and keeping the user informed of progress.
- Streaming tool calls from audio models is still rough — models sometimes narrate their intent ("I'll now generate the images") instead of calling the tool. We solved this with an auto-nudge mechanism that re-prompts when it detects narration after a tool response.
What's next for SayCut
- More voices and characters: Support for larger casts with distinct voice profiles, and let users clone their own voice for narration.
- Style transfer and consistency: Use reference images across scenes to maintain character and art style consistency throughout the storybook.
- Collaborative editing: Multiple users contributing to the same storybook in real-time, each speaking different characters.
- Export formats: Render final storybooks as MP4 videos, PDF picture books, or shareable web links.
- Music and sound effects: Auto-generate background music and foley based on scene descriptions to complete the cinematic experience.
Built With
- 1
- eigenai
- fastapi
- higgs-audio-v3.52
- higgs2p5
- kimi-k2-53
- qwen-image-edit
- react
- wan2p2-i2v-14b

Log in or sign up for Devpost to join the conversation.