Inspiration

We love storytelling — but turning an idea into a polished visual story requires writing scripts, generating art, recording voiceover, editing video, and stitching it all together. That's a dozen tools and hours of work for even a short piece. We wondered: what if you could just talk about your story and watch it come to life? Voice is the most natural way humans tell stories — campfire tales, bedtime stories, movie pitches. We wanted to make the creation process feel just as natural as the storytelling itself.

What it does

SayCut is a voice-first storybook and short film maker. You speak your idea into the mic, and an AI voice agent guides you through the entire creation pipeline — scripting, illustration, narration, and video — all through conversation.

It supports two modes:

  • Story mode — A single-narrator storybook with illustrated scenes and voiceover, like an animated picture book.
  • Movie mode — A two-character short film with multi-voice dialogue (including a Morgan Freeman-style narrator), creating cinematic scenes with distinct voices for each character.

You can edit anything by voice after the fact: "make the dragon bigger in scene 2," "add a scene between 1 and 3," or "remove the last scene." The final result plays back in a cinematic player with crossfade transitions and synced audio.

How we built it

  • Voice agent: BosonAI's HiggsAudioM3 v3.5 model handles speech-to-text and tool calling in a single streaming pass — the user speaks, and the model both understands the audio and decides which tools to invoke, without a separate STT step.
  • Tool orchestration: The voice agent uses structured tool calls (generate_script, generate_scene_image, generate_scene_audio, generate_scene_video, edit_scene_image, etc.) to drive the entire pipeline. Each tool call triggers real asset generation and pushes live updates to the frontend over WebSocket.
  • Multi-model pipeline: Script generation (Kimi K2.5), image generation (EigenImage), image editing (Qwen), image-to-video (Wan2.2), and TTS (Higgs 2.5) — all orchestrated through a single voice conversation loop.
  • Frontend: Next.js with a real-time storyboard editor, Zustand state management, and a cinematic video player with audio-synced scene transitions.
  • Backend: FastAPI with async WebSocket sessions, SQLite persistence, and local asset storage. Each session gets its own VoiceAgent instance configured for the chosen mode.

Challenges we ran into

  • Voice + tool calling in one model: The HiggsAudioM3 model processes raw audio and emits structured tool calls in the same response stream. Parsing tags from a streaming text response — especially when the model sometimes generates malformed or truncated JSON — required robust parsing with fallbacks.
  • Multi-voice audio concatenation: Movie mode needs three distinct voices (narrator + two characters) stitched into a single audio track per scene. Getting the silence gaps, sample rate normalization (24kHz/16-bit/mono), and voice-to-character mapping right across varying TTS output formats was tricky.
  • Async video generation: Image-to-video is a long-running job (submit → poll → download). Coordinating this with the voice agent's tool call loop without blocking the conversation required careful async orchestration and live progress events to the frontend.
  • Context management: Resuming an existing storybook mid-conversation means injecting scene context into the voice agent without replaying the full message history (which would pollute the model's context). We built a LOAD_STORYBOOK protocol that summarizes the existing state for the agent.

Accomplishments that we're proud of

  • Fully voice-driven creation: From "I want a story about a space cat" to a playable cinematic storybook — entirely through conversation, no typing or clicking required.
  • Two distinct creative modes with shared infrastructure: Story and Movie mode reuse the same image/video pipeline but produce fundamentally different outputs (narrated storybook vs. multi-voice film).
  • Live storyboard updates: Scenes appear and update in real-time as the AI generates them — you watch your story materialize as you talk.
  • Edit-in-place by voice: Insert, remove, or modify individual scenes without regenerating the whole storybook. The backend handles index shifting and asset cleanup automatically.

What we learned

  • Voice-first interfaces change how people interact with creative tools. Users naturally iterate ("actually, make it more dramatic") in ways they wouldn't with a form-based UI.
  • Orchestrating 5+ AI models through a single conversation loop is a systems design challenge more than an ML challenge — the hard part is state management, error recovery, and keeping the user informed of progress.
  • Streaming tool calls from audio models is still rough — models sometimes narrate their intent ("I'll now generate the images") instead of calling the tool. We solved this with an auto-nudge mechanism that re-prompts when it detects narration after a tool response.

What's next for SayCut

  • More voices and characters: Support for larger casts with distinct voice profiles, and let users clone their own voice for narration.
  • Style transfer and consistency: Use reference images across scenes to maintain character and art style consistency throughout the storybook.
  • Collaborative editing: Multiple users contributing to the same storybook in real-time, each speaking different characters.
  • Export formats: Render final storybooks as MP4 videos, PDF picture books, or shareable web links.
  • Music and sound effects: Auto-generate background music and foley based on scene descriptions to complete the cinematic experience.

Built With

  • 1
  • eigenai
  • fastapi
  • higgs-audio-v3.52
  • higgs2p5
  • kimi-k2-53
  • qwen-image-edit
  • react
  • wan2p2-i2v-14b
Share this project:

Updates