Inspiration

Most AI story tools treat storytelling as a single prompt-response operation. But real stories require layered intelligence — psychological depth, dramatic structure, visual consistency, and an understanding of what the reader will feel at each moment. We wanted to build a system that thinks like a writer, not a chatbot. Story Engine was designed from narrative theory first principles, then encoded as a multi-agent architecture where each agent owns a distinct layer of the creative process.

What it does

Story Engine is a multi-agent AI system that constructs emotionally resonant stories through a pipeline of six specialized Gemini agents. Users enter a story seed via text, voice, or image upload. A constitution agent crystallizes the story's identity — its core tension, theme, emotional signature, and protagonist psychology. A possibility agent proposes three dramatically distinct arc directions. After the user chooses, a skeleton agent generates six dramatic beats following the universal arc: equilibrium → disruption → escalation → false peak → crisis → transformation. For each beat, a scene agent writes literary prose, an image prompt agent ensures visual consistency, and an image agent generates an illustration via Gemini 2.5 Flash Image. At any point the user can intervene and redirect the story. Images can be regenerated in eight art styles including Renaissance, Noir, Anime, and Watercolor.

How we built it

We derived the architecture from narrative theory first principles — working through story structure, character psychology, dramatic arc, world-building, and style as distinct conceptual layers before writing a single line of code. Each layer became an agent with a specific role and scope.

The backend is a FastAPI application running on Google Cloud Run, using the Google GenAI SDK with Gemini 2.5 Flash for all text generation and Gemini 2.5 Flash Image for scene illustrations. Voice input is transcribed by Gemini's multimodal capability. Image seeds are analyzed by Gemini vision to extract a complete story constitution from visual composition. The frontend is a single-page app with typewriter prose effect, real-time sidebar showing the agents' reasoning, and style-selectable image regeneration.

Challenges we ran into

Gemini 2.5 Flash's thinking tokens intermittently produce empty responses — we built a retry loop and custom response part extraction to handle this reliably. The SDK version available did not expose SpeechConfig for TTS, so we used Web Speech API with voice parameters matched to the story's emotional signature. Getting consistent visual identity across generated scene images required a dedicated image prompt agent that explicitly references previous scene descriptions and maintains character continuity.

Accomplishments that we're proud of

The quality of generated stories genuinely surprised us — the hierarchical pipeline produces prose and arcs that feel crafted rather than generated. The image generation with art style selection produces stunning results, particularly in Renaissance and Cinematic styles. The multimodal image seed feature — where users upload any photo and Gemini extracts a complete story constitution from the visual composition — works remarkably well and produces stories with strong visual identity throughout.

What we learned

Narrative quality requires separating concerns that most systems conflate. Consistency checking is different from quality evaluation, which is different from dramatic judgment. Encoding narrative theory as agent architecture produced measurably better output than single-prompt approaches. The constitution as a shared north star document is the most critical architectural decision — without it, each agent drifts toward generic output.

What's next for Story Engine

The full theoretical architecture includes evaluation agents, a critic/judge agent, psychological interpreter, contradiction agent for character depth, reader awareness agent, and a branching tree structure for non-linear story exploration. TTS with contextually matched voice using Gemini TTS models. Multi-protagonist arc graphs for ensemble stories.

Built With

  • docker
  • fastapi
  • gemini-2.5-flash
  • gemini-2.5-flash-image
  • google-cloud-run
  • google-genai-sdk
  • javascript
  • python
  • web-speech-api
Share this project:

Updates