Inspiration

Every parent wants to give their child a story that feels truly theirs — starring their name, their pet, their city, their lesson. But creating one takes a writer, an illustrator, a narrator, and hours of work across five different tools. We asked a simple question: what if one voice brief could replace all of that? The Gemini Live Agent Challenge gave us the perfect forcing function to find out.


What it does

Google Stories is a Creative Director AI agent that transforms a spoken or typed brief into a complete personalized 6-page illustrated storybook in under 2 minutes. A parent speaks — "Make a story about Priya and her dog Bruno in Hyderabad, learning to ask for help" — and the agent streams back Gemini 2.5 Flash story text, Imagen 3 watercolor illustrations, and Cloud TTS narration audio simultaneously as one interleaved output. Not sequentially. Not separately. One coherent creative stream, live in the browser. The agent makes deliberate creative decisions — locking a visual style guide, placing each illustration at the scene's emotional peak, and keeping every character visually consistent across all 6 pages. Stories are saved automatically to the browser and survive refresh and close. A global MCP server exposes the entire agent as callable tools for any AI client.


How we built it

The backend is a FastAPI service deployed on Cloud Run. Gemini 2.5 Flash generates the story text using a structured [PARAGRAPH] / [ILLUSTRATION] output format. Before any image is generated, we make a second Gemini call to build a locked Visual Bible — a detailed style guide with character descriptions, color palette, and lighting — that is prepended to every Imagen 3 prompt to enforce visual consistency. Imagen 3 on Vertex AI renders each illustration. Cloud Text-to-Speech narrates each paragraph. All three streams are multiplexed into a single SSE pipeline and delivered to the React frontend word by word, image by image, audio chunk by audio chunk. We built a 4-key API rotation system that health-checks all keys on startup and automatically fails over on rate limits. A 4-level image fallback system — full prompt, simplified prompt, generic safe scene, PIL placeholder — guarantees stories always have 6 illustrations regardless of safety filter decisions. The frontend is deployed on Firebase Hosting with a claymorphism dark UI, skeleton loaders, and localStorage story persistence. The MCP server is built with FastMCP and deployed on Cloud Run, exposing 5 tools globally over Streamable HTTP.


Challenges we ran into

The hardest problem was Gemini's responseModalities: ["TEXT", "IMAGE"] interleaved output not being available on Vertex AI for our model version. We pivoted to a two-model architecture — Gemini 2.5 Flash for text via the REST API, Imagen 3 for illustrations via Vertex AI — and built a Visual Bible system to compensate for the loss of native interleaving by enforcing character consistency at the prompt level. Imagen 3's safety filter blocked scenes unpredictably, requiring a 4-level fallback system so generation never silently fails. The MCP Streamable HTTP spec required a proper session initialization handshake that PowerShell's curl couldn't handle — we debugged this using Node.js to trace the full initialize → notifications/initialized → tools/list flow. Windows PowerShell's lack of \ line continuation and restrictions on package names starting with "google" in Cloud Storage added friction throughout the deployment process.


Accomplishments that we're proud of

We are proud of the Visual Bible system — a second Gemini call that generates a locked character and style guide before any illustration is rendered, giving Imagen 3 enough context to keep Priya's braids, Bruno's collar, and the Hyderabad street scenes consistent across all 6 pages. We are proud of the 4-level image fallback that makes generation bulletproof — stories always complete with 6 illustrations. We are proud of the MCP server deployed globally on Cloud Run, making Google Stories the first personalized storybook generator accessible as a live MCP tool to any AI agent. And we are proud that the entire stack — Gemini, Imagen 3, Cloud Run, Vertex AI, Firebase, Firestore, Cloud TTS — is GCP-native end to end.


What we learned

Prompt architecture matters as much as model selection. The Visual Bible approach — extracting a style guide from the story before generating any image — dramatically improved illustration consistency compared to passing style instructions inline. We learned that MCP's Streamable HTTP transport requires a proper session lifecycle that most CLI tools don't support out of the box, and that testing MCP servers requires a proper client or a Node.js script that can handle SSE responses. We also learned that building for a deadline forces architectural clarity — every component that survived the build exists because it was genuinely necessary, not because it seemed like a good idea.


What's next for Google Stories

The immediate next step is PDF export — a downloadable storybook with the child's name on the cover, ready to print or share as a gift. After that, a character memory system where parents define a character once and reuse them across multiple stories, so Priya and Bruno can go on new adventures without re-describing them every time. We want to add real-time voice input using the Gemini Live API so the brief can be a natural conversation rather than a single spoken sentence. Longer term, Google Stories becomes a platform — schools use it to generate personalized lesson stories for each student, publishers use it to create interactive editions of existing books, and the gift market unlocks a new category of meaningful personalized presents. The personalized children's content market is over a billion dollars. We have built the infrastructure. The stories are just beginning.

Built With

  • artifact-registry
  • cloud-build
  • cloud-logging
  • cloud-monitoring
  • cloud-storage
  • cloud-text-to-speech
  • fastapi
  • fastmcp
  • firebase-hosting
  • firestore
  • gemini-2.5-flash
  • google-cloud-run
  • httpx
  • imagen-3
  • model-context-protocol-mcp
  • pillow
  • python
  • react
  • server-sent-events-sse
  • vertex-ai
  • vite
  • web-speech-api
Share this project:

Updates