Inspiration
As a child, I possessed a "cinema of the mind." Every sentence I read triggered a vivid, high-definition projection in my head—I didn't just see words; I saw the refraction of light on a character's armor and felt the atmosphere of a fictional world. However, I soon realized that for many—children with learning barriers, language learners, or those with different cognitive styles—text remains a flat, silent obstacle. My inspiration was to build a bridge: a tool that externalizes that rich, internal imagination, projecting it onto the screen so that the "soul" of a story is accessible to everyone, regardless of their reading level or native tongue.
What it does
Vivido AI is an autonomous "Director Agent" that transforms text into a real-world, a page-by-page 'Vibe Experience' that synchronizes visual aesthetic, emotional lighting, and adaptive audio.
- Personalization: You can create your own CAST of existing stories or your own imaginary story. Use up to 14 reference images (via Nano Banana Pro) to ground your CAST, ensuring your protagonist looks exactly like your reference across every page. You can also personalize individual pages using your own actors.
- Self-Explanatory Visuals: It converts any thoughts into a series of high-fidelity images and videos that revives the narrative.
- Interactive Translation: It provides audio and text overlays in any language, using Gemini Live to narrate with emotive, context-aware voices.
- Story State: Being able to save a "Story State" is crucial for collaborative storytelling. You can save and upload your complete story state (pages of text, images, audios, videos, the CAST and pdf of the whole story with your preferred language).
How I built it
The project is built on the Gemini 3 Pro ecosystem, leveraging a "Rolling Context" architecture:
- The State Engine: I utilized Thought Signatures to maintain visual continuity. By calculating a "Visual State" S for each page n, I ensured characters and settings remained consistent.
- Image & Video Generation: I integrated Nano Banana Pro with localized "Paint-to-Edit" controls and used Veo for creating fluid video transitions between story beats.
- Google Antigravity: I used this agent-first IDE to orchestrate a "workforce" of agents—for storyboarding, translation, per page for brainstorming, ... etc.
Challenges I ran into
The primary hurdle was Narrative Drift. In early iterations, a character might have a beard on page one and be clean-shaven on page two. Vivido solved this by implementing a "Physical Manifesto" within the 1M token context window—a set of immutable visual rules the AI must follow. Another challenge was Latency. Generating 4K assets in real-time is computationally heavy. Vivido solved this by moving to a Streaming Architecture, using Context Caching to ensure that the model doesn't have to "re-read" the whole book for every new image, reducing the response time drastically.
Accomplishments that we're proud of
I'm most proud of the Semantic Fidelity. My agent doesn't just draw "a man crying"; it understands the subtext. If a character is crying out of joy, the lighting, color palette, and facial micro-expressions generated by Nano Banana Pro reflect that specific emotion. I also successfully achieved Real-Time Style Pivoting, allowing a user to change the entire aesthetic of a book (e.g., from "Watercolor" to "Cyberpunk") instantly without losing character identity.
What I learned
Building Vivido AI taught me that Context is King. I learned that for an AI to be a true "Director," it needs more than just image generation—it needs a deep, persistent memory of the story's past. I also discovered the power of Affective Dialogue; when the AI's narration voice matches the "vibe" of the visual, the user's emotional engagement increases, proving that multimodal harmony is essential for learning.
What's next for Vivido
The journey is just beginning. My next steps involve:
- Interactive "What If" Modes: Allowing users to change a character's choice and seeing the "Vivido" branch into a new, AI-generated storyline.
- AR Integration: Bringing the images off the screen and into the user’s room via augmented reality glasses.
- Universal Literacy Architecture: Collaboration with educators by tailoring the engine for neurodivergent students to help bridge the literacy gap globally.
Built With
- gemini
- google-antigravity
- nano-bana


Log in or sign up for Devpost to join the conversation.