The Architecture Diagram
Introduction
Starting the agent
Agent data collection via live api: Age group 3-7
Story Loading Screen
Gemini 2.5 Flash Live generates interleaved story text and audio, while Gemini 3.1 Flash Image simultaneously creates custom illustrations.
The user discusses the story with the live agent.
Create New Story
Agent data collection via live api: Age group 18-29
More advanced, extended story with corresponding high-detail visuals and narrative depth.

Tales of Wonder

Inspiration

As a parent and educator, I've always been fascinated by the power of storytelling to spark imagination and teach valuable life lessons. However, I noticed that most digital storytelling tools either lack personalization or fail to adapt to different age groups effectively. When I discovered Gemini 2.5 Flash's Live API with interleaved multimodal capabilities, I saw an opportunity to create something truly magical: an AI storyteller that could generate personalized, age-adaptive stories with synchronized text, images, and live narration - all streaming together in real-time.

The inspiration came from watching children of different ages react to the same story. A 5-year-old needs simple vocabulary and playful tones, while a teenager craves complex narratives with dramatic themes. Traditional storytelling apps treat all users the same, but I envisioned an autonomous agent that could intelligently adapt every aspect of the story - from vocabulary complexity to illustration style - based on the user's age.

What it does

Tales of Wonder is an AI-powered storytelling agent that creates personalized, age-adaptive stories through natural voice interaction. Here's how it works:

Voice-First Experience:

Users simply speak their name, age, and story theme
No typing required - the entire experience is conversational
Gemini Live API handles natural language understanding

Autonomous Age Adaptation:

The agent automatically adapts 6 key parameters based on age (3-7, 8-12, 13-17, 18-29, 30-59, 60+):
- Vocabulary level (basic to sophisticated)
- Sentence complexity (simple to layered)
- Tone (playful to reflective)
- Illustration style (cartoon to artistic)
- Chapter length (20-50 words)
- Narrative pacing

Interleaved Multimodal Output:

Text, images, and live audio narration stream together in a single, fluid output
Not sequential - all modalities are truly interleaved
Gemini 3.1 Flash Image (Nano Banana 2) generates AI illustrations inline with the story
Real-time audio narration using Gemini's native audio capabilities

Interactive Discussion Mode:

After the story ends, users can engage in natural voice conversations about the tale
Ask questions about characters, themes, or plot
AI provides thoughtful responses and encourages deeper thinking

Complete Story Structure:

Every story includes 3 chapters with inline illustrations
Age-appropriate moral or lesson at the end
Consistent narrative arc with beginning, middle, and end

How we built it

Architecture:

We built Tales of Wonder using a modern, cloud-native architecture:

Backend (Python + FastAPI):

Story Generation Agent: Autonomous decision-making system with 5 components:
- Input Processor: Extracts name, age, and theme from voice input
- Decision Engine: Maps age to adaptive parameters using rule-based logic
- Stream Orchestrator: Manages Gemini API streaming and interleaved output
- Output Handler: Formats content for frontend rendering
- TTS Processor: Handles audio stream processing
Voice Proxy Handler: WebSocket server managing Gemini Live API connections
- Bidirectional audio streaming
- Session state management
- Mode switching (data collection, narration, discussion)
Story Discussion: Post-story conversation handler using Gemini Live API

Frontend (Vanilla JavaScript):

Voice Activation Controller: Microphone access and audio capture
WebSocket Client: Real-time bidirectional communication
Stream Renderer: Progressive rendering with typewriter effects and fade-in animations
Audio Playback: Synchronized audio output
Glass Morphism UI: Modern, accessible design

Google Cloud Integration:

Gemini 2.5 Flash Live API: Voice input/output, text generation, live narration
Gemini 3.1 Flash Image (Nano Banana 2): AI-generated illustrations
Cloud Run: Serverless backend hosting with auto-scaling
Firebase Hosting: Frontend hosting with global CDN
Cloud Firestore: Story metadata storage
Cloud Storage: Generated image storage
Cloud Build: CI/CD pipeline for automated deployments

Development Methodology:

We used spec-driven development with property-based testing:

Created formal specifications for each feature
Defined correctness properties that must hold
Implemented property-based tests using Hypothesis (Python) and fast-check (JavaScript)
Validated behavior across all age groups and edge cases

Key Technologies:

Python 3.11+ with FastAPI and Pydantic
Google GenAI SDK for Gemini integration
WebSocket for real-time communication
Web Audio API for audio capture/playback
pytest + Hypothesis for backend testing
Jest + fast-check for frontend testing

Challenges we ran into

1. Interleaved Output Synchronization:

The biggest challenge was achieving true interleaved multimodal output. Initially, we tried sequential generation (text → images → audio), but this felt disjointed. Gemini 2.5 Flash's interleaved capabilities were key, but we had to:

Handle markdown pattern splitting across chunks (e.g., ** for bold text)
Implement text buffering to prevent incomplete markdown from rendering
Synchronize audio narration with text streaming
Manage image generation timing to maintain narrative flow

Solution: We built a sophisticated buffering system that detects incomplete markdown patterns and waits for complete chunks before rendering. This ensures smooth, professional-looking output.

2. Age-Adaptive Parameter Tuning:

Determining the right parameters for each age group required extensive research and testing. We had to balance:

Vocabulary complexity vs. comprehension
Story length vs. attention span
Illustration style vs. age preferences
Tone appropriateness vs. engagement

Solution: We created a decision matrix based on educational psychology research and iteratively refined it through testing with users across different age groups.

3. WebSocket Connection Stability:

Managing WebSocket connections for voice streaming proved challenging:

Connection drops during long stories
Audio buffer management
Session state persistence
Error recovery without disrupting the experience

Solution: We implemented robust error handling, automatic reconnection logic, and session state management to ensure seamless experiences even with network issues.

4. Gemini API Rate Limits and Costs:

During development, we hit rate limits and had to optimize:

API call frequency
Prompt engineering for efficiency
Caching strategies
Cost management

Solution: We implemented request batching, optimized prompts to reduce token usage, and added intelligent caching for repeated requests.

5. Property-Based Testing Complexity:

Writing property-based tests for an AI system was challenging because:

AI outputs are non-deterministic
Hard to define universal properties for creative content
Test execution time for comprehensive coverage

Solution: We focused on structural properties (story has 3 chapters, age-appropriate parameters are selected) rather than content properties, and used Hypothesis/fast-check to generate diverse test cases efficiently.

Accomplishments that we're proud of

1. True Interleaved Multimodal Output:

We achieved genuine interleaved streaming where text, images, and audio flow together naturally - not sequentially. This creates a magical experience that feels like a professional audiobook with live illustrations.

2. Autonomous Age Adaptation:

The agent makes intelligent decisions without manual intervention. Users simply provide their age, and the system automatically adapts 6 parameters to create age-appropriate content. This demonstrates true AI autonomy.

3. Comprehensive Testing:

We implemented property-based testing across the entire stack:

50+ property-based tests validating universal behaviors
100+ unit tests for specific scenarios
Integration tests for end-to-end flows
This ensures correctness and reliability

4. Production-Ready Deployment:

We built a fully automated CI/CD pipeline:

Infrastructure-as-code with Cloud Build
Automated deployment scripts
Zero-downtime deployments
Monitoring and logging

5. Accessibility and UX:

We prioritized accessibility:

Voice-first design (no typing required)
Glass morphism UI with high contrast
Responsive design for all devices
Clear error messages and guidance

6. Complete Documentation:

We created comprehensive documentation:

Architecture diagrams with data flow
Reproducible testing instructions
GCP setup automation scripts
Code examples demonstrating GCP integration

What we learned

1. Interleaved Output is the Future:

Working with Gemini 2.5 Flash's interleaved capabilities showed us that the future of AI interaction isn't sequential (text, then images, then audio) - it's truly multimodal and simultaneous. This creates more natural, engaging experiences.

2. Age Adaptation Requires Deep Understanding:

Building age-adaptive systems taught us that it's not just about vocabulary - it's about tone, pacing, complexity, visual style, and narrative structure. True adaptation requires holistic consideration of all these factors.

3. Property-Based Testing for AI:

We learned that property-based testing is invaluable for AI systems. Instead of testing specific outputs (which are non-deterministic), we test structural properties and invariants. This provides stronger guarantees of correctness.

4. WebSocket Management is Complex:

Real-time bidirectional communication requires careful state management, error handling, and recovery strategies. We learned to design for failure and implement graceful degradation.

5. Prompt Engineering is an Art:

Crafting prompts that consistently produce desired outputs across different age groups and themes required iteration and experimentation. We learned to be specific, provide examples, and set clear constraints.

6. Cloud-Native Architecture Scales:

Using Google Cloud services (Cloud Run, Firebase, Firestore) allowed us to build a scalable, reliable system without managing infrastructure. Serverless is powerful for AI applications.

7. Voice UX is Different:

Designing for voice-first interaction taught us that traditional UI patterns don't apply. We had to think about conversation flow, error recovery, and providing audio feedback.

What's next for Tales of Wonder

1. Multi-Language Support:

Expand to support storytelling in multiple languages, leveraging Gemini's multilingual capabilities. This would make Tales of Wonder accessible to children worldwide.

2. Story Customization:

Allow users to specify additional preferences:

Character names and traits
Story settings (fantasy, sci-fi, historical)
Moral lessons to emphasize
Story length preferences

3. Story Library and Sharing:

Build a library where users can:

Save their favorite stories
Share stories with friends and family
Rate and review stories
Discover popular themes

4. Educational Integration:

Partner with schools and educators to:

Align stories with curriculum standards
Generate stories for specific learning objectives
Track reading comprehension and engagement
Provide teacher dashboards

5. Advanced Personalization:

Use machine learning to:

Learn user preferences over time
Recommend themes based on past stories
Adapt difficulty dynamically based on engagement
Personalize illustration styles

6. Collaborative Storytelling:

Enable multiple users to:

Co-create stories together
Take turns adding to the narrative
Vote on story directions
Create branching narratives

7. Accessibility Enhancements:

Add features for users with disabilities:

Screen reader optimization
Adjustable narration speed
Visual customization (font size, contrast)
Closed captions for audio

8. Mobile Apps:

Develop native iOS and Android apps with:

Offline story playback
Download stories for later
Push notifications for new features
Better mobile UX

9. Analytics and Insights:

Provide parents and educators with:

Reading time tracking
Engagement metrics
Vocabulary exposure reports
Learning progress insights

10. Community Features:

Build a community around storytelling:

User-generated themes
Story contests and challenges
Creator profiles
Social sharing

Tales of Wonder demonstrates the power of combining Gemini's multimodal capabilities with thoughtful design and autonomous decision-making. We're excited to continue evolving this platform and bringing magical storytelling experiences to users of all ages.

Built With

cloud-build
cloud-firestore
cloud-storage
css3
docker
fast-check
fastapi
firebase-hosting
gemini-2.5-flash-live-api
gemini-3.1-flash-image
git
google-cloud-run
google-genai-sdk
html5
hypothesis
javascript
jest
pydantic
pytest
python
web-audio-api
websocket

Updates

Mehmet Akif Acar started this project — Mar 10, 2026 08:17 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.