Future Artist

Future Artist

Inspiration

I would like to enable parents generate some educative or inspiring stories for their kids. This is a good way of education

The rise of multimodal AI — models that can reason about text and generate images in the same context — made us believe this was finally possible. We were also inspired by how children's books, marketing campaigns, and educational content all share the same core need: a coherent narrative paired with visuals that reinforce the story. Future Artist was born from that idea.

What it does

Future Artist is a multimodal AI storytelling platform. You give it a topic, a tone, and a target audience — it generates a complete, illustrated, narrated story streamed to you in real time.

4 story types: Storybook, Marketing Campaign, Educational, Social Media
tone settings: Playful, Inspiring, Professional, Neutral — each affecting both writing style and speech delivery
AI-generated illustrations per scene, styled to match your chosen visual style (Cartoon, Realistic, Minimalist, Modern)
Reading Mode for a distraction-free, full-width reading experience
Real-time streaming — content appears scene by scene as it's generated, not all at once

🔗 Live demo: https://futureartist-frontend-226638196775.us-central1.run.app 💻 GitHub: https://github.com/stevenchendan/futureArtist

How we built it

We built Future Artist on an end-to-end Google stack.

AI & Agents

Google Gemini 2.5 Flash handles both text and image generation
Google ADK (Agent Development Kit) powers a multi-agent pipeline with 5 specialized agents: Story Planner, Style Director, Text Generator, Image Generator, and Audio Generator
An Orchestrator agent coordinates the pipeline and manages the streaming output

Backend

Python 3.11 + FastAPI + Uvicorn
WebSocket endpoint streams typed chunks (text, image, audio) to the frontend as each scene completes
Agents pass context between each other — the Style Director's character consistency rules flow into every Image Generator prompt
Greatly reduce the time for generating multiple format contents (text, image, audio) by leveraging ADK to access multi-model API endpoints asynchronously

Frontend

Next.js 14 + TypeScript + Tailwind CSS
WebSocket client renders each chunk inline as it arrives — the story literally builds itself on screen
AudioPlayer component uses Web Speech API with per-tone rate and pitch settings

Infrastructure

Deployed on Google Cloud Run (us-central1) for both frontend and backend
CI/CD via Google Cloud Build and Github Action

Challenges we ran into

Character consistency across scenes — Gemini generates each image independently, so characters would change appearance between scenes. We solved this by having the Style Director agent build an explicit character description sheet (skin tone, hair, clothing, expression) that gets injected into every image prompt.

Hex color codes appearing in images — Our image prompts originally included raw hex color values like #6C5B7B. Gemini rendered these literally as text labels in the images. We fixed this by mapping hex values to descriptive color names ("muted purple") and adding an explicit instruction to never render text or labels.

Streaming architecture — Building a pipeline where 5 agents run sequentially but results stream incrementally to the browser required careful WebSocket chunk typing and frontend state management. Each chunk carries enough metadata for the frontend to know where to render it.

ReactMarkdown compatibility — The backend returns structured objects { text, scene_number, metadata } but ReactMarkdown requires plain strings. We had to normalize content at the rendering layer.

Deploy issues -- When doing deployment via CICD, we encountered several GCP related issues, including insufficient service account permission, expose secret manager to Cloud Run service, and set extra build time argument for docker creation. We successfully solved these issues and learnt a lot from this project.

Accomplishments that we're proud of

A true multi-agent system built with Google ADK where each agent has a distinct role and passes structured context to the next
Real-time interleaved streaming — text, images, and audio controls appear progressively in one unified reading experience, not as separate outputs
Tone coherence end-to-end — the same tone setting affects the story writing style, the image mood, and the speech delivery simultaneously
A fully deployed, publicly accessible application on Google Cloud Run that anyone can use without any setup

What we learned

Google ADK is a powerful framework for building multi-agent systems — defining agent roles, passing structured state between them, and hooking into Gemini's API cleanly made the pipeline far more maintainable than a monolithic prompt
Prompt engineering for visual consistency is its own discipline — small changes to how you describe characters and colors have a large impact on image output
Streaming UX changes the feel of AI generation — watching a story build itself scene by scene feels alive in a way that waiting for a complete response never does
Multimodal generation has real constraints — image generation is slower and more unpredictable than text; designing a system that handles partial failures gracefully matters

What's next for Future Artist

Native Gemini TTS — replace Web Speech API with Gemini's audio generation for richer, more expressive narration
Video generation — animate scenes using generated images as storyboard frames
Export options — download the full story as a PDF illustrated book or slideshow
Character persistence — allow users to define named characters with fixed appearances that stay consistent across different story. the current persistence only work for current session
Collaborative mode — multiple users co-creating a story in real time, each controlling different agents
browse mode — kids can browse storybook generated by other kids(content has been reviewed by official)
Background music — each storybook should have their unique background music