Inspiration
Every creative idea deserves more than a wall of text.
We've all used AI tools that generate a paragraph here, an image there — but never together. Never as a single, breathing, cinematic experience. A screenwriter pitching a film doesn't hand you a Word document — they paint the scene with words, sketches, and emotion simultaneously.
That gap is what inspired OmniSence.
We asked ourselves: what if AI could think like a creative director, not a typewriter? What if you could speak a single idea — "a girl who discovers she can paint the future" — and watch a living story unfold in real-time, with narration flowing word by word, illustrations appearing inline mid-sentence, and a voice reading it back to you — all in one fluid, uninterrupted stream?
The "text box" paradigm has been the ceiling of AI interaction for too long. OmniSence is our answer to breaking it.
What it does
OmniSence is an elite Creative Director AI that transforms a single idea into a rich, multimodal creative experience — streaming text, images, and audio together in real-time as one cohesive output.
It operates across 4 creative modes:
| Mode | What It Creates |
|---|---|
| 📚 Storybook | Illustrated narratives with warm prose and watercolor artwork inline |
| 📣 Marketing | Campaigns with hero visuals, conversion copy, and CTAs in one go |
| 🎓 Educational | Explainers with Google Search-grounded facts woven with diagrams |
| 📱 Social | Platform-native posts with paired visuals and hashtags |
Key capabilities:
- 🎤 Voice Input — speak your idea using Web Speech API
- 🔊 AI Narration — Google Cloud TTS reads your story with Studio voices
- ⚡ Live SSE Streaming — text appears word-by-word, images pop in inline
- 🔍 Grounded Generation — Google Search grounding for educational mode
- 🔄 Context-Aware Sessions — maintains memory across turns
- 🛑 Stream Interruption — stop and redirect generation mid-stream
- ☁️ Cloud Persistence — all assets saved to Google Cloud Storage
Unlike tools that generate text and images separately, OmniSence streams everything simultaneously on a single pipeline — the agent acts as a creative director coordinating text, visuals, and voice in perfect sync.
How we built it
Architecture: Orchestrated Interleaved Streaming
The core innovation is what we call Orchestrated Interleaved Streaming — an ADK agent pipeline where multiple specialized models are coordinated in real-time on a single SSE stream.
User Prompt
↓
Google ADK Agent (OmniSence)
↓
Gemini 3.1 Flash — streams text with [IMAGE_DIRECTIVE: ...] markers
↓ (on marker detection)
Imagen 4 (async) ←→ Cloud TTS (async)
↓ ↓
GCS Upload GCS Upload
↓ ↓
SSE: {type:"image"} SSE: {type:"audio"}
↓
React Frontend — renders text + images + audio inline, live
The math behind the streaming interleave timing:
$$T_{total} = T_{text_stream} + \max(T_{imagen}, T_{tts}) - T_{overlap}$$
Because image and audio generation run in parallel while text continues streaming, the perceived latency is dramatically lower than sequential generation — users see content within milliseconds of submitting.
Tech Stack
Backend: Python 3.12 + FastAPI + Google ADK
ADK Agent Tools (5 real async tools):
generate_scene_image— Imagen 4 via Vertex AI → GCSnarrate_text— Cloud TTS Studio voices → GCSsearch_creative_references— Google Search groundingsave_session_asset— GCS persistence for session continuityget_style_constraints— Structured creative mode framework
Google Cloud Services:
- Gemini 3.1 Flash (interleaved text generation)
- Imagen 4 Standard (scene illustration)
- Cloud Text-to-Speech Studio voices (narration)
- Cloud Run (serverless hosting)
- Cloud Storage (asset persistence)
- Cloud Build (CI/CD pipeline)
Frontend: React 18 + TypeScript + Vite + Tailwind CSS
The frontend renders a 3-panel cinematic canvas — session memory on the left, live streaming canvas in the center, creative controls on the right — with a dark film-grain aesthetic designed to feel like a director's suite, not a chat window.
Challenges we ran into
1. True Interleaved Streaming
The hardest technical challenge was making text and images feel simultaneous
rather than sequential. Gemini doesn't natively emit image bytes mid-text
stream — we had to design the [IMAGE_DIRECTIVE: ...] interception pattern
where the ADK agent acts as a real-time orchestrator, firing async Imagen 4
calls the moment a visual cue appears in the text stream, then merging both
outputs back onto the SSE channel.
2. SSE Stream Cancellation
Building true mid-stream interruption required maintaining per-session cancellation flags in FastAPI's async context, ensuring that when a user hits "Stop & Redirect", the stream cleanly exits without leaving dangling Imagen API calls or GCS uploads in flight.
3. Grounding Without Breaking Flow
For educational mode, integrating Google Search grounding had to happen before the creative stream began — but the search latency couldn't be perceived by the user. We solved this by pre-fetching grounding data in a parallel async task during the 200ms UI transition animation.
4. Persona Consistency Across Modes
Getting OmniSence to sound like a warm, bold creative director across 4 completely different output formats — from bedtime storybook to B2B marketing campaign — required carefully layered system prompts with mode-specific tone injection while preserving the core "Omnisence" persona throughout.
5. Cloud TTS Latency
Studio voice generation takes 2–4 seconds per paragraph. Streaming audio while text is still generating required chunking narratives into sentence groups, generating audio for completed sentences while later sentences still stream — a rolling audio generation pipeline.
Accomplishments that we're proud of
- 🏆 Zero perceived latency between text streaming and image appearance — the experience feels genuinely live, not assembled
- 🔍 Google Search grounding integrated seamlessly into educational mode without breaking the creative flow
- 🎭 Distinct AI persona — OmniSence has a consistent voice, vocabulary, and creative philosophy across all 4 modes
- 🛑 Real stream interruption — one of the few hackathon projects with true mid-generation redirection, not just cancellation
- ☁️ Full GCP deployment with automated CI/CD via Cloud Build and one-command deploy script
- 🎙️ Voice-to-story pipeline — speak an idea, receive a fully illustrated, narrated story in under 30 seconds
- 🎨 Used OmniSence to generate its own thumbnail — the cover art for this submission was created using the product itself
What we learned
Technically:
- ADK's tool system is extraordinarily powerful for orchestrating multi-model pipelines — the agent naturally decides when to generate images vs. when to keep writing, creating organic pacing
- SSE streaming in FastAPI requires careful async generator design —
asyncio.create_task()for parallel side-effects while the main generator stays non-blocking - Google Search grounding via the GenAI SDK is a one-line addition that dramatically improves factual accuracy in educational content
- Imagen 4's prompt sensitivity is high — a 10-word style suffix ("watercolor, warm light, children's book illustration") changes output quality dramatically
About multimodal UX:
- Users don't want to choose between text, image, and audio — they want the AI to make that decision for them, seamlessly
- The moment images appear inline rather than below text, the experience shifts from "AI output" to "living document" — this single UX change had the largest impact on how the product felt
- Voice input removes the "blank page" problem entirely — speaking feels more creative than typing
What's next for OmniSence
Immediate (next sprint):
- [ ] Veo 3.1 integration — animated scene clips woven inline with storybooks
- [ ] Collaborative mode — two users co-directing the same story in real-time
- [ ] Export to PDF/EPUB — download illustrated storybooks as formatted files
Medium term:
- [ ] Custom persona training — brands upload their style guide, OmniSence generates on-brand content automatically
- [ ] OmniSence for Education — teachers create grounded, illustrated lesson plans with a single sentence prompt
- [ ] Mobile app with persistent creative portfolio
Vision:
The next generation of creative tools won't have a text box at all.
You'll think out loud, and the canvas will keep up.
OmniSence is the first step toward that world.
Built for the Gemini Live Agent Challenge · #GeminiLiveAgentChallenge
Built With
- asyncio
- cloudbuild
- cloudtext-to-speech
- docker
- events
- fastapi
- gemini3.1flash
- googleadk
- googlecloudrun
- googlecloudstorage
- googlegenaisdk
- googlesearchgrounding
- imagen4
- python
- react
- server-sent
- tailwindcss
- typescript
- vertexai
- vite
- webspeechapi
Log in or sign up for Devpost to join the conversation.