OmniSence

Inspiration

Every creative idea deserves more than a wall of text.

We've all used AI tools that generate a paragraph here, an image there — but never together. Never as a single, breathing, cinematic experience. A screenwriter pitching a film doesn't hand you a Word document — they paint the scene with words, sketches, and emotion simultaneously.

That gap is what inspired OmniSence.

We asked ourselves: what if AI could think like a creative director, not a typewriter? What if you could speak a single idea — "a girl who discovers she can paint the future" — and watch a living story unfold in real-time, with narration flowing word by word, illustrations appearing inline mid-sentence, and a voice reading it back to you — all in one fluid, uninterrupted stream?

The "text box" paradigm has been the ceiling of AI interaction for too long. OmniSence is our answer to breaking it.

What it does

OmniSence is an elite Creative Director AI that transforms a single idea into a rich, multimodal creative experience — streaming text, images, and audio together in real-time as one cohesive output.

It operates across 4 creative modes:

Mode	What It Creates
📚 Storybook	Illustrated narratives with warm prose and watercolor artwork inline
📣 Marketing	Campaigns with hero visuals, conversion copy, and CTAs in one go
🎓 Educational	Explainers with Google Search-grounded facts woven with diagrams
📱 Social	Platform-native posts with paired visuals and hashtags

Key capabilities:

🎤 Voice Input — speak your idea using Web Speech API
🔊 AI Narration — Google Cloud TTS reads your story with Studio voices
⚡ Live SSE Streaming — text appears word-by-word, images pop in inline
🔍 Grounded Generation — Google Search grounding for educational mode
🔄 Context-Aware Sessions — maintains memory across turns
🛑 Stream Interruption — stop and redirect generation mid-stream
☁️ Cloud Persistence — all assets saved to Google Cloud Storage

Unlike tools that generate text and images separately, OmniSence streams everything simultaneously on a single pipeline — the agent acts as a creative director coordinating text, visuals, and voice in perfect sync.

How we built it

Architecture: Orchestrated Interleaved Streaming

The core innovation is what we call Orchestrated Interleaved Streaming — an ADK agent pipeline where multiple specialized models are coordinated in real-time on a single SSE stream.

User Prompt
    ↓
Google ADK Agent (OmniSence)
    ↓
Gemini 3.1 Flash — streams text with [IMAGE_DIRECTIVE: ...] markers
    ↓ (on marker detection)
Imagen 4 (async) ←→ Cloud TTS (async)
    ↓                    ↓
GCS Upload           GCS Upload
    ↓                    ↓
SSE: {type:"image"}  SSE: {type:"audio"}
    ↓
React Frontend — renders text + images + audio inline, live

The math behind the streaming interleave timing:

$$T_{total} = T_{text_stream} + \max(T_{imagen}, T_{tts}) - T_{overlap}$$

Because image and audio generation run in parallel while text continues streaming, the perceived latency is dramatically lower than sequential generation — users see content within milliseconds of submitting.

Tech Stack

Backend: Python 3.12 + FastAPI + Google ADK

ADK Agent Tools (5 real async tools):

generate_scene_image — Imagen 4 via Vertex AI → GCS
narrate_text — Cloud TTS Studio voices → GCS
search_creative_references — Google Search grounding
save_session_asset — GCS persistence for session continuity
get_style_constraints — Structured creative mode framework

Google Cloud Services:

Gemini 3.1 Flash (interleaved text generation)
Imagen 4 Standard (scene illustration)
Cloud Text-to-Speech Studio voices (narration)
Cloud Run (serverless hosting)
Cloud Storage (asset persistence)
Cloud Build (CI/CD pipeline)

Frontend: React 18 + TypeScript + Vite + Tailwind CSS

The frontend renders a 3-panel cinematic canvas — session memory on the left, live streaming canvas in the center, creative controls on the right — with a dark film-grain aesthetic designed to feel like a director's suite, not a chat window.

Challenges we ran into

1. True Interleaved Streaming

The hardest technical challenge was making text and images feel simultaneous rather than sequential. Gemini doesn't natively emit image bytes mid-text stream — we had to design the [IMAGE_DIRECTIVE: ...] interception pattern where the ADK agent acts as a real-time orchestrator, firing async Imagen 4 calls the moment a visual cue appears in the text stream, then merging both outputs back onto the SSE channel.

2. SSE Stream Cancellation

Building true mid-stream interruption required maintaining per-session cancellation flags in FastAPI's async context, ensuring that when a user hits "Stop & Redirect", the stream cleanly exits without leaving dangling Imagen API calls or GCS uploads in flight.

3. Grounding Without Breaking Flow

For educational mode, integrating Google Search grounding had to happen before the creative stream began — but the search latency couldn't be perceived by the user. We solved this by pre-fetching grounding data in a parallel async task during the 200ms UI transition animation.

4. Persona Consistency Across Modes

Getting OmniSence to sound like a warm, bold creative director across 4 completely different output formats — from bedtime storybook to B2B marketing campaign — required carefully layered system prompts with mode-specific tone injection while preserving the core "Omnisence" persona throughout.

5. Cloud TTS Latency

Studio voice generation takes 2–4 seconds per paragraph. Streaming audio while text is still generating required chunking narratives into sentence groups, generating audio for completed sentences while later sentences still stream — a rolling audio generation pipeline.

Accomplishments that we're proud of

🏆 Zero perceived latency between text streaming and image appearance — the experience feels genuinely live, not assembled
🔍 Google Search grounding integrated seamlessly into educational mode without breaking the creative flow
🎭 Distinct AI persona — OmniSence has a consistent voice, vocabulary, and creative philosophy across all 4 modes
🛑 Real stream interruption — one of the few hackathon projects with true mid-generation redirection, not just cancellation
☁️ Full GCP deployment with automated CI/CD via Cloud Build and one-command deploy script
🎙️ Voice-to-story pipeline — speak an idea, receive a fully illustrated, narrated story in under 30 seconds
🎨 Used OmniSence to generate its own thumbnail — the cover art for this submission was created using the product itself

What we learned

Technically:

ADK's tool system is extraordinarily powerful for orchestrating multi-model pipelines — the agent naturally decides when to generate images vs. when to keep writing, creating organic pacing
SSE streaming in FastAPI requires careful async generator design — asyncio.create_task() for parallel side-effects while the main generator stays non-blocking
Google Search grounding via the GenAI SDK is a one-line addition that dramatically improves factual accuracy in educational content
Imagen 4's prompt sensitivity is high — a 10-word style suffix ("watercolor, warm light, children's book illustration") changes output quality dramatically

About multimodal UX:

Users don't want to choose between text, image, and audio — they want the AI to make that decision for them, seamlessly
The moment images appear inline rather than below text, the experience shifts from "AI output" to "living document" — this single UX change had the largest impact on how the product felt
Voice input removes the "blank page" problem entirely — speaking feels more creative than typing

What's next for OmniSence

Immediate (next sprint):

[ ] Veo 3.1 integration — animated scene clips woven inline with storybooks
[ ] Collaborative mode — two users co-directing the same story in real-time
[ ] Export to PDF/EPUB — download illustrated storybooks as formatted files

Medium term:

[ ] Custom persona training — brands upload their style guide, OmniSence generates on-brand content automatically
[ ] OmniSence for Education — teachers create grounded, illustrated lesson plans with a single sentence prompt
[ ] Mobile app with persistent creative portfolio

Vision:

The next generation of creative tools won't have a text box at all.
You'll think out loud, and the canvas will keep up.
OmniSence is the first step toward that world.

Built for the Gemini Live Agent Challenge · #GeminiLiveAgentChallenge