storyboard
aria_architectural_diagram
deployment to cloud
cloud storage proof
firestore_proof

ARIA — Adaptive Reality Intelligence Agent

Creative Storyteller — Gemini Live Agent Challenge 2026
Category: Creative Storyteller · Hosted on Google Cloud Run · Submitted February–March 2026

Inspiration

Every great story deserves a great storyteller.

We were inspired by the gap between what people imagine when they think of their story and what they can actually produce. Many people have vivid ideas but lack the tools, time, or technical skill to bring them to life cinematically.

Google’s Gemini Live API introduced a new possibility: an AI you can talk to in real time that can see, hear, and act.

This sparked a simple question:

What if telling a story was as easy as having a conversation?

From that idea, ARIA was born — an AI creative director that transforms spoken ideas into cinematic stories.

What It Does

ARIA is a multimodal AI creative storyteller that lets anyone produce cinematic stories through natural voice and text interaction.

ARIA was built entirely during the contest period beginning February 2026.

Core Capabilities

Live Voice Control
Talk to ARIA naturally in real time. Say “create a scene of a sunset over Lagos” and watch it generate. Commands like “split screen” or “start recording” happen instantly.

Multimodal Story Generation
ARIA combines AI-generated images, videos, narration audio, and title cards into a coherent story timeline.

Cinematic Presenter
A full-screen presentation mode plays your story with synchronized narration, scene transitions, and progress tracking.

Browser & Screen Intelligence
ARIA can see your screen, control browser tabs, take screenshots, and describe what it observes in real time.

Camera View Modes
Stealth, Picture-in-Picture, and Split Screen modes give users full control over how ARIA sees them.

Project Management
Users can save, load, duplicate, and manage multiple story projects stored in Firestore.

Email Notifications
Users receive email alerts when long-running video generation tasks finish.

Session Recording
Entire ARIA sessions can be recorded and saved to Cloud Storage for later playback or download.

How We Built It

Frontend

A single-file HTML/CSS/Vanilla JavaScript interface with a cinematic dark theme.
No frameworks — designed to be lightweight, responsive, and visually immersive.

Features include:

Landing animation loop (logo → templates → logo)
Draggable Picture-in-Picture camera
Minimisable voice control panel
Full presentation stage for cinematic playback

Backend

Python + Flask + Flask-Sock, deployed on Google Cloud Run.

The architecture is fully serverless and stateless:

API endpoints handle real-time interactions
Long-running tasks are dispatched via Cloud Tasks
The frontend polls task status for results

AI Models on Vertex AI

Model	Purpose
`gemini-3.1-flash-lite-preview`	Story director, chat reasoning, scene planning
`gemini-3.1-flash-image-preview`	AI scene image generation
`veo-3.1-generate-preview`	Cinematic video generation
`gemini-2.5-flash-preview-tts`	Narration voiceover
`gemini-live-2.5-flash-native-audio`	Real-time live voice intelligence

Google Search Grounding
Gemini’s built-in Google Search grounding tool provides real-time knowledge retrieval. Responses that use grounding are marked ✦ Web-grounded in the interface.

Google Cloud Services

Service	Role
Cloud Run	Serverless hosting with automatic scaling
Firestore	Stores users, projects, and generation job state
Cloud Storage	Stores generated images, videos, and recordings
Cloud Tasks	Manages asynchronous long-running generation jobs
Vertex AI	Runs all Gemini model inference

Third-Party Tools

FFmpeg (LGPL) — video compilation with narration audio
Playwright (Apache 2.0) — headless browser automation
MSS (MIT) — screen capture integration
Unsplash — template reference images
Gmail SMTP — generation completion notifications

Challenges We Ran Into

WebSocket + Cloud Run
Maintaining persistent WebSocket connections on a stateless serverless platform required careful tuning of worker configuration and timeouts.

Live Audio Synchronisation
Streaming PCM audio without drift required a precise Web Audio buffer scheduling system with look-ahead timing.

Async Job Architecture
Replacing local threading with Cloud Tasks while preserving the same frontend polling API required careful endpoint design.

Veo Video Polling
Veo video generation can run for minutes. Cloud Tasks allowed the polling loop to survive Cloud Run request limits.

Screen Capture in the Cloud
Server-side capture works locally but not in production environments. We switched to browser-based getDisplayMedia.

Gunicorn Worker Compatibility
Finding the correct worker configuration (gthread) that supports both HTTP and WebSockets took several iterations.

GCS Signed URLs
Signed URL behaviour differs between local development and Cloud Run. IAM-based signing with token fallback solved the issue.

Accomplishments We're Proud Of

Full multimodal storytelling pipeline — voice input becomes a fully narrated story with images and video
Live voice that actually works — ARIA can respond to interruptions and control the interface
Zero frontend frameworks — the entire rich interface runs in pure HTML/CSS/JS
Fully GCP-native architecture
Wholesome storytelling guardrails designed into the system prompt
Deployed and live application, not just a prototype
Grounded responses visible in the UI via ✦ Web-grounded indicators

What We Learned

The Gemini Live API enables entirely new interaction patterns beyond traditional chat interfaces.
Cloud Tasks is ideal for long-running AI jobs in serverless environments.
Firestore is well suited for AI content workflows due to its schemaless flexibility.
Building a voice-first interface requires a different design philosophy than traditional UI.
Hackathon constraints force strong architectural prioritisation.

What's Next for ARIA

Enhanced GCP Security
Advanced IAM roles and per-user quota management for enterprise deployments.

Monetisation
Subscription tiers (Basic / Pro / Studio) with usage limits per plan.

Collaboration
Real-time multi-user story editing using Firestore listeners.

Custom Voice Cloning
Users will be able to record their own narrator voice.

Mobile App
React Native wrapper with native camera and microphone access.

YouTube Direct Publishing
OAuth-based direct uploads to YouTube Studio.

Story Template Marketplace
Community-created templates discoverable by interest.

Offline Mode
Service worker caching so users can browse saved stories without internet access.

Tech Stack Summary

Layer	Technology
Hosting	Google Cloud Run
AI Inference	Vertex AI (Gemini Live, Image, Video, TTS)
Database	Cloud Firestore
Storage	Google Cloud Storage
Async Jobs	Cloud Tasks
Backend	Python · Flask · Flask-Sock
Frontend	Vanilla HTML / CSS / JavaScript
Video	FFmpeg
Browser Automation	Playwright
Email	Gmail SMTP
SDK	Google GenAI Python SDK