ARIA — Adaptive Reality Intelligence Agent

Creative Storyteller — Gemini Live Agent Challenge 2026
Category: Creative Storyteller · Hosted on Google Cloud Run · Submitted February–March 2026


Inspiration

Every great story deserves a great storyteller.

We were inspired by the gap between what people imagine when they think of their story and what they can actually produce. Many people have vivid ideas but lack the tools, time, or technical skill to bring them to life cinematically.

Google’s Gemini Live API introduced a new possibility: an AI you can talk to in real time that can see, hear, and act.

This sparked a simple question:

What if telling a story was as easy as having a conversation?

From that idea, ARIA was born — an AI creative director that transforms spoken ideas into cinematic stories.


What It Does

ARIA is a multimodal AI creative storyteller that lets anyone produce cinematic stories through natural voice and text interaction.

ARIA was built entirely during the contest period beginning February 2026.

Core Capabilities

Live Voice Control
Talk to ARIA naturally in real time. Say “create a scene of a sunset over Lagos” and watch it generate. Commands like “split screen” or “start recording” happen instantly.

Multimodal Story Generation
ARIA combines AI-generated images, videos, narration audio, and title cards into a coherent story timeline.

Cinematic Presenter
A full-screen presentation mode plays your story with synchronized narration, scene transitions, and progress tracking.

Browser & Screen Intelligence
ARIA can see your screen, control browser tabs, take screenshots, and describe what it observes in real time.

Camera View Modes
Stealth, Picture-in-Picture, and Split Screen modes give users full control over how ARIA sees them.

Project Management
Users can save, load, duplicate, and manage multiple story projects stored in Firestore.

Email Notifications
Users receive email alerts when long-running video generation tasks finish.

Session Recording
Entire ARIA sessions can be recorded and saved to Cloud Storage for later playback or download.


How We Built It

Frontend

A single-file HTML/CSS/Vanilla JavaScript interface with a cinematic dark theme.
No frameworks — designed to be lightweight, responsive, and visually immersive.

Features include:

  • Landing animation loop (logo → templates → logo)
  • Draggable Picture-in-Picture camera
  • Minimisable voice control panel
  • Full presentation stage for cinematic playback

Backend

Python + Flask + Flask-Sock, deployed on Google Cloud Run.

The architecture is fully serverless and stateless:

  • API endpoints handle real-time interactions
  • Long-running tasks are dispatched via Cloud Tasks
  • The frontend polls task status for results

AI Models on Vertex AI

Model Purpose
gemini-3.1-flash-lite-preview Story director, chat reasoning, scene planning
gemini-3.1-flash-image-preview AI scene image generation
veo-3.1-generate-preview Cinematic video generation
gemini-2.5-flash-preview-tts Narration voiceover
gemini-live-2.5-flash-native-audio Real-time live voice intelligence

Google Search Grounding
Gemini’s built-in Google Search grounding tool provides real-time knowledge retrieval. Responses that use grounding are marked ✦ Web-grounded in the interface.


Google Cloud Services

Service Role
Cloud Run Serverless hosting with automatic scaling
Firestore Stores users, projects, and generation job state
Cloud Storage Stores generated images, videos, and recordings
Cloud Tasks Manages asynchronous long-running generation jobs
Vertex AI Runs all Gemini model inference

Third-Party Tools

  • FFmpeg (LGPL) — video compilation with narration audio
  • Playwright (Apache 2.0) — headless browser automation
  • MSS (MIT) — screen capture integration
  • Unsplash — template reference images
  • Gmail SMTP — generation completion notifications

Challenges We Ran Into

WebSocket + Cloud Run
Maintaining persistent WebSocket connections on a stateless serverless platform required careful tuning of worker configuration and timeouts.

Live Audio Synchronisation
Streaming PCM audio without drift required a precise Web Audio buffer scheduling system with look-ahead timing.

Async Job Architecture
Replacing local threading with Cloud Tasks while preserving the same frontend polling API required careful endpoint design.

Veo Video Polling
Veo video generation can run for minutes. Cloud Tasks allowed the polling loop to survive Cloud Run request limits.

Screen Capture in the Cloud
Server-side capture works locally but not in production environments. We switched to browser-based getDisplayMedia.

Gunicorn Worker Compatibility
Finding the correct worker configuration (gthread) that supports both HTTP and WebSockets took several iterations.

GCS Signed URLs
Signed URL behaviour differs between local development and Cloud Run. IAM-based signing with token fallback solved the issue.


Accomplishments We're Proud Of

  • Full multimodal storytelling pipeline — voice input becomes a fully narrated story with images and video
  • Live voice that actually works — ARIA can respond to interruptions and control the interface
  • Zero frontend frameworks — the entire rich interface runs in pure HTML/CSS/JS
  • Fully GCP-native architecture
  • Wholesome storytelling guardrails designed into the system prompt
  • Deployed and live application, not just a prototype
  • Grounded responses visible in the UI via ✦ Web-grounded indicators

What We Learned

  • The Gemini Live API enables entirely new interaction patterns beyond traditional chat interfaces.
  • Cloud Tasks is ideal for long-running AI jobs in serverless environments.
  • Firestore is well suited for AI content workflows due to its schemaless flexibility.
  • Building a voice-first interface requires a different design philosophy than traditional UI.
  • Hackathon constraints force strong architectural prioritisation.

What's Next for ARIA

Enhanced GCP Security
Advanced IAM roles and per-user quota management for enterprise deployments.

Monetisation
Subscription tiers (Basic / Pro / Studio) with usage limits per plan.

Collaboration
Real-time multi-user story editing using Firestore listeners.

Custom Voice Cloning
Users will be able to record their own narrator voice.

Mobile App
React Native wrapper with native camera and microphone access.

YouTube Direct Publishing
OAuth-based direct uploads to YouTube Studio.

Story Template Marketplace
Community-created templates discoverable by interest.

Offline Mode
Service worker caching so users can browse saved stories without internet access.


Tech Stack Summary

Layer Technology
Hosting Google Cloud Run
AI Inference Vertex AI (Gemini Live, Image, Video, TTS)
Database Cloud Firestore
Storage Google Cloud Storage
Async Jobs Cloud Tasks
Backend Python · Flask · Flask-Sock
Frontend Vanilla HTML / CSS / JavaScript
Video FFmpeg
Browser Automation Playwright
Email Gmail SMTP
SDK Google GenAI Python SDK

ARIA — Built for the Gemini Live Agent Challenge · February–March 2026

Built With

  • cloud-storage
  • cloudrun
  • firestore
  • python
  • vertexai
Share this project:

Updates