Inspiration
Every year, millions of people watch Bharatanatyam and see only beauty — the colour, the precision, the athleticism. What they cannot see is the conversation happening in plain sight.
Each hand gesture — a Mudra — is a word in a 2,000-year-old vocabulary. A single performance encodes mythology, emotion, and narrative that takes scholars years to fully decode. The dancer works for a decade to master this language. The audience sits in respectful silence, understanding almost none of it.
No real-time translation has ever existed. Pre-written subtitles break the fourth wall. Human interpreters require the dancer to choreograph around them. The audience has always been locked out.
The question that launched Nritya: what if Gemini could sit in the audience and whisper?
What the Gemini Live Agent Does
Nritya deploys Gemini as a Creative Storyteller — not a chatbot, not a question-answering system, but a live theater companion that watches the performance through a camera and narrates meaning to the audience in real time.
The Gemini Live API's real-time multimodal architecture makes three things happen simultaneously that no other API can deliver in a single session:
1. Vision perception → Mudra identification Gemini receives a JPEG frame and structured skeletal metadata the moment the dancer's wrist velocity drops below 3px/frame — a held pose, a completed gesture. The image provides visual confirmation. The skeletal JSON provides physical context: wrist velocity, stance depth, mirror-corrected hand positions. Gemini identifies the Mudra from the image alone, using the metadata as supporting evidence. This distinction matters — it means the system degrades gracefully when pose detection is imprecise.
2. Interleaved tool calls + live audio narration
The interleaved output stream is where the Gemini Live API demonstrates its unique capability. Gemini calls trigger_mudra_lock — snapping sacred geometry to the dancer's wrist — and update_story_card — pushing a Sanskrit title and poetic translation to the AR HUD — while simultaneously beginning to narrate. The NON_BLOCKING tool behavior means the AI never pauses speech waiting for tool responses. The audience hears the poetry before the card finishes animating.
3. Symbolic artwork generation on demand
For climactic poses and Rasa transitions, Gemini requests a Tanjore-style illustration via update_story_card with request_image: true. Vertex AI Imagen generates a symbolic metaphor — a lotus emerging from dark water, a flame consuming a battlefield — that appears on the story card as the narration plays. The dancer's gesture, the AI's words, and the generated image arrive in the same moment.
sequenceDiagram
participant D as 💃 Dancer
participant G as Gemini Live Agent
participant HUD as AR HUD
participant V as Vertex AI Imagen
D->>G: Holds Alapadma (lotus) mudra
Note over G: Multimodal inference<br/>Image + skeletal JSON
G->>HUD: trigger_mudra_lock(LOTUS_OUTLINE)
G->>HUD: update_story_card("Alapadma",<br/>"A thousand petals unfurl...",<br/>request_image: true)
G-->>D: 🔊 Narration begins immediately
Note over G: NON_BLOCKING — speech<br/>never pauses for tools
G->>V: image_gen_prompt →<br/>"lotus emerging from dark water,<br/>Tanjore painting style"
V-->>HUD: Tanjore artwork
G->>HUD: set_atmosphere(shringara, 0.9)
G->>HUD: trigger_sfx(TEMPLE_BELL)
Note over HUD: Vignette shifts emerald<br/>Bell plays · narration ducks 400ms
How It's Built — Google Cloud Native
Nritya runs exclusively on Google Cloud. Zero third-party AI services.
graph TB
subgraph GCP["Google Cloud Platform"]
GL["Gemini Live API\ngemini-2.5-flash-native-audio\nBidirectional WebSocket"]
VI["Vertex AI · Imagen 3\nSymbolic artwork generation\nTanjore painting style"]
FF["Firebase Functions\nEphemeral token authority\nImagen proxy · API key vault"]
FH["Firebase Hosting\nGlobal CDN · Auto HTTPS"]
CICD["GitHub Actions\nZero-touch CI/CD pipeline"]
end
subgraph Browser["Browser Client"]
CAM["Camera feed\nMediaPipe gesture trigger"]
GC["GeminiClient\nWebSocket session"]
IC["ImagenClient\nfirebase/ai lazy import"]
HUD["AR HUD\nStory Card · Vignette\nSamaapti summary"]
end
CAM -->|"Pose lock → JPEG + JSON"| GC
Browser -->|"POST /generateLiveToken"| FF
FF -->|"ephemeral token · uses:1"| Browser
GC -->|"wss://?access_token={token}"| GL
GL -->|"PCM16 audio + tool calls"| GC
GC -->|"requestImage: true"| IC
IC -->|"firebase/ai · getImagenModel()"| FF
FF --> VI
VI -->|"base64 PNG"| HUD
FH -->|"Serves SPA"| Browser
CICD --> FH
The ephemeral token architecture is the security decision that separates Nritya from demo-quality AI projects. Firebase Functions mints a one-use token — uses: 1, 30-minute TTL — via authTokens.create(). The Gemini API key never exists in the browser, never appears in the network tab, never ships in the client bundle. The browser opens the Gemini Live WebSocket directly using this token, which is consumed on connection and permanently invalidated. This is production security architecture, not hackathon scaffolding.
sequenceDiagram
participant B as Browser
participant F as Firebase Functions
participant G as Gemini API
B->>F: POST /generateLiveToken
Note over F: API key: server-side only<br/>Never in client bundle
F->>G: authTokens.create()<br/>uses: 1 · TTL: 30min
G-->>F: ephemeral access_token
F-->>B: { access_token }
B->>G: WebSocket connect<br/>?access_token={token}
Note over G: Token consumed · invalidated<br/>Replay attacks: impossible
G-->>B: Session open
Firebase AI Logic handles the Vertex AI Imagen path in production via a lazily imported firebase/ai SDK — it never loads in development, keeping cold-start performance sharp. Firebase Functions holds the project credentials server-side and proxies the generation request, ensuring the Vertex AI endpoint is never directly callable from the browser.
What Gemini Live Unlocked
Before the Gemini Live API, this product was architecturally impossible. The three constraints that blocked every prior approach:
Latency. Classical turn-based vision APIs introduce 2–5 seconds of round-trip latency. A dancer completes three gestures in that window. The Gemini Live WebSocket session maintains a persistent connection — the JPEG frame is sent the moment the pose locks, and narration begins before the dancer exhales.
Simultaneity. No other API simultaneously streams audio output while dispatching tool calls. Every alternative requires a choice: speak or act. The Gemini Live API's interleaved output stream does both in the same response turn. The sacred geometry snaps to the wrist while the poetry plays in the audience's ears.
Creative voice. The Gemini Live API's systemInstruction field accepts a full persona — the Sutradhara, the ancient thread-holder of Indian theater. The agent speaks in embodied, rhythmic language: "Her arm cuts the air like a blade — this is Pataka, the flag that commands armies to halt." This quality of output requires a model that understands narrative, metaphor, and cultural context at the level Gemini delivers.
Accomplishments
- Six live tool declarations wired to the AR layer:
trigger_mudra_lock,update_story_card,set_atmosphere,trigger_sfx,end_session,request_clarification— dispatched and responded to within a single Gemini turn - Production security architecture — ephemeral token pattern, zero API key client exposure
- Vertex AI Imagen 3 generating Tanjore painting-style artwork on demand, per gesture, per performance
- Fully automated deployment — GitHub Actions → Firebase Hosting, every commit ships to production at
nritya-ardance.web.app - Google Cloud native end-to-end — Gemini, Vertex AI, Firebase Functions, Firebase Hosting, Firebase AI Logic — no third-party AI services anywhere in the stack
What's Next
Nritya launched as a hackathon project and immediately proved something unexpected: Gemini's real-time multimodal capability makes a 2,000-year-old art form legible to a modern global audience for the first time.
The roadmap extends across three horizons:
Depth — Expand from mudra identification to full Abhinaya (facial expression) recognition, Nritta (rhythmic footwork) patterns, and complete Rasa arc tracking across a full performance. Bharatanatyam has 108 canonical mudras and thousands of choreographic sequences — the Gemini Live Agent's context window makes comprehensive coverage achievable.
Breadth — Extend the Gemini Live Agent framework to all eight classical Indian dance forms: Kuchipudi, Odissi, Kathak, Manipuri, Mohiniyattam, Bharatanatyam, Kathakali, and Sattriya. The underlying architecture is dance-form agnostic — the system prompt and tool schema adapt; the GCP infrastructure does not change.
Institution — Present Nritya to India's Ministry of Culture and state cultural ministries as a technology platform for the preservation and global accessibility of intangible cultural heritage. The Gemini Live API's ability to make ancient artistic languages real-time comprehensible to modern audiences represents a new category of cultural technology — one with genuine policy implications for how India presents its classical arts to the world.
The dancer has always been speaking. Gemini is finally listening.
Built With
Gemini Live API · Vertex AI Imagen 3 · Firebase Functions · Firebase AI Logic · Firebase Hosting · GitHub Actions · Svelte 5 · TypeScript · MediaPipe · GSAP · WebAudio API
Built With
- firebase-ai-logic
- firebase-cloud-functions
- firebase-hosting
- gemini-live-api
- svelte5
- typescript
- vertext-ai
- zod
Log in or sign up for Devpost to join the conversation.