Cortana - Multimodal Conversational AI Agent

🌌 Inspiration: Beyond the Chatbot The inspiration for Cortana stems from a core dissatisfaction with the current "command-response" paradigm of LLMs. Most AI interactions today feel like filling out a form—you type, you wait, you receive text. We wanted to build something that felt ambient and immersive, similar to the AI interfaces seen in The Expanse or Halo.

We envisioned an agent that doesn't just "chat," but co-creates. An agent that listens to your voice, understands the emotional and descriptive context of your words, and proactively generates visuals to match—all while maintaining a low-latency voice conversation.

🏗️ How We Built Cortana Cortana is a distributed system designed for high concurrency and multimodal throughput.

  1. The Orchestration Layer The core of Cortana is a custom WorkflowCoordinator. Unlike a traditional synchronous request, we had to manage three parallel lifecycles:

Gemini Live (WebSockets): Real-time, full-duplex audio stream for voice interaction. Media Pipeline (REST): Triggered by the user to generate high-fidelity images and video based on conversation context.

Cloud Persistence: Background pushing of generated artifacts to Google Cloud Storage. The latency $L_{total}$ of our multimodal response can be modeled as: $$L_{total} = \max(L_{audio}, L_{gen}) + L_{network}$$ where $L_{audio}$ is the time to generate the voice response and $L_{gen}$ is the time to generate the visual artifact.

  1. Tech Stack Core AI: Gemini 2.5 Flash (Native Audio), Gemini Live API, Google GenAI SDK. Frontend: React (Vite) for the shell, and Lit Elements for the high-performance 3D Audio Orb visuals. Compute: Google Cloud Run (containerized backend). Storage: Google Cloud Storage ($GCS$) for artifact persistence.

🧠 What We Learned Native Audio is a Game Changer: Using Gemini 2.5 Flash's native audio capabilities allowed us to bypass the traditional Whisper (STT) $\rightarrow$ LLM $\rightarrow$ TTS pipeline. This reduced latency by nearly $40%$.

Contextual Nuance: We learned that the "feeling" of an AI is dictated by its peripheral actions. Adding the Manual Generation trigger (the Sparkle button) made the AI feel like a professional tool rather than a toy that just guesses what you want.

🚧 Challenges Faced GCP Organization Policies: We encountered strict allUsers invoker restrictions on Cloud Run. We had to pivot our architecture to use authenticated signed URLs for artifact retrieval, ensuring security while maintaining submission visibility.

Vite Browser Security: Managing API keys in a client-side Vite application required careful handling of VITE_ prefixes and import.meta.env to ensure the GenAI SDK could authenticate directly from the browser for low-latency generation.

Barge-in Logic: Implementing stable "interruption" (barge-in) required complex state management between the browser's AudioContext and the Gemini Live socket to ensure Cortana stops speaking the millisecond you do.

🚀 Deployment Details (GCP Proof) Our solution is live and hosted on Google Cloud.

Built With

Share this project:

Updates