Project Logo
Project Architecure

Inspiration

Watching Inside Out highlighted how compelling it is to see emotions interact as independent forces. Joy arguing with Sadness or Anger losing control felt surreal and concrete at the same time. The goal was to build a system where those internal dynamics could be spoken to directly.

What It Does

Inside Inside Out is a voice-first AI console where five emotions (Joy, Sadness, Anger, Fear, Disgust) operate as separate agents. They do not simply reply to the user. They interrupt one another, debate in real time, and react dynamically to what is said. Users can hold live voice conversations with the emotions or prompt them to perform improvised comedy scenes based on provided scenarios.

How It Was Built

Brain
Google Gemini 2.5 Flash Lite via Vertex AI handles emotional reasoning and selects which emotion speaks next.

Voice
ElevenLabs Turbo v2.5 is used for low-latency text-to-speech streaming.

Frontend
React and Vite with Framer Motion for animation, plus WebSocket streaming for real-time audio delivery.

Backend
FastAPI orchestrates the system with a streaming architecture that pipes audio chunks directly to the browser.

Significant time was spent manually reviewing the ElevenLabs voice library to match voices to emotional profiles. Joy required sustained optimism, Anger needed sharpness, and Sadness required emotional weight.

Each emotion has a dedicated knowledge base defining personality traits, reaction patterns, and relationships with the other emotions.

Challenges

Maintaining multi-agent coherency was the primary difficulty. Five distinct personas needed to remain consistent while staying aware of each other’s dialogue. Context management became the core constraint. Preventing personality drift and preserving short-term conversational memory required extensive iteration.

Accomplishments

The multi-agent system operates with shared inter-agent context intact. The emotions behave as if co-present. They reference prior statements, interrupt when triggered, and retain distinct voices across extended conversations.

What Was Learned

Advanced use of ElevenLabs including voice selection, expressive synthesis, and streaming latency optimization
Real-time inference at scale using GCP and Vertex AI
Designing shared-context systems where multiple AI agents coexist without interference

What’s Next

Expanding the emotional roster
Refining interruption logic for more natural conversation flow
Implementing memory persistence so emotions retain user-specific context across sessions