MUSE: Multimodal Synesthetic Experience Engine

Inspiration

Synesthesia is a neurological condition where stimulation of one sense automatically triggers another

Synesthetes might hear colors, see sounds, or taste shapes.

For an AI that genuinely processes multiple modalities simultaneously, this should be native behavior.

MUSE (Multimodal Synesthetic Experience Engine) asks: what if AI could experience the world the way a synesthete does? Instead of describing a painting, MUSE hears it. Instead of transcribing a melody, MUSE sees it. And it generates visual art from those cross-modal translations in real time, without being asked.

What it does

MUSE is a real-time AI synesthesia engine with four creative modes:

Visual Mode: Show MUSE a painting, photo, or scene — it describes what it "hears" in the colors and composition, then generates synesthetic art
Audio Mode: Hum or play music — MUSE describes the visual landscape the sound evokes, then paints it
Environment Mode: Walk through spaces — MUSE narrates the journey and generates illustrated "postcards" of remarkable moments
Sketch Mode: Show a hand drawing — MUSE recognizes the creative intent and generates a refined synesthetic illustration

MUSE speaks its interpretations in real-time via native audio and generates images spontaneously — without being asked.

How we built it

Stack: React + Vite (frontend) → WebSocket → FastAPI + uvicorn (backend) → Google ADK → Gemini

The browser captures microphone audio (16kHz PCM via AudioWorklet) and camera frames (1fps JPEG), sending them over a single WebSocket as binary + JSON frames. The FastAPI server feeds these into an ADK LiveRequestQueue. An ADK Runner orchestrates a multi-agent system:

MuseOrchestrator: Root streaming agent using gemini-2.5-flash-native-audio-preview — proactive creative personality with a generate_synesthetic_image tool
PoetryAgent: Sub-agent for structured poem generation from sensory descriptions
NarrativeAgent: Sub-agent for building ongoing environment journey narratives

The key insight: image generation is a tool called by the streaming orchestrator, not a separate session. MUSE keeps talking while the image generates, then describes the result when it arrives.

Deployed on Google Cloud Run with Cloud Storage for images and Firestore for session metadata. Cloud Build automates deployment via cloudbuild.yaml.

Challenges we ran into

Async session creation: ADK create_session() is a coroutine but wasn't obviously documented — caused pydantic validation errors until we added await
React StrictMode double-mount: Caused duplicate WebSocket connections and duplicate greeting messages — fixed with dedup logic in transcript state
Image data through ADK event stream: Had to parse part.function_response.response to extract base64 image data from tool results
Bidirectional streaming: Keeping live audio flowing while async image generation completes required asyncio.gather() task management

Accomplishments that we're proud of

Full real-time bidirectional voice + camera + image generation in one seamless session
Proactive AI behavior — MUSE initiates art creation without being asked, embodying the synesthesia concept
Clean multi-agent ADK architecture: streaming orchestrator + specialized sub-agents + function tools

What we learned

The Gemini Live API + ADK combination makes it possible to build agents that act rather than just respond. Synesthesia as a design paradigm forces genuine multimodal integration — you cannot fake it with sequential API calls.