Inspiration

Over 1 billion people worldwide live with disabilities that make traditional keyboard-and-mouse interfaces difficult or impossible to use. We asked: What if anyone could create stunning AI art using just their hands and voice?

Muse was born from the belief that creative expression should be accessible to everyone — regardless of physical ability.

What it does

Muse is a hands-free AI art studio that turns hand gestures and voice commands into stunning images and videos, powered entirely by Google's Gemini API ecosystem.

Core Interaction Model

  • ✊ Fist → Start voice input (describe what you want)
  • 👌 OK → Generate AI image from your description
  • ✌️ Peace → Generate video from the image
  • 🖐️ Open Palm → Get creative inspiration
  • 🤙 Shaka → Start real-time voice conversation with AI

Gemini Models Used

Model Purpose
Gemini 3 Flash Preview AI Art Director — refines prompts with streaming thought chains
Gemini 2.5 Flash Image Image generation from refined prompts
Veo 3.1 Fast Video generation from images with motion
Gemini 2.5 Flash Native Audio Real-time bidirectional voice conversations (Live API)
Gemini 2.0 Flash Social copy generation & image analysis

Accessibility Features

  • 8 hand gesture controls via camera
  • 15+ voice commands in 7 languages
  • Full keyboard shortcut support
  • Audio announcements for screen readers
  • Haptic feedback on mobile
  • Button text labels toggle

How we built it

Frontend-only architecture — no backend server needed:

  • React + TypeScript + Vite for the UI
  • @google/genai SDK for ALL AI interactions (no other AI providers)
  • MediaPipe Holistic (WASM) for real-time hand & face tracking in the browser
  • Web Speech API for voice input
  • Tailwind CSS for responsive design across mobile/tablet/desktop
  • Firebase Auth for user authentication
  • Web Audio API for generative ambient music during loading

The entire AI pipeline runs client-side → Gemini API, with no middleware.

Challenges we ran into

  1. Gesture reliability — Hand gesture detection needed careful tuning of thresholds and a persistence filter (3 consecutive frames) to avoid false positives
  2. Fist→OK transition — When switching from recording (fist) to generating (OK), we had to implement stopAndCapture() to ensure the voice transcript was fully delivered before triggering generation
  3. API overload handling — Gemini 503/UNAVAILABLE errors during streaming required wrapping the entire stream read (not just connection) in retry logic with exponential backoff
  4. Accessible UX — Balancing a visually rich interface with true accessibility required aria-labels on every button, keyboard shortcuts for all actions, and i18n across 7 languages

Accomplishments that we're proud of

  • Zero-keyboard creative workflow: Users can go from idea to AI-generated art to video using only hand gestures and voice
  • 6 Gemini models integrated into a single coherent experience
  • Real-time AR effects on the camera feed during generation (constellation particles, energy rings, pulsing vignette)
  • Generative ambient music using Web Audio API synthesis — evolving chord pads, random melodic notes, and filtered noise textures that change per generation stage

What we learned

  • MediaPipe's WASM-based hand tracking is remarkably accurate but requires careful frame-rate management to avoid UI jank
  • The Gemini Live API (bidiGenerateContent) requires dated model variants — non-dated aliases don't work
  • React state updaters inside async chains can cause timing bugs — using refs (studioStateRef) for cross-async-boundary state reads solved this
  • Web Audio API can create surprisingly musical ambient soundscapes with just oscillators, filters, and delay nodes

What's next for Muse

  • 3D Model Generation and Genie 3 World Generation (UI ready, awaiting API access)
  • Personalized AI Memory for Pro users — Muse remembers your creative style
  • Cloud deployment with ephemeral tokens for secure client-side Gemini access
  • Community gallery for sharing creations

Built With

  • firebase-auth
  • gemini-2.5-flash-image
  • gemini-3-flash
  • gemini-live-api
  • google-genai-sdk
  • mediapipe
  • react
  • tailwind-css
  • typescript
  • veo-3.1
  • vite
  • web-audio-api
  • web-speech-api
Share this project:

Updates