Muse — Gesture & Voice Powered AI Art Studio

Login Page
Main window
Generate Promt
Generate Picture thinking Process
Picture finished
Help
Prompt Liberary
Subscribe Pro
Setting
Generate Video gesture
Generate Video processing
Generate Video finished
Live chat

Inspiration

Over 1 billion people worldwide live with disabilities that make traditional keyboard-and-mouse interfaces difficult or impossible to use. We asked: What if anyone could create stunning AI art using just their hands and voice?

Muse was born from the belief that creative expression should be accessible to everyone — regardless of physical ability.

What it does

Muse is a hands-free AI art studio that turns hand gestures and voice commands into stunning images and videos, powered entirely by Google's Gemini API ecosystem.

Core Interaction Model

✊ Fist → Start voice input (describe what you want)
👌 OK → Generate AI image from your description
✌️ Peace → Generate video from the image
🖐️ Open Palm → Get creative inspiration
🤙 Shaka → Start real-time voice conversation with AI

Gemini Models Used

Model	Purpose
Gemini 3 Flash Preview	AI Art Director — refines prompts with streaming thought chains
Gemini 2.5 Flash Image	Image generation from refined prompts
Veo 3.1 Fast	Video generation from images with motion
Gemini 2.5 Flash Native Audio	Real-time bidirectional voice conversations (Live API)
Gemini 2.0 Flash	Social copy generation & image analysis

Accessibility Features

8 hand gesture controls via camera
15+ voice commands in 7 languages
Full keyboard shortcut support
Audio announcements for screen readers
Haptic feedback on mobile
Button text labels toggle

How we built it

Frontend-only architecture — no backend server needed:

React + TypeScript + Vite for the UI
@google/genai SDK for ALL AI interactions (no other AI providers)
MediaPipe Holistic (WASM) for real-time hand & face tracking in the browser
Web Speech API for voice input
Tailwind CSS for responsive design across mobile/tablet/desktop
Firebase Auth for user authentication
Web Audio API for generative ambient music during loading

The entire AI pipeline runs client-side → Gemini API, with no middleware.

Challenges we ran into

Gesture reliability — Hand gesture detection needed careful tuning of thresholds and a persistence filter (3 consecutive frames) to avoid false positives
Fist→OK transition — When switching from recording (fist) to generating (OK), we had to implement stopAndCapture() to ensure the voice transcript was fully delivered before triggering generation
API overload handling — Gemini 503/UNAVAILABLE errors during streaming required wrapping the entire stream read (not just connection) in retry logic with exponential backoff
Accessible UX — Balancing a visually rich interface with true accessibility required aria-labels on every button, keyboard shortcuts for all actions, and i18n across 7 languages

Accomplishments that we're proud of

Zero-keyboard creative workflow: Users can go from idea to AI-generated art to video using only hand gestures and voice
6 Gemini models integrated into a single coherent experience
Real-time AR effects on the camera feed during generation (constellation particles, energy rings, pulsing vignette)
Generative ambient music using Web Audio API synthesis — evolving chord pads, random melodic notes, and filtered noise textures that change per generation stage

What we learned

MediaPipe's WASM-based hand tracking is remarkably accurate but requires careful frame-rate management to avoid UI jank
The Gemini Live API (bidiGenerateContent) requires dated model variants — non-dated aliases don't work
React state updaters inside async chains can cause timing bugs — using refs (studioStateRef) for cross-async-boundary state reads solved this
Web Audio API can create surprisingly musical ambient soundscapes with just oscillators, filters, and delay nodes

What's next for Muse

3D Model Generation and Genie 3 World Generation (UI ready, awaiting API access)
Personalized AI Memory for Pro users — Muse remembers your creative style
Cloud deployment with ephemeral tokens for secure client-side Gemini access
Community gallery for sharing creations

Built With

firebase-auth
gemini-2.5-flash-image
gemini-3-flash
gemini-live-api
google-genai-sdk
mediapipe
react
tailwind-css
typescript
veo-3.1
vite
web-audio-api
web-speech-api

Submitted to

Gemini 3 Hackathon

Created by

I designed and built Muse , a multimodal creative agent powered by Google Gemini. My key contributions include:

1. AI Architecture : Integrated the full Gemini ecosystem, using Gemini 2.0 Flash for real-time reasoning, Imagen 3 for art generation, and the Multimodal Live API for bidirectional voice/video conversations.
2. Frontend Engineering : Developed the React 19 + Vite application with a custom immersive UI system (Cyberpunk/Zen themes).
3. Gesture System : Implemented client-side hand tracking using MediaPipe , mapping gestures (like "OK" to generate, "Shaka" to start Live Chat) to AI actions via WebSockets.
4. Cloud Integration : Set up Firebase Authentication and Firestore for user persistence and deployed the app to production.

Joshua Q

Updates

Joshua Q started this project — Feb 08, 2026 01:08 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.