Inspiration

Content creation has become increasingly demanding. Creators, professionals and students need to produce videos quickly while managing complex editing software. We asked: what if editing a video was as natural as talking and waving your hands?

What it does

WaveformStudio is a fully hands-free video editor designed for creating short-form vertical videos and long form videos. Users can:

  • Search for media with their voice: Say "Hey Waveform Studio, find videos of beach sunsets" and AI extracts keywords, searches Pexels, and returns matching clips instantly
  • Edit with hand gestures: Make a ✌️ peace sign to select a clip, swipe to reorder, show a fist to pause, open palm to play
  • Use natural language commands: "Splice video 2 at 3 seconds", "Move image 1 to the right", "Delete the selected clip"...
  • Auto-generate metadata: Gemini AI analyzes your video and generates optimized titles and descriptions for social media
  • Export production-ready videos: Render MP4 files in 1080p, 720p, or 480p directly in the browser.

The entire editing workflow can be completed without touching a keyboard or mouse

How we built it

  • Frontend: React + Vite
  • Voice Recognition: Web Speech API captures voice input, with a custom wake word detector ("Hey Waveform Studio") that opens an 8-second listening window. Commands are parsed using OpenAI GPT-3.5-turbo for natural language understanding.
  • Gesture Recognition: MediaPipe Hands tracks 21 hand landmarks via webcam. We built a custom gesture classifier that recognizes 8 distinct gestures (fist, open palm, peace sign, OK sign, point up, shaka, swipe left/right) with cooldown logic to prevent false triggers.
  • Media Search: Natural language queries go through OpenAI for keyword extraction, then hit the Pexels API for stock photos and videos. We also built a local music library with 23 curated tracks searchable by mood tags.
  • Video Rendering: FFmpeg runs entirely in the browser. 5-stage pipeline: downloads media, processes each clip (scaling, padding, converting images to video), concatenates everything, mixes in audio, and encodes the final output.
  • AI Metadata: Google Gemini's vision API analyzes video frames to generate contextually relevant titles and descriptions.

Challenges we ran into

  • Gesture detection reliability: Early versions triggered false positives constantly. We solved this by implementing a buffer for swipes, a delay for static gestures and cooldowns between all gestures.

  • FFmpeg in the browser: Getting FFmpeg WASM to work was painful. We ended up processing the video client-side and loaded the binaries from CDN.

  • Voice command parsing: Simple regex wasn't enough for natural language. We built a two-step system: local parsing for simple commands ("play", "pause") and GPT-3.5-turbo for complex ones ("move the second video to the right"). Added fallback logic when the API fails.

Accomplishments that we're proud of

  • It actually works hands-free
  • Real-time gesture feedback: the interface feels responsive
  • In-browser video rendering: No server uploads, no waiting for cloud processing. Your video renders locally using WebAssembly.
  • Natural language pre-processing: AI keyword extraction makes searches feel conversational.

What we learned

  • MediaPipe is awesome but finicky: need good lighting, lots of trial and error.
  • AI APIs need fallbacks: Rate limits, network issues, and API errors
  • Most likely need a way more robust persistence and database if this were to be in production.

What's next for WaveformStudio

  • Text overlays and captions
  • Transition effects
  • More gestures
  • Cloud save/load

Built With

Share this project:

Updates