Inspiration
Simon Says is a game everyone already loves. I wanted to see if browser-native ML inference was fast enough to make a reaction-based game feel genuinely responsive, with no plugins. The challenge was making real-time hand tracking feel fast enough that a wrong move feels like your fault, not the model's.
What I Learned
MediaPipe Hands outputs 21 keypoints per hand as normalized (x, y) coordinates, one for each joint and fingertip across a defined skeleton of 21 connections. Gesture classification is entirely geometric there is no second ML model interpreting the pose, just distance comparisons between specific joints. A finger is considered extended when the distance from the wrist to the fingertip exceeds 1.2x the distance from the wrist to the knuckle. The 1.2 multiplier was chosen through testing: below 1.1 produced too many false positives on slightly curled fingers, above 1.3 started missing extended fingers at awkward angles. Each of the four non-thumb fingers uses this same check independently, so a gesture like Peace requires index and middle to pass while ring and pinky must fail. The thumb is a special case because it abducts sideways rather than curling forward, so it uses a separate heuristic: the thumb tip must be further from the index finger's base joint than the thumb's own base, with an additional vertical check to distinguish Thumbs Up from a generic extended thumb.
How I Built It
TensorFlow.js running on the WebGL backend loads and executes the MediaPipe Hands model directly in the browser, keeping all processing on-device. getUserMedia captures the webcam stream into a video element, and a canvas overlay sits on top at native resolution to render the detected skeleton and joint points each frame. Gesture candidates are evaluated in a specific priority order on every frame more geometrically distinctive signs like OK Sign are checked before broader ones like Open Palm because Open Palm's conditions are a superset of several other gestures and would swallow them if checked first. The OK Sign check additionally uses a pinch distance ratio: the distance between the thumb tip and index tip must be less than around 50% of the wrist-to-index-base distance, ensuring the finger pinch is tight enough to be intentional. The game loop runs entirely on requestAnimationFrame, tracking round deadlines, hold durations, time bar progress, and trap state on every tick. Each round a new gesture is picked at random, with a guarantee it is never the same as the previous round to prevent repetition feeling like a pattern.
Challenges
Inference latency was the first problem. I used a flag to state that it was busy to make ensure only one inference runs at a time. The second challenge was that the model occasionally misclassifies a single frame, which would cause unfair eliminations if taken at face value. To fix this, a gesture must be held continuously for between 420 and 650 milliseconds before it registers. The hold timer resets to zero the moment the detected gesture changes, meaning a flicker to the wrong gesture mid-hold forces the player to start again. The window starts at 650ms. Anything above 700ms felt sluggish and anything below 400ms let jitter through too often. Trap rounds where Simon did not say use a tighter fixed threshold. This is short enough that accidentally drifting into the forbidden gesture ends the game quickly, but long enough that a single bad frame does not unfairly eliminate the player.
Log in or sign up for Devpost to join the conversation.