Angry Signing Llamas is a real-time American Sign Language (ASL) learning platform that uses computer vision and machine learning to actively teach users how to sign. It verifies they are doing it correctly directly through a webcam in the browser with the help of pre-trained data. We built it because existing ASL resources rely on passive-learning. Flashcard apps show an image and assume you’ll replicate it correctly, while video courses have no way to confirm whether you’re practicing accurately. There’s no true “Duolingo for sign language,” and we believe there should be. Over 500,000 people in the United States use ASL as their primary language, yet most hearing individuals don’t know even a single sign. Our goal was to lower that barrier by making ASL practice interactive, gamified, and accessible to anyone with a laptop.

To achieve this, our app features three core modes designed to make learning engaging and easy. Tutorial mode guides users through 35 signs, all 26 letters of the alphabet plus 9 essential words such as HELLO, GOODBYE, I LOVE YOU, and THANK YOU, displaying a reference image or video and automatically detecting when the user performs the sign correctly before advancing. Challenge mode transforms practice into a fast-paced game where letters or words scroll across the screen over the live webcam feed, and users must sign each one before it passes. There are three difficulty levels: Easy (letters only), Hard (letters and words), and Unlimited (an infinite scoring mode with a leaderboard), along with adjustable speed controls. A stats tab tracks which signs users struggle with most, allowing them to target weak areas and improve more efficiently. Although only 9 words have been currently implemented, there is a seemingly infinite amount of signs for the many English words. Over time, we would like to increase the word count for the future users of Angry Signing Llamas.

One of the biggest technical challenges was supporting two fundamentally different types of signs within a unified detection system. Static letter signs such as A, B, and C are single hand poses, meaning they can be classified from a single frame. However, word signs like HELLO and GOODBYE involve motion over time, for instance, HELLO resembles a salute-like wave, and GOODBYE involves repeated finger folding. A simple image classifier cannot distinguish these because it lacks temporal awareness. To solve this, we trained two separate models. A static classification model handles letter signs using hand landmark data from a single frame, while a Long ShortTerm Memory-based sequence model processes 30 consecutive frames of landmark data to capture motion patterns for word signs. Both models operate on top of MediaPipe Hands, which extracts 21 hand keypoints per frame and streams them to our backend via WebSockets.

Real-time performance was another major challenge. Running hand detection and model inference on every webcam frame initially caused noticeable lag, with the tracking overlay falling behind the live video. The issue stemmed from the frontend sending frames faster than the backend could process them, causing a growing queue and increasing delay. We resolved this by implementing flow control so the client only sends a new frame after receiving a response for the previous one, eliminating backlog and restoring responsiveness. We also fine-tuned prediction smoothing to balance stability and responsiveness. The result is a functional, interactive prototype that proves real-time ASL learning with accurate feedback is not only possible but immediately accessible in any modern web browser.

Built With

Share this project:

Updates