ASL Gesture Recognition

Inspiration

Building an ASL hand-gesture recognizer is inspiring because it brings together cutting-edge computer vision and machine-learning techniques to bridge communication gaps, making everyday conversations, education, and digital services more accessible for Deaf and hard-of-hearing individuals; it challenges us to model the rich dynamics of hand shapes, movements, and facial expressions in real-world settings; and it creates opportunities for interactive learning tools, real-time translation services, and broader assistive-tech innovations that foster inclusion and empowerment—ultimately transforming how people connect and ensuring that technology truly serves everyone.

What it does

In this project, I developed a computer vision system to recognize hand gestures corresponding to the letters of the American Sign Language (ASL) alphabet. I used the Mediapipe framework to detect hand landmarks and built a machine learning model to classify the gestures. The system achieved an accuracy of 80.5% on a test set of 2000 hand gesture images.

How we built it

I used the Mediapipe framework to detect hand landmarks in real-time video streams. The framework provides a pre-built hand detection model that can detect the location of the hand in the video frame. I used the hand landmarks module in Mediapipe to extract the 21 key points on the hand, such as fingertips, knuckles, and palm center. I collected a dataset of 5000 hand gesture images corresponding to the 26 letters of the ASL alphabet, with 200 images per letter. I randomly split the dataset into training (70%), validation (10%), and test (20%) sets. I used the training set to train a Deep Neural Network classifier with a sequential model. I evaluated the performance of the system on the test set of 2000 hand gesture images.

Challenges we ran into

Training the model end-to-end was more time-consuming than expected: feeding thousands of 21-landmark feature vectors through multiple dense layers required careful tuning of batch size and learning rate to avoid GPU memory exhaustion or stalled convergence. Collecting a balanced dataset across all 26 letters was also difficult—some signs look visually similar (e.g., M vs. N), so the classifier struggled to distinguish them without extensive data augmentation (rotations, brightness shifts) to simulate real-world variation in lighting, hand size, and skin tone. Integrating Mediapipe’s landmark detector in a real-time pipeline introduced latency spikes when the camera frame rate dropped, forcing us to optimize preprocessing steps and prune our network for faster inference. Finally, ensuring robust performance across different backgrounds and camera positions required iterative testing and failure-case debugging to refine both our detection thresholds and model architecture.

Accomplishments that we're proud of

High recognition accuracy: Achieved an 80.5% accuracy on a held-out test set of 2,000 ASL alphabet images, demonstrating reliable letter-level classification.

Balanced, robust dataset: Curated and labeled 5,000 hand-gesture images (200 per letter), then applied systematic data augmentation (rotations, lighting shifts, scaling) to improve generalization across diverse users and environments.

Real-time performance: Integrated Mediapipe’s lightweight hand-landmark detector with our DNN pipeline to sustain frame rates above 20 FPS on a mid-range GPU, enabling smooth, interactive feedback.

End-to-end pipeline: Built a seamless workflow from webcam capture → landmark extraction → feature preprocessing → classification → on-screen display, all within a single Python application.

Modular, extensible codebase: Architected the system so that new gestures (e.g., numbers, basic words) can be added by simply collecting additional landmark datasets and retraining the classifier.

Cross-platform usability: Validated performance on Windows, macOS, and Linux, ensuring accessibility for developers and end users regardless of their OS choice.

What we learned

We discovered that robust ASL recognition hinges on diverse, well-augmented data—balancing hand sizes, skin tones, and lighting conditions and applying rotations and brightness shifts greatly improved generalization—while careful hyperparameter tuning of batch size, learning rate, and network depth was essential to maximize accuracy without exhausting GPU memory. Integrating Mediapipe for landmark detection taught us to balance out-of-the-box accuracy with the need to minimize preprocessing latency and prune unused components for smooth, real-time inference. Structuring our pipeline into modular stages—from capture and landmark extraction through preprocessing, classification, and the UI—made it easy to iterate on individual blocks without overhauling the entire system. Field testing with real users and varied backgrounds revealed edge cases like occlusions and rapid motion that offline tests missed, underscoring the value of iterative, user-centric evaluation. Finally, grappling with dataset biases—such as under-representation of certain skin tones—reminded us that ethical data collection is as critical as technical performance, guiding our plans to broaden and diversify our dataset moving forward.

What's next for ASL Gesture Recognition

Looking ahead, we plan to extend our recognizer beyond the alphabet to include numbers, common words, and short phrases by integrating temporal models like LSTMs or Transformers to capture the fluid motion of signs. We’ll optimize and prune our network for efficient on-device inference on smartphones and embedded hardware, while building a user-friendly app with real-time feedback, progress tracking, and gamified quizzes. To ensure fairness and robustness, we’ll launch a community-driven data-collection campaign to diversify skin tones, hand shapes, and signing styles, and conduct user studies in partnership with the Deaf community to refine our system based on real-world feedback. We’ll also explore multi-modal sensing—such as depth cameras and wearable IMUs—to handle challenging lighting and occlusions, and open-source our codebase and model checkpoints to foster collaboration, drive research, and accelerate innovation in assistive technologies.

Built With

api
git
javascript
matplotlib
mediapipe
numpy
opencv
panda
python
tensorflow

Updates

Jessica Huang started this project — May 04, 2025 11:49 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.