We live in a world where verbal communication is the norm, and this means the deaf and hard of hearing community can struggle to have a voice. This can make everything in life a challenge, from social gatherings to employment to health services.
Working as a developer for UCF, accessibility is constantly top of mind. I often ask myself if I can use my newfound knowledge from my degree to improve accessibility for our students and beyond. Learning computer vision this semester, I posed the question: Can I use CV to improve accessibility for the deaf and hard of hearing community by training Object Detection models to learn American Sign Language?
Think about simple everyday tasks that you complete with ease: ordering food at your favorite restaurant, withdrawing money from a bank teller, simply just talking with friends and family. Even if we can start with the alphabet, we can make a huge step toward overcoming challenges for this community.
What It Does
This machine learning model uses object detection to read American Sign Language in real time, currently using your computer's front webcam.
How I Built It
For the purposes of speed in a hackathon, I chose to use an existing dataset of ASL hand gestures which contains 720 images and is labeled. Because this is such a small size for training a ML model, I used data augmentation to increase the size of the set. Augmentation involves applying filters to each image in the training set such as flipping horizontally, slight rotation, blur, brightness, etc. to not only increase the size of the set but help the model generalize better.
Again due to time constraints, I did not want to have an overly large training set which would take too long to train. I used 3 augmentations for each image in the training set for a total of ~1700 images. Each image was resized to 416x416.
To train the ASL model I used YOLOv5 (You Only Look Once) by Ultralycs, an open source deep learning model designed for fast object detection in a Jupyter Notebook with Google Colab. While slightly less accurate than YOLOv4 from the original authors, it is proven to be relatively faster and lightweight. I used a split of 70-20-10 for training, validation, and testing sets respectively.
Trained in 124 epochs with a batch size of 8, I produced the results which can be seen pictured in the slideshow above.
Challenges I Ran Into
- If you know some letters of the ASL alphabet, you'll notice in the video that my model was mis-predicting similar looking letters such as A, S, T, M, N, which all use a closed fist with the thumb in various positions.
- I noticed in testing that different environments, with more complex backgrounds, lowered the confidence of the model and led to more misclassifications of letters.
- Google Colab set a hidden limit on the amount of GPU resources I was able to use, so I could not train my model for as many epochs as I would have liked.
The model works as intended, albeit some similar looking letters get misclassified. However in the short span of a hackathon, the results are something to be proud of!
What's Next For Reading American Sign Language with Object Detection
- I have already begun gathering original images from friends and family to improve upon the size and diversity of the dataset. I will also try combining other existing open source datasets to do the same.
- I think bringing accessibility to the deaf community means eventually porting the capabilities of the model to mobile devices, so that it can be used anywhere. With YOLOv5 being built and intended to be lightweight, this should be an achievable goal
- More time to train, larger datasets, and finetuning hyperparameters should eliminate the challenges listed above