Real-Time Sign Language to Text Conversion

Final Write-up/Reflection

Introduction

The general goal of this model was to improve accessibility to educational resources. ASL (American Sign Language) is the primary language used by the Deaf within the US. As a result, both English speakers and non-American deaf people sometimes wish to learn ASL. When learning a language, having consistent feedback is crucial to developing the skills necessary to become fluent (or at least functional). The purpose of this project is to provide ASL learners with a framework to verify their alphabet signs.

Methodology

We took a two-pronged approach to our project. Initially, we trained a naïve classifier based on the simple CNN architecture we created in class. This simple architecture performed relatively well (metrics in Results) but struggled to achieve translational invariance. To address this, we looked to alternative solutions which involve object localization and detection. We settled on using the YOLO algorithm to improve the performance of our model. “You Only Look Once” (YOLO) is a deep learning approach to object detection. The basic idea behind YOLO is to first divide the image into an NxN grid and allow each grid cell to predict some number of bounding boxes, to generate a class probability distribution, and to predict the probability of an object existing in said cell. Then the unnecessary bounding boxes are removed by applying a threshold on these predictions using a process known as non-maximal suppression (See Figure). Although YOLO has been shown to have a lower mean average precision (mAP) than R-CNNs, it is much faster as it performs object localization and classification in one forward pass. In our project, we implemented YOLO because speed was of importance when detecting hands and classifying signs in real-time video. In order to evaluate the performance of our approaches, we used average accuracy.

The YOLO architecture we implemented consists of 24 convolutional layers and 2 dense layers. Our naive classifier is based on the CNN architecture we created in class; it includes 3 convolution layers (where each convolution layer is followed by batch normalization, max-pooling, and ReLU) followed by 3 linear layers.

Results

The two models developed and tested had different use cases, and thus were evaluated as such. In testing against a real time video feed supplied by a webcam, the YOLO model had decent success in object detection, being able to draw a bounding box around hands in frame. Though, due to the model’s complex architecture, the processed video feed was slightly laggy, and individual hand sign classification was not very accurate at all. On the other hand, the naïve classifier did a much better job with actual ASL classification, achieving an accuracy of 86% on the test set after only 10 epochs of training. This classifier required a predefined bounding box to classify within, but this simplification allowed its architecture to be significantly simpler than the one used for YOLO. As a result, both training and forward pass times were accelerated greatly, enabling this classifier to be used cleanly on a live video feed with no lag. For a live video input feed, the naïve ASL classifier was able to discern most hand signs presented in the predefined bounding box on screen. However, it was sensitive to hand position and lighting to some extent, sometimes confusing similar looking letters (e.g. ‘m’ and ‘n’). Ultimately, reasonable success was found in the original goal of being able to classify ASL hand signs, though more work is necessary to allow for automatic hand tracking and a more robust model in general.

Challenges

Finding a dataset that was diverse enough to train our model on was difficult. A lot of the datasets found online did not include images of different people’s hands, different backgrounds, and/or valid bounding box parameters. Because of this, our model was able to reach a high accuracy on the testing data provided from the same source as the uniform training data, but unable to accurately classify the ASL letters when we ran the model ourselves. In order to compensate for the lack of variety in the different datasets, we combined datasets. Another challenge we were faced with was translational invariance. Our naive classification model was unable to detect the hand in frames of our video input, so we built an object detection model. However, the only dataset that we could find that included bounding box parameters for hand detection was relatively small. Thus, our object detection model was very difficult to train.

Reflection

Ultimately, we felt our project was successful. Real-time feedback is very important when learning anything new. Our sign-language translator will provide immediate feedback for people learning ASL letters and ultimately improve accessibility to learning ASL. We reached our target goal of being able to convert ASL hand-signs from a fixed region in a video frame in real time with relatively high accuracy (~86%). We made progress towards our stretch goal by adding object localization to our model using the YOLO algorithm, but our classification results fell short of our predetermined benchmarks.

Over time, a number of changes were made to our project. One pivot we had to make was finding new datasets for training. The original dataset we had chosen for this project did not have sufficient variation and was prone to overfitting. Another pivot we made was deciding to test object detection algorithms like YOLO because we found that our naïve classifier was not as translationally invariant as we hoped.

If we had more time, we would work on improving our YOLO object detection model by generating or acquiring more annotated data and modifying the architecture so that it can achieve a better classification accuracy. Additionally, future steps for our model can include expanding to include ASL words and/or translating to other languages and sign languages. Because our model runs in real-time it can also be used in spaces where ASL translation is needed for ease of communication. Other considerations would be to train our model on datasets that include images of hands of different skin color, hands in front of different backgrounds, and hands with correct bounding boxes. Acquiring a more heterogeneous dataset would reduce bias for plain backgrounds and hands of lighter skin tones.

One of the biggest takeaways from this project for us was a more in-depth understanding of the complete data pipeline and workflow when it comes to designing, training, and deploying a deep learning model. Learning to iteratively improve the deep learning models until benchmarks are met is something that we learned during this project.

Check-in 2

Introduction

The general goals of this model are to improve accessibility to educational resources. ASL (American Sign Language) is the primary language used by the Deaf within the US. As a result, both English speakers and non-American deaf people sometimes wish to learn ASL. When learning a language, having consistent feedback is crucial to developing the skills necessary to become fluent (or at least functional). The purpose of this project is to provide ASL learners with a framework to verify their alphabet signs, in addition to hopefully providing some basic translation between different sign languages.

Challenges

In this project so far, we have encountered a number of challenges. One such challenge is training our model to automatically and consistently detect and track hands within the video frame in order to provide greater translational invariance to our model. Another challenge we have run into is attempting to prevent overfitting to our training data in order to improve performance during inference time. Our current plan to improve performance on this front is to acquire more varied training data and perhaps implement regularization methods.

Insights

We were able to get a classification model to run on the training data (images of different ASL letters) with an accuracy of >70% after 5 epochs and >90% after 25 epochs. There are a few letters that the model seems to have trouble differentiating each time it is trained (e.g. k and v, h and g, v and w). Our model is performing well on our training and testing datasets, but we are working on improving its performance during inference time with real-time video.

Plan

Progress for our project is currently on-track. Tasks that have been completed are as follows. Preprocessing of data has been completed. Model for classifying the ASL letters has been trained (using the Kaggle training dataset) with >90% accuracy. Model for classifying hand/not-hand has been trained (using the Kaggle training dataset and CIFAR-10 images for non-hand images) with >90% accuracy.

Between now and the submission deadline, we plan to dedicate more time to implementing a model to determine the bounding box of a hand given an input frame of a video and improving the performance of classification in live video.

Our current plan for classifying video input of ASL letters is as follows. To determine where the hand is in the input frame, we plan on having a sliding window of three different sizes (⅛, ¼, and ½ size of the input video image) and using the tile that has the highest probability of having a hand in the tile as the bounding box. Once the bounding box of the hand is found, we will run CNN to classify the hand shape as a letter. If the confidence level of the hand letter is higher than the set threshold, we will return the ASL letter as text.

Check-in 1

Introduction

The primary motivations behind this project are to improve accessibility for learning ASL (American Sign Language). ASL is the primary language used by the Deaf within the US. As a result, both English speakers and non-American deaf people sometimes wish to learn ASL. When learning a language, having consistent feedback is crucial to developing the skills necessary to become fluent (or at least functional). The purpose of this project is to provide ASL learners with a framework to verify their alphabet signs, in addition to hopefully providing some basic translation between different sign languages. As with general image processing, mainly CNNs will be used to classify various images of signed letters, with the potential of using RNNs for real-time translation purposes. We are basing our work on a similar project that was implemented in Tensorflow. We will be expanding on the project's scope, using PyTorch, and using a different dataset.

Related Work

Here is a project using Keras and OpenCV to interpret a select number of ASL symbols when placed in a bounding box. Our project will be similar to this, so we will implement our project in PyTorch instead of Tensorflow, and we will use a different dataset.

This project also implemented a similar model in Keras and OpenCV to interpret ASL.

Data

We have obtained labeled ASL image data from Kaggle. To augment our data, we will mirror images and change orientation. Preprocessing will include loading in the images. Another potentially useful data set is the Kaggle Sign Language MNIST.

Methodology

We will use a CNN architecture in combination with LSTMs. CNNs will perform classification for video frames while LSTMs will allow us to encode a notion of order in time for signs which require motion. We will train our model on the data mentioned in the Data section using the PyTorch framework. We anticipate that the most difficult part about implementing the model will be encoding the notion of time from video frames for classification.

Metrics

We will use accuracy as a metric for success, measuring the percentage of ASL hand symbols accurately classified in a video. We plan to achieve a minimum of 70% accuracy. We plan to test sequences of letters as well as individual letters as a way to measure accuracy. We will also experiment to optimize inference time for each letter. Our goals are as follows :

Base: Correctly translate images of the ASL alphabet to their corresponding English letter.
Target: Real-time translation of ASL signs to their respective letters within a frame.
Reach: Automatic hand-tracking (full translation invariance)

Ethics Questions

What broader societal issues are relevant to your chosen problem space?
Accessibility: not everyone has access to learning ASL

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?
Training data: images appear to not include different skin tones (some images in the training data were darkened to try and account for lack of diversity); the data also appears to not include images of the left hand (we will remedy this by mirroring some of the training data)

Why is Deep Learning a good approach to this problem?
Deep learning methods will likely outperform rule-based methods in this case due to the variance in how data is presented in images and videos. A rule-based approach may not provide enough flexibility to classify the broad domain of inputs for sign language translation.

Division of Labor

jrhee8: YOLO Architecture, Model Qualitative Testing
shong41: Data Preprocessing, General CNN Architecture
mshinkar: Data processing, Naïve Classifier Architecture , Model Quantitative Evaluation