Real-time ASL Fingerspelling Recognition

Poster presentation for DL Day (5/8/23)

Final Submission Components

Final Write Up/Reflection: https://docs.google.com/document/d/1r8tIa6-5sUd85ZaCLyp-ySj5Kw6ucGzfjyDj8hqiI_M/edit?usp=sharing

Updated Github Repository: https://github.com/plumol/csci1470-final-project-ASL

Poster Presentation PDF: https://drive.google.com/file/d/19xhRAosZ04lt2kllXvVG66Ei1TSsbgYT/view?usp=sharing

Project Check-In #3 Information:

Introduction: Sign language is an essential tool used to communicate within deaf communities. While sign language is widely used in these communities, it is uncommon elsewhere and poses a problem for accessibility. This paper aims to develop a deep learning architecture that can accurately label ASL fingerspelling in real time to bridge the gap between normal-hearing and deaf communication. Almost all fingerspellings can be represented by static images, except for the “J” and “Z” signs. We can leverage this property to use convolutional neural networks to interpret and classify input fingerspelling images.
Challenges: Thus far, the most challenging part of the project has been preprocessing. Though the paper's authors describe the steps they took to preprocess the dataset images, in addition to various errors we have encountered, it was somewhat difficult to determine whether we are preprocessing "correctly" (or at least in a way that is beneficial to our model).
Insights: Our model is able to train, and our testing accuracy is ~99%. This exceeds our expectations! Going forward, our main goal will be to maximize the performance of our real-time classifier, which is still in progress.
Plan: We believe we are on track with our project. Our model's accuracy has already exceeded our stretch goal, and we have started implementing our real-time classifier. The bulk of our time will be devoted to the aforementioned real-time classifier. If needed, we may tweak our preprocessing, depending on how well the classifier does.

Project Check-In #2 Information

** Note that some aspects of this initial check-in (i.e. the dataset used) changed as we implemented the project. The correct and updated information is reflected in our final write-up and poster. **

Title

ASL Fingerspelling Recognition

Who

Claire Yang, Grace Wan, Jennifer Li, Kyle Lam

Introduction

What problem are you trying to solve and why? If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper.
- Sign language is an essential tool used to communicate within deaf communities. While sign language is widely used in these communities, it is uncommon elsewhere and poses a problem for accessibility. This paper aims to develop a deep learning architecture that can accurately label ASL fingerspelling in real time to bridge the gap between normal-hearing and deaf communication. Almost all fingerspellings can be represented by static images, except for the “J” and “Z” signs. We can leverage this property to use convolutional neural networks to interpret and classify input fingerspelling images.
What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc.
- This is a supervised classification problem, where we are training our model to correctly label ASL fingerspelling images. We aim to develop a model where we can identify ASL fingerspelling in real time.

Related Work

Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.
- Paper #1: Sign language interpretation is unique as it involves a continuous visual-spatial procedure such that a model is able to recognize letters based on the context of movement. This article explores different deep learning approaches for encoding sign language as inputs using three different sign language datasets (German, American, and Chinese). The input features include body and finger joints, facial points, and vector representations from CNNs. The models examined include baseline sequence-to-sequence approaches, reinforcement learning, more complex attention-using networks, and the transformer model. The study found that the transformer model outperformed the others when applied to the datasets.
- Paper #2: This project aims to make ASL translators portable, still utilizing deep learning methods. Related work discussed in this paper:
  - Hidden Markov Model (HMM) utilized 2 cameras – 1 on desk and other is a wearable cap with an attached camera. Accuracy changed when cam position was changed which shows importance of perspective in classification accuracy
  - PCA analyzer model determines position of hand shape, motion chain code used to understand hand movement/recognize signs that require movement
  - 2 models. 1 effective hierarchical feature characterization scheme + 1 TMDHMM (idk what that means) network to fasten recognition process
  - Utilize parameters including posture of hand, position, motion, and orientation. Use HMM (dont know what that is still tbh)
In this section, also include URLs to any public implementations you find of the paper you’re trying to implement.
Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list.
- There is quite a bit of existing material on live ASL fingerspelling classification. While the following are not direct implementations of our paper, they are attempting to tackle the same topic of ASL fingerspelling classification:

Data

What data are you using (if any)? How big is it? Will you need to do significant preprocessing?If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it).
- https://www.kaggle.com/datasets/grassknoted/asl-alphabet
  - This dataset is composed of a right hand making a sign in front of a light, mostly solid background. The data is coming from Kaggle; beyond this, we are unsure where the data came from and/or how it was collected. The implication of the dataset’s description is that an individual was inspired to personally create it to better understand users of ASL.
  - The dataset contains 87,000 images, each of which is of size 200 pixels x 200 pixels. For preprocessing, we will have to one-hot encode our classes (one for each letter, and one each for space, nothing, and delete). Because our data consists of images, we will also have to normalize RGB values by dividing them by 255. Something to keep in mind is that the images look quite similar. Most of them have the same lighting and include a right hand specifically. We may have to leverage data augmentation to simulate an image of a left hand and/or to make the model more robust to identifying signs in different environments.

Methodology

What is the architecture of your model? How are you training the model?
- Following the model of this paper, our CNN model architecture will be composed of an input layer, two convolutional layers, a maxpool layer, a flatten layer, and two dense layers. Softmax will be our loss function as it is best suited for classification tasks. The paper trains the model with a batch size of 120 and 30 epochs, though optimal performance was reached around 20 epochs.
If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.
- The hardest part of implementing the paper’s model will be the preprocessing of the data. For the training and testing data, the preprocessing of the training and testing data will be crucial in extracting important features from the fingerspelling images. We also need to consider how we can implement a real-time fingerspelling recognition system. Our model needs to be able to take image frames from our video and preprocess those images for our model to use.

Metrics

What constitutes “success?” What experiments do you plan to run? If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. If you are doing something new, explain how you will assess your model’s performance. For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?What are your base, target, and stretch goals?
- Success will mainly be measured in the number predicted labels which match the ground truth labels. The notion of “accuracy” indeed applies for this project, as the main purpose is to recognize fingerspelled letters in ASL, and output a letter as the label accordingly. Given that one of our cited papers achieved 100% test accuracy, with 98.05% test accuracy on a real-time system, a base accuracy of 60% would be reasonable, with a target accuracy of 80% and a stretch accuracy of >90%. Our stretch goal also includes processing real-time webcam input.

Ethics

Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.)
- What broader societal issues are relevant to your chosen problem space?
- Why is Deep Learning a good approach to this problem?
- What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?
- Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?
- How are you planning to quantify or measure error or success? What implications does your quantification have?
- Add your own: if there is an issue about your algorithm you would like to discuss or explain further, feel free to do so.
  - This project space addresses the lack of accessible communication for deaf individuals. Often, these individuals’ needs are overlooked. For example, even large events frequently lack interpreters. According to one of the papers we looked into for this project, nearly 10,000,000 individuals are hard-of-hearing, with 1,000,000 of those people being functionally deaf. However, only 250,000 to 500,000 Americans can understand ASL.
  - Broadly, society today is not particularly sensitive to the wide range of needs that exist, and the lack of interpretation available for deaf individuals is only one example. The major stakeholders are anyone who wants to use the interpreting service, deaf or not. The consequences of mistakes by the program would mainly be incorrect interpretation, which could lead to misunderstandings between two parties. This impact is softened by the fact that fingerspelling is not the main way ASL is communicated - thus, there is less pressure for our model to be completely correct.

Division of labor

Kyle and Jennifer will be working on preprocessing. Claire and Grace will implement the model and make tweaks as needed.

Built With

opencv
python
tensorflow

Updates

Grace Wan posted an update — May 11, 2023 03:44 AM EDT

Check-in #3 update:

Introduction: Sign language is an essential tool used to communicate within deaf communities. While sign language is widely used in these communities, it is uncommon elsewhere and poses a problem for accessibility. This paper aims to develop a deep learning architecture that can accurately label ASL fingerspelling in real time to bridge the gap between normal-hearing and deaf communication. Almost all fingerspellings can be represented by static images, except for the “J” and “Z” signs. We can leverage this property to use convolutional neural networks to interpret and classify input fingerspelling images.
Challenges: Thus far, the most challenging part of the project has been preprocessing. Though the paper's authors describe the steps they took to preprocess the dataset images, in addition to various errors we have encountered, it was somewhat difficult to determine whether we are preprocessing "correctly" (or at least in a way that is beneficial to our model).
Insights: Our model is able to train, and our testing accuracy is ~99%. This exceeds our expectations! Going forward, our main goal will be to maximize the performance of our real-time classifier, which is still in progress.
Plan: We believe we are on track with our project. Our model's accuracy has already exceeded our stretch goal, and we have started implementing our real-time classifier. The bulk of our time will be devoted to the aforementioned real-time classifier. If needed, we may tweak our preprocessing, depending on how well the classifier does.

Log in or sign up for Devpost to join the conversation.

Kyle Lam started this project — Apr 12, 2023 04:23 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.