AIdio

Inspiration

American Sign Language users can use our ASL-to-text platform as a means of learning, getting instant feedback on signed word arrangements, and as a means of expression to communicate with non-ASL users. Our video call-based speech-to-text converter also hosts transcribing capabilities not yet offered on popular platforms like WhatsApp, Line, and Discord. Heightened by COVID-19, those who are hard of hearing might have difficulty lip-reading or be unable to reproduce real life conversations over online platforms due to a host of reasons. We want to increase accessibility over web calls.

We additionally want to highlight our unique algorithm which relies on relative joint positions and eliminates hand color bias that the common, much more computationally intensive Convolutional Neural Network (CNN) uses.

What it does

Uses Deep Learning to (1) convert American Sign Language to text with 97% accuracy and (2) convert real time live-stream speech to text. Circumvents racial and ethnic bias in training set due to original, self-made algorithm. Implemented with Python, HTML, CSS, JavaScript, and various other libraries.

How we built it

We use a total of three Machine Learning models: (1) identify hand joints for training data, (2) recognize American Sign Language words, and (3) process speech to text.

ASL-to-Text

Early on in the project, we had trouble finding ways to identify ASL hand signs efficiently in less than HackUMass' designated 36 hours. Passing in pixelated videos and pictures as parameters to Tensorflow takes several hours to days to render and process because colors and exact coordinates are passed directly into the CNN model.

Our solution is to map hand joints, normalize the ratio of each joint to a certain point at the bottom of the hand between the range [0,-1], and pass in a CSV file of matrices (containing only floats). This way, we are able to train our model at a much faster rate and load the data within the bounds of one hour compared to the many hours a CNN would take. Our lighting in a moderately lit classroom has almost no impact on our results. Likewise, due to the lack of parametrized RGB values, we offer the same level of standard to people of different racial backgrounds.

Speech-to-Text

We are using a Natural Language Processor (NLP) using Tensorflow to process English. We used Sockets and Flask to handle the back-end, OpenCV for video streaming and GCP to get endpoint predictions.

Challenges we ran into

Learning how to create and use a large data (in excess of over 200 images per word) to train the model. Our solution is described under the "how we built it" section.

Creating a live video transcription on our website. We have to combine our own python and HTML files.

Configuring versions of Python to be compatible with different libraries. For an example, we had issues installing pyaudio (and similarly portaudio for building the wheels of this module) on three of the four computers, Mediapipe, and Tensorflow. These libraries often work with very specific versions like Python 3.10.8 rather than 3.11 which is the newest version.

Syncing different devices to have cross-captioning.

Making the training set from scratch since none of the pre-existing datasets fit our approach. We took 200 images of each hand sign.

Accomplishments that we're proud of

We trained, processed, and deployed state-of-the-art, custom-made Machine Learning models and achieved a very high accuracy rate of above 97% for our ASL-to-text converter. Similarly, for the second part of our project, we successfully launched a live-streaming video client with overlaid captions.

Every member of our team contributed wholeheartedly so we could complete our project in time. We slept an average of less than three hours each day during the entire duration of the weekend. To train three different models, we had to exercise a great deal of patience with bugs.

What we learned

All of us learned about deploying Tensorflow and other related Machine Learning libraries. Additionally, we were very involved in developing the website's back-end in particular, creating an ML pipeline.

What's next for AIdio

Expanding our library of ASL words! With our automated data training methodology (script found in this Github), we can continually push new words into our model in 10 minutes or less. We are also looking at other potential implications of our model for low-scale rollout. Our model surpasses most other CNN models in terms of speed and ease of uploading new files, and we are glad that we are able to eliminate the color, lighting and quality bias common to other training datasets.