All of our team members are deeply passionate about improving students' education. We focused on the underserved community of deaf or hard-of-hearing students, who communicate, understand, and think primarily in ASL. While some of these students have become accustomed to reading English in various contexts, our market research from studies conducted by Penn State University indicates that members of the community prefer to communicate and think in ASL, and think of English writing as a second language in terms of grammatical structure and syntax.
The majority of deaf people do not have what is commonly referred to as an “inner voice”; instead they often sign ASL in their heads to themselves. For this reason, deaf students are largely disadvantaged in academia, especially with regard to live attendance of lectures. As a result, we sought to design an app to translate professors’ lecture speeches to ASL in near-real time.
What it does
Our app enables enhanced live-lecture for members of the ASL-speaking community by intelligently converting the professor's speech to a sequence of ASL videos for the user to watch during lecture. This style of real-time audio to ASL conversion has never been done before, and our app bridges the educational barrier that exists in the deaf and hard-of-hearing community.
How we built it
We broke down the development of the app into 3 phases: converting voice to speech, converting speech to ASL videos, and connecting the two components together in an iOS application with an engaging user interface.
Building off of existing on-device speech recognition models including Pocketsphinx, Mozilla DeepSpeech, iOS Dictation, and more, we decided to combine them in an ensemble model. We employed the Google Cloud Speech to Text API to transcribe videos for ground truth, against which we compared transcription error rates for our models by phonemes, lengths, and syllabic features.
Finally, we ran our own tests to ensure that the speech-to-text API was dynamically editing previously spoken words and phrases using context of neighboring words. The ideal weights for each weight assigned to each candidate were optimized over many iterations of testing using the Weights & Biases API (along with generous amounts of freezing layers and honing in!). Through many grueling rounds and head-to-head comparisons, the iOS on-device speech recognizer shined, with its superior accuracy and performance, compared to the other two, and was assigned the highest weight by far. Based on these results, in order to improve performance, we ended up not using the other two models at all.
Challenges we ran into
When we were designing the solution architecture, we quickly discovered there was no API or database to enable conversion of written English to ASL "gloss" (or even videos). We were therefore forced to make our own database by creating and cropping videos ourselves. While time-consuming, this ensured consistent video quality as well as speed and efficiency in loading the videos on the iOS device. It also inspired our plan to crowdsource information and database video samples from users in a way that benefits all those who opt-in to the sharing system.
One of the first difficulties we had was navigating the various different speech recognition model outputs and modifying it for continuous and lengthy voice samples. Furthermore, we had to ensure our algorithm dynamically adjusted history and performed backwards error correction, since some API's (especially Apple's iOS Dictation) dynamically alter past text when clued in on context from later words.
All of our lexical and syntactical analysis required us to meticulously design finite state machines and data structures around the results of the models and API's we used — and required significant alteration & massaging — before they became useful for our application. This was necessary due to our ambitious goal of achieving real-time ASL delivery to users.
Accomplishments that we're proud of
As a team we were most proud of our ability to quickly learn new frameworks and use Machine Learning and Reinforcement Learning to develop an application that was scalable and modular. While we were subject to a time restriction, we ensured that our user interface was polished, and that our final app integrated several frameworks seamlessly to deliver a usable product to our target audience, sans bugs or errors. We pushed ourselves to learn unfamiliar skills so that our solution would be as comprehensive as we could make it. Additionally, of course, we’re proud of our ability to come together and solve a problem that could truly benefit an entire community.
What we learned
We learned how to brainstorm ideas effectively and in a team, create ideas collaboratively, and parallelize tasks for maximum efficiency. We exercised our literature research and market research skills to recognize that there was a gap we could fill in the ASL community. We also integrated ML techniques into our design and solution process, carefully selecting analysis methods to evaluate candidate options before proceeding on a rigorously defined footing. Finally, we strove to continually analyze data to inform future design decisions and train our models.
What's next for Sign-ify
We want to expand our app to be more robust and extensible. Currently, the greatest limitation of our application is the limited database of ASL words that we recorded videos for. In the future, one of our biggest priorities is to dynamically generate animation so that we will have a larger and more accurate database. We want to improve our speech to text API with more training data so that it becomes more accurate in educational settings.
Publishing the app on the iOS app store will provide the most effective distribution channel and allow members of the deaf and hard-of-hearing community easy access to our app.
We are very excited by the prospects of this solution and will continue to update the software to achieve our goal of enhancing the educational experience for users with auditory impairments.
Google Cloud Platform API
Penn State. "Sign language users read words and see signs simultaneously." ScienceDaily. ScienceDaily, 24 March 2011 [www.sciencedaily.com/releases/2011/03/110322105438.htm].