Sign Engine

Landing page with summaries of features
Lipsync Interface (ignore the bad picture! it's 6am and I haven't slept in so long!)
YouTube Chrome Extension, interpreting Mark Zuckerberg's Harvard Lecture

Inspiration

For over eight years, I tried learning multiple languages, from Sanskrit and Spanish to Hindi and French, yet I could barely maintain a fluent conversation in any of them. When I moved to Vancouver in 2021, I joined Burnaby South Secondary School, which shares its campus with the British Columbia Secondary School for the Deaf (BCSD). This gave me the unique opportunity to study a new kind of language – a visual language – in high school.

ASL wasn't like any of the other languages I had attempted to learn before: It wasn't just about words or pronunciation, but rather learning how to fully express yourself without the tools you typically use. Over the last three years, our ASL class has shown me how I take communication for granted and also helped me notice the many hurdles that are faced by the Deaf community in our hearing-centric society. From my very first week at Burnaby South, I have had many experiences that suddenly remind me of the reasons we learn about Deaf culture and accessibility in ASL class. The mission below is ultimately what I hope to achieve with this project.

What it does

Sign Engine has three main components:

Sign Engine API: a modular API that produces ASL sign pose sequences for inputted English text
Lipsync: a web interface that interacts with Symphonic Lab's lip-reading API and the Sign Engine API to seamlessly translate whispers to ASL signs
Chrome Extension: an extension that adds a synced automatic interpreter to all YouTube videos with an accessible English transcript

How we built it

Scraped 9000+ videos of signs from various online websites and datasets
Used Google Mediapipe to extract Holistic poses and interpolate missing frames and landmarks
Finetuned a GPT-4o-mini model on a dataset of 2M+ English-ASL Gloss pairings (ASL Gloss is the grammar structure of American Sign Language)
Used the all-MiniLM-L6-v2 embedding model on the corresponding English words for each ASL sign so we can use cosine similarity (semantic search) to fetch contextually similar words when we don't have the exact sign for a word. This is because we only have ~9000 signs in the dataset while there are 150,000+ English words
Stored all embeddings and holistic landmarks in a pgvector PostgreSQL dataset
Used Flask to create the Sign Engine API, which stitched together signing sequences for inputted English text
Used NextJS to create the Lipsync interface, which interacted with the Symphonic Lab's lip-reading API and the Sign Engine API to translate words to ASL signs
Used the Chrome API and various YouTube transcript services to create the chrome extension which interpreted YouTube videos live
Cleaned up all code, added QOL benefits like frame interpolation, smoother animations, etc

Challenges we ran into

GPT-4o-mini Finetuning took over 9 hours and then wasn't very useful after all. ASL Gloss is often used as an intermediary step while translating from English to ASL or vice versa; however, it is still not fully representative of the nuance of the sign language. I found that, despite training on an immense amount of data, the fine-tuned model deteriorated the quality of the translations. I ended up sticking to just prompting GPT-4o-mini with certain rule-based word reordering steps.
For over 9,000 signs, xtracting MediaPipe Holistic pose estimations, interpolating missing frames and landmarks, and embedding each word, ended up taking 9+ hours as well. I was more concerned about this because I knew having 4,000 signs instead of 9,000 would significantly deteriorate the quality of translations
I ran into many bugs with new technology that I haven't used before – the Chrome API, animating on the client side, the Symphonic Labs API, etc

Accomplishments that we're proud of

I've been passionate about sign language processing research for quite a while now. It's a beautiful blend of computational linguistics, computer vision, sign language linguistics, and speech recognition. I've been putting off this project for many months now, and my main goal for Hack the North was to commit to this project for the entire time so that I could finally work on what I wanted to. I'm proud that I was able to do that, and also build a tool that can truly lead to consumer-accessible sign language translation after some more iteration!

What we learned

Finetuning might not always be the best approach
8 hours of travel prior to a hackathon tends to take a toll on me
Pacing myself is extremely important

What's next for Sign Engine

I'm really proud of what I've been able to achieve in such a short span of time – but as someone who has been lucky enough to learn ASL for a couple of years now, I know that there' s a lot more to go. I worked on Sign Engine because it's a project I want to continue building after HTN, and HTN was my opportunity to get a jump start.

The most important things are:

getting Deaf involvement and feedback as I continue iterating and working on the project
improving pacing of the interpretations – it struggles to keep up with faster-paced YouTube videos
capturing more of ASL's nuance – directional signs, facial expressions, etc
using Kalidokit or some kind of VRM avatar for animations so it isn't a creepy green skeleton signing at you