Sign-opsis

💡 Inspiration

Sign-opsis was born from a desire to improve accessibility and communication for the deaf and hard of hearing community. We recognized the challenges in real-time sign language translation—especially in fast-paced environments like virtual meetings, customer service interactions, and social media. Our goal: create a seamless bridge between spoken content and sign language.

🛠 What It Does

🎧 Input: Audio from videos, podcasts, or live captions 🤟 Output: A pair of animated hands performing American Sign Language (ASL)

🧱 How We Built It

Speech Recognition (ASR): We used OpenAI’s Whisper to convert spoken language into text. Whisper handles various accents and background noise well, and we implemented preprocessing to clean the text output for translation.

Text-to-ASL Translation: English grammar does not directly map to ASL. We used spaCy to parse and restructure sentences into ASL-style gloss. Then, each word or letter is mapped to corresponding hand coordinate data.

We used two open datasets:

ASL Citizen: Provides alphabet videos + metadata in CSV format.

WLASL: A larger collection with glosses and metadata in JSON.

To animate the hands, we used MediaPipe to extract 2D hand keypoints (x, y) frame-by-frame from both sign language videos and static alphabet images. Before that, we used OpenCV (cv2) to identify and create point locations for each joint of the hand and form a connected, structured hand skeleton. This multi-step approach allowed for greater control and customization over how the hand movements were visualized. However, it was a challenge to precisely align the output of cv2 with MediaPipe’s landmark expectations, especially when dealing with noisy video data or varying hand poses.

These hand keypoints were saved in a structured coordinates.json, which can be played back by a hand model or used to animate a 2D hand.

🚧 Challenges We Faced

spaCy, MediaPipe, and Whisper all have specific Python and system requirements. We had to carefully manage environments and deal with memory and compatibility issues.

Turning a sign into keypoint coordinates involves careful alignment of frame timing, pose estimation, and meaning. We had to manually validate and fine-tune many mappings.

Whisper’s performance drops for files larger than ~25MB on many free-tier platforms. We split long files and managed memory to make transcription feasible without downgrading quality.

Ensuring that audio input, sign language output, and subtitle syncing worked seamlessly across the stack (React + Flask) was tough. We had to debug latency, coordinate file passing, and standardize output formats.

It was also difficult to ensure that the (x, y) keypoints generated by OpenCV correctly aligned with MediaPipe’s model. Any mismatch between point generation and landmark indexing caused animation bugs or misinterpretation of signs, requiring iterative debugging and visual inspection across frames.

🏆 Accomplishments We’re Proud Of

We created a fully working hand joint extractor from videos/images using OpenCV and MediaPipe, saving consistent (x, y) coordinates across frames.

We developed a clean and interactive frontend interface.

We successfully embedded Whisper in our backend and managed to run large-scale transcriptions efficiently, even with audio preprocessing and error handling.

📚 What We Learned

Techniques for extracting and synchronizing 2D hand keypoints from both videos and images.

How to combine audio processing, NLP, and computer vision into a modular backend.

Gained deeper understanding of structural differences between ASL and English, essential for accurate translation.

🚀 What’s Next for Sign-opsis

Support for multiple sign languages (e.g., BSL, ISL)

Facial expression generation (critical for expressive accuracy in ASL)

Integration with platforms like Zoom, YouTube, and Google Meet for live translation

An offline version for accessibility in low-connectivity regions

Multilingual input support: → Convert audio/video in any language → Generate English transcript → Summarize content → Render 3D avatar performing sign language with text/audio subtitles