Audio to American Sign Language (ASL) Video Translation System

1. Introduction

The Audio to American Sign Language (ASL) Video Translation System is an innovative solution designed to bridge the communication gap between spoken language and sign language. By leveraging cutting-edge technologies in Automatic Speech Recognition (ASR), Video Processing and Generative AI (GenAI), this system aims to provide translation of spoken audio into fluid ASL video content. This is a research project.

1.1 Problem Statement

Despite advancements in accessibility technologies, there remains a significant communication barrier between hearing individuals and the deaf or hard-of-hearing community. Traditional text-based captioning systems fall short in conveying the nuances of sign language, including facial expressions and body language. Also, for events such as Keynotes, Lectures etc. there might not be captioning or a translator available, making it difficult for deaf or hard-of-hearing community to fully grasp the content. This research project addresses this gap by developing a system that can translate spoken audio directly into ASL video, providing a more natural and expressive form of communication for ASL users.

1.2 Objectives

  1. Research and Develop a robust system for translating spoken audio into ASL video with least latency.
  2. Leverage a comprehensive database (source) of ASL video covering a wide range of sentences and expressions.
  3. Determine efficient algorithms for mapping recognized speech to appropriate ASL video segments (one-to-many mapping).
  4. Ensure smooth and natural-looking transitions between ASL video segments in the final output (possible through addition of LSTM sub-network in the future).
  5. Optimize the system for performance to enable real-time or near real-time translation (hopefully without using too-much computational resources).

2. System Architecture

The Audio to ASL Video Translation System is composed of two main subsystems:

  1. Automatic Speech Recognition (ASR) Subsystem
  2. Text to ASL Video Mapping Subsystem

2.1 System Overview

[Audio Input] → [ASR Subsystem] → [Text Output] → [Text to ASL Mapping Subsystem] → [ASL Video Output]

2.2 Automatic Speech Recognition (ASR) Subsystem

The ASR subsystem is responsible for converting input audio into text. It utilizes state-of-the-art speech recognition technology (using whisperX Model to achieve this) to achieve high accuracy across various speakers and acoustic conditions.

2.3 Text to ASL Video Mapping Subsystem

This subsystem takes the text output from the ASR and maps it to corresponding ASL video segments. It involves natural language processing, and video processing (splitting video into individual frames) to produce. and store mapping between frame segments to transcribed text, which will be utilized to create/store/retrieve embeddings of Text-Image in the same vector space. Later, when relevant frame is retrieved through text similarity search, this mapping will be used to retrieve relevant nearby frames to create a smoother ASL Video by patching relevant frames together.

3. Methodology

3.1 Audio Preprocessing

  • Audio is captured (way,mp3) and preprocessed to enhance quality and remove noise (thankfully whisperX takes care of this).
  • The audio stream is segmented into manageable chunks for processing.
  • The preprocessed audio is fed into the ASR model.
  • The ASR model converts the audio into text, providing word-level timestamps.

3.2 Text Processing

  • The ASR output is tokenized into words or phrases.
  • Natural Language Processing (NLP) techniques are applied to handle grammar structures and idiomatic expressions.
  • Extracted text is stored along with timestamps.

3.3 ASL Video Segment Retrieval

  • Each processed text token is converted into an embedding vector.
  • The embedding is used to query a database (we're using LanceDB) of ASL video segments.
  • The most relevant ASL video segment is retrieved based on similarity measures.

3.4 Video Stitching and Post-processing

  • Retrieved ASL video frames are stitched together.
  • (Future Work, needs LSTM sub-network) Transition smoothing techniques are applied to ensure fluid motion between segments.
  • (Future Work, needs Audio-Text-ASLVideo Dataset synced with timestamps) Final video is rendered with any necessary adjustments for timing and pacing.

4. Technical Components

4.1 Embedding Model

  • Technology: SentenceTransformer
  • Purpose: Generate embeddings for text.

  • Technology: ResNet

  • Purpose: Generate embeddings for ASL video segment/frames.

4.2 Database

  • Technology: LanceDB
  • Purpose: Store and retrieve ASL video segments, English transcribed text, timestamps and their associated metadata (filenames, related frames etc.)

4.3 Video Processing

  • Technology: OpenCV
  • Purpose: Handle video frame extraction, processing, and stitching

7. Challenges and Future Work

7.1 Current Challenges

  • Handling ASL concepts with no direct spoken language equivalent
  • Maintaining natural signing speed and rhythm
  • Accurately conveying tone and emotion in the ASL output
  • Scaling the system to cover a wide range of vocabulary and expressions
  • Optimizing for real-time performance on various hardware configurations

7.2 Future Work

  1. Develop techniques for generating novel ASL signs for unseen words or concepts.
  2. Implement user feedback mechanisms to continuously improve translation quality.
  3. Expand the system to support multiple sign languages and spoken languages.
  4. Explore the use of 3D avatars for more flexible and customizable ASL video generation.

8. Conclusion

The Audio to ASL Video Translation System represents a significant step forward in bridging the communication gap between spoken and sign languages. By leveraging advanced technologies in speech recognition, natural language processing, and video processing, this system aims to provide more accessible and expressive communication tools for the deaf and hard-of-hearing community. While challenges remain, ongoing research and development in this area hold great promise for creating more inclusive communication environments in the future.

Built With

  • automatic-speech-recognition
  • intel-tibre-cloud
  • lancedb
  • nltk
  • python
  • pytorch
  • resnet
  • sentence-bert
  • whisperx
Share this project:

Updates