Crafting Sign Language From Speech

Nya Haseley-Ayende (nhaseley), Bahar Birsel (bbirsel), Halleluiah Girum (hgirum), Amaris Grondin (agrondin), Javier Fernandez Garcia (jferna35)

Useful links:

Introduction: what we are solving and why

Our group aims to bridge the communication gap between people who are deaf or hard of hearing and those who are not by creating a tool that translates sentences of speech to sequences of sign language.

While there has been significant research into sign language translation (SLT) models that convert sign language into speech, the development of sign language production (SLP) models is relatively new. SLP models are crucial for enabling two-way communication, as they allow for the translation of spoken language into sign language, thereby facilitating interactions between people who are deaf or hard of hearing and those who are not.

Our guiding paper

If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper.

We are implementing Progressive Transformers for End-to-End Sign Language Production by Saunders et al. The objective of the paper is to have two-way translation of sign language to spoken language and vice versa and improve on past papers that propose Sign Language Production techniques. We choose this paper because it focuses on constructing an architecture that is aligned with the continuous nature of sign sequences through introducing the ‘Progressive Transformer’ architecture.

What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc.

Our project could fall under various categories of deep learning, most prominently it involves Sequence-to-Sequence learning since it translates a sequence of English language into a sequence of sign language images using a modified transformer architecture. However, the task could also be viewed through the lens of NLP since it involves language translation, or structured prediction as the model outputs/predicts a sequence of sign language gestures.

Related Work

Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.

Neural Sign Language Synthesis: Words Are Our Glosses

Zelinka et al worked on text-to video sign language synthesis by building a neural network-based translator between text and synthesized skeletal pose. They train their model on available free data from daily TV broadcasting without video segmentation. The two main tasks were to extract high-quality skeletal models from videos and to make a fully trainable end-to-end sign language synthesis system without any explicit translation. For the first task, they used Open-Pose which is a neural-network-based skeleton extraction method. For the second task, they used a state-of-the-art seq2seq translator that uses an RNN-based encoder-decoder technique with LSTM that also utilizes an attention mechanism. They replaced CNNs and tokenization layer with a layer that performs word embedding and replaced the softmax activation function in the output with a linear function. Finally, they evaluated their results using MSE with dynamic time warping (DTW → algorithm for measuring similarity between two temporal sequences, which may vary in speed).

In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list. Neural Machine Translation Methods for Translating Text to Sign Language Glosses Sign Language Production using Neural Machine Translation and Generative Adversarial Networks (same concept - different implementation)

Data

If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it).

How2Sign

For our project, we are utilizing the How2Sign dataset, which is distinct from the dataset used in our guiding paper. The How2Sign dataset is a parallel corpus that includes both speech and transcriptions of instructional videos and their corresponding American Sign Language (ASL) translations, along with annotations. This dataset comprises 80 hours of multiview ASL videos, gloss annotations, and a coarse categorization of the videos. In addition to the videos, the How2Sign dataset includes various modalities such as multiview recordings, transcriptions, gloss annotations, pose information, depth data, and speech tracks.

In case that the dataset above does not suffice, or is too difficult to work with, or another issue which has no suitable work around is found, we can also use the RWTH-PHOENIX-Weather 2014 dataset which contains clips of speech associated with sign language translations. The data preparation and pre-processing for this dataset should be similar, though harder to work with since GLOSS is not offered.

How big is it? Will you need to do significant preprocessing?

The entire dataset is substantial, with the smallest possible set of clips (shortened videos) amounting to 21 GB (training) and another 4 GB in testing and validation data. Aside from the large size of the dataset, its data is also quite complex and diverse, so preprocessing will be necessary to prepare the dataset for use. This may include: Synchronizing the clips and speech/gloss translations. Normalizing the format and resolution of the videos to ensure consistency across them. Extracting relevant features from the video such as the hands which we can use as the inputs for our model.

Methodology

What is the architecture of your model? How are you training the model? If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.

Progressive Transformer Model Diagram

The diagram above shows the model architecture we will be implementing. Specifically, we are planning on first implementing ‘b’ in the diagram which is the Progressive Transformer by inputting GLOSS (sign language annotation) and outputting sign pose sequences. A Progressive Transformer is built on the traditional symbolic transformer which has multi-headed self attention (MHA) employed in the encoder and both MHA and encoder-decoder attention in the decoder alongside residual connections, layer normalization and a feed-forward neural network. In the decoder of the Progressive Transformer, there is a continuous embedding layer which projects the sign pose frame into a continuous 3D joint position and a counter embedding which is analogous to a position embedding as it represents the frame position of the pose and the speed of each pose (between 0,1). The transformer predicts the skeleton pose and the counter value at each time step. The model is trained using Mean Squared Error loss between the predicted sequence and the true sequence. The hardest part will be implementing the progressive transformer model as we would have to augment the vanilla tensorflow implementation of a transformer.

Metrics

What constitutes “success?” What experiments do you plan to run? For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model.

We believe that converting the final clip back into GLOSS and comparing that with the actual GLOSS phrase that we started with would be a good way of measuring the accuracy. To do this we can use a premade model to check for the accuracy of the created clip. The existing paper that we are implementing also uses the same approach. They used back-translation to evaluate their model. In this way, they assessed how understandable the clips were and how much content is contained through the translation. They also used BLEU and ROGUE scores which would also be useful for us. We also might take a random sample from the generated clips and compare it ourselves with real examples.

What are your base, target, and stretch goals? Our base goal is to have the GLOSS to Sign language model training well and perform reasonably well on the BLEU and ROUGE metric.

Our target goal is to reach BLEU and ROUGE score levels of other papers that don’t use transformer modeling but GANS (Stoll et al.). This is also the benchmark that the paper we are implementing used. In addition we would also want to train a model that translates from spoken language to GLOSS.

Our stretch goal would be to have our progressive transformer perform the same as the benchmark in the translation of Text to sign language.

Ethics

What broader societal issues are relevant to your chosen problem space?

Sign language translation has significant societal implications as it directly impacts the communication barriers faced by individuals who are deaf or hard of hearing. By developing a tool that facilitates the translation of spoken language into sign language, we are addressing a fundamental need for accessibility and inclusivity. This project contributes to breaking down communication barriers and promoting equal participation and opportunities for individuals with hearing impairments in various aspects of life, including education, employment, and social interactions.

Why is Deep Learning a good approach to this problem?

Deep learning models can learn the intricate mapping between spoken language features and the corresponding visual features of sign language gestures. Also, sign language translation models require vast amounts of data to learn effectively. Deep learning algorithms are adept at handling and processing large datasets, which is crucial for achieving good performance. Lastly, deep learning models can continuously improve with additional data. This allows for ongoing development and refinement of the sign language generation capabilities over time.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Our dataset is the How2Sign dataset, which consists of instructional videos with corresponding American Sign Language (ASL) translations, annotations, and other modalities like pose information and speech tracks.

The process of collecting and labeling sign language data can be complex and labor-intensive. There may be variations in how signs are interpreted and labeled, leading to inconsistencies or inaccuracies in the dataset. Especially knowing that sometimes ASL is more about translating by phrases, rather than direct word-to-word translation, this may make the image generation a bit more difficult and the accuracy of the ASL translations and annotations may vary, depending on the expertise and qualifications of the individuals involved in the labeling process. Errors in labeling could propagate through the training process and affect the performance of the model.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

The major stakeholders in this problem include individuals who are deaf or hard of hearing, sign language interpreters, educators, and developers of assistive technologies. Mistakes made by the algorithm could have significant consequences for these stakeholders. Inaccurate translations could lead to misunderstandings, miscommunications, and potentially harmful outcomes in various contexts such as educational settings, medical consultations, legal proceedings, and everyday interactions. Therefore, it's crucial to ensure the accuracy, reliability, and cultural sensitivity of the sign language generation model to minimize the risk of errors and promote effective communication.

How are you planning to quantify or measure error or success? What implications does your quantification have?

We will be using MSE loss between the prediction sequence and the ground truth to train the model, and we will be using the BLEU and ROUGE scores to measure success.

Division of labor

Group coordination management: Nya
Data scraping and pre-processing: Amaris and Javier
Model development and coding: Bahar and Halle (All, but mostly Bahar and Halle)
Model evaluation: Nya and Javier
Poster/Written assignments: Bahar, Halle and Amaris

Reflection - Check-in 3

Challenges:

What has been the hardest part of the project you’ve encountered so far?

The hardest part of the project so far has been preprocessing the data. As our data includes both sign skeletons and text, we had to link each skeleton to its corresponding text using an id within a dictionary. In addition, our data differed from the data that the paper used in that it split up the skeleton into the face, right hand, left hand, and body which meant we had to figure out a way to compose it into one skeleton while minimizing the loss of information from reducing the dimensionality of the data and ensuring compatibility with the rest of our code.

In addition, building the transformer model was also difficult as we built a multi-headed attention transformer from scratch instead of using the tensorflow implementation. We had to do this as we need a modified transformer, the progressive transformer, where the decoder predicts both a sequence of skeleton outputs and the counter value for each skeleton.

Lastly, one of the greatest challenges that we have encountered has been understanding the extensive amount of code that the authors from our chosen paper have created. They have over 20 different classes of code, much of which uses techniques which we have never encountered before such as the counter decoding technique, many data augmentation techniques and GPU optimization.

Insights:

Are there any concrete results you can show at this point? At this point, we are confident that our data processing and our transformer model is set up appropriately. However, we have been running into various bugs which keep us from being able to train the model. Thus, we do not have metrics that we can show so far.

How is your model performing compared with expectations? We are not able to report on the performance of our model as we are yet to train the model. Once we have fixed all the bugs and are able to train we will report on model performance.

Plan:

Are you on track with your project? Despite the sheer complexity and length of our code, we are optimistic that we will be able to meet the goals we set for this project by the end of next week. We expected this project to be challenging from the beginning as we purposefully chose a task that would push us, and would let us apply the most important concepts covered in class.

What do you need to dedicate more time to? Since our data processing and our model are implemented, we need to dedicate most of our time to fixing bugs. Part of this involves splitting up the workload to devise more rigorous tests in order to identify what function(s) is causing the current bugs.

What are you thinking of changing, if anything? At the moment, we have implemented many data augmentation techniques and optimizations (such as attempting to use GPU for parallel training) similar to what the paper had done in the hopes that our model would run correctly and efficiently. However, this is part of the source of our bugs, and we are currently stepping back on the optimizations and data augmentation techniques, ensuring that we have a working base model and then building upon it.