Share this project:

Updates

posted an update

Introduction: We are implementing the recently published paper about Linformer, a refined version of the Transformer model. While the algorithm has been providing plenty of state-of-the-art results in Natural Language Processing applications, the authors are worried about how costly it would be when training and deploying Transformer models for long sequences. Take the original BERT-large model as an example, it takes four days to be trained on 16 Cloud TPUs. Considering work efficiency and environmental sustainability, (carbon emission, especially), the authors proposed a new mechanism that is able to reduce complexity from $$O(n^{2})$$ to $$O(n)$$ in both time and space. As our team shares the same concerns with the authors, we would like to implement the Linformer model and compare the result, focusing on perplexity, between the Transformer model and the Linformer model to further validate the conclusion from the paper.

The Linformer model we are using here is structurally similar to the Transformer model, except for the attention mechanism. The main application of Linformer will be language translation. Therefore, just like the Transformer model, the Linformer model will be a semi-supervised learning model. Firstly it is pre-trained, which is an unsupervised step, followed by the supervised fine-tuning step. We will be first checking our model functionality using our mocked dataset. Then apply Linformer on a self-generated dataset as well as Bookcorpus.

Challenges: There are two challenges for us currently. First, understanding architecture of the linear transformer model. We have discussed how to implement it and share ideas about how to train the model, what kind of loss function should we use as well as optimizer. However, the detail process for implementing the transformer still need more time for us to deal with it. Second, data preprocessing for real world data. We need to make all sentences into same length or some sorts of that so that we have a rule to adjust the data to our transformer. We are still finding a way to preprocess the data in order to apply a reasonable window size on it.

Insights: We have collected real world data, which includes 6GB of book corpus. Also, we have completed the main architecture of linear transformer: Linear Attention Head computation. Whether our model performance meet the expectations of paper or not would be based on how we preprocess the data.

Plan: We are on track with our project. We think most of our time will be on the process of preprocessing the data and tweaking the model when doing experiments to meet the expectations of the paper.

Log in or sign up for Devpost to join the conversation.