Testing the Robustness of the Linformer

Title: Testing the Robustness of the Linformer

Who: Tara Amruthur (tamruthu), Anna Arantes (asantan7), Jesse Edelstein (jedelste), Marie Baker (mbaker6)

Link to our GitHub: https://github.com/mbakersf/cs1470-linformer

Our Poster: https://docs.google.com/presentation/d/1yLy4kq2IO7kXAPftGWzfn5nmue0DzAJroLSAQloFbCg/edit#slide=id.p4

Introduction:

What problem are you trying to solve and why?

Paper we are implementing: "Linformer: Self-Attention with Linear Complexity" by Wang et al. (link: https://arxiv.org/pdf/2006.04768)

Transformer models are a very popular deep learning architecture used across several natural language processing (NLP) tasks, such as classification and text translation. However, these models are substantial in size and require considerable computational resources and time for training. Wang et al. introduce an alternative architecture, the Linformer, that addresses this predicament by modifying the self attention mechanism: unlike traditional Transformer models that calculate attention with quadratic complexity, i.e. O(n2), the Linformer operates with linear complexity, significantly reducing computational load and processing time.

This benefit is clearly highlighted when the Linformer is used on datasets with lengthy sequences. This reduction is achieved by approximating the stochastic matrix formed by self-attention as a low-rank matrix. The original scaled dot-product attention is split into several smaller components through linear projections, which collectively approximate the low-rank structure of the original attention matrix. Though this simplification reduces the information in the model, the Linformer is still able to perform as well as regular Transformer models when fine-tuned on sentiment classification problems.

Though the original Linformer paper highlights the benefits of using the Linformer model on different text-based tasks, mainly focused on its efficiency compared to the traditional Transformer models, they have only applied the model on datasets related to movie reviews, particularly the IMDB reviews dataset and the SST-2 dataset (a dataset consisting of one-sentence snippets of movie reviews). The domain-specific nature of these datasets leads us to the following questions: How well does the Linformer perform when dealing with data across a wide-variety of domains? Are its benefits mostly seen on datasets with longer sequences and more context, as opposed to shorter sequences? Then, what hyperparameters can be altered to increase the performance of Linformer, and what is the trade-off between efficiency and accuracy?

To answer these questions, we applied the Linformer model to the Amazon reviews dataset, a collection of different Amazon product reviews and the sentiment associated with these reviews. In evaluating the performance of the Linformer on these datasets, we looked at its accuracy in predicting the sentiment correctly, making this a classification problem.

Related Work:

Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list.

The article “Making Transformer Networks Simpler and More Efficient” by Facebook AI introduces 2 mechanisms, the “adaptive attention span” and “all-attention layer”, for improving transformer efficiency and decreasing their complexity. The adaptive attention span aims to increase the Transformer’s ability to remember longer sequences without drastically increasing computational resources by applying a masking function to attention heads so that they can be differentiated. This allows the data to determine each attention head’s attention span, instead of the attention span being fixed across all attention heads. The all-attention layer simplifies the Transformer architecture by merging the self-attention and feed-forward layer into one layer. The authors propose adding vectors that behave as weights to the attention mechanism’s keys and values, removing the need for a separate feed-forward layer.

The Linformer, adaptive attention span, and all-attention layer are all methods that approach improving Transformer efficiency in different ways. The Linformer aims to reduce computational complexity by introducing a linear attention mechanism, the adaptive attention span aims to improve efficiency by introducing a method for data to dynamically assign varying attention spans across attention heads, and the all-attention layer aims to decrease model complexity by combining the self-attention and feed-forward layers.

The paper we are implementing also includes additional methods addressing Transformer efficiency including “Mixed Precision,” “Sparse Attention,” and “Knowledge Distillation.”

Link: https://ai.meta.com/blog/making-transformer-networks-simpler-and-more-efficient/

Citation: Sukhbaatar, Sainbayar, and Armand Joulin. “Making Transformer Networks Simpler and More Efficient.” AI at Meta, 23 Aug. 2019, ai.meta.com/blog/making-transformer-networks-simpler-and-more-efficient/.

Public implementations we have found:

https://github.com/facebookresearch/fairseq
https://github.com/tatp22/linformer-pytorch/blob/master/examples/pretrain_tutorial_lm.py

Data: What data are you using (if any)?

If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). How big is it? Will you need to do significant preprocessing?

The dataset used for this analysis is the Amazon polarity dataset, by Hugging Face. This dataset consists of 3.6M entries in the training dataset and 400K entries in the test dataset. Each dataset consists of three columns: sentiment, review title, and review content. The sentiment is a binary label: either 0 for negative or 1 for positive. The review title is a short sequence, some as short as one word. On the other hand, the review content is a much longer sequence of words, often consisting of several sentences.

In order to feed this model into Linformer, a dictionary needs to be created in order to tokenize the text in this data. Additionally, due to the sheer volume of data and the lack of computational resources, the data must be trimmed down to around 25K entries in both the training and test dataset. This allows for quicker training times, which allows us to tune the hyperparameters appropriately. To incorporate more of the data, we would need access to more GPUs.

link to our data: https://huggingface.co/datasets/amazon_polarity

Methodology:

What is the architecture of your model? How are you training the model? If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.

The main difference between the architecture of the Linformer and the typical transformer relates to the linear projection of the key (K) and value (V) matrices in the self-attention mechanism in the former. Typically, in a Transformer, the self-attention computation involves interactions between all tokens in a sequence, leading to a complexity of O(n2) (where n is the sequence length). The Linformer projects the key and value matrices from n×d down to n×k, where k is a smaller dimension than n. This projection reduces the number of interactions required, thereby reducing both the computational load and memory usage. Once the dimensions are reduced, the Linformer computes the self-attention using these compressed representations. The attention mechanism remains similar: it computes the dot products of the query (Q) with the projected keys and values, but due to the reduced dimension k, the operations within the self-attention layers are significantly faster and less memory-intensive. This results in an overall complexity reduction from O(n2) to O(n).

To investigate the robustness of the Linformer, we will analyze the model with three varying parameters: compression, shared KV, and shared layer KV. Compression in this context refers to the reduction of sequence length via linear projections. Our primary objective here is to assess how different compression ratios affect the Linformer's ability to process longer sequences with reduced computational overhead. We hypothesized that higher compression ratios would lead to faster computation times and lower memory usage, which could potentially come at the cost of model accuracy and the ability to capture finer nuances/more complex relationships in the data.

The shared key-value (KV) flag, on the other hand, indicates whether a single compressed projection matrix is utilized for both keys and values within each layer of the transformer. This design significantly reduces the model size and decreases training times by minimizing the number of unique parameters. However, this configuration might also blur the distinct roles of keys and values in the attention mechanism, which could influence the model's ability to effectively differentiate and process input data, impacting its overall performance. Similarly, the shared layer KV parameter specifies whether the compressed matrix should be shared between keys and values across all layers in the Linformer model. We anticipated that layer-wise sharing would significantly decrease the number of trainable parameters, which would not only enhance memory efficiency but also reduce training time. However, this level of sharing could potentially lead to a loss of flexibility in the model, as each layer would no longer be able to learn layer-specific transformations. Each of these configurations will be tested in isolation so we can clearly understand their individual impacts on the model's performance metrics.

In addition, we will explore how the model performed with varying input sequence lengths. To do this, we will leverage the idea that the Amazon reviews dataset included both the review’s title and its content. As a result, we will run the Linformer with only the titles and with only the content to understand how the amount of context affects the model. We believe that the hardest part about implementing this model will be finding the optimal balance between accuracy and efficiency, as we are significantly limited by our computational resources and cannot run the model on the Amazon reviews dataset that many times.

Metrics:

What constitutes “success?” What experiments do you plan to run? For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. If you are doing something new, explain how you will assess your model’s performance. What are your base, target, and stretch goals?

We plan to run experiments comparing how the results from the IMDb dataset differ from the results from the Amazon review data set. Doing so will allow us to test the Linformer on data from another domain, as classification was only analyzed using movie-related data. We believe that the Amazon review dataset represents data from a wider range of domains and will allow us to more thoroughly examine the Linformer’s performance. We also plan to compare the training time of the Linformer across different compression ratios.

The notion of “accuracy” does apply to our project, as we are observing how well the Linformer performs on classification tasks. In this context, accuracy can be understood as the ratio between the Linformer’s correct and incorrect sentiment classification of the inputted review data. We will also be examining the tradeoff between accuracy and efficiency that arises when adjusting the Linformer’s compression ratio.

In the paper we are implementing, the authors hoped to find that the Linformer produced comparable accuracy results as the Transformer while operating with reduced complexity. The authors tested the Linformer’s performance on a classification task and a masked-language-modeling task, and their results were quantified by comparing the Linformer’s performance against the standard Transformer, specifically looking at the models’ inference times in relation to input sequence lengths and projected dimension sizes.

Our base goal is to fully implement the Linformer model. Our target goal is to adjust the model’s hyperparameters so that we are able to compare its performance across varying compression levels. Our stretch goal is to improve the model architecture in a way that improves its performance on classification tasks.

Ethics:

What broader societal issues are relevant to your chosen problem space?

Popular Large Language Models (LLMs) like GPT, BART, etc. are only increasing in popularity over time. These models utilize the transformer architecture and are trained on very large amounts of texts, upwards of several billion words. As such, the computational resources required to train these models and use them for inference is substantial, resulting in higher carbon dioxide emissions from the computing facilities where such tasks are performed. Not only that, but these facilities can also impact water usage, soil pollution, and sealing, which could result in a lower quality of environment. Given the urgency of the climate crisis, it is important to consider the environmental impacts of transformer models and look to ways to increase the efficiency of training—therefore, models such as the Linformer can be useful in limiting the amount of resources needed for training a model, resulting in lower carbon dioxide emissions

Link: https://pubs.acs.org/doi/10.1021/acs.est.3c01106

Citation: Rillig, M. C., Ågerstrand, M., Bi, M., Gould, K. A., & Sauerland, U. (2023). Risks and benefits of large language models for the environment. Environmental Science & Technology, 57(9), 3464-3466.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

We are using the Amazon reviews dataset, which has been collected by Hugging Face. This data was labeled with either a positive or negative sentiment. However, the problem with text data is sometimes emotion may not be clear or can be misconstrued without visual or audio cues to aid in classifying an emotion. Therefore, some of these reviews may be misclassified. Additionally, it is probably not representative of most people, as most people will not leave reviews on Amazon unless they are extremely passionate about a product. This phenomenon, known as the response bias, is a common problem that has first been identified in survey studies, but can be seen here as well. Therefore, these reviews will mostly veer towards extreme positivity or extreme negativity, resulting in a model that will only be able to adequately identify strong emotions, as opposed to slight positivity or negativity. Any biases that may be prevalent in the dataset would arise from the demographic features of individuals who use Amazon. For example, if we see individuals in a certain income bracket using Amazon more than individuals in another bracket, we will see these biases reflected in the dataset.

Division of labor:

Briefly outline who will be responsible for which part(s) of the project.

We will collaborate on all aspects of the project but will each take the lead on a specific part for organization and progress-tracking purposes:

Tara: Visualization
Marie: Data processing/model training
Anna: Data processing/model training
Jesse: Evaluation

Built With

tensorflow