Title: Improving Self-Attention Mechanisms for Long Document Classification

Team Members: Cali Rivera (crivera8), Livia Gimenes (lgimenes), Iris Cheng (icheng3)

Team Name: Tkinter Mockingbird

Github Repo

our github repo!

Final Write-up

our final write-up!

Checkpoint 2

our second checkpoint!

Checkpoint 1

our first checkpoint!

Introduction

Transformers, particularly their self-attention mechanism, have revolutionized how sequences of text can be processed for natural language processing tasks. The Transformer architecture solves many of the limitations of traditional RNNs, most notably RNNs’ limited long-term memory. However, Transformers’ very strength, the fact that they are able to maintain near-perfect long-term memory by considering all words in a given sequence at each stage, has also become their weakness. Since the memory requirements of self-attention scale quadratically with the length of the input sequence, a major architectural limitation of Transformer-based models is that they are unable to process sequences longer than a few thousand tokens without exceeding the limits of reasonable computing resources. This prevents Transformers from being used in applications that require the encoding of longer texts like entire books, large images, and other kinds of data that include a very large number of tokens.

In their paper “Longformer: The Long-Document Transformer”, Beltagy et al presented a solution to this limitation. They proposed a novel attention mechanism that allows memory requirements to scale linearly with input sequence length, rather than the quadratic scaling of a traditional Transformer. This mechanism combines a local windowed attention with a task-motivated global attention, and serves as a replacement for the standard self-attention. Their pretrained Longformer model was able to set state-of-the-art results on various long document tasks, easily processing documents with over several thousands tokens.

In this project, we implemented a modified version of this new attention mechanism, namely the sliding window attention pattern, and built a model that employs this mechanism to perform long-document text classification. Particularly, the model was designed to classify truncated book texts by genre. Although this task in particular is relatively trivial, we aimed to use it as an example of the power of this modified attention mechanism in allowing models to work with longer texts. This same mechanism could be applied to numerous more complex tasks, including long document generative tasks such as chapter or book summarization.

Related Work:

As noted in the introduction, we are aiming to reimplement the work detailed by Beltagy et al in “Longformer: The Long-Document Transformer”. Given that this paper already has supporting open-source code that uses Pytorch, we attempted to implement the system in TensorFlow, and also used a unique dataset when attempting to test our model. Furthermore, very late into our work on the project, we discovered a TensorFlow implementation of this same model. It is important to note that this implementation is quite similar to our own implementation, as both aimed to essentially reimplement the same problem in TensorFlow. https://github.com/huggingface/transformers/blob/main/src/transformers/models/longformer/modeling_tf_longformer.py

Data

For our training data, we used a 10,000 Books and their genres Kaggle dataset, a combination of publicly available books from Project Gutenberg and their respective genres scraped from GoodReads. In order to optimize the run time of our model and account for the limited computing resources available to us, we decided to truncate the book data to only include only the first 1000 words of each book.

Methodology

Our overall model architecture follows the details presented by Beltagy et al. Most notably, our model extends the RoBerta architecture, while replacing the traditional attention mechanism with our windowed attention mechanism.

Metrics

*from original proposal Our goal is to generate summaries that not only preserves the most important points of a text, but does so in a way that is grammatically correct, cohesive, and meaningful. In line with Radev et. al (2021), we will be using the ROUGE metric to quantify lexical overlap between n-grams in generated and reference summaries. We recognize that part of assessing our model’s performance would best be captured by a subjective evaluation framework that considers the coherence, meaningfulness, and fluency of the text. Output texts will be selected randomly at the conclusion of each training/validation session. We propose a 15-point scale on which the summary text will be scored on in respect to the aforementioned characteristics: Coherence: Does the summary text follow an ordering analogous to the original text? Meaningfulness: To what degree does the summary text preserve the general idea of the original text and contain intelligible content? Fluency: does the summary text exhibit grammatical and syntactical integrity? Our base goal is to generate summary texts that maintain full fluency while obtaining a ROUGE score of 0.10. Our target goal for our model is to exhibit full fluency and partial meaningfulness with a ROUGE score 0.15. Our stretch goal is producing text that imparts all three characteristics and achieves a ROUGE score of 0.20. Ethics

Why is Deep Learning a good approach to this problem?

*from original proposal Summarizing text takes a lot of labor – someone has to spend time reading the book and then writing a concise summary. Overall, using some kind of machine learning/deep learning technique would allow this task to be less labor intensive and essentially save up resources, still that is not to say that summarizing substitute reading the books. Deep Learning is specifically a good technique because it is able to learn and capture meaning like one would when reading which is something that vanilla machine learning or hard coding rules could not do.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

*from original proposal The stakeholders in this problem are quite broad. Overall, it would be any English speaker who might be reading any of the books. For those stakeholders, a mistake might result in them not wanting to read a book because the summary does not accurately reflect the book or is incomprehensible. Another stakeholder is the author of one of the books and the publishing house of the book. If their books are misrepresented causing people to not read their books, they might lose potential readers and by consequence money. The author might also be misrepresented, making a work seem that it's something that it’s not if the summary is wrong.

Division of labor

Cali Rivera

-Model Architecture

-Visualization of Results

Iris Cheng

-Model Architecture

-Visualization of Results

Livia Gimenes

-Model Architecture

-Data Preprocessing

Built With

  • tensorflow
Share this project:

Updates