Text Simplifier

Poster

Text Simplifier

Who

dtruong7, sdiwan2, wsun28, eholtz1

Final writeup

Project video

2nd checkpoint reflection

1st checkpoint reflection

Initial Proposal

Introduction

Reading and quickly extracting useful information from dense passages of text can often be difficult, especially for those learning a new language; idiomatic sentence structures and uncommon vocabulary can pose barriers to understanding text. Using translation apps to overcome this bypasses the need to understand the target language.

Instead, we believe that having a tool that simplifies English input text by varying levels allows English-language learners to flexibly comprehend essential ideas in texts that may lie beyond their reading level, encouraging further reading and building confidence in the target language. We employ supervised learning using a (recurrent) seq2seq transformer model that learns to convert input text of an arbitrary reading level to that of relatively lower level(s).

Related Works

1) https://aclanthology.org/Q15-1021.pdf: One of the motivations for our work is this paper, which discusses the prior standard dataset of text simplification, Simple Wikipedia, and what flaws it has in its usage. The paper discusses how the Simple Wikipedia suffers from key flaws that make text simplification more difficult, such as that a significant portion of the sentences are not a one-to-one semantic match, and that often, Simple Wikipedia does not actually simplify the original text’s meaning so much as change the wording. The conclusion of this paper is that the Newsela dataset solves these issues in some amount, and serves as a better standard dataset for the text simplification task. 2) https://arxiv.org/pdf/2005.02324v4.pdf: This is one of the most state of the art text simplification architectures, which goes over a method of aligning the dataset and running this through transformer-based architectures. The paper proposes a novel way to calculate alignments for sentences in the Newsela dataset (which we may choose to implement for the data scraping part of the project), and then applies this to 3 transformer-based models to test the effectiveness of this alignment. The paper proves both that these sentence alignments give improvements over other prior sentence alignment methods, and that under these sentence alignments, these transformer-based models give F1 scores in the high 80s. Thus, this proves that transformer-based models are feasible, although our project extends these models to make them generalizable using recurrency.

Data

The dataset we are using is the Newsela dataset, a standard dataset for the text simplification task. The Newsela dataset contains a set of texts, labeled by the lexile level that the text is at. The higher the lexile level, the more complex the text, and the text can be at one of 5 lexile levels: 620L, 880L, 1040L, 1210L, and MAX. This dataset is perfect for our task, as we have a consistent 5 levels of lexile difficulty, making it simple for us to batch the data such that our model is expected to predict a lexile level lower than the input text. In addition, Newsela lexile simplification is such that it simplifies the idea sentence for sentence, keeping the ideas in the same order and keeping the corresponding sentences’ meaning the same between lexile levels.

Methodology

We are using a transformer with multi-headed attention that takes in as input a sequence of text and outputs a relatively simplified sequence containing the same essential meaning. We will train the model to simplify relatively by using training inputs of Lexile reading level n and labels of level (n-1) for every epoch. By using the same model parameters across difficulty pairs, the model should learn to slightly simplify input sequences of arbitrary complexity.

In order to simplify input text by multiple degrees (as determined by a user-controlled slider), we feed the output (or ‘state’) of one forward pass back into the transformer recursively multiple times.

We will test the model using a random sample of 20% of the entire dataset.

Metrics

Our first step wil be to construct a single transformer architecture that should ideally learn not the different lexile levels as if they were two different languages, but instead a relative lexile complexity drop of 1 lexile level as defined by Newsella article curators.

We will train the model on the 4 upper lexile levels provided by Newsella through a multi-headed transformer architecture. To decrease the parplexity, we will attempt to feed in the next lower lexile level text as input to the next transformer block akin to how training occurs in RNNs with word tokens.

If this has a very high perplexity, say, over 1000, we will attempt to retrain the model architecture using different numbers of blocks.

For the transformer that we create, we aim for:

Base goal = perplexity < 100.
Target goal = perplexity < 50
Stretch goal = perplexity < 10 We will not make use of the notion of accuracy, but rather this perplexity as well as an accuracy per symbol as determined from the unpadded tokens within our transformer input text to determine how well our model is training and performing over validation/test sets.

Another way that we hope to assess our model is by checking whether the hyperparameters we have chosen are as optimal as can be. We will do this by visualizing the perplexities and accuracy per symbol over multiple hyperparametrized values, and visualize this using different traces. We will also attempt to inspect tensor attention heatmaps to see how they compare between the 5 transformer architectures and use this to determine if any tweaks in the hyperparameters between the models is needed or has a pattern.

Ethics

Our dataset consists of Newsela articles written at different lexile levels. It is not representative of every text, as these articles are almost exclusively a specific variety of non-fiction. The collection process is simple webscraping, so it should not present an issue. However, the dataset itself can have hidden biases which lead to wrongful extrapolations and convey incorrect implications. For instance, if most Newsela articles which mention programmers predominantly mention male ones, the simplifier may assume that it is appropriate to use a male pronoun to refer to the programmer in a case where their gender is ambiguous. Similarly, the simplifier may interchange ‘conservative’ and ‘Republican’ due to the high number of news articles written about conservative Republicans. However, there is a contingent of moderate Republicans who even started a political movement to oust Trump in the 2020 election: the words ‘conservative’ and ‘Republican’, while often used in the same context, are not necessarily referring to the same group of people. Mistakes of this kind can cause ESL learners to make false assumptions about English and learn the language incorrectly in ways subtle enough that they may not be able to catch the issue.

In terms of other use cases, teachers may use this model to procure ability-appropriate material for their students. In the classroom, past elementary school, (American, and perhaps some other) teachers are often forced to hand out the same exact materials to students for the sake of meeting learning objectives in a particular subject. This leads to some students struggling on certain readings and others being given readings which fail to challenge them at all. This is especially the case in classrooms where students are not tracked by ability. Teachers must be able to cope with the ability disparity in their classes to provide an adequate challenge for every student. Our text simplifier ensures that teachers can choose one text and scale its difficulty to customize it for each student, allowing a class of students with diverse learning rates and backgrounds to engage with the same content at the level appropriate for them.

One potential broader (American) societal problem with this, however, is that studies show that there is a trend in education of underestimating non-Asian minority students, and a teacher could fail to challenge such a student appropriately. On the flip side, due to the “model minority” stereotype, a teacher could hand out a text that is too difficult for a particular Asian student.

Division of Labor

Data preprocessing + web scraping: Will
Simplifier Model (Recurrent Transformer): Everyone - will divide equally
Training/Testing (batching, windowing, clipping, feeding appropriate data into model) & fine-tuning (i.e. assignment.py): Anh
Front-end/visualization (e.g. perplexity/acc per symbol graphs, transformer attention/hyperparam difference visualizations): Siddharth