Reading Comprehension using Language Models

People

Ruihan Huang / Hao Wen / Bingbin Zhou / Enyan Zhang

Check in #2

linked here

Final Results/Writeup/Reflection

linked here

Introduction

Yes, yes, language models can talk to you and do translations, but what if we test them using human metrics? Say, we feed it multiple-choice questions? We find this task interesting because:

Multiple-choice questions can often have options designed to mislead humans
The question often asks for information not directly presented but rather suggested, implied, or requiring summarization.
This is how we test if humans can read! We don't mask up works and compute perplexity/recall/whatever.

We chose this project to give ourselves a chance to reproduce some research in the rapidly growing field of NLP, and also to give ourselves a sense of how these large large (millions and billions of parameters!) models work. We are motivated by the potential outcome that a model can relief future generations of middle school students from their English homework - not really, but you get the idea.

Dual Co-Matching Network for Multi-choice Reading Comprehension

Our main goal is to reproduce the core of this paper. The paper introduces a model based on attention between embedding sequences. For a multiple-choice question with a provided passage, a question, and a list of options, it computes bi-directional attention between each pair of the three inputs, and uses these computed attentions to classify the output label. The paper achieves 67% accuracy when using BERT as the initial embedding model.

Other References:

Teaching Machines to Read and Comprehend
Combining Local Convolution with Global Self-Attention for Reading Comprehension
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding:
- https://huggingface.co/transformers/model_doc/bert.html
- https://huggingface.co/transformers/model_doc/bert.html#bertformultiplechoice
BERT on Tensorflow:
- https://tfhub.dev/google/collections/bert/1
- https://www.tensorflow.org/text/tutorials/classify_text_with_bert

Data

This task is based on the RACE dataset. It is a dataset extracted from Chinese middle school and high school English reading comprehension problems - something most of our group members are familiar with. There are a total of 28,000 passages and 97,000 associated questions, though we have decided only to use a subset of it to speed up training.

With the tears from the past in mind, we increasingly look forward to emancipating future generations of Chinese middle schoolers from these problems.

Methodology

Model Architecture

Our model uses the same pre-trained BERT model to convert the 3 inputs (passage, question, options) into embeddings. We then implement the attention architecture between each set of embeddings as introduced by the paper.

Training Workflow

We train a small number of batches (10) for various strategies we identified (different model architecture, hyperparameters, BERT variants, etc.) and choose the best-performing model to train more extensively. We train iteratively, 20 epochs at a time, and stop when the model shows signs of convergence.

Challenges

We find the hardest part to be disambiguating from the author's methodology. The paper only provides a high-level overview, which leaves a lot of details open to us. We experimented with many different options and ended up using one that seemed to be better. We cannot be sure if our implementation is the exact replicate of the authors' - we might have made different choices along the way. Fortunately the model worked out okay.

Metrics

We use CatergoricalCrossEntropy as the loss function for training since this is a standard classification task from four categories (namely, A, B, C, and D). We also use CategoricalAccuracy for the accuracy metric, this helps us to know how many problems the model actually got right.

Goals

We would aim to achieve somewhat similar results achieved by the original authors (who got 67% accuracy), but at least want to be significantly better than chance. Though 30% or so can convince us that our models have already learned something, we feel that 50% is a better base goal considering we're using a BERT model for embeddings. This means an accuracy of:

~~base: 50%~~
~~target: 60%~~
stretch 67% range.

We ended up getting an accuracy of 0.6138 on a test set of 100 passages (~300 questions). We consider this to be fairly good as we're only using a distilled BERT that is one-fifth the size of the base one, and have not trained too extensively.

Check the code out!

Here's our github repo

Ethics

What broader societal issues are relevant to your chosen problem space?

One important application of machine reading comprehension models (MRC) and Question-Answering models (QA) lies in search engines. For example, when you google “why does water have a higher density than ice,” the first result is an excerpt of texts that the algorithm believes answer this question. For this task, a model has to have similar capabilities of extracting useful information that may not be explicitly presented in the original text, like those required by a reading comprehension problem. Current language models are shown to be bad at reasoning and understanding prompts. The lacking of this skill can be disastrous, as it leads to models "hallucinating", giving answers that seem reasonable but are completely wrong (this, also, has been extensively documented, this is one example from a current state-of-the-art model). Reading comprehension problems are a good first step for testing such abilities.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

If Language Models are to become what optimistic researchers envision it to be. Be it a "google replacement", a "general AI", or anything else, it involves helping ordinary people make day-to-day decisions. Current google search algorithms extract information and highlight it in search results, meaning that it does not get more wrong than its source information, which are written by human. Language models have a much larger problem: it generates information, and the generated information can be completely unfounded and misleading. This is why we think demonstrating the understanding of a language model is extremely important.