From Short Context to Long Passages: Generalizing QA Models to Answer Standardized Test Reading Comprehension Questions
Who
dwu44 / Dustin Wu
mwang102 / Milanca Wang
Github Repo
Final Reflection
Checkpoint 2 Reflection
our reflection writeup for checkpoint 2
Introduction
What problem are you trying to solve and why?
Current state-of-the-art reading comprehension models typically operate on short contexts (reading passages of limited length) due to computational constraints. However, for standardized tests like the SAT or ACT, humans must often draw on information from much longer passages to answer test questions. Our goal is to bridge this gap to generalize QA (question-answer) models to answer standardized test reading comprehension questions with much larger contexts.
What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc.
This is a supervised learning NLP project. Specifically, we aim to develop a question-answering model that does not generate answers but rather picks from one of four possible choices.
Project Goal
Experimental Setup
See Methodology.
Goal 1: Optimal Context Window
We propose to fine-tune short-context QA models by building a supplemental architecture that identifies the “context window” (section of text) in a long passage that is ideal for using to answer a given question. If we are able to do this, then it stands to reason that a pre-trained short-context QA model could be generalized to answer long-passage questions. The two challenges that we anticipate are 1) the relatively small amount of standardized test passages that are openly available for use as training data and 2) figuring out how to generalize the short-context QA models to answer long-passage questions
Subgoal 1: Assembling a long-passage dataset
We plan to either scrape standardized test passages and questions to use as fine-tuning data, or to use them as test data and source our long-passage data from a dataset like the RACE dataset.
Subgoal 2: Developing a context window architecture
By stringing together downsampling and feature extraction layers, like in a CNN, it is possible that we could develop a context window architecture on top of the pre-trained short-context QA models to aid it in answering standardized test questions
Goal 2: Predicting Answers
Our primary goal is to adapt Question-Answering (QA) models that have been trained on datasets like ReClor or OpenBookQA, in which the context is around a paragraph of text, to answer standardized test reading comprehension questions (e.g. SAT, ACT) where the context is a long passage of several paragraphs of text. We will integrate the context window model with these QA models by sending the output of the context window as the data to fine-tune the pre-trained models with.
Subgoal 1: Integration of both architectures
We will have to determine the optimal pipeline to integrate the inputs and outputs of both models.
Subgoal 2: Fine-Tuning Model
The QA models will have to be fine-tuned across various datasets. Our goal is to determine what learnings do or do not generalize.
Related Work
Are you aware of any, or is there any prior work that you drew on to do your project?
In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”--if you stumble across a new implementation later down the line, add it to this list.
- Paper that used image and natural language processing to solve SAT geometry problems. Achieved a 49% score on official SAT geometry questions. http://geometry.allenai.org/assets/emnlp2015.pdf
- GPT-3 Reading Comprehension: https://arxiv.org/pdf/2005.14165.pdf
- ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning: https://openreview.net/forum?id=HJgJtT4tvB
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering: https://aclanthology.org/D18-1260.pdf
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://huggingface.co/transformers/model_doc/bert.html (BertForMultipleChoice: https://huggingface.co/transformers/model_doc/bert.html#bertformultiplechoice)
- LIAMF Fine-tuned Roberta: https://huggingface.co/LIAMF-USP/roberta-large-finetuned-race
- Constructing Transformers For Longer Sequences with Sparse Attention Methods: https://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html
- RACE: Large-scale ReAding Comprehension Dataset From Examinations: https://arxiv.org/abs/1704.04683
Data
What data are you using (if any)?
For our data, we plan to use SAT and ACT reading comprehension questions, which are freely accessible from practice test pdfs that we can download and process into text. While we may use the obtained data as the sole finetuning data, we are also considering using another dataset with long-passage questions as the fine tuning dataset, and then the SAT and ACT reading comprehension questions as a final test dataset to evaluate our model’s performance.
How big is it? Will you need to do significant preprocessing?
Since the size of each individual SAT reading comprehension passage is pretty big, one of our goals is to build a feature extraction model that can be used to process those passages into reasonably-sized contexts. This will be used as the data for the actual QA-prediction model.
Methodology
What is the architecture of your model?
As Transformers are the state of the art for natural language processing tasks, we are inclined to use one or more Transformers as the core of the model. We have also thought about delegating a supplementary model for classifying SAT questions into certain categories, and then choosing an expert model to ask for an answer from based on the category. In addition, since many SAT questions are based off of certain sections of text, it may help to build an additional model that learns how to map question hints like (Lines 33-47) to a left and right index on the tokens to select for the transformer.
How are you training the model?
We will use pre-trained models (ReClor, GPT-3, OpenBookQA, and/or BERT) that have a basic idea of reading comprehension and attempt to fine-tune them on other datasets, like SAT/ACT reading comprehension questions, or within each other like OpenBookQA with ReClor or vice-versa.
Justify your design. Also note some backup ideas you may have to experiment with if you run into issues.
This design is meant to observe how generalizable reading comprehension problems are across different levels. Since transformers are data hungry, using a nonspecific, pre-trained model can be helpful for establishing baseline behavior. Fine-tuning on more complex windows will yield information on how similar cross-level standardized tests are, or not!
The main challenge we anticipate will be correctly classifying the optimal window size for long passages, so our backup plan is to use data with shorter contexts and compare them exclusively with each other (ex: ReClor, which sources mainly from the GRE and LSAT; and OpenBookQA, which sources from elementary science quizzes).
Metrics
What constitutes “success?”
Accuracy (namely the number of correctly answered questions divided by the total number of questions) seems to be an applicable metric for our model.
What experiments do you plan to run?
We plan to evaluate our model’s performance on standardized testing questions from a variety of tests, namely the SAT and the ACT but perhaps also the GRE and other tests.
Accuracy Goals
- [ x ] Base: > 25%
- [ x ] Target: > 33%
- [ ] Stretch: > 50%
Ethics
Why is Deep Learning a good approach to this problem?
Standardized testing is fairly important in the current model of the US education system. It is important to critically evaluate the purpose and efficacy of these important assessments, and using deep learning models to back out reading comprehension from a purely mechanical viewpoint can be valuable towards judging the fairness or complexity of what it takes to “do well”. The goal is not to “solve” reading comprehension with pre-trained, state-of-the-art deep learning models. Instead, our hope is to use deep learning as an evaluation tool to quantify and qualify what exactly standard assessments aim to test with regards to generalization and reading comprehension.
How are you planning to quantify or measure error or success? What implications does your quantification have?
Error and success will be measured by the accuracy score of the model on real tests questions and answers: a straight percentage of (num_correct / num_questions) is how the model’s accuracy will be calculated. This way, it will be simple to easily relate the model’s performance to human performance on the same assessment.
Division of labor
Dustin: Data collection, fine-tuning model, context window architecture
Milanca: Integrating pre-trained models, context window architecture

Log in or sign up for Devpost to join the conversation.