N-ELL-P

Title: N - ELL - P (Natural Language Processing, but applied to English Language Learners)

Who: Alan Chen, Patrick Peng, Ivan Lee, Michael Lu (achen258, ppeng4, ilee52, mlu54)

Introduction:

We will tackle the Kaggle “Feedback - English Language Learning” competition. Given an essay, our model will rate it on a scale from 1 - 5 based on cohesion, syntax, vocabulary, phraseology, grammar, and conventions. The goal is to relieve the grading burden of K-12 English teachers. We plan on using embedding, transformers, and possibly LSTM and other techniques to achieve this. We hope to incorporate what we, as English speakers, understand about cohesion, syntax, etc into our model.

This problem is a supervised learning problem since we have an existing dataset. Even though there is a finite number of labels, we will treat the problem as a regression problem and round as a prediction. However, we will also experiment with approaching the problem as a classification problem.

This study evaluates deep learning models on tasks similar to ours (scoring essays.) Additionally, it provides different metrics for evaluating different aspects of writing, along with procedures for determining which of those metrics contribute to the final score.

https://arxiv.org/abs/1907.11692

This study expands on BERT, applying novel techniques for improving performance. Many language processing tasks use embeddings from similar models to assist in their work.

Public implementations:

https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md

Data

We will be using a dataset obtained from Kaggle. The dataset is roughly 2700 essays with a raw size of 9.3 mb of compressed text. We will do some preprocessing, but it will likely just be dividing the text into the respective essays and small tweaks to make the models more trainable, such as adding s.

Methodology

As aforementioned, we will experiment with viewing the problem as a classification vs regression problem. We will first embed sentence by sentence using BERT model. we will then feed these embeddings to our own architecture which will involve some sort of LSTM and/or transformer architecture that has positional information about the sentences.

We believe this model will be effective because of the power of BERT embeddings and the architecture should reflect the recurrent nature of text. Furthermore, it will capture important information such as the relation between sentences. We plan on using separate models to predict all of the different metrics (and train them separately), because each metric has its own unique predicting features that the model should pay attention to (according to our inductive bias).

The model probably will not be too difficult to train, as the biggest part of the model (the BERT) is pretrained, so we predict that we will be able to run the model locally since we have GPUs on our laptops. If this becomes problematic/too slow, we will contact Brown's CS department and figure out ways that we can get access to GPU time.

Metrics:

We plan to run experiments with different architectures and different data processing techniques to see if certain architectures perform better when predicting certain metrics. For example, we might try processing individual sentences and then feeding them into a model to determine cohesion. Success for us is a low error when compared with real-world data in the form of our test set and the test data from Kaggle. So, accuracy is important--we'd like to make our scores as close to the real scores as possible.

Ethics

Our dataset is provided by Kaggle, and is composed of essays of English Language Learners in the 8th-12th grades. We had some concerns about how it was collected or labeled regarding privacy, but we assume that these are handled by Kaggle. Additionally, we believe that this data is not representative. It comes from a very specific demographic within the very broad education space, and it may contain biases against the groups more likely to be in ELL programs at such ages. The major stakeholders in this problem are mainly the companies in the industry and the schools where this technology is applied, as it would most likely represent a major profit opportunity for the companies and a major cost savings for the schools. The consequences of our mistakes, then, would mainly fall on the students, which is quite unfortunate. The best way we can address this is probably creating the most accurate model we can and then creating metrics that explain the model's decisions.