CS1470 Final: BabyBERTa

Final reflection/write-up: https://docs.google.com/document/d/18-awQqp3TYK-FofDp1p43BRvtaeFVpy-kXB_i2CUgwg/edit?usp=sharing

Link to poster: https://docs.google.com/presentation/d/1lZZBEAotcSiVqRKP5SRZloZEfmF94DQoeIacL86cJVk/edit?usp=sharing

Check-in #3 reflection

Introduction

The objective of the original paper is to train a scaled-down RoBERTa-based model using inputs that are both quantitatively and qualitatively comparable to what a six-year-old English-speaking child would hear during their language acquisition process, and probe its grammatical knowledge to examine to which extent grammatical structures can be learned without innate knowledge.

We chose this paper for two reasons. First, we believe training language models with child-directed language is a reasonable novel approach to language learning in the NLP context given how humans learn their native languages, and will demonstrate how training data impacts model performances. Second, we believe that developing a scaled-down language model that requires fewer resources to train helps promote more efficient and more sustainable NLP practices in the long run. The problem that the paper focuses on is within the domain of natural language processing. The model hyperparameters are tuned on a masked word prediction task but the ultimate goal is to examine if the trained model can correctly classify sentences as being grammatically acceptable or not.

Related Work

Previous research has investigated how sizes of input dataset influence language models’ grammatical knowledge. Warstadt et al. (2020) showed that given a small amount of training data, language models tend to learn only the superficial linguistic features rather than making real generalizations. Zhang et al. (2021) trained a group of RoBERTa models and found that they require about 10M to 100M words to represent theoretically-attested syntactic and semantic features they tested.

Data

We will use the raw data files posted on the authors’ GitHub repository and do the pre-processing by ourselves. The child language data is ~5M words and 0.02GB. The authors described their pre-processing pipeline as:

Raw text data is used as the starting point.
Sentences are separated and those that are too short or too long are excluded.
Multiple sentences may be combined (but the default is 1) into a single sequence.
Each sequence is sub-word tokenized with a custom-trained BBPE Tokenizer from tokenizers.
Multiple sequences are batched together (default is 16).
Each batch of sequences is input to a custom trained tokenizers Byte-Level BPE Tokenizer, which produces output compatible with the forward() method of BabyBERTa.

We may also consider using other child language corpora if we have enough time. CHILDES (https://childes.talkbank.org/access/) is a database for child-directed language transcripts.

In addition to CHILDES data, the authors used Wikipedia data (provided in their GitHub repo) and Newsela data (which has to be requested at https://newsela.com/data/). Both of them need to go through the pre-processing pipeline described above.

Methodology

Our model will be a transformer with multi-headed self-attention, and we will be training it as a masked language model: we will give our model inputs of sentences of which some words are “masked” out, and let the model try predicting the masked words. In order to do so, the model will first randomly select a proportion of all input words to be masked later. Then, we will decide whether or not we really want to mask all the chosen words. For example, we may specify that the model should replace 90 percent of the chosen words with a [mask] token, and keep the other 10 percent unchanged. Or, we may decide to replace all the chosen words with a [mask] token. Then, the model will read the whole sentence, and predict the chosen word (whether it is masked or not) with the information of all unmasked words (i.e., if the chosen word is masked, the model will predict it with all other words; if the chosen word is not masked, the model will predict it with the whole sentence). As the transformers we covered in the class, the model will conduct the prediction using self-attention, and then compute the loss and conduct the back-propagation.

The hardest part about implementing the model may be organizing the whole model’s structure. Besides the attention and feedforward layers, we may also need to incorporate some “add and normalize” layers into the model. Therefore, it may be tricky to organize the model’s overall structure. In addition, I believe that the data preprocessing can be difficult, as we may need multiple corpus, and keep those corpus “balanced”.

Metrics

We will use “accuracy” to test our model. Our base goal is to use the “holistic metrics” to test our model’s accuracy, and the stretch goal involves also using other metrics. Basically, the “holistic metrics” asks the model to do multiple-choice problems: the model will read groups of sentences which differ with one another only by a word or a short phrase, and only one sentence in each pair is grammatically acceptable. Then, we will count the model’s accuracy of choosing the grammatically correct sentence, and use that accuracy as the metrics. It is noteworthy that we will be using the test suite (the “multiple-choice problems”) provided by the authors of the paper BabyBERTa. The stretch goal involves utilizing other metrics, such as the MLM scoring: “each candidate sentence is input to a masked language model multiple times, each time with a mask in a different position. The score is the sum of the log-loss computed at each masked position in the sentence. (Huebner, 2021, p.631)”

For our base goal, we will train our model on the CHILDES data provided by the authors of the paper BabyBERTa. For the target goal, we will obtain child language data from other sources, and train our model on those data. For the stretch goal, we will also obtain adult language data (Wikipedia or Twitter for example), train our model on those data, and compare the model’s performance on different data.

In addition, as mentioned in the Methodology section, we may specify the model to mask all the chosen words, or mask only part of them. We will implement the model to mask all the chosen words in our base goal, and try other masking strategies in the stretch goal.

Authors of the BabyBERTa paper would like to check if a relatively simple deep learning model is able to learn the grammatical knowledge of English with relatively little data. They are also concerned if the context of the data (i.e., adult versus child language) will affect the model’s performance. These are also our goals.

Ethics

What broader societal issues are relevant to your chosen problem space?

Our problem is related to the broader field of child language acquisition. This opens up the question of learning environments since language acquisition takes place in the household, in a child’s immediate surroundings, and school. As a result, it involves socio-economic and cultural factors such as familial structures, income, ethnic and racial background, etc. Since BabyBERTa attempts to imitate child language acquisition, taking all these factors into account is essential.

Why is Deep Learning a good approach to this problem?

Deep learning is a good approach to this problem as it involves natural language processing. Human language is recursive with sentences and words following a set of grammatical rules. Child language researchers have been studying whether human languages can be learned exclusively through environmental input or with some form of innate knowledge hardwired in our brains. Deep learning gives us a tool to examine the learnability problem, since language models seem to be able to learn linguistic features purely from data, unlike in the rule-based approach.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

The dataset includes CHILDES American English corpora, which are collected by researchers from children and their caregivers. We have limited access to information about how data is collected and do not have ways to verify that children and caregivers have given consent to make their data public. At the same time, we recognize that the model is trained on American English data exclusively which is not a fair sample of English dialects. Nor is there a particular reason to focus on English and not other world languages. This reflects the existing problem of lack of diversity in language-related research which tends to marginalize less-studied languages.

Division of labor

Our goal is to work as a team for most aspects of the project as we plan on collaborating in each part. That said, we want to stay organized by splitting up our tasks as such:

Yanwan: data pre-processing, testing ( the part related to linguistics domain knowledge), writing
Mingxuan: model implementation, data pre-processing
Grâce: model implementation, writing