FINAL WRITEUP: https://docs.google.com/document/d/1TMRQT9kbuCvAqjX-ckGDgZ0BDu44d18HbgRZ6KIu-QI/edit?usp=sharing

Title: GPT-0.3

Who: Chris Avalos (cavalos1), John Fay (jfay1), John Zhou (jzhou43)

Introduction: We plan to re-implement the latest version of GPT (GPT-3) based on the three GPT-papers, on a much smaller scale in terms of weights and datasets, but with the same model architecture.

The main goal of GPT is to carry out semi-supervised learning of natural language. Their larger models show that with sufficient complexity, these models can carry out simple math.

GPT is also focused on carrying unsupervised pre-training, AKA giving the model some initial examples to train on before giving answers, which significantly improves the performance of the model.

Related Work: Attention is all you need – Describes self-attention and multi headed attention in transformers. https://arxiv.org/pdf/1706.03762.pdf

Generating Long Sequences with Sparse Transformers – Describes sparse transformers, which are used in the network in GPT-3. https://arxiv.org/pdf/1904.10509.pdf

Data: We will scrape Reddit for our training/testing set, using the BeautifulSoup python library. For pre-processing, we will use the byte-pair level encoding instead of word-based encoding (GPT-2). We will also need to batch the sentences per Reddit post, since we want to predict comments for a given Reddit post.

Methodology: Taking architecture choices from their respective papers:

GPT-3: Alternating dense and locally banded sparse attention patterns in the layers of the transformer (Sparse transformers)

GPT-2: Byte-pair level encoding (instead of word-based encoding)

We are going to try to predict comments for a given Reddit post, so we are going to treat a given Reddit post and its comments as a single stream. We are most likely going to be using the most popular Reddit posts.

The model will be as follows:

We will focus mainly on the entailment section, where we start with a premise (the original post) and generate a hypothesis (a comment), and we hope to expand the model to include the other sections if we find good sources of data for those sections

Metrics: We’ll measure the perplexity of the output of our model. Perplexity is a more appropriate measure of success than accuracy for our model. The original GPT performed quite well in language modeling tests, which we will try to use as well. Stories Cloze Test (achieved 8.9% improvement), RACE test (5.7% improvement), MultiNLI test (1.3% improvement) While we do not know at the moment how well our model will perform, we are aiming, at a minimum, a perplexity of 50.

Ethics: What broader societal issues are relevant to your chosen problem space? This program, although it may be weak, if trained correctly, can generate inflammatory bot messages on Reddit, which is already an issue. What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? While we are going to be using Reddit as our dataset, we want to make sure we don’t use any sensitive information (we will try to avoid this by using the most popular reddit posts, since those usually have sensitive information filtered out.

Division of Labor: We’ll all put equal amounts of work into each aspect of the project. We’ll be in contact with each other to make sure that everyone is doing their part Parts: Preprocessing/gathering of the data Making the model Testing Poster/Presentation Written reflection

Links: GPT-3 = https://arxiv.org/abs/2005.14165 GPT-2 = https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf GPT = https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf Sparse transformers = https://arxiv.org/abs/1904.10509

Built With

  • tensorflow
Share this project:

Updates

posted an update

Introduction: We plan to re-implement the latest version of GPT (GPT-3) based on the three GPT-papers, on a much smaller scale in terms of weights and datasets, but with the same model architecture.

The main goal of GPT is to carry out semi-supervised learning of natural language. Their larger models show that with sufficient complexity, these models can carry out simple math.

GPT is also focused on carrying unsupervised pre-training, AKA giving the model some initial examples to train on before giving answers, which significantly improves the performance of the model.

Challenges: Getting the data from Reddit was easy for the most part, but formatting the data has proven to be a challenge. Because the structure is different from the one we used for hw4, parsing through the reddit posts was a little difficult. We’ll also have to determine a “sentence length” for byte-pair encodings, which we’re not sure how we’ll go about doing as of now. Because our byte-pair encoding chop up our words, we’re also worried about how this representation will affect training.

Results: We have not started training the model, so we do not have concrete results. So far, we’ve found preprocessing to be somewhat easier than expected.

Plans: Going forward, we want to finish preprocessing by mid-week, and begin working on the model by this the end of this week/weekend.

Log in or sign up for Devpost to join the conversation.