FINAL WRITEUP: https://docs.google.com/document/d/1TMRQT9kbuCvAqjX-ckGDgZ0BDu44d18HbgRZ6KIu-QI/edit?usp=sharing
Title: GPT-0.3
Who: Chris Avalos (cavalos1), John Fay (jfay1), John Zhou (jzhou43)
Introduction: We plan to re-implement the latest version of GPT (GPT-3) based on the three GPT-papers, on a much smaller scale in terms of weights and datasets, but with the same model architecture.
The main goal of GPT is to carry out semi-supervised learning of natural language. Their larger models show that with sufficient complexity, these models can carry out simple math.
GPT is also focused on carrying unsupervised pre-training, AKA giving the model some initial examples to train on before giving answers, which significantly improves the performance of the model.
Related Work: Attention is all you need – Describes self-attention and multi headed attention in transformers. https://arxiv.org/pdf/1706.03762.pdf
Generating Long Sequences with Sparse Transformers – Describes sparse transformers, which are used in the network in GPT-3. https://arxiv.org/pdf/1904.10509.pdf
Data: We will scrape Reddit for our training/testing set, using the BeautifulSoup python library. For pre-processing, we will use the byte-pair level encoding instead of word-based encoding (GPT-2). We will also need to batch the sentences per Reddit post, since we want to predict comments for a given Reddit post.
Methodology: Taking architecture choices from their respective papers:
GPT-3: Alternating dense and locally banded sparse attention patterns in the layers of the transformer (Sparse transformers)
GPT-2: Byte-pair level encoding (instead of word-based encoding)
We are going to try to predict comments for a given Reddit post, so we are going to treat a given Reddit post and its comments as a single stream. We are most likely going to be using the most popular Reddit posts.
The model will be as follows:
We will focus mainly on the entailment section, where we start with a premise (the original post) and generate a hypothesis (a comment), and we hope to expand the model to include the other sections if we find good sources of data for those sections
Metrics: We’ll measure the perplexity of the output of our model. Perplexity is a more appropriate measure of success than accuracy for our model. The original GPT performed quite well in language modeling tests, which we will try to use as well. Stories Cloze Test (achieved 8.9% improvement), RACE test (5.7% improvement), MultiNLI test (1.3% improvement) While we do not know at the moment how well our model will perform, we are aiming, at a minimum, a perplexity of 50.
Ethics: What broader societal issues are relevant to your chosen problem space? This program, although it may be weak, if trained correctly, can generate inflammatory bot messages on Reddit, which is already an issue. What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? While we are going to be using Reddit as our dataset, we want to make sure we don’t use any sensitive information (we will try to avoid this by using the most popular reddit posts, since those usually have sensitive information filtered out.
Division of Labor: We’ll all put equal amounts of work into each aspect of the project. We’ll be in contact with each other to make sure that everyone is doing their part Parts: Preprocessing/gathering of the data Making the model Testing Poster/Presentation Written reflection
Links: GPT-3 = https://arxiv.org/abs/2005.14165 GPT-2 = https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf GPT = https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf Sparse transformers = https://arxiv.org/abs/1904.10509
Built With
- tensorflow
Log in or sign up for Devpost to join the conversation.