Title

Communication with GPT-2

Who

Yuanqi Li (yli322), Yifan Jiang (yjiang79), Weike Dai (wdai8), Zhe Hu (zhu24)

Introduction

From what we have learned in class, we know that we could use GPT-2 to generate conversations, however, we have a significant problem that the GPT-2 could not generate good responses if we don't provide excessive contexts. GPT-2 usually tries to provide a vague and generic answer to your question if there is no context in users’ inputs. Our solution is to provide a "pre-process" method that transfers the input sentences given from human users to some more structured and detailed paragraphs, which could enhance the performance of conversation generation if we perform fine-tuning correctly. Our project is about transfer learning in the NLP area.

Related Work

One of the papers introduces a new approach to generative data-driven dialogue systems (e.g. chatbots) called TransferTransfo which is a combination of a Transfer learning based training scheme and a high-capacity Transformer model. In this work, they made a step toward more consistent and relevant data-driven conversational agents by proposing a model architecture, associated training and generation algorithms which are able to significantly improve over the traditional seq-2-seq and information-retrieval baselines in terms of (i) relevance of the answer (ii) coherence with a predefined personality and dialog history, and (iii) grammaticality and fluency as evaluated by automatic metrics. Link: TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents

Data

TCS DATASET, the open-source dataset from Amazon Alexa. It provides data to enrich entities by adding knowledge to them. This knowledge is from Wikipedia. We choose this dataset because Alexa focuses a lot on the chatbot tasks, and its dataset is comprehensive enough to provide the information we need. For the dataset, we still do not get a concise number since we need to train and test then we will know it.

Methodology

The main task is to perform transfer learning: we are supposed to use a pre-trained GPT-2 model to generate dialogue. After we get the inputs from the users, we need to perform a pre-process on the input sentences: we want to use traditional methods to extract the keywords of the sentences, then try to find and generate relevant information in our system. Afterward, we will feed the processed data to GPT-2, and let it generate paragraphs. What we do is to find a good way to generate contexts for given inputs, and fine-tune the GPT-2 model in order to make it work better for our processed inputs.

Metrics

The target goal for this project is improving the performance of language modeling for the GPT-2. We will evaluate our model by comparing benchmarks of language modeling before and after using this transfer learning model. The testing data is the conversation dataset of the TCS DATASET. The corresponding metrics are perplexity, Hits@1 and F1. Apart from the target goal, we have a stretch goal to improve the performance on other NLP sub-tasks. So we're measuring these metrics on other NLP sub-tasks, such as reading comprehension, summarization, and translation to see if there are any improvements.

Ethics

Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.) Why is Deep Learning a good approach to this problem? Deeping learning is broadly used in NLP tasks. Now GPT-2 is proposed and proved better than other existing models. It could make for a better conversation generation. Moreover, in deep learning there’s also a fine-tuning mechanism to improve models, thus we choose to apply it to further improve the result we get from GPT-2. What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset is from Amazon Alexa and they are already open-source. Since Alexa is the cutting-edge product in the chatbot area, the dataset should be convincing because lots of scientists already do some filter and clean the data to create the best dataset.

Final Report

link

Built With

Share this project:

Updates

posted an update

Introduction

From what we have learned in class, we know that we could use GPT-2 to generate conversations, however, we have a significant problem that the GPT-2 could not generate good responses if we don't provide excessive contexts. GPT-2 usually tries to provide a vague and generic answer for your question if there is no context in users’ inputs. Our solution is to provide a "pre-process" method that transfers the input sentences given by human users to some more structured and detailed paragraphs, which could enhance the performance of conversation generation if we perform fine-tuning correctly. Our project is about transfer learning in the NLP area.

Challenges

What has been the hardest part of the project you’ve encountered so far? Currently, we have encountered difficulties with the model architecture design, fine-tuning GPT-2, metrics selection. After researching relevant knowledge-grounded conversation generation approaches, we proposed two ways to include world knowledge into conversational response generation. It’s hard to compare and the loss function design is different respectively. We are analyzing both to finalize the model architecture design by this week. The input of the self-attention block of the GPT-2 model is the sum of three types of embedding: word, dialog-state, and positional. The word embedding is constructed by concatenating the knowledge sentences with a history of the dialog’s previous utterance. Along with positional embedding, they are learned during the pre-training phase. The dialog-state embedding, which is learned through fine-tuning, is used to indicate whether the current token is part of (i) a knowledge sentence, (ii) an utterance from PERSON1, or (iii) an utterance from PERSON2. Firstly, learned TF-IDF vectors for the ground-truth response at last turn and knowledge candidate sentences. Then, select the knowledge sentence which argmax the TF-IDF score. The selected sentence and dialog history sentences are passed into a model (e.g., Transformer encoder) respectively, the outputs are concatenated and passed to GPT-2. Apart from that, we have not found if we could ask GPT-2 to perform as a chatbot directly. There might be something we still need to fine-tune and handle the interaction process manually, and we still need to do things on our own. Another challenging part is metrics selection. We found that there are various automated metrics that we have to choose from: perplexity, F1, n-gram diversity, Hits@1, and so on. Besides, we’re not sure if the human evaluation is practical in this course project, considering the huge size of test data and subjectivity.

Insights

Are there any concrete results you can show at this point? We have preprocessed the data and generated five different sets of the data: train, validate_freq, validate_rare, test_freq, test_rare. All of the data set consists of three files: dialog history with different lengths, the corresponding ground-truth response for each dialog history, the most relevant knowledge sentence for each ground-truth response. We are working on implementing a basic chatbot using GPT-2, which will not include knowledge and perform as our baseline.

Plan

Are you on track with your project? Yes, we've extracted conversation messages, knowledge data from original JSON format TCS DATASET into txt format data, discussed different model architectures, and worked on the implementation of GPT-2 chatbot. What do you need to dedicate more time to? We guess we might need more time for handling how we process our data and finalize the design of input representation. And we might need more time on gaining some GPT-2 experience: the model is not as easy to use as we originally think. As mentioned in the Challenge part, we will dedicate more time to implement a basic GPT-2 chatbot for further improvements. What are you thinking of changing, if anything? Based on model architecture design, we might change the way we handle the pre-processing and input representation. For example, we may also need to provide extra information not only based on the keywords of the inputs but also some personality information et cetera, especially for very short inputs.

Log in or sign up for Devpost to join the conversation.