Link to most recent check-in: https://docs.google.com/document/d/1LA_cRR84TV9kVynyZ2nwObNfAc3aQEvbx-TjSfneOtM/edit

Dear TA - apologies for the formatting. We had a difficult time getting it to work. We went to TA hours but the TA wasn't sure. Maybe we could discuss it at our first mentor meeting - that way, for the next check-in, you can have an easier time reading it!

Title: Application of tensor flow to reimplement “extractive summarization as text matching” by Ming Zhong et al, with modified transformer architecture and data set in order to reduce quantity of GPU necessary

Who: Names and logins of all your group members. Fill in your names and logins! Braxton Morrison - braxton_morrison@alumni.brown.edu Jacob Bartolomei - jbartol2 Calvin Bauer - cbauer4

Introduction: What problem are you trying to solve and why? If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper. We are reimplementing the MatchSum model which is a model that performs extractive summarization using BERT transformers. The paper aims to improve upon existing extractive summarization architectures via using semantic text matching. Ming Zhong et al succeeded in increasing performance for datasets with medium length summaries, such as CNN/DM and wikiHow (~60 words). They are currently the top-ranked model for performing extractive summarization on the CNN/DM data set. We are trying to solve the same problem, but implementing the model in Tensorflow and applying it to the Newsroom dataset. We chose this paper because it currently is ranked #1 via ROUGE on the CNN/DM data set. It is therefore a substantial improvement on the existing architectures. It also uses techniques that we discussed in class, such as semantic text matching, transformers, and metrics of similarity like cosine-similarity, while eliminating the need for techniques such as Trigram Blocking. It takes what we learned in class about using recurrent neural networks for translation and how to use embeddings in order to integrate the meaning of a word into a sentence and takes that a step further so that it can perform document summarization. Additionally, one of the group members of the project was particularly interested in this topic because the pharmaceutical industry is shifting towards using document summarization techniques. These would be used to extract meaningful data from government regulatory documents that could be hundreds of pages long and therefore require an intensive manpower input. If you are doing something new, detail how you arrived at this topic and what motivated you. What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc. This is a supervised learning model using transformers.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.

The use of a transformer is central to this paper. The authors use BERT, variations of which seem to be the standard for tasks like ours, but is also a major computation challenge. Thus, we hope to use DistilBERT, a lighter version introduced in https://arxiv.org/pdf/1910.01108.pdf. The authors of that paper demonstrate that their version is 40% smaller and 60% faster than BERT, and is still able to achieve 97% of the natural language understanding. The code for DistilBERT is available at https://github.com/huggingface/swift-coreml-transformers. It utilized knowledge distillation meaning a smaller model is trained to reproduce the behaviors of a larger model, in this case BERT.

In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”--if you stumble across a new implementation later down the line, add it to this list. Official code: https://github.com/maszhongming/MatchSum https://github.com/HHousen/TransformerSum https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22

Data: What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it).

We are using the CNN/Daily Mail dataset to verify the results and will be expanding it to test the model on the Newsroom dataset. The CNN/Daily Mail dataset is 300k articles with a summary for each. The Newsroom dataset is 1.3 million articles with a summary for each.

How big is it? Will you need to do significant preprocessing?

The datasets require significant preprocessing before being used in the model as we need to utilize a pre-trained BERT model to create the candidate summaries for the documents.

Methodology: What is the architecture of your model?

How are you training the model?

Trained from scratch using the Tensorflow library and the Transformers git repository from huggingface for pretrained distilBERT.

If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.

Since computing resources will most likely be a large bottle neck for us, the hardest part of implementing this model will be finding a feasible number of transformers to train on and choosing a downgraded transformer from BERT that is able to perform well enough while increasing the training speed. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

Key terms to understand in order to understand the architecture description include:

  1.  Candidate summary: a candidate summary is one of the multiple summaries that the model outputs before it ultimately decides which summary closest to the summary provided in the original data set.
    
  2.  Gold summary: the summary of a particular document provided in the original data set, analogous to that document's label.
    

The paper describes an implementation of a summary-level framework, which they call MATCHSUM. The overall structure is predicated on the idea that a good summary should be more semantically similar as a whole to the source document than a summary that is a poor representation of the central ideas of the original document. In order to achieve this goal, they compute the similarity between the source document and the candidate summary. That is, the model learns a vector representation for a text fragment, then applies a cosine-similarity metric to compute matching scores. This approach is implemented via the usage of a Siamese-BERT architecture. The Siamese-BERT uses the siamese network structure. It is comprised of two BERTS with tied weights and a cosine-similarity layer during the inference phase. The model uses a margin-based triplet loss to update the weights via a pairwise margin loss in order to calculate the loss for each of the candidate summaries. These candidate summaries are then sorted in descending order of ROUGE scores compared to the gold summary. Their argument is that the best candidate summary should have a larger matching score with the golden summary than any of the other candidate summaries. The authors implemented a summary-level framework to address what they viewed as flaws in sentence-level summarizers. Sentence-level summarizers typically extract either a sentence or some smaller semantic unit from the original text, then determine the relationship between sentences and make independent binary decisions for each of them. These sentences are then combined to form a summary. The authors argue that this type of architecture results in high redundancy because it fails to account sufficiently for whether or not the extracted sentences are quite similar to each other. (As an aside, some of these articles have implemented techniques in order to reduce redundancy, such as using an autoregressive decoder in order to score the relationship between different sentences, or the use of Trigram Blocking. When trigram blocking is used, a sentence that has semantic trigram overlapping with the previous sentences in a summary will be skipped.) Loss: The paper uses a triplet Margin Ranking Loss – a ranking system that compares the scores of a candidate summary with a gold (true) summary. Scores are measured with the cosine distance between the summary and the document, then passed to the loss ranking system. Positive pairs have 0 loss, and negative pairs have a value no less than a decided margin value 1. The authors use a default margin value of 0.01, but allow it to be set as a command line argument. They use and modify PyTorch’s MarginRankingLoss function, which appears to mostly correspond to TensorFlow’s contrastive_loss function.

Metrics: What constitutes “success?”

Success would be achieving our target goal of reimplementing the model and achieving a score of 10 ROUGE.

What experiments do you plan to run?

We intend to implement the model using DistilBERT and train the model on different datasets that have variable summary length and look at how the ROUGE score changes for the model depending on the lengths.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

The metric that replaces accuracy for this model and problem is the ROUGE score.

If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model.

The authors of this paper were seeking to show how the contextualizing the problem as the semantic text matching problem would increase the ROUGE score. They quantified their results by comparing the results of their model on different datasets to those of other models where results were defined by the ROUGE score metric.

If you are doing something new, explain how you will assess your model’s performance.

We will train the models and use test data to get the ROUGE score to quantify the performance of the model. To contextualize it, we will look at the performance of other models on the same dataset.

What are your base, target, and stretch goals?

Our base goal is to reimplement the model using DistilBERT in Tensorflow and show that it can produce extractive summaries. Our target goal is to reimplement the model and achieve a score of 10 ROUGE. Our stretch goal for this project is to be able to implement and train the model using a variety of datasets which have different summary sizes. Then compare the success of the model relative to the different sizes of summaries that the model was employed on.

Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.) What broader societal issues are relevant to your chosen problem space? Text summarization could be applied to many spaces to make huge chunks of text more digestible. However, if this technology were applied across the internet, it could simply serve to propagate harmful content.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? The Dataset used by the authors is CNN/Dailymail, and we will also be testing on the Cornell Newsroom dataset, which contains articles from 38 publications. This certainly could lead to bias in what kind of summaries our model learns to extract. Since these are news-based summaries, the model could learn newsworthy extractions that do not apply well to other fields. It may also be possible that the algorithm would learn to pay more attention to demographics that are talked about more in the articles. Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

Division of labor: Briefly outline who will be responsible for which part(s) of the project.

Presenters: everyone

Coding:

We will split coding assignments equally, to be determined when we are further along in our implementation.

Braxton: Introduction + pre-processing part of write-up ⅓ of results section Challenges section

Jacob: Model architecture describing original paper ⅓ of results section Discussion and future work

Calvin: Describe original model architecture before we modified it ⅓ of results section Reflection

Built With

Share this project:

Updates