Who: Josue Cruz (jcruz14) and Varun Kasibhatla (vkasibha)
Introduction: With social media continuing to grow as the cultural juggernaut it has become in the past two decades, the amount of information the average person has at their fingertips is much higher than it previously has been. As such, it becomes essential for individuals to have a quick way to verify the information they read online and to have a check to ensure they are not spreading false information. This is especially necessary in the field of public health, where misinformation about COVID-19 could put individuals at risk for consequences to their health. The Fax&Knowledge FactChecker will act as a convenient tool to assess the validity of claims online, querying the relevant contradictory information from a reputable source. The paper that we will be implementing focuses on creating an effective end-to-end fact checker using a solely a language model, with-out any external knowledge or explicit retrieval components. We chose this paper because shows how with a good masking mechanism and verification model, language models can be used for fact checking.
Related Work: Our model is similar to https://aclanthology.org/2020.fever-1.5.pdf, which is an implementation of a similar FactChecker using the Fever dataset, a dataset of Wikipedia articles. It follows a similar workflow as our model: masking the last token using NER, predicting the token using BERT, extracting the features using AllenNLP, and outputting the probabilities using a multi-layer perceptron.
Data: Our model will use the COVID-19 Rumors database, which can be found here https://github.com/MickeysClubhouse/COVID-19-rumor-dataset/blob/master/Data/news/news.csv . This is a .csv file which stores 4,200 rumors about COVID-19, labeling them as “T”, “F”, and “U” for true, false, and unverifiable. Here are examples from the database: T: “Washing your hands decreases the number of microbes on your hands and helps prevent the spread of infectious diseases”, F: Using a hair dryer to breathe in hot air can cure COVID-19 and stop its spread, U: “The Coronavirus and China and Quarantines May Cause a Global Recession”. There will not need to be some preprocessing necessary, particularly with storing the reputable source attached with the information to the correct or incorrect claim.
Methodology: The flow of data through our model will proceed as follows: the input data will undergo preprocessing before being entered into the learning model. The preprocessing of the data will involve taking the token attached to each claim and linking it with the source, stored in a different .csv file. Next, our model will pass the inputs through a Named-Entity-Recognition (NER) layer, which will mask the last named entity in the claim. This is to capture the relationship between the entities in the statement and mitigate the effect of wording and syntax differences in how the model reports its substance. Next, the model will predict the token of the masked entity using a Bidirectional Encoder Representations from Transformers (BERT) layer. From here, the BERT transformed output will be put into an entailment layer, which is a pre-trained entailment model that will obtain entailment features for the model to use in the multi-layer perceptron; these input features will be returned from the AllenNLP dataset, which is a pre-trained model that will assist in identifying features of an input text. Finally, the features returned from the entailment layer will be fed into a multi-layer perceptron returning the probability of each classification, of which the highest probability classification will be output by the model.
Metrics: Our experimental metric will be recall, which is a modification of the traditional accuracy metric that associates particularly high cost to false negatives (marking a false claim as true). We chose this dataset because our aim in this project is to identify misinformation about COVID-19, which can have harmful effects to public health. We believe the highest weight in the model should be associated with identifying false claims as false, so the accuracy metric used will be recall. The experiment we plan to run will be a dataset of classified tweets, returning the recall metric. Our goal is to perform above a recall of 0.55, which is the accuracy metric of two average human participants manually parsing through basic language detection tasks. The change from accuracy to recall is also significant here because it means that our model must perform around or above a 0.55 accuracy threshold, but the misidentified statements are more likely to be false positives than false negatives.
Ethics: The issue of political polarization and public health lie at the heart of this problem. With different stances on masks, vaccines, and treatments to COVID-19 dividing largely along party lines, we hope our model will provide a means to remove false information from the conversation entirely and have the political and medical discourse to be founded upon a basis of facts rather than rumors. The stakeholders in this problem are everyone, but particularly vulnerable populations, such as those who are less affluent, elderly, or immunocompromised, who are at increased risk of mortality or morbidity by COVID-19. False information is extremely harmful to this population because it often leads to irresponsible public exposure to the virus and increased risk for individuals who may be permanently affected. The consequences of a mistake within our algorithm could increase the risk of spread throughout individuals, which is more likely to result in prolonged ailment or death for those in vulnerable populations.
Division of Labor: Preprocessing: Varun NER Masking: Josue BERT Token Prediction: Josue Entailment Feature Extraction: Varun Multi-layer Perceptron: Varun Verification: Josue Report: Both Poster: Both Recording: Both
Built With
- keras
Log in or sign up for Devpost to join the conversation.