Emotion Detection in COVID-related Reddit Posts

Title

Who

Gabrielle Shieh (gshieh2)

Nolan Serbent (nserbent)

Jonathan Goshu (jgoshu)

Hao Wen (hwen5)

Our Github

Our code can be found here

Introduction

Using Reddit posts from the peak of the COVID-19 pandemic, we aim to create a model that can detect percieved emotions from text and identify the triggers related to them. Our model will be detect which emotions are conveyed in users' posts and be able to generate automated summaries of the specific text that correspond to each detected emotion. Previous models detecting emotion from text struggle to handle multi-emotion detection and lack automated summarization of emotionally charged text. Our project solves these issues by taking a deeper look into emotional triggers and an analysis of correlations between emotion and language.

Related Work

The paper we are re-implementing in TensorFlow can be found here.

The Github for the paper we are re-implementing can be found here.

Other potential datasets we will be using can he found here.

Data

The data for our project can be found here.

Our data is coming from a dataset from the original paper, where Reddit posts regarding COVID were taken and individually labeled with sentiment and annotated.

Our dataset has 2200 episodes and is already split into 1199 posts for training, 284 stories for validation, and 397 stories for testing data. We are unsure how preprocessing will look, as in past homework assignments we have taken out unusual words as part of the preprocessing, however we are unsure how many of the unusual words are needed to be taken out of these posts. Other steps for preprocessing may include taking out stop words, tokenization, stemming, and lower casing.

Methodology

General architecture:

Step one: use transformers.

Metrics

Accuracy for sentiment analysis will be the same as the accuracy metrics used for classification tasks in previous homework assignments, as our model will be categorizing Reddit posts into different sentiments, such as fear anticipation. Furthermore, perplexity will be the accuracy metric used for summarizing triggers, because perplexity describes how well the model can predict the next token. For our project, the next token will be the triggers. (this part will probably have to be changed)

We will be assessing the quality of the model based the accuracy it returns and on the perplexity it returns. If the classification is high, then the model does a good job at classifying the sentiment; if the classification is low, then the model will need to be modified. If the perplexity is low, then the model does a good job at predicting the next token; if the perplexity is high, then the model needs to be modified.

Our target goal is to generate correct sentiments. Our stretch goal is to generate statements regarding the triggers.

Ethics

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

In using the pre-existing data set gathered for the original paper, we risk not having enough posts with positive sentiments due to the dataset being about COVID. It is representative of the general sentiment surrounding COVID, however the data set does not contain many posts with positive sentiment. However, we will test this by testing our model on a different data set that is pre-labeled, such as sentiment in Twitter posts or Amazon reviews.

What broader societal issues are relevant to your chosen problem space?

Stakeholders may include Reddit users and mental health organizations. Mistakes made by the algorithm pose no threat or societal implications. Our algorithm intends to predict sentiment analysis and is meant to have real world application but instead be cohesive with the existing plot.

What broader societal issues are relevant to your chosen problem space?

Broader societal issues relevant to our chosen problem space include social media usage and effects, the COVID pandemic as it relates to its effect on society, as well as mental health effects. Our algorithm intends to predict sentiment analysis and the analyzation of these posts can provide insight into the effects COVID has on mental health and can provide more insight into the effects that social media has on individuals.

Division of Labor

We plan to mostly complete this project together and will not be splitting up much of the coding tasks. This way, we can fully collaborate and make decisions as a group as to how different problems will be solved and any modifications we may need to make to our original plan. Smaller/individual tasks, such as debugging, may be assigned accordingly to be completed individually in between group meetings. If time becomes an issue, we may split into pairs to work on certain tasks but we will still be working on this as a group.