Detecting Early Onset of Depression from Social Media Text

Final Write-Up Link: https://docs.google.com/document/d/16SOkOwfUA-wuvmECFWu3KKF-nfqQ-gjApSYHD-uW0uA/edit?usp=sharing

~~~~~~~~~~~~~~~~~~~~~~~~ Checkpoint 1 ~~~~~~~~~~~~~~~~~~~~~~~~

Introduction

The paper we chose to replicate seeks to develop a structural prediction model that can detect early onset of depression from users' posts on Reddit. Early diagnosis of depression is critical, considering suicide from depression is the second leading cause of death among young adults. Mental health cues, especially on an anonymous forum like Reddit, can shed light about the users' emotions. We chose to implement this paper for several reasons. All of us were interested in pursuing an application of NLP within our model, and this paper in particular seemed very relevant with the rise of social media. The model in the paper utilizes many topics that we've learned about in class, including extracting bigrams and trigrams, and word embeddings.

A feature of the model that goes beyond the scope of CS1470 course is that it trains with chronological data, such that it is fed a user’s earlier posts collected in a span of 2 - 3 years. This way as the number of submissions increases, the model is required to decide as early as possible with a learned confidence interval.

Related Work

A related work by Nalabandian el al. (2019) indicated that people going through depression were more likely to use more negative words and a “self-focused language” when writing about interactions with significant others (as opposed to writing about other people around them).

Another related work by Loveys et al. examines how different cultures experience depression in different ways. Specifically this paper looks at the differences between how Caucasians, African Americans, and Asian and Pacific Islander people use different types of language to express negative emotions (and even in the case of Asian and Pacific Islander people, these negative emotions are often inhibited). The paper dug further and saw that Hispanic people were unique and more likely to express both positive and negative emotion. It is crucial to acknowledge and try to account for these differences in culture because depression affects all walks of life and to ignore cultural differences would be to imbue our model with bias.

Citations: [Nalabandian and Ireland2019] Taleen Nalabandian and Molly Ireland. 2019. Depressed individuals use negative self-focused language when recalling recent interactions with close romantic partners but not family or friends. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pages 62–73. https://www.aclweb.org/anthology/W19-3008/

[Loveys et al.2018] Kate Loveys, Jonathan Torrez, Alex Fine, Glen Moriarty, and Glen Coppersmith. 2018. Cross-cultural differences in language markers of depression online. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 78–87. https://www.aclweb.org/anthology/W18-0608.pdf

Data

We will be using the eRisk (Early Detection on the Internet) dataset, which is developed from an annual workshop held by CLEF (Conference and Labs of the Evaluation Forum). We've received access to the 2018 and 2020 eRisk dataset from the CLEF organizers. The eRisk dataset has two tasks, detecting early depression and anorexia. The paper we are replicating focuses on early detection of depression in the 2018 data. We plan on replicating the paper's methodology on the 2020 dataset. The 2018 dataset contains 125 depressed users and 752 non-depressed users as training data and 79 depressed users and 741 non-depressed users as test data. There are a total of 531,349 submissions, only 49,557 of which are submitted from users with depression. The submissions are over a 2-3 year range. The 2020 dataset that we will utilize has similar statistics, both for number of users and number of submissions. We will need to do significant preprocessing to develop the NLP pipeline. The users' text will need to be cleaned. This requires transforming the text into lowercase and removing punctuation and stopwords. The numbers and URLs in texts should be replaced with specific tokens and then stemming can be done with Porter Stemmer. Given the large size of the data, the paper reduces the dimensionality of the dictionary by using collocation to extract bigrams/trigrams. Since the number of submissions from non-depressed users is so much higher than depressed users, the authors balance the classes by downsampling the majority class to a ratio of 2:1. We will follow a similar preprocessing method.

Methodology

Because we will be using a different dataset than the one used in the paper, we are planning to use the architecture defined in the paper: 3 hidden layers of sizes 512, 256, and 256 respectively, using a Leaky ReLu and Dropout. In addition, the paper downsamples the data so that the ratio of posts and comments from non-depressed users to depressed users is 2:1, which we plan to do as well. We expect the hardest part of implementing the model to be Latent Semantic Indexing, which is a method the paper uses to analyze topics of users’ posts. We will have to figure out how to create and use a Latent Semantic Indexing model to extract topic modelling embeddings, which are then fed into our neural network architecture. In addition, if we choose to implement confidence learning, which is a second output of the model in the paper, this will also be a new area for us to explore and learn to apply.

Metrics

For our base goal, we plan to train on the t1-2020 dataset and test on the t2-2020 dataset. For our target goal, we plan on including the timeframe as part of the model to account for early detection of depression. We will experiment using ERDE metrics (Early Risk Detection Error), which tests both the first 5 submissions and the first 50 to add penalties for time delays. The authors of this paper were hoping to find that their model was able to detect depression as early as possible. They quantified the results of their model by using Early Risk Detection Error (ERDE) as their primary metric. The authors of the paper chose to use this metric because it includes the time component of the early detection task.

Based on our base goal (of detecting whether a user is depressed or not given the entire post history of this user), we will assess our model’s performance by seeing our model’s accuracy (i.e., what percentage of the time does our model correctly guess the classification of a user).

Our base goal is for our model to be able to determine whether a user is depressed or not, given all the entire post history of this user. Our target goal is for our model to be able to detect depression as early as possible, as it is viewing the posts of a given user. Our stretch goal is for our model to perform better than the paper, where we define better as detecting depression earlier.

Ethics

Some of the broader social issues relevant to our problem space is the positive correlation between increasing social media presence and deteriorating mental health, especially among younger individuals. In light of the COVID-19 pandemic, it is even more challenging for individuals to take care of their own mental health and are forced to increase their online presence. In recent years AI and machine learning techniques are more frequently used to detect and treat mental health. Additionally, similar techniques can be used to detect other social cues such as misogyny or racism. There are potential concerns about whether the data is representative across different populations. The data includes no information about the gender, race, ethnicity, age, or other relevant information about the users. Therefore, there could be large discrepancies within our sample dataset about the population distribution. Reddit users are predominantly young men (1), which could skew the data distribution. Representative data is important, considering different cultures experience and express depression in different ways (see Related Work). There could also be different signs of depressions across gender. (1)https://www.theatlantic.com/technology/archive/2013/07/reddit-demographics-in-one-chart/277513/

Division of Labor

Preprocessing Tasks: We will each take an active role in preprocessing, considering it is a significant step. Parsing each XML file - Sarah Compiling XML file data in bulk - Andrew Transform text into lowercase, remove punctuation, stopwords. Numbers/URLS replaced with tokens Perform stemming with Porter Stemmer Use collocation to extract bigrams/trigrams Implementing Model - all of us

~~~~~~~~~~~~~~~~~~~~~~~~ Checkpoint 2 ~~~~~~~~~~~~~~~~~~~~~~~~

Introduction

Challenges

The hardest part of the project we have encountered so far is trying to fully understand how to implement the paper’s preprocessing methods. The initial challenge was determining how to process and stem the XML data files. We had to decide as a group the best way to store the data and what class structure would fit best for the model. From there, the hardest part of the project is incorporating a method in the paper called Latent Semantic Indexing. This procedure extracts embeddings that contain information about the text in users’ posts. We emailed and reached out to the researchers of the study to clarify a couple of questions we had about parsing and storing users’ texts as well as the implementation Latent Semantic Indexing, which was very helpful. We plan on incorporating this method into our processing this upcoming week before beginning with the model architecture.

Insights

We were able to preprocess the XML files that the users' posts. We created a class hierarchy to store the data. The User class contains the user ID and posts of a given user. The Post class represents a single user post, containing the date, text, title, and info fields. We've created a method get_data that extracts all the data for each user and their posts, and creates a User instance for each subject XML file. With this list, we've completed initial preprocessing methods. We start by tokenizing and stemming the words in text and titles, and removing stop words.

Plan

We are on track with our project. We have collected our data and preprocessed it. We believe, moving forward, we should dedicate more time to conceptually understanding and implementing the Latent Semantic Indexing portion of the paper. As of right now, we are not thinking of changing anything.