Ukrainian War Sentiment Analysis + Topic Modeling

**Final Write-up: https://docs.google.com/document/d/1voQNx_bnMcm3u1YVwHkbXYIAc24YSoAN0s70FzopMYg/edit?usp=sharing

Introduction:

The current conflict between Russia and Ukraine has caused people and groups to suffer at different levels and aspects globally. To better understand Twitter users’ discourse and psychological reactions to this conflict, we use deep learning methods to analyze the tweets related to this conflict. We also want to track how the discussions and attitudes towards the conflict change over time. The result can have several practical usages, such as improving the donation campaigns and helping local communities to establish emotional support organizations. More generally speaking, we want to see how people outside of Russia and Ukraine are affected by the conflict. Through topic modeling, we can have a view of the most concerned subtopics related to the conflict, which would be a reflection on the effects on people’s daily life and mental states. The sentiment analysis part is a classification problem, we are trying to predict the posts’ sentiment into 11 different categories of sentiments. The topic modeling part is an unsupervised learning problem. We will train the model to cluster the posts into groups and then examine the features of each group.

Related Work:

In “Public discourse and sentiment during the COVID 19 pandemic”, Xue’s team used Latent Dirichlet Allocation for topic modeling on Twitter to explore the public discourse and psychological reactions during the early stage of COVID-19. They analyzed 1.9 million Tweets (written in English) related to coronavirus collected from January 23 to March 7, 2020 and identified 11 salient topics and then categorized into ten themes. They also tracked how the sentiments change overtime. Xue, Jia, et al. "Public discourse and sentiment during the COVID 19 pandemic: Using Latent Dirichlet Allocation for topic modeling on Twitter." PloS one 15.9 (2020): e0239441.https://arxiv.org/abs/2005.08817

Data:

Using Russia vs Ukraine Tweets Dataset from Kaggle, we collect about 10,000 tweets regarding Russia and Ukraine from 21st February to date. During the pre-processing stage, we removed noisy elements such as stop words, hyperlinks, and non-ASCII characters, and applied lemmatization to clean data. For our training and validation dataset, we used the SemEval 2018 (Task EI-oc) data [https://competitions.codalab.org/competitions/17751#learn_the_details-datasets]. Each tweet in this dataset is labeled with one or more of eleven sentiments: anger, anticipation, disgust, fear, joy, love, love, optimism, pessimism, sadness, surprise, and trust.

Methodology:

The main question for our sentiment analysis is how has the overall sentiment of Twitter users evolved since the outbreak of the conflict between Russia and Ukraine? To address this question, we collected and sampled about 10,000 tweets from 21st February to the present. We then preprocessed the data to build a bidirectional LSTM multi-label sentiment classification model. The model was trained using the semeval2018 dataset. Applying the trained model to the cleaned Twitter data for each time period, we created sentiment evolution graphs for this ~ time period. Another question is what are the main topics discussed on Twitter regarding the Russian-Ukrainian conflict? To discover the topics, we use the Latent Dirichlet Allocation (LDA) algorithm. For all tweets on each topic, we applied the LDA Mallet model to tweets with a certain range of total topic values. The model that produces the best consistency score is selected as the best model. The corresponding total topic count is the optimal value for the number of topics. Then, we recreated an LDA Mallet model by using this optimal topic count. Finally we used a visualization tool called PyLDAVis to visualize the topic modeling results for a better and more intuitive understanding of the topic modeling results. Regarding the backup idea, if the accuracy of our modeling approach is not satisfactory on our training dataset, we can consider using other models or finding a more comprehensive training dataset.

Metrics:

When training our sentiment analysis model using semeval2018 dataset, we use accuracy to measure the performance. For implementation, we don’t have labels associated with the tweets, so we will check if the visualization of results seems meaningful and the change of sentiments seems reasonable with respect to the reality. For the topic modeling part, we will see if different identified topics have significantly different features.

Ethics:

What broader societal issues are relevant to your chosen problem space? The border issue in our project is to understand the potential negative effect of regional conflict on the global society. By evaluating people’s emotions regarding the Ukraine and Russia War, we want to further investigate the harmful consequences of violence and how it impacts the global in the 21 century. Why is Deep Learning a good approach to this problem? In our project, we are trying to evaluate people’s emotions regarding the Ukraine and Russia War through their tweets. However, the methodology required to accomplish this task is not as simple as it sounds. Because emotions are very abstract in terms of vocabulary, one specific emotion can take on various forms. Thus, We cannot complete this task by simply extracting certain keywords. Instead, by using deep learning methodologies such as sentiment analysis, we can train a model which will evaluate a sentence’s emotion on its own.

Division of labor

- 1 Yifei Song & Zhirui Li - The first part of our project includes using a training dataset (SemEval 2018) to build a model that classifies individual Tweets into one of the eleven sentimental categories. 
- 2 Yifei Song & Zhirui Li - The second part of our project includes assigning newly collected Tweets to one of the eleven sentimental categories (prediction task). 
-  3 Keying Gong & Hanjun Wei - The third part of our project includes visualizing the prediction we made in part 2. The x-axis represents the date and the y-axis represents the frequency of each sentiment category. The color of each scatter plot represents the class of the sentiment. 
- 4  Keying Gong & Hanjun Wei - The final part of our project includes topic modeling (unsupervised learning) of the newly collected Tweets. Moreover, we can also visualize different topics on each date using the same method illustrated in #3.

Built With

lda
lstm
python