Team 52 - Stop Hate Coalition

Hate Speech Detection

Who:

Chujun Chen (cs login: cchujun), Sicheng Yang (cs login: ysicheng), Yijia Xue (cs login: yxue45)

Introduction:

In the modern era, characterized by a vast increase in social media use and online forums, there is a corresponding rise in the visibility of hate speech. This trend poses significant risks, particularly to minority groups, by potentially inciting societal polarization and real-life crimes. To help with creating an inclusive and friendly online communication environment, our project aims to identify one effective model for classifying and detecting hate speech by evaluating various deep learning models – DistillBERT, LSTM, BiLSTM, CNN-LSTM – through both qualitative (visualizations) and quantitative (performance metrics) measures. Numerous studies have explored the use of deep neural networks for detecting hate speech on social media. Yet, most of these models function as ‘black boxes’, obscuring their internal decision-making processes from users. Our project addresses this issue by enhancing the interpretability of these models, offering a transparent, visual representation of the decision-making process, particularly in identifying sentiments and classifying texts with hateful or extremist content.

Methodology:

Dataset:

We utilized multiple datasets to ensure a comprehensive approach to hate speech detection. These include:

The Offensive Language Identification Dataset (OLID) [1], which features over 10,000 English tweets annotated for offensive content.
A German dataset consisting of 2,000+ Facebook posts focused on xenophobic statements [2].
A French dataset with 2,856 tweets annotated for racist speech [3].

Each dataset was chosen for its relevance to the linguistic and cultural contexts it represents, aiming to encompass a diverse range of expressions.

Data Preprocessing:

Our preprocessing steps were designed to standardize and clean the data, ensuring that it was suitable for model training. This included:

Removed or replacing hashtags and URLs.
Converted emojis to text to preserve emotional expressions.
Normalized text by reducing multiple spaces to a single space and stripping newline characters.

Models:

We experimented with several model architectures to find the most effective approach for our needs:

DistilBERT[4], [5]: A lighter version of BERT that maintains most of the original model's performance. This model was chosen for its efficiency and effectiveness in processing English text.
LSTM and BiLSTM [6]: We employed LSTM models to capture the temporal dynamics and dependencies in text data, with BiLSTM [6] providing an additional layer of complexity by processing data in both directions.
CNN-LSTM: A hybrid model combining convolutional neural networks (CNN) with LSTM to leverage spatial hierarchies and temporal sequence data for enhanced feature extraction [7], [9].

Evaluation Metrics:

F1-Score: This metric helped us balance the precision and recall of our models, providing a more insightful performance indicator in the context of imbalanced classes, which is often the case with hate speech detection.
Accuracy Score: While less nuanced than the F1-score, accuracy provided a straightforward indicator of overall model performance across all classes.
Integrated Gradient based Visualization [8]: Integrated Gradients for NLP aims to interpret model predictions by attributing importance to each word in the input. It computes the integral of gradients along a straight path from a predefined baseline to the input. By measuring how model predictions change as individual word influences vary along this path, it quantifies the contribution of each word to the final prediction.

Results:

Please refer to the Result section in final report (here) for tables and figures.

Quantitative Result

We compared the performance of four models—Basic LSTM, BiLSTM, CNN-LSTM, and DistillBERT—across three languages: English, German, and French. BERT achieves the highest accuracy overall in English and German, demonstrating its robust language understanding capabilities. In French, however, BiLSTM outperforms the other models, suggesting that the bidirectional architecture may be particularly effective in capturing the contextual patterns in this language. The CNN-LSTM model shows mixed performance, with its accuracy being higher than BiLSTM in English but significantly lower than both BiLSTM and DistillBERT in German and French.

Qualitative Result

For the basic model ‘basicLSTM,’ the word importance appears evenly distributed between hate and neutral attributions, suggesting that the model may not delve deeply into the text or fully comprehend the context of the task.

For the ‘Distill BERT’ model, most words in the sentences seem to contribute to the model's prediction of hate speech, proving that model's more effective ability to understand the context of words in a sentence.

For the false positive sample in the model prediction, we can find that most of them contain swear words that are used to express tone. Thus, it is difficult for the model to distinguish whether these word expressions are hateful or not.

For the false negative sample in the model prediction, we find that it is difficult even for humans to recognize the hate information contained in the words without specific domain knowledge, such as political terms and internet abbreviations. This suggests that when people aim to decrease the probability of being recognized, a potential strategy is to increase the complexity of understanding.

Ethics:

One of the main concerns with datasets in natural language processing is their representativeness. For hate speech detection, it's essential that the dataset reflects the diversity of languages, dialects, cultural contexts, and forms of communication used on social media. A dataset that lacks this diversity can lead to models that are biased towards the language and speech patterns of the majority group represented in the dataset. This can result in higher false positives or negatives for underrepresented groups, perpetuating existing societal biases. To mitigate this representativeness bias, we hope to run models with different languages, such as German, and further examine models’ performance.

Challenges & Limitations:

One of the foremost challenges in detecting hate speech is accurately distinguishing it from other forms of speech, such as sarcasm or legitimate criticism. This task is complicated by the subtleties and nuances of language, including slang and colloquial expressions, which can differ significantly across communities and evolve over time. Furthermore, ensuring that our detection algorithms remain unbiased and fair for all groups presents a complex challenge, given the diverse manifestations of hate speech. Compiling a comprehensive and accurately annotated dataset is also difficult, requiring meticulous manual review and being prone to interpretation biases.

Our project has notable limitations. Currently, our models do not consider external contexts like user profiles, gender, or posting history, which could enhance classification accuracy. Additionally, our focus has been primarily on English-language content, thereby not addressing the detection of multilingual hate speech, which limits the applicability of our findings across different linguistic contexts.

We plan to use False Positives and False Negatives to quantify error, but we realize both metrics have profound implications. On the one hand, false positives imply that we incorrectly label content as hate speech when it is not, which could infringe on individuals’ freedom of speech. This could lead to unjust censorship, affecting users’ ability to express themselves and engage in meaningful discussions. On the other hand, failing to identify actual hate speech (false negatives) can have detrimental consequences, allowing harmful content to spread unchecked. This can contribute to a hostile online environment, marginalize vulnerable groups, and in extreme cases, incite violence and radicalization.

To address these ethical considerations, our stretch goal aims to strive for a dataset that encompasses a wide range of linguistic and cultural contexts, potentially by including annotators from diverse backgrounds to label the data. In addition, we will discuss the limitations of our dataset and model, and establish mechanisms for accountability, such as regular audits and updates to the model based on feedback.

Reflection

Project Outcomes and Goals

Our project has shown promising results and has met our expectations effectively. We found that DistilBERT was the most effective model in our assessments, confirming our hypothesis about its suitability for hate speech detection. We successfully achieved our base and target goals, which centered around implementing and evaluating the DistilBERT, LSTM, Bi-LSTM, and CNN-LSTM models. We are also halfway through achieving our stretch goals, which include the addition of interpretation tools to enhance the model's transparency.

Evolution and Adaptations of the Project

Initially, our project followed a straightforward path, but as we delved deeper, we realized the importance of model interpretability due to the usage of multiple languages. This realization led us to pivot towards incorporating the Captum package, which allowed us to add interpretation visualizations for each of the three languages we focused on. This pivot not only enriched our project but also expanded our understanding and application of AI in multilingual contexts.

Areas for Improvement

Given more time, we would like to extend our project to include datasets from low-resource regions. This expansion would test the robustness of our model across more diverse linguistic datasets and potentially improve its utility in underrepresented regions.

Key Takeaways

This project has deepened our familiarity with the deep learning pipeline and has provided valuable practical experience in applying deep learning models to real-world problems. This hands-on experience was crucial in understanding model training and evaluation nuances, highlighting the importance of interpretability in AI, and fostering valuable teamwork to achieve complex project goals.

Repository

The GitHub repository of our project can be found here
The presentation slide of our project can be found here
The final report of our project can be found here

Reference

[1] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Predicting the Type and Target of Offensive Posts in Social Media,” arXiv preprint arXiv:1902.09666, 2019.

[2] U. Bretschneider and R. Peters, “Detecting Offensive Statements Towards Foreigners in Social Media,” in Proc. 50th Hawaii Int. Conf. Syst. Sci., 2017.

[3] N. Vanetik, "FTR Dataset," GitHub repository, 2020. [Online]. Available: https://github.com/NataliaVanetik/FTR-dataset

[4] M. S. Jahan and M. Oussalah, "A Systematic Review of Hate Speech Automatic Detection Using Natural Language Processing," Neurocomputing, vol. 526, pp. 126232, 2023.

[5] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, "Predicting the Type and Target of Offensive Posts in Social Media," Proceedings of the NAACL-HLT, vol. 1, pp. 1415-1421, 2019.

[6] S. Khan, M. Fazil, V. K. Sejwal, M. A. Alshara, R. M. Alotaibi, A. Kamal, and A. R. Baig, "BiCHAT: BiLSTM with Deep CNN and Hierarchical Attention for Hate Speech Detection," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 4, pp. 4335-4344, 2022.

[7] P. K. Roy, A. K. Tripathy, T. K. Das, and X. Gao, "A Framework for Hate Speech Detection Using Deep Convolutional Neural Network," IEEE Access, vol. 8, pp. 204951-204962, 2020. [Online]. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9253658

[8] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” arXiv:1703.01365 [cs], Jun. 2017, Available: https://arxiv.org/abs/1703.01365

[9] A. Ruderman, N. C. Rabinowitz, A. S. Morcos, and D. Zoran, "Pooling is Neither Necessary nor Sufficient for Appropriate Deformation Stability in CNNs," arXiv preprint arXiv:1804.03367, 2018.