Previously: https://devpost.com/software/cobra-covid-brings-racism
**IMPORTANT NOTE: Please check out our website on Chrome or Firefox for best viewing experience. Thank you!**
Inspiration
The CRANE project stems from Gianluca Stringhini's proposed challenge for the Resiliency Challenge. Challenge #14, Real-time monitoring of racist rhetoric linked to the Covid19 pandemic, part of the Collecting, Visualizing or Disseminating Information track targeted at vulnerable and underserved populations, is an extension of Leonard Schild et al.'s early look on the emergence of online sinophobic behavior in the face of the Covid19 pandemic.
"The anonymity and ubiquity of the social media provides a breeding ground for hate speech and makes combating it seems like a lost battle." Udanor (2019)
As early as February, national media and research bodies raised the alarm about a rise in both ambient sinophobia and sinophobic aggressions: Time, Center for Research on Globalization, BBC, Mail&Guardian, New York Times - March 23th, CNN, TheNextWeb, ABC News, USA Today, New York Times - April 18th, ADL... The phenomenon has grown to such magnitude that a dedicated page has been created on Wikipedia.
Social media platforms allow any and all users to communicate with each other and reach a large audience, bringing a “transformative revolution in our society” (Mondal, et al (2017)). with both wonderful opportunities for connection and devastating ramifications. While the design of online social media drives people to express more violent opinions than they would otherwise, hate speech represents only a fraction of the messages shared on social media. However, the sheer volume of content involved means that even this small percentage has an impact. Among other findings, hate speech has been shown to have adverse psychological effects in online college communities and can also have consequences similar to those of hate crimes, such as psychological trauma and communal fear (Gerstenfeld, Phyllis B. 2017. Hate crimes: Causes, controls, and controversies. Sage Publications.). In addition to this, there is evidence that online hate speech predicts hate crime and human rights groups have argued that exposure to online hate speech normalizes such hatred for majority groups. Though there are measures being put in place to try and combat such rhetoric, it is difficult to put restrictions in place on the internet. Though legislation has been passed in many countries in an attempt to prosecute those behind the spread of inflammatory and harmful language, there is a continual battle between holding people accountable and encroaching on users' rights of free speech, as illustrated by the Irish Hate Track project. Moreover, the vast and ever-growing digital spaces make it more and more difficult to manage.
In this context, tracking the changes in online sinophobic hate speech resulting from the progress of the Covid19 pandemic is a logical contribution to the response to the consequences of the pandemic. Like many other data analysis enterprises, two approaches are available for our research. A quantitative analysis can study the prevalence of sinophobic hate speech on online social media as a proxy of the evolution of sinophobic opinions in the population. A qualitative analysis can focus on the content of sinophobic hate speech to understand the racist narratives emerging from the current crisis. Both approaches are of interest to characterize and answer the current increase in sinophobia.
Goals
We aim to:
- Look for trends in the content of sinophobic hate speech on Twitter in the context of the COVID-19 pandemic.
- Raise awareness on this emergent hatred phenomenon.
- Provide anti-racists organizations with contextualized information to support their actions.
What it does
Research approach
Research question: Has the Covid19 pandemic impacted the vocabulary or stereotypes found in sinophobic hate speech on Twitter?
Using daily archives of tweets shared by our mentor Gianluca Stringhini, we carried out analyses on nearly 100M original English tweets between 1st November 2019 and 30th April 2020.
Conscious that the time-frame of the Resiliency Challenge did not allow for the development of new machine-learning methods, we chose to implement existing pipelines and evaluate their value to answer our research question. We implemented four analysis pipelines:
- to compute the daily frequency of known sinophobic slurs;
- to classify tweets as sinophobic hate speech and compute the relative volume of those classified tweets;
- to identify words used in a similar context to keywords from a list;
- to identify the main topics for a given period of time.
We give a general outline of the methodology for each one of these pipelines in the How we build it section.
Public web app
Our web application displays data visualizations that transform our complex data into comprehensible, engaging data visuals. Users are able to interact with the graphs and analyze trends surrounding sinophobic hate speech on Twitter (in specific relation to the Covid19 pandemic). While an important purpose of the webpage is to display our data, the webpage also serves to raise awareness of how the pandemic has influenced hate speech. By including data to show the reach of sinophobic hate speech and including testimonials from user interviews (anonymous for users’ privacy), website visitors get a thorough understanding of the current impact of sinophobia. Finally, our website will also provide anti-racist organizations with contextualized information to support their outreach efforts.
The purpose of our website is to inform visitors of the effects the pandemic has had on a specific portion of our population, and the importance of supporting those affected.
How we built it
All our preprocessing and analyses are performed in Python (encapsulated in a pipenv environment) and C. _ Additional methodological details are available in our GitHub repository. _
Data collection
We are working with daily archives provided by Gianluca Stringhini, from 1st November 2019 to 30th April 2020, with a few missing days. We filtered the dataset to keep only tweets labelled as English language by Twitter, and among those only original tweets (that were not retweets, made via the Twitter interface or manually). To optimize storage, we dropped most of the information provided for each tweet, keeping only the id, publication date, and full text. Once filtered, each archive corresponds to about 600k English original tweets.
Preprocessing
Our preprocessing was adapted from Effrosynidis et al. (2017). It deals with unicode, URLs, mentions, hashtags, punctuation, contractions, numbers and newlines. We have added variants as options for URLs, mentions, hashtags, punctuation and numbers. In particular, the preprocessing function can segment hashtags and replace numbers by their text version.
Analysis
Frequency of known slurs
This pipeline performs a simple quantitative analysis over the dataset. Given a list of slurs (with variants), it computes the daily and weekly frequency of these slurs in the dataset.
We tracked the following slurs: "chink", "bugland", "chankaro", "chinazi", "gook", "insectoid", "bugmen", "chingchong"/"ching-chong"/"ching chong".
Those results can be found on our web app, and the pipeline is fully documented and can be run on other datasets and keywords.
Classification of anti-black and anti-asian hate speech
This pipeline performs a classification task to detect anti-asian hate speech tweets (src/classifier.py
), then computes the daily frequency of the thus classified tweets in the dataset (src/statsForClassifier.py
).
!!! The accuracy of this classifier has not been tested yet, we are working on obtaining a test set. As such, its results are not presented in our web app. !!!
The classifier (src/classifier.py
) relies on a two-steps approach: it flags tweets containing slurs against different ethnicities then evaluate the toxicity of those tweets using Google Perspective API
The list of slurs to detect was compiled from a survey of hate-speech detection papers, hatebase.org and Wikipedia slurs list.
Word embeddings
The word embedding analysis pipeline is based on Gianluca Stringhini’s paper, it makes use of the word2vec package which is a submodule of our repo.
For each keyword and each time period, the pipeline yields the 40 words appearing in the most similar context to the keyword over the time period.
Our current keyword list is: [virus, corona, coronavirus, covid, kungflu, kung, wuflu, infection, infecting, disease, bat, bats, pangolin, pangolins, quarantine, china, chinese, asian, asia, chingchong, wuhan, chink, chinaman, jap, slant, immigration, immigrant, immigrants, country, disgusting, alien]. We have topic coded the words obtained for each keywords, highlighting trends in the topics linked to a number of keywords. Those results can be found on our web app.
This pipeline is fully documented and can be run on other datasets and keyword lists.
Topic modelling with LDA
This pipeline performs a rather standard Latent Dirichlet Allocation analysis for topic modelling, using gensim's implementation.
!!! It appears that standard LDA under-performs on short texts like tweets, as evidenced by the omnipresence of beer
in our results for Covid19-related tweets. This pipeline will be reworked with known adaptations of LDA for Twitter analysis. As such, its results are not presented in our web app. !!!
Public web app
When designing the website, we first conducted user research to gain a better understanding of public perception of the Covid19 Pandemic and its effects on hate speech and racism. To do so, we held ethnographic interviews with individual stakeholders and shared an online survey.
The goal of this user research was to determine the best way to design the website’s information architecture, data visualization, and takeaways. We created a user research guide that includes inquiry templates and questionnaires for specific target audiences. We’ve reached out to anti-racist/advocacy organizations and Asian workshops as well as individuals who would be interested in engaging with sites like ours. We’ve received some interesting findings, but while we would have liked to spend the last week hosting more interviews, due to recent events regarding race relations in the United States, we decided it putting such user research methods on hold out of respect for the current situation. Instead, we continued researching the sociological implications of our research, and wrote a framework backing our data. We also conducted market research to become familiar with similar products on the web, and gauge how successfully they achieve their desired outcome(s).
Our final product was designed with the user in mind. Every design consideration from layout to copywriting to visual elements was made based off the research we conducted to ensure that every component of the product was made for the user. We kept our design flexible so the website and our research can evolve following the completion of the hackathon.
Taking into account the sensitivity of the topic, we have chosen not to display examples of tweets classified as sinophobic/xenophobic hate speech by our analysis process. If users are so inclined, they can easily search Twitter for keywords and hashtags associated with Covid19-related hate speech, but we do not wish to expose users to violent content nor shame the authors of this content. The web app is a static React App bootstrapped using Create React App. We’re also using Nivo.rocks as a data visualization library and AntD as a basic design framework. The site is hosted on CloudFlare Workers Sites CDN and the data for the graph is statically embedded. We chose to attach the data for the graphs statically since we have not fleshed out the pipeline for providing automatic updates with the latest data from Twitter. In the future, we would like to move the serving of this data to something such as Firestore to enable dynamic updating of the graphs.
Challenges we ran into
The recent death of George Floyd and associated protesting made it imperative to tread our topic carefully. We had to make the difficult decision to put some of our user research on hold to ensure that we were respectful of the current climate.
Over the course of the Resiliency Challenge, we encountered several challenges which can be split into two categories: organizational and technical challenges.
Organizational challenges:
- Deciding on a use case to maximize our impact, as there could be several possibilities, from victims of sinophobia to anti-racists organizations, to public stakeholders, to the general public.
- Mobilizing an audience not accustomed to hackathons, in particular recruiting team members from the sociology field. Contacting representatives of anti-racists movements has also proven difficult in this context where many offices are closed and "non-essential" activities are curtailed.
- Coordinating a remote team spread over disparate time zones.
- Coordinating a team with very different background and experience without a shared physical space. In particular, interfacing the different parts of the project and finding a common ground between the academic culture, the programming/development culture and the design culture.
Technical challenges:
- Obtaining historical Twitter data. Twitter offers an API to acquire real-time data (about 1% of all data) but because of their terms of service sharing large databases is difficult. Researchers tend to collect data filtered for their specific research question, while we needed unfiltered data for our results to be representative of the prevalence of sinophobia on Twitter.
- Adapting the design process, and especially user research tests, to the very strict time constraints of the Resiliency Challenge, and to remote working. Interviews cannot be conducted the same when not in-person.
- Code optimization for computing time, memory usage and storage, due to the large amount of data involved.
Accomplishments that we are proud of
We are proud of the team we have built, a "talented, driven, compassionate" one, in the words of one of our members.
As a team, we are proud of having:
- Experimented and deployed efficient internal communication and task management strategies suitable for our remote global team.
- Produced a proof-of-concept deliverable inside of a 6-weeks time-frame.
- Gathered a strong theoretical background for our research approach.
- Mobilized an interdisciplinary team and senior supports on a project that is not a typical technology-oriented hackathon project.
As individuals, some of us are proud of having:
- Spoken up.
- Managed to support teammates in their own tasks.
- Made new connections.
- Conducted the initial user research that the rest of the project stands on.
What we learned
Human knowledge and skills we learned:
- The depth and complexity of studying historically and culturally heavy topics like racism. Less commonly-known issues in the many shapes racism can take.
- Importance of communication, especially when working remotely and working with people from very distinct fields.
- How to interface design and development, what each sub-team expects and needs from the other.
- Tools and good practice for working with remote global teams.
- Discovering other cultures, in a real-life setting that defies stereotypes.
Technical knowledge and skills we learned:
- Branding and logo design.
- Exploring and designing wireframes.
- New machine learning methods for natural language processing (hate speech detection, word embedding, topic modelling).
- Tools and good practice for remote programming and computer calculation (SSH protocols, tmux, bash loops, ...).
- Search scientific literature (Google Scholar, Pubmed-NCBI, Sci-Hub, ArXiV, ...).
What's next for CRANE
Following the end of the hackathon, after discussion internally within our team, Gianluca and Resiliency Challenge organizers, we believe this web-based analysis and resource hub can expand larger than the research conducted surrounding Covid19’s effects on sinophobia and xenophobia. The past week has shown how prevalent and significant the issue of race is around the world, and this project has the possibility of analyzing discourse online and housing resources for those in need.
Our project will be deployed and serve as a resource for future research. We plan to transform our work into a toolbox with data analysis pipelines and good practice recommendations for data visualizations and outreach deliverables. This toolbox would allow non-computer science researchers and anti-racist organizations to continue our research on the effects of short-term and everlasting effects of racial hate speech online.
In addition, we are considering expanding this project ourselves to be an all-encompassing research project, in which we would collaborate with a sociology team to combine their expertise with our NLP tools to create a framework to build our current project upon. We have also considered transforming the website into a well-rounded resource to aid in the fight against COVID-19 related xenophobia/sinophobia, working with existing digital platforms Heartmob and nationally renowned nonprofits like the Asian Pacific Network. In the last 48 hours, Boston University has announced the creation of the Center for Antiracist Research. We believe that this centre and the resources they provide, such as the COVID Racial Data Tracker (found here: https://covidtracking.com/race) offer great potential to be helpful resources for our project. The goal of their tracker is to gather the most complete race and ethnicity data on Covid19 in the United States. Such a resource would not only allow us to sustainably expand from our current focus on xenophobia in the wake of Covid19, but give us an expansive look at the effects the virus has had on race and ethnicity issues at large, and provide us with the proper framework to expand past Covid19 to future impending crises.
Our GitHub
Built With
- d3.js
- google-perspective-api
- machine-learning
- natural-language-processing
- nivo
- pipenv
- python
- react
- word2vec
Log in or sign up for Devpost to join the conversation.