The CoBRa project stems from Gianluca Stringhini's proposed challenge for the Resiliency Challenge. Challenge #14, Real-time monitoring of racist rhetoric linked to the Covid19 pandemic, part of the Collecting, Visualizing or Disemminating Information track targeted at vulnerable and underserved populations, is an extension of Leonard Schild et al.'s early look on the emergence of online sinophobic behaviour in the face of the Covid19 pandemic.
As early as February, national media and research bodies raised the alarm about a rise in both ambient sinophobia and sinophobic aggressions: Time, Center for Research on Globalization, BBC, Mail&Guardian, New York Times - March 23th, CNN, TheNextWeb, ABC News, USA Today, New York Times - April 18th, ADL... The phenomenon has grown to such magnitude that a dedicated page has been created on Wikipedia.
While the design of online social media drives people to express more violent opinions than they would otherwise, hate speech represents only a fraction of the messages shared on social media. However, the sheer volume of content involved means that even this small percentage has an impact. Among other findings, hate speech has been shown to have adverse psychological effects in online college communities and can also have consequences similar to those of hate crimes, such as psychological trauma and communal fear (Gerstenfeld, Phyllis B. 2017. Hate crimes: Causes, controls, and controversies. Sage Publications.). In addition to this, there is evidence that online hate speech predicts hate crime and human rights groups have argued that exposure to online hate speech normalises such hatred for majority groups.
In this context, tracking the changes in online sinophobic hate speech resulting from the progress of the Covid19 pandemic is a logical contribution to the response to the consequences of the pandemic. Like many other data analysis enterprises, two approaches are available for our research. A quantitative analysis can study the prevalence of sinophobic hate speech on online social media as a proxy of the evolution of sinophobic opinions in the population. A qualitative analysis can focus on the content of sinophobic hate speech to understand the racist narratives emerging from the current crisis. Both approaches are of interest to characterise and answer the current increase in sinophobia.
We aim to:
- Look for trends in the content of sinophobic hate speech on Twitter in the context of the Covid19 pandemic.
- Raise awareness on this emergent hatred phenomenon.
- Provide anti-racists organisations with contextualised information to support their actions.
What it does
Research question: Has the Covid19 pandemic impacted the vocabulary or stereotypes found in sinophobic hate speech on Twitter?
Using a public dataset of Covid19-related tweets shared by the Panacea Lab, we carried out analyses on 22M original English tweets related to Covid19 between March 11th and April 18th 2020.
Our first analysis pipeline classifies the tweets to detect anti-black and anti-asian hate speech. The data is then aggregated by week to study the evolution of the prevalence of online sinophobic hate speech in relation to Covid19.
Our second analysis pipeline combines word embedding and thematic coding to measure trends in the narratives linked to either China/Chinese or kungflu/wuflu (sinophobic alternate names for Covid19).
Public web app
Our web application displays data visualisations that transform our complex data into comprehensible, engaging data visuals. Users are able to interact with the graphs and analyze trends surrounding sinophobic hate speech on Twitter (in specific relation to the Covid-19 pandemic). While an important purpose of the webpage is to display our data, the webpage also serves to raise awareness of how the pandemic has influenced hate speech. By including data to show the reach of sinophobic hate speech and including testimonials from user interviews (anonymous for users’ privacy), website visitors get a thorough understanding of the current impact of sinophobia. Finally, our website will also provide anti-racist organisations with contextualised information to support their outreach efforts.
The purpose of our website is to inform visitors of the effects the Pandemic has had on a specific portion of our population, and the importance of supporting those affected.
How we built it
We downloaded version 6 of a public research dataset, which includes a huge amount of Covid19-related tweets. This data is shared as tweet ids and needed to be hydrated to retrieve the full tweet information, which we did using the Hydrator GUI.
We performed the preprocessing in Python, encapsulated in a pipenv virtual environment. We filtered the dataset to keep only tweets labelled as English language by Twitter, and among those only original tweets (that were not retweets made via the Twitter interface or manually). To optimise storage, we dropped most of the information provided for each tweet, keeping only the id, publication date, full text, user id, user location, user verified status, coordinates, favourites count, and hashtag entities.
The text of the tweets was prepared for further analysis by cleaning the URLs and special characters specific to Twitter, following recommendations by Dimitrios Effrosynidis et al. (Techniques 0, 1, 3, 4, 5 + lowercase).
We performed the classification in Python, encapsulated in a pipenv virtual environment.
Our hate speech classifier relies on a home-made vocabulary list for slurs against different ethnicities and religions, and on the Perspective API built by the Google group Jigsaw. The vocabulary list for slurs was compiled from our survey of hate-speech detection papers, hatebase.org and Wikipedia slurs list.
The classifier first searches the tweet text for slurs belonging to each category, which result in a dictionary with (key, value) pairs following this format: (“ethnicity”: 1 if corresponding slur else 0). It then sends the tweet text of tweets with at least one slur category detected to the Google Perspective API. The tweets containing at least one slur and with a Severe Toxicity score above 0.3 are saved with the information on slurs detected and Severe Toxicity score. The final threshold for Severe Toxicity score was selected by inspection of the text of tweets in different ranges of Severe Toxicity score to determine the best pivot point.
From there, time series of frequencies for anti-black and anti-asian hate speech in our corpus are easily generated and sent to the public web app.
Due to the collection process, the daily amount of tweets before March 11th is too low for the type of analyses we wished to perform, so we restricted the dataset to tweets published between March 11th and April 18th. In addition to the preprocessing common to both pipelines, we remove the punctuation (excepted for hyphens) from the text of the tweets.
We performed the word embedding calculations in C, using the word2vec package which computes the cosine similarity between the word embedding vectors of two given words. This method is particularly suited to our problem because we want to identify words close in usage to given keywords and we have no prior knowledge of the pairs we wish to study. It is also very appropriate for large vocabulary sets (set of different words contained in the corpus) and large corpus with short documents.
We ran the word2vec pipeline on weekly splits of the dataset, starting on March 11th.
We recorded the 40 closest words to each keyword of the following list, with corresponding similarity score: virus, corona, coronavirus, covid, kungflu, kung, wuflu, infection, infecting, disease, bat, bats, pangolin, pangolins, quarantine, china, chinese, asian, asia, wuhan, chink, slant, foreign, dogs, immigration, immigrants, home, country, citizenship, disgusting, alien, bitch, hoe.
For a first analysis, we selected the four keywords for which the results appeared the most promising (words referring to both the virus or its consequences, and China or Chinese people, especially words conveying a sentiment or an opinion on the situation): China, Chinese, Kungflu and Wuflu. Using a thematic coding process, we classified the words similar to those keywords into seven topics: Chinese government-related, Locations, Biowar, Cover-up/lies, Political calls for action, Alternative names for Covid19, China as other/dangerous.
From there, time series for each topic (either by keyword or aggregated over the four keywords) are easily generated and sent to the public web app.
Public web app
When designing the website, we first conducted user research to determine the best way to design the website’s layout. We created a user research guide that includes inquiry templates and questionnaires for specific target audiences. We’ve reached out to anti-racist organisations and Asian workshops as well as individuals who would be interested in engaging with sites like ours. We are currently waiting for the results.
We based the current layout for the web app off our survey of published sociology works on the topic. We kept our design flexible to be able to evolve it as we gather our interviewees’ responses.
Taking into account the sensitivity of the topic, we have chosen not to display examples of tweets classified as sinophobic hate speech by our analysis process. If users are so inclined they can easily search Twitter for keywords and hashtags associated with Covid19-related sinophobic hate speech, but we do not wish to expose users to violent content nor shame the authors of this content.
The web app is built on a static React site bootstrapped with Create React App in Typescript. We’re using Nivo.rocks for an easy to use data visualization library that also supports Typescript. The site is hosted via Netlify and data will either be statically embedded or pulled from a Firestore Realtime database.
Challenges we ran into
Challenges we encountered can be split into two categories: technical challenges and organisational challenges.
- Obtaining historical Twitter data. Twitter offers an API to acquire real-time data (about 1% of all data) but because of their terms of service sharing large databases is difficult. Researchers tend to collect data filtered for their specific research question, while we needed unfiltered data for our results to be representative of the prevalence of sinophobia on Twitter.
- Code optimisation for computing time, memory usage and storage, due to the large amount of data involved. In addition to this, some processing steps, such as the classification of tweets as hate speech or not, rely on API for which the query rate is capped, limiting our analysis speed.
- Take the nivo library for interactive data visualisation in hand.
- Mobilising an audience not accustomed to hackathons, in particular recruiting team members from the sociology field. Contacting representatives of anti-racists movements has also proven difficult in this context where many offices are closed and "non-essential" activities are curtailed.
- Adapting the design process, and especially user research tests, to the very strict time constraints of the Resiliency Challenge.
- Coordinating a remote team spread over disparate time zones.
- Coordinating a team with very different background and experience without a shared physical space.
Accomplishments that we are proud of
- Experimenting and deploying efficient internal communication strategies suitable for our remote global team.
- The production of a proof-of-concept deliverable inside of a 3-weeks time-frame.
- Gathering a strong theoretical background for our research approach.
- Mobilising an interdisciplinary team and senior supports on a project that is not a typical technology-oriented hackathon project.
What we learned
- Importance of communication, especially when working remotely and working with people from very distinct fields.
- Tools and good practice for working with remote global teams.
- New machine learning methods for natural language processing (hate speech detection, word embedding, topic modelling).
- Tools and good practice for remote programming and computer calculation (SSH protocols, tmux, bash loops, ...).
- Search scientific literature (Google Scholar, Pubmed-NCBI, Sci-Hub, ArXiV, ...).
What's next for CoBRa
We plan to continue the project during Sprint 3 of the Resiliency Challenge. In those additional three weeks, we will focus on:
- Running our analysis pipelines on a wider dataset provided by Gianluca Stringhini. We will analyse English original tweets from November 1st 2019 to May 31st 2020 to be able to compare results from different stages of awareness and spread of the pandemic.
- Adding a topic modelling analysis pipeline to bring more insights on the sinophobic narratives born from Covid19. The icing on the cake would be to also add a pipeline to study the diffusion patterns of those narratives (either with Hawkes processes or through network methods).
- Working with anti-racist organisations to deliver a web app as helpful as possible for them. By reaching out to stakeholders, we want to ensure that the way we present our results will be helpful to support anti-racist actions and that we provide the necessary contextual information, from (anonymous) testimonials from user interviews to sharing scientific resources and information on outreach actions by anti-racist movements.
- Documenting and packaging both the research process and the web app so that others can take our work over either for research or action-oriented purposes. We will be particularly careful to clearly outline the working hypotheses and limits of our methods.
- Ensuring that our Web App is accessible to the greatest number, by addressing common issues for accessible web content such as responsivity, readability by visually impaired and colour blind people, alternative text descriptions, ...