We want to:
- Provide resources to survivors of sexual assault
- Contextualize the movement with a sexual assault reporting database
- Analyze #MeToo tweets in aggregate
- Show our users that they are not alone
What it does
The Front Page
The front page of the website welcomes the user with critical information and statistics about sexual assault and outlines the vision behind the dashboard.
There are three primary data sources integrated into You're Not Alone. The first is an open source database about sexual assault in academia called Sexual Harassment In the Academy: A Crowdsource Survey, which is curated by Dr. Karen Kelsky. We visualize the gender breakdown of the survivors in the database as well as their academic backgrounds with interactive bar graphs.
The second data source is Twitter, where #MeToo hashtags run rampant.
The third data source is our live-update survey database from our custom design survey.
How I built it
The entire online data dashboard is built using R Shiny.
The Twitter data is ingested from Twitter using an API wrapper called Tweepy, which has built-in functionality for querying the API for keywords. In this case, we queried tweets with #MeToo in the text body. Additional variables were inferred from the raw data including retweets, gender, and ethnicity.
To infer the user's gender, we used a Python package called SexMachine that employs a gender classification model trained on over 40,000 first names. The model classifies first names as either unknown, male, female, mostly male, and mostly female. The ethnicity inference was computed using the ethnicolr package, which draws its predictions from a combination of the US Census database, Florida Voter Registration data, and Wikipedia data. The model classifiers and diagnostics for these packages were unfortunately not listed on the package's websites.
We used Google Sheets as the backend database for our survey data collected via Google Forms.
We used dialog and pre-trained model to build our chatbot.
Challenges I ran into
- Twitter API limit on amount of tweets queried per day as well as the history limit (only provides tweets that were posted up to one week ago)
- Consistent look and feel of graphics output from different languages and hence, different plotting packages.
- Efficient code to handle >60000 tweets in live time because of interactive analysis feature
- Reactive programming with a huge datase: I needed to program an action button to delay the analysis of text input until the text input is complete, in other words, only do one analysis of "metoo" instead of running analysises of "m", "me", "met", "meto" and "metoo" when the user input the text string
Accomplishments that I'm proud of
- Integrating multiple data sources (public survey data, live-update survey data and Twitter tweet history)
- Integrating outputs from multiple languages (R, Python, and the back-end of the chatbot which includes node.js)
- Built a working website and product
- Consistent look and feel across the site
- Interactive data dashboard that allows users to explore the data themselves (survey data and tweet keyword search)
- Crowdsourced live-update database on sexual assault to address the problem of underreported sexual assault
What I learned
- How to build a chatbot
- How to efficiently clean text output from API
- UX design
- Reactive programming
- Parallel programming
What's next for You-are-not-alone
We don't have a solid direction of further direction yet, however, there are a few things we want to update with:
Historical Twitter Data
- Due to Twitter API limit, we are unable to query more data than what we have. The Twitter API provides tweets that were posted up to one week ago.
Live update of the survey datashared by Dr. Kelsky
Integration of other sexual assault databases
- We will be interested to build a more comprehesive database of reported sexual assault.
- We are interestd in building a twitter bot that will reply to tweets with #MeToo hashtags with helpful resources.