The primary inspiration for ReproRepo was a series of pieces by John Oliver on Crisis Pregnancy Centers and Sexual Education from his show Last Week Tonight. In these pieces, he detailed the extremely deliberate and sophisticated techniques used by anti-abortion groups and groups against sex education in order to push their own agenda onto people who seek reproductive help. Masquerading as credible sources, these groups use intentionally ambiguous language to hide their true intentions until they have the person in their clinic, where they can further pressure or misinform them into making decisions that they would not have made if they had truly credible counsel. The tactics and language used by these groups can be difficult for the untrained eye to catch. Further exacerbating the problem is the fact that these groups can often outnumber credible services! Knowing these facts, we began thinking: Can we create an application that not only links users to credible sources but also helps them evaluate if a given source is fake?
What it does
ReproRepo is a web application that allows users to input the URL of a given sexual health resource and will then analyze that resource by comparing it to known fake/real resources. The user will receive the output of the analysis, including whether the resource was most likely fake/credible, which words/phrases on the site are potentially dubious, and what fake/real resources their inputted URL matched most to. In addition to this, users can access the visualizations related to the density of fake crisis pregnancy centers in their states and other resources.
How I built it
The framework for the front end of ReproRepro was created in R using the R Shiny package. Website data from both fake and real pregnancy centers was scraped using the Python package beautiful soup. This data was used to make up training and testing sets that were applied in a Term Frequency-inverse document frequency (TF-idf) text similarity analysis.
Challenges I ran into
Getting the URLs for the fake/real websites was difficult due to the databases we got the URLs were built more for individual reference, not for pulling all the data at once (no features like CSV export, etc.). Additionally, we ran into difficulties with the web scraping when dealing with broken links, websites with multiple tabs, and cleaning up the data collected post scraping. Lastly, integrating the Rshiny application with the TF-idf script in Python.
Accomplishments that I'm proud of
Last year was the first time we were at Technica and we mostly entered just for fun. We really loved the experience so this year we decided to up the ante and take on a much more complex and ambitious project. We are so proud of how all the pieces came together, particularly the visualizations in Rshiny and the TF-idf analysis!
What I learned
We learned more about creating visualizations in Rshiny, specifically heatmaps, and reactive Rshiny programming. We also learned about web scraping with beautiful soup and text similarity analysis using TF-idf. Although it was not used in the final product, we also investigated semantic similarity as another potential way to do the analysis.
What's next for ReproRepo
For the scope of the hackathon, we decided to primarily focus on crisis pregnancy centers as sources of reproductive health misinformation, however, this just scratches the surface of the bad actors in this space. Ideally, ReproRepo would be able to tackle a much wider range of misinformation about sexual education. Furthermore, we are currently doing our similarity analysis with simple text frequency comparisons in different documents. While we are getting reasonably high accuracy with this model (real resources predicted at 95%, fake resources predicted at 84%), we would want to further improve this by doing analysis that considers semantics and gleans meaning from the text. Lastly. we would want ReproRepo to grow to become a platform that curates these fake and credible sources, essentially providing the research level database that we did not have when working on this project.
Reproaction Education Fund: Fake Clinic Database, accessed October 25, 2020 from [https://reproaction.org/fakeclinicdatabase] https://stackoverflow.com/questions/53479963/highlighting-text-on-shiny https://towardsdatascience.com/how-to-rank-text-content-by-semantic-similarity-4d2419a84c32 https://prochoice.org/patients/find-a-provider/ Logo design plays off of the Planned Parenthood Logo!