What I did
I ran a semantic analysis on various communities on Reddit and StackOverflow to determine which communities are the friendliest, and which are the most toxic.
How we built it
We gathered comment data from Reddit and StackOverflow using praw and StackOverflow's SQL endpoint, with each comment tagged by the subreddit or SO community it is associated with. We then ran the comment data through Microsoft's cognitive services API for Reddit and through IBM Watson for StackOverflow.
Challenges we ran into
The StackOverflow dataset did not yield any trends from Microsoft's API, which only measures sentiment. IBM Watson's tone analysis API gives a deeper analysis of the emotional connotation of text, which gave more significant results on the SO dataset. We also had to efficiently manage our API calls as the trial accounts do not give you many.
Accomplishments that we're proud of
The results we achieved reflected our intuitions about certain communities, and we think the ability to analyze the toxicity of online subcommunities has a true potential to prevent cyber-bullying.
What we learned
We learned efficient procedures for web-scraping, and simple yet effective procedures for data analysis. We also gained valuable experience with the Microsoft and IBM Watson APIs.
What's next for What is the least friendly community on StackOverflow?
A model that is specifically trained for cyber-bullying, rather than just for general semantic analysis, would be even more effective at detection of toxic behavior online. The approach could also be applied to any number of other online communities; however, to compare communities across domains some method of standardization would have to be created and tested.