Berkeley Reddit Toxicity Tracker

r/berkeley

Inspiration

We often go on the Berkeley reddit, and believed that many of the posts were quite toxic compared to other universities subreddits. More specifically, we thought that many of the EECS/CS posts were worse than the general posts.

What it does

This uses reddit API to webscrape r/Berkeley's top posts and uses CoHere's pre-trained algorithm for detecting Toxic vs Non-Toxic language to determine what percent of posts were considered toxic.

How we built it

We looked up and used the Reddit API to turn all of the posts into Strings we could throw into the Cohere models.

Challenges we ran into

We struggled with correctly webscraping the Reddit to get the pertinent information. We also struggled with getting the model to work and output what we actually wanted it to.

Accomplishments that we're proud of

We were proud of having learned how to webscrape and use Machine Learning models to get data, even if we weren't able to create our own specific models.

What we learned

This was the first hackathon for most of us, and we learned some of the basics of Machine Learning and APIs.

What's next for Berkeley Reddit Toxicity Tracker

I think it would be good to be able to use it to compare to other university subreddits such as Stanford's subreddit or other UC Reddits. It could also possibly use more specialized models that are meant to work on reddit posts.