The anonymity of the web has lead to an avalanche of consequence-free online harassment. Mob behavior, death threat and sexist abuse have become a daily occurrence for thousands of people. Tech companies have so far been unable to respond to the problem with any success. It has gotten so bad that multiple NGO's such as Amnesty International, have been forced to take direct part in safeguarding the rights of individuals on the web. We've been first hand witnesses of this issue on various social media platforms and online video games. There has to be a better way to affect change.
What it does
Our Classifier Model is capable of accurately telling the difference between toxic comments and non-toxic comments online.
Our associated website provides an interactive interface to access and demo our Toximeter API. It also provides additional information such as emotional analysis and confidence metrics.
Overall, our project provides social platform moderators a more efficient way of detecting harmful behavior and disrespectful speech.
How we built it
We used a neural network following a pooled GRU-RNN architecture.
We preprocessed the data by cleaning, tokenizing (separating the words), lemmatizing (turning words into basic versions of themselves: "walking" becomes "walk") and embedding (turning words into numbers in a matrix) it. We then split it into training, validation and test sets.
We used grid search to find the optimal batch size and learning rate. In the end, we trained the network for 2 epochs using the Adam optimizer with a learning rate of 0.005. Then we fine-tuned the model by decreasing the learning rate by 10 and by training the network for another epoch.
The libraries we used are sklearn and keras for ML as well as pandas and spacy for preprocessing. We also used IBM Watson tone analyzer and Aylien API to provide additional metrics.
Challenges we ran into
Preprocessing the data took a very long time (lemmatizing is a very long process). This meant that we couldn't afford trial and error (since restarting would mean waiting several hours for the data to be ready).
Creating the web-app and the ML model required very different skill sets and technologies which we weren't at first familiar with.
Accomplishments that we're proud of
We finished the project on time ! The Model runs and is accurate and the website successfully calls the Toximeter API and displays nicely the results in the browser.
What we learned
A lot of ML and python knowledge: pooled GRU architecture, a lot of Keras functions, sklearn, overall NLP (spacy, lemmatizing...)
Also some web development in react and node.js.
What's next for NLPure
There are many ways the model can be improved: first, it needs to be adapted to multiple languages. This can be done by utilizing more data (of different language comments) when training the model.
Second, the model needs to be improved to better understand negated sentences. Since it's rare for anyone to type comments of type "You are not stupid" the model has learned to associate the words "you" and "stupid" with toxic behavior. Creating a decision tree for the model to accurately reflect the nature of such comments (by playing on the presence of words such as "not") would be a good improvement.
For insurance companies, the model could be specialized to detect fraud of exaggerated claims using a slightly different training set.