With the vast amount of toxicity online, our team wanted to see if NLP models could help reduce this. On large public discord servers, this problem is especially visible as public text channels can quickly become toxic cesspools filled with spam and hate. Currently, Discord doesn’t have any first party solutions for this, instead, leaving it to third party bots such as the popular bot, MEE6. However, the auto-moderation features of these bots are largely based on more traditional spam filters and word blacklists which only help to a limited extent and lack any ability to make more intelligent decisions on what kind of text should be flagged for moderation. For example, identity hate or alike hateful speech (like calling someone gay as an insult) wouldn’t be detected with any of the current methods. Thus, we wanted to use NLP, which has the potential to make a much more holistic evaluation of text, to evaluate the toxicity of a message.
What it does
AutoBot automatically evaluates messages sent in Discord text channels and assigns every message a score between 0 and 1 for several different categories. These categories are: toxicity, severe toxicity, obscenity, identity hate, insults, and threats. Once the message scores are past a certain threshold (which only the server admin can set), the message is deleted and AutoBot will warn the author of the message via direct message. The warning message will contain the number of infractions the author now has, and if the author has over 5 infractions, then they will be kicked from the server. Included in the evaluated messages are server member nicknames/usernames, regular messages, and edited messages.
How we built it
We used a pre-trained Bert model in combination with a CNN we trained ourselves to create a text classifier that gave predictions in the 6 categories mentioned above for some given text. This model was trained on data provided by the Toxic Comment Classification Challenge on Kaggle which was made up of labeled comments from Wikipedia’s talk page edits. This model along with the bot are setup and running on a GPU powered virtual machine on GCP.
For the bot itself, we primarily used the Discord.py library. Once we had the model and evaluate function built out and tested, it was just a matter of defining a checkMessage() function that took the user and the string to check as parameters. In this function, we called the evaluate function on the input string to make judgements on its toxicity. This function is also where we implemented the warning and threshold ideas, as well as the disciplinary actions taken against users past the threshold. We then used the @bot.event annotation from Discord.py to define different on_action functions that called checkMessage(). When a user sent a message, edited a message, joined the server, or changed their name, it would trigger the respective on_action function and the appropriate string to check would be evaluated.
Challenges we ran into
The dataset we used for training gave us a number of problems in our end product. The data was made up exclusively of comments which when compared to typical language used on Discord, is not very representative. For example, the word meme only shows up a few times in the entire dataset and the amount of slang typical in Discord was very minimally present in the dataset. This made the quality of detections quite low in many cases of language that would be typical in a Discord server and be considered toxic. We also did not get a chance to train on a more up-to-date dataset on Kaggle that takes unintended model bias into account. This means that our model incorrectly learns to associate the names of frequently attacked identities with toxicity. For example, the phrase “I am a gay black woman” would be flagged with a high degree of certainty of being toxic as the training data consisted of many comments where identities were used in offensive ways. To correct this, we would need to use metrics to measure this unintended bias to allow for the model to optimize against them as well. We also did not have enough time to do cleaning of the dataset that would require a much more thorough and time consuming analysis. For example, within the dataset, there are a number of comments that contain URLs and a vast majority of these comments were labelled as non toxic. As a result of this imbalance, in our testing, we found that if you put http or .com in a message it would not be flagged as toxic.
Accomplishments that we’re proud of
Overall, we succeeded in creating an auto-moderation discord bot that is more capable compared to other bots in detecting toxic language. While nowhere near production level due to the limitations of the dataset we used to train with, our model was capable of detecting toxic messages that didn’t necessarily use any words that would typically be on a blacklist which would be needed for other bots to detect them. We also achieved a high degree of accuracy on our validation dataset with our model with the performance below. This means that our model is capable of generalizing well to data similar to what it was trained on. Therefore, given more representative training data containing more typical Discord language, our model would have a high likelihood of having much better real-world results than it does currently.
ROC AUC per label:
- toxic : 0.9786185661012127
- severe_toxic : 0.9862638391262774
- obscene : 0.9859656107908958
- threat : 0.9698006579864755
- insult : 0.9791061267388688
- identity_hate : 0.962972848616481
What we learned
The path to a potentially production ready bot that does a better job compared to traditional spam filtering and blacklisting would require a great deal more work. While we were initially hoping that a model trained on toxic comments would generalize well to toxic messages on discord, a model trained on data collected from actual Discord servers or other culturally similar sources like Twitch chat would do orders of magnitude better in terms of accuracy. We also learned about the limitations of the Discord bot API. We initially wanted a feature to be able to temporarily hide a person’s flagged message until it was reviewed by a human moderator but this feature is impossible as a Discord bot only has the ability to delete a person’s message from a channel.
What's next for AutoBot
- Storing the number of times user has violated server rules in a database and make note of such users - blacklisting
- Integration with emails and website notifications where servers owner could get monthly emails of users that have violated and store those in a database so that if they want to join another server, the owner knows their track record
- Add a settable threshold for infractions and more admin-customizable actions to take other than just kicking (muting from server, etc.)
- Create a website for more intuitive control over the settings of the bot such as the thresholds for the different classification categories
- Apply a spell check algorithm and filters to messages as pre-processing, so re able to detect a greater amount of obfuscated toxicity
- Use NLP models to detect underage language to ban underage users from 18+ channels and servers as well as to combat child grooming
- Bootstrap more tailored data by using the bot to collecting and training on human moderator flagged text that the bot missed