Inspiration
The internet connects billions of people, but online harassment and toxic behavior threaten free and open discussions. We were deeply inspired by the Kaggle Jigsaw challenges to build a system that not only identifies hate speech but does so fairly, without penalizing vulnerable minority groups through unintended biases.
What it does
Detoxify instantly classifies text into specific toxicity categories, including severe toxicity, obscenity, threats, insults, and identity-based attacks. It goes beyond simple English detection, offering a robust Multilingual model that supports 7 different languages, and an Unbiased model specifically designed to detect identity hate without marginalizing certain demographic terms.
How we built it
We utilized the power of modern deep learning, specifically relying on PyTorch and PyTorch Lightning for a clean and scalable training loop. For the models themselves, we leveraged HuggingFace Transformers, fine-tuning state-of-the-art architectures like bert-base-uncased, roberta-base, albert-base-v2, and xlm-roberta-base on millions of annotated comments.
Challenges we ran into
One of the hardest parts of detecting toxicity is unintended bias. Often, machine learning models associate perfectly innocent identity words (like "gay", "muslim", or "black") with toxicity simply because those words appear frequently in hateful contexts. Balancing our datasets and computing specialized bias metrics to ensure our models didn't unfairly target marginalized communities was a massive, but rewarding, technical hurdle.
Accomplishments that we're proud of
We are incredibly proud of achieving near state-of-the-art results (such as an AUC score of 98.64% on the original dataset) while keeping the library highly accessible. Packaging this complex research into a simple pip install detoxify command that anyone can run in just three lines of Python code is a huge win for developers everywhere.
What we learned
We learned a tremendous amount about the nuances of human language and the ethical responsibilities of AI. We saw firsthand how easily models can amplify historical prejudices if left unchecked, reinforcing the importance of rigorous, ethical AI evaluation.
What's next for Detoxify
We plan to expand Detoxify to cover even more languages, especially low-resource languages that currently lack robust moderation tools. We also aim to release lighter, highly-quantized models that can be run on edge devices, making real-time, on-device toxicity moderation a reality for mobile apps and games
Log in or sign up for Devpost to join the conversation.