Detoxify

Inspiration

The internet connects billions of people, but online harassment and toxic behavior threaten free and open discussions. We were deeply inspired by the Kaggle Jigsaw challenges to build a system that not only identifies hate speech but does so fairly, without penalizing vulnerable minority groups through unintended biases.

What it does

Detoxify instantly classifies text into specific toxicity categories, including severe toxicity, obscenity, threats, insults, and identity-based attacks. It goes beyond simple English detection, offering a robust Multilingual model that supports 7 different languages, and an Unbiased model specifically designed to detect identity hate without marginalizing certain demographic terms.

How we built it

We utilized the power of modern deep learning, specifically relying on PyTorch and PyTorch Lightning for a clean and scalable training loop. For the models themselves, we leveraged HuggingFace Transformers, fine-tuning state-of-the-art architectures like bert-base-uncased, roberta-base, albert-base-v2, and xlm-roberta-base on millions of annotated comments.

Challenges we ran into

One of the hardest parts of detecting toxicity is unintended bias. Often, machine learning models associate perfectly innocent identity words (like "gay", "muslim", or "black") with toxicity simply because those words appear frequently in hateful contexts. Balancing our datasets and computing specialized bias metrics to ensure our models didn't unfairly target marginalized communities was a massive, but rewarding, technical hurdle.

Accomplishments that we're proud of

We are incredibly proud of achieving near state-of-the-art results (such as an AUC score of 98.64% on the original dataset) while keeping the library highly accessible. Packaging this complex research into a simple pip install detoxify command that anyone can run in just three lines of Python code is a huge win for developers everywhere.

What we learned

We learned a tremendous amount about the nuances of human language and the ethical responsibilities of AI. We saw firsthand how easily models can amplify historical prejudices if left unchecked, reinforcing the importance of rigorous, ethical AI evaluation.

What's next for Detoxify

We plan to expand Detoxify to cover even more languages, especially low-resource languages that currently lack robust moderation tools. We also aim to release lighter, highly-quantized models that can be run on edge devices, making real-time, on-device toxicity moderation a reality for mobile apps and games