Inspiration

Most toxicity classifiers fail the moment someone writes "bhai tu bahut stupid hai" — a single sentence mixing Hindi and English the way millions of Indian users naturally communicate.

We tested multiple moderation tools on code-mixed data. Over 40% of these inputs were misclassified — not because the models were weak, but because they were not designed for multilingual, mixed-language text.

SafeSpeak was built to close this gap: a toxicity detector that works when language boundaries don’t exist.


What it does

SafeSpeak classifies user-generated comments as toxic or non-toxic across:

  • English ("you are stupid")
  • Hindi ("तुम बहुत बेकार हो")
  • Code-mixed inputs blending both languages

It also handles borderline phrasing where toxicity is implied.

The system outputs:

  • Binary label (0 = non-toxic, 1 = toxic)
  • Confidence score for moderation decisions

How we built it

Model: textdetox/bert-multilingual-toxicity-classifier — pretrained on multilingual toxicity corpora.

Pipeline:

  • Data cleaning (nulls, duplicates, preserving Devanagari text)
  • Tokenization using Hugging Face tokenizer
  • Stratified 80/20 train-validation split
  • Zero-shot evaluation before training
  • Fine-tuning (learning rate 2e-5, warmup_steps=270, weight decay=0.01)
  • Evaluation using ROC-AUC and classification metrics

A key design choice was evaluating the model before fine-tuning to ensure transparent results.


Results

  • ROC-AUC (fine-tuned): 0.9966
  • ROC-AUC (zero-shot): 0.9989
  • Accuracy: 97%
  • Macro F1-score: 0.97

Balanced precision and recall ensure reliable moderation without excessive false positives.


Challenges we ran into

  • Handling code-mixed Hindi + English text without breaking linguistic meaning
  • Designing preprocessing that preserves Devanagari characters
  • Debugging dependency conflicts in Colab (transformers version mismatch)
  • Maintaining performance when starting from an already strong pretrained model

Accomplishments that we're proud of

  • Built a pipeline that handles mixed-script (Latin + Devanagari) input
  • Achieved ROC-AUC 0.9966 with 0.97 macro F1
  • Verified results with reproducible outputs
  • Implemented zero-shot vs fine-tuned comparison for transparent evaluation

What we learned

Starting from a domain-matched pretrained model changes what fine-tuning can realistically achieve — and reporting the baseline honestly is what makes the result credible


What's next for SafeSpeak

  • Multi-label toxicity classification (threats, hate speech, harassment)
  • Expand to models like XLM-R for broader language coverage
  • Deploy as a Telegram/Discord moderation bot
  • Build a real-time API for production use

Built With

Share this project:

Updates