Inspiration

Toxic content crosses language barriers — hate speech, threats, and abuse exist in every language. Building a scalable multilingual moderation system that works across Hindi and English was the core motivation for this challenge.

What I Built

A binary toxic comment classifier using XLM-RoBERTa-base — a cross-lingual transformer model supporting 100 languages — fine-tuned to classify comments as Toxic (1) or Non-Toxic (0).

How I Built It

  • Used XLM-RoBERTa-base for native multilingual support across Hindi and English
  • Fine-tuned for 3 epochs on 9,000 labeled comments
  • 90/10 train-validation split with seed=42 for reproducibility
  • ROC-AUC computed using softmax probabilities for reliable ranking
  • max_length=128, batch_size=16, fp16=True
  • Platform: Google Colab T4 GPU

Results

Non-Toxic (0) → Precision: 0.97 | Recall: 0.96 | F1: 0.96

Toxic (1) → Precision: 0.96 | Recall: 0.97 | F1: 0.96

✅ ROC-AUC Score: 0.9922

Challenges Faced

  • Handling multilingual text (Hindi + English) without any language-specific preprocessing
  • Perfectly balanced dataset (50-50) required careful evaluation using ROC-AUC over accuracy
  • Excel file format (.xlsx) required special handling compared to standard CSV

What I Learned

XLM-RoBERTa handles multilingual text natively without any translation or language detection preprocessing. Cross-lingual transfer learning is extremely powerful for low-resource language tasks like Hindi toxic comment detection.

Built With

Share this project:

Updates