[MangoDB] Filtering the Noise

📌 Project Overview This project uses a BART-based zero-shot classification model to automatically classify Google reviews into four categories:

Relevant – Genuine reviews about the business Irrelevant – Off-topic reviews not related to the business Advertisement – Promotional content or links Rant_No_Visit – Complaints from users who never visited the business We built this as part of TikTok Tech Jam, where the focus is on solving real-world problems with creative AI solutions. The same approach could apply to TikTok comments, TikTok Shop reviews, and brand campaign feedback to separate meaningful content from noise.

⚙️ Methodology

Model We used facebook/bart-large-mnli through Hugging Face’s pipeline for zero-shot classification. Instead of training on labeled data, we crafted prompts with candidate labels to guide the model.
Evaluation Metrics Due to class imbalance, we used:
Macro F1 → measures performance equally across all classes.
Weighted F1 → accounts for imbalance by weighting scores by class frequency.
Findings
Strong performance on frequent classes (e.g., irrelevant).
Struggles with minority classes (ads, rant_no_visit).
Common confusion: irrelevant ↔ rant_no_visit, and subtle ads misclassified as relevant.
Limitations
Imbalanced dataset.
Zero-shot models are sensitive to prompt design.
BART is not fine-tuned for review/comment domain language.
Improvements
Fine-tune on labeled review/comment datasets.
Balance the dataset with oversampling or class-weighted training.
Experiment with newer instruction-tuned LLMs for better adaptation.