Inspiration Local reviews drive real decisions, but noise (ads, off-topic posts, non-visit rants, spam) erodes trust. We built a practical pipeline that (1) auto-labels against clear policies and (2) trains a lightweight classifier platforms can run cheaply at scale.

What it does

  • Classifies reviews into ADS, IRRELEVANT, NON_VISIT_RANT, LOW_QUALITY, plus an overall USEFUL/Relevant flag.
  • Bootstraps labels with an LLM (Gemma-3-12B-IT, 4-bit), using JSON-only prompts and guardrails for empty/emoji/gibberish text.
  • Trains a fast baseline (TF-IDF + multi-label Logistic Regression) with per-label threshold calibration from PR curves.
  • Exports FP/FN CSVs for targeted error analysis.

How we built it

  • Data prep: merge Google Local Reviews with metadata → de-dupe → normalize/clean text.
  • LLM labeling: strict system rules, compact template, deterministic decoding, brace-based JSON parse, short-circuit spam guard; ~1k pseudo-labels with chunked checkpoints.
  • Modeling: TF-IDF (word + char n-grams) → class-weighted LR; 70/15/15 split.
  • Calibration & eval: per-label PR-curve thresholds; report P/R/F1; export misclassifications.

Challenges

  • Gated/large checkpoints on Colab: solved via real file copies + shard size verification.
  • Class imbalance: rare ADS/IRRELEVANT/NON_VISIT hurt F1 in small samples.
  • Pseudo-label noise: short/ambiguous texts, mitigated with guards and rationale caps.
  • Time limits: great on Relevance; rarer classes need more positives.

Accomplishments

  • Clean, end-to-end, resumable Colab pipeline (mount→cache→label→train→calibrate→analyze).
  • High F1 on Relevance (~0.94) despite noisy short texts.
  • Robust engineering: Drive caching, JSON-safe prompting, error-tolerant parsing.

What we learned

  • Data beats model swaps for rare classes, targeted positives matter most.
  • Thresholds matter: PR-based calibration > naive 0.5.
  • Guardrails reduce noise: normalize text + short-circuit obvious spam.
  • Ops hygiene (checkpoints/caching) saves hackathon hours.

What’s next

  • Grow rare classes: regex mining + focused LLM labeling; active learning on uncertain/hard negatives.
  • Human-in-the-loop: 300–500 verified examples to anchor thresholds.
  • Richer features: simple policy signals (URL, phone, %digits, length, link count) + optional metadata (user history, time).
  • Model upgrades: calibrated Linear SVM; then compact transformers (e.g., DeBERTa-v3-small with LoRA) as labels scale.
  • Multilingual & UI: expand prompts/locales; add a small dashboard for triage and policy audits.

Built With

Share this project:

Updates