Reviews Classifier

Inspiration Local reviews drive real decisions, but noise (ads, off-topic posts, non-visit rants, spam) erodes trust. We built a practical pipeline that (1) auto-labels against clear policies and (2) trains a lightweight classifier platforms can run cheaply at scale.

What it does

Classifies reviews into ADS, IRRELEVANT, NON_VISIT_RANT, LOW_QUALITY, plus an overall USEFUL/Relevant flag.
Bootstraps labels with an LLM (Gemma-3-12B-IT, 4-bit), using JSON-only prompts and guardrails for empty/emoji/gibberish text.
Trains a fast baseline (TF-IDF + multi-label Logistic Regression) with per-label threshold calibration from PR curves.
Exports FP/FN CSVs for targeted error analysis.

How we built it

Data prep: merge Google Local Reviews with metadata → de-dupe → normalize/clean text.
LLM labeling: strict system rules, compact template, deterministic decoding, brace-based JSON parse, short-circuit spam guard; ~1k pseudo-labels with chunked checkpoints.
Modeling: TF-IDF (word + char n-grams) → class-weighted LR; 70/15/15 split.
Calibration & eval: per-label PR-curve thresholds; report P/R/F1; export misclassifications.

Challenges

Gated/large checkpoints on Colab: solved via real file copies + shard size verification.
Class imbalance: rare ADS/IRRELEVANT/NON_VISIT hurt F1 in small samples.
Pseudo-label noise: short/ambiguous texts, mitigated with guards and rationale caps.
Time limits: great on Relevance; rarer classes need more positives.

Accomplishments

Clean, end-to-end, resumable Colab pipeline (mount→cache→label→train→calibrate→analyze).
High F1 on Relevance (~0.94) despite noisy short texts.
Robust engineering: Drive caching, JSON-safe prompting, error-tolerant parsing.

What we learned

Data beats model swaps for rare classes, targeted positives matter most.
Thresholds matter: PR-based calibration > naive 0.5.
Guardrails reduce noise: normalize text + short-circuit obvious spam.
Ops hygiene (checkpoints/caching) saves hackathon hours.

What’s next

Grow rare classes: regex mining + focused LLM labeling; active learning on uncertain/hard negatives.
Human-in-the-loop: 300–500 verified examples to anchor thresholds.
Richer features: simple policy signals (URL, phone, %digits, length, link count) + optional metadata (user history, time).
Model upgrades: calibrated Linear SVM; then compact transformers (e.g., DeBERTa-v3-small with LoRA) as labels scale.
Multilingual & UI: expand prompts/locales; add a small dashboard for triage and policy audits.