Inspiration Local reviews drive real decisions, but noise (ads, off-topic posts, non-visit rants, spam) erodes trust. We built a practical pipeline that (1) auto-labels against clear policies and (2) trains a lightweight classifier platforms can run cheaply at scale.
What it does
- Classifies reviews into ADS, IRRELEVANT, NON_VISIT_RANT, LOW_QUALITY, plus an overall USEFUL/Relevant flag.
- Bootstraps labels with an LLM (Gemma-3-12B-IT, 4-bit), using JSON-only prompts and guardrails for empty/emoji/gibberish text.
- Trains a fast baseline (TF-IDF + multi-label Logistic Regression) with per-label threshold calibration from PR curves.
- Exports FP/FN CSVs for targeted error analysis.
How we built it
- Data prep: merge Google Local Reviews with metadata → de-dupe → normalize/clean text.
- LLM labeling: strict system rules, compact template, deterministic decoding, brace-based JSON parse, short-circuit spam guard; ~1k pseudo-labels with chunked checkpoints.
- Modeling: TF-IDF (word + char n-grams) → class-weighted LR; 70/15/15 split.
- Calibration & eval: per-label PR-curve thresholds; report P/R/F1; export misclassifications.
Challenges
- Gated/large checkpoints on Colab: solved via real file copies + shard size verification.
- Class imbalance: rare ADS/IRRELEVANT/NON_VISIT hurt F1 in small samples.
- Pseudo-label noise: short/ambiguous texts, mitigated with guards and rationale caps.
- Time limits: great on Relevance; rarer classes need more positives.
Accomplishments
- Clean, end-to-end, resumable Colab pipeline (mount→cache→label→train→calibrate→analyze).
- High F1 on Relevance (~0.94) despite noisy short texts.
- Robust engineering: Drive caching, JSON-safe prompting, error-tolerant parsing.
What we learned
- Data beats model swaps for rare classes, targeted positives matter most.
- Thresholds matter: PR-based calibration > naive 0.5.
- Guardrails reduce noise: normalize text + short-circuit obvious spam.
- Ops hygiene (checkpoints/caching) saves hackathon hours.
What’s next
- Grow rare classes: regex mining + focused LLM labeling; active learning on uncertain/hard negatives.
- Human-in-the-loop: 300–500 verified examples to anchor thresholds.
- Richer features: simple policy signals (URL, phone, %digits, length, link count) + optional metadata (user history, time).
- Model upgrades: calibrated Linear SVM; then compact transformers (e.g., DeBERTa-v3-small with LoRA) as labels scale.
- Multilingual & UI: expand prompts/locales; add a small dashboard for triage and policy audits.
Built With
- gemma
- google-colab
- python
Log in or sign up for Devpost to join the conversation.