PetaByteSized — Trustworthy Location Reviews
Noise filtering for local reviews. Detect ads, irrelevant content, and rants from non‑visitors with a YAML‑driven policy engine, classical ML baselines, and a transformer fine‑tuned for relevancy — all wrapped in a reproducible pipeline and Streamlit app.
Elevator pitch
Keep location reviews trustworthy by automatically flagging content that breaks policy or is off-topic for the place. PetaByteSized combines transparent rules (with human‑readable reasons) and ML (TF‑IDF + a transformer) into a precision‑first ensemble suitable for moderation workflows.
About the project
Problem we address
Online reviews shape where people eat, shop, and visit, but promotions, off‑topic rants, and reviews from people who never visited distort reality. We built an ML system that:
- Gauges quality: flags spam/ads, promotion links, phone “call now” content.
- Assesses relevancy: checks if a review genuinely relates to the business.
- Enforces policy: detects rant‑without‑visit and other violations, returning clear reasons.
What inspired us
We wanted a pragmatic pipeline that a real platform could deploy quickly: explicit policies for accountability + ML for generalization. We optimized for explainability, reproducibility, and speed to value in a 72‑hour window.
What we built
- Policy rules engine (
configs/policy.yaml,src/rules_engine.py)- Ads: URL/phone/keyword regexes.
- Irrelevant: semantic mismatch vs business text + category‑aware off‑topic lists.
- Rant‑no‑visit: cue phrases with a small negation window.
- Outputs human‑readable reasons (e.g.,
ads: url/phone/promo,irrelevant: cosine<thr,rant_no_visit: cue).
- Baselines
- Rules‑only baseline.
- TF‑IDF + Logistic Regression (per‑label thresholds).
- Transformer
- DistilBERT fine‑tuned for
label_irrelevantonly (class imbalance handled with pos‑weighted BCE).
- DistilBERT fine‑tuned for
- Ensemble
- Precision‑oriented: (rules OR transformer_above_threshold).
- Interactive app
- Streamlit UI to upload CSVs, run predictions, view reasons, and export results.
- Reproducible CLIs
- Ingest, clean, pseudo‑label, train, predict, evaluate, and tune thresholds.
Challenges & constraints we acknowledge
- Only “irrelevant” is model‑trained.
Given time/compute limits, we focused the transformer on relevancy.label_adsandlabel_rant_no_visitare rules‑only in this version. - Weak labels via pseudo‑labeling.
For large datasets we generated labels using our rules to scale quickly. This improves iteration speed but introduces label bias. Evaluations on these sets can over‑estimate performance for rule‑like patterns. - Limited human gold.
We used a small, hand‑curated sample for sanity checks. A larger human‑labeled set would further validate generalization. - Compute/logistics.
We optimized training for Windows/CPU and modest GPUs and added knobs for dataset limiting, workers, and sequence length.
What we learned
We went through the end‑to‑end ML lifecycle under hackathon constraints:
- Data: acquire (Kaggle + McAuley Google Local), clean, standardize schemas.
- Labeling: bootstrap with pseudo‑labels when gold is scarce.
- Modeling: start with explainable rules → add TF‑IDF → fine‑tune a transformer on the hardest task (
irrelevant). - Thresholds/Ensembles: pick operating points for precision/recall and compose with rules.
- Evaluation: report metrics, inspect error dumps, and state limitations clearly.
Built With
- chatgpt
- huggingface
- numpy
- numpy**
- optional-mini-berts-(for-cpu)-**dev/platform**:-vs-code
- pandas
- pandas**
- powershell
- python
- pytorch
- pytorch**
- pyyaml
- pyyaml**
- scikit-learn
- scikit-learn**
- streamlit
- streamlit**-**models**:-distilbert-(base)
- transformers
- vscode
- windows-powershell-**data**:-kaggle-(google-maps-restaurant-reviews)
Log in or sign up for Devpost to join the conversation.