PetaByteSized — Trustworthy Location Reviews

Noise filtering for local reviews. Detect ads, irrelevant content, and rants from non‑visitors with a YAML‑driven policy engine, classical ML baselines, and a transformer fine‑tuned for relevancy — all wrapped in a reproducible pipeline and Streamlit app.

Elevator pitch

Keep location reviews trustworthy by automatically flagging content that breaks policy or is off-topic for the place. PetaByteSized combines transparent rules (with human‑readable reasons) and ML (TF‑IDF + a transformer) into a precision‑first ensemble suitable for moderation workflows.

About the project

Problem we address

Online reviews shape where people eat, shop, and visit, but promotions, off‑topic rants, and reviews from people who never visited distort reality. We built an ML system that:

Gauges quality: flags spam/ads, promotion links, phone “call now” content.
Assesses relevancy: checks if a review genuinely relates to the business.
Enforces policy: detects rant‑without‑visit and other violations, returning clear reasons.

What inspired us

We wanted a pragmatic pipeline that a real platform could deploy quickly: explicit policies for accountability + ML for generalization. We optimized for explainability, reproducibility, and speed to value in a 72‑hour window.

What we built

Policy rules engine (configs/policy.yaml, src/rules_engine.py)
- Ads: URL/phone/keyword regexes.
- Irrelevant: semantic mismatch vs business text + category‑aware off‑topic lists.
- Rant‑no‑visit: cue phrases with a small negation window.
- Outputs human‑readable reasons (e.g., ads: url/phone/promo, irrelevant: cosine<thr, rant_no_visit: cue).
Baselines
- Rules‑only baseline.
- TF‑IDF + Logistic Regression (per‑label thresholds).
Transformer
- DistilBERT fine‑tuned for label_irrelevant only (class imbalance handled with pos‑weighted BCE).
Ensemble
- Precision‑oriented: (rules OR transformer_above_threshold).
Interactive app
- Streamlit UI to upload CSVs, run predictions, view reasons, and export results.
Reproducible CLIs
- Ingest, clean, pseudo‑label, train, predict, evaluate, and tune thresholds.

Challenges & constraints we acknowledge

Only “irrelevant” is model‑trained.
Given time/compute limits, we focused the transformer on relevancy. label_ads and label_rant_no_visit are rules‑only in this version.
Weak labels via pseudo‑labeling.
For large datasets we generated labels using our rules to scale quickly. This improves iteration speed but introduces label bias. Evaluations on these sets can over‑estimate performance for rule‑like patterns.
Limited human gold.
We used a small, hand‑curated sample for sanity checks. A larger human‑labeled set would further validate generalization.
Compute/logistics.
We optimized training for Windows/CPU and modest GPUs and added knobs for dataset limiting, workers, and sequence length.

What we learned

We went through the end‑to‑end ML lifecycle under hackathon constraints:

Data: acquire (Kaggle + McAuley Google Local), clean, standardize schemas.
Labeling: bootstrap with pseudo‑labels when gold is scarce.
Modeling: start with explainable rules → add TF‑IDF → fine‑tune a transformer on the hardest task (irrelevant).
Thresholds/Ensembles: pick operating points for precision/recall and compose with rules.
Evaluation: report metrics, inspect error dumps, and state limitations clearly.

Built With

chatgpt
huggingface
numpy
numpy**
optional-mini-berts-(for-cpu)-**dev/platform**:-vs-code
pandas
pandas**
powershell
python
pytorch
pytorch**
pyyaml
pyyaml**
scikit-learn
scikit-learn**
streamlit
streamlit**-**models**:-distilbert-(base)
transformers
vscode
windows-powershell-**data**:-kaggle-(google-maps-restaurant-reviews)

Updates

Nicholas Chong started this project — Aug 30, 2025 02:50 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.