TechJem Review Moderation

Inspiration

Local choices—where to eat, fix a phone, or book a clinic—often hinge on online reviews. But we kept running into junk: promo links, off-topic rants, and “never been here” complaints that distort ratings and hurt honest businesses. Manual moderation can’t keep up, and generative spam makes it worse. We built TechJem Review Moderation to restore trust: a lightweight, reproducible ML pipeline that flags ads, irrelevancies, and likely non-visitor rants so real experiences rise to the top.

What it does

Flags Google-style location reviews as VALID, ADVERTISEMENT, IRRELEVANT, or RANT_NO_VISIT to protect users and businesses from misleading content.

How we built it

UCSD Google Local “10-core” data → cleaning → rule seeds → TF-IDF + metadata (length, punctuation, caps/digits, URL flag, hour/weekday) → Logistic Regression with class-balanced training and tuned thresholds.

Challenges we ran into

Find it difficult to categorise comments correctly into 4 labels

Accomplishments that we're proud of

Although we are fresh starters of machine learning algorithms/models, we still managed to work this project out.

What we learned

We learned that clean engineering beats heavy modeling when you need something reliable fast. Streaming the UCSD JSONL files and adding simple metadata (length, punctuation, caps/digits, URL flag, time) on top of TF-IDF improved macro-F1 and made the classifier far more robust than text alone. Class imbalance was real—class_weight="balanced" plus per-class thresholds helped catch ads/rants without flooding false positives. Finally, a reproducible Colab pipeline, artifacts in Drive, and small rule seeds made iteration smooth and the demo easy to trust.