Inspiration
Local choices—where to eat, fix a phone, or book a clinic—often hinge on online reviews. But we kept running into junk: promo links, off-topic rants, and “never been here” complaints that distort ratings and hurt honest businesses. Manual moderation can’t keep up, and generative spam makes it worse. We built TechJem Review Moderation to restore trust: a lightweight, reproducible ML pipeline that flags ads, irrelevancies, and likely non-visitor rants so real experiences rise to the top.
What it does
Flags Google-style location reviews as VALID, ADVERTISEMENT, IRRELEVANT, or RANT_NO_VISIT to protect users and businesses from misleading content.
How we built it
UCSD Google Local “10-core” data → cleaning → rule seeds → TF-IDF + metadata (length, punctuation, caps/digits, URL flag, hour/weekday) → Logistic Regression with class-balanced training and tuned thresholds.
Challenges we ran into
Find it difficult to categorise comments correctly into 4 labels
Accomplishments that we're proud of
Although we are fresh starters of machine learning algorithms/models, we still managed to work this project out.
What we learned
We learned that clean engineering beats heavy modeling when you need something reliable fast. Streaming the UCSD JSONL files and adding simple metadata (length, punctuation, caps/digits, URL flag, time) on top of TF-IDF improved macro-F1 and made the classifier far more robust than text alone. Class imbalance was real—class_weight="balanced" plus per-class thresholds helped catch ads/rants without flooding false positives. Finally, a reproducible Colab pipeline, artifacts in Drive, and small rule seeds made iteration smooth and the demo easy to trust.
What's next for TechJem Review Moderation
Try out more other models!!
Built With
- google-colab
- pandas
- python
- scikit-learn
- transformers
Log in or sign up for Devpost to join the conversation.