Inspiration

Ratings aren’t enough. Location platforms are flooded with noisy, off-topic, or promotional reviews that warp reputation and waste attention. Manual moderation doesn’t scale. Beyond Stars, Towards Trust asks: how do we score trust with little to no labels, and make the decision explainable enough to act on?

What it does

Cleans and normalizes reviews (PII masking, dedup, text normalization).
Generates pseudo-labels via labeling functions (LFs): promo/template cues, entity-density spikes, sentiment–rating conflicts, bursty users, emoji overload, etc.
Runs two complementary prototypes:
- RoBERTa classifier (end-to-end).
- RoBERTa embeddings + XGBoost (tabular-friendly decision boundary).
Outputs a trust score, predicted class, and which LFs fired for transparency.
Provides batch analysis (n-gram/word clouds, length vs rating, sentiment vs rating, inter-review intervals) and a demo with a threshold slider for different operating points.

How we built it

EDA & Prep: Removed empties; replaced emails/phones/URLs with placeholders; deduped by (gmap_id, user_id, text); normalized (lowercasing, stopwords, punctuation/digits removal, lemmatization).
Weak Supervision: Implemented LF library and weighted voting to produce high-confidence pseudo-labels
Models:
- Fine-tunable RoBERTa + FC + softmax.
- Embedding extractor (RoBERTa) feeding XGBoost for a strong non-neural baseline.
Evaluation & Viz: Scripts for AUC/F1/Precision@Untrusted, confusion matrices, top-triggered LFs, and slice metrics.
Demo: Optional Gradio app for single/batch inference, rule hits, and operating-point control.
Stack: VSCode, Jupyter, PyTorch, Hugging Face Transformers, scikit-learn, pandas, numpy, matplotlib, wordcloud.

Challenges we ran into

No/low labels: Designing LFs that are broad enough to cover noise but precise enough to avoid over-flagging.
Domain noise: Neutral sentiment often confuses rating alignment; off-topic detection needs signals beyond keywords.
Balance of metrics: Tuning for high precision on “untrusted” vs overall F1 is a pragmatic trade-off.
Time & compute constraints: Building an end-to-end, reproducible pipeline while keeping training iterations lean.

Accomplishments that we're proud of

A label-efficient pipeline from raw text → pseudo-labels → dual modeling paths → explainable outputs.
A reusable LF library that captures common abuse patterns (promo/templates, bursts, emoji, entity density).
An interpretable demo that exposes rule hits and lets reviewers choose operating points.

What we learned

Weak supervision works: Good LFs + conservative voting can bootstrap useful training signals fast.
Interpretability is adoption: Showing why a review is flagged (LF hits, token importances) builds trust with ops teams.
Behavioral context matters: User-level patterns (bursts) add signal that pure text often misses.

What's next for Beyond Stars, Towards Trust

Human-in-the-loop: Active learning on most-uncertain samples to quickly improve LFs and models.
Calibration & fairness: Temperature scaling/ECE, plus category/language slices for consistent thresholds.
Multilingual & domain transfer: Extend beyond English and adapt to new verticals with minimal retuning.