FreddyFazbytes

Surfacing trustworthy reviews from the chaos of Google Maps

Inspiration

We’ve all been there. You’re scrolling through restaurant reviews and suddenly hit a wall of promos, emotional rants, or completely off-topic spam. As food lovers and frequent map-searchers, we asked ourselves: can AI help restore trust in these sections?

Inspired by moderation challenges on platforms like Google Maps and TripAdvisor, we built FreddyFazbytes. It’s a review filtering system powered by large language models that separates signal from noise and elevates authentic, policy-compliant feedback.

What it does

FreddyFazbytes classifies restaurant reviews into four categories:

  • Ad: Promotional spam, brand comparisons, or self-marketing
  • Irr: Irrelevant content with no link to the location
  • Rant: Emotionally charged complaints without substantiated experiences
  • Val: Valuable, constructive, and relevant reviews

The system uses Gemma 3B with a few-shot prompting strategy enhanced by chain-of-thought reasoning. This allows the model to make step-by-step decisions without fine-tuning.

It outputs:

  • A review classification (Ad, Irr, Rant, Val)
  • A relevancy score between 0 and 1
  • A policy violation flag for moderation

We evaluate results using confusion matrices, macro F1 scores, and precision/recall breakdowns.

How we built it

Data Collection and Preprocessing

We scraped around 2000 reviews from Kaggle’s Google Maps dataset and paired them with live reviews from Singapore using Apify.
Non-English reviews were translated using the Google Translate API.
Emojis and empty content were removed.
Metadata such as length, links, and sentiment was preserved for analysis.

Exploratory Data Analysis

We found that star ratings were unreliable indicators of review quality.
Vague five-star compliments were often less helpful than detailed three-star reviews.

Prompt Engineering and Evaluation

We built an evaluation pipeline using a custom script to test outputs with macro precision, recall, and F1.
Visualizations included confusion matrices to diagnose misclassifications.

Baseline (Old Prompt):

  • Macro F1: 0.837
  • Accuracy: 90.5%
  • Ad had perfect recall but moderate precision
  • Irr and Rant had high recall but tended to over-predict
  • Val was the most balanced and stable class

The old prompt was effective at catching spam but occasionally mislabeled valid reviews as rants or ads.

Prompt Refinement and Improvements:

We refined the instructions and added better few-shot examples.
We highlighted borderline cases such as short-but-valid reviews and vague spam.
We prioritized specificity over emotional tone.
We enforced structured YAML outputs for clean parsing.

Improved Prompt Results:

  • Macro F1: 0.830
  • Accuracy: 91.0%
  • Ad classification achieved perfect precision and recall
  • Val maintained high precision
  • Rant recall dipped slightly due to stricter thresholds

The new prompt is more precise and cautious. While Rant recall slightly decreased, misclassifications between Rant, Ad, and Val were significantly reduced. This makes the system more reliable for public moderation.

Challenges we ran into

Prompt Length and Output Consistency

Chain-of-thought reasoning combined with metadata led to long outputs.
This caused token limit issues and formatting errors.
We solved this by trimming input fields and simplifying reasoning steps.

Class Imbalance

Most real-world reviews were labeled as Val.
We manually oversampled edge cases such as Ads, Rants, and Irrelevants to balance the dataset.

Output Parsing Fragility

Small formatting issues like missing colons broke YAML parsing.
We built a validation and cleanup script and enforced strict output formatting.

Inference Latency and Cost

Batch inference was rate-limited.
We batched 10 reviews per call and prioritized 250 for evaluation and 200 for pseudo-labeling.

Accomplishments that we're proud of

  • Built an end-to-end pipeline from data collection to inference and evaluation
  • Boosted macro F1 from 0.774 to 0.820 through prompt refinements
  • Achieved perfect spam detection with 100 percent precision and recall for the Ad class
  • Created a relevancy scoring system to go beyond rigid labels
  • Used chain-of-thought prompting to mimic a human moderator’s reasoning
  • Handled tricky edge cases including:
    • Short but specific reviews
    • Vague emotional rants
    • Non-English inputs

What we learned

  • Prompt engineering can rival fine-tuning when backed by high-quality examples
  • Chain-of-thought prompting adds nuance, especially when distinguishing rants from valid complaints
  • Precision is just as important as recall in moderation tasks
  • Confusion matrices help reveal blind spots such as ambiguity between Rant and Ad
  • Trustworthiness in reviews depends on multiple factors including length, tone, detail, and relevance

What's next for FreddyFazbytes

Trust Score Aggregator

We plan to build a composite Trust Score for each review based on:

  • Specificity and clarity
  • Depth and emotional tone
  • Comparison to known spam and valuable reviews

This will be visualized through heatmaps and category trends.

Time-Series Moderation Insights

We want to track moderation outcomes over time.
This includes identifying spam spikes after promotions or shifts in review quality following viral events.

Feedback Loop for Moderation

We aim to let users and businesses vote on classifications.
This will help flag controversial edge cases and feed back into prompt tuning.
It also builds transparency and trust in the moderation process.

Built With

Share this project:

Updates