FreddyFazbytes

Surfacing trustworthy reviews from the chaos of Google Maps

Inspiration

We’ve all been there. You’re scrolling through restaurant reviews and suddenly hit a wall of promos, emotional rants, or completely off-topic spam. As food lovers and frequent map-searchers, we asked ourselves: can AI help restore trust in these sections?

Inspired by moderation challenges on platforms like Google Maps and TripAdvisor, we built FreddyFazbytes. It’s a review filtering system powered by large language models that separates signal from noise and elevates authentic, policy-compliant feedback.

What it does

FreddyFazbytes classifies restaurant reviews into four categories:

Ad: Promotional spam, brand comparisons, or self-marketing
Irr: Irrelevant content with no link to the location
Rant: Emotionally charged complaints without substantiated experiences
Val: Valuable, constructive, and relevant reviews

The system uses Gemma 3B with a few-shot prompting strategy enhanced by chain-of-thought reasoning. This allows the model to make step-by-step decisions without fine-tuning.

It outputs:

A review classification (Ad, Irr, Rant, Val)
A relevancy score between 0 and 1
A policy violation flag for moderation

We evaluate results using confusion matrices, macro F1 scores, and precision/recall breakdowns.

How we built it

Data Collection and Preprocessing

We scraped around 2000 reviews from Kaggle’s Google Maps dataset and paired them with live reviews from Singapore using Apify.
Non-English reviews were translated using the Google Translate API.
Emojis and empty content were removed.
Metadata such as length, links, and sentiment was preserved for analysis.

Exploratory Data Analysis

We found that star ratings were unreliable indicators of review quality.
Vague five-star compliments were often less helpful than detailed three-star reviews.

Prompt Engineering and Evaluation

We built an evaluation pipeline using a custom script to test outputs with macro precision, recall, and F1.
Visualizations included confusion matrices to diagnose misclassifications.

Baseline (Old Prompt):

Macro F1: 0.837
Accuracy: 90.5%
Ad had perfect recall but moderate precision
Irr and Rant had high recall but tended to over-predict
Val was the most balanced and stable class

The old prompt was effective at catching spam but occasionally mislabeled valid reviews as rants or ads.

Prompt Refinement and Improvements:

We refined the instructions and added better few-shot examples.
We highlighted borderline cases such as short-but-valid reviews and vague spam.
We prioritized specificity over emotional tone.
We enforced structured YAML outputs for clean parsing.

Improved Prompt Results:

Macro F1: 0.830
Accuracy: 91.0%
Ad classification achieved perfect precision and recall
Val maintained high precision
Rant recall dipped slightly due to stricter thresholds

The new prompt is more precise and cautious. While Rant recall slightly decreased, misclassifications between Rant, Ad, and Val were significantly reduced. This makes the system more reliable for public moderation.

Challenges we ran into

Prompt Length and Output Consistency

Chain-of-thought reasoning combined with metadata led to long outputs.
This caused token limit issues and formatting errors.
We solved this by trimming input fields and simplifying reasoning steps.

Class Imbalance

Most real-world reviews were labeled as Val.
We manually oversampled edge cases such as Ads, Rants, and Irrelevants to balance the dataset.

Output Parsing Fragility

Small formatting issues like missing colons broke YAML parsing.
We built a validation and cleanup script and enforced strict output formatting.

Inference Latency and Cost

Batch inference was rate-limited.
We batched 10 reviews per call and prioritized 250 for evaluation and 200 for pseudo-labeling.

Accomplishments that we're proud of

Built an end-to-end pipeline from data collection to inference and evaluation
Boosted macro F1 from 0.774 to 0.820 through prompt refinements
Achieved perfect spam detection with 100 percent precision and recall for the Ad class
Created a relevancy scoring system to go beyond rigid labels
Used chain-of-thought prompting to mimic a human moderator’s reasoning
Handled tricky edge cases including:
- Short but specific reviews
- Vague emotional rants
- Non-English inputs

What we learned

Prompt engineering can rival fine-tuning when backed by high-quality examples
Chain-of-thought prompting adds nuance, especially when distinguishing rants from valid complaints
Precision is just as important as recall in moderation tasks
Confusion matrices help reveal blind spots such as ambiguity between Rant and Ad
Trustworthiness in reviews depends on multiple factors including length, tone, detail, and relevance

What's next for FreddyFazbytes

Trust Score Aggregator

We plan to build a composite Trust Score for each review based on:

Specificity and clarity
Depth and emotional tone
Comparison to known spam and valuable reviews

This will be visualized through heatmaps and category trends.

Time-Series Moderation Insights

We want to track moderation outcomes over time.
This includes identifying spam spikes after promotions or shifts in review quality following viral events.

Feedback Loop for Moderation

We aim to let users and businesses vote on classifications.
This will help flag controversial edge cases and feed back into prompt tuning.
It also builds transparency and trust in the moderation process.

Built With

apify
devpost
github
google-translate
hugging-face
kaggle
markdown
matplotlib
pandas
powershell
python
scikit-learn
vs-code

Submitted to

TikTok TechJam 2025

Created by

As team lead and integration coordinator, I oversaw all aspects of FreddyFazbytes, from prompt design to evaluation and deployment. I quickly upskilled in LLM prompting and metrics to support technical decisions, while also aligning the team’s efforts, resolving differences, and ensuring a cohesive end-to-end pipeline.

Ethan Ong
woon kin tan
Shaun Lim
jiayi Ku
Lan Vu Ngoc