FreddyFazbytes
Surfacing trustworthy reviews from the chaos of Google Maps
Inspiration
We’ve all been there. You’re scrolling through restaurant reviews and suddenly hit a wall of promos, emotional rants, or completely off-topic spam. As food lovers and frequent map-searchers, we asked ourselves: can AI help restore trust in these sections?
Inspired by moderation challenges on platforms like Google Maps and TripAdvisor, we built FreddyFazbytes. It’s a review filtering system powered by large language models that separates signal from noise and elevates authentic, policy-compliant feedback.
What it does
FreddyFazbytes classifies restaurant reviews into four categories:
- Ad: Promotional spam, brand comparisons, or self-marketing
- Irr: Irrelevant content with no link to the location
- Rant: Emotionally charged complaints without substantiated experiences
- Val: Valuable, constructive, and relevant reviews
The system uses Gemma 3B with a few-shot prompting strategy enhanced by chain-of-thought reasoning. This allows the model to make step-by-step decisions without fine-tuning.
It outputs:
- A review classification (Ad, Irr, Rant, Val)
- A relevancy score between 0 and 1
- A policy violation flag for moderation
We evaluate results using confusion matrices, macro F1 scores, and precision/recall breakdowns.
How we built it
Data Collection and Preprocessing
We scraped around 2000 reviews from Kaggle’s Google Maps dataset and paired them with live reviews from Singapore using Apify.
Non-English reviews were translated using the Google Translate API.
Emojis and empty content were removed.
Metadata such as length, links, and sentiment was preserved for analysis.
Exploratory Data Analysis
We found that star ratings were unreliable indicators of review quality.
Vague five-star compliments were often less helpful than detailed three-star reviews.
Prompt Engineering and Evaluation
We built an evaluation pipeline using a custom script to test outputs with macro precision, recall, and F1.
Visualizations included confusion matrices to diagnose misclassifications.
Baseline (Old Prompt):
- Macro F1: 0.837
- Accuracy: 90.5%
- Ad had perfect recall but moderate precision
- Irr and Rant had high recall but tended to over-predict
- Val was the most balanced and stable class
The old prompt was effective at catching spam but occasionally mislabeled valid reviews as rants or ads.
Prompt Refinement and Improvements:
We refined the instructions and added better few-shot examples.
We highlighted borderline cases such as short-but-valid reviews and vague spam.
We prioritized specificity over emotional tone.
We enforced structured YAML outputs for clean parsing.
Improved Prompt Results:
- Macro F1: 0.830
- Accuracy: 91.0%
- Ad classification achieved perfect precision and recall
- Val maintained high precision
- Rant recall dipped slightly due to stricter thresholds
The new prompt is more precise and cautious. While Rant recall slightly decreased, misclassifications between Rant, Ad, and Val were significantly reduced. This makes the system more reliable for public moderation.
Challenges we ran into
Prompt Length and Output Consistency
Chain-of-thought reasoning combined with metadata led to long outputs.
This caused token limit issues and formatting errors.
We solved this by trimming input fields and simplifying reasoning steps.
Class Imbalance
Most real-world reviews were labeled as Val.
We manually oversampled edge cases such as Ads, Rants, and Irrelevants to balance the dataset.
Output Parsing Fragility
Small formatting issues like missing colons broke YAML parsing.
We built a validation and cleanup script and enforced strict output formatting.
Inference Latency and Cost
Batch inference was rate-limited.
We batched 10 reviews per call and prioritized 250 for evaluation and 200 for pseudo-labeling.
Accomplishments that we're proud of
- Built an end-to-end pipeline from data collection to inference and evaluation
- Boosted macro F1 from 0.774 to 0.820 through prompt refinements
- Achieved perfect spam detection with 100 percent precision and recall for the Ad class
- Created a relevancy scoring system to go beyond rigid labels
- Used chain-of-thought prompting to mimic a human moderator’s reasoning
- Handled tricky edge cases including:
- Short but specific reviews
- Vague emotional rants
- Non-English inputs
- Short but specific reviews
What we learned
- Prompt engineering can rival fine-tuning when backed by high-quality examples
- Chain-of-thought prompting adds nuance, especially when distinguishing rants from valid complaints
- Precision is just as important as recall in moderation tasks
- Confusion matrices help reveal blind spots such as ambiguity between Rant and Ad
- Trustworthiness in reviews depends on multiple factors including length, tone, detail, and relevance
What's next for FreddyFazbytes
Trust Score Aggregator
We plan to build a composite Trust Score for each review based on:
- Specificity and clarity
- Depth and emotional tone
- Comparison to known spam and valuable reviews
This will be visualized through heatmaps and category trends.
Time-Series Moderation Insights
We want to track moderation outcomes over time.
This includes identifying spam spikes after promotions or shifts in review quality following viral events.
Feedback Loop for Moderation
We aim to let users and businesses vote on classifications.
This will help flag controversial edge cases and feed back into prompt tuning.
It also builds transparency and trust in the moderation process.
Built With
- apify
- devpost
- github
- google-translate
- hugging-face
- kaggle
- markdown
- matplotlib
- pandas
- powershell
- python
- scikit-learn
- vs-code
Log in or sign up for Devpost to join the conversation.