Pour-Over

Inspiration

We were inspired by the rising amount of spam, irrelevant content, and vague reviews on Google Reviews. Our goal was to build an intelligent review classification system to help platforms better moderate and extract useful feedback.

What it does

Our project tackles the challenge of assessing the quality and relevance of location-based reviews by building a robust pipeline that categorises reviews into predefined policy types. These include:

Advertisements
Spam
Irrelevant Comments
Reviews from Non-Visitors
Valid Reviews

The objective is to support platforms in identifying misleading or low-quality content using modern LLM-based techniques.

How we built it

Data Collection & Cleaning

Scraped 50,000+ Google reviews using Apify. Dataset was limited to Singapore, balanced across 5 different business categories: Places of Attraction, Restaurants/Cafes/Bars, Retail, Services, and Hotel/Lodging, and across establishments in the heartlands vs CBD. This ensures we have a diverse dataset to work with to prevent model validation biases.
Normalized emojis, quotation marks, and Unicode issues.
Filtered out reviews with no textual/contextual content.

Labeling & Classification

Used OpenAI GPT-4o to pseudo-label reviews based on strict policy rules and prompt engineering.
Labels included: Advertisement, Spam, Irrelevant, No Visit, Valid.
Handled edge cases (e.g., sarcasm, multilingual reviews, short replies) with GPT's semantic reasoning.
Curated ground truth labels through manual verification of GPT-4o labeled data by multiple people (double labeling) to ensure understanding of policy rules are consistent
Balanced label categories across ground truth data (oversampling of rare cases) to ensure system is stress-tested on rare cases.

Modeling & Evaluation

Created a basic Hugging Face pipeline using:
- Gemma 3 12B
- Qwen 3 8B
Used GPT-labeled data as our ground truth data.
Evaluated model accuracy using standard metrics like:
- Precision: ( \frac{TP}{TP + FP} )
- Recall: ( \frac{TP}{TP + FN} )
- F1 Score: ( 2 \times \frac{precision \times recall}{precision + recall} ) ## Challenges we ran into
Long runtimes: Using large language models (LLMs) like GPT-4o and Hugging Face models led to high latency, especially when processing large batches of reviews. Optimising for speed while maintaining classification accuracy was a key concern.
Insufficient Data : Despite scraping more than 50,000 reviews on Apify across 5 different business categories, we were only able to obtain a limited number of reviews labeled as "Spam", "No Visit", "Irrelevant", or "Advertisements". This meant that we had a limited amount of data to use for model validation.
Multilingual content: Many reviews contained multiple languages or mixed grammar styles, requiring translation and context-aware handling to ensure consistent labelling.
Edge cases in labelling: Some reviews featured sarcasm, ambiguous phrasing, or lacked sufficient context, making it difficult to classify them using rules or simple heuristics.
Model inconsistencies: Outputs from open-source models (e.g., Gemma, Qwen) were not always consistent across categories, requiring iterative prompt refinement and manual validation. ## Accomplishments that we're proud of
Built an end-to-end LLM-driven review classification pipeline
Created policy-aligned labels that go beyond keyword detection
Enabled multilingual understanding and sarcasm detection using GPT-4o

What we learned

Prompt engineering is crucial as minor phrasing changes can lead to significant output shifts.
GPT-4o excels at understanding context, especially for edge cases like sarcasm or passive-aggressive language.
Open-source models offer flexibility but still trail behind GPT-4o in nuanced semantic tasks.