Inspiration
We were inspired by the rising amount of spam, irrelevant content, and vague reviews on Google Reviews. Our goal was to build an intelligent review classification system to help platforms better moderate and extract useful feedback.
What it does
Our project tackles the challenge of assessing the quality and relevance of location-based reviews by building a robust pipeline that categorises reviews into predefined policy types. These include:
- Advertisements
- Spam
- Irrelevant Comments
- Reviews from Non-Visitors
- Valid Reviews
The objective is to support platforms in identifying misleading or low-quality content using modern LLM-based techniques.
How we built it
Data Collection & Cleaning
- Scraped 50,000+ Google reviews using Apify. Dataset was limited to Singapore, balanced across 5 different business categories: Places of Attraction, Restaurants/Cafes/Bars, Retail, Services, and Hotel/Lodging, and across establishments in the heartlands vs CBD. This ensures we have a diverse dataset to work with to prevent model validation biases.
- Normalized emojis, quotation marks, and Unicode issues.
- Filtered out reviews with no textual/contextual content.
Labeling & Classification
- Used OpenAI GPT-4o to pseudo-label reviews based on strict policy rules and prompt engineering.
- Labels included:
Advertisement,Spam,Irrelevant,No Visit,Valid. - Handled edge cases (e.g., sarcasm, multilingual reviews, short replies) with GPT's semantic reasoning.
- Curated ground truth labels through manual verification of GPT-4o labeled data by multiple people (double labeling) to ensure understanding of policy rules are consistent
- Balanced label categories across ground truth data (oversampling of rare cases) to ensure system is stress-tested on rare cases.
Modeling & Evaluation
- Created a basic Hugging Face pipeline using:
- Gemma 3 12B
- Qwen 3 8B
- Used GPT-labeled data as our ground truth data.
- Evaluated model accuracy using standard metrics like:
- Precision: ( \frac{TP}{TP + FP} )
- Recall: ( \frac{TP}{TP + FN} )
- F1 Score: ( 2 \times \frac{precision \times recall}{precision + recall} ) ## Challenges we ran into
- Long runtimes: Using large language models (LLMs) like GPT-4o and Hugging Face models led to high latency, especially when processing large batches of reviews. Optimising for speed while maintaining classification accuracy was a key concern.
- Insufficient Data : Despite scraping more than 50,000 reviews on Apify across 5 different business categories, we were only able to obtain a limited number of reviews labeled as "Spam", "No Visit", "Irrelevant", or "Advertisements". This meant that we had a limited amount of data to use for model validation.
- Multilingual content: Many reviews contained multiple languages or mixed grammar styles, requiring translation and context-aware handling to ensure consistent labelling.
- Edge cases in labelling: Some reviews featured sarcasm, ambiguous phrasing, or lacked sufficient context, making it difficult to classify them using rules or simple heuristics.
- Model inconsistencies: Outputs from open-source models (e.g., Gemma, Qwen) were not always consistent across categories, requiring iterative prompt refinement and manual validation. ## Accomplishments that we're proud of
- Built an end-to-end LLM-driven review classification pipeline
- Created policy-aligned labels that go beyond keyword detection
- Enabled multilingual understanding and sarcasm detection using GPT-4o
What we learned
- Prompt engineering is crucial as minor phrasing changes can lead to significant output shifts.
- GPT-4o excels at understanding context, especially for edge cases like sarcasm or passive-aggressive language.
- Open-source models offer flexibility but still trail behind GPT-4o in nuanced semantic tasks.
What's next for Pour-Over
- Deploying as an API service for real-time review filtering
- Expanding to other platforms (e.g., TripAdvisor, Booking.com)
- Creating a feedback loop using human-in-the-loop validation
- Exploring fine-tuning open-source models on custom data