π Google Location Review Detector
π Introduction
Maintaining the quality and trustworthiness of location-based reviews is crucial for both businesses and consumers. Fake or misleading reviews can distort perceptions, harm businesses, and erode consumer confidence.
The Google Location Review Detector is an automated system designed to efficiently identify and flag suspicious reviews that violate common platform policies, such as:
- π« Spam or advertisements
- β Irrelevant or off-topic content
- π‘ Rants without evidence of a real visit
βοΈ What It Does
Our solution takes a CSV file of Google location reviews as input and processes each review through a hybrid ML pipeline:
Preprocessing & Feature Engineering
- Clean review text
- Extract numerical features (review length, sentiment polarity, subjectivity, suspicion score, etc.)
- Clean review text
Classification Model
- A custom-trained DistilBERT model classifies reviews into:
- β
NONEβ No violation - π«
SPAMβ Spam/Advertisement - π‘
RANT_WITHOUT_VISITβ Rant without proof of visit - β
IRRELEVANT_CONTENTβ Off-topic/irrelevant content
- β
- A custom-trained DistilBERT model classifies reviews into:
Output
- Two separate CSV files:
- Clean reviews
- Flagged violations
- Clean reviews
- A summary report detailing the findings
- Two separate CSV files:
π οΈ How I Built It
I focused on a machine learning approach using a custom DistilBERT-based model. The workflow included:
- Data Preparation β Cleaning review text and extracting numerical features.
- Feature Engineering β Metrics such as sentiment polarity, subjectivity, review length, and Rule based suspicion score.
- LLM as a Judge- Used GPT-OSS-20B using Groq Cloud to generate pseudo labels as the original data did not have any labels
- Model Development β Training DistilBERT for multi-class classification with both text and numerical features.
- Deployment β Building a Gradio-powered web interface to let users upload CSV files and instantly receive analysis results.
π§ Challenges I Ran Into
- π Feature Selection β Choosing the right combination of text + numerical features for accurate classification.
- π― Model Training β Balancing generalization vs. overfitting with limited review violation data.
- βοΈ Class Imbalance β Most Google reviews are clean, so violations were underrepresented. Mitigated with class weights.
- π» Compute Limits β Google Colab GPUs often crashed during long training runs.
π Accomplishments
- β
Developed and deployed a full end-to-end review violation detection system.
- π§ Created a custom ML model tailored specifically for review policy violations.
- π Built a user-friendly interface with Gradio for easy adoption.
π What I Learned
- Combining textual and numerical metadata significantly improves classification robustness.
- Practical experience in fine-tuning and deploying DistilBERT for a domain-specific NLP task.
- Building interactive ML tools with Gradio for real-world usability.
π Whatβs Next
- Model Enhancement β Explore ensembles or larger LLMs (e.g., Qwen 3, Gemma 3) for deeper review understanding.
- Granular Violation Details β Provide explanations, reasons, and confidence scores for flagged reviews.
- Real-Time Analysis β Adapt the system to flag reviews as they are submitted.
π§° Libraries & Frameworks Used
- Core Language: Python
- ML Framework: PyTorch
- NLP / Data Processing: Hugging Face Transformers, pandas, NumPy, scikit-learn
- Deployment/UI: Gradio
π Assets & Datasets
- Dataset: UCSD Google Local Reviews Dataset
- Model: Custom-trained DistilBERT classifier for review violation detection
Log in or sign up for Devpost to join the conversation.