🌍 Google Location Review Detector

πŸ“ Introduction

Maintaining the quality and trustworthiness of location-based reviews is crucial for both businesses and consumers. Fake or misleading reviews can distort perceptions, harm businesses, and erode consumer confidence.

The Google Location Review Detector is an automated system designed to efficiently identify and flag suspicious reviews that violate common platform policies, such as:

  • 🚫 Spam or advertisements
  • ❌ Irrelevant or off-topic content
  • 😑 Rants without evidence of a real visit

βš™οΈ What It Does

Our solution takes a CSV file of Google location reviews as input and processes each review through a hybrid ML pipeline:

  1. Preprocessing & Feature Engineering

    • Clean review text
    • Extract numerical features (review length, sentiment polarity, subjectivity, suspicion score, etc.)
  2. Classification Model

    • A custom-trained DistilBERT model classifies reviews into:
      • βœ… NONE – No violation
      • 🚫 SPAM – Spam/Advertisement
      • 😑 RANT_WITHOUT_VISIT – Rant without proof of visit
      • ❌ IRRELEVANT_CONTENT – Off-topic/irrelevant content
  3. Output

    • Two separate CSV files:
      • Clean reviews
      • Flagged violations
    • A summary report detailing the findings

πŸ› οΈ How I Built It

I focused on a machine learning approach using a custom DistilBERT-based model. The workflow included:

  • Data Preparation – Cleaning review text and extracting numerical features.
  • Feature Engineering – Metrics such as sentiment polarity, subjectivity, review length, and Rule based suspicion score.
  • LLM as a Judge- Used GPT-OSS-20B using Groq Cloud to generate pseudo labels as the original data did not have any labels
  • Model Development – Training DistilBERT for multi-class classification with both text and numerical features.
  • Deployment – Building a Gradio-powered web interface to let users upload CSV files and instantly receive analysis results.

🚧 Challenges I Ran Into

  • πŸ”Ž Feature Selection – Choosing the right combination of text + numerical features for accurate classification.
  • 🎯 Model Training – Balancing generalization vs. overfitting with limited review violation data.
  • βš–οΈ Class Imbalance – Most Google reviews are clean, so violations were underrepresented. Mitigated with class weights.
  • πŸ’» Compute Limits – Google Colab GPUs often crashed during long training runs.

πŸ† Accomplishments

  • βœ… Developed and deployed a full end-to-end review violation detection system.
  • 🧠 Created a custom ML model tailored specifically for review policy violations.
  • 🌐 Built a user-friendly interface with Gradio for easy adoption.

πŸ“š What I Learned

  • Combining textual and numerical metadata significantly improves classification robustness.
  • Practical experience in fine-tuning and deploying DistilBERT for a domain-specific NLP task.
  • Building interactive ML tools with Gradio for real-world usability.

πŸš€ What’s Next

  • Model Enhancement – Explore ensembles or larger LLMs (e.g., Qwen 3, Gemma 3) for deeper review understanding.
  • Granular Violation Details – Provide explanations, reasons, and confidence scores for flagged reviews.
  • Real-Time Analysis – Adapt the system to flag reviews as they are submitted.

🧰 Libraries & Frameworks Used

  • Core Language: Python
  • ML Framework: PyTorch
  • NLP / Data Processing: Hugging Face Transformers, pandas, NumPy, scikit-learn
  • Deployment/UI: Gradio

πŸ“‚ Assets & Datasets

Share this project:

Updates