Google Local Review Quality & Relevancy Classification

Overview

Our project addresses the challenge of automatically assessing the quality and relevancy of Google location reviews using Machine Learning (ML) and Natural Language Processing (NLP). Online reviews play a vital role in shaping perceptions of businesses, but irrelevant, spammy, or misleading reviews can distort reputation.

To solve this, we developed and compared two ML-based approaches:

  1. RoBERTa-based Transformer Model for deep contextual text classification.
  2. Random Forest Classifier leveraging engineered linguistic and semantic features.

Together, these models aim to detect low-quality reviews, enforce content policies, and improve review reliability.


Problem Statement Tackled

We specifically address the hackathon’s requirement to:

  • Gauge review quality: Detect spam, ads, irrelevant content, and rants from non-visitors.
  • Assess relevancy: Ensure review text is genuinely tied to the location being reviewed.
  • Enforce policies: Automatically flag reviews with:
    • Advertisements
    • Irrelevant
    • Rants
    • Span
    • Good Content

Solution Approach

1. RoBERTa Transformer Model

  • Architecture: Built on Hugging Face’s roberta-base with a custom classification head.
  • Key Features:
    • Tokenization and contextual embeddings via RoBERTa.
    • Class-weighted loss to address label imbalance.
    • Early stopping for efficient training.
  • Training:
    • Dataset cleaned (drop missing values, irrelevant columns).
    • Train/val/test split with label encoding.
    • Metrics: Accuracy and Macro F1-score.
  • Output: Predicts review categories (e.g., Good Review, Spam, Advertisement, Irrelevant Content).

This model excels in capturing nuanced semantics and context, making it powerful for detecting subtle irrelevance or disguised promotional language.


2. Random Forest Classifier

  • Architecture: Traditional ensemble method using engineered features.
  • Feature Engineering:
    • Linguistic indicators: word count, sentiment words, sensory words, superlatives.
    • Behavioral cues: business-related terms, promotional language, authenticity score.
  • Training & Validation:
    • Label encoding + feature scaling.
    • Trained with balanced class weights to handle skew.
    • Comprehensive validation: cross-validation, learning curves, feature importance analysis.
  • Output: Multi-class classification into review categories.

This approach offers interpretability through feature importance, helping explain why a review is flagged.


Tools & Technologies Used


Development Tools

  • Google Colab

  • Jupyter Notebooks

  • VS Code


APIs Used

  • OpenAI GPT-4o-mini (for exploratory data analysis & augmentation)

  • Google Maps API (for location-based review metadata, if needed)


Libraries & Frameworks

  • Hugging Face Transformers

  • PyTorch

  • scikit-learn

  • pandas

  • numpy

  • seaborn

  • matplotlib

  • openai


Datasets

  • Google Local Reviews Dataset (cleaned & preprocessed)

  • Manually labeled reviews:

    • Spam
    • Ad
    • Rant
    • Irrelevant
    • Good review

Results & Insights

  • RoBERTa: Achieved higher accuracy and macro-F1, particularly effective for nuanced semantic detection (e.g., disguised ads, off-topic rants).
  • Random Forest: Provided strong interpretability and decent accuracy, highlighting which textual features most influenced classification.
  • Combined Value: While RoBERTa outperforms on raw text understanding, Random Forest aids in explainability and quick validation, making the two complementary.

Future Improvements

  • Integrate both models in a hybrid pipeline (RoBERTa for semantic analysis + Random Forest for interpretable signals).
  • Expand dataset with active learning to capture more edge cases.
  • Deploy as an API endpoint for real-time review moderation.

Built With

  • csv-data-processing
  • google-colab-cloud-services:-openai-api-(gpt-4o)-apis:-openai-gpt-4-api-other-technologies:-regular-expressions-(regex)
  • googlecolab
  • json
  • jupyter
  • numpy
  • numpy-platforms:-jupyter-notebook
  • openai-api
  • pandas
  • python
  • regex
  • scikit-learn
Share this project:

Updates