Inspiration

Online reviews are a quick way for users to make decisions. However, these reviews are often polluted with advertisements, spam, or irrelevant content, which reduces trust for both users and businesses

What it does

We worked with the Google local review data, link. Since most reviews are text-heavy, we focused on text-based natural language processing (NLP). We first explored sentiment analysis as a baseline, then moved towards classification. We decided to create a model that runs through reviews and filters out those that are identified as spam or ads, leaving only 'normal' reviews. We implemented pseudo-labelling - advertisements are defined as reviews that contain 'www' or 'http', spams are reviews with patterns like 'never been here'. while the rest were treated as normal reviews. We further balanced across classes before training by fine-tuning BERT (bert-base-uncased) and adding a custom classification head, enabling the model to distinguish between the 3 types of reviews.

For evaluation, we took a two-step approach. First, since the dataset was dominated by normal reviews, we applied oversampling to the spam and advertisement classes to help the model learn their features more effectively. Once the model was trained and applied, we manually reviewed filtered results in the CSV file to refine and improve our labelling policies. Second, to test generalizability, we ran the model on reviews from two other datasets (Hawaii and New York) available on the same site. In both cases, the filtering was successful, demonstrating that our approach transfers well across locations.

Specific problem statement tackled from the challenge prompt

Design and implement an ML-based system to evaluate the quality and relevancy of Google location reviews. The system should:

  • Gauge review quality: Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.
  • Enforce policies: Automatically flag or filter out reviews that violate the following example policies:
    • No advertisements or promotional content.
    • No irrelevant content (e.g., reviews about unrelated topics).
    • No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals)

Development tools, APIs, libraries, frameworks and datasets used

Colab, jupyter, BERT bert-base-uncased, Hugging Face Transformers, PyTorch, scikit-learn, pandas, Google Local Reviews dataset

Built With

Share this project:

Updates