Project Veritas: Enhancing the Reliability of Location-Based Reviews

🔍 Problem Statement

Online review platforms face a critical challenge: maintaining the integrity and usefulness of user-generated content. Location-based reviews are particularly susceptible to:

Low-quality content: Irrelevant, nonsensical, or extremely brief reviews
Policy violations: Spam, advertisements, offensive language, or inappropriate content
Biased information: Fake reviews that mislead users and businesses

These issues undermine the trustworthiness of review platforms and diminish user experience. Project Veritas addresses this challenge by creating an ML-powered system to evaluate the quality and relevancy of Google location reviews, helping identify and filter out problematic content while preserving genuine user feedback.

💡 Our Solution

Project Veritas uses a multi-stage approach to review quality assessment through an iterative, day-by-day workflow:

Day 1: Data Collection and Integration

We combined and analyzed the UCSD Google Local Reviews dataset with a focus on Delaware locations. This involved:

Downloading the review-Delaware_10.json.gz and meta-Delaware.json.gz datasets from UCSD's repository
Data cleaning, integration, and exploratory data analysis
Creating a combined dataset with reviews and their metadata

Day 2: Feature Engineering and Model Development

We enhanced the dataset with sophisticated features to detect policy violations:

Sentiment analysis using NLTK's VADER
Extraction of linguistic features and user behavior patterns
Implementation of both rule-based policy modules and machine learning classifiers
Topic modeling with LDA to identify irrelevant content

Day 3: Model Refinement and Evaluation

We optimized our approach through:

Fine-tuning the model with advanced features and hyperparameters
Implementing SMOTE for handling class imbalance
Rigorous evaluation with metrics tailored to policy enforcement
Analysis of real-world examples to understand model behavior

Our final solution can accurately identify three key classes of problematic reviews:

Rants without visits: Strong complaints with no evidence of actual visits
Irrelevant content: Off-topic or uninformative reviews
Advertisements: Reviews containing promotional content or external links

🛠️ Technologies Used

Development Tools

Jupyter Notebooks: Primary development environment for all stages
Google Colab: Used for executing notebook code with GPU acceleration

Libraries and Frameworks

Data Processing & Analysis:
- Pandas: Data manipulation and cleaning
- NumPy: Numerical computing operations
- Matplotlib/Seaborn: Data visualization
Natural Language Processing:
- NLTK: Text preprocessing and sentiment analysis (VADER)
- Gensim: Topic modeling with Latent Dirichlet Allocation (LDA)
Machine Learning:
- Scikit-learn: Feature extraction (TF-IDF), preprocessing, and ML pipelines
- LightGBM: Gradient boosting framework for classification
- Imbalanced-learn: SMOTE implementation for class imbalance

Datasets

UCSD Google Local Reviews Dataset:
- Source: https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/
- Files used: review-Delaware_10.json.gz, meta-Delaware.json.gz
- Contains user reviews and business metadata for locations in Delaware

📊 Model Performance

Our final optimized LightGBM model achieved impressive results:

Overall accuracy: 80.5%
F1-Score for rant detection: 98.3%
F1-Score for irrelevant content: 73.4%

The model particularly excels at identifying baseless negative reviews (rant_no_visit class), which is crucial for protecting businesses from unwarranted negative attacks. It also demonstrates strong performance in filtering irrelevant content with high precision (86.6%).

🌟 Impact and Applications

Project Veritas helps:

Users find reliable and useful information about locations
Businesses receive genuine feedback to improve their services
Platform owners maintain content quality without excessive manual moderation
Community members enjoy a more trustworthy review ecosystem

By automating the detection of low-quality and policy-violating reviews, our solution enhances the overall health of review platforms while saving significant resources that would otherwise be spent on manual content moderation.