Project Veritas: Enhancing the Reliability of Location-Based Reviews

🔍 Problem Statement

Online review platforms face a critical challenge: maintaining the integrity and usefulness of user-generated content. Location-based reviews are particularly susceptible to:

  • Low-quality content: Irrelevant, nonsensical, or extremely brief reviews
  • Policy violations: Spam, advertisements, offensive language, or inappropriate content
  • Biased information: Fake reviews that mislead users and businesses

These issues undermine the trustworthiness of review platforms and diminish user experience. Project Veritas addresses this challenge by creating an ML-powered system to evaluate the quality and relevancy of Google location reviews, helping identify and filter out problematic content while preserving genuine user feedback.

💡 Our Solution

Project Veritas uses a multi-stage approach to review quality assessment through an iterative, day-by-day workflow:

Day 1: Data Collection and Integration

We combined and analyzed the UCSD Google Local Reviews dataset with a focus on Delaware locations. This involved:

  • Downloading the review-Delaware_10.json.gz and meta-Delaware.json.gz datasets from UCSD's repository
  • Data cleaning, integration, and exploratory data analysis
  • Creating a combined dataset with reviews and their metadata

Day 2: Feature Engineering and Model Development

We enhanced the dataset with sophisticated features to detect policy violations:

  • Sentiment analysis using NLTK's VADER
  • Extraction of linguistic features and user behavior patterns
  • Implementation of both rule-based policy modules and machine learning classifiers
  • Topic modeling with LDA to identify irrelevant content

Day 3: Model Refinement and Evaluation

We optimized our approach through:

  • Fine-tuning the model with advanced features and hyperparameters
  • Implementing SMOTE for handling class imbalance
  • Rigorous evaluation with metrics tailored to policy enforcement
  • Analysis of real-world examples to understand model behavior

Our final solution can accurately identify three key classes of problematic reviews:

  1. Rants without visits: Strong complaints with no evidence of actual visits
  2. Irrelevant content: Off-topic or uninformative reviews
  3. Advertisements: Reviews containing promotional content or external links

🛠️ Technologies Used

Development Tools

  • Jupyter Notebooks: Primary development environment for all stages
  • Google Colab: Used for executing notebook code with GPU acceleration

Libraries and Frameworks

  • Data Processing & Analysis:

    • Pandas: Data manipulation and cleaning
    • NumPy: Numerical computing operations
    • Matplotlib/Seaborn: Data visualization
  • Natural Language Processing:

    • NLTK: Text preprocessing and sentiment analysis (VADER)
    • Gensim: Topic modeling with Latent Dirichlet Allocation (LDA)
  • Machine Learning:

    • Scikit-learn: Feature extraction (TF-IDF), preprocessing, and ML pipelines
    • LightGBM: Gradient boosting framework for classification
    • Imbalanced-learn: SMOTE implementation for class imbalance

Datasets

📊 Model Performance

Our final optimized LightGBM model achieved impressive results:

  • Overall accuracy: 80.5%
  • F1-Score for rant detection: 98.3%
  • F1-Score for irrelevant content: 73.4%

The model particularly excels at identifying baseless negative reviews (rant_no_visit class), which is crucial for protecting businesses from unwarranted negative attacks. It also demonstrates strong performance in filtering irrelevant content with high precision (86.6%).

🌟 Impact and Applications

Project Veritas helps:

  • Users find reliable and useful information about locations
  • Businesses receive genuine feedback to improve their services
  • Platform owners maintain content quality without excessive manual moderation
  • Community members enjoy a more trustworthy review ecosystem

By automating the detection of low-quality and policy-violating reviews, our solution enhances the overall health of review platforms while saving significant resources that would otherwise be spent on manual content moderation.

🚀 Future Directions

Future enhancements could include:

  • Real-time review assessment capabilities
  • Multilingual support for global platforms
  • Integration with more diverse datasets beyond Delaware
  • Advanced feature engineering using large language models
  • User interface for content moderators to review borderline cases

👥 Team Members

  • Nguyen Tran Thanh Lam
  • Pham Hong Phuc
  • Pham Tran Tuan Minh
  • Dinh Quang Anh
  • Nguyen Le Tam

Built With

Share this project:

Updates