Project Veritas: Enhancing the Reliability of Location-Based Reviews
🔍 Problem Statement
Online review platforms face a critical challenge: maintaining the integrity and usefulness of user-generated content. Location-based reviews are particularly susceptible to:
- Low-quality content: Irrelevant, nonsensical, or extremely brief reviews
- Policy violations: Spam, advertisements, offensive language, or inappropriate content
- Biased information: Fake reviews that mislead users and businesses
These issues undermine the trustworthiness of review platforms and diminish user experience. Project Veritas addresses this challenge by creating an ML-powered system to evaluate the quality and relevancy of Google location reviews, helping identify and filter out problematic content while preserving genuine user feedback.
💡 Our Solution
Project Veritas uses a multi-stage approach to review quality assessment through an iterative, day-by-day workflow:
Day 1: Data Collection and Integration
We combined and analyzed the UCSD Google Local Reviews dataset with a focus on Delaware locations. This involved:
- Downloading the review-Delaware_10.json.gz and meta-Delaware.json.gz datasets from UCSD's repository
- Data cleaning, integration, and exploratory data analysis
- Creating a combined dataset with reviews and their metadata
Day 2: Feature Engineering and Model Development
We enhanced the dataset with sophisticated features to detect policy violations:
- Sentiment analysis using NLTK's VADER
- Extraction of linguistic features and user behavior patterns
- Implementation of both rule-based policy modules and machine learning classifiers
- Topic modeling with LDA to identify irrelevant content
Day 3: Model Refinement and Evaluation
We optimized our approach through:
- Fine-tuning the model with advanced features and hyperparameters
- Implementing SMOTE for handling class imbalance
- Rigorous evaluation with metrics tailored to policy enforcement
- Analysis of real-world examples to understand model behavior
Our final solution can accurately identify three key classes of problematic reviews:
- Rants without visits: Strong complaints with no evidence of actual visits
- Irrelevant content: Off-topic or uninformative reviews
- Advertisements: Reviews containing promotional content or external links
🛠️ Technologies Used
Development Tools
- Jupyter Notebooks: Primary development environment for all stages
- Google Colab: Used for executing notebook code with GPU acceleration
Libraries and Frameworks
Data Processing & Analysis:
- Pandas: Data manipulation and cleaning
- NumPy: Numerical computing operations
- Matplotlib/Seaborn: Data visualization
Natural Language Processing:
- NLTK: Text preprocessing and sentiment analysis (VADER)
- Gensim: Topic modeling with Latent Dirichlet Allocation (LDA)
Machine Learning:
- Scikit-learn: Feature extraction (TF-IDF), preprocessing, and ML pipelines
- LightGBM: Gradient boosting framework for classification
- Imbalanced-learn: SMOTE implementation for class imbalance
Datasets
- UCSD Google Local Reviews Dataset:
- Source: https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/
- Files used: review-Delaware_10.json.gz, meta-Delaware.json.gz
- Contains user reviews and business metadata for locations in Delaware
📊 Model Performance
Our final optimized LightGBM model achieved impressive results:
- Overall accuracy: 80.5%
- F1-Score for rant detection: 98.3%
- F1-Score for irrelevant content: 73.4%
The model particularly excels at identifying baseless negative reviews (rant_no_visit class), which is crucial for protecting businesses from unwarranted negative attacks. It also demonstrates strong performance in filtering irrelevant content with high precision (86.6%).
🌟 Impact and Applications
Project Veritas helps:
- Users find reliable and useful information about locations
- Businesses receive genuine feedback to improve their services
- Platform owners maintain content quality without excessive manual moderation
- Community members enjoy a more trustworthy review ecosystem
By automating the detection of low-quality and policy-violating reviews, our solution enhances the overall health of review platforms while saving significant resources that would otherwise be spent on manual content moderation.
🚀 Future Directions
Future enhancements could include:
- Real-time review assessment capabilities
- Multilingual support for global platforms
- Integration with more diverse datasets beyond Delaware
- Advanced feature engineering using large language models
- User interface for content moderators to review borderline cases
👥 Team Members
- Nguyen Tran Thanh Lam
- Pham Hong Phuc
- Pham Tran Tuan Minh
- Dinh Quang Anh
- Nguyen Le Tam
Built With
- gensim
- google-colab
- imbalanced-learn
- jupyter-notebooks
- lightgbm
- matplotlib
- ntlk
- numpy
- pandas
- python
- scikit-learn
- seaborn
Log in or sign up for Devpost to join the conversation.