About the project

Inspiration

Our inspiration came from a common problem plaguing online platforms like Google Maps: the prevalence of fake or misleading business reviews. These reviews can range from irrelevant spam and advertisements to comments that aren't based on actual customer experiences. Such content degrades the quality of information, misleads potential customers, and ultimately erodes trust in the platform. We saw an opportunity to leverage machine learning to automatically identify and flag these policy-violating reviews, creating a more reliable and trustworthy environment for both consumers and businesses.

What it does

This project, named the Google Maps Business Ad Violation Detector, is an intelligent system designed to automatically analyze and classify user reviews based on a predefined set of rules. It identifies three primary types of violations: Advertisements (if_ad): Reviews that are not genuine feedback but are intended to promote other products or services. Irrelevant Content (if_irrelevant): Comments that are completely unrelated to the business being reviewed, such as personal chatter or social commentary. Not Based on Experience (if_not_experience): Reviews where the user admits they haven't visited the place but are commenting based on hearsay or appearance (e.g., "I didn't go, but it looks nice"). Our final product is an interactive web application where a user can input a review and its associated metadata (like star rating, reviewer history, etc.). The system then processes this information through our trained machine learning model and provides a real-time prediction, flagging any potential violations and showing the confidence score for each violation type.

How we built it

We followed a comprehensive, iterative machine learning pipeline to build this detector: Data Collection & Initial Cleaning: We started with a raw dataset of Google Maps reviews. The first step was to clean the data by translating all non-English text into English and removing entries with empty review texts, ensuring a consistent language base. Data Labeling & Augmentation: The most critical challenge was the severe class imbalance—violating reviews were extremely rare. To overcome this, we employed a hybrid strategy: We manually labeled a subset of the data. We then used a Large Language Model (LLM) to perform data synthesis, generating hundreds of new, diverse, and realistic-looking violating reviews based on the patterns from our initial labeled set. This significantly enriched our training data for minority classes. Feature Engineering: We transformed the raw data into a format suitable for machine learning models. Numerical Features: Handled missing values using median imputation, treated extreme outliers with 99th percentile capping, and then standardized all values using StandardScaler. Categorical Features: Applied One-Hot Encoding to place_title and class_both. Text Features: This was the core of our feature engineering. We used the Word2Vec model to generate 100-dimensional vector embeddings for each review, capturing the semantic meaning of the text. Engineered Features: We also extracted strong signals directly from the raw text, such as contains_url and contains_phone, to give the model direct clues for spam detection. Model Training & Selection: We framed the problem as a multi-label classification task. We trained and rigorously compared several models, including Logistic Regression, Random Forest, XGBoost, LightGBM, and CatBoost. Deployment: We chose the best-performing model (LightGBM, based on our evaluation metrics) and built an interactive web application using Streamlit to demonstrate its real-world capabilities. All the preprocessing steps were saved into a pipeline using joblib to ensure consistency between training and live prediction.

Challenges we ran into

Extreme Class Imbalance: This was our biggest hurdle. Initially, violating reviews made up less than 0.5% of the data. Our first models completely failed, predicting every review as "compliant." We solved this by using AI-powered data synthesis to create a more balanced and robust training set. Feature Representation for Text: Simply cleaning text wasn't enough. Our initial models failed to detect subtle violations like URLs (www.github.com) because our text cleaning process was too aggressive. We overcame this by refining our text preprocessing and engineering explicit features like contains_url. Environment and Dependency Issues: We encountered several technical issues like UnicodeDecodeError due to inconsistent file encodings and NameError or library-specific errors (CatBoostError, LightGBMError) due to misconfigured environments or incompatible feature names. We solved these through systematic debugging, careful environment management, and writing more robust code (e.g., cleaning feature names).

Accomplishments that we're proud of

Successfully Overcoming Extreme Data Imbalance: Moving from a model with zero predictive power to one that can effectively identify rare violation classes is our biggest accomplishment. Our use of AI for data synthesis was a game-changer. Building an End-to-End ML System: We successfully completed the entire lifecycle of a machine learning project, from raw data cleaning and labeling to feature engineering, model training, and finally, deploying a live, interactive web application. Developing a High-Performing Model: Our final LightGBM model demonstrated strong performance across all violation categories, proving the effectiveness of our data-centric approach.

What we learned

Data is King: We learned firsthand that no amount of model tuning can compensate for a lack of quality, representative data. Our biggest performance gains came from improving the data, not from changing the model algorithm. The Power of Iteration: We realized that building a machine learning model is not a linear process. We had to constantly go back and forth between data analysis, feature engineering, and model evaluation to diagnose problems and improve results. AI as a Tool for AI: Using an LLM to generate training data for a smaller, more specialized model is a powerful and efficient modern workflow. It's a prime example of leveraging different AI strengths to solve a complex problem. The Importance of Robust Preprocessing: Seemingly small details, like how to handle special characters in text or feature names, can make or break a model's ability to train and perform correctly.

What's next for Google Maps Business Ad Violation Detection

Expand Violation Categories: We plan to add more nuanced violation types, such as detecting hate speech, personal attacks, or fake engagement reviews. Incorporate Advanced NLP Models: While Word2Vec worked well, we plan to experiment with more advanced text vectorization techniques like Sentence-BERT (which we also coded), which could better capture the contextual meaning of entire reviews. Real-time Feedback Loop: Deploy the model in a live environment and collect user feedback on its predictions. This feedback can be used to continuously retrain and improve the model over time (a concept known as Human-in-the-Loop). Scalability: Optimize the entire pipeline for scalability, allowing it to process millions of reviews in near real-time.

Built With

Share this project:

Updates