Danny and his Butlers

Problem statement

Design and implement an ML-based system to evaluate the quality and relevancy of Google location reviews. The system should:

Gauge review quality: Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.
Assess relevancy: Determine whether the content of a review is genuinely related to the location being reviewed.
Enforce policies: Automatically flag or filter out reviews that violate the following example policies:
- No advertisements or promotional content.
- No irrelevant content (e.g., reviews about unrelated topics).
- No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).

Github repository with Readme link:

https://github.com/Capacesse/danny-and-his-butlers

Inspiration

Every day, TikTok receives millions of merchant reviews. While reviews are essential for trust and sales, many are irrelevant, spammy, or misleading. This hurts both merchants and buyers. We were inspired to solve this challenge because we saw the real business impact:

Buyers lose confidence if reviews are low quality.
Merchants cannot act on feedback that is unclear or irrelevant.
TikTok risks reduced engagement and GMV. We wanted to build a system that could automatically filter and classify reviews in a way that is scalable, transparent, and cost-effective.

What it does

Our solution is a Review Quality & Relevance Classifier that:

Flags reviews as Useful, Irrelevant, or Policy-Violating.
Provides explainability with confidence scores and highlighted keywords.
Works in real time and is designed to be scalable across millions of reviews.
In simple terms, it helps merchants focus on filtering out the valid reviews so that consumers won’t be misled by negative and irrelevant content, which obviously is able to gain trust within consumers towards the shops. Moreover, merchants can utilise the valid reviews to know about the public's opinion and gain insights to make better decisions.

How We Built It

Our solution is a multi-stage pipeline that uses a combination of rule-based logic and machine learning to achieve high accuracy.

Image Analysis & Preprocessing: The pipeline begins by analysing review photos using a BLIP model to generate a text caption. This caption is merged with the original, cleaned review text to create a single, unified text field.
Advanced Feature Engineering: We convert the unified text into a rich numerical representation using TF-IDF vectors (to capture keyword importance) and Sentence Embeddings (to capture contextual meaning).
Pseudo-labelling: We first create a set of high-confidence business rules, which are encapsulated in the generate_pseudo_labels function. For instance, we used rules that link sentiment analysis scores to rating values. We only label the instances that can fulfil the requirement of these high-confidence business rules to get a set of high-quality data that we can use to train a model to label the rest of the instances, since the data that we have does not have the label provided. We have used a hybrid method to label the data. We first label them using the business rules as they will flag out those which are most obvious, and since the rules are defined by humans, it will be more accurate and increase on the accuracy. After that, we chose a simple logistic regression model to classify it as a simple model is easy to interpret and easy for us to understand and modify the policies if needed. Moreover, we have used the try-except concept to handle extreme cases and raise a warning for our users.
Model Building: For our model, we use a robust stacking ensemble model to do the classification job. We have engineered three distinct base models: (a) Logistic Regression model, trained on our classic, tabular features like word_count and sentiment_score. This model is excellent at capturing the linear relationships in the structured data. (b) LightGBM model, trained specifically on the sparse, high-dimensional TF-IDF features. This model's gradient-boosting power is perfectly suited for understanding the textual nuances. (c) Random Forest model, trained exclusively on the high-dimensional embeddings. The tree-based structure of Random Forest allows it to uncover complex, non-linear relationships within the embedding space that other models might miss. To ensure a truly robust and unbiased meta-model, we implemented a StratifiedKFold cross-validation strategy to generate Out-of-Fold (OOF) predictions. The meta-model learns from these OOF predictions, meaning it never sees the data used to train the base models. This critical step guarantees that the meta-model’s weights are not biased by any data leakage, aligning perfectly with the most stringent ML best practices. We will use the trained base models to classify on a dataset that was not trained on and pass the probability scores for each class to the meta-model, and the meta-model will train on this together with the labelled data. This ensemble method is what gives the solution its impressive accuracy and robustness.
Production-Grade Pipeline: We created a pipeline in a main function for the user to predict the labels, called predict_reviews that includes the data preprocessing, feature engineering, modelling and eventually returns a dataframe that has the label with it. There is also a pipeline with clear steps for training the model as well.

Technical Deep Dive

The Specific Problem Statement Tackled: To design and implement an ML-based system to evaluate the quality and relevancy of Google location reviews, with a focus on enforcing policies against advertisements, irrelevant content, and fraudulent rants without a real visit.
Development Tools: Google Colab, Git & GitHub, Google Drive.
APIs: Hugging Face API (for authenticating and downloading pre-trained models). Libraries & Frameworks: PyTorch, Hugging Face Transformers, Scikit-learn, LightGBM, Pandas, Sentence-Transformers, Joblib, Pillow.
Assets & Datasets: The primary "Google Local Reviews" dataset from Kaggle, augmented with additional online sources for class balance. We also used pre-trained models from Hugging Face, including Salesforce/blip-image-captioning-base.

Challenges we ran into

Data quality Many of the reviews in our dataset were machine-translated, which introduced subtle errors and inconsistencies. This made it harder to train a reliable model, since the language didn’t always reflect how real users write reviews. We had to carefully clean and expand the dataset to make sure the model wasn’t biased by translation artifacts.
Policy ambiguity TikTok’s content guidelines are written broadly, but for our project, we needed precise categories (spam, irrelevant, ad-like, or genuine). Translating those high-level rules into concrete, machine-readable policies was a challenge. We spent time aligning on definitions so that both humans and models could apply them consistently.
TF-IDF vectorizer causing data format to be inconsistent As the training dataset might have a different size as the test dataset (which often happens in the real-world application), if we run the TF-IDF vectorizer in the process of training the model and also predicting a review csv file (which will be used by the user), the number of new features that will be generated in the process of training might be different than the number of features that will be generated in the process of predicting. By doing so, it will cause an error because our machine learning model is trained based on the dataset that has gone through the process of feature engineering in the training process and since it is trained on a dataframe with a specific number of columns already, it cannot be used to predict the outcome based on another dataframe that have different number of columns in the process of predicting. This is because every machine learning model that learns from a numerical matrix expects a fixed number of input features. Thus, we solve this problem by creating two separate functions for the feature engineering process. One will be used during the process of training, and it will save the TF-IDF vectorizer that has learnt a certain pattern from the train dataset into a .pkl file in the google drive folder; while the other one will be used in the process of predicting, it will load the TF-IDF vectorizer that is previously saved without learning, and do feature engineering on the dataframe that we have to predict on. By doing so, in both of the processes, the number of features in the dataframe that is returned after feature engineering will be the same. This will ensure stability but at the same time, if our model is used in real-world application, the model has to be run to learn regularly to ensure high accuracy, and the model is up-to-date by allowing the TF-IDF vectorizer to be able to learn from more and more vocabulary.
Explainability A black-box “accept or reject” output wouldn’t be helpful for merchants or platform moderators. They need to know why a review was flagged. To address this, we designed outputs that are interpretable and provide human-readable reasons, so the system can support transparency and trust rather than hide behind complex algorithms.

Accomplishments that we're proud of

Designed a scalable teacher-student pipeline that balances cost and accuracy.
Delivered an interactive demo in under 72 hours.
Improved dataset balance by augmenting irrelevant review types.
Created a system that could realistically scale within TikTok’s ecosystem.

What we learned

The importance of high-quality labels for downstream models.
Trade-offs between precision vs. recall when filtering user-generated content.
Building explainability into ML systems increases trust and adoption.
We must always apply the exact same preprocessing steps to our new, unseen data as we did to our training data.
We must save our feature engineering objects, like the TF-IDF vectorizer, after they are fitted on the training data. This allows us to load and use the same exact object to .transform() new data, preventing errors related to feature mismatches.
We created a creative and practical solution of using a hybrid rule-based and ML approach to generate pseudo-labels
We learnt to build a sophisticated stacking ensemble to maximise our model’s performance and use specialised base models on specific feature sets that each model is best at learning from
We implemented the professional practice of OOF to prevent data leakage
We designed distinct functions for each stage of the pipeline, which makes the code reusable, maintainable, and easy to debug
We incorporated robust error handling, such as using try … except blocks in train_and_save_baseline_model to handle cases where the data distribution might prevent stratification
We connected all the pieces to create a cohesive function that can take raw data, process it and return something useful to be ready for deployment.

What’s next for Danny and his butlers

Our goal is to expand this project into a scalable solution: supporting multiple languages, improving multimodal analysis, and deploying as an API that can directly help TikTok merchants and users. Moreover, we wish to create dashboard with more features that can help us to explain our findings to others in a clearer and better way, so that merchants can use that to make better business decisions, and consumers can have better experience in gaining information of the shops through reviews so that this will make it easier for them in choosing shops they will likely like as well.