BERTBERT: Metadata-Enhanced Review Classification

Inspiration

The challenge we tackled was designing an AI system that outperforms current models in detecting the quality and relevancy of location-based reviews. Our inspiration came from the growing difficulty in identifying genuine reviews on platforms such as Google Maps and Yelp. Spam, misleading rants, and advertisements frequently distort the credibility of these platforms, which harms both users and businesses. We wanted to build a system that could accurately classify reviews as relevant, spam, rant, or advertisement, thereby ensuring more reliable insights for all stakeholders.

What it does

Our solution integrates text-based analysis with metadata-driven feature engineering, moving beyond models that rely solely on text. The system classifies reviews into four categories:

Relevant: Reviews that are related to the location.

Spam: Reviews lacking sufficient relevant information about the location.

Example violation: “Great place!”

Rant: Complaints or negative feedback not necessarily tied to actual visitors.

Example violation: “The menu does not look that great, I would never try this place.”

Advertisement: Reviews that contain promotional material, links, or mention other establishments.

Example violation: “check out @nomninjas on ig, lemon8, tiktok, facebook for more credits when photos or videos used ... The promo is available for dinners from 1 May to 31 Aug 2025 Citibank: 15% off weekday dinner or all-day weekend with $100 spend OCBC: 5% cashback + 50% off Mapo Tofu with $50 spend”

By combining embeddings from text with metadata features such as reviewer behaviour and posting time, the system provides more accurate detection of spam, rants, and advertisements.

How we built it

Data Collection and Cleaning

We collected reviews from Google Reviews (US market), the Yelp Open Dataset, and self-scraped Google Maps reviews in compliance with Google’s Terms of Service. The cleaning process involved removing duplicates, filtering out non-English characters, and discarding truncated reviews that ended with ellipsis. We then manually labelled self-scraped reviews using strict classification policies to ensure consistency. In total, the dataset consisted of around 100,000 reviews for training and 4,000 manually labelled reviews for initial testing. Each data point contained the author’s name, rating, text, and classification label.

Model Development

We used supervised learning with our labelled data and tested different approaches. Baseline experiments were carried out with standard, more mainstream Large Language Models (LLMs) like Qwen3 8b and Gemini3 12b, pre-trained transformer models such as BERT (Bidirectional Encoder Representations from Transformers), and its variants distilBERT and roBERTa. We then built and experimented with an ensemble of different pre-trained models and ultimately ended up with our final solution of using an ensemble of 4 different models. We first have a "text tower", which utilises a pre-trained variant of BERT, called bert-base-uncased. This tower converts the "review text" and "response text" features into a holistic 768-dimensional vector to capture semantic meaning and context. Second, there is a second "user tower", which uses the embedding module from PyTorch to learn a unique vector representation for each user. Third, there is a "business tower", which is a Sequential Neural Network, consolidating all business-related features. Lastly, we use cross-attention, where the "text tower" queries the "user tower" and "business tower" output. This allows the model to weigh the importance of user behaviour and business context based on the specific language used in a review. The model's output is then passed onto a classifier to make the final prediction. To prevent overfitting, we experimented with applied techniques such as dropout, regularisation, normalisation, and early stopping. Iterating on larger datasets and more advanced models helped to improve generalisation.

Feature Engineering

Unlike most models that rely only on text, we engineered metadata-based features to provide additional context. These included the time of posting for each review, the time gaps between reviews by the same user, the reviewer’s history such as frequency, consistency, or anomalies, and the time difference between a review and the business’s reply.

Evaluation and Iteration

We benchmarked BERTBERT against OpenAI ChatGPT and other baseline models. The inclusion of metadata provided measurable improvements in detecting spam, advertisements, and rants. Evaluation metrics included precision, recall, F1-score, accuracy, and Cohen's Kappa, which together gave a more reliable picture of model performance.

Challenges we ran into

One of the challenges we faced was data quality. Initially, we followed the instructions given and tried labelling the given dataset with ChatGPT-4o. However, we realised that the LLM's output was unreliable in labelling the data. As such, we looked to fine-tune our own labelling model on a small manually labelled dataset. We fine-tuned our model which uses DebertaV2forSequenceClassification and then used the model to label a much larger dataset of about 1.5 million instances of data.

However, while trying to train our "labelling model", we faced a bunch of problems due to the very skewed dataset. In our initially manually labelled dataset of about 1.5k instances, we had less than 50 combined instances of "spam", "advertisement" and "rant". This led to the model being unable to generalize well to these classes. We tried a few solutions to solve this:

SMOTE (Synthetic Minority Oversampling Technique)

We considered this, but then we had concerns with the introduction of synthetic data introducing too much noise. Our experimentation with it proved us right as we constantly got low f1 scores with this method.

Unequal weights according to the frequency of the data

We use this in some of our models, but the tweaking of the weights have also been a challenge as we had to spend a large amount of time to find appropriate weights.

Supplementing the dataset with more instances of “spam”, “advertisement” and “rant”

This was what we ultimately turned to, as we felt that the google reviews dataset had too few “advertisement”, “spam” and “rant” instances (which makes sense given how Google likely already has a filtering algorithm in place). As such, we looked to alternative datasets, landing on Yelp reviews and supplementing our data with it.

Quality of data was also an issue, many reviews ended with ellipsis, which meant that the content was truncated. Reviews also had empty fields of data and we had to carefully handle such cases during preprocessing. Pandas came particularly helpful in this aspect as we managed to clean the data quickly and carefully.

Computational resources were also a huge challenge. As none of our team had access to GPUs other than the free runtime on Google Colab, we were unable to train our models on very large datasets. Actually, we had prepared and cleaned a dataset of 1.5million instances of data with our labeling model. However, due to lack of time and GPU access, we had to settle for using a much smaller dataset of about 50 thousand instances to train our final model on. We also used dimensionality reduction methods such as Principal Component Analysis (PCA) to decrease the computational load, even though we would lose some of the signal. We thought that this was worth it as we would be able to increase the rate at which we test new models, while the signal lost was relatively very little.

Finally, the metadata sometimes introduced noise, as some behavioural signals produced spurious correlations. Normalisation and careful validation were required to reduce such effects. We approached the metadata in this manner, coming up with theories on what we think would create a signal, and testing it. We did not want to approach the data in a datamining manner, where we looked for correlation between different features as there is a higher chance that it is a coincidental relation idiosyncratic to this dataset.

Accomplishments

We are proud of several key accomplishments.

First, we were able to train a labeling model that beat out OpenAI’s GPT-4o in the labeling of data.
Second, that we were able to work together as a team to engineer solutions to the many problems that we have faced, especially for maintaining data quality.
Third, that we were able to navigate the lack of computational resources to still build a model in which the metrics are respectable.
Lastly, that we were able to learn about and implement machine learning concepts which most of us have never used before.

Epoch	Train Loss	Val Loss	Kappa	Accuracy	Notes
1	0.0071	0.0051	0.719	0.918	Saved (best)
2	0.0047	0.0044	0.677	0.899	Saved (best)
3	0.0038	0.0042	0.739	0.924	Saved (best)
4	0.0032	0.0046	0.678	0.898	No improve
5	0.0026	0.0049	0.742	0.924	No improve
6	0.0022	0.0055	0.781	0.939	Early stop

Best Performance

Validation Accuracy: 93.9%
Cohen's Kappa: 0.781

What we learned

From this project, we have learned that real life machine learning applications stem much further than just experimentation with models. In real applications it is very hard to find a clean and readily labeled dataset as we see with Kaggle competitions. We have realised that most of the 80% problems are data-related, and that good quality data collection is what makes a good model.

What’s next for BERTBERT

Looking forward, we plan to first solve our problem of the lack of GPUs and computation power. This would allow us to be able to realistically scale up the training to train on all of the originally planned training set of 1.5 million data instances. We would also aim to get a larger manually labelled dataset through engaging paid manual labor to label the data, on which we would then train our labeling model, so that we could get better pseudo-labeled data to train our final model on. We could also explore the possibility of getting a data labelling vendor such as Scale AI to handle the data so we would have the best quality of data possible. We would then be able to fine-tune our models better and even experiment with different model architectures. We also want to integrate explainability tools such as SHAP or LIME so that users and businesses can understand why a review was flagged. Another goal is to build a real-time API that businesses can use to filter reviews as they are posted. In addition, we aim to extend the model to other domains such as Amazon or TripAdvisor reviews. Finally, with experimentation, we will continue to refine the optimisation process, particularly loss functions, to further improve handling of class imbalance.