Inspiration

Online reviews shape where we eat, shop, and travel but they’re pretty messy. We saw great places buried under spam, ads, and rants from people who likely never visited. So we set out to build an ML system that scores review quality and relevancy.

What it does

The system ingests Google location reviews and classifies each into clear policy buckets—relevant, spam, ad, irrelevant, and rant/no-visit—then outputs a label. This lets platforms surface trustworthy reviews and create cleaner review feeds.

How we built it

We built a pretrained Transformer (DistilBERT) for multi-class sequence classification. Reviews are loaded from a scraper that we built ourselves using the Apify Data Scraping Client. To acquire a labelled dataset, we applied prompt engineering techniques by leveraging Google Gemini to assist in the labelling process, matching each row to a specific class. After a stratified training, texts are tokenized and trained with Hugging Face’s Trainer. We report accuracy, precision, recall, and F1 score, then saved the model & tokenizer for batch scoring.

Challenges we ran into

Our team had limited NLP experience, which meant a steep ramp-up on how to adapt a pretrained Transformer to multi-class labels. And also with limited computational resources, batch size, learning rate, and epochs interacted in non-obvious ways. Thus we had to iterate carefully to avoid over or underfitting.

Accomplishments that we're proud of

We delivered an end-to-end pipeline that goes from raw reviews to a trained, evaluated, and saved model in a reproducible way. The label schema is simple enough for moderators and product managers to reason about, yet expressive enough to capture the most common failure modes in review quality.

What we learned

Clear label definitions are as important as model choice. By tightening those definitions, we reduced confusion and improved metrics more than expected. We also realized the value of simple preprocessing steps—like combining category and description—which gave the model better context. Lastly, experimenting with batch size, learning rate, and epochs taught us how sensitive NLP models can be, and how small adjustments can make a big difference.

What's next for Review Filtering AI

While the model performs relatively well overall, there are instances where it fails to classify a legitimate, clean, and direct review correctly. These reviews are short, to the point, and relevant but may still be misclassified as spam, rant, or ad. We aim to improve the model's ability to accurately classify such reviews by refining the training data, improving the label definitions, and introducing additional context features to better handle these edge cases.

Another issue we faced was language bias, which occurs because our training dataset is fully in English. As a result, the model has been trained solely on English-language reviews and therefore cannot capture any meaningful patterns or nuances when it encounters reviews in other languages. Since the model has never seen any non-English text during training, it struggles to process and classify reviews in languages it has not been exposed to, leading to misclassifications or failures to classify such reviews effectively. This issue stems mainly from the biased dataset, as the model's performance is limited by the language-specific patterns and features it has learned from the English reviews. To address this, the model must be trained on a multilingual dataset. This will enable the model to learn language-agnostic patterns and improve its ability to accurately classify reviews in diverse languages, making the system more robust and adaptable to a global audience.

Built With

Share this project:

Updates