lunchlylitter

Inspiration

With the ever increasing number of bots, spammers, and trollers, Google reviews are becoming more and more unreliable. While many reviews are genuine, there are also advertisements, irrelevant content, and rants that do not add value or give the intended help for other people to gauge how good a place is. If it is left as is, then more useless reviews will pile up, which will make it hard for users to find useful reviews. This will decrease the users’ trust in Google reviews as a platform where they can share and read other people’s experiences in different places.

What it does

Our solution is a fine-tuned RoBERTa model that reads a Google review: It takes the text, location, and type of the place (e.g. restaurant), then classifies the review as either 0) advertisement, 1) irrelevant, 2) rant, or 3) valid.

How we built it

We used VSCode as our IDE and used Jupyter Notebook as the training ground for the model due to the ease it gives in running individual scripts in Python. We also used OpenAI and Gemini APIs to generate the ground-truth labels for the public Google review dataset, as well as generate synthetic data to create more training data for rare classes.

Challenges we ran into -- and how we overcame them.

We spent our first 1.5 days figuring out how to improve our training datasets -- a very real-world struggle! Our data wasn't labelled, thus in order to approach this supervised learning problem we had to label our data with the help of ChatGPT and Gemini through their APIs. We then realised that they weren't fool-proof, so we resorted to other means like manually sifting through results, bootstrapping, creation of synthetic data, etc. Eventually we got to about 37k rows of decently high-quality data -- that's something we're proud of.

The other challenge was time. We were pressed for time -- especially near the end, while waiting for ML models to finish training.

What we learned

Complex is not always better. Sometimes, a simply tuned logistic regression/softmax regression model can perform better and resist overfitting than other state-of-the-art models.

But beyond this, we also realised that most of the heavy-lifting had been done for us, in the sense that the embeddings that we used (all our models in one way or another, used embeddings) were all pre-trained by some enterprise (e.g HuggingFace). It is to them that we (partially) credit the success of our models.

What's next for lunchlylitter

We will continue diving into the world of ML and NLP. Huzzah.

Built With

gemini
openai
python
roberta
scikit-learn
sentence-transformer
tensorflow

Submitted to

TikTok TechJam 2025

Created by

I worked on the RoBERTa model and training. I had a lot of challenge because it was my first time using NLP models. It was particularly hard because the data we got was largely skewed, but we managed to overcome it in the end using synthetic data. It was an eye-opening experience!

Devon Sebastian Koswara
Timotheus Teng Da Hui
wei xian
Caleb Loh