Inspiration
With the ever increasing number of bots, spammers, and trollers, Google reviews are becoming more and more unreliable. While many reviews are genuine, there are also advertisements, irrelevant content, and rants that do not add value or give the intended help for other people to gauge how good a place is. If it is left as is, then more useless reviews will pile up, which will make it hard for users to find useful reviews. This will decrease the users’ trust in Google reviews as a platform where they can share and read other people’s experiences in different places.
What it does
Our solution is a fine-tuned RoBERTa model that reads a Google review: It takes the text, location, and type of the place (e.g. restaurant), then classifies the review as either 0) advertisement, 1) irrelevant, 2) rant, or 3) valid.
How we built it
We used VSCode as our IDE and used Jupyter Notebook as the training ground for the model due to the ease it gives in running individual scripts in Python. We also used OpenAI and Gemini APIs to generate the ground-truth labels for the public Google review dataset, as well as generate synthetic data to create more training data for rare classes.
Challenges we ran into -- and how we overcame them.
We spent our first 1.5 days figuring out how to improve our training datasets -- a very real-world struggle! Our data wasn't labelled, thus in order to approach this supervised learning problem we had to label our data with the help of ChatGPT and Gemini through their APIs. We then realised that they weren't fool-proof, so we resorted to other means like manually sifting through results, bootstrapping, creation of synthetic data, etc. Eventually we got to about 37k rows of decently high-quality data -- that's something we're proud of.
The other challenge was time. We were pressed for time -- especially near the end, while waiting for ML models to finish training.
What we learned
Complex is not always better. Sometimes, a simply tuned logistic regression/softmax regression model can perform better and resist overfitting than other state-of-the-art models.
But beyond this, we also realised that most of the heavy-lifting had been done for us, in the sense that the embeddings that we used (all our models in one way or another, used embeddings) were all pre-trained by some enterprise (e.g HuggingFace). It is to them that we (partially) credit the success of our models.
What's next for lunchlylitter
We will continue diving into the world of ML and NLP. Huzzah.
Built With
- gemini
- openai
- python
- roberta
- scikit-learn
- sentence-transformer
- tensorflow
Log in or sign up for Devpost to join the conversation.