Inspiration
Google reviews are critical for businesses and consumers alike, but they're often flooded with irrelevant posts and spam advertisements. This makes it hard for consumers to trust the reviews, which in turn can impact businesses negatively. I wanted to build a model that attempts to separate genuine reviews from unhelpful ones to give consumers greater insight on a location.
What it does
ReviewRadar classifies reviews into 4 categories:
- Normal - genuine reviews
- Advertisement - promotional or spammy reviews
- Irrelevant - Off topic reviews
- Non-visitor - Reviews from people who did not visit the location
How we built it
- Collected and labelled review data (with a synthetic generator)
- Extracted features from reviews using TF-IDF and custom features such as review length and link detection
- Trained an XGBoost Classifier
- Validated performance on test data and new unseen reviews
- Created a review generator to generate new reviews efficiently
Challenges we ran into
Creating synthetic data As I could not find a dataset with a substantial number of reviews that go against policies, I had to use advanced models like GPT to create synthetic data. Initially, I did not prompt engineer it enough to give me varied data, which resulted in the XGBoost model overfitting and performing poorly when I tested it on external data.
Large datasets Handling large datasets (big JSON files) and finding a way to upload them onto Github.
Accomplishments that we're proud of
- Built a working multi-class classification pipeline that flags ads with links
- Achieved rather high performance on test data
- Created a data generator script so that the whole workflow can be reproduced easily
What we learned
Even if I build a complex or advanced model, if the data I used to fit the model is poor, then the result will also be poor.
What's next for ReviewRadar
- Integrate modern embeddings (eg. BERT) for better generalization
- Deploy as an API
- Expand categories (eg. Duplicate reviews)
- Create a dashboard to show review insights and trends
Built With
- csv
- json
- jupyter-notebook
- matplotlib
- numpy
- pandas
- python
- scikit-learn
- seaborn
- vscode
- xgboost
Log in or sign up for Devpost to join the conversation.