Problem Statement

Filtering the Noise: ML for Trustworthy Location Reviews

Inspiration

Many online entries found in Google Reviews are just a form of spam or irrelevant reviews that may ruin users' experience in quickly surveying a place. A legitimate online detector that can show the trustworthiness of these reviews would be helpful in enhancing the users' experience. Additionally, it could increase business' owner capability and convenience in responding, improving their overall business performance.

What it does

The algorithm detects whether the reviews are valid or invalid. Invalid refers to spam, irrelevant, or misleading reviews.

How we built it

We trained the model by using the public Google Local Review data: https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/, which contains 51 states and 1 other reviews zip file. From these files, we took 1000 random rows of filled text review. We also combined these reviews with the locations they are reviewing from the metadata. After extracting 1000 random reviews from each state, we proceed by combining them to 1 csv file. Since the data has no validity label attribute, we utilized the DeepSeek LLM model with tailored prompts and rules and saving the result as labelled_data.csv.

With this labelled data, we performed Exploratory Data Analysis (EDA) by observing the distribution of features that have contributions to the labelling process. Then, we proceeded with feature engineering and building our pipeline. The pipeline includes a rule-based system and the constructed model. The rule-based system helps to filter the reviews directly to accelerate classification before and during the processing in the model pipeline. Some Natural Language Processing (NLP) models that we trained are BERT with structure and LightGBM. After comparing the relevant metrics, we discovered that LightGBM model is able to provide more accurate label classification and better metrics overall.

To deploy the product as a browser extension, we used FastAPI for the backend and HTML, JavaScript, and CSS for the front end. The product choice as a browser extension is driven by its convenience for users on a daily basis usage rather than an application or website. This extension could help detect reviews with a translucent block that indicates the validity by percentage of validity along with a color indicator, from green for valid reviews to red for invalid reviews. This will aid users in deciding the usefulness and trustworthiness of the reviews. The main development tool used in our system is VSCode.

Challenges encountered

When creating the browser extension frontend, we were unable to web scrape the Google Reviews directly without scanning over other components in the website (other than the reviews). Hence, instead of just focusing on the reviews, other components were also detected.

Accomplishments that we're proud of

We could complete the first milestone in deploying a user-friendly and easy-to-use product to avoid spam reviews without users having to read through everything and navigate through a lot of buttons. With only 1 click, all reviews are immediately analyzed and displayed in a clear manner.

What we learned

We learned how to work with unlabelled data, build and compare NLP models for this specific use case, and implement the machine learning model for practical usage by doing the full-stack development of a browser extension product.

What's next for Filter Reviews

We could improve the validity classification performance with longer-trained models such as ensembles and provide the attributes such as percentage of the review filter percentage in a better format.

Built With

Share this project:

Updates