THKS | Devpost

Inspiration

Online platforms are flooded with location-based reviews, but many are noisy: ads, irrelevant rants, or generic one-liners. This dilutes trust and makes it harder for businesses and customers to find reliable insights. We wanted to solve this by creating a pipeline that identifies which reviews are genuinely helpful.

What it does

Our system preprocesses raw review text and fine-tunes RoBERTa to classify reviews as relevant, irrelevant, advert, rant_no_visit. This improves fairness, trust, and user experience on review platforms.

How we built it

Developed a custom preprocessing tool (review_preprocess.py) to normalize text: lowercasing, URL/character cleanup, and stopword removal with negation handling. Designed a PyTorch ReviewDataset class and used Hugging Face Transformers (FacebookAI/roberta-base). Fine-tuned RoBERTa on labeled review datasets to detect quality and relevance. Evaluated using scikit-learn metrics (accuracy, precision, recall, F1) and visualized results with seaborn/matplotlib.

Challenges we ran into

Designing preprocessing steps that removed noise without stripping away important context. Working with no labeled data for fine-tuning. Training RoBERTa efficiently under GPU memory constraints.

Accomplishments that we're proud of

Built a complete, scalable pipeline from raw text -> clean dataset -> fine-tuned transformer -> evaluation.

Successfully fine-tuned RoBERTa to detect low-quality reviews with strong performance.

Manually labeled training and test data, giving us a reliable ground truth to evaluate performance and avoid overfitting.

Produced results that are directly applicable to real-world review platforms.

What we learned

How to design and implement a review quality classifier end to end. The importance of thoughtful preprocessing, especially for noisy, user-generated text. Practical experience fine-tuning transformer models with Hugging Face and PyTorch. The trade-offs between model accuracy, dataset size, and computational efficiency.