ML for Trustworthy Location Reviews with peanut butter

Inspiration

We’ve all looked up a place on Google Maps only to find the reviews cluttered with spam, irrelevant rants, or even blatant advertisements. It’s frustrating, when you just want to know whether the place is good, but the signal is buried under noise. As a group, we wanted to tackle this problem directly and build something that helps keep online reviews trustworthy, relevant, and useful.

What it does

Our system automatically evaluates location-based reviews and flags content that doesn’t belong, whether that’s spam, irrelevant tangents, angry rants, or policy violations. By training a custom transformer model, we’re able to classify each review across multiple categories and decide whether it should be trusted, filtered, or flagged for moderation.

How we built it

We start by preprocessing the raw review data into higher quality datasets to be used for downstream tasks. preprocessing/main.py begins by detecting the language of each review and filters out anything that isn’t written in English, before cleaning the text, replacing URLs with placeholders and removing unnecessary whitespace while preserving punctuation, capitalization, and emojis, since those often carry important meaning. It also extracts useful metadata features: the length of the review, how many exclamation marks it contains, how many all-caps words are used, and whether or not it includes a URL. These features can help identify low-quality or potentially suspicious reviews later on. To capture the tone of each review, the program uses spaCy integrated with TextBlob to calculate both sentiment polarity (ranging from positive to negative) and subjectivity (objective versus opinion-driven). It also uses scikit-learn to flag near-duplicate reviews within the same business, using TF-IDF vectorization and cosine similarity to detect text that is too similar. The output is a .csv file.

Next, we chose the Gemma 3 LLM from Ollama to flag policy violations in the dataset. The function compile_reviews first loads the preprocessed data (.csv) and runs a loop iterating through the different rows, i.e. each review. In this loop, the evaluate_review function is first called. The prompt for the LLM is generated using the generate_prompt function which is tailored to each review by providing the preprocessed data to the LLM. This preprocessed data then assists the LLM in determining whether the review has any policy violations. Once the response has been received, parse_response is used to extract the required information from the response and organise it into a dictionary-like format for easier data access and readability. Lastly, once all the LLM’s evaluation of all the reviews is complete, compile_reviews collates the data into a single .json file that it saves into the specified output location.

We trained a DistilBERT-based transformer model for multi-label classification, configured to predict across four categories: inauthentic, advertisement, irrelevant, rant, and violations. The pipeline starts by loading JSON-formatted review data into pandas, converting it into a Hugging Face DatasetDict, and automatically creating train/validation splits. From there, we use Hugging Face’s AutoTokenizer to preprocess text with padding and special tokens, and AutoModelForSequenceClassification to initialize a classification head. Fine-tuning is managed through Hugging Face’s Trainer API, paired with a DataCollatorWithPadding to ensure efficient batching. For experimentation with efficient training, we also integrated PEFT (Parameter Efficient Fine-Tuning) and LoRA configurations, which can be toggled in the pipeline to reduce computational overhead while maintaining accuracy. Evaluation metrics (e.g., F1-score, accuracy) are calculated using the evaluate library. Development was carried out in VSCode and Jupyter, with PyTorch as the training backend. Supporting libraries like pandas and NumPy handled preprocessing and numerical operations, while the Hugging Face datasets library streamlined dataset transformations. By grounding the solution in our own labeled dataset and a carefully structured pipeline, this project directly addresses the problem of noisy, unhelpful location reviews. The result is a practical and extensible review moderation tool that enforces platform policies and improves the overall user experience.

Challenges we ran into

There were undoubtedly many, many challenges over the few days, but here were just a few highlights for us. Balancing multiple labels (spam vs irrelevant vs rant vs policy violations) was tricky, since some categories appeared much less frequently than others, while others were initially not being detected accurately. Constantly adjusting our LLM prompt to label the data was also quite tough to get accurate, especially when testing the LLM had long runtimes. Formatting the data appropriately for Hugging Face’s pipeline, and having to update it after each tweak and iteration, also took more effort than expected. Tuning the model without overfitting also required countless rounds of testing and iteration.

Accomplishments that we're proud of

We achieved over 98% micro F1 score on our validation set, showing that the model can indeed reliably detect and classify reviews. It was satisfying to successfully build an end-to-end pipeline that loads raw review data, processes it, trains a transformer model, and outputs meaningful evaluation metrics. Ultimately, we hope to contribute to creating a solution that’s not just theoretical but practical and extensible for real-world review moderation.

What we learned

We learned how to fine-tune large language models for multi-label classification, how to preprocess real-world review data into usable formats, and how to use prompt engineering to semantically analyse large sets of text reviews. We also gained hands-on experience with Hugging Face’s ecosystem (Trainer, datasets, transformers, and evaluate), as well as the implementation and importance of clean data for training.

What's next for Track 1: Filtering the Noise with peanut butter

Ultimately, we want to take this beyond a proof of concept and turn it into a tool that genuinely improves the trustworthiness of online review ecosystems. In our own time, we hope to explore a live review platform detection, expanding the label dataset, improving interpretability for flagging, scaling up with larger datasets, and even possibly testing multilingual support for reviews in non-English languages.