Inspiration

Online reviews heavily influence public perception of business. However, low-quality, irrelevant, or misleading reviews distort trust. We aimed to automate the assessment of review quality and relevancy using AI models, ensuring platforms provided reliable feedback for users and fair representation for business.

What it does

CharWHIJard-AI is a full machine learning pipeline that:

  1. Downloads and preprocesses raw Google Local Reviews.
  2. Extracts textual and metadata features (review length, images, response status).
  3. Labels reviews automatically with two approaches:
  4. GPT-4.1-nano baseline
  5. Local Qwen 3-8B model
  6. Evaluates Qwen predictions against the GPT baseline using precision, recall, and F1-score.

How we built it

We downloaded public Google review datasets and preprocessed them by merging review and metadata JSONs into a structured CSV file. We then clean this CSV file and only kept the relevant columns needed for our prompt. Next for modelling, we used GPT-4.1-nano as our baseline and ran local model of Qwen3-8B. We compared the Qwen labels against GPT using sklearn metrics. Finally, we created a pipeline orchestrates all these steps.

Challenges we ran into

Local Model Runtime: Running the Qwen 3-8B model locally on the full dataset was very slow, so we had to optimise and sometimes limit the input size. API Limitations: GPT-4.1-nano calls were fast, but we ran out of API credits when trying to process all reviews, forcing us to run only a sample for the baseline. Data Quality: Some reviews lacked text or metadata, making the filter and label process difficult

Accomplishments that we're proud of

  • Built an AI-driven review classification system capable of identifying ads, rants, irrelevant content, and other categories.
  • Integrated large language models (LLMs) like GPT-4.1-nano to generate high-quality labels for a baseline/ground truth for evaluation.
  • Designed an evaluation pipeline measuring precision, recall, and F1 across multiple categories, highlighting areas for improvement in less frequent classes.
  • Developed custom prompts for handling low-representation labels, improving the model’s awareness of edge cases.

What we learned

We learned that LLMs can effectively generate high-quality labels, but low-frequency or nuanced categories remain challenging for models like Quen. We also learned that data preprocessing and sampling strategies significantly impact model performance, especially in imbalanced datasets. Furthermore, testing and evaluation is key to improving any model’s reliability and trustworthiness.

What's next for CharWHIJard-AI

  • Improve performance on is_rant_without_visit and is_advertisement using advanced prompt engineering and fine-tuning techniques.
  • Expand the system to handle multilingual reviews and diverse review formats.
  • Explore automated summarization, sentiment analysis and utilise more metadata to provide richer insights alongside classification.

Built With

Share this project:

Updates