3Pandas

Inspiration

We were inspired by the growing reliance on user reviews for making decisions about restaurants, healthcare, tourism, and other services. Many reviews, however, are noisy, and they contain advertisements, irrelevant content, or rants from users who may not have even visited the location. We wanted to build a system that could automatically filter out such low-quality reviews, improving trust for users, fairness for businesses, and reducing moderation effort for platforms.

What it does

Our system detects and flags noisy reviews by predicting multiple policy violations simultaneously: advertisements (is_ad), irrelevant content (is_relevant), and rants (is_rant). It combines metadata features (like URL counts, phone counts, capitalization ratios, ratings, user activity, and sentiment scores) with optional text-based embeddings and similarity features. The model outputs a multi-label prediction per review, indicating which policies it violates.

Our project addresses the core problem stated in the challenge prompt: "How can we assess the quality and relevancy of location-based reviews?" We interpreted this as the need to automatically identify and filter out reviews that violate common platform policies, thereby separating helpful, genuine user feedback from noisy, malicious, or irrelevant content that pollutes the ecosystem and misleads consumers. This enhances user trust, ensures fair representation for businesses, and reduces moderation overhead, aligning perfectly with TikTok's goal of fostering a reliable and engaging community around local discovery.

How we built it

Data Collection & Preprocessing
- Combined Yelp, Kaggle restaurant reviews, and crawled Google reviews across categories like tourism, food, healthcare, fitness, and retail.
- Cleaned text, removed duplicates and non-English reviews, normalized emojis, and generated policy-aligned labels using the OpenAI API.
Feature Engineering
- Metadata features, sentiment scores, and cosine similarity between review text and business category.
- Text embeddings using transformer models for semantic understanding.
Modeling
- MultiOutputClassifier wrapped around RandomForestClassifier for multi-label classification.
- Compared with MLP and transformer-based models to balance efficiency, interpretability, and semantic capture.

Development Tools:

Jupyter Notebook: The primary environment for orchestrating the entire data science workflow, from data collection to model evaluation.
Google Colab: Utilized for accessing GPU resources to fine-tune large language models efficiently.

APIs Used:

OpenAI API (GPT-4o): Used to generate the initial policy violation labels (is_ad, is_rant, is_relevant) for our dataset, ensuring a consistent labeling strategy at scale.
OpenStreetMap (OSM) API: Used to gather initial points of interest (name, coordinates, address) across multiple categories in Singapore.
Apify Google Maps Reviews Scraper: Used as a scalable cloud service to crawl Google Maps and extract reviews from the list of URLs obtained via OSM.

Libraries and Frameworks:

Hugging Face Transformers: The core library for downloading, customizing, and fine-tuning the pre-trained transformer models (Gemma3-1B and Qwen1.5-0.5B).
PyTorch: The underlying deep learning framework for building and training neural network architectures.
scikit-learn: Used for traditional machine learning models (MultiOutputClassifier with RandomForestClassifier), metrics, train-test splits, and utilities.
pandas & NumPy: The foundation for all data manipulation, cleaning, and numerical computation.
Matplotlib & Seaborn: Employed for all exploratory data analysis (EDA) and visualization.

Assets and Datasets Used: The project employed a multi-source data strategy to ensure robustness and generalizability:

Proprietary Crawled Dataset: Data collected via Apify from Google Maps, based on locations in Singapore identified via the OpenStreetMap API.
Google Maps Restaurant Reviews (Kaggle): A public dataset from Kaggle used to add volume and diversity.
GoogleLocal (UCSD McAuley Lab): A large-scale academic dataset; a random sample (10 states, 1k reviews each) was used.
Yelp Open Dataset: A sample of 10,000 reviews from Yelp's public dataset to incorporate another platform's characteristics.
Synthetically Generated Data: A crucial dataset generated by GPT-4o to create thousands of additional examples of the underrepresented violation classes (is_ad, is_rant), which was essential for solving the severe class imbalance problem.

Challenges we ran into

Class Imbalance: Ads and rants were much less frequent than normal reviews.
Threshold Tuning: Needed careful calibration to avoid over-flagging.
Subtle Irrelevance: Some reviews were context-dependent and challenging even for humans.
Integrating Text and Metadata: Ensuring numeric embeddings and metadata could be combined consistently for ML models.

Accomplishments that we're proud of

End-to-end ML Pipeline: Built a complete workflow from data collection → preprocessing → model training → evaluation → reporting within the hackathon timeframe.
Diverse Model Suite: Successfully trained and compared Random Forest, MLP, Gemma-3-1B, and Qwen 0.5–0.8B models, showcasing both classical ML and modern LLM approaches.
Efficient Labeling: Used the OpenAI API to generate high-quality labels aligned with platform policies, saving time and ensuring consistency.
Balanced Tradeoffs: Demonstrated that Qwen 0.5–0.8B offers a strong accuracy-efficiency balance, making it practical for real-world deployment.

What we learned

Multi-label classification requires careful evaluation metrics like macro F1-score.
Metadata features are strong indicators of noise and can work effectively even without deep semantic modeling.
Transformer embeddings improve performance but involve trade-offs in computation.
Feature engineering, preprocessing, and model selection are key to balancing accuracy, interpretability, and efficiency. ## What's next for 3Pandas
Address class imbalance with advanced sampling or cost-sensitive learning.
Extend the system to new categories and languages for broader coverage.