Compile and Conquer

Inspiration

Online reviews shape the reputation of local businesses and influence customer decisions, but not all reviews are trustworthy. Spam, irrelevant content, and misleading rants create noise that makes it hard for users to trust platforms. We wanted to create a system that helps people make confident decisions while also ensuring fair representation for businesses.

What We Learned

Through this project, we gained hands-on experience in:

Natural Language Processing (NLP) with VADER and large language models.
Machine learning pipelines, including feature engineering, training, and inference.
Data cleaning and preprocessing, learning how messy real-world data impacts model performance.
Combining AI approaches, integrating rule-based filters with LLMs and CatBoost for robust predictions.

We also learned practical considerations like API rate limits, handling large datasets efficiently, and ensuring reproducibility in ML workflows.

How We Built It

1. Data Ingestion & Standardization

Accepted JSON, CSV, and TXT datasets.
Standardized key fields: user_id, user_name, business_name, rating, text, and sentiment_category.
Used OpenAI API to assist with parsing and labeling non-standard datasets.

2. Preprocessing

Removed NaN or null-heavy rows to reduce noise.
Applied text cleaning (removing emojis, HTML tags, and extra spaces).

3. Sentiment & Relevancy Analysis

Used VADER for basic sentiment scoring.
Flagged likely untrustworthy reviews (1) vs trustworthy (0).

4. Machine Learning Pipeline

CatBoost classifier with K-fold cross-validation (K = 10) for optimal performance.
Evaluated models with F1 score and AverageGain.
Saved trained models and metadata for inference on new data.

5. Inference

Preprocess new review datasets.
Apply VADER sentiment and load trained CatBoost models.
Output CSV labeling each review as trustworthy or untrustworthy.

Challenges

Noisy datasets: Many reviews were incomplete or poorly formatted, requiring robust cleaning and standardization.
API limitations: OpenAI API calls needed batching to avoid rate limits.
Balancing accuracy vs speed: Ensuring the pipeline processed large datasets efficiently without sacrificing model performance.
Multi-step integration: Combining rule-based filters, LLM pseudo-labeling, NLTK VADER sentiment analysis and CatBoost modeling into a single coherent pipeline.

Reflection

This project showed us the power of combining AI techniques to solve real-world problems. We built a scalable, interpretable system that helps users navigate online reviews with confidence and ensures businesses are fairly represented. It also reinforced best practices in ML pipelines, preprocessing, and evaluation metrics.

What it does

We built a pipeline that takes in messy, real-world review data and transforms it into reliable insights.

On the training side, we standardize raw datasets, clean and preprocess the text, then use AI models like OpenAI and VADER to generate smart pseudolabels. These enriched datasets feed into a CatBoost classifier, trained with rigorous 10-fold cross-validation for robustness.

On the inference side, anyone can feed in new reviews. The system preprocesses the data, adds sentiment signals, loads our pre-trained models, and instantly predicts whether each review is spam or genuine. Results are neatly saved for analysis.

In short: we’ve built a scalable, end-to-end system that combines classical ML, modern NLP, and sentiment analysis to make online reviews trustworthy again.

Accomplishments that we're proud of

End-to-end pipeline: We successfully built a complete system that goes from raw, messy datasets all the way through preprocessing, pseudolabelling, training, and inference.

Smart model design: By combining VADER sentiment analysis with a CatBoost classifier, we created a lightweight yet accurate solution that balances scalability and performance.

Robust training setup: We implemented 10-fold cross-validation to ensure that our results were reliable and not overfitted to any one dataset.

Practical outputs: Our inference pipeline can take in brand-new review data, process it quickly, and output clear predictions that separate spam from genuine reviews.

Team adaptability: We learned to evaluate tradeoffs between state-of-the-art models and scalable ones, making thoughtful design choices to match the resources available.

What we learned

During this project, we explored and compared multiple sentiment analysis approaches to enhance our spam review detection pipeline. In particular, we evaluated lightweight lexicon-based methods such as VADER against large transformer models like DeBERTa.

We ultimately chose VADER because it is extremely fast and scalable — capable of handling thousands of reviews in real time on standard hardware without requiring GPUs. This made it a practical choice for ensuring our tool could run reliably in constrained environments.

However, we also recognized that DeBERTa consistently outperforms rule-based models in capturing nuanced sentiment, sarcasm, and context. If we had access to more powerful computational resources (e.g., GPUs), integrating DeBERTa would likely improve accuracy and robustness, particularly for edge cases.

This comparison taught us the importance of balancing scalability and performance with accuracy and complexity, and to design our system so it can adapt as resources and requirements evolve.

What's next for Compile and Conquer

Looking ahead, we aim to push our system beyond conventional ensemble models. While our current CatBoost + sentiment approach is efficient and scalable, ensemble methods have limitations — particularly when they place too much emphasis on optimizing supervised learning alone, sometimes at the expense of capturing deeper contextual signals.

With access to more powerful hardware such as GPUs or TPUs, we hope to integrate deep learning architectures like transformer-based models. These models can learn richer language representations, handle subtleties like sarcasm or adversarial spam, and provide greater robustness in real-world deployment.

Our next step is to transition from a resource-conscious prototype to a scalable, deep learning–driven solution that can adapt to new domains of review data and deliver even more accurate authenticity detection.