ChabaCrunch

Inspiration

Building things that work in the real world has always been our goal. We have been driven by past experiences—like Project Harmonics in high school, which focused on HCI and accessibility through adaptive audio visualization—to design systems that are both impactful and well-structured. For this project, our team at ChabaCrunch was inspired by our core interests in Machine Learning, Natural Language Understanding, and Data Visualization. We saw an opportunity in the TouchBistro dataset to explore how tipping behaviors vary across geographic locations, and this challenge perfectly blended our passion for data science with real-world restaurant operations.

What It Does

ChabaCrunch analyzes restaurant transactional data to reveal tipping trends by city, venue type, and order type. By merging detailed bills data with venue information, the project:

Computes tip percentages and absolute tip amounts.
Identifies key differences between dining formats (dine-in vs. takeout/delivery).
Discovers how local culture and venue concepts (e.g., bars vs. cafés) influence tipping behavior.
Provides actionable insights for restaurants to optimize service, staffing, and digital tip prompts.

How We Built It

We approached the project iteratively:

Data Integration: Merged bills.csv and venues.csv on venue_xref_id to create a comprehensive dataset.
Data Cleaning: Handled outliers, weird zero values, negatives, and missing values. For example, unknown venue concepts (up to 22% missing) were imputed using a Random Forest model after numerous trials on hyperparameters.
Exploratory Data Analysis (EDA): Used statistical methods and visualizations to uncover patterns in tip percentages and amounts by city and concept.
Memory Optimization: Transitioned from Jupyter Notebook to Google Colab to leverage its collaborative nature and manage large datasets (e.g., handling an 8-million-row bills dataset, around 3.3 GB in size) by optimizing numeric data types and employing careful memory management.
Iterative Refinement: We continuously revisited each stage—from cleaning to analysis—refining our methods until we achieved data quality that could support robust analysis.

Challenges We Ran Into

Data Cleaning: The dataset contained numerous outliers, strange zero values, and missing data. Each anomaly required a tailored approach:
- Handling unknown concepts (22% missing) and missing waiter UUIDs.
- Labeling negative values appropriately to distinguish between refunds and sales.
Memory Management: Google Colab’s 12 GB cap forced us to innovate with data downcasting and efficient memory usage, especially when processing large bills files.
Technical Learning Curve: Transitioning from prior limited experience with Jupyter Notebook to a more collaborative environment in Google Colab, and mastering techniques for large-scale data processing.
Modeling: Imputing unknown venue concepts with a Random Forest model was particularly challenging. It involved extensive trial-and-error with hyperparameters to achieve satisfactory accuracy.

Accomplishments That We're Proud Of

Successfully merging and cleaning two complex datasets to produce a unified, analysis-ready dataset.
Developing a robust data cleaning pipeline that effectively handled a variety of data quality issues.
Implementing a predictive model to impute missing venue concepts, which significantly enhanced our analysis.
Extracting meaningful insights on tipping behavior that can directly inform restaurant operations.
Overcoming memory and computational challenges in Google Colab, ensuring our analysis could scale with the data.

What We Learned

Iterative Development: There is no single script that solves all problems. Iterative refinement—from cleaning to modeling—proved essential.
Collaboration is Key: Sharing responsibilities and learning from one another’s expertise greatly enhanced our problem-solving capabilities.
Importance of Data Quality: Robust data cleaning is foundational; addressing anomalies early on leads to more reliable analyses.
Resource Management: Efficient data handling and memory optimization are critical when dealing with large datasets.
Modeling Nuances: Even with advanced models like Random Forests, tuning and validation are crucial, especially when imputing high-impact features like venue concepts.

What's Next for ChabaCrunch

Moving forward, we plan to expand our analysis by:

Integrating external data sources (e.g., local economic indicators, weather data) to deepen our understanding of tipping behaviors.
Enhancing our predictive models to cover other facets of restaurant operations, such as sales forecasting or staffing recommendations.
Developing interactive dashboards to allow real-time exploration of tipping trends.
Exploring new projects that merge Machine Learning, HCI, and Data Visualization, such as real-time collaborative coding platforms or AI-powered course schedulers.
Continuing to innovate collaboratively and leverage our learnings to tackle even more technically ambitious challenges.

ChabaCrunch is not just a project—it’s a stepping stone to building scalable, impactful systems in the real world, driven by data, collaboration, and a passion for solving complex problems.