Supermarket Data Analysis and Sales Prediction

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

Inspiration

The idea behind Supermarket Data Analysis and Sales Prediction stemmed from the everyday challenges supermarkets face in managing inventory, staffing, and promotions. Working with a dataset of 1,000 real supermarket transactions sparked my interest in how machine learning can extract hidden patterns from simple data like product categories, customer demographics, time of purchase, and payment methods. I wanted to build a tool that turns raw transactional data into accurate sales forecasts, helping businesses boost revenue, reduce waste, and make smarter decisions.

What it does

This project is an AI-powered sales predictor that forecasts future supermarket sales based on current transaction details. Users can input single transactions manually or upload CSV files with multiple entries. The system processes features like branch, product line, customer gender, time of day, and quantity to predict sales for a target month. It then compares current sales against predictions, visualizes differences by product line, calculates total projected revenue changes, and provides downloadable results. The core model uses a Random Forest Regressor trained on historical data, deployed via an interactive Streamlit web app.

How we built it

We started with thorough exploratory data analysis on the supermarket dataset using Pandas, Matplotlib, and Seaborn to understand distributions and relationships. Key steps included converting date/time columns, extracting features like day of week, hour, and time of day categories (Morning, Afternoon, Evening, Night). We engineered powerful interaction features such as ProductLine_TimeOfDay, ProductLine_Gender, and Branch_TimeOfDay to capture behavioral patterns. We removed leaky columns (e.g., tax, COGS) that directly derive from the target sales value. Preprocessing involved scaling numerical features with StandardScaler and one-hot encoding categoricals, wrapped in a scikit-learn Pipeline. After comparing models, we selected Random Forest Regressor for its robustness. The trained pipeline was saved with joblib, and we built a full Streamlit app for user interaction, including CSV upload, session-state entry management, predictions, visualizations with Matplotlib, and result downloads.

Challenges we ran into

One major challenge was data leakage from columns mathematically tied to the target, which initially inflated performance metrics. Removing them caused a realistic drop in scores, forcing deeper feature engineering. Interaction features increased dimensionality, raising overfitting concerns, though Random Forest handled it well. Handling varied date/time formats in user uploads required robust parsing logic with fallbacks. Initial tree-based models underperformed until proper preprocessing and interactions were added. Deploying in Streamlit involved managing session state for multiple entries and ensuring the model reloads or retrains gracefully.

Accomplishments that we're proud of

We successfully built an end-to-end machine learning solution from raw data to a production-ready interactive web app. The feature engineering significantly improved prediction accuracy, with the model achieving strong real-world relevance. Creating a user-friendly interface that supports bulk CSV uploads, real-time calculations, and clear visualizations (including product-line bar charts and monthly projections) stands out. Implementing auto-retraining fallback and detailed result exports makes the tool practical for business use.

What we learned

We gained deep insights into feature engineering's power over algorithm choice, especially how interaction terms reveal customer behaviors. We mastered avoiding data leakage and building reproducible pipelines. The project reinforced the importance of thoughtful EDA, outlier handling, and skewness analysis. Deploying with Streamlit taught us about state management, custom styling, and user experience in ML apps. Overall, it highlighted how machine learning can drive tangible retail improvements.

What's next for Supermarket Data Analysis and Sales Prediction

Future enhancements include integrating time-series models (e.g., Prophet or LSTM) for trend-based forecasting. Adding demand prediction per product line, inventory optimization suggestions, and promotion impact simulation would add value. We plan to support larger datasets, incorporate external factors like holidays or economic indicators, and deploy on cloud platforms for broader access. Exploring explainable AI (e.g., SHAP values) to show why predictions change will make insights even more actionable.

What's next for Supermarket Data Analysis and Sales Prediction

Built With

Updates

Snehit Parajuli started this project — Dec 19, 2025 02:01 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.