Customer Purchase Prediction with Class Imbalance Handling

Project Title: Customer Purchase Prediction with Class Imbalance Handling

Project Overview:

This project focuses on predicting customer behavior, specifically whether a customer will make their next purchase, using machine learning techniques. Given the class imbalance in the dataset, the project emphasizes precision, recall, and F1-scores over simple accuracy to ensure robust predictions for the minority class. The solution combines feature engineering, data cleaning, model optimization, and an in-depth evaluation to improve predictions, especially for the rare class (next purchase prediction).

Key Steps and Methodology:

Data Preprocessing:
- Handling Missing Values: The dataset is cleaned by addressing missing values, such as filling missing annual income with the mean.
- Date-Time Conversion: The Reg_Date column, initially an object type, is converted into a datetime format to enable time-based analysis.
- Encoding Categorical Data: The Edu_Level (education level) column is encoded using an ordinal approach to handle the ordered nature of education levels.
- Family Status Handling: Rare categories in the Family_Status column are replaced with frequency counts, and the column is dropped to streamline the model.
Exploratory Data Analysis (EDA):
- Data Distribution: Visualizations like histograms and bar plots highlight class distribution and reveal imbalances in the target variable (Next_Purchase).
- Correlation Analysis: A heatmap is used to analyze the correlations between numeric features, aiding in understanding how features interact.
- Outlier Detection: Box plots are created to identify outliers in the dataset. Special attention is given to handling these outliers in the Annual_Income column.
Model Training and Evaluation:
- Splitting Data: The dataset is split into training and testing sets, ensuring that the models are evaluated on unseen data.
- Feature Scaling: Numerical features are scaled using StandardScaler to standardize the data and improve model performance.
- Model Selection: Two classification models—XGBoost and Random Forest—are trained and evaluated. XGBoost, known for its efficiency with large datasets, is used alongside Random Forest, which handles feature importance and can work well with imbalanced data.
- Handling Class Imbalance: Given the imbalance in the target variable, focus is placed on evaluating models using precision, recall, and F1-score rather than accuracy to ensure that both majority and minority classes are treated fairly.
Model Optimization:
- Hyperparameter Tuning: Both models undergo optimization for better accuracy and predictive power, with a particular focus on improving recall for the minority class.
- Evaluation Metrics: The classification reports and confusion matrices are used to understand the model's performance in predicting both classes, with emphasis on the recall for the minority class.
Final Prediction and Output:
- The final predictions for the test dataset are made using the trained models, and the results are stored in a CSV file. The User_Key and predicted Next_Purchase values are saved for submission, ensuring that each customer's prediction is tied to a unique identifier.

Key Insights:

The project demonstrates how to handle class imbalance in a predictive modeling task, ensuring that the minority class (next purchase prediction) is effectively predicted.
Feature engineering and proper handling of categorical variables like Edu_Level and Family_Status contribute to improved model performance.
By using advanced machine learning algorithms like XGBoost and Random Forest, the project achieves robust predictions that are critical in customer behavior analysis, applicable to fields such as marketing, customer retention, and churn prediction.

Project Impact:

This solution is crucial for businesses aiming to predict customer purchasing behavior and take proactive steps to engage with high-potential customers. By focusing on both the majority and minority classes, the project ensures fair and accurate predictions, which can drive smarter decision-making in targeted marketing, promotions, and personalized customer outreach.

Built With

matplotlib
nump
pandas
scikit-learn
seaborn

Updates

Waris Hayat started this project — Feb 01, 2025 09:12 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.