Inspiration
The inspiration for this project came from the critical challenge businesses face in retaining customers. Acquiring a new customer can cost 5–7 times more than keeping an existing one. I realized that if companies could predict customer churn in advance, they could take proactive steps to improve satisfaction and loyalty.
I wanted to explore ensemble machine learning techniques to see if combining multiple models could outperform single-model predictions.
What it does
Takes in customer data (like tenure, contract type, monthly charges, payment method).
Predicts whether a customer is likely to leave (churn) or stay.
Uses ensemble techniques (combining multiple models like Random Forest, Gradient Boosting, Logistic Regression) to improve prediction accuracy.
Helps businesses take proactive retention actions — such as targeted discounts, better customer support, or loyalty programs — before the customer leaves.
Think of it like an early warning system for customer loss, so companies can prevent revenue leakage.
How we built it
Data Collection & Cleaning
Used a telecom customer dataset with features like tenure, monthly charges, contract type, and payment method.
Handled missing values and encoded categorical variables using One-Hot Encoding.
Exploratory Data Analysis (EDA)
Plotted churn distribution.
Analyzed feature correlations with churn using heatmaps.
Feature Engineering
Created derived features like tenure_group to capture customer lifecycle stages.
Model Building
Trained multiple classifiers: Logistic Regression, Random Forest, Gradient Boosting, XGBoost.
Model Evaluation
Compared precision, recall, F1-score, and ROC-AUC.
Selected the ensemble model that gave the highest ROC-AUC for balanced performance.
Challenges we ran into
Imbalanced Data: Churners were much fewer than non-churners, leading to biased predictions. Solved using SMOTE oversampling.
Overfitting: Boosting models tended to overfit on training data; mitigated with regularization parameters. Feature Importance Interpretation: Understanding which features drove churn required careful SHAP value analysis. Model Selection: Balancing recall (catching churners) with precision (avoiding false alarms) was tricky.
What we learned
During this project, I learned:
How imbalanced datasets can affect model performance and how techniques like SMOTE can help. The strengths of different ensemble algorithms: Bagging (e.g., Random Forest) for variance reduction. Boosting (e.g., Gradient Boosting, XGBoost) for bias reduction. Voting Classifier for combining multiple models. How to evaluate models beyond accuracy using metrics.
What's next for customer churn prediction in banking
Integrate real-time churn prediction in a web dashboard.
Experiment with Stacking Classifiers for potentially better performance.
Incorporate customer sentiment analysis from feedback text data.
Built With
- matpoltlib
- numpy
- pandas
- pycharm
- python
- scikit-learn
- seaborn
- xg-boost
Log in or sign up for Devpost to join the conversation.