customer churn prediction in banking

Inspiration

   The inspiration for this project came from the critical challenge businesses face in retaining customers. Acquiring a new customer can cost 5–7 times more than keeping an existing one. I realized that if companies could predict customer churn in advance, they could take proactive steps to improve satisfaction and loyalty.

I wanted to explore ensemble machine learning techniques to see if combining multiple models could outperform single-model predictions.

What it does

    Takes in customer data (like tenure, contract type, monthly charges, payment method).
    Predicts whether a customer is likely to leave (churn) or stay.
    Uses ensemble techniques (combining multiple models like Random Forest, Gradient Boosting,     Logistic Regression) to improve prediction accuracy.
    Helps businesses take proactive retention actions — such as targeted discounts, better customer support, or loyalty programs — before the customer leaves.

Think of it like an early warning system for customer loss, so companies can prevent revenue leakage.

How we built it

   Data Collection & Cleaning
           Used a telecom customer dataset with features like tenure, monthly charges, contract type, and payment method.
            Handled missing values and encoded categorical variables using One-Hot Encoding.
   Exploratory Data Analysis (EDA)
            Plotted churn distribution.
            Analyzed feature correlations with churn using heatmaps.
   Feature Engineering
            Created derived features like tenure_group to capture customer lifecycle stages.
   Model Building
            Trained multiple classifiers: Logistic Regression, Random Forest, Gradient Boosting, XGBoost.
   Model Evaluation
            Compared precision, recall, F1-score, and ROC-AUC.
            Selected the ensemble model that gave the highest ROC-AUC for balanced performance.

Challenges we ran into

   Imbalanced Data: Churners were much fewer than non-churners, leading to biased predictions. Solved using SMOTE oversampling.

Overfitting: Boosting models tended to overfit on training data; mitigated with regularization parameters. Feature Importance Interpretation: Understanding which features drove churn required careful SHAP value analysis. Model Selection: Balancing recall (catching churners) with precision (avoiding false alarms) was tricky.

What we learned

   During this project, I learned:

How imbalanced datasets can affect model performance and how techniques like SMOTE can help. The strengths of different ensemble algorithms: Bagging (e.g., Random Forest) for variance reduction. Boosting (e.g., Gradient Boosting, XGBoost) for bias reduction. Voting Classifier for combining multiple models. How to evaluate models beyond accuracy using metrics.

What's next for customer churn prediction in banking

   Integrate real-time churn prediction in a web dashboard.
   Experiment with Stacking Classifiers for potentially better performance.
   Incorporate customer sentiment analysis from feedback text data.

Built With

matpoltlib
numpy
pandas
pycharm
python
scikit-learn
seaborn
xg-boost

Updates

ANUSHRI A Ashokkumar started this project — Aug 14, 2025 07:25 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.