This project focuses on predicting customer churn and segmenting customers using machine learning. The goal is to identify which customers are likely to leave and understand customer behavior to improve retention strategies.
The project is structured into three key steps:
- EDA, Data Cleaning & Feature Engineering
- Classification - Churn Prediction
- Segmentation - Customer Clustering
The dataset consists of 10,000 bank customers, with information on credit scores, account balances, estimated salaries, number of products, geography, and customer activity status. The target variable is Exited, where:
Exited = 1
means the customer has churned.Exited = 0
means the customer has remained with the bank.
I started by importing and merging the customer and account information datasets using CustomerID
as the common key.
- Removed duplicate records and inconsistent labels.
- Converted currency fields to numeric values by stripping symbols.
- Handled missing values by replacing categorical missing data with
"MISSING"
and filling numerical missing values with the median. - Standardized geographical values (e.g.,
"FRA", "France", "French"
→"France"
). - Identified incorrect salary values (e.g.,
-999999
was replaced with the median).
To improve the model’s predictive power, I created two new features:
- Balance-to-Income Ratio:
Balance_v_income = Balance / EstimatedSalary
- Income per Product:
income_v_product = EstimatedSalary / NumOfProducts
To understand customer churn behavior, I visualized relationships between variables:
- Churn rates across geography and gender using bar plots.
- Numerical features vs. churn using box plots and histograms.
- German customers have nearly twice the churn rate compared to French and Spanish customers.
- Inactive members and those without credit cards are more likely to churn.
- Customers with a higher balance-to-income ratio tend to have a higher churn rate.
With the data cleaned and new features added, I built machine learning models to predict whether a customer would churn.
- Converted categorical variables into dummy variables.
- Split the data into training and test sets (80% training, 20% testing).
I trained and evaluated two models:
- Logistic Regression
- Random Forest Classifier
Performance was assessed using:
- Accuracy, Precision, Recall, F1-score, and ROC-AUC
- Confusion matrices to compare predicted vs. actual churn
- Precision-recall curves to adjust model thresholds
- Logistic Regression achieved an AUC of 0.77, providing a simple and interpretable model.
- Random Forest performed better with an AUC of 0.85, but showed slight overfitting.
To reduce overfitting, I fine-tuned Random Forest using RandomizedSearchCV
and GridSearchCV
to optimize hyperparameters.
The most influential features in predicting churn were:
- Age (strongest predictor)
- Number of Products, Balance, and Credit Score
Beyond churn prediction, I wanted to understand different types of customers by applying unsupervised learning.
- Standardized numerical features for better clustering.
- Used the elbow method to determine the optimal number of clusters.
- Analyzed the characteristics of each cluster.
- Cluster 1: High-churn risk, low income → Customers in this group may need personalized offers.
- Cluster 2: Financially stable, high balance → Upselling premium financial products could be effective.
- Cluster 3: Young and active customers → Engaging these customers early could improve retention.
To replicate this analysis, follow these steps:
- Clone the repository:
git clone https://github.com/yourusername/Bank-Churn-ML.git cd Bank-Churn-ML