This project leverages machine learning to predict whether a customer will subscribe to a bank's term deposit based on data collected from direct marketing campaigns. By analyzing features such as customer demographics, previous interactions, and financial data, we aim to optimize marketing strategies for future campaigns.
This repository contains the complete pipeline from data preprocessing, feature engineering, model building, hyperparameter tuning, and model evaluation.
The dataset used is from a Portuguese banking institution, consisting of 41,188 instances and 20 features. It contains customer data and outcomes from direct marketing campaigns involving phone calls. Key features include:
- Customer Attributes:
age
,job
,marital
,education
,balance
,housing
,loan
, etc. - Contact Attributes:
contact type (telephone, cellular)
,last contact day
,duration
, etc. - Previous Campaign Data:
pdays
,previous
,poutcome
(outcome of the previous campaign). - Target Variable:
subscribed
(whether the customer subscribed to a term deposit).
- Handled missing values using median imputation and default values for categorical features.
- Encoded categorical variables using One-Hot Encoding.
- Applied Min-Max scaling to normalize continuous features.
Objective: Identify key patterns and relationships between features and the target variable.
- Correlation Matrix: Assessed correlations between numerical features and the target variable.
- Univariate and Bivariate Analysis: Visualized distributions of important features (e.g., age, balance) and their relationships with the target.
- Class Imbalance: The dataset is highly imbalanced, with only ~11% positive class (i.e., subscribed). Addressed class imbalance using SMOTE (Synthetic Minority Over-sampling Technique).
- Interaction Features: Created new interaction terms between
balance
andpdays
to capture potential non-linear relationships. - Domain-Specific Features: Developed features such as
contact rate per campaign
andbalance-duration ratio
. - Temporal Features: Derived features based on the day of the week and time of contact to account for possible temporal effects on subscription likelihood.
- Logistic Regression: As a baseline for comparison.
- Decision Trees: For interpretable predictions.
- Random Forest: Robust model for handling non-linear relationships and feature importance analysis.
- XGBoost: Gradient boosting for better generalization and handling of imbalanced classes.
- CatBoost: Evaluated due to its efficiency in handling categorical features without explicit encoding.
Utilized GridSearchCV and RandomizedSearchCV for hyperparameter optimization:
- Random Forest: Tuned
n_estimators
,max_depth
, andmin_samples_split
. - XGBoost: Tuned
learning_rate
,max_depth
,n_estimators
, andsubsample
.
- Accuracy: Simple baseline comparison.
- Precision: Focus on minimizing false positives in this business context.
- Recall: Important to avoid missing potential customers likely to subscribe.
- F1-Score: Harmonic mean of precision and recall to balance both.
- ROC-AUC Score: Evaluated the model's ability to discriminate between the classes.
- Implemented SMOTE to oversample the minority class and improve recall.
- Tested class weights adjustment to further balance precision and recall.
-
The best-performing model was XGBoost, achieving:
- Accuracy: 90.5%
- Precision: 75.6%
- Recall: 68.3%
- F1-Score: 71.8%
- ROC-AUC: 92.2%
-
Feature Importance (from XGBoost):
duration
: The duration of the last contact.pdays
: Number of days since the client was last contacted.balance
: Customer's account balance.campaign
: Number of contacts during the current campaign.job
: Customer’s occupation.
-
The duration of the last contact was the most influential predictor, indicating the importance of engagement time in a successful subscription.
- The final model is deployed via Flask API. It accepts customer data as input and returns the likelihood of subscription.
- Dockerized the API for easy integration with other banking systems.
- Experiment with neural networks to capture more complex patterns in high-dimensional data.
- Integrate real-time data to make the model adaptive to changing customer behaviors and market trends.
- Implement an A/B testing framework to continuously validate and improve the model in production.
├── data/ # Dataset and data processing scripts
├── notebooks/ # Jupyter notebooks for EDA and model building
├── models/ # Saved models and model training scripts
├── app/ # Flask app for deployment
├── Dockerfile # Docker configuration
├── README.md # Project documentation
└── requirements.txt # List of dependencies
Clone the repository:
git clone https://github.com/Gourav052003/Predicting-Customer-Engagement-in-Financial-Products-Insights-from-Marketing-Campaigns.git
cd bank-term-deposit-prediction
Install dependencies:
pip install -r requirements.txt
Run the Jupyter notebook to train models:
jupyter notebook notebooks/Bank_Term_Deposit_Prediction.ipynb
Run the Flask API for predictions:
cd app
python app.py
- Detailed descriptions of models, algorithms, and hyperparameter tuning techniques.
- Emphasis on dealing with class imbalance using methods like SMOTE and class weight adjustments.
- Feature engineering techniques that demonstrate data-driven decision-making.
- Comprehensive evaluation metrics showing performance beyond just accuracy, including precision, recall, F1-score, and ROC-AUC.
- Future work that hints at more complex methods (e.g., neural networks, real-time predictions) and production considerations (e.g., Dockerization, API deployment).