Skip to content

A Machine Learning classification project that predicts customer churn for a bank.

License

Notifications You must be signed in to change notification settings

KOrfanakis/Predicting_Customer_Churn_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



A Machine Learning classification project that predicts customer churn for a bank.


Nbviewer Made With Jupyter Licence Star if useful


Table of Contents:


Motivation

Imagine that we are hired as data scientists by a major commercial bank with several branches across Europe. Its finance division decided that the bank could benefit from reducing the number of customers that churn, i.e. stop using its services. Therefore, they compiled a dataset containing information on the bank’s current and past customers, such as their credit score, age, account balance, etc. Most importantly, the dataset contains a column stating whether the customer has churned or not.

We are assigned to use the dataset to gain insights into why customers are leaving and build a Machine Learning (ML) model that predicts whether a customer is likely to churn or not.

For more information on customer churn and why it is important, please refer to the introductory section of the notebook.


Skills: Exploratory Data Analysis, Data Visualisation, Data Pre-processing (Feature Selection, Encoding Categorical Features, Feature Scaling), Addressing Class Imbalance, Model Tuning, Ensemble Learning.

Models Used: Logistic Regression, Support Vector Machines, Random Forests, Gradient Boosting, XGBoost, and Light Gradient Boosting Machine.


Business Objective

The first question we need to ask our employers is how the bank expects to benefit from our model. Our employers answer that they will use our model’s output and create a list of customers more likely to churn. Then, the bank will offer additional services and special offers to those customers in an effort to increase customer satisfaction. The total amount will be equal to £1,000 per customer. The bank estimates that the gain from each customer retained will be approximately equal to £5,000 per customer, i.e. five times the initial investment.

The next question to ask our employers is what the current solution looks like (if any). Our employers answer that a customer’s likelihood of churning is estimated manually by experts: a team gathers up-to-date information about a client and uses complex rules to come up with a prediction. This strategy is costly and time-consuming, and their estimates are not always excellent. Therefore, apart from the financial benefit, building a successful model will free up some time for the experts to work on more interesting and productive tasks.


Data

The dataset can be retrieved from Kaggle, following this link. The dataset consists of 14 columns (synonyms: features or attributes) and 10K rows (synonyms: instances or samples). The last column, ‘Exited’, is the target variable and indicates whether the customer has churned (0 = No, 1 = Yes). The meaning of the rest of the features can be easily inferred from their name.


Results

We used eight classification algorithms to build our models. The algorithms are:

  1. Logistic Regression,
  2. Support Vector Classifier,
  3. Random Forest Classifier,
  4. Balanced Random Forest Classifier,
  5. Gradient Boosting Classifier,
  6. Xtreme Gradient Boosting Classifier,
  7. Light Gradient Boosting Machine, and
  8. An ensemble of the above classifiers.

After establishing a baseline, hyperparameter tuning was used to calculate their optimal performance on the training set (consisting of 8,000 customers). The best-performing model with an AUC score of ~0.865 is the optimised model based on the LGBM classifier.

The model was tested on unseen data, using customer instances not used for training. The test sample consists of 2,000 customers, maintaining the ratio of churned/remained customers of the training set. Using the assumptions mentioned in the Business Objective section, the bank would make a total profit of £900,000! As a comparison, our baseline model (based on the Gaussian Naïve Bayes algorithm) yields a total profit of £456,000.


References

A complete list of references is provided at the end of the notebook. I used the book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.) to construct the Business Objective section of this file.


Feedback

If you have any feedback or ideas to improve this project, feel free to contact me via:

Twitter LinkedIn