Churn Prediction with SHAP

Introduction

This project aims to build a robust model for predicting customer churn using the Telco Customer Dataset from IBM. Churn prediction is crucial for businesses to identify customers who are likely to leave, enabling them to take proactive measures to retain these customers. The focus of this project is not only on achieving accurate predictions but also on ensuring that these predictions are interpretable and actionable.

To achieve this, the project integrates the SHAP (SHapley Additive exPlanations) algorithm, a powerful tool for model explainability. SHAP provides both global and local explanations, helping data scientists and business stakeholders understand which factors contribute most to customer churn. This dual focus on prediction accuracy and explainability is essential in making the model’s insights both trustworthy and actionable for business decisions.

The analysis provides valuable insights into the risk factors for customer churn, guiding companies on where to focus their retention efforts. This combination of predictive modeling and explainability offers a comprehensive tool for businesses looking to reduce churn and improve customer loyalty.

Datasets

These datasets were taken from here

These are Telecommunications Industry Sample Data given by IBM. The Telco customer churn data contains information about a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3. It indicates which customers have left, stayed, or signed up for their service. Multiple important demographics are included for each customer, as well as a Satisfaction Score, Churn Score, and Customer Lifetime Value (CLTV) index.

I analyzed these datasets about:

Demographics information
Location information
Population information
Services information
Status information
a complete dataset about Customer Churn. This is the dataset I used to make prediction and explanations

Exploratory Data Analysis

In this project, I applied Object-Oriented Programming (OOP) principles by creating a custom EDA class here. This class encapsulates all the necessary methods for performing statistical analysis and data visualization across different datasets. The use of OOP provided several advantages:

Code Reusability: By defining the EDA class, I was able to reuse the same code structure across various datasets, eliminating redundancy and ensuring consistency in the analysis process.
Scalability: The class-based approach allowed me to easily extend and adapt the analysis to new datasets with minimal changes, making the code highly scalable.
Modularity: The encapsulation of EDA functionalities within a class enhanced the modularity of the code, simplifying maintenance and future enhancements.

The results and insights obtained from the EDA are comprehensively documented within the notebook's markdown cells. Each step is clearly explained, commented on, and supported by visualizations that aid in understanding the underlying data patterns.

Geographical Data Visualization

One of the highlights of the EDA was the visualization of geographical data using GeoPandas. By mapping the latitude and longitude points of the dataset onto a map of California, sourced from Natural Earth Data, I was able to confirm that all spatial points indeed fell within California's borders. This spatial analysis provided valuable context and ensured the geographical integrity of the dataset. The use of GeoPandas not only facilitated the accurate plotting of points but also allowed for a seamless integration with base maps, offering a clear and informative geographical visualization. Here, you can see the plot:

Machine Learning

In this project, I implemented a robust and scalable machine learning pipeline by creating a custom ModelPipeline class, located in the ml_model.py file. This object-oriented approach provides several key advantages, such as code reusability, scalability, and the flexibility to easily extend the pipeline with new models and techniques.

Default Models and Flexibility

The ModelPipeline class is designed with several default machine learning models, including:

Logistic Regression
Random Forest
XGBoost

Additionally, I integrated CatBoost, which is particularly effective for handling categorical features and imbalanced datasets. The modular design of the ModelPipeline class allows for seamless integration of these models and the ability to switch between them effortlessly, depending on the specific needs of the analysis.

Handling Imbalanced Classes

A significant challenge in churn prediction is dealing with imbalanced classes. To address this, I incorporated methods such as SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling) within the ModelPipeline. These techniques were applied during the stratified cross-validation process to balance the classes and improve model performance.

Performance Evaluation

The ModelPipeline also includes methods for evaluating model performance, both during training and on test datasets. I utilized various metrics to assess the models' effectiveness, ensuring a comprehensive understanding of their predictive capabilities.

Results and Insights

All results, including model performance metrics, are completely documented and discussed in the Jupyter notebook. The Markdown cells within these notebooks provide detailed explanations and insights, making it easy to understand the outcomes and implications of each experiment. For a deeper understanding, I recommend reviewing the notebooks directly.

Explainability with SHAP

SHAP (SHapley Additive exPlanations) is a powerful tool for model interpretability, based on Shapley values from cooperative game theory. It provides a unified measure of feature importance by attributing the contribution of each feature to every individual prediction. SHAP's key advantage lies in its ability to deliver both global and local interpretability, offering insights into model behavior across the entire dataset and for individual predictions.

In this project, I utilized SHAP with models that performed well with SMOTE, specifically XGBoost and CatBoost. XGBoost was chosen for its high performance with balanced datasets, while CatBoost was included to evaluate its behavior with categorical features and its integration with SHAP. All analyses were conducted using stratified cross-validation to ensure robust results.

Analysis and Visualizations

I performed a range of global and local analyses to understand model predictions better:

Global Analysis: Beeswarm plots and bar plots were used to visualize overall feature importance and the distribution of Shapley values across features.
Local Analysis: Detailed examination of individual predictions was carried out using force plots, decision plots, and waterfall plots.

A notable example of the dependency plot from CatBoost illustrates the interaction between Tenure and Contract. The plot reveals that clients with contracts longer than one or two years are more likely to stay, whereas clients with month-to-month contracts tend to churn.

For a detailed understanding of the results and visualizations, I encourage you to explore the Jupyter notebook. They contain comprehensive explanations and comments on all the graphs, helping to interpret the findings effectively.

Requirements

To run this project, ensure you have the following installed:

Python 3.11.x: The project is developed and tested with Python 3.11.x. Make sure you have this version installed.
Using the requirements.txt, you can install the necessary dependencies by running the following command:

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
Data		Data
Geo_Map		Geo_Map
Images		Images
Results		Results
src		src
Classification.ipynb		Classification.ipynb
Exploratory.ipynb		Exploratory.ipynb
LICENSE		LICENSE
README.md		README.md
SHAP.ipynb		SHAP.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Churn Prediction with SHAP

Table of Contents

Introduction

Datasets

Exploratory Data Analysis

Exploratory Data Analysis

Geographical Data Visualization

Machine Learning

Default Models and Flexibility

Handling Imbalanced Classes

Performance Evaluation

Results and Insights

Explainability with SHAP

Analysis and Visualizations

Requirements

About

Releases

Packages

Languages

License

Silvano315/Churn-Prediction-with-SHAP

Folders and files

Latest commit

History

Repository files navigation

Churn Prediction with SHAP

Table of Contents

Introduction

Datasets

Exploratory Data Analysis

Exploratory Data Analysis

Geographical Data Visualization

Machine Learning

Default Models and Flexibility

Handling Imbalanced Classes

Performance Evaluation

Results and Insights

Explainability with SHAP

Analysis and Visualizations

Requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages