- Introduction
- Datasets
- Exploratory Data Analysis
- Machine Learning
- Explainability with SHAP
- Requirements
This project aims to build a robust model for predicting customer churn using the Telco Customer Dataset from IBM. Churn prediction is crucial for businesses to identify customers who are likely to leave, enabling them to take proactive measures to retain these customers. The focus of this project is not only on achieving accurate predictions but also on ensuring that these predictions are interpretable and actionable.
To achieve this, the project integrates the SHAP (SHapley Additive exPlanations) algorithm, a powerful tool for model explainability. SHAP provides both global and local explanations, helping data scientists and business stakeholders understand which factors contribute most to customer churn. This dual focus on prediction accuracy and explainability is essential in making the model’s insights both trustworthy and actionable for business decisions.
The analysis provides valuable insights into the risk factors for customer churn, guiding companies on where to focus their retention efforts. This combination of predictive modeling and explainability offers a comprehensive tool for businesses looking to reduce churn and improve customer loyalty.
These datasets were taken from here
These are Telecommunications Industry Sample Data given by IBM. The Telco customer churn data contains information about a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3. It indicates which customers have left, stayed, or signed up for their service. Multiple important demographics are included for each customer, as well as a Satisfaction Score, Churn Score, and Customer Lifetime Value (CLTV) index.
I analyzed these datasets about:
- Demographics information
- Location information
- Population information
- Services information
- Status information
- a complete dataset about Customer Churn. This is the dataset I used to make prediction and explanations
In this project, I applied Object-Oriented Programming (OOP) principles by creating a custom EDA
class here. This class encapsulates all the necessary methods for performing statistical analysis and data visualization across different datasets. The use of OOP provided several advantages:
- Code Reusability: By defining the
EDA
class, I was able to reuse the same code structure across various datasets, eliminating redundancy and ensuring consistency in the analysis process. - Scalability: The class-based approach allowed me to easily extend and adapt the analysis to new datasets with minimal changes, making the code highly scalable.
- Modularity: The encapsulation of EDA functionalities within a class enhanced the modularity of the code, simplifying maintenance and future enhancements.
The results and insights obtained from the EDA are comprehensively documented within the notebook's markdown cells. Each step is clearly explained, commented on, and supported by visualizations that aid in understanding the underlying data patterns.
One of the highlights of the EDA was the visualization of geographical data using GeoPandas. By mapping the latitude and longitude points of the dataset onto a map of California, sourced from Natural Earth Data, I was able to confirm that all spatial points indeed fell within California's borders. This spatial analysis provided valuable context and ensured the geographical integrity of the dataset. The use of GeoPandas not only facilitated the accurate plotting of points but also allowed for a seamless integration with base maps, offering a clear and informative geographical visualization. Here, you can see the plot:
In this project, I implemented a robust and scalable machine learning pipeline by creating a custom ModelPipeline
class, located in the ml_model.py
file. This object-oriented approach provides several key advantages, such as code reusability, scalability, and the flexibility to easily extend the pipeline with new models and techniques.
The ModelPipeline
class is designed with several default machine learning models, including:
- Logistic Regression
- Random Forest
- XGBoost
Additionally, I integrated CatBoost, which is particularly effective for handling categorical features and imbalanced datasets. The modular design of the ModelPipeline
class allows for seamless integration of these models and the ability to switch between them effortlessly, depending on the specific needs of the analysis.
A significant challenge in churn prediction is dealing with imbalanced classes. To address this, I incorporated methods such as SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling) within the ModelPipeline
. These techniques were applied during the stratified cross-validation process to balance the classes and improve model performance.
The ModelPipeline
also includes methods for evaluating model performance, both during training and on test datasets. I utilized various metrics to assess the models' effectiveness, ensuring a comprehensive understanding of their predictive capabilities.
All results, including model performance metrics, are completely documented and discussed in the Jupyter notebook. The Markdown cells within these notebooks provide detailed explanations and insights, making it easy to understand the outcomes and implications of each experiment. For a deeper understanding, I recommend reviewing the notebooks directly.
SHAP (SHapley Additive exPlanations) is a powerful tool for model interpretability, based on Shapley values from cooperative game theory. It provides a unified measure of feature importance by attributing the contribution of each feature to every individual prediction. SHAP's key advantage lies in its ability to deliver both global and local interpretability, offering insights into model behavior across the entire dataset and for individual predictions.
In this project, I utilized SHAP with models that performed well with SMOTE, specifically XGBoost and CatBoost. XGBoost was chosen for its high performance with balanced datasets, while CatBoost was included to evaluate its behavior with categorical features and its integration with SHAP. All analyses were conducted using stratified cross-validation to ensure robust results.
I performed a range of global and local analyses to understand model predictions better:
- Global Analysis: Beeswarm plots and bar plots were used to visualize overall feature importance and the distribution of Shapley values across features.
- Local Analysis: Detailed examination of individual predictions was carried out using force plots, decision plots, and waterfall plots.
A notable example of the dependency plot from CatBoost illustrates the interaction between Tenure and Contract. The plot reveals that clients with contracts longer than one or two years are more likely to stay, whereas clients with month-to-month contracts tend to churn.
For a detailed understanding of the results and visualizations, I encourage you to explore the Jupyter notebook. They contain comprehensive explanations and comments on all the graphs, helping to interpret the findings effectively.
To run this project, ensure you have the following installed:
-
Python 3.11.x: The project is developed and tested with Python 3.11.x. Make sure you have this version installed.
-
Using the requirements.txt, you can install the necessary dependencies by running the following command:
pip install -r requirements.txt