Skip to content

Run histogram-based gradient boosted trees binary classifier on generated data and interpret results with standard metrics, SHAP, and supervised clustering

Notifications You must be signed in to change notification settings

afairless/binary_classification_shap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Binary classification workflow

The code in this repository trains a binary classification model on a generated data set and then runs standard classification metrics, SHAP, and supervised clustering to evaluate and interpret the results.

As the data set is synthetically generated, the purpose here is not to glean any particular insights about the data set or even about the utility of the modeling approach. The purpose is to store code that represents a fairly typical modeling workflow that I might use and that can be readily adapted to a particular problem.

Generally speaking, the steps of the workflow are:

  • Generate multivariate Gaussian-distributed data with correlations among the variables
  • Dichotomize the response variable based on several different thresholds
  • For each version of the response variable:
    • Split the data into training and test sets
    • Choose a modeling algorithm (currently histogram-based gradient-boosted trees)
    • Define a hyperparameter space for the modeling algorithm
    • Use random search and cross-validation to optimize the hyperparamters on the training data set
    • Refit the modeling algorithm with the optimized hyperparameters to the entire training data set
    • Choose the classification threshold that maximizes the chosen performance metric for the training and test data sets
    • Calculate and plot standard binary classification metrics (e.g., confusion matrix, ROC curve)
    • Calculate and plot feature importances based on SHAP
    • Calculate and plot supervised clusters