This repository contains an imbalanced supervised binary classification task: predicting patients that will have a stroke given sociological and biological factors, using the Kaggle dataset present in "https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset".
The main code is scattered across three notebooks:
In the notebook 1-eda.ipynb
, I perform exploratory data analysis on the data:
- Check for missing values and data types
- Draw the blueprint for the data pipeline
- Perform univariate and bivariate data analysis
This notebook, 2-model.ipynb
, selects a model across many different classifiers and tunes the best selected classifiers using cross-validation.
The following approach is used:
- Creating a data pipeline
- Selecting the best models using cross-validation
- Performing cross-validaition hyperparameter tuning on the best models using the
optuna
package - Saving the best model pipelines for later evaluation
Notebook 3-eval.ipynb
evaluates the tuned models from the previous notebook and benchmarks them across various different metrics on the test set.
The evaluation consists of the following steps:
- Accuracy, ROC AUC and F1 score
- Confusion matrix
- ROC curve
- Precision Recall curve
- True vs predicted distributions
- Threshold tuning using F1-score