Stroke prediction

This repository contains an imbalanced supervised binary classification task: predicting patients that will have a stroke given sociological and biological factors, using the Kaggle dataset present in "https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset".

Tools used

The main code is scattered across three notebooks:

In the notebook 1-eda.ipynb, I perform exploratory data analysis on the data:

This notebook, 2-model.ipynb, selects a model across many different classifiers and tunes the best selected classifiers using cross-validation.

The following approach is used:

Creating a data pipeline
Selecting the best models using cross-validation
Performing cross-validaition hyperparameter tuning on the best models using the optuna package
Saving the best model pipelines for later evaluation

Notebook 3-eval.ipynb evaluates the tuned models from the previous notebook and benchmarks them across various different metrics on the test set.

The evaluation consists of the following steps:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
models		models
1-eda.ipynb		1-eda.ipynb
2-model.ipynb		2-model.ipynb
3-eval.ipynb		3-eval.ipynb
README.md		README.md