Heart failure is a common event caused by cardiovascular disease. Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
We used a dataset of 12 features that contribute to CVDs to predict mortality by heart failure.
To view this project, click here: Heart Failure Prediction
- age in years from 40 to 95
- anaemia - boolean
- creatinine phosphokinase levels from 970 to 7,861
- diabetes - boolean
- ejection fraction percentage from 14 to 80
- high blood pressure - boolean
- platelet count from 97,804 to 850,000
- serum creatinine levels from 0.5 to 9.4
- serum sodium levels from 113 to 148
- sex - boolean - male/female
- smoking - boolean
- follow up time in days from first appointment
Death Event - boolean - whether or not the patient died during the course of the study.
Our dataset was very clean and well organized. There were no missing values and all values were either int64 or float64.
There were 299 sets of 12 features and 1 target variable.
Of the 299 patients, 96 have died of heart failure.
3.1 We used a combination of violin and box plots to get an idea of the distribution of our data and to look for outliers.
3.4 We also plotted linear regressions of individual feature correlations so that we could understand feature interaction.
We ran the following machine learning models and used metrics such as F1 Score and Balanced Accuracy to compare results. We also plotted Confuson Matrices of our classifiers.
- Decision Tree Entropy Model
- Decision Tree Gini Model
- Random Forest Classifier
We used Recursive Feature Selection, Hyperparameter Tuning and Prinicple Component Analysis to indentify the most important features.
In the end we used 2 important features, Ejection Fraction and Serum Creatinine.
After extensive testing we settled on the Random Forest model with 2 Prime Features and Tuned Hyperparameters.
- n_estimators: 1600
- min_samples_split: 10
- min_samples_leaf: 4
- max_features: sqrt
- max_depth: 30
- bootstrap: True
We acheived the follow results by averaging 100 interations of the Random Forest Model with the above hyperparameters.
- Accuracy: 0.762
- Balanced Accuracy: 0.703
- F1 Score: 0.584
- Precision Score: 0.666
- Precision Score for Positive: 0.666
- Precision Score for Negative: 0.800
- Recall Score: 0.534
- Recall Score for Positive: 0.534
- Recall Score for Negative: 0.872
We have found that it is possible to predict heart failure in patients given the right information, specifically Ejection Fraction and Serum Creatinine.
- In the above graph you can see a clear negative correlation between ejection fraction and mortality, meaning that the lower the ejection fraction, the more likely the patient is to suffer heart failure.
- In the above graph you can see a clear positive correlation between serum creatinine and mortality, meaning that the higher the serum creatinine, the more likely the patient is to suffer heart failure.
- In the above graph you can see by the clustering, that death event coincides wiht low ejection fraction and low serum creatinine, suggesting that ejection fraction is a more poweful indicator of possible heart failure, but the death events that coincide with a high serum creatine and high ejection factor suggest that high serum creatinine is also an indicator of possible heart failure.
We have also realized that it is important to focus on the elimination of False Negatives in our modeling, following the logic that it is better to generate a false positive and conitnue follow-up with a healthy patient as opposed to genterating a false negative and not following up with an unhealthy patient.
The above confusion matrix shows that out of 99 predictions, we predicted only 8 false negatives, showing a false negative rate of only 8%.
To view this project, click here: Heart Failure Prediction
The following similar study was conducted using the same data set, and reached the same conclusions regarding the prediction of heart failure.
Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone