Credit_Risk_Analysis

Challenge Overview

Purpose:

The purpose of this analysis is to predict credit risk with machine learning models by using different techniques to train and evaluate models with unbalanced classes.

Resampling Models
- Over-sampling method: using the RandomOverSampler & SMOTE algorithms
- Under-sampling method: using the ClusterCentroids algorithm
- Combination Sampling (a combinatorial approach of oversampling and undersampling): using the SMOTEENN algorithm
Ensemble Classifiers method
- using BalancedRandomForestClassifier & EasyEnsembleClassifier algorithms

Resources

Software:
- Jupyter Notebook 6.4.6
- Machine Learning
  - Python
    - scikit-learn library
    - imbalanced-learn library
Data source:
- Credit card credit dataset from LendingClub
  - LoanStats_2019Q1.csv

Results

Random Oversampling
- the balanced accuracy score: 66%
- the precision and recall scores (high_risk):
  - The sensitivity/recall (71%) is more than the precision (1%)
  - there are many false positives (predicted high risk but actually low risk)
  - making this a poor algorithm for this dataset

Synthetic Minority Oversampling Technique (SMOTE)
- the balanced accuracy score: 66%
- the precision and recall scores (high_risk):
  - The sensitivity/recall (63%) is more than the precision (1%)
  - there are many false positives (predicted high risk but actually low risk)
  - making this a poor algorithm for this dataset

Cluster Centroid Undersampling
- the balanced accuracy score: 54%
- the precision and recall scores (high_risk):
  - The sensitivity/recall (69%) is more than the precision (1%)
  - there are many false positives (predicted high risk but actually low risk)
  - making this a poor algorithm for this dataset

Combination Sampling With SMOTEENN
- The balanced accuracy score: 64%
- the precision and recall scores (high_risk):
  - The sensitivity/recall (72%) is more than the precision (1%)
  - there are many false positives (predicted high risk but actually low risk)
  - making this a poor algorithm for this dataset

BalancedRandomForestClassifier
- The balanced accuracy score: 79%
- the precision and recall scores (high_risk):
  - The sensitivity/recall (70%) is more than the precision (3%)
  - there are many false positives (predicted high risk but actually low risk)
  - making this a poor algorithm for this dataset
- The total_rec_prncp and total_pymnt of the credit dataset are the more relevant features or columns

EasyEnsembleClassifier
- The balanced accuracy score: 93%
- the precision and recall scores (high_risk):
  - The sensitivity/recall (92%) is more than the precision (9%)
  - there are many false positives (predicted high risk but actually low risk)
  - making this a poor algorithm for this dataset

Summary:

Eventhough the EasyEnsembleClassifier algorithm has the highest balanced accuracy score, 93%, this algorithm and the other algorithms still are not good enough to determine if a credit is high risk because the sensitivity/recall is very high, while the precision is very low. It indicates that there are many false positives (predicted high risk but actually low risk). Clearly, they are not useful algorithms for this dataset. Therefore I would not recommend that they be used to predict credit risk. Maybe a dataset with more obsevations would produce a better result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Credit_Risk_Analysis

Challenge Overview

Purpose:

Resources

Results

Summary:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Credit_Risk_Analysis

Challenge Overview

Purpose:

Resources

Results

Summary: