Celine Ng - December 2024
- Improve risk evaluation accuracy to retail banks. In practice meaning target variable classification.
- Evaluate feature importance to understand influence on home credit default
The data comes with 10 separate CSV files. It is originally based on a Kaggle competition that is now closed, Home Credit Default Risk. Original CSV files can be found in the folder "data_csv".
- Python
- Pandas
- Seaborn
- Matplotlib
- scikit-learn
- xgboost
- lightgbm
- optuna
- SHAP
This project involves 8 tables containing a large amount of data. My initial approach was to aggregate these tables after briefly understanding the data types and information provided. I focused on delving deeper only into the aspects that had significant relevance to the problem at hand.
Once I aggregated the data into a single table, I followed a structured methodology of data exploration, preprocessing, feature engineering, model selection, modeling, and evaluation:
-
Data Exploration: I identified columns that needed encoding and assessed the information available in each feature. Most features had weak correlations with the target variable, but hypothesis testing showed that the presence or absence of prior information was significant for predicting defaults.
-
Preprocessing: Since only tree-based models were used, this step focused solely on encoding categorical features.
-
Feature Engineering: New features were created based on domain knowledge, generated with the help of an AI tool. These features were tested using cross-validation with ROC AUC (primary metric) and F1 score (secondary metric). Based on the results, all features were retained for modeling.
-
Model Selection: I compared several models, including Random Forest, Logistic Regression, XGBoost, LightGBM, and ensemble methods. After hyperparameter tuning, models were evaluated using AUC ROC and F1 Score. The best threshold for the chosen model was selected.
-
Modeling: The final model was retrained using the full dataset and tested on the test data.
-
Evaluation: The final model was interpreted using a confusion matrix, feature importance, and SHAP values.
The results showed that the model was still influenced by class imbalance, as the model can correctly predicted the majority class better by a large margin. The most important feature identified was 'external source mean', an aggregate feature derived from external rating data.
The largest challenges encountered in this project were mainly derived from the large amount of data, which translates to many missing values and many features. Another surprising challenge faced was that LightGBM did not improve results after hyperparameter tuning.
Future Work :
- Another approach to the large amount of data would involve first thoroughly understanding the data and selecting only the most relevant features for aggregation and further processing.
- The model would benefit from feature selection to reduce noise, as most validation/test score were very similar, hyperparameter tuning didn't help much, and in feature engineering it was understood that 52% of the features were not important for the models.
- Try a simpler product, including only the most important maybe 20 features and quickly predict if there is even potential for this client before all the data collection.
- To improve imbalance affect, collecting more data from the minority class would likely improve the model’s performance. Or improve data collection process to include features that can better distinguish the classes. As for this project several strategies, like added weights, cross validation with stratified k fold, and train test split with stratify, were applied to reduce the affect.
- It would also be valuable to investigate the nature of the 'external source score', as this feature had a significant impact on the final model.
-
Run the notebooks to create all the necessary files, including the final model.
-
Run the Flask application with: python app.py
-
Once server is running, app will be accessible at: http://127.0.0.1:5000
-
The model will return a csv file with the following content:
SK_ID_CURR,predictions
100001,0
100005,0
100013,0
100028,0
100038,1
100042,0
100057,0
...