My model's name is EAA(Ensemble is Always Answer).
Architecture |
---|
Among the boosting models, the LightGBM DART model performed the best. Due to the high noise of the metric, it is slow but can guarantee performance I decided to use a DART model. Meanwhile, in the case of Catboost, categorical features were learned by adding first features. TabNet is a type of neural network, and in order to secure the diversity of the GBDT model, an ensemble was attempted by giving a small weight value even if the performance was poor. The LightGBMs learned with various features were all stacked with XGBoost.
- Create lag features through time features
- statement feature: Check the customer's statement (SDist)
- First feature and Last feature is importance features
Performance difference was severe depending on the seed value due to noise of the metric. Therefore, after learning several seeds, we try to find the optimal cv value. And they all found and ensemble weight in a gradient way.
def get_best_weights(oofs: List[np.ndarray], target: np.ndarray) -> np.ndarray:
"""
Get best weights
Args:
oofs: oofs of models
target: target of train data
Returns:
best weights
"""
weight_list = []
weights = np.array([1 / len(oofs) for _ in range(len(oofs) - 1)])
logging.info("Blending Start")
kf = KFold(n_splits=5)
for fold, (train_idx, _) in enumerate(kf.split(oofs[0]), 1):
res = minimize(
get_score,
weights,
args=(train_idx, oofs, target),
method="Nelder-Mead",
tol=1e-06,
)
logging.info(f"fold: {fold} res.x: {res.x}")
weight_list.append(res.x)
mean_weight = np.mean(weight_list, axis=0)
mean_weight = np.insert(mean_weight, len(mean_weight), 1 - np.sum(mean_weight))
logging.info(f"optimized weight: {mean_weight}\n")
return mean_weight
I use weight optimization of ensemble method. Using the KFold method, find the weight as the average value by using the Nelder and Mead method.
Model | CV | Public LB | Private LB |
---|---|---|---|
XGBoost(10-KFold - gbdt) | 0.792 | 0.793 | - |
TabNet(10-StratifiedKFold) | 0.789 | 0.790 | - |
CatBoost(5-StratifiedKFold sdist-lag-features - dart) - seed22 | 0.7953 | 0.797 | - |
CatBoost(5-StratifiedKFold sdist-lag-features - dart) - seed42 | 0.7954 | 0.797 | - |
CatBoost(5-StratifiedKFold sdist-lag-features - dart) - seed99 | 0.7958 | 0.797 | - |
CatBoost(5-StratifiedKFold sdist-lag-features - dart) - seed3407 | 0.7948 | 0.797 | - |
LightGBM(5-StratifiedKFold time-features - shap - dart) - trick | 0.7970 | 0.797 | - |
LightGBM(5-StratifiedKFold time-lag-features - dart) - trick | 0.7973 | 0.797 | - |
LightGBM(5-StratifiedKFold diff-features - dart) - trick | 0.7973 | 0.799 | - |
LightGBM(5-StratifiedKFold trick-features - dart) - seed42 | 0.7977 | 0.798 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed22 | 0.7981 | 0.798 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed42 | 0.7979 | 0.798 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed88 | 0.7977 | 0.799 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed94 | 0.7972 | 0.799 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed99 | 0.7979 | 0.799 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed2020 | 0.7978 | 0.798 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed2222 | 0.7976 | 0.799 | - |
LightGBM(5-StratifiedKFold sdist-features - dart) - seed3407 | 0.7977 | 0.799 | - |
LightGBM(5-StratifiedKFold sdist-lag-features - dart) - seed3407 | 0.7977 | 0.799 | - |
LightGBM(5-StratifiedKFold bruteforce-features - dart) - seed22 | 0.7978 | 0.799 | - |
LightGBM(5-StratifiedKFold bruteforce-features - dart) - seed42 | 0.7981 | 0.799 | - |
LightGBM(5-StratifiedKFold bruteforce-features - dart) - seed99 | 0.7979 | 0.799 | - |
LightGBM(5-StratifiedKFold bruteforce-features - dart) - seed3407 | 0.7978 | 0.799 | - |
LightGBM(5-StratifiedKFold sdist-lag-features - dart) - seed5230 | 0.7963 | 0.799 | - |
XGBoost(10-KFold - stacking regression) | 0.7985 | 0.799 | - |
Ensemble is Always Answer | 0.79952 | 0.799 | - |
├── LICENSE
├── README.md
├── config <- config yaml files
│
├── res
| ├── data <- encoding pickle file
| └── models <- Trained and serialized models
|
├── notebooks <- ipykernel
│
└── src <- Source code for use in this project
│
├── data <- Scripts to preprocess data
│ └── dataset.py
│
├── features <- Scripts of feature engineering
| ├── build.py
| └── select.py
|
├── models <- build train models
| ├── base.py
| ├── boosting.py
| ├── callbacks.py
| ├── infer.py
| └── network.py
|
├── tuning <- tuning models by optuna
| ├── base.py
| └── boosting.py
│
└── utils <- utils files
└── utils.py
conda env create -f environment.yaml # might be optional
conda activate amex
- DART: Dropouts meet Multiple Additive Regression Trees
- rounding-trick
- time-features
- pay-features
- Ensemble model
- feature-engineering
- statement-features
- seed-number
- clean-data
Project based on the cookiecutter data science project template & microsoft recommenders.