-
Notifications
You must be signed in to change notification settings - Fork 0
/
lightGBM.Rmd
executable file
·331 lines (230 loc) · 9.03 KB
/
lightGBM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
title: "lightGBM"
author: "liuc"
date: "1/17/2022"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## lightGBM
> https://lightgbm.readthedocs.io/en/latest/
> https://www.kaggle.com/code/athosdamiani/lightgbm-with-tidymodels/notebook
> https://sefiks.com/2020/05/13/xgboost-vs-lightgbm/
> https://stackoverflow.com/questions/70450755/shap-xgboost-and-lightgbm-difference-in-shap-values-calculation
> https://zhuanlan.zhihu.com/p/99069186
lightGBM(Light Gradient Boosting Machine)是一个流行的基于决策树算法的梯度提升( gradient boosting)集成框架, 设计初衷是提供一个快速、高效、低内存、高准确度、支持并行和大规模数据处理的工具。LightGBM可以减小数据对内存的使用、减小通信代价以及提升多机并行时的效率,在数据计算上实现线性加速。
lightGBM是,,其和`xgboost`的差异在于The main difference between these frameworks is the way they are growing. XGBoost applies `level-wise` tree growth where LightGBM applies `leaf-wise` tree growth. Level-wise approach grows horizontal whereas leaf-wise grows vertical.
- 更快的训练速度,并支持大规模数据的训练。
lightGBM通过综合两种技术来实现更快的训练速度: Gradient-based One-Side Sampling (GOSS)和Exclusive Feature Bundling (EFB)。GOSS通过采样方式找到信息量最大的样本来训练;EFB通过捆绑互斥的特征来降维。
- 更低的内存占用。
lightGBM采用了直方图算法来整合连续特征信息,大大减少了内存使用。在构建决策树时,LightGBM使用Histogram算法将特征按值域分桶,并将每个桶视为一个离散的特征。
- 更高的准确性。
lightGBM实现了Leaf-wise(叶子层级)的最佳分裂策略,可以减少误差;同时支持高效的并行学习,提高了准确性。
- 支持多分类问题。
lightGBM直接聚合二分类模型解决多分类问题,无需重新训练,并保证效率。
- 支持并行学习。
lightGBM支持基于共享内存的并行学习,可以充分利用多核设备。
- 处理异常值的稳健性。
lightGBM通过梯度上限和采样措施限制异常值的影响。lightGBM能够直接处理缺失数据,不需要提前删除或填充。
*lightGBM的数据输入要求:*LightGBM直接支持类别特征,对类别特征不必进行独热编码处理
在使用LightGBM时,需要选择一些*超参数来调整模型*,例如:树的深度、学习率、叶子节点数量等。LightGBM提供了一些自动调参的工具,例如网格搜索、随机搜索和贝叶斯优化等。
```{r, include=FALSE}
library(doParallel)
library(tidyverse)
library(tidymodels)
library(bonsai)
# library(treesnip) # 被bonsai代替了
show_engines('boost_tree')
all_cores <- parallel::detectCores(logical = FALSE)
registerDoParallel(cores = all_cores)
```
### 理解数据
以一个分类问题进行示例。
```{r}
adult <- read_csv("./datasets/adult.csv")
adult <- adult %>%
janitor::clean_names()
glimpse(adult)
```
```{r}
paint::paint(adult)
```
### Step 1: train/test split ----------------------------------------
测试数据,暂不考虑数据整理的部分。谈论起数据整理,Pandas/Numpy生态提供的数据格式和工具都是属于数据整理应用的范畴。
```{r}
set.seed(42)
adult_initial_split <- initial_split(adult, strata = "income", prop = 0.75)
adult_train <- training(adult_initial_split)
adult_test <- testing(adult_initial_split)
adult_initial_split
```
### Step 3: dataprep --------------------------------------------------------
数据输入前的处理。此处因为摘抄的流程,故不涉及详细的整理过程。
`step_dummy`是有必要的吗?
```{r}
adult_recipe <- recipe(income ~ ., data = adult_train) %>%
step_impute_mode(workclass, occupation, native_country) %>%
step_zv(all_predictors()) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
head(juice(prep(adult_recipe)))
```
### Step 4: model definiton -----------------------------------
超参数还是最为需要关注的地方,每个参数设置的范围和是否需要设置等。
```{r}
adult_model <-
boost_tree(
mtry = 5,
trees = 1000,
min_n = tune(),
tree_depth = tune(),
loss_reduction = tune(),
learn_rate = tune(),
sample_size = 0.75
) %>%
set_mode("classification") %>%
set_engine("lightgbm")
# workflow
adult_wf <- workflow() %>%
add_model(adult_model) %>%
add_recipe(adult_recipe)
adult_wf
```
### Step 5: hiperparameter tunning ------------------------------------------
```{r}
# resamples
adult_resamples <- vfold_cv(adult_train, v = 5)
# grid
# 构建一个grid参数表
adult_grid <- hardhat::extract_parameter_set_dials(adult_model) %>%
finalize(adult_train) %>%
grid_random(size = 200)
head(adult_grid)
```
```{r}
# grid search
# 相比xgboost,lightgbm快不少,大概十几分钟的样子
adult_tune_grid <- adult_wf %>%
tune_grid(
resamples = adult_resamples,
grid = adult_grid,
control = control_grid(verbose = FALSE),
metrics = metric_set(roc_auc)
)
autoplot(adult_tune_grid)
```
*interpret: *图中的结果为四个tune的参数。learning rate似乎越大模型表现越好,minimal node size/loss reduction取值影响不大;tree depth在达到8之后影响便几乎不在变化。
```{r}
# top 5 hiperparams set
show_best(adult_tune_grid, "roc_auc")
```
把通过grid search得到的参数带入到模型中便是选择好超参数需要用训练集数据训练的模型。
### Step 6: last fit performance ------------------------------------------
开始模型的训练。
```{r}
# select best hiperparameter found
adult_best_params <- select_best(adult_tune_grid, "roc_auc")
adult_wf <- adult_wf %>% finalize_workflow(adult_best_params)
# last fit
# 在训练集上训练,并在测试集上evaluate
adult_last_fit <- last_fit(
adult_wf,
adult_initial_split
)
# metrics
collect_metrics(adult_last_fit)
```
一个插曲,不使用`last_fit`函数,而是用模型直接拟合训练集数据:
```{r, eval=FALSE}
# 不需要运行
```
```{r}
# roc curve
adult_test_preds <- collect_predictions(adult_last_fit)
adult_roc_curve <- adult_test_preds %>% roc_curve(income, `.pred_<=50K`)
autoplot(adult_roc_curve)
# confusion matrix
adult_test_preds %>%
mutate(
income_class = factor(if_else(`.pred_<=50K` > 0.6, "<=50K", ">50K"))
) %>%
conf_mat(income, income_class)
```
```{r}
yardstick::roc_auc()
```
### SHAP
```{r}
library(shapr)
```
### 模型保存
得到的模型在经过验证后才是所需要的最后的结果,需要保存并部署。
```{r}
adult_wf_model <- adult_last_fit$.workflow[[1]]
predict(adult_wf_model, adult_test[10, ])
saveRDS(adult_wf_model, here::here("final_model.rds"))
```
读入保存的模型,并做预测:
```{r}
```
## lightGBM by lightgbm
boosting_type:
弱学习器的类型。取值如下:
gbdt(默认):使用基于树的模型进行计算。
gblinear:使用线性模型进行计算。
rf:使用随机森林模型进行计算。
dart:使用dropout技术删除部分树,防止过拟合。
goss:使用单边梯度抽象算法进行计算。速度快,但是可能欠拟合。
```{r}
library(lightgbm)
```
```{r}
# define the training parameters
params <- list(objective = "regression",
metric = "rmse",
num_leaves = 30,
learning_rate = 0.1)
lgb.grid = list(objective = "binary",
metric = "auc",
min_sum_hessian_in_leaf = 1,
feature_fraction = 0.7,
bagging_fraction = 0.7,
bagging_freq = 5,
min_data = 100,
max_bin = 50,
lambda_l1 = 8,
lambda_l2 = 1.3,
min_data_in_bin=100,
min_gain_to_split = 10,
min_data_in_leaf = 30,
is_unbalance = TRUE)
# Gini for Lgb
lgb.normalizedgini = function(preds, dtrain){
actual = getinfo(dtrain, "label")
score = NormalizedGini(preds,actual)
return(list(name = "gini", value = score, higher_better = TRUE))
}
```
```{r}
#lgb.model.cv = lgb.cv(params = lgb.grid, data = lgb.train, learning_rate = 0.02, num_leaves = 25,
# num_threads = 2 , nrounds = 7000, early_stopping_rounds = 50,
# eval_freq = 20, eval = lgb.normalizedgini,
# categorical_feature = categoricals.vec, nfold = 5, stratified = TRUE)
#best.iter = lgb.model.cv$best_iter
lgb.model = lgb.train(params = lgb.grid, data = lgb.train, learning_rate = 0.02,
num_leaves = 25, num_threads = 2 , nrounds = best.iter,
eval_freq = 20, eval = lgb.normalizedgini,
categorical_feature = categoricals.vec)
```
```{r}
# train the model
model <- lgb.train(params,
data = train,
verbose = 1)
# make predictions on the test set
predictions <- predict(model, test)
# evaluate the model
rmse <- mean((predictions - test$target)^2)
print(rmse)
```