random_forest_tidymodels.Rmd

---
title: "random_forest_tidymodels"
author: "liuc"
date: '2022-05-23'
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## random forest by tidymodels

> http://www.ehbio.com/ML/randomForest.html
> https://towardsdatascience.com/understanding-random-forest-58381e0602d2
> https://github.com/stevenpawley/recipeselectors/issues/1


随机森林，采用随机抽样的ensembled决策树构建的模型。其属于集成学习中的bagging(Bootstrap AGgregation 的简称).
A `bagging model` is the same as a random forest where mtry is equal to the number of predictors.

随机森林是，集成学习方法的一种，首先其将数据随机重采样，在生成多个数据样本后，对这些数据进行决策树建模，同时子集数据集的特征也具有随机性，这是随机森林和决策树之间的关键区别。对于回归问题，所有树的输出值是取的平均值；对于分类问题，投票次数最多的类别（即最常见的类别变量）将确定为最终类别。在该训练样本中，有三分之一被留作测试数据，称为袋外 (oob) 样本，将 oob 样本用于交叉验证，最终确定预测结果。
大概的过程如下：对于一个`n x p`的矩阵，首先从数据集中有放回的随机选取n个样本为一数据集，训练一颗决策树，每棵决策树随机选择m个(m<<p)特征进行每个决策变量的选择，决策树野蛮生长、不剪枝，重复此过程至t颗决策树，最后聚合t颗数的结果做出最后的决策。

*oob(Out-of-Bag)* error rate，随机森林中每一颗决策树在构建时因为有放回的采样缘故，约有1/3的样本不会被用到，这些没有被用到的样本会用来测试该决策树的分类能力。随机森林与其他机器学习方法不同的是存在OOB，相当于自带多套训练集和测试集，自己内部就可以通过OOB值评估模型的准确度 (Bootstrap方式)。其他一些机器学习方法却没有这一优势。


*常适用的数据*，随机森林对于维度很高的数据（特征很多的数据）比如微生物组数据，不用特征工程也可以直接使用。对于连续数据和分类数据混合的数据集，也可以使用，可以说是适用一切数据，甚至可以拿它来仅仅进行变量筛选，当然机器学习的样本数量也不能太少。


*重要的超参数*，随机森林算法有三个主要的超参数，需要在训练之前进行设置。 其中包括节点大小(min_n)、树的数量(ntree)和 采样的特征数量(mtry)。其它需要tune的超参数，在`set_engine`中可以进行设置。
其中采样的特征数量通常randomForest() `p / 3` variables when building a random forest of regression trees, and `sqrt(p)` variables when building a random forest of classification trees. 
树的数量一般会控制在500以下，but one should never use over 500 trees because it is a waste of time.虽然有人这么说， 但还是会用到很多的trees的。
当然具体还要依据数据的情况而定，`dials`包所提供的一些列的函数值得参考。

其他的可以tune的超参数，Criterion(gini or mse, et.), 

`n_estimators`: We know that a random forest is nothing but a group of many decision trees, the n_estimator parameter controls the number of trees inside the classifier. We may think that using many trees to fit a model will help us to get a more generalized result, but this is not always the case. However, it will not cause any overfitting but can certainly increase the time complexity of the model. The default number of estimators is 100 in scikit-learn.
`max_depth`: It governs the maximum height upto which the trees inside the forest can grow. It is one of the most important hyperparameters when it comes to increasing the accuracy of the model, as we increase the depth of the tree the model accuracy increases upto a certain limit but then it will start to decrease gradually because of overfitting in the model. It is important to set its value appropriately to avoid overfitting. The default value is set to None, None specifies that the nodes inside the tree will continue to grow until all leaves become pure or all leaves contain less than min_samples_split (another hyperparameter).
`min_samples_split`: It specifies the minimum amount of samples an internal node must hold in order to split into further nodes. If we have a very low value of min_samples_splits then, in this case, our tree will continue to grow and start overfitting. By increasing the value of min_samples_splits we can decrease the total number of splits thus limiting the number of parameters in the model and thus can aid in reducing the overfitting in the model. However, the value should not be kept very large that a number of parameters drop extremely causing the model to underfit. We generally keep min_samples_split value between 2 and 6. However, the default value is set to 2.
`min_samples_leaf`:  It specifies the minimum amount of samples that a node must hold after getting split. It also helps to reduce overfitting when we have ample amount of parameters. Less number of parameters can lead to overfitting also, we should keep in mind that increasing the value to a large number can lead to less number of parameters and in this case model can underfit also. The default value is set to 1.
`max_features`: Random forest takes random subsets of features and tries to find the best split.  max_features helps to find the number of features to take into account in order to make the best split. It can take four values “auto“, “sqrt“, “log2” and None.
In case of auto: considers max_features = sqrt(n_features)
In case of sqrt: considers max_features = sqrt(n_features), it is same as auto
In case of log2: considers max_features = log2(n_features)
In case of None: considers max_features = n_features
`max_leaf_nodes`: It sets a limit on the splitting of the node and thus helps to reduce the depth of the tree, and effectively helps in reducing overfitting. If the value is set to None, the tree continues to grow infinitely.
max_samples: This hyperparameter helps to choose maximum number of samples from the training dataset to train each individual tree.


*随机森林对于特征重要性的评估*，基尼重要性和平均不纯度减少 (MDI) 常用于衡量排除给定变量时模型准确度的降低程度。 此外，排列重要性（又称 MDA，平均精度下降）也是一种衡量重要性的方法。 MDA 通过随机排列 oob 样本中的特征值来识别准确性的平均降低程度。 基于随机置换的变量的整体重要性得分(ACS)是评估变量重要性的一个比较可靠的方式。但这种方式获得的ACS Z-score只能用于排序确定哪些变量更重要，但不能根据统计显著性获得所有重要的变量。


*优缺点：*

1. 它可以出来很高维度（特征很多）的数据，并且不用降维，无需做特征选择
2. 它可以判断特征的重要程度
3. 可以判断出不同特征之间的相互影响
4. 不容易过拟合
5. 训练速度比较快，容易做成并行方法
6. 实现起来比较简单
7. 对于不平衡的数据集来说，它可以平衡误差。
8. 如果有很大一部分的特征遗失，仍可以维持准确度。

缺点

1. 随机森林已经被证明在某些噪音较大的分类或回归问题上会过拟合。
2. 对于有不同取值的属性的数据，取值划分较多的属性会对随机森林产生更大的影响，所以随机森林在这种数据上产出的属性权值是不可信的


*输入数据的处理preprocessing: *
虽然随机森林对Outlier之类的数据不是很敏感，往往不需要对数据进行前处理，但是缺失值、分类变量等还是需要依据具体的问题处理一下的，并且对数据基本的检测中，符合理性的Outlier等自然也是要删掉的。还有不均衡数据的问题。So, data preprocessing is very important even in the case of Random Forest.

多分类问题:随机森林天生用于解决多分类问题,分类效果往往优于单一决策树。
高维数据:随机森林可以处理高维数据,不太容易过拟合,而且内置了变量重要性测度方法。
少量样本:随机森林可以从较少的样本中学习,不易过拟合。
包含缺失值或未标定数据的混杂数据集:随机森林对缺失数据不太敏感。
流形或非线性的数据集:随机森林适合学习复杂的非线性模式。
噪声数据:随机森林会平均很多弱分类器,可以减轻过拟合和噪声的影响


```{r, include=FALSE}

library(tidyverse)
library(tidymodels)
library(vip)
library(usemodels)
library(themis)

tidymodels_prefer()

# for clustrer
cl <- parallel::makePSOCKcluster(4)
doParallel::registerDoParallel(cl)
```


*prepare input data*:

以下链接有多了癌症组织表达数据集：https://file.biolab.si/biolab/supp/bi-cancer/projections/
可以拿来做测试数据。

The prostate data set (Singh et al.) includes the gene expression measurements for 52 prostate tumors and 50 adjacent normal prostate tissue samples.
本测试数据虽则样本数量不是很多，但是变量挺多，采用决策树模型并非是一个好的策略。

```{r, include=FALSE}
# 数据探索的步骤暂时不进行
expr_file <- "datasets/prostat.expr.symbol.txt"
metadata_file <- "datasets/prostat.metadata.txt"

expr_mat <- read_delim(expr_file, delim = "\t") %>%
  janitor::clean_names()
metadata <- read_delim(metadata_file, delim = "\t") %>%
  janitor::clean_names()

# 此处或可以加上`Boruta`的结果进行一些变量筛选工作
input_data <- expr_mat %>%
  column_to_rownames("symbol") %>%
  t() %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  as_tibble() %>%
  janitor::clean_names() %>%
  left_join(metadata, by = c("rowname" = "sample")) %>%
  mutate(class = if_else(class == "tumor", 1, 0)) %>%
  mutate(class = as_factor(class))


# 此处也可以保留rowname列，在下文中通过recipe中的`update_role`进行
set.seed(42)
df_split <- initial_split(input_data %>% select(-rowname))

df_train <- training(df_split)
df_test <- testing(df_split)

```

##### 一个网上的RF回归小例子

回归问题。

```{r}
# 一个网上的小例子
# https://juliasilge.com/blog/ikea-prices/

ikea <- read_csv("~/Downloads/ikea.csv")

ikea_df <- ikea %>%
  select(price, name, category, depth, height, width) %>%
  mutate(price = log10(price)) %>%
  mutate_if(is.character, factor)

ikea_df

```

```{r}
set.seed(123)
df_split <- initial_split(ikea_df, strata = price)
df_train <- training(df_split)
df_test <- testing(df_split)

set.seed(234)
dials_grid <- bootstraps(df_train, strata = price, times = 10)


dials_grid
```

```{r}
library(textrecipes)
ranger_recipe <-
  recipe(formula = price ~ ., data = df_train) %>%
  step_other(name, category, threshold = 0.01) %>%
  step_clean_levels(name, category) %>%
  step_impute_knn(depth, height, width)

ranger_spec <-
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
  set_mode("regression") %>%
  set_engine("ranger")

ranger_workflow <-
  workflow() %>%
  add_recipe(ranger_recipe) %>%
  add_model(ranger_spec)

```

```{r}
set.seed(8577)
# doParallel::registerDoParallel()

ranger_tune <-
  tune_grid(ranger_workflow,
    resamples = dials_grid,
    grid = 11
  )
```


```{r}
show_best(ranger_tune, metric = "rmse")
```
```{r}
final_rf <- ranger_workflow %>%
  finalize_workflow(select_best(ranger_tune, metric = 'rmse'))

final_rf
```

```{r}
final_fit <- last_fit(final_rf, df_split)

final_fit
```

```{r}
attributes(final_fit)
```


#### Random Forests 分类问题

随机森林的R包中常用的有`ranger`, `randomForest`包，在此选择用ranger包，其的速度似乎更快。
Whereas the `randomForest` package provides forests based on traditional decision trees, the `cforest()` function in the `party` package can be used to generate random forests based on conditional inference trees. If predictor variables are highly correlated, a random forest using conditional inference trees may provide better predictions.

不同于决策树、罗辑回归等模型，随机森林等一众模型，其在变量解释上有黑盒子模型之称。虽则随机森林也有提供变量重要性等指标，在构建完随机森林后，我们将采用XAI计划中的一些包进行模型解释上的一些示例。


*对于特征如此多的情况，自然特征工程是少不了的*
但是下面的例子用到的是全部的特征。以做测试。
```{r, eval=FALSE}
# Boruta特征选择鉴定关键分类变量
# R 里面的Boruta
library(Boruta)

set.seed(42)

boruta <- Boruta(x=df_train %>% select(-class), 
                 y=df_train %>% pull(class), 
                 pValue=0.01, 
                 mcAdj=T, 
                 maxRuns=100
                 )

boruta

boruta.finalVarsWithTentative <- data.frame(Item=getSelectedAttributes(boruta, withTentative = T), 
                                            Type="Boruta_with_tentative")

# boruta_train_data <- train_data[, boruta.finalVarsWithTentative$Item]
# boruta_test_data <- test_data[, boruta.finalVarsWithTentative$Item]

```

```{r}
show_engines('rand_forest')
```


```{r}

# set_engine 设置额外的参数也是需要查询原来的R包的吧
rf_spec <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", 
             importance = 'impurity', 
             # num.threads = 4
             ) %>%
  set_mode("classification")


# 对于变量多达9000个，而样本数量只有100多个的样本，是否需要一些recipe的过程呢
# 随机森林似乎不需要

rf_rec <- recipe(class ~ ., data = df_train) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes())
  # themis::step_downsample(Class, under_ratio = 1, skip = TRUE)


class_rf_wf <- workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(rf_rec)

```


一些`tidymodels`系列包的操作：
```{r}
# 查看parsnip对象的所有参数
args(rand_forest)

extract_parameter_set_dials(class_rf_wf)
```


```{r}
# tune the hyperparameters
set.seed(42)

trees_folds <- vfold_cv(df_train,
                        v = 5
                        )

metricSet <- metric_set(accuracy, yardstick::sens, yardstick::spec, ppv)

# 在进行具体的模型构建过程中，超参数的范围选择是件需要参考的事情
# dials 可以提供一些参考，是一个不错的设计
# 比如在本示例种，数据集大小为76x9022，问题为分类，则约95个mtry
grid <- expand.grid(
      mtry = c(50, 100, 200), 
      min_n = c(),
      trees = c(100, 300, 500)
    )

# use dials to create grid
# mtry 的数目 看起来有些多
dials_grid <- dials::grid_random(
  finalize(mtry(), x = df_train),
  min_n(),
  trees(),
  size = 10
)


# 网格搜索后的对象为
tune_res <- tune_grid(
  class_rf_wf,
  resamples = trees_folds,
  control = control_grid(save_pred = TRUE),
  grid = dials_grid
)

# save(tune_res, file = './datasets/tune_res_rf.rda')
```


在进行超参数网格搜索时可以先设置一些值进行，然后根据模型的表现多尝试几次tune, 再最终选择合适的超参数。
以下探索在各个参数的情况下，以`roc_auc` matrics为标准的一些参数变化，为了演示grid设置较少。
利用验证集数据进行自动的`hyperparameter`设置也是不错的选择。

```{r}
load('./datasets/tune_res_rf.rda')


tune_res %>%
  collect_metrics()

show_best(tune_res, metric = "roc_auc", n = 3)

# 手动查看几个变量的分布和AUC的关系
tune_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC") +
  theme_bw()

```
*interpret the result: *随着mtry的增加，AUC值的变化，不过mtry的分布不是很均一呀，在实际的操作中可以多搜索一些参数。


```{r}
# 和上面的结果类似，只是用自带的autoplot
autoplot(tune_res)

```


*Choosing the best model*

得到超参数后，对候选的模型进行拟合，注意目前得到的参数只是超参数，而机器学习里的参数是模型拟合的过程。

```{r, eval=FALSE}
# 本cell做为单独的拟合
best_auc <- select_best(tune_res, "roc_auc")

final_rf <- finalize_model(
  rf_spec,
  best_auc
)

final_rf

# 此处final_rf 和 final_wf的主要差别在于，前者为一个模型，后者还包含数据的处理
final_wf <- finalize_workflow(class_rf_wf, best_auc)

# final_fit 在拟合时会对数据进行recipe的
final_fit <- final_wf %>% fit(df_train)


```

*详细的查看final_fit 对象*
`final_fit`对象是一个Workflow，在后续对新数据的predict时，可以直接输入原始数据。
`final_fit`对象会不会有点太大了, 10G?
```{r}
final_fit

# 验证训练集数据是否经过了recipe
extract_recipe(final_fit)
```


```{r}
final_fit %>% lobstr::obj_sizes()
```


最佳参数的AUC和ROC曲线，`tune_res`has a lot of metrics,
下面的结果显示在训练集上表现不错，
*模型自身的标准metrics也是需要汇报的*
```{r}
# 得到训练集最佳auc的预测
# 可以考虑在图上增加AUC的值
rf_train_auc <- 
  tune_res %>% 
  collect_predictions(parameters = best_auc) %>% 
  roc_curve(., truth = class, .pred_0) %>% 
  mutate(model = "Random Forest")

rf_train_auc %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(linewidth = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6) +
  theme_bw()
```


`final_res`便是最终模型在测试集数据上拟合过的模型，`extract_`系列函数对其的操作。
和`final_fit`二者运行其一即可。

```{r}
best_auc <- select_best(tune_res, "roc_auc")

final_rf <- finalize_model(
  rf_spec,
  best_auc
) %>% 
  set_engine("ranger", 
             importance = 'impurity' # 'permutation'
             )

final_wf <- final_wf %>% 
  update_model(final_rf)


# This function fits a final model on the entire training set and evaluates on the testing set
final_res <- final_wf %>%
  last_fit(split = df_split)

# 在测试集上的表现
final_res %>% 
  collect_metrics()
```

*针对final_res的一系列的操作：*
final_res 对象的操作总是很慢，因为？为什么这个对象有10G那么大？
```{r, eval=FALSE}
# do not run this!!!!!
# 为什么只是查看一下对象，却要等好久，明明已经运行过了呀？
final_res

```


```{r, eval=FALSE}
# do not run this cell!!!!!
extract_fit_engine(final_res)
```


*测试集数据上的表现，*
`final_fit`是在训练集数据直接拟合的模型；
`final_res`是在训练集数据拟合完后又在测试集上拟合的结果，其包含有`final_fit`的结果。

```{r}
test_aug <- augment(final_fit, new_data = df_test) %>% 
  select(class, starts_with('.'))

# test_aug 和 final_res$.predictions 间的细微差别和xgboost的原因一致
final_res$.predictions
```

```{r}
# confusion matrix for test dataset
final_res %>%
    collect_predictions() %>%
    conf_mat(class, .pred_class)
```


在测试集数据上绘制ROC曲线:
```{r}
collect_predictions(final_res) %>%
  roc_curve(class, .pred_0) %>%
  autoplot()
```

优化一下ROC曲线，为在测试集数据上的表现：
```{r}

roc_d <- collect_predictions(final_res) %>%
  roc_curve(class, .pred_0)

roc_d %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) + 
  geom_path(linewidth = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6) +
  theme_bw()
```


*对数据进行预测：*
拟合好的模型，其目的便在于对模型进行预测，新输入的数据也需要进行和训练集数据一致的recipe过程。
`final_fit` 为直接拟合训练集数据得到的模型，在此例子中，其也是`class(final_fit)` worflow类。
`final_res` 也是通过workflow得到的模型
```{r}
# 如果数据经过了recipe处理，那么在直接的predict中，是应该对新输入的数据进行同样的处理的
# 速度慢的有点奇怪
test_processed <- prep(rf_rec, df_test) %>% 
  bake(df_test)

# 如果对final_fit直接输入原来的数据
cancer_pred <- predict(final_fit, 
                       new_data = df_test) %>%
  bind_cols(predict(final_fit, df_test, type = "prob")) %>%
  bind_cols(df_test %>% select(class))

cancer_pred2 <- predict(final_fit, 
                       new_data = test_processed) %>%
  bind_cols(predict(final_fit, test_processed, type = "prob")) %>%
  bind_cols(df_test %>% select(class))

# 和上面 final_res的结果一致
cancer_pred %>%  roc_auc(truth = class, .pred_1, event_level="second")

cancer_pred %>% accuracy(truth = class, .pred_class)


```


variable importance by package `vip`：
变量重要性也是经常需要去关注的指标。在随机森林模型中用以解释变量的重要性、贡献度。
vip包的主要目的在于，
`vip` functions when we want to use model-based methods that take advantage of model structure (and are often faster);
DALEX functions when we want to use model-agnostic methods that can be applied to any model

```{r}
# 此处之所以在重新跑一遍是为了演示，其实在模型构建中可以设置参数的
# vip_res <- final_wf %>%
#   update_engine_parameters(importance = "permutation") %>%
#   fit(class ~ .,
#     data = df_train
#   ) 
# 
# vip_res %>%
#   vip::vip(geom = "col", num_features = 20)


# or we can run this 
final_res %>%
  extract_fit_parsnip() %>%
  vip::vip(num_features = 20)
```

```{r}
# 待保存的workflow
save_wf_model <- extract_workflow(final_res)

extract_fit_engine(save_wf_model) %>%
  pluck("imp") %>%
  slice_max(value, n = 10) %>%
  ggplot(aes(value, fct_reorder(term, value))) +
  geom_col(alpha = 0.8, fill = "midnightblue") +
  labs(x = "Variable importance score", y = NULL)
```


*save model*

```{r}
# crash_wf_model 为最终的模型，不过奇怪的是和final_fit有些地方不一致？
crash_wf_model <- final_res$.workflow[[1]]
predict(crash_wf_model, df_test[10, ])


saveRDS(crash_wf_model, here::here("crash-api", "crash-wf-model.rds"))

collect_metrics(crash_res) %>%
  write_csv(here::here("crash-api", "crash-model-metrics.csv"))
```


```{r}
library(vetiver)


v <- vetiver_model(
  save_wf_model, 
  "traffic-crash-model", 
  metadata = list(metrics = crash_metrics %>% dplyr::select(-.config))
)

v
```

```{r}
library(pins)
b <- board_rsconnect()
vetiver_pin_write(b, v)
```


#### Random Forests SHAP 值

```{r}
library(shapr)
library(fastshap)
```


```{r}

```