rf_pipeline.Rmd

---
title: "rf_pipeline"
author: "liuc"
date: "4/25/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 随机森林的一些笔记

> http://www.ehbio.com/ML/feature-selection.html

一个细致的随机森林分析建模过程，基于`Randomforest`包。

有一些算法如随机森林等在构建模型时可以自己选择有最合适的分类或预测能力的特征集合。前面提到，基于随机置换的变量的整体重要性得分(ACS)是评估变量重要性的一个比较可靠的方式。但这种方式获得的ACS Z-score只能用于排序确定哪些变量更重要， 但不能根据统计显著性获得所有重要的变量。因此如果我们的目的还希望鉴定出哪些特征对分类或预测有贡献时，还需要借助其他的方式。

特征递归消除(RFE, recursive feature elimination)是一个常用的方法；它基于模型预测性能评估指标如准确性等选择 一个性能较佳的最小集合的特征变量。它首先应用所有的特征变量构建模型，然后移除对模型最不重要的一定比例的变量再次构建模型；持续迭代直至获得准确性最高的模型。

在后续应用预测模型时，只需要使用数目比较少的变量，这样的好处是便于新样品的变量检测。但在临床应用时，不同的数据集包含的变量可能相差较大，共有的变量较少，但却可以获得相似的预测性能。其中一个主要原因是组学数据中，很多变量之间是相关性比较强的。因此不同的变量集合可能会获得同样的预测性能。而在构建模型时保留这些相关的冗余变量可以获得更稳定、准确的预测结果，同时对病理机制也能提供更全面的解释。

其它变量选择方法不去除冗余变量，而是通过估计不相干变量的重要性值的分布，选择具有显著更高重要性值的变量作为预测变量。这些方法越来越受欢迎。包括permutation (Perm)和它的参数化版本Altmann,Boruta, r2VIM (recurrent relative variable importance)和Vita。在2019年的一篇基于组学数据的评估文章中得出结论：Boruta和Vita算法的稳定性最好，适合分析高维数据。Vita的计算速度比Boruta快很多更适合大数据集，但只有Boruta可用于低维数据。 (https://doi.org/10.1093/bib/bbx124).


## use caret

```{r message=FALSE, include=FALSE}
library(caret)
library(glmnet)

library(Boruta) # 特征选择

```

read data, 一个摘抄自网路上的示例数据。
```{r}
expr_file <- "./datasets/prostat.expr.symbol.txt"
metadata_file <- "./datasets/prostat.metadata.txt"
expr_mat <- read.table(expr_file, row.names = 1, header = T, sep="\t", 
                       stringsAsFactors =T)
# 处理异常的基因名字
rownames(expr_mat) <- make.names(rownames(expr_mat))
metadata <- read.table(metadata_file, row.names=1, header=T, sep="\t", stringsAsFactors =T)
table(metadata)


expr_mat <- t(expr_mat)
expr_mat_sampleL <- rownames(expr_mat)
metadata_sampleL <- rownames(metadata)
common_sampleL <- intersect(expr_mat_sampleL, metadata_sampleL)

expr_mat <- expr_mat[common_sampleL,,drop=F]
metadata <- metadata[common_sampleL,,drop=F]

```

102例样本的9千多个特征。数据为特征维度远多于样本数量的数据。

```{r}
set.seed(42)

train_index <- createDataPartition(metadata$class, p=0.75, list = F)

train_data <- expr_mat[train_index,]
train_data_group <- metadata[["class"]][train_index]

test_data <- expr_mat[-train_index,]
test_data_group <- metadata[["class"]][-train_index]

dim(train_data)
dim(test_data)
```


### 建模

*特征选择*

随机森林模型可以自身对变量特征进行选择。

`Boruta`法在特征变量筛选是既可以用于分类问题，也可以用于回归问题，其是通过随机森林建模并计算每个变量的重要性得分，通过和影子变量进行对比，得到重要的变量。
Python中有人实现了除了随机森林外的其他的树模型的包，以及整合SHAP变量特征重要性的包。详细见对应的Python笔记。

`Boruta`

```{r}
# boruta 的数据输入 需要对数据进行前处理吗？
boruta <- Boruta::Boruta(x = train_data, y = train_data_group, 
                         pValue=0.01, 
                         mcAdj=T, 
                         maxRuns=100)

table(boruta$finalDecision)
# boruta$finalDecision[boruta$finalDecision=="Confirmed"]

Boruta::plotImpHistory(boruta)
```
*Tentative, Confirmed, Rejected*,
其中图片的结果为：绿色是重要的变量，红色是不重要的变量，蓝色是影子变量，黄色是Tentative变量.


提取重要变量, 其结果为Tentative+Confirmed, 如下将提取99个变量.

```{r}
boruta.finalVarsWithTentative <- data.frame(Item=getSelectedAttributes(boruta, withTentative = T), 
                                            Type="Boruta_with_tentative")

caret::featurePlot(train_data[,boruta.finalVarsWithTentative$Item], train_data_group, plot="box")

boruta_train_data <- train_data[, boruta.finalVarsWithTentative$Item]
boruta_test_data <- test_data[, boruta.finalVarsWithTentative$Item]
```

此处摘抄至网络上的一个小函数，主要为了实现基于Boruta 选择的变量拟合模型。
```{r}
# 定义一个函数生成一些列用来测试的mtry (一系列不大于总变量数的数值)。
generateTestVariableSet <- function(num_toal_variable){
  max_power <- ceiling(log10(num_toal_variable))
  tmp_subset <- c(unlist(sapply(1:max_power, function(x) (1:10)^x, simplify = F)), ceiling(max_power/3))
  #return(tmp_subset)
  base::unique(sort(tmp_subset[tmp_subset<num_toal_variable]))
}

# 交叉验证
trControl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)

# 在设置mtry时，可以随机选择，也可以Random generate 15 mtry values with tuneLength = 15
mtry <- generateTestVariableSet(ncol(boruta_train_data))
tuneGrid <- expand.grid(.mtry=mtry)

rf <- caret::train(x = boruta_train_data, 
            y = train_data_group,
            method = 'rf',
            tuneGrid = tuneGrid,
            trControl = trControl,
            metric = 'Accuracy' # metric='Kappa'
            )

print(rf)
plot(rf)

dotPlot(varImp(borutaConfirmed_rf_default))
```


对于mtry亦可通过tuneRF筛选最佳mtry，以下为按照OOBEError进行筛选.
不过似乎不应该这么跑，报warning
```{r}
# dont run this
bestMtry <- randomForest::tuneRF(boruta_train_data,train_data_group, 
                                 stepFactor = 1.5, 
                   improve = 1e-5, ntree = 500)

trControl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)

tuneGrid <- expand.grid(.mtry=bestMtry)

rf <- train(x = boruta_train_data, 
            y = train_data_group,
            method = 'rf',
            tuneGrid = tuneGrid,
            trControl = trControl,
            metric = 'Accuracy'
            )

print(rf)
plot(rf)
```

#### 可视化top20特征，并提取final model
```{r}
dotPlot(varImp(rf), top = 20)

rf_finalModel <- rf$finalModel
```

此处有一个小trick，不建议提取finalModel，而是建议直接使用`rf`)去做后续预测。尤其是训练时指定了数据预处理参数时，使用rf会对测试集进行同样的预处理(等同于predict.train)，而使用borutaConfirmed_rf_default$finalModel则不会对测试集进行预处理（等同于predict.randomForest）
```{r}

```


#### 模型评价
```{r}
library(pROC)
predictions_train <- predict(rf_finalModel, newdata=boruta_train_data)
confusionMatrix(predictions_train, train_data_group)

# 在测试集上的数据表现
prediction_prob <- predict(rf_finalModel, 
                           newdata = boruta_test_data, type = 'prob')

predictions <- predict(rf_finalModel, newdata = boruta_test_data)
confusionMatrix(predictions, test_data_group)

```


```{r}
# ROC curve
roc_curve <- roc(test_data_group, prediction_prob[,1])
roc_curve
roc <- roc(test_data_group, factor(predictions, ordered=T))
plot(roc)

best_thresh <- data.frame(coords(roc=roc_curve, x = "best", input="threshold", 
                                 transpose = F, best.method = "youden"))
# youden指数求roc曲线的最佳点
best_thresh$best <- apply(best_thresh, 1, function (x) 
  paste0('threshold: ', x[1], ' (', round(1-x[2],3), ", ", round(x[3],3), ")"))

print(best_thresh)

ROC_data <- data.frame(FPR = 1- roc_curve$specificities, TPR=roc_curve$sensitivities)
ROC_data <- ROC_data[with(ROC_data, order(FPR,TPR)),]

```


手动绘图：
```{r}
p <- ggplot(data=ROC_data, mapping=aes(x=FPR, y=TPR)) +
  geom_step(color="red", size=1, direction = "vh") +
  geom_segment(aes(x=0, xend=1, y=0, yend=1))  + theme_classic() + 
  xlab("False positive rate") + 
  ylab("True positive rate") + coord_fixed(1) + xlim(0,1) + ylim(0,1) +
  annotate('text', x=0.5, y=0.25, label=paste('AUC=', round(roc_curve$auc,2))) +
  geom_point(data=best_thresh, mapping=aes(x=1-specificity, y=sensitivity), color='blue', size=2) + 
  ggrepel::geom_text_repel(data=best_thresh, mapping=aes(x=1.05-specificity, y=sensitivity ,label=best))
p
```


基于选定的最优阈值制作混淆矩阵并评估模型预测准确度显著性,结果显著P-Value [Acc > NIR]<0.05。
```{r}
predict_result <- data.frame(Predict_status=c(T,F), Predict_class=colnames(prediction_prob))

head(predict_result)
predictions2 <- plyr::join(data.frame(Predict_status=prediction_prob[,1] > best_thresh[1,1]), predict_result)

predictions2 <- as.factor(predictions2$Predict_class)

confusionMatrix(predictions2, test_data_group)
```

## use ranger

ranger 包拟合随机森林的笔记：

```{r}
library(ranger)
```