SVM_tidymodels.Rmd

---
title: "SVM_tidymodels"
author: "liuc"
date: '2022-05-26'
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


## SVM_tidymodels

> https://scikit-learn.org/stable/modules/svm.html#svm-regression
> https://github.com/thebioengineer/TidyX
> https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf


支持向量机(Support vector machines)可以处理complex classification, regression, and outlier detection problems。 支持向量指的是在svm寻找超平面时，只需要让最靠近中间分割线的那些点尽量远离，即只用到那些「支持向量 support vector」的样本。核函数是很重要的。把一个数据集正确分开的超平面可能有多个，而具有「最大间隔」的超平面就是 SVM 所要找的最优解。最靠近超平面的样本点即为「支持向量」。支持向量到超平面的距离称为「间隔 margin」。
简单点讲，SVM就是一种二类分类模型，他的基本模型是定义在特征空间上的间隔最大的线性分类器，SVM的学习策略就是间隔最大化。

1.线性可分SVM,
当训练数据线性可分时，通过硬间隔hard margin最大化可以学习得到一个线性分类器，即硬间隔SVM,
2.线性SVM,
当训练数据不能线性可分但是可以近似线性可分时，通过软间隔(soft margin)最大化也可以学习到一个线性分类器，即软间隔SVM。
3.非线性SVM,
当训练数据线性不可分时，通过使用核技巧(kernel trick)和软间隔最大化，可以学习到一个非线性SVM。


在真实世界中，原始的特征空间内也许压根并不存在这样一个能够正确划分两类样本的超平面，即数据往往是线性不可分的。解决办法有软间隔分类器（Soft margin classifier）和核技巧（Kernel Trick）。相对硬间隔的分类完全准确、不存在分类错误的情况，软间隔允许一定量的样本分类错误，用于近似线性可分的情况。核技巧则是通过对数据点进行特征空间转换（将低维空间的点投射到高位空间）来实现线性可分，然后再通过核函数来获得高维空间中的超平面参数。


_The disadvantages of support vector machines include:_
If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

*支持向量机的优点是:*
由于SVM是一个凸优化问题，所以求得的解一定是全局最优而不是局部最优。
不仅适用于线性线性问题还适用于非线性问题(用核技巧)。
拥有高维样本空间的数据也能用SVM，这是因为数据集的复杂度只取决于支持向量而不是数据集的维度，这在某种意义上避免了“维数灾难”。
理论基础比较完善(例如神经网络就更像一个黑盒子)。
*支持向量机的缺点是:*
二次规划问题求解将涉及m阶矩阵的计算(m为样本的个数), 因此SVM不适用于超大数据集。(SMO算法可以缓解这个问题)
只适用于二分类问题。(SVM的推广SVR也适用于回归问题；可以通过多个SVM的组合来解决多分类问题)
_by chatGPT:_
Support Vector Machines (SVMs) are a popular machine learning algorithm that can be used for both classification and regression tasks. Here are some advantages and disadvantages of using SVMs:
_Advantages:_
Effective in high-dimensional spaces: SVMs perform well in high-dimensional spaces and can handle large feature sets with ease.
Good generalization performance: SVMs are less prone to overfitting compared to other algorithms, making them suitable for small and medium-sized datasets.
Versatile: SVMs can be used for both linear and non-linear problems.
Effective with unstructured and semi-structured data: SVMs can work with unstructured and semi-structured data such as text, images, and genomic data.
Robust: SVMs are less sensitive to the choice of the kernel function and hyperparameters, making them a robust choice for many problems.
__Disadvantages:__
Computationally intensive: SVMs can be computationally expensive, especially when dealing with large datasets.
Difficult to interpret: SVMs can be difficult to interpret, and it may not be clear which features are contributing most to the decision boundary.
Sensitive to the choice of kernel function and hyperparameters: The choice of kernel function and hyperparameters can significantly affect the performance of the model, making it challenging to find the optimal configuration.
Memory-intensive: The memory requirements for SVMs can be high, especially when dealing with large feature sets.
Can be sensitive to class imbalance: SVMs can be sensitive to class imbalance, where the number of examples in one class is much larger than the others.


*SVM相对比较适合的数据：*对于机器学习算法而言，虽然不存在绝对的适合数据问题，SVM比较适合基因表达数据、OTU数据，这些数据样本一般不会太大，毕竟SVM运行的时间和样本数目息息相关，但是特征数目会比较多，SVM倒是便于处理特征多的情况。同时如果数据集中Outlier较多，SVM也可以handle，毕竟其只需用到支持向量。


*SVM重要的超参数：*cost(C,错误项的惩罚参数)The value of C determines the penalty for the classifier； margin() regression only；kernel的选择是重要的，有一些kernel还有 自己的需要注意的参数，
比如对于`poly`kernel有degree, 一般在1-10之间; `svm_linear`经常只需要Cost一个超参数。radial kernel这一常用又成功的kernel在`tidymodels` 以`svm_rbf`需要tune的参数有，recommend to search C in the range [2^−5;2^15] and γ in the range [2^−15;2^3].γ默认自变量个数的倒数，值越大表示低维度样本向高维度映的维度越高，训练的结果越好，但越容易引起过拟合（泛化能力越低）。C值影响支持向量与决策平面之间的距离的参数，表示错分样本时的惩罚程度，默认为1。C值越大分类器的准确性越高，但容错率会越低泛化能力会变差；C越小，泛化能力越强准确性会降低。
对于`tidymodels`的超参数设置而言，其所提供的参数选项看起来不如`sklearn`，而且如果需要`set_engine`中的超参数的话还是需要自己去查阅资料的。

_from chatGPT_
调整支持向量机（Support Vector Machine，SVM）的超参数是实现最佳性能的关键步骤。以下是调整SVM超参数的一般步骤：
1.选择适当的核函数：核函数是SVM的一个重要超参数。常见的核函数包括线性、多项式、径向基函数（RBF）和sigmoid函数。您应根据数据的性质选择核函数。
2.设置正则化参数C：正则化参数C控制最大化间隔和最小化分类错误之间的权衡。C值高会导致较窄的间隔，可能会过拟合数据。相反，C值低会导致较宽的间隔，可能会欠拟合数据。
3.设置特定于核函数的超参数：如果您选择的核函数不是线性的，则需要设置额外的超参数。例如，对于多项式核，您需要设置多项式的次数，对于RBF核，您需要设置gamma参数。
4.使用交叉验证：为了找到最佳超参数，您应使用交叉验证。将数据分为训练集和验证集，并在训练集上尝试不同的超参数组合。然后，评估每个组合在验证集上的性能。
5.评估性能：最后，评估具有最佳超参数的SVM在测试集上的性能，以确保它对新数据具有良好的泛化性能。
6.不同的编程语言中有多个库可用于调整SVM的超参数，例如Python中的scikit-learn中的GridSearchCV或RandomizedSearchCV。


*做为一个调包选手，机器学习意味着什么呢*意味着何种算法适合何种数据，输入数据有什么要求，需要进行哪些特征工程，每个模型的解释有啥其自身的特点，具体的模型有哪些需要关注的超参数，具体的超参数如何设置范围、需不需要依据具体的数据集进行超参数的设置等等。剩下的就是真正在用的过程中不断的优化、优化、优化。


`sklearn`Python包中支持的SVM种类很多，`svm.LinearSVC`,`svm.SVC`,`svm.NuSVC`,`svm.OneClassSVM`等等，可以看到 SVM 在 sklearn 上有三个接口，分别是 LinearSVC、SVC 和 NuSVC。最常用的一般是 SVC 接口。除了特别表明是线性的两个类 LinearSVC 和 LinearSVR 之外，其他的所有类都是同时支持线性和非线性的。NuSVC和NuSVC 可以手动调节支持向量的数目， 其他参数都与最常用的SVC和SVR一致。而OneClassSVM 则是无监督的类。`svm.SVC`中fit的时间是样本数的平方、甚至更多，For large datasets consider using LinearSVC or SGDClassifier instead。
`tidymodels`和`sklearn`在`SVM`间的最大差异，应该就是kernel的选择了，前者直接提供单独的建模函数，


*preprocessing steps before fitting an SVM: *
Data scaling:SVM is sensitive to the scale of the input data. If the input features have different scales, then it can cause problems with the optimization of the SVM. To avoid this, you should `scale the input data` so that each feature has a similar range. This can be done using techniques like normalization, standardization, or scaling to a specific range.
*svm*对于特征NA值的处理：
纳入过多特征的数据集有一个大麻烦在于后续预测时，不是每一个样本都会包含这些特征，比如对于基因表达数据的检验，很有可能某一个基因其的表达本次没有检测到，而作为一个样本输入，如何去预测其的分类呢？
分类变量：one-hot编码。

支持向量机涉及到距离的计算，为了消除不同当量的影响，还需要将连续性变量进行标度统一（Scaling，或翻译成缩放？），归一化(normalization)和标准化(standardization)是最常用的方法。因为数据集分为了训练集和测试集，这样就有一个不可避免的问题：对变量进行标准化时应该采用整个数据集的均值和标准差，还是用训练集的均值和标准差？如果采用整个数据集的均值和标准差，实际上是将未来的、将要对其做出预测的信息引入了训练集中。但作为一个预测模型而言，预测是应用于未发现的数据，这些数据在建立模型时往往是不知道的，测试集实际上就是对此进行的模拟。所以为了获得对模型质量和泛化能力的良好估计，需要将标准化参数（均值和方差）的计算限制在训练集中，即训练集和测试集应该采用相同的参数进行Scaling。


### 一个简单的示例

```{r, include=FALSE}
library(tidyverse)
library(tidymodels)
library(usemodels)
library(kernlab)

```


在建立模型之前，可以通过一些`tidymodels`所提供的辅助工具进行一些初步的工作，比如我们打算在前列腺癌表达数据集上进行重要变量（基因）的筛选和建立区分癌和癌旁的模型，并打算利用SVM算法，那么首先可以利用`usemodels`建立一个基本的模型框架，并基于具体的数据进行一些修改, 不过`usemodels`所支持的模型还有限，`parsnip::parsnip_addin()`也有一段时间没有更新了。
`show_engines()`函数可以查看所支持的引擎。
```{r}
usemodels::use_kernlab_svm_rbf(formula = class ~ ., data = prostat_train,
                               prefix = 'kernlab', verbose = F,
                               tune = TRUE, colors = TRUE, clipboard = F
                               )
```


准备数据，为前列腺癌102例患者的RNASeq标准化的数据，其中caner 52例，normal样本50例，分组均衡。
下载自网络。
```{r, include=FALSE}
expr_file <- "datasets/prostat.expr.symbol.txt"
metadata_file <- "datasets/prostat.metadata.txt"

expr_mat <- read_delim(expr_file, delim = '\t') %>% 
  janitor::clean_names()
metadata <- read_delim(metadata_file, delim = '\t') %>% 
  janitor::clean_names()

# 此处或可以加上`Boruta`的结果进行一些变量筛选工作
input_data <- expr_mat %>% column_to_rownames('symbol') %>% t() %>% 
  as.data.frame() %>% rownames_to_column() %>% as_tibble() %>% 
  janitor::clean_names() %>% 
  left_join(metadata, by = c('rowname'='sample')) %>% 
  mutate(class = if_else(class=='tumor', 1, 0)) %>% 
  mutate(class = as_factor(class))


# 此处也可以保留rowname列，在下文中通过recipe中的`update_role`进行
set.seed(42)
df_split <- initial_split(input_data %>% select(-rowname))

prostat_train <- training(df_split)
prostat_test <- testing(df_split)

# Print the number of cases in each split
cat("Training cases: ", nrow(prostat_train), "\n",
    "Test cases: ", nrow(prostat_test), sep = "")
```


`svm_linear`主要由两个R包提供内在的引擎`show_engines('svm_linear')`: kernlab & LiblineaR；
`svm_poly`目前由`kernlab`提供；
`svm_rbf`目前由`kernlab` & `liquidSVM`提供。

`svm_linear`, `svm_poly`, `svm_rbf` 其分别和`scikit-learn`中对应的关系为,`linear`：线性核函数, `poly`：多项式核函数,
`rbf`：径像核函数/高斯核, `sigmod`：sigmod 核函数，在`sklearn`中kernel参数为不同的核函数，并进一步的有其对应的需要 调整的参数，而在`tidymodels`中，则以上述几种建模函数做为选择。

`?kernlab::ksvm`中有提供的很多的kernal选择，但是`parsnip`就只提供了三个选择。

```{r}
prostat_folds <- vfold_cv(prostat_train, strata = class)


# svm_spec <- svm_linear(mode = "classification") %>% 
#   set_engine('kernlab')

svm_spec <- svm_rbf(mode = 'classification',
                    cost = tune(), rbf_sigma = tune()) %>%
  set_engine('kernlab')


prostat_rec <-
  recipe(class ~ ., data = prostat_train) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors())

## just to see how it is working:
# prep(prostat_rec) %>% bake(new_data = NULL) %>% glimpse()

prostat_wf <- workflow(prostat_rec, svm_spec)
prostat_wf
```

tuning the hyper parameter:
对于`rbf`kernel，选择cost, sigma等超参数。

因为没有筛选特征，此步运行很慢。可以考虑不用`tune_grid`，而是使用`grid_random`, 或者利用`Bayesian Optimization, tune_bayes`tune 超参数。`tune` and `finetune` package provide some tune function.

```{r}
doParallel::registerDoParallel()

# svm_grid <- grid_max_entropy()

# 此两个参数的取值范围
param_grid <- grid_regular(dials::cost(), 
                           rbf_sigma(),
                           levels = 3
                           )

tune_res <- tune_grid(
  prostat_wf, 
  resamples = prostat_folds, 
  grid = param_grid
)

# save(tune_res, './datasets/svm_tune_res.rda')
```


*Bayesian Optimization*
Gaussian process model
```{r, eval=FALSE}
ctrl <- control_bayes(verbose = TRUE) # can also add more arguments, like no_improve
svm_initial <- tune_res
svm_param <- 
  svm_wflow %>% 
  parameters() %>% 
  update(rbf_sigma = rbf_sigma(c(-7, -1)))

set.seed(420)

svm_bo <-
  prostat_wf %>%
  tune_bayes(
    resamples = prostat_folds, 
    # metrics = roc_res, 
    initial = svm_initial, # tune_grid object produced earlier
    param_info = svm_param, # specified earlier too, with our new bounds for rbf_sigma
    iter = 25, # maximum number of search iterations
    control = ctrl
  )

show_best(svm_bo)
```

```{r, eval=FALSE}
p1 <- autoplot(svm_bo, type = "performance")
p2 <- autoplot(svm_bo, type = "parameters")

p1 + p2
```


```{r}
load('./datasets/svm_tune_res.rda')

# 只有cost这一参数；sigma 
autoplot(tune_res)
```

得到超参数优化好的最终模型，并拟合：

`metrics`中的准确度表现很差呀，
```{r}
collect_metrics(tune_res)

best_cost <- select_best(tune_res, metric = "roc_auc")

svm_final_wf <- finalize_workflow(prostat_wf, best_cost)
```

*`last_fit`得到最终模型：*

在此之前先手动fit最终模型，以和`last_fit`得到的模型做对比，二者应该一摸一样，不过在构建随机森林时发现不一样。
```{r}
svm_C_fit <- svm_final_wf %>% fit(prostat_train)
# 
# svm_C_fit %>%
#   extract_fit_engine() %>%
#   plot()

# 测试集数据的表现
augment(svm_C_fit,
        new_data = prostat_test) %>%
  roc_curve(truth = class, estimate = .pred_0) %>%
  autoplot()
```
*缺失对模型自身的评价是不合适的，包括模型的AUC、区分度、校准曲线等都是需要的：*
```{r}

```


不tuning 超参数，直接10X交叉验证，
```{r}
# 网格搜索搜索的过程太慢，直接先fit一个模型。
set.seed(123)
prostat_metrics <- metric_set(accuracy, sens, spec)

# `fit_resamples`不可以做tune的操作，即不可以做网格搜索；但提供一种类似交叉验证的操作
prostat_rs <- fit_resamples(prostat_wf, 
                            resamples = prostat_folds, 
                            metrics = prostat_metrics)

collect_metrics(prostat_rs)
```


last_fit() to fit one final time to the training data and evaluate one final time on the testing data.
以下结果显示在测试集数据上具有不错的表现。
```{r}
# 默认参数的表现
final_rs <- last_fit(svm_final_wf, 
                     df_split, 
                     metrics = prostat_metrics)


collect_metrics(final_rs)
```


confusion matrix:
在测试集数据上的混淆矩阵
```{r}
collect_predictions(final_rs) %>%
  conf_mat(class, .pred_class)

# ROC曲线，和上面直接fit的模型进行对比
aug_res <- augment(svm_C_fit,
        new_data = prostat_test) 

aug_res %>%
  roc_auc(truth = class, estimate = .pred_0)

aug_res %>% 
  roc_curve(truth = class, estimate = .pred_0) %>%
  autoplot()
```


If we decide this model is good to go and we want to use it in the future, we can extract out the fitted workflow. This object can be used for prediction:
final workflow 可以用作后续的模型。
```{r}
# final_fitted 即是最终拟合好的模型
final_fitted <- extract_workflow(final_rs)

# 测试集数据上的验证
augment(final_fitted, 
        new_data = slice_sample(prostat_test, n = 3)) %>% 
  dplyr::select(last_col(1):last_col(5))
```


We can also examine this model (which is just linear with coefficients) to understand what drives its predictions.
在engine选择为LiblineaR时tidy函数支持，不过kernlab不支持。
对于`svm_linear`，且engine为`LiblineaR`时适用。

```{r}
tidy(final_fitted) %>%
  slice_max(abs(estimate), n = 20) %>%
  mutate(
    term = str_remove_all(term, "tf_author_"),
    term = fct_reorder(term, abs(estimate))
  ) %>%
  ggplot(aes(x = abs(estimate), y = term, fill = estimate > 0)) +
  geom_col() +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_discrete(labels = c("Fewer weeks", "More weeks")) +
  labs(x = "Estimate from linear SVM (absolute value)", y = NULL, 
       fill = "How many weeks on\nbestseller list?")
```


SVM不像线性模型或者树模型一样可以有模型变量重要性的信息，但是可以通过permutation of the variables得到变量的相对重要性
, 以及利用`SHAP`值等方式确认变量的重要性。
```{r}
library(vip)
```
也忒慢了
```{r}
set.seed(42)

prostat_prep <- prep(prostat_rec)

prostat_imp <- svm_C_fit %>% 
  extract_fit_parsnip() %>%
  vip::vi(
    method = "permute", nsim = 10,
    target = "class", metric = "auc", reference_class = "0",
    pred_wrapper = kernlab::predict, train = juice(prostat_prep)
  )

prostat_imp %>%
  slice_max(Importance, n = 8) %>%
  mutate(
    # Variable = str_remove(Variable, ""),
    Variable = fct_reorder(Variable, Importance)
  ) %>%
  ggplot(aes(Importance, Variable, color = Variable)) +
  geom_errorbar(aes(xmin = Importance - StDev, xmax = Importance + StDev),
    alpha = 0.5, size = 1.3
  ) +
  geom_point(size = 3) +
  theme(legend.position = "none") +
  scale_color_avatar(palette = "FireNation") +
  labs(y = NULL)
```


### 基于SHAP值的SVM和RandomForest间特征重要性的差异


```{r}

```