Resampling_statistics_bootstrapping.qmd

---
title: "Resampling_statistics_bootstrapping"
format: html
---


## Resampling & bootstrapping


传统假设检验总是假设数据有一个具体的分布，既是参数类的分析方法，而在不符合这些条件的时候，重采样和bootstrap方法就比较适用。对于样本量少、有outlier等情况，permutation和bootstrap方法比较robust，computer-intensive。


Permutation其基本逻辑是：比如在两样本独立ttest中，传统的假设检验会假设两组间的均值相等（更确切的说是两组的差值和总体为0的值相等），并基于这个假设计算出对应的t值，进而和理论分布进行比较得出对应的P值等等。而permutation在计算完抽样数据的t统计值后，会把两组的样本混合在一起，randomly的再把数据分成两组然后计算t-statistic，重复此步只到完成所有的组合。再然后对所有的t-statistic值排序，并得到一个经验分布，然后检验t0第一个t-statistic是否落在95%的区间。*其的一个假设便是如果两种处理方式真的等价，那么测量到A药物或B药物的效果值便是可以随机分配到2组中。*


Permutation其并不是像bootstrap那样进行有放回的重采样，只是通过对样本数据进行组合，把同一个分析做个1000遍，其是把数据re-randomization tests的， 然后通过得到的empirical distribution进行判断。其和传统的参数检验类的方法更为接近，


### permutation, 基于排列的方法

“Permutation tests, also called randomization or re-randomization tests, have been around for decades, but it took the advent of high-speed computers to make them practically available.”


Permutations的一般思路为：
- 对于特定的检验目的，计算样本统计量，比如t统计值
- 然后把样本数据混合在一起，随机进行分配并计算新数据的统计值，比如两独立样本均值的比较，再重新将数据分为两组
- 如此反复，计算所有的组合，并得到全部数据组合的统计量
- 将该统计量进行排序，得到经验分布
- 然后类似参数检验的思路，如果t0落在经验分布中间95%部分的外面，则在0.05的显著性水平下，拒绝null假设


*适用的条件：*
似乎没有什么特别的限制，不用考虑数据的分布、样本量大小、甚至Outlier，直接对数据进行分析即可。
不过独立性假设、同分布假设、以及随机化假设（Permutationtests的核心思想是通过随机重排样本观测值来生成随机分布。因此，假设样本间没有特定的顺序或结构。）还是要准守的。

那么对于传统的参数检验而言，二者间的统计功效谁更可靠一些呢，或者说如果二者的结果不一致，倾向于谁的结论呢？似乎会成为一个仁者见仁的回答。
Permutation tests是一种非参数检验方法，它不依赖于总体参数的假设。在某些情况下，传统的参数检验可能对数据的分布做出了不合适的假设，导致结果的不准确性。
Permutation tests则不对数据分布做出任何假设，因此更加灵活。尤其在样本量较小或数据具有非正态分布的情况下，Permutation tests可能具有更高的统计功效。在实际应用中，我们建议根据具体情况选择合适的假设检验方法，并进行敏感性分析，以确保结果的可靠性。


理论分布(theoretical distribution)是指在统计学中，用于描述特定统计量的概率分布。它是基于某种假设或模型的理论推导得出的分布。理论分布的重要性在于它们提供了对统计量进行推断和假设检验的基础。通过与理论分布进行比较，我们可以计算出统计量的概率或p值，进而进行推断和假设检验。


```{r, include=FALSE}
library(easystats)

# cion 包提供一系列的函数可以代替传统的test的函数
library(coin) # Conditional Inference Procedures in a Permutation Test Framework
library(lmPerm) # provides a permutation approach to linear models, including regression and analysis of variance
```

cion包的一些重点：“If distribution="exact", the distribution under the null hypothesis is computed exactly (that is, from all possible permutations). The distribution can also be approximated by its asymptotic distribution (distribution="asymptotic") or via Monte Carlo resampling (distribution="approximate(nresample=n)"), where n indicates the number of random replications used to approximate the exact distribution. The default is 10,000 replications. At present, distribution="exact" is only available for two-sample problems.”

Excerpt From
R in Action, Third Edition
Robert I. Kabacoff

常见的经典统计中的函数其都有对应的函数，使用起来相对还是很方便的。

```{r}
set.seed(42)

# one-way permutation test
# 在下面书中的例子中，如果传统的ttest结果和permutation结果有出入，选择相信permutation的结果

score <- c(40, 57, 45, 55, 58, 57, 64, 55, 62, 65)
treatment <- factor(c(rep("A", 5), rep("B", 5)))
mydata <- data.frame(treatment, score)

t1 <- t.test(score ~ treatment, data = mydata, var.equal = TRUE)
t1

# t1的结果中P值为0.047，对于习惯于0.05阈值的人而言，总是感觉在缺少些说服力
```

```{r}
report(t1)
```


两样本检验：
```{r}
coin::oneway_test(score ~ treatment, data = mydata, distribution = "exact")
```
*Interpret：*Z值为-1.9147，对于标准正态分布在1.96处，其小于概率0.05，所以可以推测其P值大于0.05. 这里并没有实质的计算T统计值，也就是说其没有考虑样本量自由度的问题。


下面是一个多分组的问题：
```{r}
# ANOVA 分析
data(cholesterol, package = 'multcomp')

ff <- coin::oneway_test(response ~ trt,
  data = cholesterol,
  distribution = approximate(nresample = 9999)
)

ff
```

对应的参数检验为：
```{r}
# normal ANOVA

ff2 <- aov(formula = response ~ trt, data = cholesterol, qr = TRUE)

ff2 |> summary()
```
*interpret：*二者间的统计值计算的并不一样，卡方是一个


```{r}
# 对于coin::oneway_test 的 ANOVA 多分组如何进行post-hoc 分析
# 用一些非参数的方法
posthoc_result <- pairwise.wilcox.test(response ~ trt, data = cholesterol, 
                                p.adjust.method = "holm")

# 输出pairwise比较结果
print(posthoc_result)

```


*`lmPerm`包的使用*

provides a permutation approach to linear models, including regression and analysis of variance.
In particular, the lmp() and aovp() functions are the lm() and aov() functions modified to perform permutation tests rather than normal theory tests. The parameters in the lmp() and aovp() functions are similar to those in the lm() and aov() functions, with the addition of a perm= parameter. The perm= option can take the value Exact, Prob, or SPR.


*类似于lm*
```{r}
fit1 <- lmp(weight ~ height, data = women, perm = "Prob")

summary(fit1)
```

*interpret: *和`lm`输入的结果近乎一致


```{r}
# 并不支持
report(fit1)
```


倒是有对应的methods：
```{r}
coefficients(fit1)

confint(fit1)
```


*ANOVA的示例*
```{r}
fit2 <- lmPerm::aovp(weight ~ gesttime + dose, data = litter, perm = "Prob")

summary(fit2)
```


```{r}
anova(fit2)
```


### bootstrap

permutation 主要以提供pvalue以对null假设做判断，其很难给出CI。这在pvalue日益收到批评的研究氛围内愈发显得不太合适，bootstrap在这一点上具有相对的优势。

Bootstrapping generates an empirical distribution of a test statistic or set of test statistics by repeated random sampling with replacement from the original sample. It allows you to generate confidence intervals and test statistical hypotheses without having to assume a specific underlying theoretical distribution.

上文的介绍可以看到，和permutation相比，二者都不用考虑一个理论分布的存在。


In general, bootstrapping involves three main steps:

1. Write a function that returns the statistic or statistics of interest. If there is a single statistic (for example, a median), the function should return a number. If there is a set of statistics (for example, a set of regression coefficients), the function should return a vector.
2. Process this function through the boot() function to generate R bootstrap replications of the statistic(s).
3. Use the boot.ci() function to obtain confidence intervals for the statistic(s) generated in step 2.


`boot::boot`函数的主要用法为：

```{r}
library(boot)
library(bootstrap)
```


```{r}
set.seed(42)

# 得到R2的bootstrap值
rsq <- function(formula, data, indices) {
  d <- data[indices, ]
  fit <- lm(formula, data = d)
  return(summary(fit)$r.square)
}

results <- boot(
  data = mtcars, statistic = rsq,
  R = 1000, formula = mpg ~ wt + disp
)
```


```{r}
print(results)

plot(results)
```


可利用百分位法和偏差矫正百分位法进行Bootstrap区间估计。
```{r}
boot.ci(results, type = c("perc", "bca"))
```


`boot`函数的一贯用法，首先对所要bootstrap的系数写一个返回的函数，然后进行bootstrap，bootstrap可以简单的理解为对一批数据反复有放回的采样，得到迭代次数的系数。

```{r}
bs <- function(formula, data, indices) {
  d <- data[indices, ]
  fit <- lm(formula, data = d)
  return(coef(fit))
}


results2 <- boot(
  data = mtcars, statistic = bs,
  R = 1000, formula = mpg ~ wt + disp
)

```


对于多个统计指标的boot，在plot和boot.ci中

```{r}
print(results2)

plot(results2, index = 1)
```

```{r}
boot.ci(results2, type="bca", index=3)
```


### infer package

在完成上述内容的笔记后，我们可以看待下`tidymodels`中的`infer`包的使用，其是一个以randomization-based statistical inference。
文档附件中的modern statistic值得一读，下一步打算分享这本书的笔记。

```{r, include=FALSE}
# library(tidymodels)
library(infer)
```


首先和`coin`包分析的两样本独立ttest检验进行对比：

```{r}
coin::oneway_test(score ~ treatment, data = mydata, distribution = "exact")
```

以permutation的方法计算同样数据：
```{r}
# calculate the observed statistic
# 计算的啥统计值？
observed_statistic <- mydata %>%
  specify(score ~ treatment) %>%
  calculate(stat = "diff in means", order = c("A", "B"))

observed_statistic

# generate the null distribution with randomization
null_dist_2_sample <- mydata %>%
  specify(score ~ treatment) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("A", "B"))


null_dist_2_sample %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")
```
*如图所示：*在用随机组合的数据生成多个统计值后，按照顺序排列其的分布。


```{r}
t_hat <- mydata %>% 
  observe(score ~ treatment,
          stat = "t", order = c("A", "B"))

t_hat

null_dist <- mydata %>%
  specify(score ~ treatment) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t", order = c("degree", "no degree"))

visualize(null_dist) +
  shade_p_value(obs_stat = t_hat, direction = "two-sided")
```


```{r}

null_dist_theory <- mydata %>%
  specify(score ~ treatment) %>%
  assume("t")


visualize(null_dist_theory) +
  shade_p_value(obs_stat = t_hat, direction = "two-sided")
```


```{r}
null_dist_2_sample |>
  get_p_value(observed_statistic,
              direction = "two-sided")
```


*利用随机重采样：*
下面的例子似乎有问题。
```{r}
null_dist_2_sample2 <- mydata %>%
  specify(score ~ treatment) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("A", "B"))


null_dist_2_sample2 %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")
```


```{r}
null_dist_2_sample2 |>
  infer::get_ci()
```


*bootstrap的用法*
```{r}
mydata <- transform(mydata, treatment = factor(treatment))

p_value <- function(formula, data, indices) {
  d <- data[indices, ]
  fit <- t.test(formula, data = d, var.equal = TRUE)
  return(fit$p.value)
}


res1 <- boot::boot(
  data = mydata, 
  statistic = p_value,
  R = 1000, 
  formula = score ~ treatment
)

print(res1)
```