ANOVA.qmd

---
title: "ANOVA"
format: html
---

## 方差分析

> R in action
> https://stats.stackexchange.com/questions/60108/how-to-write-the-error-term-in-repeated-measures-anova-in-r-errorsubject-vs-e
> https://stats.stackexchange.com/questions/286280/repeated-measures-anova-vs-factorial-anova-with-subject-factor-understanding


方差(variance)指的是，一组数字和其均值间的度量，描述的是一个变量的离散程度，是随机变量与其总体均值或样本均值的离差的平方的期望值。方差是标准差的平方、分布的二阶矩，以及随机变量与其自身的协方差。

$$s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}$$

在方差分析中，我们比较两个方差：组内方差 (within-group variance) 和组间方差 (between-group variance)。组内方差指的是每个组内数据点与其组均值之间的差异。组间方差指的是不同组的均值之间的差异。这两个方差的比值（如 F 比）用来确定不同组均值之间的差异是否显著。
如果组间方差较大，同时组内方差较小，说明不同组之间的均值差异较大，而组内数据点与各自组均值之间的差异较小。在这种情况下，我们认为不同组的均值具有显著差异。
如果组间方差较小，同时组内方差较大，说明不同组之间的均值差异较小，而组内数据点与各自组均值之间的差异较大。在这种情况下，我们认为不同组的均值差异不显著。
因此，尽管方差分析似乎只是比较了不同组间的方差大小，但通过比较组间和组内方差，我们可以得出关于不同组间均值差异显著性的结论。


*方差分析(analysis of variance)的实验设计：*
- between group
- within group
- one-way/two-way, et.al...
- balanced/unbalanced
- mixed-model ANOVA: When a factorial design includes both between-groups and within-groups factors

在引入了多个因子变量的情况下，interaction是一定需要设计的吗？three-way，four-way呢？这种交互情况就复杂了。

一些基本概念：
one-way ANOVA: 一个因子自变量，一个因变量
混淆变量(confounding factor)：即影响自变量，也影响因变量
ANCOVA: covariate协变量一般是一个连续变量，在对感兴趣因子进行分析前，adjust for 协变量的影响
MANOVA/MANCOVA: 因变量多于两个


一些公式：
One-way ANOVA: $y ~ A$
One-way ANCOVA with 1 covariate: $y ~ x + A$ 注意顺序
two-way factorial ANOVA: $y ~ A * B$
Randomized block: $y ~ B + A$(B is a blocking factor) 注意顺序

For example, in a two-way ANOVA with unequal numbers of observations in the treatment combinations, the model `y ~ A*B` will not produce the same results as the model `y ~ B*A`. R默认采用TypeI（sequential），还有Type II (hierarchical)、Type III (marginal)，SAS and SPSS employ the Type III approach by default. 如果是均衡设计，3种类型的方差分析没有差别。所以很多时候会选择typeIII的结果。


*Inappropriately Designating a Factor as Fixed or Random*
In Analysis of Variance and some other methodologies, there are two types of factors: fixed effect and random effect. Which type is appropriate depends on the context of the problem, the questions of interest, and how the data is gathered. Here are the differences:
_Fixed effect factor:_ Data has been gathered from all the levels of the factor that are of interest.

Example: The purpose of an experiment is to compare the effects of three specific dosages of a drug on the response. "Dosage" is the factor; the three specific dosages in the experiment are the levels; there is no intent to say anything about other dosages.

_Random effect factor:_ The factor has many possible levels, interest is in all possible levels, but only a random sample of levels is included in the data.1

Example: A large manufacturer of widgets is interested in studying the effect of machine operator on the quality final product. The researcher selects a random sample of operators from the large number of operators at the various facilities that manufacture the widgets. The factor is "operator." The analysis will not estimate the effect of each of the operators in the sample, but will instead estimate the variability attributable to the factor "operator".


在本笔记中，重点关注医学统计中的一些用法示例，其中会涉及到医学统计中常用到的概念，不过多也是以上描述的一些不同命名方式罢了。比如随机区组设计资料的方差分析，就是指的two-way ANOVA中的Randomized block。


```{r, include=FALSE}

library(tidyverse)
library(easystats)

library(car)
library(multcomp)
library(statsExpressions)
library(PMCMRplus) # 事后检验
```


### stats 包 aov, oneway.test函数

如上文所述，`aov`函数提供TypeI for partitioning the variance in y among the effects on the right side of this equation.
其是包装的lm函数，不过其的输出更符合常规ANOVA的结构。

同时需要特别注意，`aov`函数需要样本是balance的。
The order in which the effects appear in a formula matters when (a) there’s more than one factor and the design is unbalanced or (b) covariates are present.


#### 完全随机设计资料的方差分析 one-way ANOVA

one-way ANOVA，其假设条件为因变量正态分布、独立、各组间方差相等、线性。

假设检验为：H0 五个分组之间均值相等；H1，5个分组之间至少有一个分组和其他不同。

```{r}
data(cholesterol, package="multcomp")
```


描述性统计，以及数据检查：
```{r}

ggplot(data = cholesterol, aes(x = trt, y = response)) +
  geom_boxplot() +
  theme_abyss()
```


```{r}
fit1 <- aov(response ~ trt, data=cholesterol)

summary(fit1)

report::report_model(fit1)
```

```{r}
print(model.tables(fit1,"means"),digits=3)
```


Type III anova:

```{r}
car::Anova(fit1, type = 'III')
```


*模型检验：*

```{r}
# 和线性模型一样
plot(fit1) # 输出四幅模型诊断的图片

```


```{r}
car::qqPlot(fit1, simulate = T, main = 'QQ Plot')
```


```{r}
performance::check_model(fit1)

# 除了满足模型假设外，还需要控制outlier
car::outlierTest(fit1)
```


对于one-way ANOVA其效应值为eta^2^
```{r}
effectsize::effectsize(fit1)
```


```{r}
ggeffects::ggeffect(fit1) |> plot()
```


*post-hoc*分析：

One-way 方差分析的结果表明五个分组间其均值不等，还需要事后检验分析哪二者间不同，一般采用TukeyHSD方法，此方法和emmeans包的paires有何异同呢？

```{r}
(stats::TukeyHSD(fit1)) |> plot()
```

```{r}
emmeans::emmeans(fit1, specs = 'trt') |> pairs()
```


利用`multcomp`包进行多重检验，此处因为是笔记的缘故，故会多个包实现一个目的：

```{r}
tuk <- multcomp::glht(fit, linfct=mcp(trt="Tukey"))

summary(tuk)

labels1 <- cld(tuk, level=.05)$mcletters$Letters
```


*手动计算方差分析的结果：*
```{r}
# 首先是全部样本的离均差平方和和方差
var(cholesterol$response) * (50 -1)

sum((cholesterol$response - mean(cholesterol$response))^2)


# 然后是组间的方差
# 以下是一个组内均衡设计的例子
foo <- aggregate(cholesterol$response, by = list(cholesterol$trt), FUN = 'mean')
var(foo$x) * 4 * 10


# 然后是组内的方差
# 组内方差

```

```{r}
# 计算组均值
group_means <- cholesterol %>% group_by(trt) %>% summarise(mean_Y = mean(response))

# 计算总均值
overall_mean <- mean(cholesterol$response)

# 计算总样本数量
n_total <- nrow(cholesterol)

# 计算组数量
n_groups <- nrow(group_means)

# 计算组内样本数量
n_obs_per_group <- (cholesterol %>% group_by(trt) %>% summarise(n_obs = n()))$n_obs

# 计算组间方差
between_group_variance <- sum(n_obs_per_group * (group_means$mean_Y - overall_mean)^2) / (n_groups - 1)

# 计算离均差平方和（总平方和）
total_ss <- sum((cholesterol$response - overall_mean)^2)

# 计算组内方差
within_group_variance <- (total_ss - between_group_variance * (n_groups - 1)) / (n_total - 1 - n_groups)


# 计算F值
F_value = between_group_variance / within_group_variance


# 通过F分布得到对应的P值, 查F分布临界值
F_crit <- qf(0.95, (n_groups - 1), (n_total - 1 - n_groups))

# 判断显著性
if(F_value > F_crit) {
  print("Group effect is significant") 
} else {
  print("Group effect is not significant")
}
```


#### One-way ANCOVA

在ANOVA的基础上加一个定量的协变量。协变量在`aov`函数中放在公式的前面。不过在`car::Anova`直接包裹时是不用考虑顺序的，因为其所提供的是typeII和typeIII anova。


```{r}
str(litter)
```


```{r}
# gesttime 是连续性变量
# dose 为分类变量，也是我们主要关注的变量
fit2 <- aov(weight ~ gesttime + dose, data=litter)

summary(fit2)
```

模型假设：
线性模型LINE的要求，还要求每个分组中的regression slope保持一致，一般通过引入交互项进行检验，如果交互项有意义， 一般就是不满足条件。如果条件不满足的话，可以采用`sm::ancova`函数进行，其提供非参数方法。

```{r}
aov(weight ~ gesttime*dose, data=litter) |> summary()
```


```{r}
d_p <- broom::augment(fit2)

ggplot(
  data = d_p,
  aes(gesttime, weight)
) +
  geom_point() +
  facet_wrap(~dose, nrow = 1) +
  geom_line(aes(y = .fitted)) +
  labs(title = "ANCOVA for weight by gesttime and dose") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )
```


*post-hoc analysis*

因为有一个协变量，分组均值应该用adjusted后的，既是the group means obtained after partialing out the effects of the covariate. 这里的事后检验通常包括矫正后的均值和contrasts。

```{r}
# 均值
ggeffects::ggeffect(fit2, terms = 'dose')

# same as 
# modelbased::estimate_means(fit2)

litter %>% group_by(dose) %>% summarise(m = mean(weight))
```


contrasts是指在矫正gesttime后，不同dose间weight间的均值差异。`emmeans`提供的便是这种结果。
这和`multcomp`提供的有啥差别呢？

```{r}
# contrsts

modelbased::estimate_contrasts(fit2,
                               method = 'trt.vs.ctrl'
                               )
```


*多重检验：*

The contrast c(3, -1, -1, -1) specifies a comparison of the first group with the average of the other three. 

```{r}
# Deviation coding?
contrast <- rbind("no drug vs. drug" = c(3, -1, -1, -1))

summary(multcomp::glht(fit2, linfct=mcp(dose=contrast)))
```


*Assumptions:*

除了正态性和等方差外，ANCOVA还假设homogeneity of regression slopes. 如果不满足此假设的话，可以采用`sm::sm.ancova`进行分析

```{r}
fit2_2 <- aov(weight ~ gesttime * dose, data=litter)

summary(fit2_2)
```
`gesttime:dose` 无统计学意义，证实了等斜率假设。


#### 随机区组设计资料的方差分析 two-way factorial ANOVA

two-way factorial ANOVA 分析，两个具有交互作用的分类变量和一个响应变量：

supp和dose没有nest效应。都是between间的差别。同一个患者测了两个不同的指标。但每一行都是一个患者。


```{r}
ToothGrowth$dose <- factor(ToothGrowth$dose)

ToothGrowth %>% janitor::tabyl(supp, dose)

# 均衡设计, 此数据集的两个随机分组每个组合都有10例观测

str(ToothGrowth)
```


```{r}
fit3 <- aov(len ~ supp*dose, data=ToothGrowth)

summary(fit3)
```

```{r}
gplots::plotmeans(len ~ interaction(dose, supp, sep = ' : '), data=ToothGrowth)
```

```{r}
interaction.plot()
```


Randomized block(随机区组)的实验设计，其数据格式和多水平模型的数据格式相似。比如group为水平1单位、block为水平2单位。

但每一行为一个患者。block怎么理解。

```{r}
weight <- c(0.82,0.65,0.51,0.73,0.54,0.23,0.43,0.34,0.28,0.41,0.21,
            0.31,0.68,0.43,0.24)
block <- c(rep(c("1","2","3","4","5"),each=3))
group <- c(rep(c("A","B","C"),5))

data4_4 <- data.frame(weight,block,group)

head(data4_4)

data4_4 %>% janitor::tabyl(block, group)
```


```{r}
fit4 <- aov(weight ~ block + group,data = data4_4) #随机区组设计方差分析，注意数据均衡与否，以及term的顺序。

summary(fit4)
```


拉丁方设计方差分析

```{r}
psize <- c(87,75,81,75,84,66,73,81,87,85,64,79,73,73,74,78,73,77,77,68,69,74,76,73,
           64,64,72,76,70,81,75,77,82,61,82,61)
drug <- c("C","B","E","D","A","F","B","A","D","C","F","E","F","E","B","A","D","C",
          "A","F","C","B","E","D","D","C","F","E","B","A","E","D","A","F","C","B")
col_block <- c(rep(1:6,6))
row_block <- c(rep(1:6,each=6))
mydata <- data.frame(psize,drug,col_block,row_block)
mydata$col_block <- factor(mydata$col_block)
mydata$row_block <- factor(mydata$row_block)
str(mydata)
```

```{r}
xtabs(~ drug + col_block + row_block, data = mydata) |> ftable()
```

```{r}
fit5 <- aov(psize ~ drug + row_block + col_block, data = mydata)

summary(fit5)
```


#### 两阶段交叉设计资料方差分析

这里的数据可以理解为一个患者经过了两次处理。和上述的随机区组设计资料都属于多因素方差分析，对多个因素进行配比，但各组间的差异仍然会影响实验的结果。基于“最好的配对就是自身”这一理念，交叉设计采用自身配对的思想，可以很好的解决这一问题。

交叉设计的基本思路为（以2×2交叉设计为例）：首先将研究对象随机分为两组，第1组先接受A处理，经过一段洗脱期（washout period）消除A处理的影响后，再接受B处理，即第1组的实验顺序为AB。第2组先接受B处理，经过一段洗脱期消除B处理的影响后，再接受A处理，即第2组的实验顺序为BA。两种处理在实验过程中交叉进行，所以称为交叉设计。由于交叉设计是对同一人群分别实施多种处理，因此除各处理组的均值和标准差外，还需要考虑同一人群内的相关性。总的来说，两组均值差别越大，两组合并标准差越小，所需样本量越小；前后测量之间的相关性越大，所需样本量越小，反之所需样本量越大。


```{r}
# 怎么感觉这个示例有点问题
contain <- c(760,770,860,855,568,602,780,800,960,958,940,952,635,650,440,450,
             528,530,800,803)
phase <- rep(c("phase_1","phase_2"),10)
type <- c("A","B","B","A","A","B","A","B","B","A","B","A","A","B","B","A",
          "A","B","B","A")
testid <- rep(1:10,each=2)
mydata <- data.frame(testid,phase,type,contain)

str(mydata)
```


```{r}
fit <- aov(contain~phase+type+testid,mydata)

summary(fit)
```


#### Repeated measures ANOVA

对于常见的组间组内设计而言，两个因子间一开始纳入交互项是较为合理的。

重复测量方差分析，不可以有NA值，既是在同一个分组组合中不能有数据缺失。

Unfortunately, there are no non-parametric alternatives to the two-way and the three-way repeated measures ANOVA. one-wat 具有freedman 检验的，其为非参数检验的形式。

- No significant outliers(outlier 对方差等分析的影响确实很大)
- Normality
- Assumption of sphericity
- 还有一点很重要aov需要balanced样本


```{r}
# 一个很重要的数据格式问题
# conc是否需要调整成factor类型，factor类型的实质为整数;
# 那么调整成character呢？结果和number格式的conc也是不一样的，其没有Error:Within，但是和factor的结果一致
# 经过一番资料查询，认为是有必要将水平2单位，往往是时间/浓度，整理成因子的。
w1b1 <- subset(CO2, Treatment=='chilled') %>% 
  mutate(conc = as.factor(conc))

str(w1b1)
```


```{r}
# Plant 既是Subject
# This will split the error into subject error and interaction error. 
# subject/time is notation for time nested in subject, and so expands to two parts, subject and the subject:time interaction.
# 也就是conc这一水平1单位，嵌在Plant这一水平2单位里面
fit6 <- aov(uptake ~ conc*Type + Error(Plant/conc), w1b1)

summary(fit6)
```

*interpret*: In one way repeated measures anova, the residuals splits into subject and interaction residuals as compared to one way anova. This decomposition of error term considerably reduce the residual source of variance for the calculation of F value.


*They are equivalent only when there is exactly 1 measurement per each subject/time combination. If there are more measurements per subject/time combination (either simply because measurements were repeated, or because there is another within-subject factor B, in addition to time), then Error(subject) and Error(subject/time) will yield different F- and p-values for time.*
```{r}
# 虽则fit7和fit6之间的Error不同，但是其summary的结果是一致的
# 这种情况只在Plant和conc的组合只有一个测量值时一致。
fit7 <- aov(uptake ~ conc*Type + Error(Plant), w1b1) 

summary(fit7)

```

```{r}
summary(aov(uptake ~ conc+Type + Error(Plant), w1b1))
```


```{r}
CO2 %>%
  group_by(conc, Type) %>%
  summarise(mean_conc = mean(uptake)) %>%
  ggplot(., aes(
    x = conc, y = mean_conc,
    group = Type, color = Type, linetype = Type
  )) +
  geom_point(linewidth = 2) +
  geom_line(linewidth = 1) +
  theme_bw() +
  theme(legend.position = "top") +
  labs(
    x = "Concentration", y = "Mean Uptake",
    title = "Interaction Plot for Plant Type and Concentration"
  )
```

*模型检验：*
除了线性模型的要求外，还需要球形检验。
it assumes that the variances of the differences between any two levels of the within-groups factor are equal.

```{r}
# 尚不支持
# performance::check_sphericity(fit6)

car::Anova()
```


*post-hoc*:

```{r}
modelbased::estimate_contrasts(fit6)
```

两两对比并矫正P值：
```{r}

```


*one-way repeated measure*

*考虑这样一种情况，只有conc这一分组变量：* 
对于这种数据，相当于每一个患者，在不同时间接受了某种检测，现在查看不同时间的差别，
从数据的角度来看，其似乎就是一个one-way anova，但每一行不是一个独立的case，而是一个重复测量数据。
也即是one-way repeated measure：
```{r}
# 奇怪的问题，conc应该设置为factor还是numberic呢，设置成因子的时候结果似乎有所不同
fit8 <- aov(uptake ~ conc + Error(Plant/conc), w1b1)

summary(fit8)

```
*interpret: *`Error: Within` 为conc的效应，似乎不同浓度间没有差别。`Error(id/time)` is used to divide the error variance into 4 different clusters, which therefore takes into account of the repeated measures. 


```{r}
# 此结果和上述的结果一致
summary(aov(uptake ~ conc + Error(Plant), w1b1))
```


对于one-way repeated measure的事后检验，
```{r}
report(fit8)
```

```{r}
# 很烦
car::outlierTest(fit8)
```


```{r}
# 利用
res <- nlme::lme(uptake ~ conc, random = ~1 | Plant, data = w1b1)

# 查看分析结果
summary(res)
```

```{r}
# 进行事后检验
post_hoc <- glht(res, linfct = mcp(conc = "Tukey"))

# 显示事后检验结果
summary(post_hoc)
```


#### Multivariate analysis of variance (MANOVA)

在统计学中，多元方差分析 (MANOVA, multivariate analysis of variance) 是一种对多个分组中检测了多个指标变量 (这里的变量等同于上面的指标；如每个样本中每个物种的丰度信息、每个样本中每个基因的表达信息)的样本整体均值的检验方法  。作为一个多变量过程，它在有两个或多个因变量时使用，并且通常会分别涉及各个因变量的显着性检验。
用于检测多个因变量之间是否存在群体间总体平均数向量的差异。MANOVA要满足正态性、同质性矩阵、独立性等统计假设。MANOVA适用于因变量间相关的多元实验数据。

*模型假设：*
The two assumptions underlying a one-way MANOVA are multivariate normality and homogeneity of variance-covariance matrices. 

one-way MANOVA，同时纳入多个因变量的意思有种显式变量推导隐变量的感觉。三个因变量都属于营养测量这一大的范畴。

在微生物OTU表格数据中，所有的OTU的丰度值作为因变量，其他的自变量因素如分组变量说明，在这些不同的分组中因变量的不同可以检验因变量之间的关系是否受自变量的影响。当因变量之间存在相关时,MANOVA比多个ANOVA更有力量。


```{r}
# 测试数据
data(UScereal, package="MASS")

str(UScereal)
```

```{r}
# 自变量
shelf <- factor(UScereal$shelf)

# 因变量
y <- cbind(UScereal$calories, UScereal$fat, UScereal$sugars)
colnames(y) <- c("calories", "fat", "sugars")
```


一个简单的示例，只有一个自变量：
```{r}
fit8 <- stats::manova(y ~ shelf)

summary(fit8, test = 'Pillai') # one of “Pillai”, “Wilks”, “Hotelling-Lawley”, “Roy”
```
*interpret:* The significant F value indicates that 分类变量shelf在营养测量上具有差异。也就是多个因变量在整体上具有差异。
Pillai/Wilks(Wilks' lambda、Pillai's trace、Hotelling-Lawley trace等统计量)等为其的统计量。Pillai’s trace系数为0.93869，是MANOVA分析中一个衡量组间差异的指标，其取值范围为0到1，越接近1表示组间差异越大。


Because the multivariate test is significant, you can use the summary.aov() function to obtain the univariate one-way ANOVAs.
```{r}
# 得到每一个因变量的检验值
summary.aov(fit8)
```


*模型检验：*

```{r}
# multivariate normal distribution
check_normality(fit8)

# homogeneity of variance-covariance matrices


```


```{r}
mvnfast::
```


用`car`包做的分析：
```{r}
# same as Manova
fit9 <- car::Manova(lm(y ~ shelf))

# 行Mauchly球形检验,检验方差-协方差矩阵的同质性
mauchly.test(cbind(Y1, Y2, Y3) ~ A*B, data)

# 自定义线性假设检验
LinearHypothesis(model, L = c("A1 - A2 = 0"))
```


*post-hoc*:

```{r}
stats::TukeyHSD()
```


##### *Robust MANOVA*

当假设条件不怎么满足的时候，可以考虑采用robust version of the one-way MANOVA.
```{r}
# robust
# 运行的很慢
rrcov::Wilks.test(y, shelf, method="mcd")

```
*interpret: *


##### *from vegan package: *

在很多生物、生态和环境数据集中，多元方差分析的前提假设通常难以满足。

一些鲁棒性更强、对数据分布依赖更少的检验方法被提出来并且获得广泛应用，如ANOSIM (analysis of similarities), PERMANOVA (permutational multivariate analysis of variance) (也称为NPMANOVA, non-parametric MNOAVA), 和Mantel test。这些方法都通过一个样本间的距离矩阵或相似性矩阵构建ANOVA分析类似的统计量，然后对每组的观测结果进行随机置换来计算显著性P-value。对于单因素分析，对数据唯一的假设条件就是观察指标数据存在可置换性 (exchangeability)。

`PERMANOVA`利用距离矩阵(如欧式距离、Bray-Curtis距离)对总方差进行分解，分析不同分组因素或不同环境因子对样品差异的解释度，并使用置换检验对各个变量解释的统计学意义进行显著性分析。目的是检测不同分组的响应变量如菌群构成是否有显著差异。原始假设 (null hypothesis)是每组样本在其检测指标构成的检测空间中的中心点 (centroid)和离散度dispersion无差别。


*在做PCoA、NMDS的时候通常会采用这些方法来进行统计检验，其主要是*


```{r}
library(vegan)

data(dune)
data("dune.env")
```


```{r}
# or a nonparametric
# 此函数在16S测序中经常用到，可参考对应笔记
# 基于bray-curtis距离进行计算
dune.div <- vegan::adonis2(dune ~ Management, 
                           data = dune.env, 
                           permutations = 999, method="bray")

dune.div
```

PERMANOVA采用数据置换的方式计算pseudo F-值的统计显著性，比较随机置换数据获得的pseudo F-值是否高于或等于实际观测到的值。如果多于5%随机置换计算的pseudo-F值高于实际观测值，则表示不同组的样品之间不存在显著差异 (p-value > 0.05)。


*两两事后检验：*
```{r}
dune.pairwise.adonis <- pairwiseAdonis::pairwise.adonis(
  x = dune,
  factors = dune.env$Management, sim.function = "vegdist",
  sim.method = "bray",
  p.adjust.m = "BH",
  reduce = NULL,
  perm = 999
)
```
ANOSIM和PERMANOVA的pairwise analysis声明：“Pairwise tests are not possible in vegan. My understanding is that the non-R software with such tests makes separate pairwise tests using subsets of data with only two levels of a factor in one test. We don’t provide that in vegan and have no plans to provide this in the future.”