Common_statistical_tests_are_linear_models.qmd

---
title: "Common statistical tests are linear models"
format: html
---

## Common statistical tests are linear models

> https://lindeloev.github.io/tests-as-linear/


线性模型等价common statistic的一些笔记。
线性模型往往需要较多的样本数量才和common statistic的结果趋近。具体的样本数量在上述的链接中有极好的示例，此处不再赘述


### 对于correlation中常用的参数和非参数的Pearson和Spearman

对于correlation中常用的参数和非参数的Pearson和Spearman，其等价于如下的lm模型


```{r}
library(effectsize)

options(es.use_symbols = TRUE)
```


```{r}
# 生成测试数据
rnorm_fixed = function(N, mu = 0, sd = 1)
  scale(rnorm(N)) * sd + mu

y = c(rnorm(15), exp(rnorm(15)), runif(20, min = -3, max = 0))  # Almost zero mean, not normal
x = rnorm_fixed(50, mu = 0, sd = 1)  # Used in correlation where this is on x-axis
y2 = rnorm_fixed(50, mu = 0.5, sd = 1.5) 
```


x, y 都是连续性数字变量

```{r}
a = cor.test(y, x, method = "pearson") # Built-in
b = lm(y ~ 1 + x) # Equivalent linear model: y = Beta0*1 + Beta1*x, P值R2虽一样但系数不一致

# 得到和a一样的系数
c = lm(scale(y) ~ 1 + scale(x)) # On scaled vars to recover r

summary(c) # 的系数既是R， F-statistic的P值既是P值。
```


```{r}
# Spearman correlation
a = cor.test(y, x, method = "spearman") # Built-in
b = lm(rank(y) ~ 1 + rank(x)) # Equivalent linear model

summary(b)
```


### t-test的线性等价：

#### one sample t-test

```{r}
# Built-in t-test
a = t.test(y)

# Equivalent linear model: intercept-only
b = lm(y ~ 1)
```

```{r}
## 以下三者是一致的，都是单样本wilcox检验
# Built-in
signed_rank = function(x) sign(x) * rank(abs(x))

a = wilcox.test(y)

# Equivalent linear model
b = lm(signed_rank(y) ~ 1)  # See? Same model as above, just on signed ranks

# Bonus: of course also works for one-sample t-test
c = t.test(signed_rank(y))

a
summary(b)
c
```


#### two sample t-test

对于两样本ttest，其公式为：\[y_{i}=\beta_{0} + \beta_{1}xi \text{ } H_{0}:\beta_{1}=0 \]

```{r}
# Long format data with indicator
value = c(y, y2)
group = rep(c('y1', 'y2'), each = 50)
```


```{r}
# Built-in independent t-test on wide data
a = t.test(y, y2, var.equal = TRUE)

# Be explicit about the underlying linear model by hand-dummy-coding:
group_y2 = ifelse(group == 'y2', 1, 0)  # 1 if group == y2, 0 otherwise
b = lm(value ~ 1 + group_y2)  # Using our hand-made dummy regressor

# Note: We could also do the dummy-coding in the model
# specification itself. Same result as above.
# 甚至可以看到Intercept值和系数的值分别是样本的均值，一般线性模型的分类变量的处理方式。
tapply(value, group, mean)
c = lm(value ~ 1 + I(group == 'y2')) # summary(lm(value ~ group)) same result.
confint(c)

```

_Mann-Whitney U Test_

```{r}
# Wilcoxon / Mann-Whitney U
a = wilcox.test(y, y2)

# As linear model with our dummy-coded group_y2:
b = lm(rank(value) ~ 1 + group_y2)  # compare to linear model above
```


在上述的示例中，t.test中假设方差相等，方差不相等的情况下we model one variance per group.
既是Welch ttest。


```{r}
# Built-in
a = t.test(y, y2, var.equal=FALSE)

# As linear model with per-group variances
b = nlme::gls(value ~ 1 + group_y2, weights = nlme::varIdent(form=~1|group), 
              method="ML")

summary(b) # 解读同线性模型
```


### ANOVAs

对于分组变量，在模型中纳入的是dummy变量，R中模型如何处理factor真是令人头痛呀。
好在很多函数中直接输入factor变量即可，虽然存在指定reference的问题。

```{r}
# Compare built-in and linear model
a = car::Anova(aov(value ~ group, D))  # Dedicated ANOVA function

b = lm(value ~ 1 + group_b + group_c, data = D)  # As in-your-face linear model
#identical(lm(value ~ 1 + group, data = D), b)
```


```{r}
# Dedicated ANCOVA functions. The order of factors matter in pure-aov (type-I variance).
# Use type-II (default for car::Anova) or type-III (set type=3),
a = car::Anova(aov(value ~ group + age, D))
#a = aov(value ~ group + age, D)  # Predictor order matters. Not nice!

# As dummy-coded linear model.
full = lm(value ~ 1 + group_b + group_c + age, D)

# Testing main effect of age using Likelihood-ratio test
null_age = lm(value ~ 1 + group_b + group_c, D)  # Full without age. One-way ANOVA!
result_age = anova(null_age, full)

# Testing main effect of groupusing Likelihood-ratio test
null_group = lm(value ~ 1 + age, D)  # Full without group. Pearson correlation!
result_group = anova(null_group, full)
```


### Proportion

```{r}

```