glm.Rmd

---
title: "Generalized linear model"
author: "liuc"
date: "11/2/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, include=FALSE}
library(tidyverse)
library(easystats)
library(glmnet)
library(glm2)
library(logbin)
library(ggeffects)
library(emmeans)

```

## interpret glm results

> https://biologyforfun.wordpress.com/2014/04/08/interpreting-interaction-coefficient-in-r-part1-lm/
> https://online.stat.psu.edu/stat504/lesson/6/6.1


广义线性模型(Generalized linear model, GLIMs)是指一类，通过不同的link function将因变量和自变量的关系变成线性模型。
The term "generalized" linear model (GLIM or GLM) refers to a larger class of models popularized by McCullagh and Nelder (1982, 2nd edition 1989).其需要满足因变量的条件分布属于指数分布族，且因变量的均值是自变量的线性函数的某个单调可微函数（称为联接函数）。

其的一般形式为：
$$g(ui)=\beta0+b1x1+...+bnxn$$
其中，g是联接函数，μi是第i个观测值对应的因变量的均值，β0, β1, ..., βp是待估计的参数，xi1,...xip是第i个观测值的自变量。


There are three components to any GLM:
- Random Component: specifies the probability distribution of the response variable
- Systematic Component: specifies the explanatory variables  in the model
- Link Function: specifies the link between the random and the systematic components.

因变量本身就是随机的，是GLM思考的一个出发点；而在一般线性模型中，我们通常假设因变量Y是定值,只是存在一个误差项,误差项所引入的随机因素是常考虑的部分。GLM的系统组件只决定了Y的期望值。Y的真实取值会根据随机组件给出的分布形状在期望值附近波动。也就是说,GLM并不试图预测Y的准确值,而是 MODEL 它的分布形状。这与线性回归等传统模型有很大区别。

"Deviance Residuals" 是广义线性模型中特殊的一种残差，反映了模型拟合的好坏。Deviance Residuals 的计算方式与普通的残差不同。它是基于似然函数计算的，更具体地说，Deviance Residuals 是观察值的似然函数与最大可能似然函数的两倍差异的平方根。这使 Deviance Residuals 能够更好地度量模型的拟合程度，特别是对于那些不符合正态假设或者方差不等的数据。所以它更准确地说是模型诊断工具,而不是传统意义上的残差。

离散参数φ反映了响应变量的方差和均值的关系。它有以下几个作用:衡量残差的离散程度。如果φ越大,表示残差越分散,模型拟合效果越差。调整方差,改进模型。可以用φ来改进方差函数,使模型拟合效果更好。构建模型的显著性检验。许多显著性检验都包含了φ,如wald检验和score检验。区分泊松分布和负二项式分布。当φ=1时,GLM退化为泊松分布;当φ不等于1时,GLM为负二项式分布。判断过度离散。如果φ明显大于1,表示存在过度离散的情况。所以φ对判断GLM的拟合效果和构建统计推断非常重要。需要正确估计和解释φ的大小。


*Assumptions:*
- Y值独立，cases are indenpendent
- GLM不需要Y值为正态分布，but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal, etc.)
- but it does assume a linear relationship between the transformed expected response in terms of the link function and the explanatory variables
- The homogeneity of variance does NOT need to be satisfied
- Errors need to be independent but NOT normally distributed
- Parameter estimation uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS)


Family                            Default link function
binomial                          (link = "logit")
gaussian                          (link = "identity")
gamma                             (link = "inverse")
inverse.gaussian                  (link = "1/mu^2")
poisson                           (link = "log")
quasi                             (link = "identity", variance = "constant")
quasibinomial                     (link = "logit")
quasipoisson                      (link = "log")


### glm实现简单线性模型

以下是一个glm函数实现简单线性回归的示例，和`lm`函数不同的是其summary中不是Residuals，而是Deviance Residuals ，虽然其的值是一样的。


```{r}
df2 <- readRDS('./datasets/df2.rds')

glm_model <- glm(
  formula = score ~ time * class + AGE,
  data = df2,
  family = 'gaussian'
)

summary(glm_model)

```

**interpret the results:** 首先是Coefficients, 可以看到，对于这批次数据而言，lme/gee的结果极为类似.

带有交互作用的解释：Intercept值为532.929，score值在timeDay0和class0这二者分类变量的baseline值。timeDay 1, score 对比baseline增加 94.286 在Day==1以及class==0等在control时。以此类推。timeDay 1:class1，class ==1、在timeDay1时score值 相比于class==1、timeDay0减少了-332，既是timeday==1、class==0时score增加的94.286，在timeday==1，class==1时score值变化为532.929-4.761+ 94.286-332.125.

在多个covariates的多重线性回归中，定量变量的解释是在其他定量变量constant的情况下其每改变一个单位，因变量改变的值，而不用去考虑分类变量是否为control，处于哪个组都可以。


Anova的结果也和在线性回归中不一样：
默认的test是LRT，即似然比test，将参数改成`test='F'`，“Rao”, “LRT”, “Chisq”, “F”, “Cp”
```{r}
anova(glm_model)
```
*结果解读：*Deviance，偏差。


### interpret lm results

对于线性回归还是适合用`lm`函数。

回归(Regression)在统计分析中无疑处于核心的地位。其具有大量的模型和方法，也是我们在现实世界应用最多的方法，通过对数据的采集，来理解变量间的关系、预测未来的事件，这都只能归结为客观世界本身具备可量化的方程。
科学研究中的规律，比如由观察数据所得到的定理，如 $G = m * g$ 其是否具备误差项。

```{r}
# 线性回归需要符合LINE条件，其在临床预测模型和诊断模型. 模型自身的评价标准和对test数据集的评价
# 线性模型里面能用*号吗？按照模型的定义应该是不可以的。因为这样便引入了交互项、共线性？

lm_fit <- lm(
  score ~ class*time + AGE,
  data = df2
)

summary(lm_fit)

```

*interpret results:* Residuals指的是模型的每一个预测值和真实值间的差值，通过`summary(df2$score-lm_fit$fitted.values)`可以得到类似的结果. 在利用QQplot图观察时，可以进一步分析模型在两端的预测情况。Coefficients中Estimate指的是在利用诸如最小二乘法估计最小离差平方和得到的slope和intercept，再加入交互效应时，SE可以用来计算estimate的CI（比如1.96*SE+—）,而且一般而言SE大于estimate值，其P值一般没有意义。t值为Estimate/SE。P值为系数为0的概率，比如我们可以说class0的timeDay7在模型中是有意义的变量, class1没有统计学意义，则样本在两分组中没有差异。Residual standard error指的是残差的均值(the average amount that the actual values of Y (the dots) differ from the predictions (the line) in units of Y. In general, we want the smallest residual standard error possible, because that means our model’s prediction line is very close to the actual values, on average)。对于R2, multiple 一般用于简单线性模型，adjusted一般用于多变量模型。F-statistic和p-value是模型整体假设检验，可以理解为其null假设为模型的所有系数都是0。
在线性回归中，对于简单的一元线性回归，t检验值和F检验值是一样的，在多元线性回归中F检验假设模型整体。


```{r}
anova(lm_fit) # anova的结果为Type II tests

car::Anova(lm_fit, type = 'III')
```

*结果解读：*anova对线性模型的分析是计算，其结果中p值小于0.05的一般认为是对模型有贡献的变量。


Global Validation of Linear Models Assumptions.
和`performance::check_models`做一个对比。
```{r}
gvmodel<-gvlma::gvlma(f)
summary(gvmodel)
```


conditional plot, 在其他因素保持不变的情况下，某一变量在

```{r}
visreg::visreg(lm_fit, xvar = 'class', by = 'time', type = 'conditional')
```

```{r}
modelbased::estimate_contrasts(lm_fit, contrast = c("class", "time"))
```


```{r}
modelbased::estimate_means(lm_fit, at = c("time", 'class'))

ggeffects::ggemmeans(lm_fit, terms = c("time", 'class')) |> plot()
```
对于分类变量一般会计算emmeans，其是指在矫正了其他变量


```{r}
modelbased::estimate_slopes(lm_fit)
```
针对一般线性模型中的连续变量，


标准化回归系数对比不同自变量对因变量的影响：
一般线性回归模型的效应值就是其std.coeff.

```{r}
sjPlot::plot_model(lm_fit,
           type = "std",      # 计算标准化回归系数
           sort.est = TRUE,    # 进行排序
           show.values = TRUE,  # 显示数值
           show.p = TRUE,     # 显示p值
           value.offset = 0.2,  # 调整数值的位置
           vline.color = "gray",    # 修改穿越0的线条颜色
           title = "Summary of the regression model")    # 添加题目

# effectsize::effectsize(lm_fit)
```


逐步选择法：

```{r}
MASS::stepAIC(fit, direction = "backward")
stats::step(fit, direction = "backward")

# 全局择优法
# Cp是越小越好的，从上面这幅图来看，纳入3个自变量（tg/ri/hba）时最好。

library(leaps)
leaps <- regsubsets(fpg ~ cho + tg + ri + hba, data = df15_1, nbest=4)

plot(leaps, scale = "Cp") # 通过Cp判断

```


```{r}
lm_fit2 <- lm(
  score ~ class + time + AGE,
  data = df2
)
```


### by glmnet

```{r}
X <- df2 %>% dplyr::select()
lm_fit2 <- glmnet::glmnet(x = X, 
                          y = y,
                          family = 'gaussian')

summary(lm_fit2)
```


## different family

对于广义线性模型，对不同数据具有不同的模型选择。依据Y变量的不同分布进行选择。


### 负二项分布

负二项分布(Negative Binomial Distribution)是一种离散型概率分布,它描述在Bernoulli试验中观察到r次“成功”之前发生的“失败”次数的概率分布。

$$P(X=k) = \frac{\Gamma(k+r)}{\Gamma(r)k!} p^r (1-p)^k$$
其中:

$k$ - 观察到第$r$次“成功”之前的“失败”次数
$r$ - “成功”的次数
$p$ - 单次Bernoulli试验中“成功”的概率
$\Gamma()$ - Gamma函数


负二项分布的期望和方差为:
E(X) = r(1-p)/p
Var(X) = r(1-p)/p^2
与其他分布的关系:
当r=1时,负二项分布退化为几何分布。
当r→∞时,负二项分布趋向于泊松分布。

负二项分布可以用来描述一些现实问题，比如在一个电话销售活动中，需要拨打多少电话才能完成r次销售；或者在一个广告投放活动中，需要投放多少次广告才能吸引到r次点击等。


*模型假设：*


```{r}
# 负二项分布
# 负二项分布回归一般用于count数据，尤其较为离散over-dispersed分布的count数据，柏松分布假设means和var是same的。
# negative binomial models assume the conditional means are not equal to the conditional variances

library(MASS)


dat <- read_csv('./datasets/Lab_exp.csv')

```


考虑如下一个简单的示例，Diplo_intensity为counts数的计数；Treatment为不同条件。
```{r}
# check assumptions
# 可以看到四个分组之间
with(dat, tapply(Diplo_intensity, Treatment, function(x) {
    sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
```


```{r}
m1 <- MASS::glm.nb(formula = Diplo_intensity~Treatment,
                   data = dat %>% filter(Treatment %in% c('Infected HG', 'Control')))

# 95% CI for estimate
est <- cbind(Estimate = coef(m1), confint(m1))
exp(est)


```

```{r}
summary(m1)

```


```{r}
anova(m1)
```


模型假设还是很方便的。
```{r}
performance::check_model(m1)
```


R包`mvabund`的负二项分布建模：

```{r}
# mvabund 负二项分布建模
library(mvabund)

mm1 <- mvabund::manyglm(
  formula = Diplo_intensity~Treatment,
  data = dat %>% filter(Treatment %in% c('Infected HG', 'Control')),
  family = 'negative.binomial'
)

summary(mm1)
# summary.manyglm(mm1, test = 'wald', nBoot = 9999)

message(rep("#", 50), "\n")

mm1
```

*结果解读: *对于MASS包，其summary后的系数和一般线性模型的解释类似，如连续性变量则是其每改变一个单位对outcome的log count的影响，对于离散变量则是和refer group的对比。


*多个自变量的MASS package & manyglm的对比*

```{r}
m2 <- MASS::glm.nb(Diplo_intensity~Treatment+Fish_family+Fish_sex+Initial_weight_g, 
                   data = dat)
summary(m2)

mm2 <- manyglm(Diplo_intensity~Treatment+Fish_family+Fish_sex+Initial_weight_g, 
                   data = dat,
                   family = 'negative.binomial')

summary.manyglm(mm2, test = 'wald', nBoot = 9999)
```


评估负二项分布, 对于mvabund包performance包不支持，不过mvabund自身有提供一些手段; 对于MASS包的结果是支持的。

```{r}
performance::check_model(m2)
```


```{r}
# 对于两个模型的对比
m3 <- MASS::glm.nb(Diplo_intensity~Treatment + Fish_family, 
                   data = dat)

anova(m1, m3)

# 既然nb对于小数据不是很合适，对于DESeq2中的应用是怎么应用的呢
# To proceed, we need to make distributional assumptions. Essentially, what such assumptions do is that they let us compute the probabilities of rare events in the tails of the distribution – i.e., extraordinarily high or low counts – from a small number of distribution parameters.

```


下面是RNA counts数据的一个例子
https://www.huber.embl.de/msmb/Chap-CountData.html
```{r}

```


### log binomial

log binomial is a cousin to logistic regression, log link rather than logit link. Can be used to estimate adjusted
relative risk. 
log binomial对于common outcome计算时的RR值更为接近真实情况，而罗辑回归在rare outcome时所得到的OR可近似为RR。
Odds ratios are not good effect measures when the outcome of interest is common.

```{r}
# logbin package is a good choice
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)

# first we fit a logistic regression
l_reg <- glm(
  diabetes ~ glucose,
  data = PimaIndiansDiabetes2, family = 'binomial'
)
performance::performance(l_reg)


# then we fit a log binomial,but an error reporte
log_bin <- glm2(
  diabetes ~ glucose,
  data = PimaIndiansDiabetes2, family=binomial(link=log),
  start = c(log(mean()), rep(0, 1)) # https://stats.stackexchange.com/questions/105633/what-to-do-when-a-log-binomial-models-convergence-fails
  )


# then we use logbin package to fit a model
library(logbin)

lb <- logbin(
  diabetes ~ glucose,
  data = PimaIndiansDiabetes2, model = FALSE, 
  method = 'cem' # method有多种选择，分别对应不同的
)
performance::performance(lb)
```

```{r}
car::influencePlot(l_reg)
```
```{r}
plot(predict(l_reg, type="response"), residuals(l_reg, type= "deviance"))
```


```{r}
plot(hatvalues(l_reg)) 
plot(rstudent(l_reg)) 
plot(cooks.distance(l_reg))
```


### modified Poisson regression & Poisson distribution

Modified Poisson regression, which combines a log Poisson regression model with robust variance estimation, is a useful alternative to log binomial regression for estimating relative risks.

```{r}
# https://charliemarks.com/r-tutorials/modifiedpoissonregression
# https://rstudio-pubs-static.s3.amazonaws.com/5752_fc41dca85dd24539bc99868697de83d0.html

library(lmtest)
library(sandwich)

## Define a function to obtain ratio
..glm.ratio <- function(GLM.RESULT, DIGITS = 2, P.DIGITS = 3, CONF.LEVEL = 0.95) {

    if (GLM.RESULT$family$family == "binomial") {
        LABEL <- "OR"
    } else if (GLM.RESULT$family$family == "poisson") {
        LABEL <- "RR"
    } else {
        stop("Not logistic or Poisson model")
    }

    ## Extract coefficients and confidence interval
    COEF      <- stats::coef(GLM.RESULT)
    CONFINT   <- suppressMessages(stats::confint(GLM.RESULT, level = CONF.LEVEL))
    TABLE     <- cbind(coef = COEF, CONFINT)

    ## Turn them into OR
    TABLE.EXP <- round(exp(TABLE), DIGITS)

    colnames(TABLE.EXP)[1] <- LABEL

    ## Extract p-value
    PVAL <- round(summary(GLM.RESULT)$coef[,4], P.DIGITS)

    TABLE.EXP <- cbind(TABLE.EXP, "P" = PVAL)    
    TABLE.EXP
}

#First, I need to make a new variable, which will be 1 if the state’s life expectancy is at least 71 years and 0 otherwise
state.x77_2 <- as.data.frame(state.x77)
state.x77_2$seventyone <- ifelse(state.x77_2$`Life Exp`>=71, 1, 0)
state.x77_2$HSGrad <- state.x77_2$`HS Grad`
state.x77_2$ID <- 1:nrow(state.x77_2)

mp0 <- glm(seventyone ~ Illiteracy + Murder + HSGrad, 
             data=state.x77_2,
             family = binomial(link = "logit")
             )

# Fit by glm() then test using robust SE estimator
mp <- glm(seventyone ~ Illiteracy + Murder + HSGrad, 
             data=state.x77_2,
             family = poisson(link="log")
             )

res_p <- coeftest(mp, vcov = sandwich)

parameters::parameters(mp)

# this is the results
model_results <- round(cbind(exp(cbind(RR = res_p[,1], 
                                       LCI = res_p[,1] + qnorm(0.05/2)*res_p[,2],
                                       UCI = res_p[,1] - qnorm(0.05/2)*res_p[,2])),
                                       P = res_p[,4]),4)

## same results as model_results
geeglm.log.poisson <- geepack::geeglm(formula = seventyone ~ Illiteracy + Murder + HSGrad,
                             data    = state.x77_2,
                             family  = poisson(link = "log"),
                             id      = ID,
                             corstr  = "exchangeable")
summary(geeglm.log.poisson)


## The intercepts correspond to the baseline risk and baseline odds
cbind(
  RiskRatio = exp(coef(mp)),
  RR2 = exp(coef(geeglm.log.poisson)),
  OddsRatio = exp(coef(mp0))
)
```


#### Poisson distribution

泊松分布，poisson回归主要用于结局资料是计数资料的情况，与线性回归和logistic回归不同。注意计数资料不同于连续资料，不要弄混。比如咳嗽次数、哮喘发作次数，等等变量，这是计数资料，需要考虑poisson回归，不能用线性回归。
logistic回归通常认为服从二项分布，而二项分布一般认为发生率不应该太低。poisson回归服从poisson分布，一般用于发生率特别低的情形。


Poisson分布描述一个固定时间间隔内随机事件的发生次数。


*初步了解R里面的柏松分布*
```{r}
dpois(x, lambda)#：计算 Poisson 分布的概率密度函数在 x 处的取值，其中 lambda 是 Poisson 分布的参数。
ppois(q, lambda)#：计算 Poisson 分布的累积分布函数在 q 处的取值。
qpois(p, lambda)#：计算 Poisson 分布的分位数，即满足 P(X ≤ q) = p 的最小整数 q。
rpois(n, lambda)#：生成 n 个服从 Poisson 分布的随机数，其中 lambda 是 Poisson 分布的参数。
```


```{r}
# Diplo_intensity 为计数资料
dat <- read_csv('./datasets/Lab_exp.csv')
```


初步查看分布的情况：
```{r}
ggplot(dat,aes(x=Diplo_intensity)) +
  geom_histogram() +
  theme_bw()
```


建模：
```{r}
# 
pos_m <- glm(Diplo_intensity~Treatment + Initial_weight_g, 
             data = dat, family = 'poisson')

summary(pos_m)
```


Change factors for rate parameter lambda:
```{r}
exp(coef(pos_m))
```


```{r}
# 重新计算系数，但标准误所使用方法不同
# Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates to control for mild violation of the distribution assumption that the variance equals the mean. 
cov.m1 <- sandwich::vcovHC(pos_m, type="HC0")
std.err <- sqrt(diag(cov.m1))
r.est <- cbind(
  Estimate = coef(pos_m), "Robust SE" = std.err,
  "Pr(>|z|)" = 2 * pnorm(abs(coef(pos_m) / std.err), lower.tail = FALSE),
  LL = coef(pos_m) - 1.96 * std.err,
  UL = coef(pos_m) + 1.96 * std.err
)

r.est

# 上面的为手动计算，此处为`lmtest`包进行的计算
lmtest::coeftest(pos_m, 
                 vcov=cov.m1
                 )
```

```{r}
parameters::parameters(pos_m, exponentiate = TRUE)
```


```{r}
check_model(pos_m)
```

```{r}
performance(pos_m)
```


quasipoisson
```{r}
# Adjusted Poisson-regression
# Same parameter estimates as in Poisson model, but different standard errors, hence different p-values
pos_m2 <- glm(Diplo_intensity~Treatment + Initial_weight_g, 
             data = dat, 
             family = quasipoisson(link="log")
             )

summary(pos_m2)
```