-
Notifications
You must be signed in to change notification settings - Fork 0
/
lme.Rmd
executable file
·761 lines (519 loc) · 29.8 KB
/
lme.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
---
title: "linear mixed effect model"
author: "liuc"
date: "10/28/2021"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## linear mixed effect model
> https://yury-zablotski.netlify.app/post/mixed-models/
> https://m-clark.github.io/mixed-models-with-R/random_intercepts.html
> https://poissonisfish.com/2017/12/11/linear-mixed-effect-models-in-r/
> https://bbolker.github.io/mixedmodels-misc/
> https://journals.sagepub.com/doi/full/10.1177/2515245920960351
> https://stats.stackexchange.com/questions/13166/rs-lmer-cheat-sheet/13173#13173
> https://stackoverflow.com/questions/26465215/random-slope-for-time-in-subject-not-working-in-lme4
> https://stats.stackexchange.com/questions/58745/using-lmer-for-repeated-measures-linear-mixed-effect-model?rq=1
> https://stats.stackexchange.com/questions/65371/mixed-model-with-1-observation-per-level
> https://stats.stackexchange.com/questions/488428/how-do-you-know-the-number-of-random-effects-in-a-mixed-effects-model
> https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html
> https://drive.google.com/file/d/1sOZUAFOc004H4jO8vuUc_4HyYHEgu45b/view
> https://mp.weixin.qq.com/s/jxnwMVB3eNkru60jebC8Zw
线性混合模型用来解释因变量的连续型和分类型自变量的作用称为*固定效应*,个体之间的差别的影响称为*随机效应*,随机效应是用来代表观察对象间的差异以及观察对象内的差异(如由重复测量造成的相关性),这些差异是随机的且不可观察到的。
mixed effects model average trends, random effects model the extent to which these trends vary across levels of some grouping factor.混合效应模型可以量化不同群体间平均趋势的差异程度。这里的grouping factor指的是对数据进行分组的因素, 比如地区、性别等。
混合模型的优点之一就是可以处理不等距时间数据,这对于分析临床回顾性研究中收集的患者不等距时间点上的重复测量数据是非常合适的。
*线性混合效应模型原理:*
线性混合模型也有叫混合线性模型,是在线性模型的基础上发展起来的。LME模型的出现主要是为了解决抽样数据不独立的问题,例如同一被试多次测得的血压。使用线性回归模型方法有一个重要前提,就是要求每个测量数据都是独立的。默认假设组间被试是异质的,而组内被试则是同质的。重复测量会增加一些随机变化,最终导致拟合的模型会偏离实际情况。
假设随机效应满足正态分布(似乎有带商榷),即每个群体的随机效应都是从总体随机效应的正态分布中独立取样的。在线性混合效应模型 (LMM) 中,因变量被视为固定效应和随机效应的线性组合。
*同样是控制其他的因素,混合模型和多元线性模型的差别在哪里呢,或者说为什么线性模型更合适呢?*
随机因子作为需要控制的变量,让人想到协变量以及协方差分析。它利用拟合的协变量与因变量的线性回归关系,把需要控制的变量拉到同一水平(总均数),相应地得到修正后的Y值,再对各组Y值的均值进行比较分析。
多元线性回归里的净回归系数,也是指控制其它预测变量后,当前自变量对因变量的纯粹影响关系。
多元线性回归模型假设测量数据之间的独立性,但是不同患者同一时间的采样,受限于个体内的影响,Y值在个体内会有所关联,并导致错误的模型参数估计。LME模型把组内(各患者)的相关测量(另外如:个体的重复测量)采用随机效应来处理,解决了数据独立性的问题。
线性混合模型的应用较多,此处只针对医学研究中常见的重复测量数据分析。
对于医学中常见的纵向数据,同一个受试者在不同时间的因变量值是相关的,不同受试者的因变量值可以认为不独立。
此时不同的受试者作为分层因素,`(1|PatientID)`,其中1是指截距,竖线右侧是分组因子,因此这里的含义是:分别为每个患者设置一个截距参数。
*一个需要注意点在于怎么引入交互作用:*下面这段摘抄是一个不错的建议,For the so called ‘fixed effects’, one typically specifies effects of time (as a categorical or factor variable), randomised treatment group, and their interaction. This implies a saturated model for the mean, or put another way, there is a separate mean parameter for each time point in each treatment group. Often there are baseline covariates to be adjusted for. One can adjust for these as simple main effects, or additionally with an interaction with time, in order to allow for the association between the baseline variable(s) and outcome to potential vary over time. For a more in depth discussion of the model, see for example Molenberghs et al 2004 (open access). 一般来说需要考虑时间和group的交互,以及时间和base line间的交互。随机项的话,或者一个随机截距,或者随机time系数。过多的需要考虑的分组信息是一个需要小心的事情。
`m2 <- lmer(Obs ~ Treatment * Day + (1+Day|Subject), mydata)`
Essentially, an MMRM is a specific linear mixed effects model that includes (at least) an interaction of treatment arm and categorical visit variables as fixed effects. The covariance structure of the residuals can have different forms, and often an unstructured (i.e. saturated parametrization) covariance matrix is preferred. This structure can be represented by random effects in the mixed model.
*一些总结:*
重复测量方差分析对应的是具有随机截距和随机斜率的混合效应模型。
重复测量方差分析的精妙之处在于,它把被试间的平方和从组内平方和中剥离出来,使得误差平方和变得更小,相比于减少不多的自由度,其误差均方也会变小。F检验中的分母变小,意味着其值将变大,能够提升统计功效,更容易发现组间的实验效应。
LME模型中,因变量的总方差等于固定效应方差、随机效应方差和残差方差之和。模型的解释力是用_条件R2_(conditional R2)来表示的,它是指固定效应方差和随机效应方差之和除以总方差,而_边缘R2_(marginal R2)是指固定效应方差除以总方差。
*如何比较原先的模型与纳入随机效应后的新模型,以便确认新模型的优势。*
首先是似然比检验(likelihood ratio,LR)方法,它是一种基于p值、适用于大样本数据的模型比较方法。而对于小样本数据,可以使用Satterthwaite近似法(对于REML模型)来推导出p值。
REML是指残差最大似然法,它是LME模型默认的参数估计方法。这是因为最大似然(ML)估计是有偏差的,而REML的偏差通常较小。不过,在比较具有不同固定效应的模型时,应该选用ML方法,因为它不依赖于固定效应的系数。
```{r, include=FALSE}
library(tidyverse)
library(easystats)
library(lme4)
library(lmerTest)
library(ggeffects)
library(emmeans)
library(mmrm) # 一个针对临床重复资料设计的R包
# data(package = 'lme4')
```
## use lmer
> https://ase.tufts.edu/bugs/guide/assets/mixed_model_guide.html
首先是检验数据因变量的概率分布,在这一点上似乎和一般线性模型相似,线性混合模型需要因变量满足正态性(目前来看值得商榷).线性混合模型的统计推断如t检验、F检验依赖于残差的正态性和方差的齐性,而不是依赖于因变量的正态性。
如果不符合正态性检验,需选择正确的建模方法,比如penalized quasilikelihood (PQL) 、Laplace approximation and Markov chain Monte Carlo algorithms (MCMC).
PQL is a flexible technique that can deal with non-normal data, unbalanced design, and crossed random effects. However, it produces biased estimates if your response variable fits a discrete count distribution, like Poisson or binomial, and the mean is less than 5 - or if your response variable is binary.`MASS::glmmPQL`
The Laplace approximation can handle up to 3 random effects.Any more than that, and you'll have to use MCMC, which is a Bayesian method that can be somewhat confusing.
the Laplace approximation is a special case of a parameter estimation method called Gauss-Hermite quadrature (GHQ), with one iteration. GHQ is more accurate than Laplace due to repeated iterations, but becomes less flexible after the first iteration, so you can only use it for one random effect.
不过通过performance::check_models似乎更安全 ,不确定此处的正态性检验是对因变量还是残差。
除此之外,*对于outlier等数据的处理、NA值的处理*等,此处所使用的数据暂不考虑这些,只是主要结果的笔记。
*总结一下,线性混合效应模型的假设assumption:*
1. 解释变量和相应变量间线性相关
2. 残差独立、正态分布、并have constant variance
3. Normally distributed random effects - The random effects are usually assumed to be normally distributed.
4. No extreme multicollinearity ;No extreme outliers
```{r}
df2 <- readRDS('./datasets/df2.rds')
car::qqPlot(df2$score)
car::qqp(log(df2$score))
car::qqp(df2$score, 'lnorm')
# 可以看到数据不是很符合正态分布,但符合正态分布在实际中不是一件简单的事情,差不多就可以。
# 像本次数据的分布有些过于不符合了
```
寻找最佳随机效应结构,找到了模型最好的随机效应结构,接下来我们就给模型加入固定效应。
lme模型的构建在先验知识外,可以展开随机效应结构的筛选。
但怎么判定一个变量是固定效应变量还是随机效应变量呢?有人讲尝试每一个变量作为随机效应,但似乎并不可取。通过对项目本身的知识来选择似乎更靠谱。
REML estimation is unbiased but does not allow for comparing models with different fixed structures. Only use the REML estimation on the optimal model.
所以在进行效应选择的时候,用ML算法。
*随机效应的选择*
```{r}
nullmodel1 <- lme4::lmer(score ~ 1 + (1 | PatientID), data = df2, REML = FALSE)
nullmodel2 <- lme4::lmer(score ~ 1 + (1 + class | PatientID), data = df2, REML = FALSE)
# 一个常见的报错,即随机效应的数目大于等于观察数目
# 比如以下这个公式,其time|PatientID的数目和观察的数目相等,即是考虑了随机截距和随机slope后,随机效应的
# 数目为231, 不过这个问题很神奇,因为同一患者在不同时间点测多次,必然是和观察数目相等的。
# sleepstudy这个测试数据集却是没有问题的,不过如果将Days变成factor,同样的问题依旧会出现
# lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy) 没有任何问题, str(sleepstudy)
# 目前来看应该就是样本数目不够的缘故
# 随机效应的数目计算为77+77?,这里应该考虑不同的分类因子变量来计算不同的随机效应的数目,所以是3*77,也就是刚才提到的,不论怎样,观察和随机效应值都会是一样的,报错总会有!!
# time |PatientID的写法中,除了会计算其他的效应,还会计算time在每一个PatientID的效应。
# The | symbol indicates a grouping factor in mixed methods.
# 所以在time combine PatientID时,只有一个观察值的情况下,这种纳入是不正确的。不要纳入time的随机slope。
nullmodel3 <- lme4::lmer(score ~ 1 + (1 + time |PatientID), data = df2, REML=FALSE,
control=lmerControl(check.nobs.vs.nRE="ignore") # wouldn't recommend it.
)
# 以下没有任何问题
nlme_m <- nlme::lme(score ~ time, random = ~ 1 + time | PatientID,
method = "REML", data = df2)
intervals(nlme_m)
# 有问题
nlme_m2 <- nlme::gls(
score ~ time,
# Specify a general correlation structure
correlation = corSymm(form = ~ time | PatientID),
# Specify constant variance for all Days
# weights = varIdent(form = ~1),
# Or let the variance vary by Day
# weights = varIdent(form = ~ 1 | time),
data = df2
)
summary(nlme_m)
anova (nullmodel1, nullmodel2) # 基于LRT比的模型间的对比,同时还考虑模型的AIC/BIC/
```
*报错问题的解决:*首先是考虑随机效应的数目,随机效应数目的计算方式为,1|PatientID,其随机效应值为77,time | PatientID, 因为time是一个分类变量,则其在计算时time*77便是随机效应值,在这个意义上来讲,数据在time上不是repeated measures,不能将time视为随机效应。
This is because the levels of a categorical variable are represented by dummy variables - essentially they are treated as different variables. So in your case, when you fit random slopes only you are asking the software to estimate 5 random slopes for each group. When you fit random intercepts and random slopes there will be 407 random intercepts, but only 4 random slopes for each group (since one level will be treated as a reference group and included in the intercept), so either way you will have 5 x 407 random effects.
The only way to solve this is by either coding the variable as `numeric`, if that is plausible in your study/data, or not fitting random slopes, or having more than 1 observation per treatment per group.
```{r}
test_performance(nullmodel1, nullmodel2)
compare_performance(nullmodel1, nullmodel2) |> plot()
```
上面的结果可以看到,nullmodel1有较好的指标,所以只选择随机intersect即可,不用考虑其他变量的随机slope。
固定效应的纳入,一般结合具体问题和样本量进行取舍。变量的纳入参考10人1️变量。
```{r}
# https://rpubs.com/rslbliss/r_mlm_ws#:~:text=To%20run%20a%20multilevel%20linear,we%20have%20used%20thus%20far.&text=Note%20that%20R%20uses%20restricted%20maximum%20likelihood%20to%20fit%20the%20model.
# same df2 data as in gee.Rmd
# 以入组后的2次测量值为因变量, 以处理效应、时间效应等为固定效应, 以不同病人作为随机效
lme_model <-
lmerTest::lmer(score ~ time + class + AGE + (1 | PatientID),
data = df2
)
summary(lme_model)
# 纳入time的公式有问题,这里用class做一个测试。不过没有必要考虑fixed变量的随机slope吧
lme_model2 <-
lme4::lmer(score ~ time * class + AGE + (0 + class |`PatientID`),
data = df2
)
lme_model3 <- update(lme_model2, .~.-AGE)
# To keep the intercept fixed while keeping the random slope, replace the “1” with a “0”
```
*结果解读:*everything to the left of the | indicates the effects that should be random, and the variable to the right of the | is the grouping variable across which the effects should vary. What is the “1”? It’s the way we refer to the intercept.
lme模型和gee.Rmd中的示例为同一数据,我们在只考虑随机截距的情况下,Fixed effects中Estimate值和gee结果保持一致,说明自变量。random effects为随机效应, Groups列为随机效应因素, 此处只对Intercept的PatientID考虑了随机因素, 通过其Variance的大小可以判断随机截距效应是否应该考虑.
*REML criterion at convergence: *REML(restricted maximum likelihood),
*Random effects: * The “Residual” standard deviation refers to σ. 如果Variance太小的话,则随机效应可能不太重要。
*Fixed Effects: * 随机效应的解释可以为timeDay1对比于(Intercept)中的timeDay0,score值减少了-147.260。AGE变量的话是可以不用考虑的。
```{r}
sjPlot::tab_model(lme_model,
p.val = "kr", # kr 方法计算的p值更为精准
show.df = TRUE
)
gtsummary::tbl_regression(lme_model,
intercept = TRUE
)
```
随机效应的残差方差:
```{r}
d <- lme4::VarCorr(lme_model)
d
# 可重复性
# 如果你得到的重复性小于1.0,那么个体内测量结果之间的变化来源是什么。仅是测量误差吗?
216.0^2/(216.0^2 + 417.9^2)
```
计算固定效应方差:
```{r}
#计算固定效应方差
fm.var <- model.matrix(lme_model) %*% fixef(lme_model)
var(fm.var)
```
*关于线性混合效应模型的R2计算方法:*
`var(fm.var)/(var(fm.var) + d$vcov[1] + d$vcov[2])`
```{r}
# 手动计算R2
MuMIn::r.squaredGLMM(lme_model)
```
Estimates of Covariance Parameters:
对随机截距进行检验, 简单理解为不同观察者Y指标初始值之间的差异具有显著统计学意义。
```{r}
lmerTest::ranova(lme_model)
```
*interpret result: *如果P值小于0.05,可以理解成有必要考虑随机效应。
*绘制固定效应拟合直线:*
```{r}
#绘制固定效应拟合直线
plot(effects::allEffects(lme_model))
```
```{r}
test_performance(lme_model, lme_model2)
compare_performance(lme_model, lme_model2) |> plot()
```
```{r}
check_model(lme_model)
```
*Group-level Effects*
此处的group为患者,注意患者数目对绘图的影响。
```{r}
random <- estimate_grouplevel(lme_model)
random
plot(random) +
theme_lucid()
```
```{r}
# type III
car::Anova(lme_model, type = 'III')
# Satterthwaite's method
anova(lme_model)
# to see all 134 models
# coef(lme_model)
plot(ggemmeans(lme_model, terms = c("time", "class"))) +
ggplot2::ggtitle("GLMER Effect plot")
```
*结果interpre* 在模型构建完成后,应该如何展示分析的结果呢?typeIII结果可以表明变量对模型的贡献度。
同时还可以汇报emmeans,以及不同分组中的emmeans的差值以及对应的95%CI。
*Unit-level predictions*
```{r, include=FALSE}
require(marginaleffects)
```
此图对连续变量更为适用,以下不适用。
```{r}
pred1 <- predictions(lme_model,
newdata = datagrid(PatientID = df2$PatientID,
time = c('Day 0','Day 1','Day 7'),
AGE = 30:40,
class = 1
)
)
p1 <- ggplot(pred1, aes(AGE, estimate, level = PatientID)) +
geom_line() +
labs(y = "Predicted weight", x = "Time", title = "Linear growth model")
p1
```
*Population-level predictions*
To make population-level predictions, we set the PatientID variable to NA, and set re.form=NA. This last argument is offered by the `lme4::predict` function which is used behind the scenes to compute predictions:
```{r}
```
计算emmeans,以及contrasts:
```{r}
# 计算emmeans,以及contrasts
(emm_res <- emmeans(lme_model, specs = c('time', 'class')))
pc <- emmeans::emmeans(lme_model,
specs = trt.vs.ctrl ~ time | class,
type = "response",
ref = 1,
adjust = 'fdr', infer = c(TRUE, TRUE))
pc
# 当然还可以 通过如下方式,求得两两比对的结果,结果同pairs(emm_res)
# contrast(emm_res, method='pairwise', by = 'class')
grafify::posthoc_Levelwise(lme_model,
c("time", 'class'))
```
```{r}
## Pairwise comparisons, 对于lme模型中emmeans值进行两两比较,自然是针对分类变量
(emm_res <- emmeans(lme_model, specs = 'Time'))
pairs(emm_res)
contrast(emm_res)
plot(emm_res, comparisons = TRUE) + theme_bw() +
labs(y = "", x = "Estimated marginal mean")
```
```{r}
library(gtsummary)
tbl_data <- df_p_long %>%
select(Subject, Folder, !!x_input, ACTARM) %>%
distinct() %>%
pivot_wider(names_from = Folder, values_from = !!x_input)
# janitor::clean_names()
# labelled::remove_labels()
gtsummary::tbl_summary(tbl_data %>% select(-Subject),
by = ACTARM,
missing = 'no',
type = all_continuous() ~ 'continuous2',
statistic = list(all_continuous() ~ c("{N_nonmiss}",
"{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}"),
all_categorical() ~ "{n} / {N} ({p}%)")
) %>%
modify_header(all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p)}%)") %>%
add_n() %>%
add_p() %>%
bold_labels() %>%
modify_spanning_header(all_stat_cols() ~ glue::glue("**{x_string}**"))
```
```{r}
gtsummary::tbl_regression(lme_model,
intercept = TRUE
)
```
*use easystats to get model parameter or effectsize or performance*
下面针对模型的参数和performance进行展示
```{r}
# 首先是是否满足线性混合模型的条件
performance::check_model(lme_model) # 可以看到残差不是很满足正态性分布。
# 其次是模型自身的表现,一般用于模型间的比对
performance::performance(lme_model)
# 相对于summary的结果,多了95%CI,不过通过confint函数也可以计算
parameters::parameters(lme_model)
```
对模型计算effectsize时得到的是Standardization的结果?
```{r}
effectsize::effectsize(lme_model)
```
*use glmer 广义混合模型*
glmer可以用于处理因变量为分类的数据,即是*Multilevel logistic models*,以及种种其他不符合正态分布的因变量数据。
对log转换后的数据进行符合正态分布的线性分析并不是一件值得去做的事情。
The reason we want to use a GLMM for this is that if we imagine a stastical method as E(x), E(ln(x)) is not the same as ln(E(x)).
```{r}
glme_model <-
lme4::glmer(
score ~ time + class + AGE + (1|`PatientID`),
data = df2,
family = 'gaussian'
)
summary(glme_model) # same as lme_model
# Run logistic model with random intercept and slope
model <- glmer(white ~ homework + (1 + homework | schid), data=mlmdata,
family=binomial(link="logit"))
summary(model)
```
```{r}
compare_performance(lme_model, glme_model)
plot(compare_performance(lme_model, glme_model))
```
## lme by tidymodels
tidymodels 有提供一系列的统计模型的方法,这是基于R生态的优势之所在。但`tidymodels`一贯的以预测作为模型的主要目的,而在我们的分析中,很多时候模型是作为解释自变量对因变量的影响而展开的。
multilevelmod包支持的模型还是极为广泛的。不过如果只是作为一个模型拟合的过程,似乎没有必要用到tidymodels,毕竟一个模型的拟合所涉及到的各种自变量的问题,是一个需要循环往复的过程,模型的诊断、对比、选择、解释等等才是分析时需要大量考虑的事情。
下面仅在tidymodel的框架下复现一个线性混合模型的应用:
```{r, include=FALSE}
library(tidyverse)
library(tidymodels)
library(multilevelmod)
library(broom.mixed)
tidymodels_prefer()
theme_set(theme_bw())
```
```{r, include=FALSE}
df <- read_csv('~/OneDrive/kintor/Daily_Work/US30001_COVID/lme_data.csv')
```
```{r}
car::qqp(df$LOGCHG, 'norm')
car::qqPlot(df$LOGCHG, 'lnorm')
```
*GEE模型*
```{r}
gee_spec <-
linear_reg() %>%
set_engine("gee", corstr = "exchangeable")
```
*MME模型*
```{r}
lmer_spec <-
linear_reg() %>%
set_engine("lmer")
lmer_fit <-
lmer_spec %>%
fit(LOGCHG ~ VISIT + AGEGR1 + SEX + ACTARM + LOGBASE + (1 | SUBJID), data = df)
lmer_fit
```
`parsnip`类一般包含有模型自身的一些信息
```{r}
lmer_fit %>% extract_fit_engine() %>%
summary()
```
*上图的分布提醒我们终然log后的数据,其也呈现出非正态分布的倾向,那么对原始数据AVAL直接进行广义混合模型似乎是一件更为合理的事情 :*
`MASS::glmmPQL`是对`nlme::lme`的封装,继承`lme`类。
```{r}
car::qqp(df$CHG, 'lnorm')
car::qqp(log10(df$CHG), 'norm')
```
```{r}
PQL_m <- MASS::glmmPQL(fixed = CHG ~ VISIT + AGEGR1 + SEX + ACTARM + LOGBASE,
random = ~ 1| SUBJID,
family = gaussian(link = "log"),
data = df
)
```
```{r}
GHQ <- lme4::glmer(LOGCHG ~ VISIT + AGEGR1 + SEX + ACTARM + LOGBASE + (1 | SUBJID),
data = df,
family = gaussian(link = "log"),
nAGQ = 1
) # Set nAGQ to # of desired iterations, =1就是 Laplace approximation
```
one R structure needs to be specified for each fixed effect and one G structure needs to be specified for each random effect.
```{r}
require(MCMCglmm)
```
```{r}
plot(fitted(lme_model), residuals(lme_model), xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, lty = 2)
lines(smooth.spline(fitted(lme_model), residuals(lme_model)))
```
## MMRM包的使用
最近发现了一个新包`MMRM`,记录下其的应用。
医学研究中的重复测量数据有其对应的考量之处:
```{r}
# MMRM
mod.gls1 <- nlme::gls(Y_comp ~ Group + Time + Group*Time,
data = dat,
correlation = corSymm(form = ~ 1 | USUBJID), # unstructured correlation
weights = varIdent(form = ~ 1 | Time), # freely estimate variance at subsequent timepoints
na.action = na.exclude)
mod.gls2 <- nlme::gls(Y_mar ~ Group + Time + Group*Time,
data = dat,
correlation = corSymm(form = ~ 1 | USUBJID), # unstructured correlation
weights = varIdent(form = ~ 1 | Time), # freely estimate variance at subsequent timepoints
na.action = na.exclude)
# 倾向于这种用法
mmrm <- gls(y~y0*factor(time)+trt*factor(time),
na.action=na.omit,
data=longData,
correlation=nlme::corSymm(form=~time | id),
weights=nlme::varIdent(form=~1|time))
summary(mmrm)
library(multcomp)
confint(glht(mmrm, linfct = matrix(c(0,0,0,0,1,0,0,0,0), nrow=1)))
confint(glht(mmrm, linfct = matrix(c(0,0,0,0,1,0,0,1,0), nrow=1)))
confint(glht(mmrm, linfct = matrix(c(0,0,0,0,1,0,0,0,1), nrow=1)))
# MMRM -
mod.us1 <- glmmTMB::glmmTMB(Y_comp ~ Group + Time + Group*Time + us(Time + 0 | USUBJID),
data=dat,
REML = T,
dispformula=~0)
mod.us2 <- glmmTMB::glmmTMB(Y_mar ~ Group + Time + Group*Time + us(Time + 0 | USUBJID),
data=dat,
REML = T,
dispformula=~0)
mod.lmer1 <- lme4::lmer(Y_comp ~ Group + Time + Group*Time + ( -1 + Time | USUBJID),
data = dat,
control = lmerControl(check.nobs.vs.nRE = 'ignore'))
mod.lmer2 <- lme4::lmer(Y_mar ~ Group + Time + Group*Time + (0 + Time | USUBJID),
data = dat,
control = lmerControl(check.nobs.vs.nRE = 'ignore'))
# OLS Regression Model
mod.ols1 <- lm(Y_comp ~ Time + Group + Time*Group,
data = dat)
mod.ols2 <- lm(Y_mar ~ Group + Time + Group*Time,
data = dat)
# Compute marginal means
library(emmeans)
mod.emms.ols1 <- emmeans(mod.ols1, pairwise ~ Group | Time, adjust = 'none', mode = 'df.error')
mod.emms.ols2 <- emmeans(mod.ols2, pairwise ~ Group | Time, adjust = 'none', mode = 'df.error')
mod.emms.gls1 <- emmeans(mod.gls1, pairwise ~ Group | Time, adjust = 'none', mode = 'satterthwaite')
mod.emms.gls2 <- emmeans(mod.gls2, pairwise ~ Group | Time, adjust = 'none', mode = 'satterthwaite')
mod.emms.us1 <- emmeans(mod.us1, pairwise ~ Group | Time, adjust = 'none', mode = 'df.error')
mod.emms.us2 <- emmeans(mod.us2, pairwise ~ Group | Time, adjust = 'none', mode = 'df.error')
# Kenward Rogers Degrees of Freedom
mod.emms.lmer1.kr <- emmeans(mod.lmer1, pairwise ~ Group | Time, adjust = 'none', mode = "kenward" )
mod.emms.lmer2.kr <- emmeans(mod.lmer2, pairwise ~ Group | Time, adjust = 'none', mode = "kenward" )
# Does not align with SAS output
# Satterthwaite Degrees of Freedom:
mod.emms.lmer1.sat <- emmeans(mod.lmer1, pairwise ~ Group | Time, adjust = 'none', mode = 'satterthwaite' )
mod.emms.lmer2.sat <- emmeans(mod.lmer2, pairwise ~ Group | Time, adjust = 'none', mode = 'satterthwaite' )
```
```{r}
library(mmrm)
```
以下公式是错误的,其公式的用法和`lme4`不同。正如其文档所说,其不支持个体间随机效应的应用。。
```{r}
# specifying an unstructured covariance matrix.
# 此公式错误
# mmrm_res <- mmrm(formula = score ~ time + class + AGE + us(1|`PatientID`),
# data = df2,
# reml = TRUE
# )
# 只考虑随机截距的模型
mmrm_res <- mmrm(formula = score ~ time + class + AGE + us(time|`PatientID`),
data = df2,
reml = TRUE
)
```
```{r}
mmrm_res
```
The summary() method then provides the coefficients table with Satterthwaite degrees of freedom as well as the covariance matrix estimate:
```{r}
summary(mmrm_res)
```
*此处有个问题需要重视:*
在以前的分析中,我们知道,time,这一变量是不适合纳入到随机截距里的,
```{r}
# ?mmrm_control
mmrm_res2 <- mmrm(formula = score ~ time * class + AGE + us(time |`PatientID`),
data = df2,
method = 'Satterthwaite'
)
```
```{r}
summary(mmrm_res2)
```
```{r}
emm_res <- emmeans::emmeans(mmrm_res2, ~ class | time)
bb <- contrast(emm_res, method = 'trt.vs.ctrl',
type = 'response', adjust = 'fdr',
by = 'time',
infer = c(TRUE, TRUE))
plot(bb) +
theme_bw() +
geom_vline(xintercept = 0, linetype = 2, color = 'red4')
```
```{r}
emm_res <- emmeans::emmeans(mmrm_res2, ~ class)
bb <- contrast(emm_res, method = 'trt.vs.ctrl',
type = 'response', adjust = 'fdr',
# by = 'time',
infer = c(TRUE, TRUE))
plot(bb) +
theme_bw() +
geom_vline(xintercept = 0, linetype = 2, color = 'red4')
```
## brms.mmrm
mixed model for repeated measures (MMRM) run by `brms`.
`brms` package is a package provides an interface to fit Bayesian generalized (non-)linear multivariate multilevel models using Stan.
```{r}
library(brms)
library(brms.mmrm)
```