-
Notifications
You must be signed in to change notification settings - Fork 0
/
logistic_reg.qmd
676 lines (479 loc) · 23.2 KB
/
logistic_reg.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
---
title: "logistic regression"
format: html
date: 2023-05-18
---
## 重新整理logistic回归的笔记
> https://cscu.cornell.edu/wp-content/uploads/91_ordlogistic.pdf
> https://cscu.cornell.edu/workshop/logisic_regression_analysis
> https://stats.oarc.ucla.edu/r/faq/ologit-coefficients/
> https://stats.oarc.ucla.edu/r/dae/ordinal-logistic-regression/
> https://stats.oarc.ucla.edu/r/dae/multinomial-logistic-regression/
> https://bookdown.org/chua/ber642_advanced_regression/
> https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
在此处重新记录一下罗辑回归的相关内容。
首先是logistic分布,逻辑分布的CDF在生存分析时已经有所了解。在分析概率问题时,可以将其理解为两参数模型,最大和最小值分别为1和0。
odds: $odd=p/(1-p)$
logit: $logit(p)=log(odds)$
逻辑回归假设数据服从伯努利分布(因变量),通过极大化似然函数的方法,运用梯度下降来求解参数,来达到二分类的目的。在
*Statistical Software: What You Need to Know: *There are several ways in which an ordinal regression model can be parameterized and different statistical software packages use different parameterizations. Thus, great care should be taken when interpreting the output from `ordinal regression models`.
```{r, include=FALSE}
library(tidyverse)
library(easystats)
library(MASS)
library(car)
```
### 危险因素分析,或者变量间关系分析
和作为预测模型稍有不同,作为考察变量间关系的模型,其有独特的统计分析思路。
#### 二分类问题
最为常规的使用是处理二分类问题。
```{r}
data(bivariate, package = 'modeldata')
head(bivariate_train)
```
```{r}
fit1 <- stats::glm(Class ~ ., data = bivariate_train, family = binomial(link = 'logit'))
summary(fit1)
```
```{r}
# 和SPSS以及JMP做对比
dat <- haven::read_spss('~/Downloads/二分类Logistic回归【简】-SPSS教程-医咖会/Logistic回归.sav') %>%
labelled::unlabelled()
dat
fit1 <- glm(cancer ~ sex + age + BMI + COPD + smoke, data = dat,
family = binomial(link = 'logit')
)
summary(fit1)
```
```{r}
## odds ratios and 95% CI
## CIs using profiled log-likelihood
exp(cbind(OR = coef(fit1), confint(fit1)))
```
*那么依据这些结果怎么分析危险因素呢:*首先是A/B变量的OR值,per unit change in regressor.
Note that while R produces it, the odds ratio for the intercept is not generally interpreted.
*We can test for an overall effect of rank using the wald.test function of the aod library. *
```{r}
# Run a wald.test
# 似乎一次只能run一个term,怎么run全部呢
aod::wald.test(b=coef(fit1),Sigma=vcov(fit1), Terms = 3)
# car::Anova 就可以
```
The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.
*Likelihood Ratio Tests: *
```{r}
# Whole model test
lmtest::lrtest(fit1)
# logLik(fit1)
# Effect Likelihood Ratio Test
# for every term
# same as SAS/JMP
car::Anova(fit1, type = 'III', test.statistic='LR')
```
The likelihood ratio test is based on -2LL ratio. It is a test of the significance of the difference between the likelihood ratio (-2LL) for the researcher’s model with predictors (called model chi square) minus the likelihood ratio for baseline model with only a constant in it.
*anova, table of deviance*
```{r}
anova(fit1, test="Chisq")
car::Anova(fit1, type='III',test.statistic='Wald')
```
*anova*得到的每个变量的P值,其可以解释为和null model相比Deviance值越大越好,The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding B significantly reduces the residual deviance. A large p-value here indicates that the model without the variable explains more or less the same amount of variation.
```{r}
# Box–Tidwell
car::boxTidwell(fit1)
```
*easystats系列提供的一系列方便的函数:*
```{r}
parameters(fit1, exponentiate = TRUE)
```
```{r}
performance::performance(fit1)
pscl::pR2(fit1)
```
和线性回归不同,对于逻辑回归没有一个确切的R2,可以采用Tjur's R2或者McFadden R2进行阐释。
*模型评价和预测:*
```{r}
# ?predict.glm
# 默认为link,log-odds (probabilities on logit scale)
predict(fit1) |> head()
# 返回的为probabilities in the form of P(y=1|X).
# Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0.
p <- predict(fit1, type = 'response')
p |> head()
```
```{r}
broom::augment(fit1,type.predict='response')
```
*马修斯相关系数, matthews correlation coefficient(phi coefficient)*
```{r}
Logistic_Predictions = predict(fit1, type = 'response')
Logistic_Predictions_binomial=ifelse(Logistic_Predictions>0.5,1,0)
con_Logistic <- caret::confusionMatrix(factor(Logistic_Predictions_binomial), factor(dat$cancer))
yardstick::mcc(con_Logistic$table)
```
*变量重要性:*
```{r}
vip::vi(fit1)
caret::varImp(fit1)
```
Higher values indicate more importance.
*绘制对应的图表:*
```{r}
gtsummary::tbl_regression(fit1, exponentiate = TRUE)
```
*ROC*
```{r}
# 感觉这个包不如pROC
require(ROCR)
pr <- prediction(p, dat$cancer, label.ordering=c('非肺癌', '肺癌'))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
```
*ROC plot by pROC*
```{r}
# library(pROC)
roc1 <- pROC::roc(dat$cancer, p,
percent = TRUE,
# arguments for auc
# partial.auc = c(100, 90), partial.auc.correct = TRUE,
# partial.auc.focus = "sens",
# arguments for ci
ci = TRUE, boot.n = 100, ci.alpha = 0.9, stratified = FALSE,
# arguments for plot
plot = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE,
print.auc = TRUE, show.thres = TRUE
)
```
```{r}
coords(roc1, "best", ret=c("threshold", "specificity", "1-npv"))
coords(roc1, "local maximas", ret=c("threshold", "sens", "spec", "ppv", "npv"))
```
#### Multinomial Logistic Regression
除二分类问题外,logistic可以拟合多分类问题,此处记录为无序多分类问题的笔记。
对于多分类问题,常用的方法还包括`log-linear`, `Discriminant Analysis`等,当然还有多种机器学习的方法可供选择。
其中`log-linear`的详细笔记见`log_linear_model.qmd`文档。判别分析的脚本见`CA_DA.Rmd`。
```{r}
require(nnet)
```
在无序多分类逻辑回归中,需要注意对于因变量reference的设置,也就是你欲做对比的那个level。
```{r}
# 假设此数据集为无序多分类,虽则其实为有序多分类。
d <- haven::read_sav('~/Downloads/有序多分类Logistic回归【详】-SPSS教程-医咖会/26 有序Logistic回归.sav') %>%
labelled::unlabelled()
glimpse(d)
```
```{r}
fit_norminal <- nnet::multinom(formula = patient_satisfaction ~ ., data = d, model = TRUE)
summary(fit_norminal)
```
```{r}
# extract the coefficients from the model and exponentiate
exp(coef(fit_norminal))
```
```{r}
# JMP or SAS produce LR & Wald, here we only show the LR
car::Anova(fit_norminal, type = 'III')
```
*Calculate the Goodness of fit: *
```{r}
# chiseq 对于多分类问题不是很合适
chisq.test(d$patient_satisfaction,predict(fit_norminal))
```
Calculate the Pseudo R-Square
```{r}
# 有一个很神奇的报错,nnet::multinom,不识别这种格式。。
DescTools::PseudoR2(fit_norminal, which = c("CoxSnell","Nagelkerke","McFadden"))
```
Interpretation of the R-Square:
These are three pseudo R squared values. Logistic regression does not have an equivalent to the R squared that is found in OLS regression; however, many people have tried to come up with one. These statistics do not mean exactly what R squared means in OLS regression (the proportion of variance of the response variable explained by the predictors), we suggest interpreting them with great caution.
Cox and Snell’s R-Square imitates multiple R-Square based on ‘likelihood’, but its maximum can be (and usually is) less than 1.0, making it difficult to interpret. Here it is indicating that there is the relationship of 31% between the dependent variable and the independent variables. Or it is indicating that 31% of the variation in the dependent variable is explained by the logistic model.
The Nagelkerke modification that does range from 0 to 1 is a more reliable measure of the relationship. Nagelkerke’s R2 will normally be higher than the Cox and Snell measure. In our case it is 0.357, indicating a relationship of 35.7% between the predictors and the prediction.
McFadden = {LL(null) – LL(full)} / LL(null). In our case it is 0.182, indicating a relationship of 18.2% between the predictors and the prediction.
*Likelihood Ratio Tests*
Whole model test: 此处的结果即是JMP/SPSS中
```{r}
lmtest::lrtest(fit_norminal)
```
```{r}
lmtest::lrtest(fit_norminal, 'gender')
```
The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model.
These likelihood statistics can be seen as sorts of overall statistics that tell us which predictors significantly enable us to predict the outcome category, but they don’t really tell us specifically what the effect is. To see this we have to look at the individual parameter estimates.
```{r}
# we can also use the performance package
performance::performance(fit_norminal)
```
*calculate p-value based on z-value*
```{r}
# Calculate z-values
zvalues <- summary(fit_norminal)$coefficients / summary(fit_norminal)$standard.errors
# Remember that the normal distribution doesn’t depend on degrees of freedom.
pnorm(abs(zvalues), lower.tail=FALSE)*2
# 2-tailed z test
# (1 - pnorm(abs(zvalues), 0, 1)) * 2
```
*Build a classification table*
```{r}
# Load the summarytools package to use the classification function
# library(summarytools)
# Build a classification table by using the ctable function
ctable <- table(d$patient_satisfaction, predict(fit_norminal))
ctable
```
*ROC曲线:*
在多分类问题中,TPR和FPR需要将问题转为二分类问题后再进行计算。转换的方式有两种:
- the One-vs-Rest scheme compares each class against all the others (assumed as one);
- the One-vs-One scheme compares every unique pairwise combination of classes.
Accuracy and Kappa use the same definitions as their binary counterpart, Matthews correlation coefficient (MCC) has a known multiclass generalization as well, sometimes called the 𝑅𝐾 statistic.
`Macro averaging`, reduces your multiclass predictions down to multiple sets of binary predictions, calculates the corresponding metric for each of the binary cases, and then averages the results together. macro averaging reduces the problem to multiple one-vs-all comparisons.
`Micro averaging` treats the entire set of data as an aggregate result, and calculates 1 metric rather than k metrics that get averaged together.
*One-vs-One*
```{r}
library(pROC)
# same as fitted(fit_norminal)
p <- predict(fit_norminal, type = 'probs')
roc.multi <- pROC:::multiclass.roc(d$patient_satisfaction, p,
plot = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE,
print.auc = TRUE, show.thres = TRUE,
# add = TRUE
)
auc(roc.multi)
```
```{r}
# individual roc curve info for each classes
# 也就是四个分组两两组和,不能重复的排列,共有6种组合,其中每个组合又有两个control和test组相反的情况,共12种情况
roc.multi['rocs'] |> lengths()
```
```{r}
pROC:::multiclass.roc(d$patient_satisfaction, p, levels=c('不满意', '非常满意'),
plot=T,
direction='>',
print.auc = TRUE, show.thres = TRUE
)
```
```{r}
# plot 6种组合中的第一种
rs <- roc.multi[['rocs']]
plot.roc(rs[[1]][[1]])
sapply(2:length(rs),function(i) lines.roc(rs[[i]][[1]],col=i))
```
*One-vs-Rest multiclass ROC*
In each step, a given class is regarded as the positive class and the remaining classes are regarded as the negative class as a bulk.
```{r}
# 在R里面实现起来似乎还有些困难
# 似乎要fit好几次
library(multiROC)
# 发现了一个R包,虽然用起来不是很方便
```
```{r}
p1 <- as.data.frame(p)
p1$patient_satisfaction <- d$patient_satisfaction
p2 <- fastDummies::dummy_cols(p1, select_columns = 'patient_satisfaction') %>%
dplyr::select(starts_with('patient_satisfaction'), everything()) %>%
dplyr::select(-patient_satisfaction) %>%
dplyr::rename('不满意_true'='patient_satisfaction_不满意',
'一般_true'='patient_satisfaction_一般',
'满意_true'='patient_satisfaction_满意',
'非常满意_true'='patient_satisfaction_非常满意',
'不满意_pred_multinom'='不满意',
'一般_pred_multinom'='一般',
'满意_pred_multinom'='满意',
'非常满意_pred_multinom'='非常满意'
)
res <- multiROC::multi_roc(p2, force_diag=T)
# same as JMP
unlist(res$AUC)
```
```{r}
n_method <- length(unique(res$Methods))
n_group <- length(unique(res$Groups))
res_df <- data.frame(Specificity = numeric(0), Sensitivity = numeric(0),
Group = character(0), AUC = numeric(0), Method = character(0))
for (i in 1:n_method) {
for (j in 1:n_group) {
temp_data_1 <- data.frame(
Specificity = res$Specificity[[i]][j],
Sensitivity = res$Sensitivity[[i]][j],
Group = unique(res$Groups)[j],
AUC = res$AUC[[i]][j],
Method = unique(res$Methods)[i]
)
colnames(temp_data_1) <- c("Specificity", "Sensitivity", "Group", "AUC", "Method")
res_df <- rbind(res_df, temp_data_1)
}
temp_data_2 <- data.frame(
Specificity = res$Specificity[[i]][n_group + 1],
Sensitivity = res$Sensitivity[[i]][n_group + 1],
Group = "Macro",
AUC = res$AUC[[i]][n_group + 1],
Method = unique(res$Methods)[i]
)
temp_data_3 <- data.frame(
Specificity = res$Specificity[[i]][n_group + 2],
Sensitivity = res$Sensitivity[[i]][n_group + 2],
Group = "Micro",
AUC = res$AUC[[i]][n_group + 2],
Method = unique(res$Methods)[i]
)
colnames(temp_data_2) <- c("Specificity", "Sensitivity", "Group", "AUC", "Method")
colnames(temp_data_3) <- c("Specificity", "Sensitivity", "Group", "AUC", "Method")
res_df <- rbind(res_df, temp_data_2)
res_df <- rbind(res_df, temp_data_3)
}
ggplot2::ggplot(res_df, ggplot2::aes(x = 1 - Specificity, y = Sensitivity)) +
ggplot2::geom_path(ggplot2::aes(color = Group, linetype = Method)) +
ggplot2::geom_segment(ggplot2::aes(x = 0, y = 0, xend = 1, yend = 1), colour = "grey", linetype = "dotdash") +
ggplot2::theme_bw() +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5), legend.justification = c(1, 0),
legend.position = c(.95, .05), legend.title = ggplot2::element_blank(),
legend.background = ggplot2::element_rect(fill = NULL, size = 0.5, linetype = "solid",
colour = "black"))
```
```{r}
library(yardstick)
```
```{r}
p1$predict <- predict(fit_norminal, type = 'class')
# # Macro averaged multiclass precision
yardstick::accuracy(p1, patient_satisfaction, predict)
```
```{r}
# method from Hand, Till, (2001).
yardstick::roc_auc(data = p1,
truth = patient_satisfaction,
`不满意`:`非常满意`
)
# 如何得到每一个分类的auc,即下图中AUC曲线下的面积
```
```{r}
# If a multiclass truth column is provided, a one-vs-all approach will be taken to calculate multiple curves, one per level.
# p1 %>% roc_curve(truth = patient_satisfaction,
# `不满意`:`非常满意`
# ) %>%
# autoplot()
# a much more beautiful plot
roc_curve(p1, patient_satisfaction, `不满意`:`非常满意`) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
theme_bw()
```
```{r}
yardstick::lift_curve(p1, truth = patient_satisfaction,
`不满意`:`非常满意`
) %>%
autoplot()
```
#### 有序多分类问题
使用有序Logistic进行回归分析时,需要考虑4个假设。
- 假设1:因变量唯一,且为有序多分类变量,如血压水平可以分为高、中、低;某病的治疗效果分为痊愈、有效、无效等。
- 假设2:存在一个或多个自变量,可为连续、有序多分类或无序分类变量。
- 假设3:自变量之间无多重共线性。
- 假设4:模型满足比例优势假设。意思是无论因变量的分割点在什么位置,模型中各个自变量对因变量的影响不变,也就是自变量对因变量的回归系数与分割点无关。有序多分类的Logistic回归原理是将因变量的多个分类依次分割为多个二元的Logistic回归,例如本例中因变量患者满意度有4个等级,分析时拆分为三个二元Logistic回归,分别为(0 vs 1+2+3) 、(0+1 vs 2+3)和(0+1+2 vs 3),均是较低级与较高级对比。在有序多分类Logistic回归中,假设几个二元Logistic回归的自变量系数相等,仅常数项不等,结果也只输出一组自变量的系数。因此,有序多分类的Logistic回归模型,必须对自变量系数相等的假设(即比例优势假设)进行检验(又称平行线检验)。如果不满足该假设,则考虑使用无序多分类Logistic回归。
虽然利用软件做有序逻辑回归还是较为简单的,但是interpret结果就显得有些麻烦了,非常值得做一个记录笔记。
A cumulative logit parameterization is used in ordinal logistic regression models. 所以在结果中可以看到对于除reference外
的全部level都给出了一个Estimate值。比如SPSS里的Threshold值。
*不同软件给出的结果的不同:*SPSS和`MASS::polr`采用第一种parameterizations方法,其给出的参数为正数,在书写公式时需要加上-负号。JMP和SAS采用的是第二种方法,可以看到其给出的参数带有负号,同时JMP的分类变量采用Effect而不是Dummy。R中的`VGAM::vglm`,`rms::lrm`分别可以实现第二种和第三种参数化方法,具体的内容见参考链接1.
R, Stata, SPSS, and SAS (using proc genmod) use dummy coding, while JMP and SAS (using proc logistic) use effect coding (see Statnews #72 for more information on these two coding schemes). Both R and Stata use the first level alphanumerically as the reference level, whereas SAS, JMP, and SPSS use the last level as the reference level.
However, it is possible to customize the reference level in each of these programs.
JMP对于分类变量采用的编码方式和SPSS以及R不同。
```{r}
library(ordinal)
```
```{r}
library(labelled)
dd <- haven::read_sav('~/Downloads/有序多分类Logistic回归【详】-SPSS教程-医咖会/26 有序Logistic回归.sav') %>%
labelled::unlabelled()
glimpse(dd)
dd <- dd %>% mutate(gender = factor(gender, levels = c('女性','男性')),
treatment = fct_rev(treatment),
patient_satisfaction = factor(patient_satisfaction,
levels = c('不满意','一般','满意','非常满意'),
ordered = TRUE
)
)
paint::paint(dd)
```
```{r}
# The command name comes from proportional odds logistic regression, highlighting the proportional odds assumption in our model.
fit_ordinal <- MASS::polr(formula = patient_satisfaction ~ ., data = dd, Hess = T)
# same as SPSS
summary(fit_ordinal)
```
*interpret: *Call, this is R reminding us what type of model we ran, what options we specified, etc.
Next we see the usual regression output coefficient table including the value of each coefficient, standard errors, and t value, which is simply the ratio of the coefficient to its standard error. There is no significance test by default.
Next we see the estimates for the two intercepts, which are sometimes called cutpoints. The intercepts indicate where the latent variable is cut to make the three groups that we observe in our data. Note that this latent variable is continuous. In general, these are not used in the interpretation of the results. The cutpoints are closely related to thresholds, which are reported by other statistical packages.
Finally, we see the residual deviance, -2 * Log Likelihood of the model as well as the AIC. Both the deviance and AIC are useful for model comparison.
男性认为“治疗满意度高”的OR值是女性的1.976倍(95%CI:1.112-3.511),χ2=5.389,P=0.020。
得到上述结果后,依据SPSS使用的模型,可以得到三个方程。
Ln(patient_satisfaction=0)= 9.175-0.231*age-0.007*fee-0.681gender男性-0.032*treatmentdrug1-1.142* treatmentdrug2
Ln(patient_satisfaction=1)= 10.998-0.231*age-0.007*fee-0.681gender男性-0.032*treatmentdrug1-1.142* treatmentdrug2
Ln(patient_satisfaction=2)= 13.984-0.231*age-0.007*fee-0.681gender男性-0.032*treatmentdrug1-1.142* treatmentdrug2
可以看到,SPSS得到的方程中,除了截距项之外,所有效应值要在Parameter Estimates表格中的原始值基础上加上负号。
```{r}
(ci <- confint(fit_ordinal)) # default method gives profiled CIs
```
```{r}
## OR and CI
exp(cbind(OR = coef(fit_ordinal), ci))
```
These coefficients are called proportional odds ratios and we would interpret these pretty much as we would odds ratios from a binary logistic regression.
```{r}
car::Anova(fit_ordinal, type='III')
```
*goodness of fit*
```{r}
chisq.test(dd$patient_satisfaction, predict(fit_ordinal))
```
```{r}
lmtest::lrtest(fit_ordinal)
```
*Calculating the Pseudo R-Square*
Measuring Strength of Association
```{r}
DescTools::PseudoR2(fit_ordinal, which = c("CoxSnell","Nagelkerke","McFadden"))
```
They are not as useful as the statistic in multiple regression, since their interpretation is not straightforward.
*test proportional odds assumption*
```{r}
# Load the library
library(brant)
# Run the Brant test on the model
brant(fit_ordinal)
```
Each of the probabilities refers to the null hypothesis that the parallel regression assumption holds, or that the value of the coefficients don’t differ across different cutpoints in the outcome variable.
*ROC curve:*
这里涉及到和无序多分类一样的问题,推荐直接使用`rms` package.
```{r}
library(pROC)
p <- predict(fit_ordinal, type = 'probs')
roc.ordi <- pROC:::multiclass.roc(dd$patient_satisfaction, p,
plot = TRUE,
print.auc = TRUE, show.thres = TRUE
)
auc(roc.ordi)
```
```{r}
p1$predict <- predict(fit_ordinal, type = 'class')
yardstick::roc_curve(p1, patient_satisfaction, `不满意`:`非常满意`) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
theme_bw()
```
### 作为预测模型的分析
下面两个分析的主要目的,在于构建预测模型。
```{r}
library(tidymodels)
library(Hmisc)
library(rms)
```
#### 作为临床预测模型的分析
详细内容见`clinical_prediction_model.Rmd`
#### 作为机器学习预测模型
逻辑回归的超参数,
mixture = tune(), penalty = tune()