logistic_reg.qmd

---
title: "logistic regression"
format: html
date: 2023-05-18
---

## 重新整理logistic回归的笔记

> https://cscu.cornell.edu/wp-content/uploads/91_ordlogistic.pdf
> https://cscu.cornell.edu/workshop/logisic_regression_analysis
> https://stats.oarc.ucla.edu/r/faq/ologit-coefficients/
> https://stats.oarc.ucla.edu/r/dae/ordinal-logistic-regression/
> https://stats.oarc.ucla.edu/r/dae/multinomial-logistic-regression/
> https://bookdown.org/chua/ber642_advanced_regression/
> https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html


在此处重新记录一下罗辑回归的相关内容。


首先是logistic分布，逻辑分布的CDF在生存分析时已经有所了解。在分析概率问题时，可以将其理解为两参数模型，最大和最小值分别为1和0。

odds: $odd=p/(1-p)$
logit: $logit(p)=log(odds)$

逻辑回归假设数据服从伯努利分布（因变量），通过极大化似然函数的方法，运用梯度下降来求解参数，来达到二分类的目的。在


*Statistical Software: What You Need to Know: *There are several ways in which an ordinal regression model can be parameterized and different statistical software packages use different parameterizations. Thus, great care should be taken when interpreting the output from `ordinal regression models`.


```{r, include=FALSE}
library(tidyverse)
library(easystats)
library(MASS)
library(car)
```


### 危险因素分析，或者变量间关系分析

和作为预测模型稍有不同，作为考察变量间关系的模型，其有独特的统计分析思路。


#### 二分类问题

最为常规的使用是处理二分类问题。

```{r}
data(bivariate, package = 'modeldata')

head(bivariate_train)
```

```{r}
fit1 <- stats::glm(Class ~ ., data = bivariate_train, family = binomial(link = 'logit'))

summary(fit1)
```


```{r}
# 和SPSS以及JMP做对比
dat <- haven::read_spss('~/Downloads/二分类Logistic回归【简】-SPSS教程-医咖会/Logistic回归.sav') %>% 
  labelled::unlabelled()

dat

fit1 <- glm(cancer ~ sex + age + BMI + COPD + smoke, data = dat,
            family = binomial(link = 'logit')
            )

summary(fit1)
```

```{r}
## odds ratios and 95% CI
## CIs using profiled log-likelihood
exp(cbind(OR = coef(fit1), confint(fit1)))
```
*那么依据这些结果怎么分析危险因素呢：*首先是A/B变量的OR值，per unit change in regressor.
Note that while R produces it, the odds ratio for the intercept is not generally interpreted.


*We can test for an overall effect of rank using the wald.test function of the aod library. *
```{r}
# Run a wald.test
# 似乎一次只能run一个term，怎么run全部呢
aod::wald.test(b=coef(fit1),Sigma=vcov(fit1), Terms = 3)

# car::Anova 就可以
```
The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.


*Likelihood Ratio Tests: *
```{r}
# Whole model test
lmtest::lrtest(fit1)

# logLik(fit1)

# Effect Likelihood Ratio Test
# for every term
# same as SAS/JMP
car::Anova(fit1, type = 'III', test.statistic='LR')
```
The likelihood ratio test is based on -2LL ratio. It is a test of the significance of the difference between the likelihood ratio (-2LL) for the researcher’s model with predictors (called model chi square) minus the likelihood ratio for baseline model with only a constant in it.


*anova, table of deviance*
```{r}
anova(fit1, test="Chisq")

car::Anova(fit1, type='III',test.statistic='Wald')
```
*anova*得到的每个变量的P值，其可以解释为和null model相比Deviance值越大越好，The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding B significantly reduces the residual deviance. A large p-value here indicates that the model without the variable explains more or less the same amount of variation.


```{r}
# Box–Tidwell
car::boxTidwell(fit1)
```


*easystats系列提供的一系列方便的函数：*
```{r}
parameters(fit1, exponentiate = TRUE)
```
```{r}
performance::performance(fit1)

pscl::pR2(fit1)
```
和线性回归不同，对于逻辑回归没有一个确切的R2，可以采用Tjur's R2或者McFadden R2进行阐释。


*模型评价和预测：*
```{r}
# ?predict.glm
# 默认为link，log-odds (probabilities on logit scale) 
predict(fit1) |> head()

# 返回的为probabilities in the form of P(y=1|X). 
# Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0. 
p <- predict(fit1, type = 'response') 
p |> head()
```


```{r}
broom::augment(fit1,type.predict='response')
```


*马修斯相关系数, matthews correlation coefficient(phi coefficient)*
```{r}
Logistic_Predictions = predict(fit1, type = 'response')
Logistic_Predictions_binomial=ifelse(Logistic_Predictions>0.5,1,0)
con_Logistic <- caret::confusionMatrix(factor(Logistic_Predictions_binomial), factor(dat$cancer))

yardstick::mcc(con_Logistic$table)
```


*变量重要性：*
```{r}
vip::vi(fit1)

caret::varImp(fit1)
```
Higher values indicate more importance.


*绘制对应的图表：*
```{r}
gtsummary::tbl_regression(fit1, exponentiate = TRUE)
```


*ROC*
```{r}
# 感觉这个包不如pROC
require(ROCR)


pr <- prediction(p, dat$cancer, label.ordering=c('非肺癌', '肺癌'))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
```

*ROC plot by pROC*
```{r}
# library(pROC)

roc1 <- pROC::roc(dat$cancer, p,
  percent = TRUE,
  # arguments for auc
  # partial.auc = c(100, 90), partial.auc.correct = TRUE,
  # partial.auc.focus = "sens",
  # arguments for ci
  ci = TRUE, boot.n = 100, ci.alpha = 0.9, stratified = FALSE,
  # arguments for plot
  plot = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE,
  print.auc = TRUE, show.thres = TRUE
)
```
```{r}
coords(roc1, "best", ret=c("threshold", "specificity", "1-npv"))
coords(roc1, "local maximas", ret=c("threshold", "sens", "spec", "ppv", "npv"))
```


#### Multinomial Logistic Regression

除二分类问题外，logistic可以拟合多分类问题，此处记录为无序多分类问题的笔记。

对于多分类问题，常用的方法还包括`log-linear`, `Discriminant Analysis`等，当然还有多种机器学习的方法可供选择。
其中`log-linear`的详细笔记见`log_linear_model.qmd`文档。判别分析的脚本见`CA_DA.Rmd`。


```{r}
require(nnet)
```


在无序多分类逻辑回归中，需要注意对于因变量reference的设置，也就是你欲做对比的那个level。
```{r}
# 假设此数据集为无序多分类，虽则其实为有序多分类。
d <- haven::read_sav('~/Downloads/有序多分类Logistic回归【详】-SPSS教程-医咖会/26 有序Logistic回归.sav') %>% 
  labelled::unlabelled()

glimpse(d)
```

```{r}
fit_norminal <- nnet::multinom(formula = patient_satisfaction ~ ., data = d, model = TRUE)

summary(fit_norminal)
```

```{r}
# extract the coefficients from the model and exponentiate
exp(coef(fit_norminal))
```

```{r}
# JMP or SAS produce LR & Wald, here we only show the LR
car::Anova(fit_norminal, type = 'III')
```

*Calculate the Goodness of fit: *
```{r}
# chiseq 对于多分类问题不是很合适
chisq.test(d$patient_satisfaction,predict(fit_norminal))
```

Calculate the Pseudo R-Square
```{r}
# 有一个很神奇的报错，nnet::multinom，不识别这种格式。。
DescTools::PseudoR2(fit_norminal, which = c("CoxSnell","Nagelkerke","McFadden"))
```
Interpretation of the R-Square:

These are three pseudo R squared values. Logistic regression does not have an equivalent to the R squared that is found in OLS regression; however, many people have tried to come up with one. These statistics do not mean exactly what R squared means in OLS regression (the proportion of variance of the response variable explained by the predictors), we suggest interpreting them with great caution.

Cox and Snell’s R-Square imitates multiple R-Square based on ‘likelihood’, but its maximum can be (and usually is) less than 1.0, making it difficult to interpret. Here it is indicating that there is the relationship of 31% between the dependent variable and the independent variables. Or it is indicating that 31% of the variation in the dependent variable is explained by the logistic model.

The Nagelkerke modification that does range from 0 to 1 is a more reliable measure of the relationship. Nagelkerke’s R2 will normally be higher than the Cox and Snell measure. In our case it is 0.357, indicating a relationship of 35.7% between the predictors and the prediction.

McFadden = {LL(null) – LL(full)} / LL(null). In our case it is 0.182, indicating a relationship of 18.2% between the predictors and the prediction.


*Likelihood Ratio Tests*
Whole model test: 此处的结果即是JMP/SPSS中
```{r}
lmtest::lrtest(fit_norminal)
```

```{r}
lmtest::lrtest(fit_norminal, 'gender')
```
The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model.
These likelihood statistics can be seen as sorts of overall statistics that tell us which predictors significantly enable us to predict the outcome category, but they don’t really tell us specifically what the effect is. To see this we have to look at the individual parameter estimates.


```{r}
# we can also use the performance package
performance::performance(fit_norminal)
```

*calculate p-value based on z-value*
```{r}
# Calculate z-values
zvalues <- summary(fit_norminal)$coefficients / summary(fit_norminal)$standard.errors

# Remember that the normal distribution doesn’t depend on degrees of freedom.
pnorm(abs(zvalues), lower.tail=FALSE)*2
# 2-tailed z test
# (1 - pnorm(abs(zvalues), 0, 1)) * 2
```


*Build a classification table*
```{r}
# Load the summarytools package to use the classification function
# library(summarytools)
# Build a classification table by using the ctable function
ctable <- table(d$patient_satisfaction, predict(fit_norminal))
ctable
```


*ROC曲线：*

在多分类问题中，TPR和FPR需要将问题转为二分类问题后再进行计算。转换的方式有两种：
- the One-vs-Rest scheme compares each class against all the others (assumed as one);
- the One-vs-One scheme compares every unique pairwise combination of classes.

Accuracy and Kappa use the same definitions as their binary counterpart, Matthews correlation coefficient (MCC) has a known multiclass generalization as well, sometimes called the 𝑅𝐾 statistic.

`Macro averaging`, reduces your multiclass predictions down to multiple sets of binary predictions, calculates the corresponding metric for each of the binary cases, and then averages the results together. macro averaging reduces the problem to multiple one-vs-all comparisons. 
`Micro averaging` treats the entire set of data as an aggregate result, and calculates 1 metric rather than k metrics that get averaged together.


*One-vs-One*

```{r}
library(pROC)

# same as fitted(fit_norminal)
p <- predict(fit_norminal, type = 'probs')

roc.multi <- pROC:::multiclass.roc(d$patient_satisfaction, p,
  plot = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE,
  print.auc = TRUE, show.thres = TRUE, 
  # add = TRUE
)

auc(roc.multi)
```


```{r}
# individual roc curve info for each classes
# 也就是四个分组两两组和，不能重复的排列，共有6种组合，其中每个组合又有两个control和test组相反的情况，共12种情况
roc.multi['rocs'] |> lengths()
```


```{r}
pROC:::multiclass.roc(d$patient_satisfaction, p, levels=c('不满意', '非常满意'), 
                      plot=T,
                      direction='>',
                      print.auc = TRUE, show.thres = TRUE
                      )
```

```{r}
# plot 6种组合中的第一种
rs <- roc.multi[['rocs']]
plot.roc(rs[[1]][[1]])
sapply(2:length(rs),function(i) lines.roc(rs[[i]][[1]],col=i))
```


*One-vs-Rest multiclass ROC*

In each step, a given class is regarded as the positive class and the remaining classes are regarded as the negative class as a bulk.
```{r}
# 在R里面实现起来似乎还有些困难
# 似乎要fit好几次

library(multiROC)
# 发现了一个R包，虽然用起来不是很方便
```

```{r}
p1 <- as.data.frame(p)
p1$patient_satisfaction <- d$patient_satisfaction

p2 <- fastDummies::dummy_cols(p1, select_columns = 'patient_satisfaction') %>% 
  dplyr::select(starts_with('patient_satisfaction'), everything()) %>% 
  dplyr::select(-patient_satisfaction) %>% 
  dplyr::rename('不满意_true'='patient_satisfaction_不满意',
                '一般_true'='patient_satisfaction_一般',
                '满意_true'='patient_satisfaction_满意',
                '非常满意_true'='patient_satisfaction_非常满意',
                '不满意_pred_multinom'='不满意',
                '一般_pred_multinom'='一般',
                '满意_pred_multinom'='满意',
                '非常满意_pred_multinom'='非常满意'
                )

res <- multiROC::multi_roc(p2, force_diag=T)

# same as JMP
unlist(res$AUC)
```

```{r}
n_method <- length(unique(res$Methods))
n_group <- length(unique(res$Groups))
res_df <- data.frame(Specificity = numeric(0), Sensitivity = numeric(0), 
                     Group = character(0), AUC = numeric(0), Method = character(0))
for (i in 1:n_method) {
  for (j in 1:n_group) {
    temp_data_1 <- data.frame(
      Specificity = res$Specificity[[i]][j],
      Sensitivity = res$Sensitivity[[i]][j],
      Group = unique(res$Groups)[j],
      AUC = res$AUC[[i]][j],
      Method = unique(res$Methods)[i]
    )
    colnames(temp_data_1) <- c("Specificity", "Sensitivity", "Group", "AUC", "Method")
    res_df <- rbind(res_df, temp_data_1)
  }
  temp_data_2 <- data.frame(
    Specificity = res$Specificity[[i]][n_group + 1],
    Sensitivity = res$Sensitivity[[i]][n_group + 1],
    Group = "Macro",
    AUC = res$AUC[[i]][n_group + 1],
    Method = unique(res$Methods)[i]
  )
  temp_data_3 <- data.frame(
    Specificity = res$Specificity[[i]][n_group + 2],
    Sensitivity = res$Sensitivity[[i]][n_group + 2],
    Group = "Micro",
    AUC = res$AUC[[i]][n_group + 2],
    Method = unique(res$Methods)[i]
  )
  colnames(temp_data_2) <- c("Specificity", "Sensitivity", "Group", "AUC", "Method")
  colnames(temp_data_3) <- c("Specificity", "Sensitivity", "Group", "AUC", "Method")
  res_df <- rbind(res_df, temp_data_2)
  res_df <- rbind(res_df, temp_data_3)
}

ggplot2::ggplot(res_df, ggplot2::aes(x = 1 - Specificity, y = Sensitivity)) +
  ggplot2::geom_path(ggplot2::aes(color = Group, linetype = Method)) +
  ggplot2::geom_segment(ggplot2::aes(x = 0, y = 0, xend = 1, yend = 1), colour = "grey", linetype = "dotdash") +
  ggplot2::theme_bw() +
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5), legend.justification = c(1, 0), 
                 legend.position = c(.95, .05), legend.title = ggplot2::element_blank(), 
                 legend.background = ggplot2::element_rect(fill = NULL, size = 0.5, linetype = "solid", 
                                                           colour = "black"))
```


```{r}
library(yardstick)
```

```{r}
p1$predict <- predict(fit_norminal, type = 'class')

# # Macro averaged multiclass precision
yardstick::accuracy(p1, patient_satisfaction, predict)
```
```{r}
# method from Hand, Till, (2001). 
yardstick::roc_auc(data = p1,
                   truth = patient_satisfaction,
                   `不满意`:`非常满意`
                   )

# 如何得到每一个分类的auc，即下图中AUC曲线下的面积

```
```{r}
# If a multiclass truth column is provided, a one-vs-all approach will be taken to calculate multiple curves, one per level.
# p1 %>% roc_curve(truth = patient_satisfaction,
#                  `不满意`:`非常满意`
#                  ) %>% 
#   autoplot()

# a much more beautiful plot
roc_curve(p1, patient_satisfaction, `不满意`:`非常满意`) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()
```

```{r}
yardstick::lift_curve(p1, truth = patient_satisfaction,
                      `不满意`:`非常满意`
                      ) %>% 
  autoplot()
```


#### 有序多分类问题

使用有序Logistic进行回归分析时，需要考虑4个假设。 
- 假设1：因变量唯一，且为有序多分类变量，如血压水平可以分为高、中、低；某病的治疗效果分为痊愈、有效、无效等。
- 假设2：存在一个或多个自变量，可为连续、有序多分类或无序分类变量。
- 假设3：自变量之间无多重共线性。
- 假设4：模型满足比例优势假设。意思是无论因变量的分割点在什么位置，模型中各个自变量对因变量的影响不变，也就是自变量对因变量的回归系数与分割点无关。有序多分类的Logistic回归原理是将因变量的多个分类依次分割为多个二元的Logistic回归，例如本例中因变量患者满意度有4个等级，分析时拆分为三个二元Logistic回归，分别为(0 vs 1+2+3) 、(0+1 vs 2+3)和(0+1+2 vs 3)，均是较低级与较高级对比。在有序多分类Logistic回归中，假设几个二元Logistic回归的自变量系数相等，仅常数项不等，结果也只输出一组自变量的系数。因此，有序多分类的Logistic回归模型，必须对自变量系数相等的假设（即比例优势假设）进行检验（又称平行线检验）。如果不满足该假设，则考虑使用无序多分类Logistic回归。


虽然利用软件做有序逻辑回归还是较为简单的，但是interpret结果就显得有些麻烦了，非常值得做一个记录笔记。
A cumulative logit parameterization is used in ordinal logistic regression models. 所以在结果中可以看到对于除reference外
的全部level都给出了一个Estimate值。比如SPSS里的Threshold值。


*不同软件给出的结果的不同：*SPSS和`MASS::polr`采用第一种parameterizations方法，其给出的参数为正数，在书写公式时需要加上-负号。JMP和SAS采用的是第二种方法，可以看到其给出的参数带有负号，同时JMP的分类变量采用Effect而不是Dummy。R中的`VGAM::vglm`,`rms::lrm`分别可以实现第二种和第三种参数化方法，具体的内容见参考链接1.
R, Stata, SPSS, and SAS (using proc genmod) use dummy coding, while JMP and SAS (using proc logistic) use effect coding (see Statnews #72 for more information on these two coding schemes). Both R and Stata use the first level alphanumerically as the reference level, whereas SAS, JMP, and SPSS use the last level as the reference level.
However, it is possible to customize the reference level in each of these programs.
JMP对于分类变量采用的编码方式和SPSS以及R不同。


```{r}
library(ordinal)
```


```{r}
library(labelled)

dd <- haven::read_sav('~/Downloads/有序多分类Logistic回归【详】-SPSS教程-医咖会/26 有序Logistic回归.sav') %>% 
  labelled::unlabelled()

glimpse(dd)

dd <- dd %>% mutate(gender = factor(gender, levels = c('女性','男性')),
                    treatment = fct_rev(treatment),
                    patient_satisfaction = factor(patient_satisfaction,
                                                  levels = c('不满意','一般','满意','非常满意'), 
                                                  ordered = TRUE
                                                  )
                    )

paint::paint(dd)
```


```{r}
# The command name comes from proportional odds logistic regression, highlighting the proportional odds assumption in our model. 
fit_ordinal <- MASS::polr(formula = patient_satisfaction ~ ., data = dd, Hess = T)

# same as SPSS
summary(fit_ordinal)
```
*interpret: *Call, this is R reminding us what type of model we ran, what options we specified, etc.
Next we see the usual regression output coefficient table including the value of each coefficient, standard errors, and t value, which is simply the ratio of the coefficient to its standard error. There is no significance test by default.
Next we see the estimates for the two intercepts, which are sometimes called cutpoints. The intercepts indicate where the latent variable is cut to make the three groups that we observe in our data. Note that this latent variable is continuous. In general, these are not used in the interpretation of the results. The cutpoints are closely related to thresholds, which are reported by other statistical packages.
Finally, we see the residual deviance, -2 * Log Likelihood of the model as well as the AIC. Both the deviance and AIC are useful for model comparison.


男性认为“治疗满意度高”的OR值是女性的1.976倍（95%CI：1.112-3.511），χ2=5.389，P=0.020。

得到上述结果后，依据SPSS使用的模型，可以得到三个方程。
Ln（patient_satisfaction=0）= 9.175-0.231*age-0.007*fee-0.681gender男性-0.032*treatmentdrug1-1.142* treatmentdrug2
Ln（patient_satisfaction=1）= 10.998-0.231*age-0.007*fee-0.681gender男性-0.032*treatmentdrug1-1.142* treatmentdrug2
Ln（patient_satisfaction=2）= 13.984-0.231*age-0.007*fee-0.681gender男性-0.032*treatmentdrug1-1.142* treatmentdrug2
可以看到，SPSS得到的方程中，除了截距项之外，所有效应值要在Parameter Estimates表格中的原始值基础上加上负号。


```{r}
(ci <- confint(fit_ordinal)) # default method gives profiled CIs
```

```{r}
## OR and CI
exp(cbind(OR = coef(fit_ordinal), ci))
```
These coefficients are called proportional odds ratios and we would interpret these pretty much as we would odds ratios from a binary logistic regression.


```{r}
car::Anova(fit_ordinal, type='III')
```

*goodness of fit*
```{r}
chisq.test(dd$patient_satisfaction, predict(fit_ordinal))
```


```{r}
lmtest::lrtest(fit_ordinal)
```


*Calculating the Pseudo R-Square*
Measuring Strength of Association
```{r}
DescTools::PseudoR2(fit_ordinal, which = c("CoxSnell","Nagelkerke","McFadden"))
```

They are not as useful as the statistic in multiple regression, since their interpretation is not straightforward.


*test proportional odds assumption*
```{r}
# Load the library
library(brant)
# Run the Brant test on the model
brant(fit_ordinal)
```
Each of the probabilities refers to the null hypothesis that the parallel regression assumption holds, or that the value of the coefficients don’t differ across different cutpoints in the outcome variable. 


*ROC curve:*
这里涉及到和无序多分类一样的问题，推荐直接使用`rms` package.
```{r}
library(pROC)

p <- predict(fit_ordinal, type = 'probs')

roc.ordi <- pROC:::multiclass.roc(dd$patient_satisfaction, p,
  plot = TRUE,
  print.auc = TRUE, show.thres = TRUE
)

auc(roc.ordi)
```


```{r}
p1$predict <- predict(fit_ordinal, type = 'class')


yardstick::roc_curve(p1, patient_satisfaction, `不满意`:`非常满意`) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()
```


### 作为预测模型的分析

下面两个分析的主要目的，在于构建预测模型。


```{r}
library(tidymodels)
library(Hmisc)
library(rms)
```


#### 作为临床预测模型的分析

详细内容见`clinical_prediction_model.Rmd`


#### 作为机器学习预测模型

逻辑回归的超参数，

mixture = tune(), penalty = tune()