IC50_cal.qmd

---
title: "IC50 数据分析"
format: html
---

## IC50 数据分析

> https://rstudio-pubs-static.s3.amazonaws.com/378543_5b5bda32bf0541a485ccc49efbea680a.html
> https://towardsdatascience.com/drug-dose-response-data-analysis-5d7d336ad8e9
> https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_models_built-in_to_prism.htm
> https://www.aatbio.com/tools/ic50-calculator 一个在线小工具
> https://rstats4ag.org/dose-response-curves.html 重要的基础知识内容，主要参考文件
> http://www.darrenkoppel.com/2020/09/04/dose-response-modelling-and-model-selection-in-r/

> https://demo.thunor.net/ 免费基于Python的dose-curve包


复现常见的药物IC50计算和绘图.
针对不同的实验设计，比如激动剂药物和拮抗剂药物，体外还是体内实验，严格来说体内实验是`dose-response curve`, 对于体外实验`concentration-response curve`是一个更为合适的称谓，不过一般宽泛的称为`dose-response`。

CDC官网对四参数最低的data points数目要求为a minimum of five data points is required for parameter estimation。不过 查到的为ELISA数据，此处也是仅供参考。


IC50 is the “half-maximal inhibitory concentration” for an entity (typically a drug) against a biological process or function (eg, enzyme activity, cell number, metabolic activity). 

EC50 is, comparably, the “half-maximal effective concentration”. The effective is in there because not all entities will completely wipe out the biological process that you are assaying. 如果没有一个最大和最小值是不可能计算EC50的。
就只是response最下和最大值的中位值所对应的浓度。在拮抗剂可以让100%的细胞死亡的情况下，EC50=IC50。再比如说一个拮抗剂最多能杀死45%的细胞，那么IC50是不存在的，但是可以有EC50.

Kd


- E0: Response observed in absence of drug
- Einf: Maximum response caused by a drug
- EC50: Concentration of a drug at which 50% of its maximal response is induced. EC50 is normally measured in molar concentrations and is used as a measure of agonist drug potency — the lower the EC50 value, the lower the concentration of drug required to elicit a 50% maximal response and the greater the potency of the drug.
- IC50: Concentration of a drug required for 50% inhibition of cell growth. It is normally measured in molar concentrations and is used as a measure of antagonist drug potency — the lower the IC50 value, the more potent the drug is.
- Hill slope: 从定义可以看出拮抗剂的slope值一般是小于0的。而和1的对比可以大概看出曲线的坡度。


*four-parameter log-logistic model: * 四参数模型是较为常用的，除此之外还有三参数、五参数模型等。

  在对数据进行绘图时，一般会对X轴的浓度数据取log，而Y轴的response一般进行简单的线性转换，主要是保证that all scatter around the curve is Gaussian。
  
  
*模型判断：*如何判断一个curve模型拟合的时候好呢？`drc`包似乎没有提供类似Prism R2的结果。而且在非线性模型中，R2不再是线性模型中x对
y的解释程度的那个数值，For nonlinear regression, the choice of a ‘null’ model is not as simple, and therefore the  
R2 type measure will be different depending on the chosen reduced model.
R2在线性模型中实际上是两个模型的对比，所拟合的模型以及null model。
AIC是个不错的选择。


*模型选择：*除了依据模型评价指标进行选择外，依据数据特点进行选择。比如最为常用到的log-logistic模型，三参数、二参数是把四参数某个参数设置为默认值实现的，Prism中把Hill系数设为1，`drc`包三参数将最小值fit到0、两参数模型还将最大值设为1，两参数模型对于survival数据较为常用。


*symmetric log-logistic model: *

$$y=c+(d-c)/(1+exp(b(log(x)-log(ED50))))$$

*常见的拟合curve的曲线和函数：*
在生存分析的学习部分，曾见过多个分布，logistic分布、weibull分布、
在interpret曲线结果时，可以分别汇报四个参数的意义即可，比如对于b值，即是Hill系数，其是拐点值，则大约在拐点值时，细胞死亡速度达到高峰，


## 示例演示

```{r, include=FALSE}
library(tidyverse)
library(ggprism)
library(pzfx) # read GraphPad data
library(drc) # Dose response curve
library(sandwich)
library(lmtest)

# 解决drc的一个bug
# 在用sandwich包时的一个小错误
registerS3method("estfun", "drc", drc::estfun.drc)
registerS3method("bread", "drc", drc::bread.drc)
```


### 单曲线分析绘制

首先是和Prism软件数据的交互，由国人开发的这个包还是不错的。

```{r}
pzfx::pzfx_tables('~/OneDrive/kintor/Daily_Work/IC50_xiaodan/data.pzfx')

```


```{r}

dd <- pzfx::read_pzfx('~/OneDrive/kintor/Daily_Work/IC50_xiaodan/data.pzfx', 
                      "22RV1-GT19686")

dd
```


此处读取的是经过处理的数据，数据为宽数据，不过需要计算重复的均值，作为函数的因变量。

```{r}
d_p <- dd %>% rename('drug_con' = 'Var.1','rep1' = '_1',
              'rep2' = '_2', 'rep3' = '_3'
              ) %>% 
  mutate(id = row_number()) %>% 
  rowwise() %>% 
  mutate(inhib_mean = mean(c_across(c(rep1, rep2, rep3)))) %>% 
  ungroup() %>% 
  relocate(id, everything())

d_p
```

长数据类型。

```{r}
d_long <- d_p %>% pivot_longer(cols = -c(id, drug_con, inhib_mean),
                               names_to = 'rep',
                               values_to = 'response'
                               )

d_long
```

```{r}
ggplot(d_long, aes(x = drug_con, y = response)) +
  geom_point()

plot(response ~ log(drug_con+0.1), data = d_long, main="Logarithmic Dose Scale")
```


`drc` package 用来拟合`dose-response curve`，`drm` (which fits the regression model) and `LL.4` (which defines the structure of the regression model).

`drc::drm`包有一个不错的地方在于，其对长宽数据都支持，不过*肯定用长数据*，所谓宽数据是需要计算均值的；而且后续绘图不友好，只是作为一个演示放在此笔记中。

```{r}
# fitted_curve <- drc::drm(
#   formula = inhib_mean ~ drug_con,
#   data = d_p,
#   fct = LL.4(names = c("hill", "min_value", "max_value", "ec_50"))
#   # fct = LL.3()
#   )

fitted_curve <- drc::drm(
  formula = response ~ drug_con,
  data = d_long,
  fct = LL.4(names = c("hill", "min_value", "max_value", "ec_50")) # 四参数
  # fct = drc::LL.4(fixed = c(1,NA,NA,NA)) # 三参数
  )


summary(fitted_curve)
```
*interpret: *As a consequence estimated b values often lie in the range 0.5 to 20 in absolute value, regardless of the assay or experiment generating the data.


可以看到，再手动计算均值后得到的结果和直接用长数据得到的结果是一致的。

```{r}
fitted_curve2 <- drc::drm(
  formula = response ~ drug_con,
  data = d_long,
  # fct = LL.4(names = c("hill", "min_value", "max_value", "ec_50")) # 四参数
  fct = drc::LL.4(fixed = c(1,NA,NA,NA)) # 三参数
  )


summary(fitted_curve2)


# 此处所得到的结果和summary结果差异在于其计算方法的不同
lmtest::coeftest(fitted_curve2, vcov = sandwich)
# summary(multcomp::glht(ryegrass.LL.4)) # same result as summary


# 三参数
broom::tidy(fitted_curve2) %>% 
  bind_cols(tibble(rowname = c('E0', 'Einf', 'EC50'))) %>% 
  dplyr::relocate(rowname, everything()) %>% 
  bind_cols(confint(fitted_curve2))
```


*模型验证*

以下模型的lack-of-fit 的P值偏大，模型的整体表现似乎一般。
```{r}
# 和ANOVA模型的差异，P值是越大越好呀
modelFit(fitted_curve)
modelFit(fitted_curve2)

# the correlation of the adjuste of the curve plotted and the data
cor(d_long$response, fitted(fitted_curve2))
```

对于连续变量结局模型的评价，lack-of-fit的P值大于0.05表明模型还不错，不过Prism有提供R2值，目前`drc`包还没有找到计算R2的方法。for lack-of-fit test (against a one-way ANOVA model): the larger (the p-value) the better. 

_但是只利用lack-of-fit_还是不够的，The distribution of the residuals and the hopefully normal distribution of residuals are informative to make sure that the prerequisites for doing a regression in the first place are not violated. The assumptions we must consider include:

Correct regression model
Variance Homogeneity
Normally distributed measurement errors
Mutually independent measurement error  ε

```{r}
#Graphical analysis of residuals
op <- par(mfrow = c(1, 2), mar=c(3.2,3.2,2,.5), mgp=c(2,.7,0)) #put two graphs together

plot(residuals(fitted_curve2) ~ fitted(fitted_curve2), main="Residuals vs Fitted")
abline(h=0)
qqnorm(residuals(fitted_curve2))
qqline(residuals(fitted_curve2))
```
*interpret: *Model esiduals diagnostic plots: Left: distribution of residuals against the fitted value (variance homogeneity); Right: normal q-q plot illustrating whether the residuals are normally distributed (normally distributed measurement errors).

The residual plot indicates that the model was able to catch the variation in data. The distribution of residuals should look like a shotgun shot on a wall. Apparently, there are some systematic violations in the residual plot, At small fitted values the residuals are close to zero than at high fitted values the variation is much more pronounced. This is called a funnel type pattern. We also see it in the graphs with the fit and all the observations (Figure 11.4). The normal Q-Q plot looks all right, no serious deviation from the assumption of normal distribution of residuals. Later we will look at this problem, which is a statistical problem not a biological one.


*计算ED值：*

考虑到各种模型间的不同，用此函数得到ED值更为靠谱一些。*weibull等模型的ED50值需通过此函数得到！*

Delta refers to the asymptotics-based method to calculate confident intervals. 

```{r}
ED(fitted_curve2, c(5, 10, 50, 80), interval = "delta")

```


*计算IC50值:*

此处就有一个较为微妙的地方，Prism计算得到的值实验室一般直接采用，即EC50值，而对于不同的模型、不同的实验设计，似乎
IC50的真正计算需要进一步的探讨。比如Weibull其EC50值就不是一个模型参数。

注意Hill系数的值在抑制剂中，应该为一个负值。

```{r}
coefs <- setNames(
  fitted_curve$coefficients,
  c("hill", "min_value", "max_value", "ec_50")
  )

ic_50 <- with(
  as.list(coefs),
  exp(
    log(ec_50) + (1 / hill) * log(max_value / (max_value - 2 * min_value))
  )
)

ic_50
```


*选择合适的model*

我们所使用的数据是和control组进行对比所得到的比值，是具有浓度为0的数据点，应该在计算IC50时带入，不过并没有必要对此要设置最大最小值的默认值，自己去计算E0，Einf是完全可行的。依据不同的实验设计选择不同的参数。

*因为数据为比例，有理由考虑模型最大设置为100等*

```{r, warning=FALSE}
# 演示一个模型选择的过程
# 依据IC值的大小可以明显的看到4参数是个好模型。
drc::mselect(fitted_curve, fctList = list(W1.3(fixed=c(NA, 100, NA)),
                                     W1.4(), 
                                     W2.4(),
                                     W2.3(fixed=c(NA, 100, NA)), 
                                     LL.3(),
                                     LL.4(fixed = c(1,NA,NA,NA)),
                                     LL.4(fixed = c(NA,0,100,NA))
                                     ),
        linreg=TRUE
        ) 


```
*interpret: *For Akaike’s information criterion and the residual standard error: the smaller the better and for lack-of-fit test (against a one-way ANOVA model): the larger (the p-value) the better. Note that the residual standard error is only available for continuous dose-response data. Log likelihood values cannot be used for comparison unless the models are nested.
The rule of AIC is that the smaller the better and differences of *more than 10* indicates “significantly” better model fits. 


```{r}
AIC(fitted_curve, fitted_curve2)
```

*考虑到上述模型中4参数模型其log-of-fit*也小于0.05，再看待一下Weibull模型。
```{r}
fit_weibull <- drc::drm(
  formula = response ~ drug_con,
  data = d_long,
  fct = W1.4()
  )


modelFit(fit_weibull)
```
```{r}
# ?plot.drc
plot(fit_weibull)
```


*drc包还有提供综合几个模型结果的函数*

就个人而言，期望其的使用，在于多个模型的AIC差别不大的情况下，虽则其会给予比重。
```{r}
drc::maED(fitted_curve,
          fctList = list(LL.3(), LL.4(fixed = c(1,NA,NA,NA))),
          c(10, 50, 80),
          interval="kang"
          )
```


#### 利用ggplot2绘制漂亮的图片：

下面这两段摘抄的代码效果一般，不考虑。

```{r}
# new dose levels as support for the line
newdata <- expand.grid(conc = exp(seq(log(0.01), log(100), length = 100)))
# newdata <- expand(conc = )
# predictions and confidence intervals
pm <- predict(fitted_curve, newdata = newdata, interval = "confidence")
# new data with predictions
newdata$p <- pm[, 1]
newdata$pmin <- pm[, 2]
newdata$pmax <- pm[, 3]
# plot curve
library(ggplot2)
# need to shift conc == 0 a bit up, otherwise there are problems with coord_trans
d_long$conc0 <- d_long$drug_con
# d_p$conc0 <- log(d_p$drug_con)
d_long$conc0[d_long$conc0 == 0] <- 0.001
# plotting the curve
ggplot(d_long, aes(x = conc0, y = response)) +
  geom_point(size =3) +
  geom_ribbon(data = newdata, aes(x = conc, y = p, ymin = pmin, ymax = pmax), alpha = 0.2) +
  geom_line(data = newdata, aes(x = conc, y = p)) +
  scale_x_log10() +
  xlab("[Agonist], M") +
  ylab("Response") +
  scale_colour_manual(
    values = c("#9FA3FE", "#00167B", "#9FA3FE")
  ) +
  ggnewscale::new_scale_colour() +
  scale_colour_prism(
    palette = "winter_bright",
  ) +
  scale_shape_prism() +
  theme_prism(palette = "winter_bright", base_size = 16) +
  theme(
    axis.title.y = element_blank(),
    legend.title = element_blank(),
    legend.position = c(0.05, 0.95),
    legend.justification = c(0.05, 0.95)
  )
```


```{r}
# Note the logged x-axis
plot(fitted_curve, 
     log='x', 
     xlab = 'Drug concentration (microM)', 
     ylab= 'Inhibition',
     type = 'all'
     )
```

此为另一篇推文计算的IC50和上述有所不同。

```{r}
E0 <- fitted_curve$coefficients[2]
Einf <- fitted_curve$coefficients[3]
EC50 <- fitted_curve$coefficients[4]
H <-  -fitted_curve$coefficients[1]
IC50 <- exp(log((((Einf - E0) / (50 - E0) ) - 1) * EC50**(-H))/ (-H))

efficacy_metrics <- data.frame(c(E0, Einf, EC50, H, IC50))
rownames(efficacy_metrics) <- c('E0', 'Einf', 'EC50', 'H', 'IC50')
colnames(efficacy_metrics) <- 'Value'
efficacy_metrics
```


```{r}
# 不是很合适
dose_resp <- y ~ min + ((max - min) / (1 + exp(hill_coefficient * (ec50 - x))))

d_p %>%
  mutate(drug_con_log = log10(drug_con)) %>%
  pivot_longer(
    cols = -c(drug_con, drug_con_log, id, inhib_mean),
    names_pattern = "(.{3})([0-9])",
    names_to = c("treatment", "rep"),
    values_to = "response"
  ) %>%
  ggplot(
    aes(x = drug_con_log, y = response)
  ) +
  geom_smooth(
    #aes(colour = treatment),
    method = "nls", formula = dose_resp, se = FALSE,
    method.args = list(start = list(
      min = E0, max = Einf,
      ec50 = EC50, hill_coefficient = H
    ))
  ) +
  scale_colour_manual(
    labels = c("Inhibitor"),
    values = c("#00167B", "#9FA3FE")
  ) +
  ggnewscale::new_scale_colour() +
  geom_point(size = 3) +
  scale_colour_prism(
    palette = "winter_bright",
    labels = c(
      "Inhibitor"
    )
  ) +
  scale_shape_prism(
    labels = c(
      "Inhibitor"
    )
  ) +
  theme_prism(palette = "winter_bright", base_size = 16) +
  theme(
    axis.title.y = element_blank(),
    legend.title = element_blank(),
    legend.position = c(0.05, 0.95),
    legend.justification = c(0.05, 0.95)
  ) +
  labs(x = "[Agonist], M")
```

*绘制基于`ggplot2`的漂亮图片：*

```{r}
#plot_2
p1 <- ggplot(data = d_long,
       aes(x = drug_con, y = response)) + 
  geom_point(col = "yellowgreen", size = 4) + 
  geom_smooth(method = drm,col = "skyblue", method.args = list(fct = L.4()), se = F) +
  scale_x_log10() +
  labs(title= "GT19637", 
       x = "Dose (mM)",  y = "Inhibition") + 
  # ggprism::theme_prism() +
  theme(legend.position = "",
        panel.background = element_blank(),
        panel.border = element_rect(colour = "black",fill = NA, size=0.5),
        panel.grid = element_blank(),
        axis.text.x   = element_text(size= 12, color = "darkblue",family = "sans",hjust = 0.5),
        axis.text.y   = element_text(size= 12, color = "darkblue",family = "sans",vjust = 0.5,hjust = 0.5),
        axis.title  = element_text(size=12, color = "darkred",family = "sans"),
        axis.ticks =  element_line(size= 0.5),
        axis.ticks.length = unit(3, "pt")
        )
p1
```


绘制基于`ggplot2`的漂亮图片：

```{r}
data_1 <- aggregate(d_long["response"], list(conc = d_long$drug_con), FUN = mean)
data_1$sd <- aggregate(d_long["response"], list(conc = d_long$drug_con), FUN = sd)$response
ic50_value <- round(efficacy_metrics["EC50", ], 3)


p2 <- ggplot(
  data = data_1,
  aes(x = conc, y = response)
) +
  geom_point(col = "red", size = 3, shape = 21, fill = "yellowgreen", position = "identity") +
  geom_errorbar(aes(ymin = response - sd, ymax = response + sd), width = .05, position = "identity") +
  geom_smooth(method = drm, col = "skyblue", method.args = list(fct = L.4()), se = F) +
  scale_x_log10() +
  ggplot2::annotate(
    geom = "text", x = 0.2, y = 0.2, size = 8,
    label = glue::glue("IC50 = {ic50_value}")
  ) +
  labs(
    title = "GT19637",
    x = "Log Drug Concentration(nM)", y = "Relative Cell viability"
  ) +
  theme(
    legend.position = "",
    panel.background = element_blank(),
    # panel.border = element_rect(colour = "black", fill = NA, size = 0.5),
    panel.grid = element_blank(),
    axis.text.x = element_text(size = 12, color = "darkblue", family = "sans", hjust = 0.5),
    axis.text.y = element_text(size = 12, color = "darkblue", family = "sans", vjust = 0.5, hjust = 0.5),
    axis.title = element_text(size = 16, color = "darkred", family = "sans"),
    axis.ticks = element_line(size = 1),
    axis.line = element_line(size = 1),
    axis.ticks.length = unit(3, "pt"),
    plot.title = element_text(size = 20, colour = "black", hjust = 0.5, vjust = -1, face = "bold")
  )

p2
```


### 多曲线对比

除计算单个化合物的IC50值外，两条或多条曲线间的对比也是经常要用到的功能，现记录一下。

两条曲线的对比，通常是满足一定条件的，即是可比的，比如同样实验设计的不同化合物之类的。

首先绘制两条或多条不同的曲线。

```{r}
plot(fitted_curve, 
     log='x', 
     xlab = 'Drug concentration (microM)', 
     ylab= 'Inhibition',
     type = 'all',
     broken = TRUE
     )

plot(fitted_curve2, add = TRUE, col="orange")
```

```{r}
pp1 <- ggplot(data = d_long,
       aes(x = drug_con, y = response)) + 
  geom_point(col = "yellowgreen", size = 4) + 
  geom_smooth(method = drm,col = "skyblue", method.args = list(fct = L.4()), se = F) +
  geom_smooth(method = drm,col = "red4", 
              method.args = list(fct = L.4(fixed = c(1,NA,NA,NA))), 
              se = F) +
  scale_x_log10() +
  labs(title= "GT19637", 
       x = "Dose (mM)",  y = "Inhibition") + 
  ggprism::theme_prism()


pp1
```


*多曲线绘制示例：*

输入数据中加上对应的分组信息即可。

当然需要选择特定的对比。

```{r}

fit_two <- drc::drm(formula = DryMatter ~ Dose,
                    curveid = Herbicide,
                    data = S.alba,
                    fct = LL.4()
                    )

```

*在多条曲线进行拟合时有什么是需要注意到嚒*

示例中考虑了设置一样的最大最小值。

```{r}
summary(fit_two)
```


```{r}

modelFit(fit_two)
```

```{r}
plot(fit_two, broken = T, col = c('red', 'blue'), )
```

`EDcomp` Comparison of relative potencies between dose-response curves

estimate值是 relative potency

```{r}

EDcomp(fit_two, percVec = c(50, 50))
```

```{r}
EDcomp(fit_two, percVec = c(50, 50), interval="delta", reverse=TRUE)
```


```{r}
par(mfrow=c(1,1), mar=c(3.2,3.2,.5,.5), mgp=c(2,.7,0))

drc::relpot(fit_two, interval = "delta", bty="l")
```


## 拟合log-logistic模型

log-logsitic 是指数据经过log后具有logistic分布的数据。

```{r}
library(glogis)
```