-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
89 add vignette #90
89 add vignette #90
Changes from all commits
3b9f1a9
5e0a822
9b80f35
d94f3d6
c40395f
3d39a41
242defb
df86466
8d54e2e
83c21c8
bc835a4
4b8f2d0
eaa3270
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
interpretable | ||
ADTTE | ||
ADaM | ||
AgD | ||
|
@@ -52,3 +53,5 @@ signorovitch | |
tte | ||
unanchored | ||
unscaled | ||
Liu | ||
Guyot |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
<?xml version="1.0" encoding="utf-8"?> | ||
<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US"> | ||
<!-- Elsevier, generated from "elsevier" metadata at https://github.com/citation-style-language/journals --> | ||
<info> | ||
<title>BioMedicine</title> | ||
<id>http://www.zotero.org/styles/biomedicine</id> | ||
<link href="http://www.zotero.org/styles/biomedicine" rel="self"/> | ||
<link href="http://www.zotero.org/styles/elsevier-vancouver" rel="independent-parent"/> | ||
<category citation-format="numeric"/> | ||
<issn>2211-8020</issn> | ||
<updated>2018-03-09T05:06:46+00:00</updated> | ||
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights> | ||
</info> | ||
</style> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
--- | ||
title: "Preprocessing and Calculating Weights" | ||
date: "`r Sys.Date()`" | ||
output: | ||
html_document: default | ||
pdf_document: default | ||
knit: (function(inputFile, encoding) { | ||
rmarkdown::render(inputFile, encoding = encoding, | ||
output_format = "all") }) | ||
bibliography: references.bib | ||
csl: biomedicine.csl | ||
vignette: > | ||
%\VignetteIndexEntry{Calculating Weights} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
<style type="text/css"> | ||
|
||
body{ /* Normal */ | ||
font-size: 14px; | ||
} | ||
td { /* Table */ | ||
font-size: 10px; | ||
} | ||
h1.title { | ||
font-size: 38px; | ||
} | ||
h1 { /* Header 1 */ | ||
font-size: 28px; | ||
} | ||
h2 { /* Header 2 */ | ||
font-size: 22px; | ||
} | ||
h3 { /* Header 3 */ | ||
font-size: 18px; | ||
} | ||
code.r{ /* Code block */ | ||
font-size: 12px; | ||
} | ||
pre { /* Code block - determines code spacing between lines */ | ||
font-size: 14px; | ||
} | ||
</style> | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set(warning = FALSE, message = FALSE) | ||
``` | ||
|
||
|
||
# Loading R packages | ||
|
||
```{r} | ||
# install.packages("maicplus") | ||
library(maicplus) | ||
``` | ||
|
||
Additional R packages for this vignette: | ||
|
||
```{r} | ||
library(dplyr) | ||
``` | ||
|
||
# Preprocessing | ||
|
||
## Preprocessing IPD | ||
|
||
This example reads in and combines data from three standard simulated data sets (adsl, adrs and adtte) which are saved as '.csv' files. | ||
|
||
In this example scenario, age, sex, Eastern Cooperative Oncology Group (ECOG) performance status, smoking status, and number of previous treatments have been identified as imbalanced prognostic variables/effect modifiers. | ||
|
||
```{r} | ||
adsl <- read.csv(system.file("extdata", "adsl.csv", | ||
package = "maicplus", | ||
mustWork = TRUE | ||
)) | ||
adrs <- read.csv(system.file("extdata", "adrs.csv", | ||
package = "maicplus", | ||
mustWork = TRUE | ||
)) | ||
adtte <- read.csv(system.file("extdata", "adtte.csv", package = "maicplus", mustWork = TRUE)) | ||
|
||
# Data containing the matching variables | ||
adsl <- adsl %>% | ||
mutate(SEX_MALE = ifelse(SEX == "Male", 1, 0)) %>% | ||
mutate(AGE_SQUARED = AGE^2) | ||
|
||
# Could use built-in function for dummizing variables | ||
# adsl <- dummize_ipd(adsl, dummize_cols=c("SEX"), dummize_ref_level=c("Female")) | ||
|
||
# Response data | ||
adrs <- adrs %>% | ||
dplyr::filter(PARAM == "Response") %>% | ||
transmute(USUBJID, ARM, RESPONSE = AVAL, PARAM) | ||
|
||
# Time to event data (overall survival) | ||
adtte <- adtte %>% | ||
dplyr::filter(PARAMCD == "OS") %>% | ||
mutate(EVENT = 1 - CNSR) %>% | ||
transmute(USUBJID, ARM, TIME = AVAL, EVENT) | ||
|
||
# Rename adsl as ipd | ||
ipd <- adsl | ||
head(ipd) | ||
``` | ||
|
||
## Preprocessing aggregate data | ||
|
||
There are two methods for specifying aggregate data. The first method involves importing aggregate data via an Excel spreadsheet. Within the spreadsheet, variable types such as mean, median, or standard deviation are possible for continuous variables, while count or proportion are possible for binary variables. Each variable should be suffixed accordingly: _COUNT, _MEAN, _MEDIAN, _SD, or _PROP. Subsequently, the `process_agd` function will convert counts into proportions. | ||
|
||
The second method entails defining a data frame of aggregate data in R. When using this approach, the _COUNT prefix should be omitted, and only proportions are permissible for binary variables. Other suffixes remain consistent with the first method. | ||
|
||
Any missing values in binary variables should be addressed by adjusting the denominator to account for missing counts, i.e., the proportion equals the count divided by (N - missing). | ||
|
||
```{r} | ||
# Through an excel spreadsheet | ||
# target_pop <- read.csv(system.file("extdata","aggregate_data_example_1.csv", package = "maicplus", mustWork = TRUE)) | ||
# agd <- process_agd(target_pop) | ||
|
||
# Second approach by defining a data frame in R | ||
agd <- data.frame( | ||
AGE_MEAN = 51, | ||
AGE_SD = 3.25, | ||
SEX_MALE_PROP = 147 / 300, | ||
ECOG0_PROP = 0.40, | ||
SMOKE_PROP = 58 / (300 - 5), | ||
N_PR_THER_MEDIAN = 2 | ||
) | ||
``` | ||
|
||
### How _SD prefix is handled | ||
|
||
As outlined in NICE DSU TSD 18 Appendix D [@phillippo2016b], balancing on both mean and standard deviation for continuous variables may be necessary in certain scenarios. When a standard deviation is provided in the comparator population, preprocessing involves calculating $E(X^2)$ in the target population (i.e. aggregate data) using the variance formula $Var(X)=E(X^{2})-E(X)^{2}$. This calculated $E(X^2)$ in the target population is then aligned with $X^{2}$ computed in the internal IPD. | ||
|
||
### How _MEDIAN prefix is handled | ||
|
||
When a median is provided, IPD is preprocessed to categorize the variable into a binary form. Values in the IPD exceeding the comparator population median are assigned a value of 1, while values lower than the median are assigned 0. The comparator population median is replaced by 0.5 to adjust to the binary categorization in the IPD data. Subsequently, the newly formed IPD binary variable is aligned to ensure a proportion of 0.5. | ||
|
||
## Centering IPD | ||
|
||
In the introduction vignette, we explain why centering the IPD variables using aggregate data means is needed when calculating weights. The function `center_ipd` centers the IPD using the aggregate data means. | ||
|
||
```{r} | ||
ipd_centered <- center_ipd(ipd = ipd, agd = agd) | ||
head(ipd_centered) | ||
``` | ||
|
||
# Calculating weights | ||
|
||
We utilize the centered IPD and employ the `estimate_weights` function to compute the weights. Prior to executing this function, it's essential to specify the centered columns, i.e., the covariates to be utilized in the optimization process. | ||
|
||
```{r} | ||
# list variables that are going to be used to match | ||
centered_colnames <- c("AGE", "AGE_SQUARED", "SEX_MALE", "ECOG0", "SMOKE", "N_PR_THER_MEDIAN") | ||
centered_colnames <- paste0(centered_colnames, "_CENTERED") | ||
|
||
match_res <- estimate_weights( | ||
data = ipd_centered, | ||
centered_colnames = centered_colnames | ||
) | ||
|
||
# Alternatively, you can specify the numeric column locations for centered_colnames | ||
# match_res <- estimate_weights(ipd_centered, centered_colnames = c(14, 16:20)) | ||
``` | ||
|
||
Following the calculation of weights, it is necessary to determine whether the optimization procedure has worked correctly and whether the weights derived are sensible. | ||
|
||
The approximate effective sample size is calculated as: $$ ESS = \frac{({ \sum_{i=1}^n\hat{\omega}_i })^2}{ \sum_{i=1}^n \hat{\omega}^2_i} $$ A small ESS, relative to the original sample size, is an indication that the weights are highly variable and that the estimate may be unstable. This often occurs if there is very limited overlap in the distribution of the matching variables between the populations being compared. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this how ESS is calculated in the Signorovich article? If so, we should mention that the ESS, as defined in Signorovitch et al., is derived from the estimates using linear combinations of the observations. Effective sample size cannot be easily calculated when utilizing weighted survival estimates because survival estimates are not a linear function of the observations. |
||
|
||
In this example, the ESS reduction is 66.73% of the total number of patients in the intervention arm (500 patients in total). As this is a considerable reduction, estimates using this weighted data may be unreliable. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we put a sentence explaining what this 66.73% means in simple words, i.e., ESS reflects the fraction of the original sample contributing to the adjusted outcome, and large reductions in ESS may indicate poor overlap between the IPD and AgD studies? (Also potential advancement for our package https://doi.org/10.1002/jrsm.1466) |
||
|
||
```{r} | ||
match_res$ess | ||
``` | ||
|
||
Also, it is useful to visualize the weights using a histogram to check if there are any extreme weights. Scaled weights are weights that are relative to the original unit weights of each individual. Scaled weights are | ||
calculated as $$\tilde{w}_i = \frac{\hat{w}_i}{\sum_{i=1}^n \hat{w}_i} \times n $$. | ||
|
||
```{r} | ||
plot(match_res) | ||
|
||
# ggplot option is also available | ||
# plot(match_res, ggplot = TRUE, bin_col = "black", vline_col = "red") | ||
``` | ||
|
||
Another check would be to look at whether the weighted summary of covariates in internal IPD match the external aggregate data summary. | ||
|
||
```{r} | ||
outdata <- check_weights(match_res, agd) | ||
outdata | ||
``` | ||
|
||
# References |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
--- | ||
title: "Introduction" | ||
date: "`r Sys.Date()`" | ||
output: | ||
html_document: default | ||
pdf_document: default | ||
knit: (function(inputFile, encoding) { | ||
rmarkdown::render(inputFile, encoding = encoding, | ||
output_format = "all") }) | ||
bibliography: references.bib | ||
csl: biomedicine.csl | ||
vignette: > | ||
%\VignetteIndexEntry{Introduction} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
<style type="text/css"> | ||
|
||
body{ /* Normal */ | ||
font-size: 14px; | ||
} | ||
td { /* Table */ | ||
font-size: 10px; | ||
} | ||
h1.title { | ||
font-size: 38px; | ||
} | ||
h1 { /* Header 1 */ | ||
font-size: 28px; | ||
} | ||
h2 { /* Header 2 */ | ||
font-size: 22px; | ||
} | ||
h3 { /* Header 3 */ | ||
font-size: 18px; | ||
} | ||
code.r{ /* Code block */ | ||
font-size: 12px; | ||
} | ||
pre { /* Code block - determines code spacing between lines */ | ||
font-size: 14px; | ||
} | ||
</style> | ||
|
||
# Introduction | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggested text to improve clarity and flow: Health technology assessments and appraisals necessitate dependable estimations of relative treatment effects to guide reimbursement determinations. In instances where direct comparative evidence is lacking, yet both treatments under scrutiny have been separately evaluated against a shared comparator (e.g., placebo or standard care), a conventional indirect comparison can be conducted utilizing published aggregate data from each study. This document outlines the procedures for conducting a matching-adjusted indirect comparison (MAIC) analysis using the maicplus package in R. MAIC is suitable when individual patient data from one trial and aggregate data from another are accessible. The analysis focuses on endpoints such as time-to-event (e.g., overall survival) or binary outcomes (e.g., objective tumor response). The methodologies detailed herein are based on the original work by Signorovitch et al. (2010) and further elucidated in the National Institute for Health and Care Excellence (NICE) Decision Support Unit (DSU) Technical Support Document (TSD) 18 (Signorovitch et al., 2010; Phillippo et al., 2016a). A clinical trial lacking a common comparator treatment to link it with other trials is termed an unanchored MAIC. Without a common comparator, it becomes challenging to directly compare the outcomes of interest between different treatments or interventions. Conversely, if a common comparator is available, it is termed an anchored MAIC. Anchored MAIC offers certain advantages over unanchored MAIC, as it can provide more reliable and interpretable results by reducing the uncertainty associated with indirect comparisons. MAIC methods aim to adjust for between-study differences in patient demographics or disease characteristics at baseline. In scenarios where a common treatment comparator is absent, MAIC assumes that observed differences in absolute outcomes between trials are solely attributable to imbalances in prognostic variables and effect modifiers. This assumption requires that all imbalanced prognostic variables and effect modifiers between the studies are known, which is often challenging to fulfill (Phillippo et al., 2016a). Various approaches exist for identifying prognostic variables and effect modifiers for use in MAIC analyses. These include clinical consultation with experts, review of published literature, examination of previous regulatory submissions, and data-driven methods such as regression modeling and subgroup analysis to uncover interactions between baseline characteristics and treatment effects. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest using the word "participant" instead of the words "patient" and "subject". |
||
|
||
Health technology assessments and appraisals necessitate dependable estimations of relative treatment effects to guide reimbursement determinations. In instances where direct comparative evidence is lacking, yet both treatments under scrutiny have been separately evaluated against a shared comparator (e.g., placebo or standard care), a conventional indirect comparison can be conducted utilizing published aggregate data from each study. | ||
|
||
This document outlines the procedures for conducting a matching-adjusted indirect comparison (MAIC) analysis using the maicplus package in R. MAIC is suitable when individual patient data from one trial and aggregate data from another are accessible. The analysis focuses on endpoints such as time-to-event (e.g., overall survival) or binary outcomes (e.g., objective tumor response). | ||
|
||
The methodologies detailed herein are based on the original work by Signorovitch et al. (2010) and further elucidated in the National Institute for Health and Care Excellence (NICE) Decision Support Unit (DSU) Technical Support Document (TSD) 18. [@signorovitch2010; @phillippo2016a] | ||
|
||
A clinical trial lacking a common comparator treatment to link it with other trials is termed an unanchored MAIC. Without a common comparator, it becomes challenging to directly compare the outcomes of interest between different treatments or interventions. Conversely, if a common comparator is available, it is termed an anchored MAIC. Anchored MAIC offers certain advantages over unanchored MAIC, as it can provide more reliable and interpretable results by reducing the uncertainty associated with indirect comparisons. | ||
|
||
MAIC methods aim to adjust for between-study differences in patient demographics or disease characteristics at baseline. In scenarios where a common treatment comparator is absent, MAIC assumes that observed differences in absolute outcomes between trials are solely attributable to imbalances in prognostic variables and effect modifiers. This assumption requires that all imbalanced prognostic variables and effect modifiers between the studies are known, which is often challenging to fulfill. [@phillippo2016a] | ||
|
||
Various approaches exist for identifying prognostic variables and effect modifiers for use in MAIC analyses. These include clinical consultation with experts, review of published literature, examination of previous regulatory submissions, and data-driven methods such as regression modeling and subgroup analysis to uncover interactions between baseline characteristics and treatment effects. | ||
|
||
# Statistical theory behind MAIC | ||
|
||
The matching is accomplished by re-weighting patients in the study with the individual patient data (IPD) by their odds, or likelihood, of having been enrolled in the study with the aggregate data (AgD). We usually refer to probability with the term “likelihood” as we seek for the variable that would maximize the probability of observing the outcome. The approach is very similar to propensity score weighting with the difference that IPD is not available for one study, so the usual maximum likelihood approach cannot be used to estimate the parameters. Instead, a method of moments must be used (the method of moments is a statistical method for the estimation of population parameters). After the matching is complete and weights have been added to the IPD, it is possible to estimate the weighted outcomes and compare the results across. | ||
|
||
The mapping approach can be described as follows: assuming that each trial has one arm, each patient can be characterized by the following random triple ($X$, $T$, $Y$), where $X$ represents the baseline characteristics (e.g., age and weight), $T$ represents the treatment of interest (e.g., $T = 0$ for the IPD study and $T = 1$ for the study with AgD), and $Y$ is the outcome of interest (e.g., overall survival). | ||
|
||
Each patient is characterized by a random triple ($x_i$, $t_i$, $y_i$) with $i=1$ to $n$ but only when IPD is available, i.e., when $t_i$ = 0. In case where $t_i$ = 1, only the mean baseline characteristics $\bar{x}_{agg}$ and mean outcome $\bar{y}_{agg}$ are observed. | ||
|
||
Given the observed data, the causal effect of treatment $T = 0$ versus $T = 1$ on the mean of $Y$ can be estimated as follows: | ||
|
||
\[ | ||
\frac{\sum_{i=1}^{n}y_{i}w_{i}}{\sum_{i=1}^{n}w_{i}}-\bar{y}_{agg} | ||
\] | ||
|
||
where $w_i=\frac{Pr(T_i=1\mid x_i)}{Pr(T_i=0\mid x_i)}$ is the odds that patient $i$ received treatment $T=1$ vs $T=0$ (i.e. enrolls in aggregate data study vs IPD study) given baseline characteristics $x_i$. Thus, the patients receiving $T=0$ are re-weighted to match the distribution of patients receiving $T=1$. Note that this causal effect would be the case when the outcome $Y$ is continuous. If the outcome is binary, $Y$ would be a proportion and we would use a link function such as logit to give us the causal effect in an odds ratio scale. As in propensity score methods, we may | ||
assume $w_i$ to follow logistic regression form | ||
|
||
\[ | ||
w_{i}=exp(x_i^{T}\beta) | ||
\] | ||
|
||
In order to estimate $\beta$, we use method of moments. We estimate $\beta$ such that the weighted averages of the covariates in the IPD exactly matches the aggregate data averages. Mathematically speaking, we estimate $\beta$ such that: | ||
|
||
\[ | ||
0=\frac{\sum_{i=1}^{n}x_{i}exp(x_i^{T}\hat{\beta})}{\sum_{i=1}^{n}exp(x_i^{T}\hat{\beta})}-\bar{x}_{agg} | ||
\] | ||
|
||
This equation is equivalent to | ||
|
||
\[ | ||
0=\sum_{i=1}^{n}(x_{i}-\bar{x}_{agg})exp(x_{i}^{T}\hat{\beta}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this formula refers to the transformation of the IPD by subtracting the aggregate data means, I suggest placing it after the text below. |
||
\] | ||
|
||
It is possible to use this estimator since a logistic regression model for the odds of receiving $T = 1$ vs $T = 0$ would, by definition, provide the correct weights for balancing the trial populations. If the $x_i$ contains all the confounders and the logistic model for $w_i$ is correctly specified, then $\hat{\theta}$ in the next equation provides a consistent estimate of the causal effect of treatment $T = 0$ vs $T = 1$ on the mean of $Y$ among patients. | ||
|
||
\[ | ||
\hat{\theta}=\frac{\sum_{i=1}^{n}y_{i}exp(x_i^{T}\hat{\beta})}{\sum_{i=1}^{n}exp(x_i^{T}\hat{\beta})}-\bar{y}_{agg} | ||
\] | ||
|
||
In order to solve the equation set up by the methods of moments, we could transform IPD by subtracting the aggregate data means. Then $\bar{x}_{agg}$ would equal 0 and the equation would simplify. This is why IPD is centered in the preprocessing step. | ||
|
||
\[ | ||
0=\sum_{i=1}^{n}x_{i}exp(x_{i}^{T}\hat{\beta}) | ||
\] | ||
|
||
Note that this is the first derivative of | ||
|
||
\[ | ||
Q(\beta)=\sum_{i=1}^{n}exp(x_{i}^{T}\hat{\beta}) | ||
\] | ||
|
||
which has second derivative | ||
|
||
\[ | ||
Q''(\beta)=\sum_{i=1}^{n}x_ix_i^Texp(x_{i}^{T}\hat{\beta}) | ||
\] | ||
|
||
Since $Q''(\beta)$ is positive-definite for all $\beta$, $Q(\beta)$ is convex and any finite solution from | ||
the equation is unique and corresponds to the global minimum of $Q(\beta)$. Thus, we can use optimization | ||
methods to calculate $\beta$. | ||
|
||
# References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph should be before the previous one.