Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

89 add vignette #90

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions R/maic_unanchored.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
#' @param ipd a data frame that meet format requirements in 'Details', individual patient data (IPD) of internal trial
#' @param pseudo_ipd a data frame, pseudo IPD from digitized KM curve of external trial (for time-to-event endpoint) or
#' from contingency table (for binary endpoint)
#' @param trt_ipd a string, name of the interested investigation arm in internal trial \code{dat_igd} (real IPD)
#' @param trt_ipd a string, name of the interested investigation arm in internal trial \code{ipd} (real IPD)
#' @param trt_agd a string, name of the interested investigation arm in external trial \code{pseudo_ipd} (pseudo IPD)
#' @param trt_var_ipd a string, column name in \code{ipd} that contains the treatment assignment
#' @param trt_var_agd a string, column name in \code{ipd} that contains the treatment assignment
#' @param trt_var_agd a string, column name in \code{pseudo_ipd} that contains the treatment assignment
#' @param endpoint_type a string, one out of the following "binary", "tte" (time to event)
#' @param eff_measure a string, "RD" (risk difference), "OR" (odds ratio), "RR" (relative risk)
#' for a binary endpoint; "HR" for a time-to-event endpoint. By default is \code{NULL}, "OR" is used for binary case,
Expand Down Expand Up @@ -161,7 +161,6 @@ maic_unanchored <- function(weights_object,
coxobj_dat <- coxph(Surv(TIME, EVENT) ~ ARM, dat, robust = TRUE)
coxobj_dat_adj <- coxph(Surv(TIME, EVENT) ~ ARM, dat, weights = weights, robust = TRUE)

browser()
res$inferential[["coxph_before"]] <- coxobj_dat
res$inferential[["coxph_after"]] <- coxobj_dat_adj

Expand Down
3 changes: 3 additions & 0 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
interpretable
ADTTE
ADaM
AgD
Expand Down Expand Up @@ -52,3 +53,5 @@ signorovitch
tte
unanchored
unscaled
Liu
Guyot
4 changes: 2 additions & 2 deletions man/maic_unanchored.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 14 additions & 0 deletions vignettes/biomedicine.csl
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<?xml version="1.0" encoding="utf-8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
<!-- Elsevier, generated from "elsevier" metadata at https://github.com/citation-style-language/journals -->
<info>
<title>BioMedicine</title>
<id>http://www.zotero.org/styles/biomedicine</id>
<link href="http://www.zotero.org/styles/biomedicine" rel="self"/>
<link href="http://www.zotero.org/styles/elsevier-vancouver" rel="independent-parent"/>
<category citation-format="numeric"/>
<issn>2211-8020</issn>
<updated>2018-03-09T05:06:46+00:00</updated>
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
</info>
</style>
193 changes: 193 additions & 0 deletions vignettes/calculating_weights.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
---
title: "Preprocessing and Calculating Weights"
date: "`r Sys.Date()`"
output:
html_document: default
pdf_document: default
knit: (function(inputFile, encoding) {
rmarkdown::render(inputFile, encoding = encoding,
output_format = "all") })
bibliography: references.bib
csl: biomedicine.csl
vignette: >
%\VignetteIndexEntry{Calculating Weights}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

<style type="text/css">

body{ /* Normal */
font-size: 14px;
}
td { /* Table */
font-size: 10px;
}
h1.title {
font-size: 38px;
}
h1 { /* Header 1 */
font-size: 28px;
}
h2 { /* Header 2 */
font-size: 22px;
}
h3 { /* Header 3 */
font-size: 18px;
}
code.r{ /* Code block */
font-size: 12px;
}
pre { /* Code block - determines code spacing between lines */
font-size: 14px;
}
</style>

```{r, include = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```


# Loading R packages

```{r}
# install.packages("maicplus")
library(maicplus)
```

Additional R packages for this vignette:

```{r}
library(dplyr)
```

# Preprocessing

## Preprocessing IPD

This example reads in and combines data from three standard simulated data sets (adsl, adrs and adtte) which are saved as '.csv' files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph should be before the previous one.


In this example scenario, age, sex, Eastern Cooperative Oncology Group (ECOG) performance status, smoking status, and number of previous treatments have been identified as imbalanced prognostic variables/effect modifiers.

```{r}
adsl <- read.csv(system.file("extdata", "adsl.csv",
package = "maicplus",
mustWork = TRUE
))
adrs <- read.csv(system.file("extdata", "adrs.csv",
package = "maicplus",
mustWork = TRUE
))
adtte <- read.csv(system.file("extdata", "adtte.csv", package = "maicplus", mustWork = TRUE))

# Data containing the matching variables
adsl <- adsl %>%
mutate(SEX_MALE = ifelse(SEX == "Male", 1, 0)) %>%
mutate(AGE_SQUARED = AGE^2)

# Could use built-in function for dummizing variables
# adsl <- dummize_ipd(adsl, dummize_cols=c("SEX"), dummize_ref_level=c("Female"))

# Response data
adrs <- adrs %>%
dplyr::filter(PARAM == "Response") %>%
transmute(USUBJID, ARM, RESPONSE = AVAL, PARAM)

# Time to event data (overall survival)
adtte <- adtte %>%
dplyr::filter(PARAMCD == "OS") %>%
mutate(EVENT = 1 - CNSR) %>%
transmute(USUBJID, ARM, TIME = AVAL, EVENT)

# Rename adsl as ipd
ipd <- adsl
head(ipd)
```

## Preprocessing aggregate data

There are two methods for specifying aggregate data. The first method involves importing aggregate data via an Excel spreadsheet. Within the spreadsheet, variable types such as mean, median, or standard deviation are possible for continuous variables, while count or proportion are possible for binary variables. Each variable should be suffixed accordingly: _COUNT, _MEAN, _MEDIAN, _SD, or _PROP. Subsequently, the `process_agd` function will convert counts into proportions.

The second method entails defining a data frame of aggregate data in R. When using this approach, the _COUNT prefix should be omitted, and only proportions are permissible for binary variables. Other suffixes remain consistent with the first method.

Any missing values in binary variables should be addressed by adjusting the denominator to account for missing counts, i.e., the proportion equals the count divided by (N - missing).

```{r}
# Through an excel spreadsheet
# target_pop <- read.csv(system.file("extdata","aggregate_data_example_1.csv", package = "maicplus", mustWork = TRUE))
# agd <- process_agd(target_pop)

# Second approach by defining a data frame in R
agd <- data.frame(
AGE_MEAN = 51,
AGE_SD = 3.25,
SEX_MALE_PROP = 147 / 300,
ECOG0_PROP = 0.40,
SMOKE_PROP = 58 / (300 - 5),
N_PR_THER_MEDIAN = 2
)
```

### How _SD prefix is handled

As outlined in NICE DSU TSD 18 Appendix D [@phillippo2016b], balancing on both mean and standard deviation for continuous variables may be necessary in certain scenarios. When a standard deviation is provided in the comparator population, preprocessing involves calculating $E(X^2)$ in the target population (i.e. aggregate data) using the variance formula $Var(X)=E(X^{2})-E(X)^{2}$. This calculated $E(X^2)$ in the target population is then aligned with $X^{2}$ computed in the internal IPD.

### How _MEDIAN prefix is handled

When a median is provided, IPD is preprocessed to categorize the variable into a binary form. Values in the IPD exceeding the comparator population median are assigned a value of 1, while values lower than the median are assigned 0. The comparator population median is replaced by 0.5 to adjust to the binary categorization in the IPD data. Subsequently, the newly formed IPD binary variable is aligned to ensure a proportion of 0.5.

## Centering IPD

In the introduction vignette, we explain why centering the IPD variables using aggregate data means is needed when calculating weights. The function `center_ipd` centers the IPD using the aggregate data means.

```{r}
ipd_centered <- center_ipd(ipd = ipd, agd = agd)
head(ipd_centered)
```

# Calculating weights

We utilize the centered IPD and employ the `estimate_weights` function to compute the weights. Prior to executing this function, it's essential to specify the centered columns, i.e., the covariates to be utilized in the optimization process.

```{r}
# list variables that are going to be used to match
centered_colnames <- c("AGE", "AGE_SQUARED", "SEX_MALE", "ECOG0", "SMOKE", "N_PR_THER_MEDIAN")
centered_colnames <- paste0(centered_colnames, "_CENTERED")

match_res <- estimate_weights(
data = ipd_centered,
centered_colnames = centered_colnames
)

# Alternatively, you can specify the numeric column locations for centered_colnames
# match_res <- estimate_weights(ipd_centered, centered_colnames = c(14, 16:20))
```

Following the calculation of weights, it is necessary to determine whether the optimization procedure has worked correctly and whether the weights derived are sensible.

The approximate effective sample size is calculated as: $$ ESS = \frac{({ \sum_{i=1}^n\hat{\omega}_i })^2}{ \sum_{i=1}^n \hat{\omega}^2_i} $$ A small ESS, relative to the original sample size, is an indication that the weights are highly variable and that the estimate may be unstable. This often occurs if there is very limited overlap in the distribution of the matching variables between the populations being compared.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how ESS is calculated in the Signorovich article? If so, we should mention that the ESS, as defined in Signorovitch et al., is derived from the estimates using linear combinations of the observations. Effective sample size cannot be easily calculated when utilizing weighted survival estimates because survival estimates are not a linear function of the observations.


In this example, the ESS reduction is 66.73% of the total number of patients in the intervention arm (500 patients in total). As this is a considerable reduction, estimates using this weighted data may be unreliable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put a sentence explaining what this 66.73% means in simple words, i.e., ESS reflects the fraction of the original sample contributing to the adjusted outcome, and large reductions in ESS may indicate poor overlap between the IPD and AgD studies? (Also potential advancement for our package https://doi.org/10.1002/jrsm.1466)


```{r}
match_res$ess
```

Also, it is useful to visualize the weights using a histogram to check if there are any extreme weights. Scaled weights are weights that are relative to the original unit weights of each individual. Scaled weights are
calculated as $$\tilde{w}_i = \frac{\hat{w}_i}{\sum_{i=1}^n \hat{w}_i} \times n $$.

```{r}
plot(match_res)

# ggplot option is also available
# plot(match_res, ggplot = TRUE, bin_col = "black", vline_col = "red")
```

Another check would be to look at whether the weighted summary of covariates in internal IPD match the external aggregate data summary.

```{r}
outdata <- check_weights(match_res, agd)
outdata
```

# References
121 changes: 121 additions & 0 deletions vignettes/introduction.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: "Introduction"
date: "`r Sys.Date()`"
output:
html_document: default
pdf_document: default
knit: (function(inputFile, encoding) {
rmarkdown::render(inputFile, encoding = encoding,
output_format = "all") })
bibliography: references.bib
csl: biomedicine.csl
vignette: >
%\VignetteIndexEntry{Introduction}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

<style type="text/css">

body{ /* Normal */
font-size: 14px;
}
td { /* Table */
font-size: 10px;
}
h1.title {
font-size: 38px;
}
h1 { /* Header 1 */
font-size: 28px;
}
h2 { /* Header 2 */
font-size: 22px;
}
h3 { /* Header 3 */
font-size: 18px;
}
code.r{ /* Code block */
font-size: 12px;
}
pre { /* Code block - determines code spacing between lines */
font-size: 14px;
}
</style>

# Introduction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested text to improve clarity and flow:

Health technology assessments and appraisals necessitate dependable estimations of relative treatment effects to guide reimbursement determinations. In instances where direct comparative evidence is lacking, yet both treatments under scrutiny have been separately evaluated against a shared comparator (e.g., placebo or standard care), a conventional indirect comparison can be conducted utilizing published aggregate data from each study.

This document outlines the procedures for conducting a matching-adjusted indirect comparison (MAIC) analysis using the maicplus package in R. MAIC is suitable when individual patient data from one trial and aggregate data from another are accessible. The analysis focuses on endpoints such as time-to-event (e.g., overall survival) or binary outcomes (e.g., objective tumor response).

The methodologies detailed herein are based on the original work by Signorovitch et al. (2010) and further elucidated in the National Institute for Health and Care Excellence (NICE) Decision Support Unit (DSU) Technical Support Document (TSD) 18 (Signorovitch et al., 2010; Phillippo et al., 2016a).

A clinical trial lacking a common comparator treatment to link it with other trials is termed an unanchored MAIC. Without a common comparator, it becomes challenging to directly compare the outcomes of interest between different treatments or interventions. Conversely, if a common comparator is available, it is termed an anchored MAIC. Anchored MAIC offers certain advantages over unanchored MAIC, as it can provide more reliable and interpretable results by reducing the uncertainty associated with indirect comparisons.

MAIC methods aim to adjust for between-study differences in patient demographics or disease characteristics at baseline. In scenarios where a common treatment comparator is absent, MAIC assumes that observed differences in absolute outcomes between trials are solely attributable to imbalances in prognostic variables and effect modifiers. This assumption requires that all imbalanced prognostic variables and effect modifiers between the studies are known, which is often challenging to fulfill (Phillippo et al., 2016a).

Various approaches exist for identifying prognostic variables and effect modifiers for use in MAIC analyses. These include clinical consultation with experts, review of published literature, examination of previous regulatory submissions, and data-driven methods such as regression modeling and subgroup analysis to uncover interactions between baseline characteristics and treatment effects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the word "participant" instead of the words "patient" and "subject".


Health technology assessments and appraisals necessitate dependable estimations of relative treatment effects to guide reimbursement determinations. In instances where direct comparative evidence is lacking, yet both treatments under scrutiny have been separately evaluated against a shared comparator (e.g., placebo or standard care), a conventional indirect comparison can be conducted utilizing published aggregate data from each study.

This document outlines the procedures for conducting a matching-adjusted indirect comparison (MAIC) analysis using the maicplus package in R. MAIC is suitable when individual patient data from one trial and aggregate data from another are accessible. The analysis focuses on endpoints such as time-to-event (e.g., overall survival) or binary outcomes (e.g., objective tumor response).

The methodologies detailed herein are based on the original work by Signorovitch et al. (2010) and further elucidated in the National Institute for Health and Care Excellence (NICE) Decision Support Unit (DSU) Technical Support Document (TSD) 18. [@signorovitch2010; @phillippo2016a]

A clinical trial lacking a common comparator treatment to link it with other trials is termed an unanchored MAIC. Without a common comparator, it becomes challenging to directly compare the outcomes of interest between different treatments or interventions. Conversely, if a common comparator is available, it is termed an anchored MAIC. Anchored MAIC offers certain advantages over unanchored MAIC, as it can provide more reliable and interpretable results by reducing the uncertainty associated with indirect comparisons.

MAIC methods aim to adjust for between-study differences in patient demographics or disease characteristics at baseline. In scenarios where a common treatment comparator is absent, MAIC assumes that observed differences in absolute outcomes between trials are solely attributable to imbalances in prognostic variables and effect modifiers. This assumption requires that all imbalanced prognostic variables and effect modifiers between the studies are known, which is often challenging to fulfill. [@phillippo2016a]

Various approaches exist for identifying prognostic variables and effect modifiers for use in MAIC analyses. These include clinical consultation with experts, review of published literature, examination of previous regulatory submissions, and data-driven methods such as regression modeling and subgroup analysis to uncover interactions between baseline characteristics and treatment effects.

# Statistical theory behind MAIC

The matching is accomplished by re-weighting patients in the study with the individual patient data (IPD) by their odds, or likelihood, of having been enrolled in the study with the aggregate data (AgD). We usually refer to probability with the term “likelihood” as we seek for the variable that would maximize the probability of observing the outcome. The approach is very similar to propensity score weighting with the difference that IPD is not available for one study, so the usual maximum likelihood approach cannot be used to estimate the parameters. Instead, a method of moments must be used (the method of moments is a statistical method for the estimation of population parameters). After the matching is complete and weights have been added to the IPD, it is possible to estimate the weighted outcomes and compare the results across.

The mapping approach can be described as follows: assuming that each trial has one arm, each patient can be characterized by the following random triple ($X$, $T$, $Y$), where $X$ represents the baseline characteristics (e.g., age and weight), $T$ represents the treatment of interest (e.g., $T = 0$ for the IPD study and $T = 1$ for the study with AgD), and $Y$ is the outcome of interest (e.g., overall survival).

Each patient is characterized by a random triple ($x_i$, $t_i$, $y_i$) with $i=1$ to $n$ but only when IPD is available, i.e., when $t_i$ = 0. In case where $t_i$ = 1, only the mean baseline characteristics $\bar{x}_{agg}$ and mean outcome $\bar{y}_{agg}$ are observed.

Given the observed data, the causal effect of treatment $T = 0$ versus $T = 1$ on the mean of $Y$ can be estimated as follows:

\[
\frac{\sum_{i=1}^{n}y_{i}w_{i}}{\sum_{i=1}^{n}w_{i}}-\bar{y}_{agg}
\]

where $w_i=\frac{Pr(T_i=1\mid x_i)}{Pr(T_i=0\mid x_i)}$ is the odds that patient $i$ received treatment $T=1$ vs $T=0$ (i.e. enrolls in aggregate data study vs IPD study) given baseline characteristics $x_i$. Thus, the patients receiving $T=0$ are re-weighted to match the distribution of patients receiving $T=1$. Note that this causal effect would be the case when the outcome $Y$ is continuous. If the outcome is binary, $Y$ would be a proportion and we would use a link function such as logit to give us the causal effect in an odds ratio scale. As in propensity score methods, we may
assume $w_i$ to follow logistic regression form

\[
w_{i}=exp(x_i^{T}\beta)
\]

In order to estimate $\beta$, we use method of moments. We estimate $\beta$ such that the weighted averages of the covariates in the IPD exactly matches the aggregate data averages. Mathematically speaking, we estimate $\beta$ such that:

\[
0=\frac{\sum_{i=1}^{n}x_{i}exp(x_i^{T}\hat{\beta})}{\sum_{i=1}^{n}exp(x_i^{T}\hat{\beta})}-\bar{x}_{agg}
\]

This equation is equivalent to

\[
0=\sum_{i=1}^{n}(x_{i}-\bar{x}_{agg})exp(x_{i}^{T}\hat{\beta})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this formula refers to the transformation of the IPD by subtracting the aggregate data means, I suggest placing it after the text below.

\]

It is possible to use this estimator since a logistic regression model for the odds of receiving $T = 1$ vs $T = 0$ would, by definition, provide the correct weights for balancing the trial populations. If the $x_i$ contains all the confounders and the logistic model for $w_i$ is correctly specified, then $\hat{\theta}$ in the next equation provides a consistent estimate of the causal effect of treatment $T = 0$ vs $T = 1$ on the mean of $Y$ among patients.

\[
\hat{\theta}=\frac{\sum_{i=1}^{n}y_{i}exp(x_i^{T}\hat{\beta})}{\sum_{i=1}^{n}exp(x_i^{T}\hat{\beta})}-\bar{y}_{agg}
\]

In order to solve the equation set up by the methods of moments, we could transform IPD by subtracting the aggregate data means. Then $\bar{x}_{agg}$ would equal 0 and the equation would simplify. This is why IPD is centered in the preprocessing step.

\[
0=\sum_{i=1}^{n}x_{i}exp(x_{i}^{T}\hat{\beta})
\]

Note that this is the first derivative of

\[
Q(\beta)=\sum_{i=1}^{n}exp(x_{i}^{T}\hat{\beta})
\]

which has second derivative

\[
Q''(\beta)=\sum_{i=1}^{n}x_ix_i^Texp(x_{i}^{T}\hat{\beta})
\]

Since $Q''(\beta)$ is positive-definite for all $\beta$, $Q(\beta)$ is convex and any finite solution from
the equation is unique and corresponds to the global minimum of $Q(\beta)$. Thus, we can use optimization
methods to calculate $\beta$.

# References
Loading
Loading