hta-pharma · MikeJSeo · Mar 14, 2024 · Mar 14, 2024 · Mar 15, 2024 · Mar 15, 2024
diff --git a/R/maic_unanchored.R b/R/maic_unanchored.R
@@ -7,10 +7,10 @@
 #' @param ipd a data frame that meet format requirements in 'Details', individual patient data (IPD) of internal trial
 #' @param pseudo_ipd a data frame, pseudo IPD from digitized KM curve of external trial (for time-to-event endpoint) or
 #'   from contingency table (for binary endpoint)
-#' @param trt_ipd  a string, name of the interested investigation arm in internal trial \code{dat_igd} (real IPD)
+#' @param trt_ipd a string, name of the interested investigation arm in internal trial \code{ipd} (real IPD)
 #' @param trt_agd a string, name of the interested investigation arm in external trial \code{pseudo_ipd} (pseudo IPD)
 #' @param trt_var_ipd a string, column name in \code{ipd} that contains the treatment assignment
-#' @param trt_var_agd a string, column name in \code{ipd} that contains the treatment assignment
+#' @param trt_var_agd a string, column name in \code{pseudo_ipd} that contains the treatment assignment
 #' @param endpoint_type a string, one out of the following "binary", "tte" (time to event)
 #' @param eff_measure a string, "RD" (risk difference), "OR" (odds ratio), "RR" (relative risk)
 #'   for a binary endpoint; "HR" for a time-to-event endpoint. By default is \code{NULL}, "OR" is used for binary case,
@@ -161,7 +161,6 @@ maic_unanchored <- function(weights_object,
     coxobj_dat <- coxph(Surv(TIME, EVENT) ~ ARM, dat, robust = TRUE)
     coxobj_dat_adj <- coxph(Surv(TIME, EVENT) ~ ARM, dat, weights = weights, robust = TRUE)
 
-    browser()
     res$inferential[["coxph_before"]] <- coxobj_dat
     res$inferential[["coxph_after"]] <- coxobj_dat_adj
 

diff --git a/inst/WORDLIST b/inst/WORDLIST
@@ -1,3 +1,4 @@
+interpretable
 ADTTE
 ADaM
 AgD
@@ -52,3 +53,5 @@ signorovitch
 tte
 unanchored
 unscaled
+Liu
+Guyot
diff --git a/man/maic_unanchored.Rd b/man/maic_unanchored.Rd
diff --git a/vignettes/biomedicine.csl b/vignettes/biomedicine.csl
@@ -0,0 +1,14 @@
+<?xml version="1.0" encoding="utf-8"?>
+<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
+  <!-- Elsevier, generated from "elsevier" metadata at https://github.com/citation-style-language/journals -->
+  <info>
+    <title>BioMedicine</title>
+    <id>http://www.zotero.org/styles/biomedicine</id>
+    <link href="http://www.zotero.org/styles/biomedicine" rel="self"/>
+    <link href="http://www.zotero.org/styles/elsevier-vancouver" rel="independent-parent"/>
+    <category citation-format="numeric"/>
+    <issn>2211-8020</issn>
+    <updated>2018-03-09T05:06:46+00:00</updated>
+    <rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
+  </info>
+</style>
diff --git a/vignettes/calculating_weights.Rmd b/vignettes/calculating_weights.Rmd
@@ -0,0 +1,193 @@
+---
+title: "Preprocessing and Calculating Weights"
+date: "`r Sys.Date()`"
+output: 
+  html_document: default
+  pdf_document: default
+knit: (function(inputFile, encoding) {
+  rmarkdown::render(inputFile, encoding = encoding,
+  output_format = "all") })
+bibliography: references.bib
+csl: biomedicine.csl
+vignette: >
+  %\VignetteIndexEntry{Calculating Weights}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+<style type="text/css">
+
+body{ /* Normal  */
+      font-size: 14px;
+  }
+td {  /* Table  */
+  font-size: 10px;
+}
+h1.title {
+  font-size: 38px;
+}
+h1 { /* Header 1 */
+  font-size: 28px;
+  }
+h2 { /* Header 2 */
+    font-size: 22px;
+}
+h3 { /* Header 3 */
+  font-size: 18px;
+}
+code.r{ /* Code block */
+    font-size: 12px;
+}
+pre { /* Code block - determines code spacing between lines */
+    font-size: 14px;
+}
+</style>
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(warning = FALSE, message = FALSE)
+```
+
+
+# Loading R packages
+
+```{r}
+# install.packages("maicplus")
+library(maicplus)
+```
+
+Additional R packages for this vignette:
+
+```{r}
+library(dplyr)
+```
+
+# Preprocessing
+
+## Preprocessing IPD
+
+This example reads in and combines data from three standard simulated data sets (adsl, adrs and adtte) which are saved as '.csv' files.
+
+In this example scenario, age, sex, Eastern Cooperative Oncology Group (ECOG) performance status, smoking status, and number of previous treatments have been identified as imbalanced prognostic variables/effect modifiers.
+
+```{r}
+adsl <- read.csv(system.file("extdata", "adsl.csv",
+  package = "maicplus",
+  mustWork = TRUE
+))
+adrs <- read.csv(system.file("extdata", "adrs.csv",
+  package = "maicplus",
+  mustWork = TRUE
+))
+adtte <- read.csv(system.file("extdata", "adtte.csv", package = "maicplus", mustWork = TRUE))
+
+# Data containing the matching variables
+adsl <- adsl %>%
+  mutate(SEX_MALE = ifelse(SEX == "Male", 1, 0)) %>%
+  mutate(AGE_SQUARED = AGE^2)
+
+# Could use built-in function for dummizing variables
+# adsl <- dummize_ipd(adsl, dummize_cols=c("SEX"), dummize_ref_level=c("Female"))
+
+# Response data
+adrs <- adrs %>%
+  dplyr::filter(PARAM == "Response") %>%
+  transmute(USUBJID, ARM, RESPONSE = AVAL, PARAM)
+
+# Time to event data (overall survival)
+adtte <- adtte %>%
+  dplyr::filter(PARAMCD == "OS") %>%
+  mutate(EVENT = 1 - CNSR) %>%
+  transmute(USUBJID, ARM, TIME = AVAL, EVENT)
+
+# Rename adsl as ipd
+ipd <- adsl
+head(ipd)
+```
+
+## Preprocessing aggregate data
+
+There are two methods for specifying aggregate data. The first method involves importing aggregate data via an Excel spreadsheet. Within the spreadsheet, variable types such as mean, median, or standard deviation are possible for continuous variables, while count or proportion are possible for binary variables. Each variable should be suffixed accordingly: _COUNT, _MEAN, _MEDIAN, _SD, or _PROP. Subsequently, the `process_agd` function will convert counts into proportions.
+
+The second method entails defining a data frame of aggregate data in R. When using this approach, the _COUNT prefix should be omitted, and only proportions are permissible for binary variables. Other suffixes remain consistent with the first method.
+
+Any missing values in binary variables should be addressed by adjusting the denominator to account for missing counts, i.e., the proportion equals the count divided by (N - missing).
+
+```{r}
+# Through an excel spreadsheet
+# target_pop <- read.csv(system.file("extdata","aggregate_data_example_1.csv", package = "maicplus", mustWork = TRUE))
+# agd <- process_agd(target_pop)
+
+# Second approach by defining a data frame in R
+agd <- data.frame(
+  AGE_MEAN = 51,
+  AGE_SD = 3.25,
+  SEX_MALE_PROP = 147 / 300,
+  ECOG0_PROP = 0.40,
+  SMOKE_PROP = 58 / (300 - 5),
+  N_PR_THER_MEDIAN = 2
+)
+```
+
+### How _SD prefix is handled
+
+As outlined in NICE DSU TSD 18 Appendix D [@phillippo2016b], balancing on both mean and standard deviation for continuous variables may be necessary in certain scenarios. When a standard deviation is provided in the comparator population, preprocessing involves calculating $E(X^2)$ in the target population (i.e. aggregate data) using the variance formula $Var(X)=E(X^{2})-E(X)^{2}$. This calculated $E(X^2)$ in the target population is then aligned with $X^{2}$ computed in the internal IPD.
+
+### How _MEDIAN prefix is handled
+
+When a median is provided, IPD is preprocessed to categorize the variable into a binary form. Values in the IPD exceeding the comparator population median are assigned a value of 1, while values lower than the median are assigned 0. The comparator population median is replaced by 0.5 to adjust to the binary categorization in the IPD data. Subsequently, the newly formed IPD binary variable is aligned to ensure a proportion of 0.5.
+
+## Centering IPD
+
+In the introduction vignette, we explain why centering the IPD variables using aggregate data means is needed when calculating weights. The function `center_ipd` centers the IPD using the aggregate data means.
+
+```{r}
+ipd_centered <- center_ipd(ipd = ipd, agd = agd)
+head(ipd_centered)
+```
+
+# Calculating weights
+
+We utilize the centered IPD and employ the `estimate_weights` function to compute the weights. Prior to executing this function, it's essential to specify the centered columns, i.e., the covariates to be utilized in the optimization process.
+
+```{r}
+# list variables that are going to be used to match
+centered_colnames <- c("AGE", "AGE_SQUARED", "SEX_MALE", "ECOG0", "SMOKE", "N_PR_THER_MEDIAN")
+centered_colnames <- paste0(centered_colnames, "_CENTERED")
+
+match_res <- estimate_weights(
+  data = ipd_centered,
+  centered_colnames = centered_colnames
+)
+
+# Alternatively, you can specify the numeric column locations for centered_colnames
+# match_res <- estimate_weights(ipd_centered, centered_colnames = c(14, 16:20))
+```
+
+Following the calculation of weights, it is necessary to determine whether the optimization procedure has worked correctly and whether the weights derived are sensible.
+
+The approximate effective sample size is calculated as: $$ ESS =  \frac{({ \sum_{i=1}^n\hat{\omega}_i })^2}{ \sum_{i=1}^n \hat{\omega}^2_i} $$ A small ESS, relative to the original sample size, is an indication that the weights are highly variable and that the estimate may be unstable. This often occurs if there is very limited overlap in the distribution of the matching variables between the populations being compared.
+
+In this example, the ESS reduction is 66.73% of the total number of patients in the intervention arm (500 patients in total). As this is a considerable reduction, estimates using this weighted data may be unreliable.
+
+```{r}
+match_res$ess
+```
+
+Also, it is useful to visualize the weights using a histogram to check if there are any extreme weights. Scaled weights are weights that are relative to the original unit weights of each individual. Scaled weights are
+calculated as $$\tilde{w}_i  =  \frac{\hat{w}_i}{\sum_{i=1}^n \hat{w}_i} \times n $$.
+
+```{r}
+plot(match_res)
+
+# ggplot option is also available
+# plot(match_res, ggplot = TRUE, bin_col = "black", vline_col = "red")
+```
+
+Another check would be to look at whether the weighted summary of covariates in internal IPD match the external aggregate data summary.
+
+```{r}
+outdata <- check_weights(match_res, agd)
+outdata
+```
+
+# References
diff --git a/vignettes/introduction.Rmd b/vignettes/introduction.Rmd
@@ -0,0 +1,121 @@
+---
+title: "Introduction"
+date: "`r Sys.Date()`"
+output: 
+  html_document: default
+  pdf_document: default
+knit: (function(inputFile, encoding) {
+  rmarkdown::render(inputFile, encoding = encoding,
+  output_format = "all") })
+bibliography: references.bib
+csl: biomedicine.csl
+vignette: >
+  %\VignetteIndexEntry{Introduction}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+<style type="text/css">
+
+body{ /* Normal  */
+      font-size: 14px;
+  }
+td {  /* Table  */
+  font-size: 10px;
+}
+h1.title {
+  font-size: 38px;
+}
+h1 { /* Header 1 */
+  font-size: 28px;
+  }
+h2 { /* Header 2 */
+    font-size: 22px;
+}
+h3 { /* Header 3 */
+  font-size: 18px;
+}
+code.r{ /* Code block */
+    font-size: 12px;
+}
+pre { /* Code block - determines code spacing between lines */
+    font-size: 14px;
+}
+</style>
+
+# Introduction
+
+Health technology assessments and appraisals necessitate dependable estimations of relative treatment effects to guide reimbursement determinations. In instances where direct comparative evidence is lacking, yet both treatments under scrutiny have been separately evaluated against a shared comparator (e.g., placebo or standard care), a conventional indirect comparison can be conducted utilizing published aggregate data from each study.
+
+This document outlines the procedures for conducting a matching-adjusted indirect comparison (MAIC) analysis using the maicplus package in R. MAIC is suitable when individual patient data from one trial and aggregate data from another are accessible. The analysis focuses on endpoints such as time-to-event (e.g., overall survival) or binary outcomes (e.g., objective tumor response).
+
+The methodologies detailed herein are based on the original work by Signorovitch et al. (2010) and further elucidated in the National Institute for Health and Care Excellence (NICE) Decision Support Unit (DSU) Technical Support Document (TSD) 18. [@signorovitch2010; @phillippo2016a]
+
+A clinical trial lacking a common comparator treatment to link it with other trials is termed an unanchored MAIC. Without a common comparator, it becomes challenging to directly compare the outcomes of interest between different treatments or interventions. Conversely, if a common comparator is available, it is termed an anchored MAIC. Anchored MAIC offers certain advantages over unanchored MAIC, as it can provide more reliable and interpretable results by reducing the uncertainty associated with indirect comparisons.
+
+MAIC methods aim to adjust for between-study differences in patient demographics or disease characteristics at baseline. In scenarios where a common treatment comparator is absent, MAIC assumes that observed differences in absolute outcomes between trials are solely attributable to imbalances in prognostic variables and effect modifiers. This assumption requires that all imbalanced prognostic variables and effect modifiers between the studies are known, which is often challenging to fulfill. [@phillippo2016a]
+
+Various approaches exist for identifying prognostic variables and effect modifiers for use in MAIC analyses. These include clinical consultation with experts, review of published literature, examination of previous regulatory submissions, and data-driven methods such as regression modeling and subgroup analysis to uncover interactions between baseline characteristics and treatment effects.
+
+# Statistical theory behind MAIC
+
+The matching is accomplished by re-weighting patients in the study with the individual patient data (IPD) by their odds, or likelihood, of having been enrolled in the study with the aggregate data (AgD). We usually refer to probability with the term “likelihood” as we seek for the variable that would maximize the probability of observing the outcome. The approach is very similar to propensity score weighting with the difference that IPD is not available for one study, so the usual maximum likelihood approach cannot be used to estimate the parameters. Instead, a method of moments must be used (the method of moments is a statistical method for the estimation of population parameters). After the matching is complete and weights have been added to the IPD, it is possible to estimate the weighted outcomes and compare the results across.
+
+The mapping approach can be described as follows: assuming that each trial has one arm, each patient can be characterized by the following random triple ($X$, $T$, $Y$), where $X$ represents the baseline characteristics (e.g., age and weight), $T$ represents the treatment of interest (e.g., $T = 0$ for the IPD study and $T = 1$ for the study with AgD), and $Y$ is the outcome of interest (e.g., overall survival).
+
+Each patient is characterized by a random triple ($x_i$, $t_i$, $y_i$) with $i=1$ to $n$ but only when IPD is available, i.e., when $t_i$ = 0. In case where $t_i$ = 1, only the mean baseline characteristics $\bar{x}_{agg}$ and mean outcome $\bar{y}_{agg}$ are observed.
+
+Given the observed data, the causal effect of treatment $T = 0$ versus $T = 1$ on the mean of $Y$ can be estimated as follows:
+
+\[
+\frac{\sum_{i=1}^{n}y_{i}w_{i}}{\sum_{i=1}^{n}w_{i}}-\bar{y}_{agg}
+\]
+
+where $w_i=\frac{Pr(T_i=1\mid x_i)}{Pr(T_i=0\mid x_i)}$ is the odds that patient $i$ received treatment $T=1$ vs $T=0$ (i.e. enrolls in aggregate data study vs IPD study) given baseline characteristics $x_i$. Thus, the patients receiving $T=0$ are re-weighted to match the distribution of patients receiving $T=1$. Note that this causal effect would be the case when the outcome $Y$ is continuous. If the outcome is binary, $Y$ would be a proportion and we would use a link function such as logit to give us the causal effect in an odds ratio scale. As in propensity score methods, we may
+assume $w_i$ to follow logistic regression form
+
+\[
+w_{i}=exp(x_i^{T}\beta)
+\]
+
+In order to estimate $\beta$, we use method of moments. We estimate $\beta$ such that the weighted averages of the covariates in the IPD exactly matches the aggregate data averages. Mathematically speaking, we estimate $\beta$ such that:
+
+\[
+0=\frac{\sum_{i=1}^{n}x_{i}exp(x_i^{T}\hat{\beta})}{\sum_{i=1}^{n}exp(x_i^{T}\hat{\beta})}-\bar{x}_{agg}
+\]
+
+This equation is equivalent to
+
+\[
+0=\sum_{i=1}^{n}(x_{i}-\bar{x}_{agg})exp(x_{i}^{T}\hat{\beta})
+\]
+
+It is possible to use this estimator since a logistic regression model for the odds of receiving $T = 1$ vs $T = 0$ would, by definition, provide the correct weights for balancing the trial populations. If the $x_i$ contains all the confounders and the logistic model for $w_i$ is correctly specified, then $\hat{\theta}$ in the next equation provides a consistent estimate of the causal effect of treatment $T = 0$ vs $T = 1$ on the mean of $Y$ among patients.
+
+\[
+\hat{\theta}=\frac{\sum_{i=1}^{n}y_{i}exp(x_i^{T}\hat{\beta})}{\sum_{i=1}^{n}exp(x_i^{T}\hat{\beta})}-\bar{y}_{agg}
+\]
+
+In order to solve the equation set up by the methods of moments, we could transform IPD by subtracting the aggregate data means. Then $\bar{x}_{agg}$ would equal 0 and the equation would simplify. This is why IPD is centered in the preprocessing step.
+
+\[
+0=\sum_{i=1}^{n}x_{i}exp(x_{i}^{T}\hat{\beta})
+\]
+
+Note that this is the first derivative of
+
+\[
+Q(\beta)=\sum_{i=1}^{n}exp(x_{i}^{T}\hat{\beta})
+\]
+
+which has second derivative
+
+\[
+Q''(\beta)=\sum_{i=1}^{n}x_ix_i^Texp(x_{i}^{T}\hat{\beta})
+\]
+
+Since $Q''(\beta)$ is positive-definite for all $\beta$, $Q(\beta)$ is convex and any finite solution from
+the equation is unique and corresponds to the global minimum of $Q(\beta)$. Thus, we can use optimization 
+methods to calculate $\beta$.
+
+# References