vignette.Rmd

---
title: "Estimating Average Treatment Effect with ICDML and Bootstrap Confidence Intervals"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(data.table)
library(xgboost)
```


This vignette demonstrates the process of estimating the Average Treatment Effect (ATE) using Calibrated Debiased Machine Learning (C-DML). We'll start by generating synthetic data, fitting nuisance models using machine learning algorithms, and finally applying the C-DML estimation procedure with ensemble learning.


```{r}
# Check if CDML package is installed; if not, install it from GitHub
if(!require(CDML)) {
  devtools::install_github("Larsvanderlaan/CDML")
  library(CDML)
}

```

# Generate data

To start, we generate synthetic data for a causal inference problem. Our data consists of covariates \(X\), a binary treatment indicator \(A\), and an outcome \(Y\) influenced by both \(X\) and \(A\). This setup allows us to simulate a scenario where we estimate the causal effect of \(A\) on \(Y\).


```{r}
set.seed(123)

# Sample size
n <- 1000

# Generate covariates
X <- matrix(rnorm(n * 5), n, 5)
colnames(X) <- paste0("X", 1:5)

# Generate binary treatment indicator A based on a logistic model of X
pi1_true <- plogis(X %*% rep(0.3, 5))  # Linear predictor for logistic model
A <- rbinom(n, 1, pi1_true)

# True outcome model
Y <- as.vector(2 * A + X %*% rep(0.5, 5) + rnorm(n))

# Predicted outcome regressions (for demonstration, using true model)
mu1_true <- 2 + X %*% rep(0.5, 5)  # Expected outcome if A = 1
mu0_true <- X %*% rep(0.5, 5)       # Expected outcome if A = 0


print(mean(mu1_true - mu0_true))
```

## Fit initial nuisance models 

Next, we fit initial models for nuisance parameters, which include the propensity score (probability of receiving treatment) and outcome regression models (predicted outcomes under treatment and control). We use the \texttt{sl3} package for flexible machine learning modeling with cross-fitting to reduce bias.


```{r}
if(!require(sl3)) {
  devtools::install_github("tlverse/sl3", ref="devel")
  library(sl3)
}


# Define the data as a sl3 task for propensity score
task_A <- sl3::make_sl3_Task(data = data.table(cbind(X, A = A)), covariates = colnames(X), outcome = "A", outcome_type = "binomial")
task_Y <- sl3::make_sl3_Task(data = data.table(X, Y = Y, A = A), covariates = c(colnames(X), "A"), outcome = "Y", outcome_type = "continuous")


# Define learners for propensity score and outcome regression
# Stack includes XGBoost and GAM learners
# lrnr_cv performs cross-fitting
lrnr_stack <- Stack$new(list(Lrnr_xgboost$new(max_depth = 4), Lrnr_gam$new()))
lrnr_cv <- Lrnr_cv$new(lrnr_stack)
lrnr_selector <- make_learner(Pipeline, lrnr_cv, Lrnr_cv_selector$new(loss_squared_error))

# Fit propensity score model (predicting A based on X)
propensity_fit <- lrnr_selector$train(task_A)
pi1 <- propensity_fit$predict(task_A)  # Predictions for propensity scores
pi0 = 1 - pi1

# Define two outcome regression tasks: one with A set to 1, one with A set to 0
task_Y1 <- sl3::make_sl3_Task(data = data.table(X, Y = Y, A = 1), covariates = c(colnames(X), "A"), outcome = "Y", outcome_type = "continuous")
task_Y0 <- sl3::make_sl3_Task(data = data.table(X, Y = Y, A = 0), covariates = c(colnames(X), "A"), outcome = "Y", outcome_type = "continuous")

# Fit outcome regression model using original task
outcome_fit <- lrnr_selector$train(task_Y)

# Predict outcomes under treatment (A = 1) for all observations
mu1 <- outcome_fit$predict(task_Y1)

# Predict outcomes under control (A = 0) for all observations
mu0 <- outcome_fit$predict(task_Y0)

# Check some of the predictions
head(pi1)
head(mu1)
head(mu0)


```

## CDML with specified nuisance estimates

With the predicted nuisance estimates (propensity scores and outcome regressions), we now apply the CDML estimation procedure to estimate the causal effect of treatment on the outcome.


```{r}

output <- cdml(W=X,A,Y, pi_mat = cbind(pi1, pi0), mu_mat = cbind(mu1, mu0))
print(output)

```


# CDML with ensemble learning

The `cdml` function provides a user-friendly interface that integrates machine learning with the `sl3` ensemble learning framework. `sl3` is part of the **`tlverse`** ecosystem, providing a flexible framework for **Super Learner ensembles**—a method that optimally combines multiple algorithms to create robust predictive models. Leveraging cross-validation to evaluate and ensemble diverse learners, `sl3` is particularly suited for **Causal Doubly Robust Machine Learning (CDML)** applications. Its extensible architecture supports both built-in and custom learners, making it powerful for causal inference and predictive modeling. For more, see [tlverse.org/sl3](https://tlverse.org/sl3/).


```{r}

# default learners
output <- cdml(X,A,Y)
print(output)

# custom learners
library(sl3)
learners = list(
  Lrnr_glmnet$new(),
  Lrnr_earth$new(degree = 2),
  Lrnr_gam$new(),
  Lrnr_ranger$new(max.depth = 8),
  Lrnr_xgboost_early_stopping$new(max_depth = 3),
  Lrnr_xgboost_early_stopping$new(max_depth = 4),
  Lrnr_xgboost_early_stopping$new(max_depth = 5)
)

output <- cdml(W=X,A,Y, learners_treatment = learners, learners_outcome = learners)
print(output)


```


# CDML with internal functions

 
## C-DML for ATE and bootstrap CIs

```{r}
## CDML Example


# Estimate the Average Treatment Effect (ATE) using ICDML estimator
# Inputs: 
# - A: binary treatment indicator
# - Y: observed outcomes
# - mu1, mu0: predicted outcomes for treatment (A = 1) and control (A = 0)
# - pi1, pi0: propensity scores for treatment and control groups, respectively
estimate <- estimate_cdml_ate(A, Y, mu1, mu0, pi1, pi0, weights = NULL) 


# Print the estimated causal effect
print(estimate)

# Estimate the ATE with a bootstrap confidence interval using ICDML
# The function provides both the point estimate and a confidence interval
bootstrap_result <- bootstrap_cdml_ate(A, Y, mu1, mu0, pi1, pi0, weights = NULL) 

# Print the bootstrap results: includes ATE estimate and confidence interval
print(bootstrap_result)


```


# Isotonic calibration of nuisance functions for C-DML


To directly calibrate the inverse propensity score weights and outcome regression using isotonic regression, follow the steps below. This will allow you to construct isotonic calibrated nuisance estimators, which can be used in the C-DML framework to target various causal parameters.


The following code demonstrates how to calibrate the inverse propensity weights and outcome regression using isotonic regression with the CDML package.


```{r}

# Isotonic calibration of inverse propensity score weights and outcome regression
# Assuming that you have already computed the following:
# - propensity_scores: estimated propensity scores
# - mu1: predicted outcomes for the treatment group (A = 1)
# - mu0: predicted outcomes for the control group (A = 0)
# - A: treatment indicator
# - Y: observed outcomes

# Calibrate inverse propensity score weights
calibrated_weights <- calibrate_inverse_weights(A = A, pi1 = pi1, pi0 = pi0)
alpha1_star <- calibrated_weights$alpha1_star  # Calibrated weights for treatment group
alpha0_star <- calibrated_weights$alpha0_star  # Calibrated weights for control group

# Calibrate outcome regression predictions
calibrated_outcomes <- calibrate_outcome_regression(Y = Y, mu1 = mu1, mu0 = mu0, A = A)
mu1_star <- calibrated_outcomes$mu1_star  # Calibrated outcome predictions for treatment group
mu0_star <- calibrated_outcomes$mu0_star  # Calibrated outcome predictions for control group

# Check calibrated values
head(alpha1_star)
head(alpha0_star)
head(mu1_star)
head(mu0_star)

plot(1/pi1_true, alpha1_star)

plot(mu1_true, mu1_star)
```