| Title: | Semi-Supervised Estimation of Average Treatment Effects |
|---|---|
| Description: | Provides a pipeline for estimating the average treatment effect via semi-supervised learning. Outcome regression is fit with cross-fitting using various machine learning method or user customized function. Doubly robust ATE estimation leverages both labeled and unlabeled data under a semi-supervised missing-data framework. For more details see Hou et al. (2021) <doi:10.48550/arxiv.2110.12336>. A detailed vignette is included. |
| Authors: | Jue Hou [aut, cre], Yuming Zhang [aut], Shuheng Kong [aut] |
| Maintainer: | Jue Hou <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.5 |
| Built: | 2026-05-26 07:52:56 UTC |
| Source: | https://github.com/cran/SMMAL |
Estimate Average Treatment Effect (ATE) via Semi-Supervised Learning
ate.SSL( Y, A, R, mu1, mu0, pi1, pi0, imp.A, imp.A1Y1, imp.A0Y1, min.pi = 0.05, max.pi = 0.95 )ate.SSL( Y, A, R, mu1, mu0, pi1, pi0, imp.A, imp.A1Y1, imp.A0Y1, min.pi = 0.05, max.pi = 0.95 )
Y |
Numeric vector. Observed outcomes for labeled data (with missing values for unlabelled). |
A |
Numeric vector. Treatment indicator (1 for treated, 0 for control). |
R |
Logical or binary vector. Indicator for labeled data (1 if labeled, 0 if not). |
mu1 |
Numeric vector. Estimated outcome regression |
mu0 |
Numeric vector. Estimated outcome regression |
pi1 |
Numeric vector. Estimated propensity scores |
pi0 |
Numeric vector. Estimated propensity scores |
imp.A |
Numeric vector. Estimated treatment probabilities using surrogate covariates |
imp.A1Y1 |
Numeric vector. Imputed |
imp.A0Y1 |
Numeric vector. Imputed |
min.pi |
Numeric. Lower bound to truncate estimated propensity scores (default = 0.05). |
max.pi |
Numeric. Upper bound to truncate estimated propensity scores (default = 0.95). |
This function estimates the ATE in a semi-supervised setting, where outcomes are only observed for a subset of the sample. Surrogate variables and imputed models are used to leverage information from unlabelled data.
A list containing:
Estimated ATE.
Estimated standard error of ATE.
set.seed(123) N <- 400 n <- 200 # Number of labeled observations labeled_indices <- sample(1:N, n) # Generate covariates and treatment X <- rnorm(N) A <- rbinom(N, 1, plogis(X)) # True potential outcomes Y0_true <- X + rnorm(N) Y1_true <- X + 1 + rnorm(N) # Observed outcomes Y_full <- ifelse(A == 1, Y1_true, Y0_true) # Only labeled samples have observed Y Y <- rep(NA, N) Y[labeled_indices] <- Y_full[labeled_indices] R <- rep(0, N); R[labeled_indices] <- 1 # Nuisance parameter estimates (can be replaced by actual model predictions) mu1 <- X + 0.5 mu0 <- X - 0.5 pi1 <- plogis(X) pi0 <- 1 - pi1 imp.A <- plogis(X) imp.A1Y1 <- plogis(X) * (X + 0.5) imp.A0Y1 <- (1 - plogis(X)) * (X - 0.5) # Estimate ATE result <- ate.SSL( Y = Y, A = A, R = R, mu1 = mu1, mu0 = mu0, pi1 = pi1, pi0 = pi0, imp.A = imp.A, imp.A1Y1 = imp.A1Y1, imp.A0Y1 = imp.A0Y1 ) print(result$est) print(result$se)set.seed(123) N <- 400 n <- 200 # Number of labeled observations labeled_indices <- sample(1:N, n) # Generate covariates and treatment X <- rnorm(N) A <- rbinom(N, 1, plogis(X)) # True potential outcomes Y0_true <- X + rnorm(N) Y1_true <- X + 1 + rnorm(N) # Observed outcomes Y_full <- ifelse(A == 1, Y1_true, Y0_true) # Only labeled samples have observed Y Y <- rep(NA, N) Y[labeled_indices] <- Y_full[labeled_indices] R <- rep(0, N); R[labeled_indices] <- 1 # Nuisance parameter estimates (can be replaced by actual model predictions) mu1 <- X + 0.5 mu0 <- X - 0.5 pi1 <- plogis(X) pi0 <- 1 - pi1 imp.A <- plogis(X) imp.A1Y1 <- plogis(X) * (X + 0.5) imp.A0Y1 <- (1 - plogis(X)) * (X - 0.5) # Estimate ATE result <- ate.SSL( Y = Y, A = A, R = R, mu1 = mu1, mu0 = mu0, pi1 = pi1, pi0 = pi0, imp.A = imp.A, imp.A1Y1 = imp.A1Y1, imp.A0Y1 = imp.A0Y1 ) print(result$est) print(result$se)
Trains and evaluates predictive models using cross-fitting across nfold folds,
supporting multiple learner types. Outputs out-of-fold predictions and computes log_loss
for each hyperparameter tuning round to select the best-performing model.
cf( Y, X, nfold, R, foldid, cf_model, sub_set = rep(TRUE, length(Y)), custom_model_fun = NULL )cf( Y, X, nfold, R, foldid, cf_model, sub_set = rep(TRUE, length(Y)), custom_model_fun = NULL )
Y |
Numeric or factor vector. The response variable, either binary (0/1) or continuous.
Only labelled observations (where |
X |
Matrix or data frame. Predictor variables used for model training. |
nfold |
Integer. Number of cross-fitting folds. |
R |
Binary vector. Indicator of labelled data: 1 = labelled, 0 = unlabelled. |
foldid |
Integer vector. Fold assignments for cross-fitting (length equal to the full dataset). |
cf_model |
Character string. Specifies the model type. Must be one of |
sub_set |
Logical vector. Indicates which labelled samples to include in training. |
custom_model_fun |
A logical or function. If |
The function supports three learner types:
xgboost: Gradient-boosted trees, tuning gamma across rounds.
bspline: Logistic regression using B-spline basis expansions, tuning the number of knots.
randomforest: Random forests, tuning nodesize.
Cross-fitting ensures that model evaluation is based on out-of-fold predictions,
reducing overfitting. log_loss is used as the evaluation metric to identify the
best hyperparameter setting.
A list containing:
(Currently a placeholder) List of trained models per fold and tuning round.
List of out-of-fold predictions for each of the 5 tuning rounds.
Numeric vector of log loss values for each tuning round.
Integer index (1–5) of the round achieving the lowest log_loss.
Minimum log_loss value achieved across rounds.
Vector of out-of-fold predictions from the best tuning round.
set.seed(123) N <- 200 X <- matrix(rnorm(N * 5), nrow = N, ncol = 5) # Simulate treatment assignment A <- rbinom(N, 1, plogis(X[, 1] - 0.5 * X[, 2])) # Simulate outcome Y_full <- rbinom(N, 1, plogis(0.5 * X[, 1] - 0.25 * X[, 3])) # Introduce some missingness to simulate semi-supervised data Y <- Y_full Y[sample(1:N, size = N/4)] <- NA # 25% missing # Create R vector (labelled = 1, unlabelled = 0) R <- ifelse(!is.na(Y), 1, 0) # Cross-validation fold assignment foldid <- sample(rep(1:5, length.out = N)) # Run cf with glm model result <- cf(Y = Y, X = X, nfold = 5, R = R, foldid = foldid, cf_model = "glm") # Examine output print(result$log_losses) print(result$best_rounds_index)set.seed(123) N <- 200 X <- matrix(rnorm(N * 5), nrow = N, ncol = 5) # Simulate treatment assignment A <- rbinom(N, 1, plogis(X[, 1] - 0.5 * X[, 2])) # Simulate outcome Y_full <- rbinom(N, 1, plogis(0.5 * X[, 1] - 0.25 * X[, 3])) # Introduce some missingness to simulate semi-supervised data Y <- Y_full Y[sample(1:N, size = N/4)] <- NA # 25% missing # Create R vector (labelled = 1, unlabelled = 0) R <- ifelse(!is.na(Y), 1, 0) # Cross-validation fold assignment foldid <- sample(rep(1:5, length.out = N)) # Run cf with glm model result <- cf(Y = Y, X = X, nfold = 5, R = R, foldid = foldid, cf_model = "glm") # Examine output print(result$log_losses) print(result$best_rounds_index)
Computes nuisance functions including conditional expectations and propensity scores using cross-fitting, separately for labelled and unlabelled data. These estimates are essential inputs for doubly robust or semi-supervised average treatment effect (ATE) estimators.
compute_parameter(nfold, Y, A, X, S, W, foldid, R, cf_model, custom_model_fun)compute_parameter(nfold, Y, A, X, S, W, foldid, R, cf_model, custom_model_fun)
nfold |
Integer. Number of cross-fitting folds. |
Y |
Numeric vector. Outcome variable. Can contain |
A |
Numeric vector. Treatment assignment indicator (0 or 1). Can contain |
X |
Matrix or data frame. Covariates used for outcome and propensity score models. |
S |
Matrix or data frame. Additional covariates used only in imputation models. |
W |
Matrix or data frame. Combined set of covariates (typically |
foldid |
Integer vector. Fold assignments for cross-fitting. |
R |
Binary vector. Label indicator: 1 = labelled (observed |
cf_model |
Function. A user-supplied cross-fitting wrapper function (e.g., based on Super Learner or other learners). |
custom_model_fun |
A logical or function. If |
This function applies cross-fitting to estimate all required nuisance functions for semi-supervised or doubly robust ATE estimators. Separate models are fit for the labelled dataset and the full dataset (for imputation).
A named list of estimated nuisance parameters (each a numeric vector):
Estimated propensity score .
Estimated propensity score (computed as 1 - pi1.bs).
Estimated outcome regression .
Estimated outcome regression .
Estimated imputed propensity score .
Estimated imputed propensity score (computed as 1 - cap_pi1.bs).
Estimated imputed outcome regression .
Estimated imputed outcome regression .
Creates fold assignments for both labelled and unlabelled data using stratified random sampling, ensuring an approximately equal number of samples per fold within each group.
cross_validation(N, nfold, A, Y)cross_validation(N, nfold, A, Y)
N |
Integer. Total number of observations in the dataset. |
nfold |
Integer. Number of folds to assign for cross-validation. |
A |
Numeric vector. Treatment assignment indicator (may contain |
Y |
Numeric vector. Outcome variable (may contain |
The function first separates observations into labelled and unlabelled groups based on the
availability of both treatment (A) and outcome (Y). Within each group,
fold assignments are randomly assigned to ensure approximately balanced sample sizes
across folds. This setup supports semi-supervised learning workflows by maintaining
structure between labelled and unlabelled data during cross-fitting.
A list containing:
Binary vector of length N, where 1 indicates labelled observations (non-missing A and Y),
and 0 indicates unlabelled observations.
Integer vector of length N. Fold assignments (from 1 to nfold) for use in cross-validation.
set.seed(123) N <- 100 A <- sample(c(0, 1, NA), size = N, replace = TRUE, prob = c(0.45, 0.45, 0.10)) Y <- sample(c(0, 1, NA), size = N, replace = TRUE, prob = c(0.45, 0.45, 0.10)) # Assign 5 folds for cross-fitting result <- cross_validation(N = N, nfold = 5, A = A, Y = Y) table(result$R) # Check number of labelled vs unlabelled table(result$foldid) # Check how folds are distributedset.seed(123) N <- 100 A <- sample(c(0, 1, NA), size = N, replace = TRUE, prob = c(0.45, 0.45, 0.10)) Y <- sample(c(0, 1, NA), size = N, replace = TRUE, prob = c(0.45, 0.45, 0.10)) # Assign 5 folds for cross-fitting result <- cross_validation(N = N, nfold = 5, A = A, Y = Y) table(result$R) # Check number of labelled vs unlabelled table(result$foldid) # Check how folds are distributed
Returns two list of hyperparameters for model tuning.
param_fun()param_fun()
A list, e.g., list(ridge = ..., lambda = ...)
Executes a full semi-supervised ATE estimation pipeline. This includes cross-validation fold assignment, feature selection via adaptive LASSO, model fitting using a specified learner (e.g., bspline, xgboost, or random forest), and doubly robust ATE estimation that leverages both labelled and unlabelled data.
SMMAL(Y, A, S, X, nfold = 5, cf_model = "bspline", custom_model_fun = NULL)SMMAL(Y, A, S, X, nfold = 5, cf_model = "bspline", custom_model_fun = NULL)
Y |
Numeric vector. Outcome variable (may contain |
A |
Numeric vector. Treatment indicator (1 = treated, 0 = control). May contain |
S |
Matrix or data frame. Surrogate variables used only in imputation models. |
X |
Matrix or data frame. Main covariates used for outcome and propensity score modeling. |
nfold |
Integer. Number of cross-validation folds. Default is 5. |
cf_model |
Character string. Modeling method to use in cross-fitting. One of |
custom_model_fun |
A logical or function. If |
The pipeline first selects important covariates via adaptive LASSO. Then, it fits nuisance functions (outcome regressions and propensity scores) using cross-fitting with the specified learner. Finally, it applies a doubly robust estimator that integrates information from both labelled and unlabelled observations to estimate the ATE.
A list containing:
Estimated Average Treatment Effect (ATE).
Estimated standard error of the ATE.
cf, compute_parameter, cross_validation, ate.SSL
Performs adaptive LASSO for binary outcomes by first fitting a ridge regression to compute penalty factors, and then running cross-validated lasso fits over a grid of lambda values.
SMMAL_ada_lasso( X, Y, X_full, foldid, foldid_labelled, sub_set, labeled_indices, nfold, log_loss )SMMAL_ada_lasso( X, Y, X_full, foldid, foldid_labelled, sub_set, labeled_indices, nfold, log_loss )
X |
A numeric matrix of predictors (n observations × p features). |
Y |
A numeric or integer vector of binary outcomes (length n). |
X_full |
The full matrix of predictors for all observations. |
foldid |
A vector assigning each observation (labelled or unlabelled) to a fold. |
foldid_labelled |
An integer vector (length n) of fold assignments for labeled observations.
Values should run from 1 to |
sub_set |
A logical or integer vector indicating which rows of |
labeled_indices |
An integer or logical vector indicating which rows have non-missing outcomes. |
nfold |
A single integer specifying the number of CV folds (e.g., 5 or 10). |
log_loss |
A function of the form |
This function expects that a parameter-generating function param_fun() is available in the package,
returning a list with elements $ridge (a vector of ridge penalty values) and $lambda (a vector
of lasso penalty values). Internally, it:
Fits a ridge-penalized logistic regression on all data to obtain coefficients.
Computes penalty factors as 1 / (abs(coef) + 1e-4).
For each ridge value, runs n-fold CV over lambda values with glmnet(..., alpha=1).
Records predictions on held-out folds, computes log-loss for each lambda, and selects the
lambda with minimum log-loss.
Returns a list of CV-predicted probability vectors (one vector per ridge value).
A list of length equal to the number of ridge penalty values provided by param_fun().
Each element is a numeric vector (length = n) containing cross-validated predicted probabilities
for the best lambda under that ridge penalty.
# Assume param_fun() is defined elsewhere and returns: # list(ridge = c(0.01, 0.1, 1), lambda = exp(seq(log(0.001), log(1), length = 50))) # Simulate small data: set.seed(123) n <- 100; p <- 10 X <- matrix(rnorm(n * p), nrow = n) true_beta <- c(rep(1.5, 3), rep(0, p - 3)) lin <- X %*% true_beta probs <- 1 / (1 + exp(-lin)) Y <- rbinom(n, 1, probs) # Create fold assignments for labeled observations: labeled <- sample(c(TRUE, FALSE), n, replace = TRUE, prob = c(0.8, 0.2)) foldid_labelled <- rep(NA_integer_, n) foldid_labelled[labeled] <- sample(1:5, sum(labeled), replace = TRUE) sub_set <- labeled labeled_indices <- which(labeled) # For simplicity, assign foldid to all observations (labeled & unlabeled) foldid <- sample(1:5, n, replace = TRUE) # Define a simple log-loss function: log_loss_fn <- function(true, pred) { eps <- 1e-15 pred_clipped <- pmin(pmax(pred, eps), 1 - eps) -mean(true * log(pred_clipped) + (1 - true) * log(1 - pred_clipped)) } # Call SMMAL_ada_lasso with all required args: results <- SMMAL_ada_lasso( X = X, Y = Y, X_full = X, # Here full data same as X for example foldid = foldid, foldid_labelled = foldid_labelled, sub_set = sub_set, labeled_indices = labeled_indices, nfold = 5, log_loss = log_loss_fn ) # 'results' is a list (one element per ridge value), each a numeric vector of CV predictions.# Assume param_fun() is defined elsewhere and returns: # list(ridge = c(0.01, 0.1, 1), lambda = exp(seq(log(0.001), log(1), length = 50))) # Simulate small data: set.seed(123) n <- 100; p <- 10 X <- matrix(rnorm(n * p), nrow = n) true_beta <- c(rep(1.5, 3), rep(0, p - 3)) lin <- X %*% true_beta probs <- 1 / (1 + exp(-lin)) Y <- rbinom(n, 1, probs) # Create fold assignments for labeled observations: labeled <- sample(c(TRUE, FALSE), n, replace = TRUE, prob = c(0.8, 0.2)) foldid_labelled <- rep(NA_integer_, n) foldid_labelled[labeled] <- sample(1:5, sum(labeled), replace = TRUE) sub_set <- labeled labeled_indices <- which(labeled) # For simplicity, assign foldid to all observations (labeled & unlabeled) foldid <- sample(1:5, n, replace = TRUE) # Define a simple log-loss function: log_loss_fn <- function(true, pred) { eps <- 1e-15 pred_clipped <- pmin(pmax(pred, eps), 1 - eps) -mean(true * log(pred_clipped) + (1 - true) * log(1 - pred_clipped)) } # Call SMMAL_ada_lasso with all required args: results <- SMMAL_ada_lasso( X = X, Y = Y, X_full = X, # Here full data same as X for example foldid = foldid, foldid_labelled = foldid_labelled, sub_set = sub_set, labeled_indices = labeled_indices, nfold = 5, log_loss = log_loss_fn ) # 'results' is a list (one element per ridge value), each a numeric vector of CV predictions.