Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Aug 11.
Published in final edited form as: Proc Mach Learn Res. 2023 Jul;202:34831–34854.

Causal isotonic calibration for heterogeneous treatment effects

Lars van der Laan 1,*, Ernesto Ulloa-Pérez 2,*, Marco Carone 3,1, Alex Luedtke 1,3
PMCID: PMC10416780  NIHMSID: NIHMS1900331  PMID: 37575467

Abstract

We propose causal isotonic calibration, a novel nonparametric method for calibrating predictors of heterogeneous treatment effects. In addition, we introduce a novel data-efficient variant of calibration that avoids the need for hold-out calibration sets, which we refer to as cross-calibration. Causal isotonic cross-calibration takes cross-fitted predictors and outputs a single calibrated predictor obtained using all available data. We establish under weak conditions that causal isotonic calibration and cross-calibration both achieve fast doubly-robust calibration rates so long as either the propensity score or outcome regression is estimated well in an appropriate sense. The proposed causal isotonic calibrator can be wrapped around any black-box learning algorithm to provide strong distribution-free calibration guarantees while preserving predictive performance.

1. Introduction

Estimation of causal effects via both randomized experiments and observational studies is critical to understanding the effects of interventions and informing policy. Moreover, it is often the case that understanding treatment effect heterogeneity can provide more insights than overall population effects (Obermeyer & Emanuel, 2016; Athey, 2017). For instance, a study of treatment effect heterogeneity can help elucidate the mechanism of an intervention, design policies targeted to subpopulations who can most benefit (Imbens & Wooldridge, 2009), and predict the effect of interventions in populations other than the ones in which they were developed. These necessities have arisen in a wide range of fields, such as marketing (Devriendt et al., 2018), the social sciences (Imbens & Wooldridge, 2009), and the health sciences (Kent et al., 2018). For example, in the health sciences, heterogeneous treatment effects (HTEs) are of high importance to understanding and quantifying how certain exposures or interventions affect the health of various subpopulations (Dahabreh et al., 2016; Lee et al., 2020). Potential applications include prioritizing treatment to certain sub-populations when treatment resources are scarce, or individualizing treatment assignments when the treatment can have no effect (or even be harmful) in certain subpopulations (Dahabreh et al., 2016). As an example, treatment assignment based on risk scores has been used to provide clinical guidance in cardiovascular disease prevention (Lloyd-Jones et al., 2019) and to improve decision-making in oncology (Collins & Varmus, 2015; Cucchiara et al., 2018).

A wide range of statistical methods are available for assessing HTEs, with recent examples including Wager & Athey (2018), Carnegie et al. (2019), Lee et al. (2020), Yadlowsky et al. (2021), and Nie & Wager (2021), among others. In particular, many methods, including Imbens & Wooldridge (2009) and Dominici et al. (2020), scrutinize HTEs via conditional average treatment effects (CATEs). The CATE is the difference in the conditional mean of the counterfactual outcome corresponding to treatment versus control given covariates, which can be defined at a group or individual level. When interest lies in predicting treatment effect, the CATE can be viewed as the oracle predictor of the individual treatment effect (ITE) that can feasibly be learned from data. Optimal treatment rules have been derived based on the sign of the CATE estimator (Murphy, 2003; Robins, 2004), with more recent works incorporating the use of flexible CATE estimators (Luedtke & van der Laan, 2016). Thus, due to its wide applicability and scientific relevance, CATE estimation has been of great interest in statistics and data science.

Regardless of its quality as a proxy for the true CATE, it is generally accepted that predictions from a given treatment effect predictor can still be useful for decision-making. However, theoretical guarantees for rational decision-making using a given treatment effect predictor typically hinge on the predictor being a good approximation of the true CATE. Accurate CATE estimation can be challenging because the nuisance parameters involved can be non-smooth, high-dimensional, or otherwise difficult to model correctly. Additionally, a CATE estimator obtained from samples of one population, regardless of its quality, may not generalize well to different target populations (Frangakis, 2009). Usually, CATE estimators (often referred to as learners) build upon estimators of the conditional mean outcome given covariates and treatment level (i.e., outcome regression), the probability of treatment given covariates (i.e., propensity score), or both. For instance, plug-in estimators such as those studied in Künzel et al. (2019) — so-called T-learners — are obtained by taking the difference between estimators of the outcome regression obtained separately for each treatment level. T-learners can suffer in performance because they rely on estimation of nuisance parameters that are at least as non-smooth or high-dimensional as the CATE, and are prone to the misspecification of involved outcome regression models; these issues can result in slow convergence or inconsistency of the CATE estimator. Doubly-robust and Neyman-orthogonal CATE estimation strategies like the DR-learner and R-learner (Wager & Athey, 2018; Foster & Syrgkanis, 2019; Nie & Wager, 2021; Kennedy, 2020) mitigate some of these issues by allowing for comparatively fast CATE estimation rates even when nuisance parameters are estimated at slow rates. However, while less sensitive to the learning complexity of the nuisance parameters, their predictive accuracy in finite-samples still relies on potentially strong smoothness assumptions on the CATE. Even when the CATE is estimated consistently, predictions based on statistical learning methods often produce biased predictions that overestimate or underestimate the true CATE in the extremes of the predicted values (van Klaveren et al., 2019; Dwivedi et al., 2020). For example, the ‘pooled cohort equations’ (Goff et al., 2014) risk model used to predict cardiovascular disease has been found to underestimate risk in patients with lower socioeconomic status or chronic inflammatory diseases (Lloyd-Jones et al., 2019). The implications of biased treatment effect predictors are profound when used to guide treatment decisions and can range from harmful use to withholding of treatment (van Calster et al., 2019).

Due to the consequence of treatment decision-making, it is essential to guarantee, under minimal assumptions, that treatment effect predictions are representative in magnitude and sign of the actual effects, even when the predictor is a poor approximation of the CATE. In prediction settings, the aim of bestowing these properties on a given predictor is commonly called calibration. A calibrated treatment effect predictor has the property that the average treatment effect among individuals with identical predictions is close to their shared prediction value. Such a predictor is more robust against over-or-under estimation of the CATE in extremes of predicted values. It also has the property that the best predictor of the ITE given the predictor is the predictor itself, which facilitates transparent treatment decision-making. In particular, the optimal treatment rule (Murphy, 2003) given only information provided by the predictor is the one that assigns the treatment predicted to be most beneficial. Consequently, the rule implied by a perfectly calibrated predictor is at least as favorable as the best possible static treatment rule that ignores HTEs. While complementing one another, the aims of calibration and prediction are fundamentally different. For instance, a constant treatment effect predictor can be well-calibrated even though it is a poor predictor of treatment effect heterogeneity (Gupta et al., 2020). In view of this, calibration methods are typically designed to be wrapped around a given black-box prediction pipeline to provide strong calibration guarantees while preserving predictive performance, thereby mitigating several prediction challenges mentioned previously.

In the machine learning literature, calibration has been widely used to enhance prediction models for classification and regression (Bella et al., 2010). However, due to the comparatively little research on calibration of treatment effect predictors, such benefits have not been realized to the same extent in the context of heterogeneous treatment effect prediction. Several works have contributed to addressing this gap in the literature. Brooks et al. (2012) propose a targeted (or debiased) machine learning framework (van der Laan & Rose, 2011) for within-bins calibration that could be applied to the CATE setting. Zhang et al. (2016) and Josey et al. (2022) consider calibration of marginal treatment effect estimates for new populations but do not consider CATEs. Dwivedi et al. (2020) consider estimating calibration error of CATE predictors for subgroup discovery using randomized experimental data. Chernozhukov et al. (2018) and Leng & Dimmery (2021) propose CATE methods for linear calibration, a weaker form of calibration, in randomized experiments. For causal forests, Athey & Wager (2019) evaluate model calibration using a doubly-robust estimator of the ATE among observations above or below the median predicted CATE. Lei & Candès (2021) propose conformal inference methods for constructing calibrated prediction intervals for the ITE from a given predictor but do not consider calibration of the predictor itself. Xu & Yadlowsky (2022) propose a nonparametric doubly-robust estimator of the calibration error of a given treatment effect predictor, which could be used to detect uncalibrated predictors. Our work builds upon the above works by providing a nonparametric doubly-robust method for calibrating treatment effect predictors in general settings.

This paper is organized as follows. In Section 2, we introduce our notation and formally define calibration. There we also provide an overview of traditional calibration methods. In Section 3, we outline our proposed approach, and we describe its theoretical properties in Section 4. In Section 5, we examine the performance of our method in simulation studies. We conclude with a discussion of our proposed approach in Section 6.

2. Statistical setup

2.1. Notation and definitions

Suppose we observe n independent and identically distributed realizations of data unit O(W,A,Y) drawn from a distribution P, where W𝒲Rd is a vector of baseline covariates, A{0,1} is a binary indicator of treatment, and Y𝒴R is an outcome. For instance, W can include a patient’s demographic characteristics and medical history, A can indicate whether an individual is treated (1) or not (0), and Y could be a binary indicator of a successful clinical outcome. We denote by 𝒟n{O1,O2,,On} the observed dataset, with Oi(Wi,Ai,Yi) representing the observation on the ith study unit.

For covariate value w𝒲 and treatment level a{0,1}, we denote by π0(w)P(A=1W=w) the propensity score and by μ0(a,w)E(YA=a,W=w) the outcome regression. The individual treatment effect is Y1Y0, where Ya represents the potential outcome obtained by setting A=a. Without loss of generality, we assume that higher values of Y1Y0 are desirable. We also assume that the contrast τ0(w)μ0(1,w)μ0(0,w) equals the true CATE, E(Y1Y0W=w), which holds under certain causal assumptions (Rubin, 1974). Throughout, we denote by the L2(P) norm, that is, f2=[f(w)]2dPW(w) for any given PW-square integrable function f:𝒲R, where PW is the marginal distribution of W implied by P. We deliberately take as convention that the median median {x1,x2,,xk} of a set {x1,x2,,xk} equals the k2th order statistic of this set, where k2max{zN:zk2}.

Let τ:𝒲R be a treatment effect predictor, that is, a function that maps a realization w of W to a treatment effect prediction τ(w). In practice, τ can be obtained using any black-box algorithm. Below, we first consider τ to be fixed, though we later address situations in which τ is learned from the data used for subsequent calibration. We define the calibration function γ0(τ,w)E[Y1Y0τ(W)=τ(w)] as the conditional mean of the individual treatment effect given treatment effect score value τ(w). By the tower property, γ0(τ,w)E[τ0(W)τ(W)=τ(w)], and so, expectations only involving γ0(τ,W) and other functions of W can be taken with respect to PW.

The solution to an isotonic regression problem is typically nonunique. Throughout this text, we follow Groeneboom & Lopuhaa (1993) in taking the unique càdlàg piece-wise constant solution of the isotonic regression problem that can only take jumps at observed values of the predictor.

2.2. Measuring calibration and the calibration-distortion decomposition

Various definitions of risk predictor calibration have been proposed in the literature — see Gupta & Ramdas (2021) and Gupta et al. (2020) for a review. Here, we outline our definition of calibration and its rationale. Given a treatment effect predictor τ, the best predictor of the individual treatment effect in terms of MSE is wγ0(τ,w)E[Y1Y0τ(W)=τ(w)]. By the law of total expectation, this predictor has the property that, for any interval [a,b),

E{[τ0(W)γ0(τ,W)]I(γ0(τ,W)[a,b))}=0. (1)

Equation 1 indicates that γ0(τ,) is perfectly calibrated on [a,b). Therefore, when a given predictor τ is such that τ(W)=γ0(τ,W) with P-probability one, τ is said to be perfectly calibrated (Gupta et al., 2020) for the CATE — for brevity, we omit “for the CATE” hereafter when the type of calibration being referred to is clear from context.

In general, perfect calibration cannot realistically be achieved in finite samples. A more modest goal is for the predictor τ to be approximately calibrated in that τ(w) is close to γ0(τ,w) across all covariate values w𝒲. This naturally suggests the calibration measure:

CAL(τ)[γ0(τ,w)τ(w)]2dPW(w). (2)

This measure, referred to as the 2-expected calibration error, arises both in prediction (Gupta et al., 2020) and in assessment of treatment effect heterogeneity (Xu & Yadlowsky, 2022). We note that CAL(τ) is zero if τ(w) is perfectly calibrated. Additionally, averaging in CAL(τ) with respect to measures other than PW could be more relevant in certain applications; such cases can occur, for instance, when there is a change of population that results in covariate shift and we are interested in measuring how well τ is calibrated in the new population.

Interestingly, the above calibration measure plays a role in a decomposition of the mean squared error (MSE) between the treatment predictor and the true CATE, in that

MSE(τ)τ0τ2=CAL(τ)+DIS(τ), (3)

with DIS(τ)E{var[τ0(W)τ(W)]} a quantity we term the distortion of τ. We refer to the above as a calibration-distortion decomposition of the MSE. A consequence of the calibration-distortion decomposition is that MSE-consistent CATE estimators are also calibrated asymptotically. However, particularly in settings where the covariates are high-dimensional or the CATE is nonsmooth, the calibration error rate for such predictors can be arbitrarily slow — this is discussed further after Theorem 4.6.

To interpret DIS(τ), we find it helpful to envision a scenario in which a distorted message is passed between two persons. The goal is for Person 2 to discern the value of τ0(w), where the value of w𝒲 is only known to Person 1. Person 1 transmits w, which is then distorted through a function τ and received by Person 2. Person 2 knows the functions τ and τ0, and may use this information to try to discern τ0(w). If τ is one-to-one, τ0(w) can be discerned by simply applying τ0τ1 to the received message τ(w). More generally, whenever there exists a function f such that τ0=fτ, Person 2 can recover the value of τ0(w). For example, if τ=τ0 then f is the identity function. If no such function f exists, it may not be possible for Person 2 to recover the value of τ0(w). Instead, they may predict τ0(w) based on τ(w) via γ0(τ,w). Averaged over WPW, the MSE of this approach is precisely DIS(τ). See Equation 3 in Kuleshov & Liang (2015) for a related decomposition of E[{Yτ(X)}2]=MSE(τ)+E[{Yτ0(X)}2] derived in the context of probability forecasting.

The calibration-distortion decomposition shows that, at a given level of distortion, better-calibrated treatment effect predictors have lower MSE for the true CATE function. We will explore this fact later in this work when showing that, in addition to improving calibration, our proposed calibration procedure can improve the MSE of CATE predictors.

2.3. Calibrating predictors: desiderata and classical methods

In most calibration methods, the key goal is to find a function θ:RR of a given predictor τ such that CAL(θτ)<CAL(τ), where θτ refers to the composed predictor wθ(τ(w)). A mapping θ that pursues this objective is referred to as a calibrator. Ideally, a calibrator θn for τ constructed from the dataset 𝒟n should satisfy the following desiderata:

Property 1:

CAL(θnτ) tends to zero quickly as n grows;

Property 2:

θnτ and τ are comparably predictive of τ0.

Property 1 states the primary objective of a calibrator, that is, to yield a well-calibrated predictor. Property 2 requires that the calibrator not destroy the predictive power of the initial predictor in the pursuit of Property 1, which would occur if the calibration term in decomposition (3) were made small at the cost of dramatic inflation of the distortion term.

In the traditional setting of classification and regression, a natural aim is to learn, for a{0,1}, a predictor wνa(w) of the outcome Y among individuals with treatment A=a. The best possible such predictor is given by the treatment-specific outcome regression wμ0(a,w). For a{0,1}, ν(a) is said to be calibrated for the outcome regression if ν(a)(w)E(Yν(a)(W)=ν(a)(w)), A=a for P0-almost every w. Such a calibrated predictor can be obtained using existing calibration methods for regression (Huang et al., 2020), which we review in the next paragraph. It is natural to wonder, then, whether existing calibration approaches can be directly used to calibrate for the CATE. As a concrete example, given predictors ν(1) and ν(0) of μ0(1,) and μ0(0,), a natural CATE predictor is the T-learner τν(1)ν(0). However, even if ν(1) and ν(0) are calibrated for their respective outcome regressions, the predictor τ can still be poorly calibrated for the CATE. Indeed, in settings with treatment-outcome confounding, T-learners can be poorly calibrated when the calibrated predictors ν(1) and ν(0) are poor approximations of their respective outcome regressions. As an extreme example, suppose that ν(a) equals the constant predictor wE(YA=a) for a{0,1}, which is perfectly calibrated for the outcome regression. Then, the corresponding T-learner τ()=E(YA=1)E(YA=0) typically has poor calibration for the CATE in observational settings.

In classification and regression settings (Huang et al., 2020), the most commonly used calibration methods include Platt’s scaling (Platt et al., 1999), histogram binning (Zadrozny & Elkan, 2001), Bayesian binning into quantiles (Naeini et al., 2015), and isotonic calibration (Zadrozny & Elkan, 2002; Niculescu-Mizil & Caruana, 2005). Broadly, Platt’s scaling is designed for binary outcomes and uses the estimated values of the predictor to fit the logistic regression model

logitP(Y=1τ(W)=t)=α+βt

with α, βR. While it typically satisfies Property 2, Platt’s scaling is based on strong parametric assumptions and, as a consequence, may lead to predictions with significant calibration error, even asymptotically (Gupta et al., 2020). Nevertheless, Platt’s scaling may be preferred when limited data is available. Histogram binning, also known as quantile binning, involves partitioning the sorted values of the predictor into a fixed number of bins. Given an initial prediction, the calibrated prediction is given by the empirical mean of the observed outcome values within the corresponding prediction bin. A significant limitation of histogram binning is that it requires a priori specification of the number of bins. Selecting too few bins can significantly degrade the predictive power of the calibrated predictor, whereas selecting too many bins can lead to poor calibration. Bayesian binning improves upon histogram binning by considering multiple binning models and their combinations; nevertheless, it still suffers from the need to pre-specify binning models and prior distributions.

Isotonic calibration is a histogram binning method that learns the bins from data using isotonic regression, a nonparametric method traditionally used for estimating monotone functions (Barlow & Brunk, 1972; Martino et al., 2019; Huang et al., 2020). Specifically, the bins are selected by minimizing an empirical MSE criterion under the constraint that the calibrated predictor is a nondecreasing monotone transformation of the original predictor. Isotonic calibration is motivated by the heuristic that, for a good predictor τ, the calibration function γ0(τ,) should be approximately monotone as a function of τ. For instance, when τ=τ0, the map τ0γ0(τ0,)=τ0 is the identity function. Despite its popularity and strong performance in practice (Zadrozny & Elkan, 2002; Niculescu-Mizil & Caruana, 2005; Gupta & Ramdas, 2021), to date, whether isotonic calibration satisfies distribution-free calibration guarantees remains an open question (Gupta, 2022). In this work, we will show that isotonic calibration satisfies a distribution-free calibration guarantee in the sense of Property 1. We further establish that Property 2 holds, in that the isotonic selection criterion ensures that the calibrated predictor is at least as predictive as the original predictor up to negligible error.

3. Causal isotonic calibration

In real-world experiments, Dwivedi et al. (2020) found empirically that state-of-the-art CATE estimators tend to be poorly calibrated. However, strikingly, the authors found that such CATE predictors can often still correctly rank the average treatment effect among subgroups defined by bins of the predicted effects. These findings support the heuristic that the calibration function γ0(τ0,) is often approximately monotone as a function of the predictor τ. This heuristic makes extending isotonic calibration to the CATE setting especially appealing since the monotonicity constraint ensures that the calibrated predictions preserve the (non-strict) ranking of the original predictions.

Inspired by isotonic calibration, we propose a doubly-robust calibration method for treatment effects, which we refer to as causal isotonic calibration. Causal isotonic calibration takes a given predictor trained on some dataset and performs calibration using an independent (or hold-out) dataset. Mechanistically, causal isotonic calibration first automatically learns uncalibrated regions of the given predictor. Calibrated predictions are then obtained by consolidating individual predictions within each region into a single value using a doubly-robust estimator of the ATE. In addition, we introduce a novel data-efficient variant of calibration which we refer to as cross-calibration. In contrast with the standard calibration approach, causal isotonic cross-calibration takes cross-fitted predictors and outputs a single calibrated predictor obtained using all available data. Our methods can be implemented using standard isotonic regression software.

Let τ be a given treatment effect predictor assumed, for now, to have been built using an external dataset, and suppose that 𝒟n is the available calibration dataset. In general, we can calibrate the predictor τ using regression-based calibration methods by employing an appropriate surrogate outcome for the CATE. For both experimental and observational settings, a surrogate outcome with favorable efficiency and robustness properties is the pseudo-outcome χ0(O) defined via the mapping

χ0:oτ0(w)+aπ0(w)π0(w)[1π0(w)][yμ0(a,w)], (4)

with o(w,a,y) representing a realization of the data unit. This pseudo-outcome has been used as surrogate for the CATE in previous methods for estimating τ0, including the DR-learner (Luedtke & van der Laan, 2016; Kennedy, 2020). If χ0 were known, an external predictor τ could be calibrated using 𝒟n by isotonic regression of the pseudo-outcomes χ0(O1),χ0(O2),,χ0(On) onto the calibration sample predictions τ(W1),τ(W2),,τ(Wn). However, χ0 depends on π0 and μ0, which are usually unknown and must be estimated.

A natural approach for calibrating treatment effect predictors using isotonic regression is as follows. First, define χn as the estimated pseudo-outcome function based on estimates μn and πn derived from 𝒟n. Then, a calibrated predictor is given by θnτ, where the calibrator θn is found via isotonic regression as a minimizer over iso{θ:RR;θis monotone nondecreasing} of the empirical least-squares risk function

θ1ni=1n[χn(Oi)θτ(Wi)]2.

However, this optimization problem requires a double use of 𝒟n: once, for creating the pseudo-outcomes χn(Oi), and a second time, in the calibration step. This double usage could lead to over-fitting (Kennedy, 2020), and so we recommend obtaining pseudo-outcomes via sample splitting or cross-fitting. Sample splitting involves randomly partitioning 𝒟n into m𝒞, with m used to estimate μ0 and π0, and 𝒞 used to carry out the calibration step — see Algorithm 1 for details. Cross-fitting improves upon sample splitting by using all available data to estimate μ0 and π0 as well as to carry out the calibration step. Algorithm 4, outlined in Appendix B, is the cross-fitted variant of Algorithm 1.

Algorithm 1 Causal isotonic calibration
Require:predictorτ,training datam,calibration data𝒞1:obtain estimateχmofχ0usingm;2:perform isotonic regression to findθn=argminθisoiell[χm(Oi)θτ(Wi)]2withellthe set of indices for observations in𝒞𝒟n;3:setτn=θnτ.Ensure:τn

In practice, the external dataset used to construct τ for input into Algorithm 1 is likely to arise from a sample splitting approach wherein a large dataset is split in two, with one half used to estimate τ and the other to calibrate it. This naturally leads to the question of whether there is an approach that fully utilizes the entire dataset for both fitting an initial estimate of τ0 and calibration. Algorithm 2 describes causal isotonic cross-calibration, which provides a means to accomplish precisely this. In brief, this approach applies Algorithm 1 a total of k times on different splits of the data, where for each split an initial predictor of τ0 is fitted based on the first subset of the data and this predictor is calibrated using the second subset. These k calibrated predictors are then aggregated via a pointwise median. Interestingly, other aggregation strategies, such as pointwise averaging, can lead to uncalibrated predictions (Gneiting & Ranjan, 2013; Rahaman & Thiery, 2020). A computationally simpler variant of Algorithm 2 is given by Algorithm 3. In this implementation, a single isotonic regression is performed using the pooled out-of-fold predictions; this variant may also yield more stable performance in finite-samples than Algorithm 2 — see Section 2.1.2 of Xu & Yadlowsky (2022) for a related discussion in the context of debiased machine learning.

Algorithm 2 Causal isotonic cross-calibration (unpooled)
Require:dataset𝒟n,#of cross-fitting splitsk1:partition𝒟ninto datasets𝒞(1),𝒞(2),,𝒞(k);2:fors=1,2,,kdo3:set(s)𝒟n\𝒞(s);4:get initial predictorτn,sofτ0using(s);5:get calibrated predictorτn,svia Alg.1using predictorτn,s,training data(s),and calibration data𝒞(s);6:endfor7:setτn:wmedian{τn,1(w),τn,2(w),,τn,k(w)}.Ensure:τn

4. Large-sample theoretical properties

We now present theoretical results for causal isotonic calibration. We first obtain results for causal isotonic calibration described by Algorithm 1 applied to a fixed predictor τ. We also establish MSE guarantees for the calibrated predictor and argue that the proposed calibrator satisfies Properties 1 and 2. We then extend our results to the procedure described by Algorithm 2.

For ease of presentation, we only establish theoretical results for the case where the nuisance estimators are obtained using sample splitting. With minor modifications, our results can be readily extended to cross-fitting by arguing along the lines of Newey & Robins (2018). In that spirit, we assume that the available data 𝒟n is the union of a training dataset m and a calibration dataset 𝒞 of sizes m and , respectively, with n=m+ and min{m,} as

Algorithm 3 Causal isotonic cross-calibration (pooled)
Require:dataset𝒟n,#of cross-fitting splitsk1:partition𝒟ninto datasets𝒞(1),𝒞(2),,𝒞(k);2:fors=1,2,,kdo3:letj(i)=sfor eachi𝒞(s);4:set(s)𝒟n\𝒞(s);5:get estimate predictorχn,sofχ0from(s);6:get initial predictorτn,sofτ0using(s);7:endfor8:perform isotonic regression using pooled out-of-fold predictions to findθn=argminθisoi=1n[χn,j(i)(Oi)θτn,j(i)(Wi)]29:setτn,sθnθτn,sfors=1,2,,k;10:setτn:wmedian{τn,1(w),τn,2(w),,τn,k(w)}.Ensure:τn

n. Let τn be the calibrated predictor obtained from Algorithm 1 using τ, m and 𝒞 where the estimated pseudo-outcome χm is obtained by substituting estimates πm and μm of π0 and μ0 into (4).

Condition 4.1 (bounded outcome support). The P-support 𝒴 of Y is a uniformly bounded subset of R.

Condition 4.2 (positivity). There exists ϵ>0 such that P(ϵ<π0W)<1ϵ)=1.

Condition 4.3 (independence). Estimators πm and μm do not use any data in 𝒞.

Condition 4.4 (bounded range of πm, μm, τ). There exist 0<η, α< such that P(η<πm(W)<1η)=P(μm(A,W)<α)=P(τ(W)<α)=1 for m=1,2,

Condition 4.5 (bounded variation of best predictor). The function θ0:RR such that θ0τ=γ0(τ,) is of bounded total variation.

It is worth noting that the initial predictor and its best monotone transformation can be arbitrarily poor CATE predictors. Condition 4.1 holds trivially when outcomes are binary, but even continuous outcomes are often known to satisfy fixed bounds (e.g., physiologic bound, limit of detection of instrument) in applications. Condition 4.2 is standard in causal inference and requires that all individuals have a positive probability of being assigned to either treatment or control. Condition 4.3 follows as a direct consequence of the sample splitting approach, because the estimators are obtained from an independent sample from the data used to carry the calibration step. Condition 4.4 requires that the estimators of the outcome regression and propensity score be bounded; this can be enforced, for example, by threshholding when estimating these regression functions. Condition 4.5 excludes cases in which the best possible predictor of the CATE given only the initial predictor τ has pathological behavior, in the sense that it has infinite variation norm as a (univariate) mapping of τ. We stress here that isotonic regression is used only as a tool for calibration, and our theoretical guarantees do not require any monotonicity on components of the data-generating mechanism — for example, γ0(τ,w) need not be monotone as a function of τ(w).

The following theorem establishes the calibration rate of the predictor τn obtained using causal isotonic calibration.

Theorem 4.6 (τn is well-calibrated). Under Conditions 4.14.5, as n, it holds that

CAL(τn)=OP(23+(πmπ0)(μmμ0)2).

The calibration rate can be expressed as the sum of an oracle calibration rate and the rate of a second-order cross-product bias term involving nuisance estimators. Notably, the causal isotonic calibrator rate can satisfy Property 1 at the oracle rate 23 so long as (πmπ0)(μmμ0) shrinks no slower than 13, which requires that one or both of π0 and μ0 is estimated well in an appropriate sense. If π0 is known, as in most randomized experiments, the fast calibration rate of 23 can be achieved even when μm is inconsistent, thereby providing distribution-free calibration guarantees irrespective of the smoothness of the outcome regression or dimension of the covariate vector. When π0 is unknown, the oracle rate of 23 may not be achievable if the propensity score and outcome regression are insufficiently smooth relative to the dimension of the covariate vector (Kennedy, 2020; Kennedy et al., 2022).

It is interesting to contrast the calibration guarantee in Theorem 4.6 with existing MSE guarantees for DR-learners (Kennedy, 2020) since, in view of (3), they also provide calibration guarantees. While the MSE estimation rates for the CATE depend on the dimension and smoothness of τ0, the curse of dimensionality for our calibration rates only manifests itself in the doubly-robust cross-remainder term that involves nuisance estimation rates. For instance, when =m=n2, if π0 and μ0 are known to be Hölder smooth with exponent α1, the calibration rate implied by Theorem 4.6 with minimax optimal nuisance estimators is, up to logarithmic factors, 23+4α(2α+d). In contrast, if τ0 is known to be Hölder smooth with exponent β1, a minimax optimal estimator of τ0 is only guaranteed to achieve an MSE, and therefore calibration, rate of 2β(2β+d)+4α(2α+d) (Kennedy et al., 2022). Moreover, when the nuisance smoothness satisfies αd4, causal isotonic calibration can achieve the oracle calibration rate of 23, whereas a minimax optimal CATE estimator is only guaranteed to achieve the same calibration rate under the stringent condition that the smoothness of τ0 satisfies βd.

The following theorem states that the predictor obtained by taking pointwise medians of calibrated predictors is also calibrated.

Theorem 4.7 (Pointwise median preserves calibration). Let τn,1,τn,2,,τn,k be predictors, and define pointwise τn,k(w)median{τn,1(w),τn,2(w),,τn,k(w)}. Then

CAL(τn)ks=1kCAL(τn,s),

where the median operation is defined as in Section 2.1.

Under similar conditions, Theorem 4.7 combined with a generalization of Theorem 4.6 that handles random τ (see Theorem C.5 in Appendix C.4) establishes that a predictor τn,k obtained using causal isotonic cross-calibration (Algorithm 2) has calibration error CAL(τn,k) of order

OP(n23+max1sk(πn,sπ0)(μn,sμ0)2)

as n, where μn,s and πn,s are the outcome regression and propensity score estimators obtained after excluding the sth fold of the full dataset. In fact, Theorem 4.7 is valid for any calibrator of the form τn:wτn,sn(w)(w), where sn(w) is any random selector that may depend on the covariate value w. This suggests that the calibration rate for the median-aggregated calibrator implied by Theorem 4.7 is conservative as it also holds for the worst-case oracle selector that maximizes calibration error.

We now establish that causal isotonic calibration satisfies Property 2, that is, it maintains the predictive accuracy of the initial predictor τ. In what follows, predictive accuracy is quantified in terms of MSE. At first glance, the calibration-distortion decomposition appears to raise concerns that causal isotonic calibration may distort τ so much that the predictive accuracy of τn may be worse than that of τ. This possibility may seem especially concerning given that the ouput of isotonic regression is a step function, so that there could be many w, w𝒲 such that τ(w)τ(w) but τn(w)=τn(w). The following theorem alleviates this concern by establishing that, up to a remainder term that decays with sample size, the MSE of τn is no larger than the MSE of the initial CATE predictor τ. A consequence of this theorem is that causal isotonic calibration does not distort τ so much as to destroy its predictive performance. To derive this result, we leverage that τn is in fact a misspecified DR-learner of the univariate CATE function γ0(τ,). While isotonic calibrated predictors are calibrated even when γ0(τ,) is not a monotone function of τ, we stress that misspecified DR-learners for γ0(τ,) are typically uncalibrated.

In the theorem below, we define the best isotonic approximation of the CATE given the initial predictor τ as

τ0argminθτ:θisoτ0θτ.

Theorem 4.8 (Causal isotonic calibration does not inflate MSE much). Under Conditions 4.14.5,

τnτ0=OP(13+(πmπ0)(μmμ0))

as n. As such, as n, the inflation in root MSE from causal isotonic calibration satisfies

MSEτnMSE(τ)OP(13+(πmπ0)(μmμ0)).

A similar MSE bound can be established for causal isotonic cross-calibration as defined in Algorithm 2.

5. Simulation studies

5.1. Data-generating mechanisms

We examined the behavior of our proposal under two data-generating mechanisms. The first mechanism (Scenario 1) includes a binary outcome whose conditional mean is an additive function (on the logit scale) of non-linear transformations of four confounders with treatment interactions. The second mechanism (Scenario 2) includes instead a continuous outcome with conditional mean linear on covariates and treatment interactions, with more than 100 covariates of which only 20 are true confounders. In both scenarios, the propensity score follows a logistic regression model. All covariates were independent and uniformly distributed on (−1, +1). Sample sizes 1,000, 2,000, 5,000 and 10,000 were considered. Further details are given in Appendix D.1.

5.2. CATE estimation

In Scenario 1, to estimate the CATE, we implemented gradient-boosted regression trees (GBRT) with maximum depths equal to 2, 5, and 8 (Chen & Guestrin, 2016), random forests (RF) (Breiman, 2001), generalized linear models with lasso regularization (GLMnet) (Friedman et al., 2010), generalized additive models (GAM) (Wood, 2017), and multivariate adaptive regression splines (MARS) (Friedman, 1991). In Scenario 2, we implemented RF, GLMnet, and a combination of variable screening with lasso regularization followed by GBRT with maximum depth determined via cross-validation. We used the implementation of these estimators found in R package sl3 (Coyle et al., 2021). Causal isotonic cross-calibration was implemented using the variant outlined in Algorithm 3. Further details are given in Appendix D.2.

5.3. Performance metrics

We compared the performance of the causal isotonic calibrator to its uncalibrated version in terms of three metrics: the calibration measure defined in (1), MSE, and the calibration bias within bins defined by the first and last prediction deciles. The calibration bias within bins is given by the measure in (2) standardized by the probability of falling within each bin. For each simulation iteration, the metric was estimated empirically using an independent sample 𝒱 of size n𝒱=104. These metric estimates were then averaged across 500 simulations. Additional details on these metrics is provided in Appendix D.3.

5.4. Simulation results

Results from Scenario 1 are summarized in Figure 3. The predictors based on GLMnet, RF, GAM, and MARS happened to be well-calibrated, and so, causal isotonic calibration did not lead to noticeable improvements in calibration error. In contrast, causal isotonic calibration of GBRT substantially decreased its calibration error, regardless of tree depth and sample size. In terms of MSE, calibration improved the predictive performance of GBRT and GAM, and preserved the performance of GLMnet and MARS. The calibration bias within bins of prediction was generally smaller after calibration, with a more notable improvement on GBRT — see Table 2 in Appendix E.

Results from Scenario 2 are summarized in Figure 2. The predictors based on RF and GBRT with GLMnet screening were poorly calibrated, and causal isotonic calibration substantially reduced their calibration error. Calibration did not noticeably change the already small calibration error of the GLMnet predictions; however, calibration substantially reduced the calibration error within quantile bins of its predictions — see Table 3 in Appendix E. Finally, with respect to MSE, causal isotonic calibration improved the performance of RF and GBRT with variable screening, and yielded similar performance to GLMnet.

Figure 2.

Figure 2.

Calibration error and MSE in Scenario 2. The panels show the calibration error (top) and MSE (bottom) using the calibrated (left) and uncalibrated (right) predictors as a function of sample size. The y-axes are on a log scale.

In Figure 4 of Appendix E, we compared calibration performance using hold-out sets to cross-calibration. We found substantial improvements in MSE and calibration by using cross-calibration over conventional calibration.

6. Conclusion

In this work, we proposed causal isotonic calibration as a novel method to calibrate treatment effect predictors. In addition, we established that the pointwise median of calibrated predictors is also calibrated. This allowed us to develop a data-efficient variant of causal isotonic calibration using cross-fitted predictors, thereby avoiding the need for a hold-out calibration dataset. Our proposed methods guarantee that, under minimal assumptions, the calibration error defined in (2) vanishes at a fast rate of 23 with little or no loss in predictive power, where denotes the number of observations used for calibration. This property holds regardless of how well the initial predictor τ approximates the true CATE function. To our knowledge, our method is the first in the literature to directly calibrate CATE predictors without requiring trial data or parametric assumptions. Potential applications of our method include data-driven decision-making with strong robustness guarantees. In future work, it would be interesting to study whether pairing causal isotonic cross-calibration with conformal inference (Lei & Candès, 2021) leads to improved ITE prediction intervals, and whether causal isotonic calibration and shape-constrained inference methods (Westling & Carone, 2020) can be used to construct confidence intervals for the calibration function γ0(τn,).

Our method has limitations. Its calibration guarantees require that either μ0 or π0 be estimated sufficiently well. Flexible learning methods can be used to satisfy this condition. If π0 is known, this condition can be trivially met. Hence, our method can be readily used to calibrate CATE predictors and characterize HTEs in clinical trials. For proper calibration, our method requires all confounders to be measured and adjusted for. In future work, it will be important to study CATE calibration in the context of unmeasured confounding. Our strategy could be adapted to construct calibrators for general learning tasks, including E-learning of the conditional relative risk (Jiang et al., 2019; Qiu et al., 2019), proximal causal learning (Tchetgen et al., 2020; Sverdrup & Cui, 2023), and instrumental variable-based learning (Okui et al., 2012; Syrgkanis et al., 2019).

In simulations, we found that causal isotonic cross-calibration led to well-calibrated predictors without sacrificing predictive performance; benefits were especially prominent in high-dimensional settings and for tree-based methods. This is of particularly high relevance given that regression trees have become popular for CATE estimation due to their flexibility (Athey & Imbens, 2016) and interpretability (Lee et al., 2020). We also found that cross-calibration substantially improved the MSE of the calibrated predictor relative to hold-out set calibration approaches and can even improve the MSE of the original predictor.

Though our focus was on treatment effect estimation, our theoretical arguments can be readily adapted to provide guarantees for isotonic calibration in regression and classification problems. Hence, we have provided an affirmative answer to the open question of whether it is possible to establish distribution-free calibration guarantees for isotonic calibration (Gupta, 2022).

Supplementary Material

Supplement

Figure 1.

Figure 1.

Calibration error and MSE in Scenario 1. The panels show the calibration error (top) and MSE (bottom) using the calibrated (left) and uncalibrated (right) predictors as a function of sample size. The y-axes are on a log scale.

Acknowledgements.

Research reported in this publication was supported by NIH grants DP2-LM013340 and R01-HL137808. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Footnotes

Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute.

References

  1. Athey S Beyond prediction: Using big data for policy problems. Science, 355(6324):483–485, 2017. [DOI] [PubMed] [Google Scholar]
  2. Athey S and Imbens G Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Athey S and Wager S Estimating treatment effects with causal forests: An application. Observational Studies, 5 (2):37–51, 2019. [Google Scholar]
  4. Barlow RE and Brunk HD The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140–147, 1972. [Google Scholar]
  5. Bella A, Ferri C, Hernández-Orallo J, and Ramírez-Quintana MJ Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pp. 128–146. IGI Global, 2010. [Google Scholar]
  6. Breiman L Random forests. Machine learning, 45(1):5–32, 2001. [Google Scholar]
  7. Brooks J, van der Laan MJ, and Go AS Targeted maximum likelihood estimation for prediction calibration. The international journal of biostatistics, 8(1), 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Carnegie N, Dorie V, and Hill JL Examining treatment effect heterogeneity using bart. Observational Studies, 5 (2):52–70, 2019. [Google Scholar]
  9. Chen T and Guestrin C Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016. [Google Scholar]
  10. Chernozhukov V, Demirer M, Duflo E, and Fernandez-Val I Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india. Technical report, National Bureau of Economic Research, 2018. [Google Scholar]
  11. Collins FS and Varmus H A new initiative on precision medicine. New England journal of medicine, 372(9):793–795, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Coyle JR, Hejazi NS, Malenica I, Phillips RV, and Sofrygin O sl3: Modern pipelines for machine learning and Super Learning. https://github.com/tlverse/sl3, 2021. URL 10.5281/zenodo.1342293. R package version 1.4.2. [DOI] [Google Scholar]
  13. Cucchiara V, Cooperberg MR, Dall’Era M, Lin DW, Montorsi F, Schalken JA, and Evans CP Genomic markers in prostate cancer decision making. European urology, 73(4):572–582, 2018. [DOI] [PubMed] [Google Scholar]
  14. Dahabreh IJ, Hayward R, and Kent DM Using group data to treat individuals: understanding heterogeneous treatment effects in the age of precision medicine and patient-centred evidence. International journal of epidemiology, 45(6):2184–2193, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Devriendt F, Moldovan D, and Verbeke W A literature survey and experimental evaluation of the state-of-the-art in uplift modeling: A stepping stone toward the development of prescriptive analytics. Big data, 6(1):13–41, 2018. [DOI] [PubMed] [Google Scholar]
  16. Dominici F, Bargagli-Stoffi FJ, and Mealli F From controlled to undisciplined data: estimating causal effects in the era of data science using a potential outcome framework. arXiv preprint arXiv:2012.06865, 2020. [Google Scholar]
  17. Dwivedi R, Tan YS, Park B, Wei M, Horgan K, Madigan D, and Yu B Stable discovery of interpretable subgroups via calibration in causal studies. International Statistical Review, 88:S135–S178, 2020. [Google Scholar]
  18. Foster DJ and Syrgkanis V Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019. [Google Scholar]
  19. Frangakis C The calibration of treatment effects from clinical trials to target populations, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Friedman J, Hastie T, and Tibshirani R Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010. [PMC free article] [PubMed] [Google Scholar]
  21. Friedman JH Multivariate adaptive regression splines. The annals of statistics, 19(1):1–67, 1991. [Google Scholar]
  22. Gneiting T and Ranjan R Combining predictive distributions. Electronic Journal of Statistics, 7:1747–1782, 2013. [Google Scholar]
  23. Goff DC, Lloyd-Jones DM, Bennett G, Coady S, D’agostino RB, Gibbons R, Greenland P, Lackland DT, Levy D, O’donnell CJ, et al. 2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines. Journal of the American College of Cardiology, 63(25 Part B):2935–2959, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Groeneboom P and Lopuhaa H Isotonic estimators of monotone densities and distribution functions: basic facts. Statistica Neerlandica, 47(3):175–183, 1993. [Google Scholar]
  25. Gupta C Post-hoc calibration without distributional assumption, 2022. (Ph.D. proposal). [Google Scholar]
  26. Gupta C and Ramdas AK Distribution-free calibration guarantees for histogram binning without sample splitting. arXiv preprint arXiv:2105.04656, 2021. [Google Scholar]
  27. Gupta C, Podkopaev A, and Ramdas A Distribution-free binary classification: prediction sets, confidence intervals and calibration. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, and Lin H (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3711–3723. Curran Associates, Inc., 2020. [Google Scholar]
  28. Huang Y, Li W, Macheret F, Gabriel RA, and Ohno-Machado L A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association, 27(4):621–633, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Imbens GW and Wooldridge JM Recent developments in the econometrics of program evaluation. Journal of economic literature, 47(1):5–86, 2009. [Google Scholar]
  30. Jiang B, Song R, Li J, and Zeng D Entropy learning for dynamic treatment regimes. Statistica Sinica, 29:1633–1710, 01 2019. doi: 10.5705/ss.202018.0076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Josey KP, Yang F, Ghosh D, and Raghavan S A calibration approach to transportability and data-fusion with observational data. Statistics in Medicine, 41(23):4511–4531, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kennedy E, Balakrishnan S, and Wasserman L Minimax rates for heterogeneous causal effect estimation. 03 2022. [Google Scholar]
  33. Kennedy EH Optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497, 2020. [Google Scholar]
  34. Kent DM, Steyerberg E, and van Klaveren D Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects. Bmj, 363, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kuleshov V and Liang PS Calibrated structured prediction. Advances in Neural Information Processing Systems, 28, 2015. [Google Scholar]
  36. Künzel SR, Sekhon JS, Bickel PJ, and Yu B Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lee K, Bargagli-Stoffi FJ, and Dominici F Causal rule ensemble: Interpretable inference of heterogeneous treatment effects. arXiv preprint arXiv:2009.09036, 2020. [Google Scholar]
  38. Lei L and Candès EJ Conformal inference of counter-factuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938, 2021. [Google Scholar]
  39. Leng Y and Dimmery D Calibration of heterogeneous treatment effects in random experiments. Available at SSRN 3875850, 2021. [Google Scholar]
  40. Lloyd-Jones DM, Braun LT, Ndumele CE, Smith SC Jr, Sperling LS, Virani SS, and Blumenthal RS Use of risk assessment tools to guide decision-making in the primary prevention of atherosclerotic cardiovascular disease: a special report from the american heart association and american college of cardiology. Circulation, 139(25):e1162–e1177, 2019. [DOI] [PubMed] [Google Scholar]
  41. Luedtke AR and van der Laan MJ Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Martino A, De Santis E, Baldini L, Rizzi A, et al. Calibration techniques for binary classification problems: A comparative analysis. In IJCCI, pp. 487–495, 2019. [Google Scholar]
  43. Murphy SA Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355, 2003. [Google Scholar]
  44. Naeini MP, Cooper G, and Hauskrecht M Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [PMC free article] [PubMed] [Google Scholar]
  45. Newey WK and Robins JR Cross-fitting and fast remainder rates for semiparametric estimation. arXiv preprint arXiv:1801.09138, 2018. [Google Scholar]
  46. Niculescu-Mizil A and Caruana R Obtaining calibrated probabilities from boosting. In UAI, volume 5, pp. 413–20, 2005. [Google Scholar]
  47. Nie X and Wager S Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021. [Google Scholar]
  48. Obermeyer Z and Emanuel EJ Predicting the future—big data, machine learning, and clinical medicine. The New England journal of medicine, 375(13):1216, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Okui R, Small DS, Tan Z, and Robins JM Doubly robust instrumental variable regression. Statistica Sinica, pp. 173–205, 2012. [Google Scholar]
  50. Platt J et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999. [Google Scholar]
  51. Qiu H, Luedtke A, and van der Laan M Contribution to discussion of “entropy learning for dynamic treatment regimes” by jiang b, song r, li j, zeng d. statistica sinica. Statistica Sinica, 29(4):1666–1678, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Rahaman R and Thiery AH Uncertainty quantification and deep ensembles. arXiv preprint arXiv:2007.08792, 2020. [Google Scholar]
  53. Robins JM Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics, pp. 189–326. Springer, 2004. [Google Scholar]
  54. Royden H Real analysis, the macmillan company. New York, 1963. [Google Scholar]
  55. Rubin DB Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974. [Google Scholar]
  56. Sverdrup E and Cui Y Proximal causal learning of heterogeneous treatment effects. arXiv preprint arXiv:2301.10913, 2023. [Google Scholar]
  57. Syrgkanis V, Lei V, Oprescu M, Hei M, Battocchi K, and Lewis G Machine learning estimation of heterogeneous treatment effects with instruments. Advances in Neural Information Processing Systems, 32, 2019. [Google Scholar]
  58. Tchetgen EJT, Ying A, Cui Y, Shi X, and Miao W An introduction to proximal causal learning. arXiv preprint arXiv:2009.10982, 2020. [Google Scholar]
  59. van Calster B, McLernon DJ, van Smeden M, Wynants L, and Steyerberg EW Calibration: the achilles heel of predictive analytics. BMC medicine, 17(1):1–7, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. van der Laan MJ and Rose S Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York, 2011. [Google Scholar]
  61. van der Laan MJ, Polley EC, and Hubbard AE Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007. [DOI] [PubMed] [Google Scholar]
  62. van der Vaart AW and Wellner J Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 1996. [Google Scholar]
  63. van Klaveren D, Balan TA, Steyerberg EW, and Kent DM Models with interactions overestimated heterogeneity of treatment effects and were prone to treatment mistargeting. Journal of clinical epidemiology, 114:72–83, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Wager S and Athey S Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018. [Google Scholar]
  65. Westling T and Carone M A unified study of nonparametric inference for monotone functions. Annals of statistics, 48(2):1001, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Wood SN Introducing gams. In Generalized additive models, pp. 161–194. Chapman and Hall/CRC, 2017. [Google Scholar]
  67. Xu Y and Yadlowsky S Calibration error for heterogeneous treatment effects. arXiv preprint arXiv:2203.13364, 2022. [Google Scholar]
  68. Yadlowsky S, Fleming S, Shah N, Brunskill E, and Wager S Evaluating treatment prioritization rules via rank-weighted average treatment effects. arXiv preprint arXiv:2111.07966, 2021. [Google Scholar]
  69. Zadrozny B and Elkan C Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pp. 609–616. Citeseer, 2001. [Google Scholar]
  70. Zadrozny B and Elkan C Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699, 2002. [Google Scholar]
  71. Zhang Z, Nie L, Soon G, and Hu Z New methods for treatment effect calibration, with applications to noninferiority trials. Biometrics, 72(1):20–29, 2016. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES