Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2016 Jul 5;113(27):7353–7360. doi: 10.1073/pnas.1510489113

Recursive partitioning for heterogeneous causal effects

Susan Athey a,1, Guido Imbens a
PMCID: PMC4941430  PMID: 27382149

Abstract

In this paper we propose methods for estimating heterogeneity in causal effects in experimental and observational studies and for conducting hypothesis tests about the magnitude of differences in treatment effects across subsets of the population. We provide a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects. The approach enables the construction of valid confidence intervals for treatment effects, even with many covariates relative to the sample size, and without “sparsity” assumptions. We propose an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation. Our approach builds on regression tree methods, modified to optimize for goodness of fit in treatment effects and to account for honest estimation. Our model selection criterion anticipates that bias will be eliminated by honest estimation and also accounts for the effect of making additional splits on the variance of treatment effect estimates within each subpopulation. We address the challenge that the “ground truth” for a causal effect is not observed for any individual unit, so that standard approaches to cross-validation must be modified. Through a simulation study, we show that for our preferred method honest estimation results in nominal coverage for 90% confidence intervals, whereas coverage ranges between 74% and 84% for nonhonest approaches. Honest estimation requires estimating the model with a smaller sample size; the cost in terms of mean squared error of treatment effects for our preferred method ranges between 7–22%.

Keywords: heterogeneous treatment effects, causal inference, cross-validation, supervised machine learning, potential outcomes


In this paper we study two closely related problems: first, estimating heterogeneity by covariates or features in causal effects in experimental or observational studies, and second, conducting inference about the magnitude of the differences in treatment effects across subsets of the population. Causal effects, in the Rubin causal model or potential outcome framework we use here (13), are comparisons between outcomes we observe and counterfactual outcomes we would have observed under a different regime or treatment. We introduce data-driven methods that select subpopulations to estimate treatment effect heterogeneity and to test hypotheses about the differences between the effects in different subpopulations. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specified in a preanalysis plan, without concern about invalidating inference due to searching over many possible partitions.

Our approach is tailored for applications where there may be many attributes of a unit relative to the number of units observed, and where the functional form of the relationship between treatment effects and the attributes of units is not known. The supervised machine learning literature (e.g., ref. 4) has developed a variety of effective methods for a closely related problem, the problem of predicting outcomes as a function of covariates in similar environments. The most popular approaches [e.g., regression trees (5), random forests (6), LASSO (7), support vector machines (8), etc.] entail building a model of the relationship between attributes and outcomes, with a penalty parameter that penalizes model complexity. Cross-validation is often used to select the optimal level of complexity (the one that maximizes predictive power without “overfitting”).

Within the prediction-based machine learning literature, regression trees differ from most other methods in that they produce a partition of the population according to covariates, whereby all units in a partition receive the same prediction. In this paper, we focus on the analogous goal of deriving a partition of the population according to treatment effect heterogeneity, building on standard regression trees (5, 6). Whether the ultimate goal in an application is to derive a partition or fully personalized treatment effect estimates depends on the setting; settings where partitions may be desirable include those where decision rules must be remembered, applied, or interpreted by human beings or computers with limited processing power or memory. Examples include treatment guidelines to be used by physicians or even online personalization applications where having a simple lookup table reduces latency for the user. We show that an attractive feature of focusing on partitions is that we can achieve nominal coverage of confidence intervals for estimated treatment effects even in settings with a modest number of observations and many covariates. Our approach has applicability even for settings such as clinical trials of drugs with only a few hundred patients, where the number of patient characteristics is potentially quite large. Our method may also be viewed as a complement to the use of “preanalysis plans” where the researcher must commit in advance to the subgroups that will be considered. It enables researchers to let the data discover relevant subgroups while preserving the validity of confidence intervals constructed on treatment effects within subgroups.

A first challenge for our goal of finding a partition and then testing hypotheses about treatment effects is that many existing machine learning methods cannot be used directly for constructing confidence intervals. This is because the methods are “adaptive”: They use the training data for model selection, so that spurious correlations between covariates and outcomes affect the selected model, leading to biases that disappear only slowly as the sample size grows. In some contexts, additional assumptions such as “sparsity” (only a few covariates affect the outcomes) can be applied to guarantee consistency or asymptotic (large sample) normality of predictions (9). In this paper, we use an alternative approach that places no restrictions on model complexity, which we refer to as “honesty.” We say that a model is “honest” if it does not use the same information for selecting the model structure (in our case, the partition of the covariate space) as for estimation given a model structure. We accomplish this by splitting the training sample into two parts, one for constructing the tree (including the cross-validation step) and a second for estimating treatment effects within leaves of the tree. Honesty has the implication that the asymptotic properties of treatment effect estimates within the partitions are the same as if the partition had been exogenously given. Although there is a loss of precision due to sample splitting (which reduces sample size in each step of estimation), there is a benefit in terms of eliminating bias that offsets at least part of the cost.

A key contribution of this paper is to show that criteria for both constructing the partition and cross-validation change when we anticipate honest estimation. In the first stage of estimation, the criterion is the expectation of the mean squared error (MSE) when treatment effects are reestimated in the second stage. Crucially, we anticipate that second-stage estimates of treatment effects will be unbiased in each leaf, because they will be performed on an independent sample. In that case, splitting and cross-validation criteria are adjusted to ignore systematic bias in estimation and focus instead on the tradeoff between more tailored prediction (smaller leaf size) and the variance that will arise in the second (honest estimation) stage due to noisy estimation within small leaves.

A second and perhaps more fundamental challenge to applying machine learning methods such as regression trees (5) off-the-shelf to the problem of causal inference is that regularization approaches based on cross-validation typically rely on observing the “ground truth,” that is, actual outcomes in a cross-validation sample. However, if our goal is to minimize the MSE of treatment effects, we encounter what Holland (2) calls the “fundamental problem of causal inference”: The causal effect is not observed for any individual unit, and so we do not directly have a ground truth. We address this by proposing approaches for constructing unbiased estimates of the MSE of the causal effect of the treatment.

Using theoretical arguments and a simulation exercise, we compare our approach with previously proposed ones. Relative to approaches that focus on goodness of fit in model selection, our approach yields substantial improvements in the MSE of treatment effects (ranging from 43% to 210%). We also examine the costs and benefits of honest estimation relative to adaptive estimation. In the settings we consider, honest estimation leads to approximately nominal coverage of confidence intervals across estimation methods and settings, whereas for adaptive estimation approaches coverage can be as low as 69%. The cost of honest estimation in terms of MSE of treatment effects (where for adaptive estimation, we have a larger sample size available for training) ranges from 7% to 22% for our preferred model.

The Problem

Setup.

We consider a setup where there are N units, indexed by i=1,,N. We postulate the existence of a pair of potential outcomes for each unit, (Yi(0),Yi(1)), following the potential outcome or Rubin causal model (13), with the unit-level causal effect defined as the difference in potential outcomes, τi=Yi(1)Yi(0). Let Wi{0,1} be the binary indicator for the treatment, with Wi=0 indicating that unit i received the control treatment and Wi=1 indicating that unit i received the active treatment. The realized outcome for unit i is the potential outcome corresponding to the treatment received:

Yiobs=Yi(Wi)={Yi(0)ifWi=0,Yi(1)ifWi=1.

Let Xi be a K-component vector of features, covariates, or pretreatment variables, known not to be affected by the treatment. Our data consist of the triple (Yiobs,Wi,Xi), for i=1,,N, which are regarded as an independent and identically distributed sample drawn from a large population. Expectations and probabilities will refer to the distribution induced by the random sampling, or by the (conditional) random assignment of the treatment. We assume that observations are exchangeable, and that there is no interference [the stable unit treatment value assumption (10)]. This assumption may be violated in settings where some units are connected through networks. Let p=pr(Wi=1) be the marginal treatment probability, and let e(x)=pr(Wi=1|Xi=x) be the conditional treatment probability (the “propensity score” as defined by ref. 11). In a randomized experiment with constant treatment assignment probabilities e(x)=p for all values of x.

Unconfoundedness.

Throughout the paper, we maintain the assumption of randomization conditional on the covariates, or “unconfoundedness” (11), formalized as given below.

Assumption 1 (Unconfoundedness).

Wi(Yi(0),Yi(1))|Xi,

using the symbol to denote (conditional) independence of two random variables. This assumption is satisfied in a randomized experiment without conditioning on covariates but also may be justified in observational studies if the researcher is able to observe all of the variables that affect the unit’s receipt of treatment and are associated with the potential outcomes.

To simplify exposition, in the main body of the paper we maintain the stronger assumption of complete randomization, whereby Wi(Yi(0),Yi(1),Xi). Later, we show that by using propensity score weighting (1) we can adapt all of the methods to that case.

Conditional Average Treatment Effects and Partitioning.

Define the conditional average treatment effect

τ(x)E[Yi(1)Yi(0)|Xi=x].

A large part of the causal inference literature (e.g., refs. 3 and 1214) is focused on estimating the population (marginal) average treatment effect E[Yi(1)Yi(0)]. The main focus of the current paper is on obtaining accurate estimates of and inferences for the conditional average treatment effect τ(x). We are interested in estimators τ^() [in general we use the ^ symbol to denote estimators for a population quantity—in this case τ(x)] that are based on partitioning the feature space and do not vary within the partitions.

Honest Inference for Population Averages

Our approach departs from conventional classification and regression trees (CART) in two fundamental ways. First, we focus on estimating conditional average treatment effects rather than predicting outcomes. Conventional regression tree methods are therefore not directly applicable because we do not observe unit-level causal effects for any unit. Second, we impose a separation between constructing the partition and estimating effects within leaves of the partition, using separate samples for the two tasks, in what we refer to as honest estimation. We contrast honest estimation with adaptive estimation used in conventional CART, where the same data are used to build the partition and estimate leaf effects. In this section we introduce the changes induced by honest estimation in the context of the conventional prediction setting; in the next section we consider causal effects. In the discussion in this section we observe for each unit i a pair of variables (Yi,Xi), with the interest in the conditional expectation μ(x)E[Yi|Xi=x].

Setup.

We begin by defining key concepts and functions. First, a tree or partitioning Π corresponds to a partitioning of the feature space X, with #(Π) the number of elements in the partition. We write

Π={1,,#(Π)},withj=1#(Π)j=X.

Let denote the space of partitions. Let (x;Π) denote the leaf Π such that x. Let S be the space of data samples from a population. Let π:S be an algorithm that on the basis of a sample SS constructs a partition. As a very simple example, suppose the feature space is X={L,R}. In this case there are two possible partitions, ΠN={L,R} (no split) or ΠS={{L},{R}} (full split), and so the space of trees is ={ΠN,ΠS}={{L,R},{{L},{R}}}. Given a sample S, the average outcomes in the two subsamples are Y¯L and Y¯R. A simple example of an algorithm is one that splits if the difference in average outcomes exceeds a threshold c:

π(S)={{{L,R}}ifY¯LY¯Rc,{{L},{R}}ifY¯LY¯R>c.

The potential bias in leaf estimates from adaptive estimation can be seen in this simple example. Whereas Y¯LY¯R is in general an unbiased estimator for the difference in the population conditional means μ(L)μ(R), if we condition on finding that Y¯LY¯Rc in a particular sample, we expect that Y¯LY¯R is larger than the population analog.

Given a partition Π, define the conditional mean function μ(x;Π) as

μ(x;Π)E[Yi|Xi(x;Π)]=E[μ(Xi)|Xi(x;Π)],

which can be viewed as a step-function approximation to μ(x). Given a sample S the estimated counterpart is

μ^(x;S,Π)1#(iS:Xi(x;Π))iS:Xi(x;Π)Yi,

which is unbiased for μ(x;Π). We index this estimator by the sample because we need to be precise about which sample is used for estimation of the regression function.

The Honest Target.

A central concern in this paper is the criterion used to compare alternative estimators; following much of the literature, we focus on MSE criteria, but we will modify these criteria in a variety of ways.

For the prediction case, we adjust the MSE by E[Yi2]; because this does not depend on an estimator, subtracting it does not affect how the criterion ranks estimators. Given a partition Π, define the MSE, where we average over a test sample Ste and the conditional mean is estimated on an estimation sample Sest, as

MSEμ(Ste,Sest,Π)1#(Ste)iSte{(Yiμ^(Xi;Sest,Π))2Yi2}.

The (adjusted) expected MSE is the expectation of MSEμ(Ste,Sest,Π) over test and estimation samples:

EMSEμ(Π)ESte,Sest[MSEμ(Ste,Sest,Π)],

where the test and estimation samples are independent. In the algorithms we consider, we will consider a variety of estimators for the (adjusted) EMSE, all of which take the form of MSE estimators MSEμ(Ste,Sest,Π), evaluated at the units in sample Ste, with the estimates based on sample Sest and the tree Π. For brevity in this paper we will henceforth omit the term “adjusted” and abuse terminology slightly by referring to these objects as MSE functions.

Our ultimate goal is to construct and assess algorithms π() that maximize the honest criterion

QH(π)ESest,Sest,Str[MSEμ(Ste,Sest,π(Str))].

Note that throughout the paper we focus on maximizing criterion functions, which typically involve the negative of MSE expressions.

The Adaptive Target.

In the conventional CART approach the target is slightly different:

QC(π)ESte,Str[MSEμ(Ste,Str,π(Str))],

where the same training sample is used to construct and estimate the tree. Compared with our target QH(π) the difference is that in our approach different samples Str and Sest are used for construction of the tree and estimation of the conditional means, respectively.

We refer to the conventional CART approach as adaptive and to our approach as honest. In practice there will be costs and benefits of the honest approach relative to the adaptive approach. The cost is sample size; given a dataset, putting some data in the estimation sample leaves fewer units for the training dataset, leading to higher expected MSE. The advantage of honest estimation is that it avoids a problem of adaptive estimation, which is that spurious extreme values of Yi are likely to be placed into the same leaf as other extreme values by the algorithm π(), and thus the sample means (in sample Str) of the elements of π(Str) are more extreme than they would be in an independent sample. This shows up in the poor coverage properties of confidence intervals for adaptive estimation methods relative to the honest methods.

The Implementation of CART.

There are two distinct parts of the conventional CART algorithm, initial tree building and cross-validation to select a complexity parameter used for pruning. Each part of the algorithm relies on a criterion function based on MSE. In this paper we will take as given the overall structure of the CART algorithm (e.g., refs. 4 and 5), and our focus will be on modifying the criteria.

In the tree-building phase, CART recursively partitions the observations of the training sample. For each leaf, the algorithm evaluates all candidate splits of that leaf (which induce alternative partitions Π) using a “splitting” criterion that we refer to as the “in-sample” goodness-of-fit criterion MSEμ(Str,Str,Π).

It is well understood that the conventional criterion leads to overfitting, a problem that is solved by cross-validation to select a penalty on tree depth. The in-sample goodness-of-fit criterion will always improve with additional splits, even though additional refinements of a partition Π might in fact increase the expected MSE, especially when the leaf sizes become small. The reason is that the criterion ignores the fact that smaller leaves lead to higher-variance estimates of leaf means. To account for this factor, the conventional approach to avoiding overfitting is to add a penalty term to the criterion that is equal to a constant times the number of splits, so that essentially we only consider splits where the improvement in a goodness-of-fit criterion is above some threshold. The penalty term is chosen to maximize a goodness-of-fit criterion in cross-validation samples. In the conventional cross-validation the training sample is repeatedly split into two subsamples, the Str,tr sample that is used to build a new tree as well as estimate the conditional means and the Str,cv sample that is used to evaluate the estimates. We “prune” the tree using a penalty parameter that represents the cost of a leaf. We choose the optimal penalty parameter by evaluating the trees associated with each value of the penalty parameter. The goodness-of-fit criterion for cross-validation can be written as MSEμ(Str,cv,Str,tr,Π). Note that the cross-validation criterion directly addresses the issue we highlighted with the in-sample goodness-of-fit criterion, because Str,cv is independent of Str,tr, and thus too-extreme estimates of leaf means will be penalized. The issue that smaller leaves lead to noisier estimates of leaf means is implicitly incorporated by the fact that a smaller leaf penalty will lead to deeper trees and thus smaller leaves, and the noisier estimates will lead to larger average MSE across the cross-validation samples.

Honest Splitting.

In our honest estimation algorithm, we modify CART in two ways. First, we use an independent sample Sest instead of Str to estimate leaf means. Second (and closely related), we modify our splitting and cross-validation criteria to incorporate the fact that we will generate unbiased estimates using Sest for leaf estimation (eliminating one aspect of over-fitting), where Sest is treated as a random variable in the tree building phase. We explicitly incorporate the fact that finer partitions generate greater variance in leaf estimates.

To begin developing our criteria, let us expand EMSEμ(Π):

EMSEμ(Π)=E(Yi,Xi),Sest[(Yiμ(Xi;Π))2Yi2]EXi,Sest[(μ^(Xi;Sest,Π)μ(Xi;Π))2]=EXi[μ2(Xi;Π)]ESest,Xi[V(μ^2(Xi;Sest,Π)],

where we exploit the equality ES[μ^(x;S,Π)]=μ(x;Π).

We wish to estimate EMSEμ(Π) on the basis of the training sample Str and knowledge of the sample size of the estimation sample Nest. To construct an estimator for the second term, observe that within each leaf of the tree there is an unbiased estimator for the variance of the estimated mean in that leaf. Specifically, to estimate the variance of μ^(x;Sest,Π) on the training sample we can use

V^(μ^(x;Sest,Π))SStr2((x;Π))Nest((x;Π)),

where SStr2() is the within-leaf variance, to estimate the variance. We then weight this by the leaf shares p to estimate the expected variance. Assuming the leaf shares are approximately equal in the estimation and training samples, we can approximate this variance estimator by

E^[V(μ^2(Xi;Sest,Π)|iSte]1NestΠSStr2().

To estimate the average of the squared outcome μ2(x;Π) (the first term of the target criterion), we can use the square of the estimated means in the training sample μ^2(x;Str,Π), minus an estimate of its variance,

E^[μ2(x;Π)]=μ^2(x;Str,Π)SStr2((x;Π))Ntr((x;Π)).

Combining these estimators leads to the following unbiased estimator for EMSEμ(Π):

EMSEμ^(Str,Nest,Π)1NtriStrμ^2(Xi;Str,Π)(1Ntr+1Nest)ΠSStr2((x;Π)).

Comparing this to the criterion used in the conventional CART algorithm, which can be written as

MSEμ(Str,Str,Π)=1NtriStrμ^2(Xi;Str,Π),

the difference comes from the terms involving the variance. For a given x, SStr2((x;Π)) is proportional to the MSE within the associated leaf; thus, the difference between the adaptive and honest criteria is how the within-leaf MSE is weighted, where the honest criterion penalizes small leaf size.

Honest Cross-Validation.

Even though EMSEμ^(Str,Nest,Π) is approximately unbiased as an estimator of our ideal criterion EMSEμ(Π) for a fixed Π, it is not unbiased when we use it repeatedly to evaluate splits using recursive partitioning on the training data Str. The reason is that initial splits tend to group together observations with similar, extreme outcomes. So, after the training data have been divided once, the sample variance of observations in the training data within a given leaf is on average lower than the sample variance would be in a new, independent sample. Thus, EMSEμ^(Str,Nest,Π) is likely to overstate goodness of fit as we grow a deeper and deeper tree, implying that cross-validation can still play an important role with our honest estimation approach, although perhaps less so than in the conventional CART.

Because the conventional CART cross-validation criterion does not account for honest estimation we consider the analog of our unbiased estimate of the criterion, which accounts for honest estimation by evaluating a partition Π using only outcomes for units from the cross-validation sample Str,cv:

EMSEμ^(Str,cv,Nest,Π).

This estimator for the honest criterion is unbiased for fixed Π, although it may have higher variance than MSEμ(Str,cv,Str,tr,Π) due to the small sample size of the cross-validation sample. Note that when we apply the formula for EMSEμ^ in this case, we replace Ntr with Ntr,cv.

Honest Inference for Treatment Effects

In this section we change the focus to estimating conditional average treatment effects instead of estimating conditional population means. We refer to the estimators developed in this section as “causal tree” (CT) estimators.

The setting with treatment effects creates some specific problems because we do not observe the value of the treatment effect whose conditional mean we wish to estimate. This complicates the calculation of the criteria we introduced in the previous section. However, a key point of this paper is that we can estimate these criteria and use those estimates for splitting and cross-validation.

We now observe in each sample the triple (Yiobs,Xi,Wi). For a sample S let Streat and Scontrol denote the subsamples of treated and control units, respectively, with cardinality Ntreat and Ncontrol, respectively, and let p=Ntreat/N be the share of treated units. The concept of a tree remains the same as in the previous section. Given a tree Π, define for all x and both treatment levels w the population average outcome

μ(w,x;Π)E[Yi(w)|Xi(x;Π)],

and the average causal effect

τ(x;Π)E[Yi(1)Yi(0)|Xi(x;Π)]=μ(1,x;Π)μ(0,x;Π).

The estimated counterparts are

μ^(w,x;S,Π)1#({iSw:Xi(x;Π)})iSw:Xi(x;Π)Yiobs,
τ^(x;S,Π)μ^(1,x;S,Π)μ^(0,x;S,Π).

Define the MSE for treatment effects as

MSEτ(Ste,Sest,Π)1#(Ste)iSte{(τiτ^(Xi;Sest,Π))2τi2},

and define EMSEτ(Π) to be its expectation over the estimation and test samples,

EMSEτ(Π)ESte,Sest[MSEτ(Ste,Sest,Π)].

A key challenge is that, in contrast to MSEμ(Ste,Sest,Π), the workhorse MSE function MSEτ(Ste,Sest,Π) is infeasible, because we do not observe τi. However, we show below that we can estimate it.

Modifying Conventional CART for Treatment Effects.

Consider first modifying conventional (adaptive) CART to estimate heterogeneous treatment effects. Note that in the prediction case, using the fact that μ^ is constant within each leaf, we can write

MSEμ(Ste,Str,Π)=2NtriSteμ^(Xi;Ste,Π)μ^(Xi;Str,Π)+1NtriSμ^2(Xi;Str,Π).

In the treatment effect case we can use the fact that

ESte[τi|iSte:i(x,Π)]=ESte[τ^(x;Ste,Π)]

to construct an unbiased estimator of MSEτ(Ste,Str,Π):

MSE^τ(Ste,Str,Π)2NtriSteτ^(Xi;Ste,Π)τ^(Xi;Str,Π)+1NtriSteτ^2(Xi;Str,Π).

This leads us to propose, by analogy to CART’s in-sample MSE criterion MSEμ(Str,Str,Π),

MSE^τ(Str,Str,Π)=1NtriStrτ^2(Xi;Str,Π),

as an estimator for the infeasible in-sample goodness-of-fit criterion. For cross-validation we used in the prediction case MSEμ(Str,cv,Str,tr,Π). Again, the treatment effect analog is infeasible, but we can use an unbiased estimate of it, which leads to MSE^τ(Str,cv,Str,tr,Π).

Modifying the Honest Approach.

The honest approach described in the previous section for prediction problems also needs to be modified for the treatment effect setting. Using the same expansion as before, now applied to the treatment effect setting, we find

EMSEτ(Π)=EXi[τ2(Xi;Π)]ESest,Xi[V(τ^2(Xi;Sest,Π)].

For splitting we can estimate both components of this expectation using only the training sample, yielding an estimator for the infeasible criterion that depends only on Str and Nest:

EMSE^τ(Str,Nest,Π)1NtriStrτ^2(Xi;Str,Π)
(1Ntr+1Nest)Π(SStreattr2()p+SScontroltr2()1p).

For cross-validation we use the same expression, now with the cross-validation sample: EMSE^τ(Str,cv,Nest,Π).

These expressions are directly analogous to the criteria we proposed for the honest version of CART in the prediction case. The criteria reward a partition for finding strong heterogeneity in treatment effects and penalize a partition that creates variance in leaf estimates. One difference is that in the prediction case the two terms both tend to select features that predict heterogeneity in outcomes, whereas for the treatment effect case the two terms reward different types of features. It is possible to reduce the variance of a treatment effect estimator by introducing a split, even if both child leaves have the same average treatment effect, if a covariate affects the mean outcome but not treatment effects. In such a case, the split results in more homogeneous leaves, and thus lower-variance estimates of the means of the treatment group and control group outcomes. Thus, the distinction between adaptive and honest splitting criterion will be more pronounced for treatment effect estimation. As in the prediction case, the cross-validation criterion estimates treatment effects within leaves using the Str,cv sample rather than Str,tr.

Four Partitioning Estimators for Causal Effects

In this section we briefly summarize our CT estimator and then describe three alternative types of estimators. We compare CT to the alternatives theoretically and through simulations. For each of the four types there is an adaptive version and an honest version, where the latter takes into account that estimation will be done on a sample separate from the sample used for constructing the partition, leading to a total of eight estimators. Note that further variations are possible; one could use adaptive splitting and cross-validation methods to construct a tree but still perform honest estimation on a separate sample. We do not consider such variations.

CTs.

The discussion above developed our preferred estimator, CTs. To summarize, for the adaptive version of CTs, denoted CT-A, we use for splitting the objective MSE^τ(Str,Str,Π). For cross-validation we use the same objective function, but evaluated at the samples Str,cv and Str,tr, namely MSE^τ(Str,cv,Str,tr,Π). For the honest version, CT-H, the splitting objective function is EMSE^τ(Str,Nest,Π). For cross-validation we use the same objective function, but now evaluated at the cross-validation sample, EMSE^τ(Str,cv,Nest,Π).

Transformed Outcome Trees.

Our first alternative method is based on the insight that by using a transformed version of the outcome Yi=Yi(Wip)/(p(1p)) it is possible to use off-the-shelf regression tree methods to focus splitting and cross-validation on treatment effects rather than outcomes. Similar approaches are used in refs. 1518. Because E[Yi|Xi=x]=τ(x), off-the-shelf CART methods can be used directly, where estimates of the sample average of Yi within each leaf can be interpreted as estimates of treatment effects. This ease of application is the key attraction of this method. The main drawback (relative to CT-A) is that in general it is not efficient because it does not use the information in the treatment indicator beyond the construction of the transformed outcome. For example, the sample average in S of Yi within a given leaf (x;Π) will only be equal to τ^(x;Π,S) if the fraction of treated observations within the leaf is exactly equal to p. Because this method is primarily considered as a benchmark, in simulations we focus only on an adaptive version that can use existing learning methods entirely off-the-shelf. The adaptive version of the transformed outcome tree (TOT) estimator we consider, TOT-A, uses the conventional CART algorithm with the transformed outcome replacing the original outcome. The honest version, TOT-H, uses the same splitting and cross-validation criteria, so that it builds the same trees; it differs only in that a separate estimation sample is used to construct the leaf estimates. The treatment effect estimator within a leaf is the same as the adaptive method, that is, the sample mean of Yi within the leaf.

Fit-Based Trees.

We consider two additional alternative methods for constructing trees, based on suggestions in the literature. In the first of these alternatives the choice of which feature to split on, and at what value of the feature to split, is based on comparisons of the goodness of fit (F) of the outcome rather than the treatment effect. In standard CART of course goodness of fit of outcomes is also the split criterion, but here we estimate a model for treatment effects within each leaf. Specifically, we have a linear model with an intercept and an indicator for the treatment as the regressors, rather than only an intercept as in standard CART. This approach is used in Zeileis et al. (19), who consider building general models at the leaves of the trees. Treatment effect estimation is a special case of their framework. Zeileis et al. (19) propose using statistical tests based on improvements in goodness of fit to determine when to stop growing the tree, rather than relying on cross-validation, but for ease of comparison with CART, in this paper we will stay closer to traditional CART in terms of growing deep trees and pruning them. We modify the MSE function:

MSEμ,W(Ste,Sest,Π)iSte((Yiobsμ^w(Wi,Xi;Sest,Π))2Yi2).

For the adaptive version F-A we follow conventional CART, using the criterion MSEμ,W in place of MSEμ (that is, using MSEμ,W(Str,Str,Π) for splitting and MSEμ,W(Str,cv,Str,tr,Π) for cross-validation). For the honest version we use the analog of EMSE^μ(Str,Nest,Π), with μ^w in place of μ^, for training, and the same function evaluated at (Str,cv,Nest,Π) for cross-validation. To highlight the disadvantages of the F approach, consider a case where two splits improve the fit to an equal degree. In one case, the split leads to variation in average treatment effects, and in the other case it does not. The first split would be better from the perspective of estimating heterogeneous treatment effects, but the fit criterion would view the two splits as equally attractive.

Squared T-Statistic Trees.

For the last estimator we look for splits with the largest value for the square of the t-statistic (TS) for testing the null hypothesis that the average treatment effect is the same in the two potential leaves. This estimator was proposed by Su et al. (20). If the two leaves are denoted L (Left) and R (Right), the square of the t-statistic is

T2N(Y¯LY¯R)2S2/NL+S2/NR,

where S2 is the conditional sample variance given the split. At each leaf, successive splits are determined by selecting the split that maximizes T2. The concern with this criterion is that it places no value on splits that improve the fit, even though our characterization of EMSEτ shows that improving fit has value through reduction of the variance of leaf estimates. Both the adaptive and honest versions of the TS approach use T2 as the splitting criterion. For cross-validation and pruning, it is less obvious how to proceed. Zeileis et al. (19) suggest that when using a statistical test for splitting, if it is desirable to grow deep trees and then cross-validate to determine depth, then one can use a standard goodness-of-fit measure for pruning and cross-validation. However, this could undermine the key advantage of TS, to focus on heterogeneous treatment effects. For this reason, we instead propose to use the CT-A and CT-H criteria for cross-validation for TS-A and TS-H, respectively.

Comparison of the CTs, the F Criterion, and the TS Criterion.

It is useful to compare our proposed criterion to the F and TS criteria in a simple setting to gain insight into the relative merits of the three approaches. We do so here focusing on a decision whether to proceed with a single possible split, based on a binary covariate Xi{L,R}. Let ΠN and ΠS denote the trees without and with the split, and let Y¯w, Y¯Lw and Y¯Rw denote the average outcomes for units with treatment status Wi=w. Let Nw, NLw, and NRw be the sample sizes for the corresponding subsamples. Let S2 be the sample variance of the outcomes given a split, and let S˜2 be the sample variance without a split. Define the squared t-statistics for testing that the average outcomes for control (treated) units in both leaves are identical:

T02(Y¯L0Y¯R0)2S2/NL0+S2/NR0,T12(Y¯L1Y¯R1)2S2/NL1+S2/NR1.

Then, we can write the improvement in goodness of fit from splitting the single leaf into two leaves as

F=S˜22(T02+T12)1+2(T02+T12)/N.

Ignoring degrees-of-freedom corrections, the change in our proposed criterion for the honest version of the CT in this simple setting can be written as a combination of the F and TS criteria:

EMSE^τ(S,ΠN)EMSE^τ(S,ΠS)=(T24)(S˜2F/N)+2S˜2p(1p).

The CT-H criterion focuses primarily on T2. Unlike TS, however, it incorporates the benefits of improving fit.

Inference

Given the estimated conditional average treatment effect we also would like to do inference. Once constructed, the tree is a function of covariates, and if we use a distinct sample to conduct inference, then the problem reduces to that of estimating treatment effects in each member of a partition of the covariate space. For this problem, standard approaches are therefore valid for the estimates obtained via honest estimation and, in particular, no assumptions about model complexity are required. As our simulations below illustrate, for the adaptive methods standard approaches to confidence intervals are not generally valid for the reasons discussed above.

A Simulation Study

To assess the relative performance of the proposed algorithms we carried out a small simulation study with three distinct designs. In Table 1 we report a number of summary statistics from the simulations. We report averages; results for medians are similar. We report results for Ntr=Nest with either 500 or 1,000 observations. When comparing adaptive to honest approaches, we report the ratio of the MSEτ for adaptive estimation with Ntr=1,000 to MSEτ for honest estimation with Ntr=Nest=500, to highlight the tradeoff between sample size and bias reduction that arises with honest estimation. We evaluate MSEτ using a test sample with Nte=8,000 observations to minimize the sampling variance.

Table 1.

Simulation study

Ntr=Nest Design 1 Design 2 Design 3
Estimator 500 1,000 500 1,000 500 1,000
No. of leaves
TOT 2.9 3.2 2.9 3.5 3.6 5.4
F-A 6.1 13.1 6.3 13.0 6.2 13.0
TS-A 4.0 5.4 3.4 5.1 3.4 6.6
CT-A 4.0 5.5 3.2 3.7 3.5 5.4
F-H 6.0 12.9 6.3 13.0 6.3 13.1
TS-H 4.3 7.8 5.6 11.4 5.9 12.4
CT-H 4.2 7.6 5.6 11.4 6.1 12.5
Infeasible MSE divided by infeasible MSE for CT-H*
TOT-H 1.554 1.938 1.089 1.069 1.081 1.042
F-H 1.790 1.427 1.983 2.709 1.502 2.085
TS-H 0.971 0.963 1.183 1.145 1.178 1.338
Ratio of infeasible MSE: Adaptive to honest
TOT-A/TOT-H 1.021 0.754 0.717
F-A/F-H 0.491 0.985 0.993
T-A/T-H 0.935 0.841 0.918
CT-A/CT-H 0.929 0.851 0.785
Coverage of 90% confidence intervals – adaptive
TOT-A 0.82 0.85 0.78 0.81 0.69 0.74
F-A 0.89 0.89 0.83 0.84 0.82 0.82
TS-A 0.84 0.84 0.78 0.82 0.75 0.75
CT-A 0.83 0.84 0.78 0.82 0.76 0.79
Coverage of 90% confidence intervals – honest
TOT-H 0.90 0.90 0.90 0.89 0.89 0.90
F-H 0.90 0.90 0.90 0.90 0.90 0.90
TS-H 0.90 0.90 0.91 0.91 0.89 0.90
CT-H 0.89 0.90 0.90 0.90 0.89 0.90
*

MSEτ(Ste,Sest,πEstimator(Str))/MSEτ(Ste,Sest,πCTH(Str)).

MSEτ(Ste,SestStr,πEstimatorA(SestStr))/MSEτ(Ste,Sest,πEstimatorH(Str).

In all designs the marginal treatment probability is P=0.5, K denotes the number of features, we have a model η(x) for the mean effect and κ(x) for the treatment effect, and the potential outcomes are written, for w=0,1,

Yi(w)=η(Xi)+12(2w1)κ(Xi)+ϵi,

where ϵiN(0,.01), and the Xi are independent of ϵi and one another, and XiN(0,1). The designs follow:

1:K=2;η(x)=12x1+x2;κ(x)=12x1.
2:K=10;η(x)=12k=12xk+k=36xk;κ(x)=k=121{xk>0}xk
3:K=20;η(x)=12k=14xk+k=58xk;κ(x)=k=141{xk>0}xk.

In each design, there are some covariates that affect treatment effects (κ) and mean outcomes (η), some covariates that enter η but not κ; and some covariates that do not affect outcomes at all (“noise” covariates). Design 1 does not have noise covariates. In designs 2 and 3, the first few covariates enter κ, but only when their signs are positive, whereas they affect η throughout their range. Different criterion will thus lead to different optimal splits, even within a covariate; F will focus more on splits when the covariates are negative.

The first section of Table 1 compares the number of leaves in different designs and different values of Ntr=Nest. Recalling that TOT-A and TOT-H have the same splitting method, we see that it tends to build shallow trees. The failure to control for the realized value of Wi leads to additional noise in estimates, which tends to lead to aggressive pruning. For the TS and CT estimators, the adaptive versions lead to shallower trees than the honest versions, because the honest versions anticipate correcting for bias in leaf estimates and thus prune less; the main cost of small leaf size is high variance in leaf estimates. F-A and F-H are very similar; the splitting criteria are similar, and further, the F estimators are less prone to overfitting treatment effects, because they split based upon overall model fit. We also observe that the F estimators build the deepest trees; they reward splitting on covariates that affect mean outcomes as well as treatment effects.

The second section of Table 1 examines the performance of the alternative honest estimators, as evaluated by the infeasible criterion MSEτ. We report the ratio of the average of MSEτ for a given estimator to MSEτ for our preferred estimator, CT-H. The TOT-H estimator performance is within 10% of CT in designs 2 and 3 but suffers in design 1. In design 1, the variance of Yi conditional on (Wi,Xi) is very low at 0.01, and so the failure of TOT to account for the realization of Wi results in a noticeable loss of performance. The F-H estimator suffers in all three designs; all designs give the F-H criterion attractive opportunities to split based on covariates that do not enter κ. F-H would perform better in alternative designs where η(x)=κ(x); F-H also does well at avoiding splits on noise covariates. The TS-H estimator performs well in design 1, where x1 affects η and κ the same way, so that the CT-H criterion is aligned with TS-H. Designs 2 and 3 are more complex, and the ideal splits from the perspective of balancing overall MSE of treatment effects (including variance reduction) are different from those favored by TS-H. Thus, TS performs worse, and the difference is exacerbated with larger sample size in design 3, where there are more opportunities for the estimators to build deeper trees and thus to make different choices. We also calculate comparisons based on a feasible criterion, the average squared difference between the transformed outcome Yi and the estimated treatment effect τ^i. For details see SI Appendix. The results are consistent with those from the infeasible criterion, but the feasible criterion compresses the performance differences.

The third section of Table 1 explores the costs and benefits to honest estimation. The table reports the ratio of MSEτ(Ste,SestStr,πEstimatorA(SestStr)) to MSEτ(Ste,Sest,πEstimatorH(Str) for each estimator. The adaptive version uses the union of the training and estimation samples for tree building, cross-validation, and leaf estimation, yielding double the sample size (1,000 observations) at each step. The honest version uses 500 of the observations in training and cross-validation, with the complement used for estimating treatment effects within leaves. The results show that in most cases there is a cost to honest estimation in terms of MSEτ, varying by design and estimator. The cost is large for the fit estimator in design 1; with a smaller sample size it largely ignores treatment effect heterogeneity in splitting. For CT, the cost ranges from 6.8 to 21.5%.

The final two sections of Table 1 show the coverage rate for 90% confidence intervals. We achieve nominal coverage rates for honest methods in all designs, where, in contrast, the adaptive methods have coverage rates substantially below nominal rates. The fit estimator has the highest adaptive coverage rates; it does not focus on treatment effects and thus is less prone to overstating that heterogeneity through adaptive estimation. Thus, our simulations bear out the tradeoff that honest estimation sacrifices some goodness of fit (of treatment effects) in exchange for valid confidence intervals.

Observational Studies with Unconfoundedness

The discussion so far has focused on the setting where the assignment to treatment is randomized. The proposed methods can be adapted to observational studies under the assumption of unconfoundedness. In that case we need to modify the estimates within leaves to remove the bias from simple comparisons of treated and control units. There is a large literature on methods for doing so (e.g., ref. 3). For example, as in ref. 21 we can do so by propensity score weighting. Efficiency will improve if we renormalize the weights within each leaf and within the treatment and control group when estimating treatment effects. Crump et al. (22) propose approaches to trimming observations with extreme values for the propensity score to improve robustnesses. Note that there are some additional conditions required to establish asymptotic normality of treatment effect estimates when propensity score weighting is used (see, e.g., ref. 21); these results apply without modification to the estimation phase of honest partitioning algorithms.

The Literature

A small but growing literature seeks to apply supervised machine learning techniques to the problem of estimating heterogeneous treatment effects. Beyond those previously discussed, Tian et al. (23) transform the features rather than the outcomes and then apply LASSO to the model with the original outcome and the transformed features. Foster et al. (24) estimate μ(w,x)=E[Yi(w)|Xi=x] for w=0,1 using random forests, then calculate τ^i=μ^(1,Xi)μ^(0,Xi). They use machine learning algorithms to estimate τ^i as a function of the units’ attributes, Xi. Imai and Ratkovic (25) use LASSO to estimate the effects of both treatments and attributes, but with different penalty terms for the two types of features to allow for the possibility that the treatment effects are present but the magnitudes of the interactions are small. Their approach is similar to ours in that they distinguish between the estimation of treatment effects and the estimation of the impact of other attributes of units. Green and Kern (26) use Bayesian additive regression trees to model treatment effect heterogeneity. Taddy et al. (27) consider a model with the outcome linear in the covariates and the interaction with the treatment variable. Using Bayesian nonparametric methods, they project estimates of heterogeneous treatment effects onto the feature space using LASSO-type regularization methods to get low-dimensional summaries of heterogeneity. Dudik et al. (16) and Beygelzimer and Langford (15) propose a related approach for finding the optimal treatment policy that combines inverse propensity score methods with “direct methods” [e.g., directly estimating μ(w,x)] that predict the outcome as a function of the treatment and the unit attributes. The methods can be used to evaluate the average difference in outcomes from any two policies that map attributes to treatments, as well as to select the optimal policy function. They do not focus on hypothesis testing for heterogeneous treatment effects, and they use conventional approaches for cross-validation. Also related is targeted learning (28), which modifies the loss function to increase the weight on the parts of the likelihood that concern parameters of interest, and work on experimental design optimized to find subpopulations with positive treatment effects (29). Finally, Wager and Walther (30) adjust confidence intervals to account for adaptive estimation, and List et al. (31) adjust for exhaustively searching the space of simple partitions.

Conclusion

In this paper we introduce methods for constructing trees for causal effects that allow us to do valid inference for the causal effects in randomized experiments and in observational studies satisfying unconfoundedness. These methods provide valid confidence intervals without restrictions on the number of covariates or the complexity of the data-generating process. Our methods partition the feature space into subspaces. The output of our method is a set of treatment effects and confidence intervals for each subspace.

A potentially important application of the techniques is to “data mining” in randomized experiments. Our method can be used to explore any previously conducted randomized controlled trial, for example, medical studies or field experiments in development economics. Our methods can discover subpopulations with lower-than-average or higher-than-average treatment effects while producing confidence intervals for these estimates with nominal coverage, despite having searched over many possible subpopulations.

Supplementary Material

Supplementary File
pnas.1510489113.sapp.pdf (142.8KB, pdf)

Acknowledgments

We are grateful for comments provided at seminars at the National Academy of Sciences Sackler Colloquium, the Southern Economics Association, the Stanford Conference on Causality in the Social Sciences, the MIT Conference in Digital Experimentation, Harvard University, University of Washington, Microsoft Research, Facebook, KDD, the AAAI Embedded Machine Learning Conference, the University of Pennsylvania, the California Econometrics Conference, the Collective Intelligence Conference, the University of Arizona, the Paris DataLead conference, Cornell University, Carnegie Mellon University, University of Bonn, University of California, Berkeley, the DARPA conference on Machine Learning and Causal Inference, and the NYC Data Science Seminar Series. Part of this research was conducted while the authors were visiting Microsoft Research.

Footnotes

Conflict of interest statement: The authors received funding from Microsoft Research.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Drawing Causal Inference from Big Data,” held March 26–27, 2015, at the National Academies of Sciences in Washington, DC. The complete program and video recordings of most presentations are available on the NAS website at www.nasonline.org/Big-data.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1510489113/-/DCSupplemental.

References

  • 1.Rubin D. Estimating causal effects of treatments in randomized and non-randomized studies. Educ Psychol. 1974;66(5):688–701. [Google Scholar]
  • 2.Holland P. Statistics and causal inference (with discussion) J Am Stat Assoc. 1986;81(396):945–970. [Google Scholar]
  • 3.Imbens G, Rubin D. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge Univ Press; Cambridge, UK: 2015. p. 159. [Google Scholar]
  • 4.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Ed Springer; New York: 2011. [Google Scholar]
  • 5.Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Wadsworth; Belmont, CA: 1984. [Google Scholar]
  • 6.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
  • 7.Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996;58(1):267–288. [Google Scholar]
  • 8.Vapnik V. Statistical Learning Theory. Wiley; New York: 1998. [Google Scholar]
  • 9.Wager S, Athey S. 2015 Estimation and inference of heterogeneous treatment effects using random forests. Available at arxiv.org/abs/1510.04342.
  • 10.Rubin D. Bayesian inference for causal effects: The role of randomization. Ann Stat. 1978;6(1):34–58. [Google Scholar]
  • 11.Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. [Google Scholar]
  • 12.Abadie A, Imbens G. Large sample properties of matching estimators for average treatment effects. Econometrica. 2006;74(1):235–267. [Google Scholar]
  • 13.Pearl J. Causality: Models, Reasoning and Inference. Cambridge Univ Press; Cambridge, UK: 2000. [Google Scholar]
  • 14.Rosenbaum P. Observational Studies. Springer; New York: 2002. [Google Scholar]
  • 15.Beygelzimer A, Langford J. 2009. The offset tree for learning with partial labels. arxiv:0812.4044.
  • 16.Dudik M, Langford J, Li L. 2011. Doubly robust policy evaluation and learning. Proceedings of the 28th International Conference on Machine Learning (International Machine Learning Society)
  • 17.Sigovitch J. 2007. Identifying informative biological markers in high-dimensional genomic data and clinical trials. PhD thesis (Harvard Univ, Cambridge, MA)
  • 18.Weisberg HI, Pontes VP. Post hoc subgroups in clinical trials: Anathema or analytics? Clin Trials. 2015;12(4):357–364. doi: 10.1177/1740774515588096. [DOI] [PubMed] [Google Scholar]
  • 19.Zeileis A, Hothorn T, Hornik K. Model-based recursive partitioning. J Comput Graph Stat. 2008;17(2):492–514. [Google Scholar]
  • 20.Su X, Tsai C, Wang H, Nickerson D, Li B. Subgroup analysis via recursive partitioning. J Mach Learn Res. 2009;10:141–158. [Google Scholar]
  • 21.Hirano K, Imbens G, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica. 2003;71(4):1161–1189. [Google Scholar]
  • 22.Crump R, Hotz J, Imbens G, Mitnik O. Nonparametric tests for treatment effect heterogeneity. Rev Econ Stat. 2008;90(3):389–405. [Google Scholar]
  • 23.Tian L, Alizadeh AA, Gentles AJ, Tibshirani R. A simple method for estimating interactions between a treatment and a large number of covariates. J Am Stat Assoc. 2014;109(508):1517–1532. doi: 10.1080/01621459.2014.951443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Foster JC, Taylor JM, Ruberg SJ. Subgroup identification from randomized clinical trial data. Stat Med. 2011;30(24):2867–2880. doi: 10.1002/sim.4322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Imai K, Ratkovic M. Estimating treatment effect heterogeneity in randomized program evaluation. Ann Appl Stat. 2013;7(1):443–470. [Google Scholar]
  • 26.Green D, Kern H. Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees. Public Opin Q. 2012;76(3):491–511. [Google Scholar]
  • 27.Taddy M, Gardner M, Chen L, Draper D. 2015. Heterogeneous treatment effects in digital experimentation. arXiv:1412.8563.
  • 28.Van Der Laan M, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer; New York: 2011. [Google Scholar]
  • 29.Rosenblum M, Van der Laan MJ. Optimizing randomized trial designs to distinguish which subpopulations benefit from treatment. Biometrika. 2011;98(4):845–860. doi: 10.1093/biomet/asr055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wager S, Walther G. 2015. Uniform convergence of random forests via adaptive concentration. arxiv:1503.06388.
  • 31.List J, Shaikh A, Xu Y. 2016. Multiple hypothesis testing in experimental economics. NBER Working Paper No. 21875 (National Bureau of Economic Research, Cambridge, MA)

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.1510489113.sapp.pdf (142.8KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES