Summary
Assessing heterogeneous treatment effects is a growing interest in advancing precision medicine. Individualized treatment effects (ITE) play a critical role in such an endeavor. Concerning experimental data collected from randomized trials, we put forward a method, termed random forests of interaction trees (RFIT), for estimating ITE on the basis of interaction trees. 1 To this end, we propose a smooth sigmoid surrogate (SSS) method, as an alternative to greedy search, to speed up tree construction. RFIT outperforms the ‘separate regression’ approach in estimating ITE. Furthermore, standard errors for the estimated ITE via RFIT are obtained with the infinitesimal jackknife method. We assess and illustrate the use of RFIT via both simulation and the analysis of data from an acupuncture headache trial.
Keywords: individualized treatment effects, infinitesimal jackknife, precision medicine, random forests, treatment-by-covariate interaction
1 | INTRODUCTION
Precision medicine aims to optimize the delivery of stratified or individualized therapies by integrating comprehensive patient data. This emerging approach is a growing interest in biomedical applications. Precision medicine faces many statistical challenges before it can be broadly deployed in clinical practice. Many available statistical methods, driven by the ‘one-size-fits-all’ conventional medicine, are primarily concerned about the overall main effect of a treatment over the entire population and rely heavily on the traditional significance testing. To advance precision medicine, one critical statistical challenge is to understand and quantify differential treatment effects.
There are many newly proposed approaches in this endeavor; see Lipkovich, Dmitrienko, and D’Agostino 2 for a recent survey. Among them, tree-based methods3 are dominant for several reasons. Built simply on the basis of a two-sample test statistic, trees facilitate a powerful comprehensive modeling scheme by recursively splitting data. Differential treatment effects essentially involve treatment-by-covariate interactions, which may be of nonlinear forms and of high orders. Trees excel in dealing with complex interactions. Furthermore, tree models are capable of handling high-dimensional covariates of mixed types and present as an off-the-shelf tool in the sense that minimal data preparation is required.
Interaction trees (IT)1 extend tree methods to subgroup analysis by explicitly assessing the treatment-by-covariate interactions. In the ‘virtual twins’ 4 approach, subgroups are identified by first estimating the potential outcomes. SIDES5 (Subgroup Identification based on Differential Effect Search) seeks subgroups with enhanced treatment effects, possibly taking into account both efficacy and toxicity. QUINT (QUalitative INteraction Trees6) focuses on qualitative interactions. Loh, He, and Man7 proposed a tree procedure for subgroup identification that is less prone to biased variable selection. The optimal treatment regime,8 which aims to find the recommended treatment based on individual patient information, offers an alternative way of looking at the problem. Along this direction, tree-based approaches are also common.9,10
There are typically two types of precision medicine: stratified medicine and personalized medicine. The aforementioned methods belong to the former type, primarily concerned about stratified treatment effects or regimes where groups of individuals showing homogeneous treatment effects are sought. Comparatively, individualized treatment effects (ITE) are of key importance in deploying tailored treatment plans as part of personalized medicine. A model for ITE predicts the effect of treatment on a future patient. ITE assessment affords deeper study of treatment efficacy by quantifying how heterogeneous treatment effects are and whether directional or qualitative interactions exist. This information allows for estimation of the proportion of patients who benefit from the treatment and identification of those who may be harmed by the treatment. Besides, ITE models can pinpoint important predictive factors, 11 i.e., patients characteristics that moderate or modify the treatment effects, and offer insight for understanding the pharmacological mechanisms of a drug. ITE estimation is necessarily a first step for many methods4,9,10 in stratified medicine and optimal treatment regime. The optimal choice of the treatments would be revealed once ITE is known.
Our focus is on the estimation of ITE with data collected from randomized trials. One available method for this task is separate regression (SR),4,12 in which separate predictive models for the response variable are built using data in the treated group and data in the untreated group, respectively, and applied to each individual. The difference in predicted response from the two models supplies an estimator of ITE. The idea of SR is intuitive within the causal inference framework; we shall elaborate more in the ensuing sections. One major shortcoming of SR is that one has to deal with both prognostic and predictive factors in SR, although ITE assessment involves predictive factors only. Besides, there is no standard error formula available for the estimated ITE from SR.
To overcome the deficiencies of SR, we examine an ensemble learning approach for ITE estimation using interaction trees 1. We coin the proposed method as RFIT for random forests of interaction trees. Our methodological contribution is threefold: first, we implement random forests on the basis of interaction trees, which is different from ordinary random forests13 of classification or regression trees; 3 second, a faster alternative splitting method, called smooth sigmoid surrogate (SSS), is introduced to speed up construction of interaction trees; third, we extend the infinitesimal jackknife method14 to compute the standard errors for ITE estimates. Compared to SR, RFIT is superior by focusing exclusively on predictive factors. We investigate the performance of RFIT via extensive numerical experiments.
The remainder of the article is organized as follows. In Section 2, we first introduce the concept of ITE within Rubin’s causal model framework. We then present RFIT with SSS splitting for estimating ITE and the standard error formula for estimated ITE. Section 3 contains simulation experiments that are designed to compare RFIT with other methods and assess the proposed SE formulation. In Section 4, we illustrate our proposed RFIT approach with data from an acupuncture headache trial.
2 | RANDOM FORESTS OF INTERACTION TREES (RFIT)
Consider a randomized trial with data 𝒟 = {(yi, Ti, xi): i = 1,…, n} consisting of n IID copies of (Y,T,X), where yi is the continuous response or outcome for the i-th subject, Ti is the binary treatment assignment indicator: 1 for the treated group and 0 for control, and xi = (xi1,…, xip)T ∈ ℝp is a p-dimensional covariate vector of mixed types.
The Neyman–Rubin causal model15,16,17 provides a way of finely calibrating the causal effect of treatment T on the response via the concept of potential outcomes. Let and denote the response values for a subject when assigned to the treated and the control group, respectively. Either or , but not both, can be observed, which is the so-called ‘fundamental problem of causal inference’. 18 The observed outcome is given by . Within this framework, the treatment effect can be evaluated at three levels: the population level (referred to as the average treatment effect or ATE18), the subpopulation level for a subset A ⊂ ℝp, and the unit or subject level . These three levels form a hierarchy of causal inference in increasing order of strength, in the sense that ATE can be obtained from the knowledge of subpopulation-level inferences, which in turn can be obtained from the knowledge of unit-level inferences, but not vice versa. Let δ be a generic notation for treatment effect.
Definition 1
The individualized treatment effect (ITE) is defined as .
Note that δ(x) is different from the (random) unit-level effect ( ). Strictly speaking, δ(x) is a subpopulation-level effect among individuals with X = x. Nevertheless, δ(x) is the finest approximation to the unit-level effect that is possibly available in practice.
Causal inference is essentially concerned with estimating δ at different levels through the available data 𝒟. The difficulty in causal inference stems primarily from the convoluted roles (e.g., confounder, effect modifier or moderator, or mediator) played by each covariate in X. For experimental data from trials with random treatment assignment mechanisms, T is independent of other variables. As a result, the unconfoundedness condition17 (Y1, Y0) ⫫ T |X, being sufficient for obtaining population-level inference from 𝒟, is trivially met. Randomization renders the confounding issue of little concern; however, covariate modification to the treatment effects may remain at both subpopulation and unit levels, referred to as the treatment-by-covariate interactions.
2.1 | SSS for Identifying the Best Cutoff Point
IT1 seeks subgroups with heterogeneous treatment effects by following the paradigm of CART3; hence IT supplies causal inference at the subpopulation level. Nevertheless, results from IT can be building blocks for inferences at other levels: one has the flexibility to move backward to the ATE estimation by integration and move forward to the ITE estimation via ensemble learning. The main objective of this article is to examine the use of random forests of interaction trees (RFIT) in estimating δ(x). Random forests 13 (RF) are an ensemble learning method, constructing a collection of tree models and integrating results. Among its many merits, RF is an off-the-shelf method and a top performer in predictive modeling.19
To extend random forests on the basis of interaction trees, one essential ingredient is the splitting statistic. In CART, one splits data so that the difference in response between two child nodes is maximized or, equivalently, the within-node impurity or variation is minimized. In IT, data is split so that the difference in treatment effects between two child nodes is maximized. A split on data is induced by a binary variable of general form Δ = Δ(Xj; c) = I(Xj ≤ c) that applies a threshold on covariate Xj at cutoff point c. When Xj is nominal or categorical, one common strategy is to sort the variable levels according to the treatment effect estimate at each level and treat it as if ordinal. A theoretical justification for doing so can be found in Appendix A of Su et al.1
In our setting, any binary split results in the following 2×2 table, where n1L denotes the number of treated subjects in the left child node, ȳ1L denotes the sample mean response for treated subjects in the left child node, and so on for notation in the other cells.
Treatment | Child Node | |
---|---|---|
| ||
Left | Right | |
0 | (ȳ0L, n0L) | (ȳ0R, n0R) |
1 | (ȳ1L, n1L) | (ȳ1R, n1R) |
The splitting statistic in IT can be based on the Wald test of H0: β3 = 0 in the interaction model:
(1) |
where Δi = Δ(xij; c). The least squares estimate of β3 is given by β̂3 = (ȳ1L − ȳ0L) − (ȳ1R − ȳ0R), corresponding to the concept of ‘difference in differences’ (DID).20 The resultant Wald test statistic amounts to
(2) |
where
(3) |
is the pooled estimator of σ2. Q(c) measures the difference in treatment effects between the two child nodes. With the conventional greedy search (GS) approach, the best cutoff point ĉ for Xj is ĉ = argmaxc Q(c). It is worth noting that minimizing the least squares (LS) criterion with Model (1) does not serve well in IT. A cutoff point can yield the minimum LS criterion merely for its strong additive effect associated with β2.
GS evaluates the splitting measure at every possible cutoff point for Xj. This can be slow when the number of cutoff points to be evaluated is large, even though GS can be implemented by updating the computation of Q(c) for neighboring c values. Furthermore, this discrete optimization procedure yields erratic fluctuations, as exemplified by the orange line in Figure 1(b). As a result, GS may mistakenly select a local spike due to large variation. These deficiencies motivate us to consider a smooth alternative to GS. Our idea is to approximate the threshold indicator function Δi, involved in many components of the splitting statistic, with a smooth sigmoid function. For this reason, we call the method ‘smooth sigmoid surrogate’ or SSS in short. While many sigmoid functions can be used, it is natural to consider the logistic or expit function
FIGURE 1.
Illustration of Smooth Sigmoid Surrogate (SSS) for Splitting Data: (a) The discrete threshold function Δ(x; c) = I(x ≥ c) with c = 0 (in
) and its expit approximation s(x; c) = expit{a (x − c)} (in
); (b) The splitting statistic Q(c) computed at each cutoff point c in greedy search and its SSS approximations with a = {1, 2,…, 100}. In panel (b), data of size n = 500 are generated from model y = 0.5 + 0.5 T + 0.5Δ + 0.5 · T Δ + ε, where Δ = Δ(x; c0) with true cutoff point c0 = 0 (indicated by the
dashed vertical line) and both x and ε are from N(0, 1). The best cutoff point found by GS is denoted by the
triangle while the black diamond dots indicate the best cutoff points found by SSS with different a values.
(4) |
with shape or scale parameter a > 0. Figure 1(a) depicts the expit function for different a values, where c = 0 coincides with the mean of a standardized covariate.
To approximate Q(c), we start with approximating nlτ with ñlt for l = 0, 1 and t ∈ {L,R} as follows:
where si = s(xij; a, c) approximates Δi, n1 = Σi Ti is the total number of treated individuals, and is the total number of untreated individuals. Let Slt denote the associated sum of observed responses values in each cell. They can be approximated in a similar manner:
where S1 = Σi Tiyi is the sum of response values for all treated individuals and similarly S0 for the untreated. Note that quantities n1, n0 = n − n1, S1, and S0 = Σi yi − S1 do not involve the split variable Δi and can be computed beforehand. It follows that ȳlt = Slt/nlt ≈ S̃lt/ ñlt = ỹlt for l = 0, 1 and t = {L,R}. Next, bringing (ñlt, ỹlt) into (3) gives its approximation σ̃2. Finally, plugging all the approximated quantities into Q(c) in (2) yields
(5) |
Now Q̃(c) is a smooth objective function for c only and can be directly maximized to obtain the best cutoff point ĉ.
Besides c, there is a scale parameter a involved in Q̃(c) given by (5). As shown by simulation in Section 3, the performance of the SSS method is quite robust with respect to the choice of a for a wide range of values. Thus a can be fixed a priori. In order to do so, we standardize the predictor xij:= (xij − x̄j)/ σ̂j, where (x̄j, σ̂j) denote the sample mean and standard deviation of variable Xj, respectively. For standardized covariates, we recommend fixing a at a value in [10, 50]. With fixed a, the best cutoff point ĉ can be obtained by maximizing Q̃(c) with respect to c and then transformed back to the original data scale for interpretability. This one-dimensional smooth optimization problem can be conveniently solved by many standard optimization routines. We use the Brent 21 method available in the R22 function optimizein our implementation. Given the nonconcave nature of the maximization problem, techniques such as multi-start or partitioning the search range may be used in combination with Brent’s method as further efforts to locate the global optimum. However, as shown in our numerical studies, a plain application of Brent’s method works quite effectively in estimating c.
SSS smooths out local spikes in GS splitting measures and hence helps identify the true cutoff point; see Figure 1(b) for one example. Additional simulation studies in Section ?? of the Supporting Web Materials show that SSS outperforms GS in estimating c if there exists a true cutoff point. Another main advantage of SSS over GS is computational efficiency. The following proposition provides an asymptotic quantification of the computational complexity involved in GS and SSS splitting.
Proposition 1
Consider a typical data set of size n in the interaction tree setting, where both GS and SSS are used to find the best cutoff point ĉ for a continuous predictor X with O(n) distinct values. In terms of computation complexity, GS is at best O{ln(n) n} with the updating scheme and O(n2) without the updating scheme. Comparatively, SSS is O(kn) with k being the number of iterations in Brent’s method.
A proof of Proposition 1 is relegated to the Supporting Web Materials. Implementation of tree methods benefits from incremental updating.3,23 We note that the GS splitting with updating is commonly mistaken to be of order O(n). Updating the IT splitting statistic entails sorting the Y values according to the X values within each treatment group. It turns out that this sorting step would dominate the algorithm in complexity asymptotically with a rate of O{ln(n) n}. Comparatively, SSS depends on the number of iterations in Brent’s method, k. Although the number of iterations is affected by the convergence criterion and the desired accuracy, k is generally small since Brent’s method has guaranteed convergence at a superlinear rate. Based on our numerical experience, k rarely gets over 15 even for large n. In other words, the O(kn) rate for SSS essentially amounts to the linear rate O(n). A empirical comparison of computing time between SSS and GS can be found in Section ?? of the Supporting Web Materials.
2.2 | Estimating ITE via RFIT
RFIT follows the standard paradigm of RF. 13 Take a bootstrap sample 𝒟b from data 𝒟 and construct an IT 𝒯b using 𝒟b. To split a node, a subset of m covariates are randomly selected and the optimal split for each covariate is identified and compared to determine the best split of the data. This step is iterated until a large tree 𝒯b is grown. Each terminal node τ in 𝒯b is summarized by an estimated treatment effect δ̂τ, which is simply the difference in mean response between treated and untreated individuals falling into τ, i.e.,
where n1τ = Σi:xi∈𝒟b∩τ Ti is the number of treated individuals in 𝒟b that fall into τ and n0τ for the untreated.
The entire tree construction procedure is then repeated on B bootstrap samples, which results in a sequence of bootstrap trees {𝒯b: b = 1, 2,…, B}. An individual with covariate vector x would fall into one and only one terminal node τb(x) of 𝒯b. Denoting δ̂b(x) = δ̂τb(x), the ITE for this individual can then be estimated as
(6) |
Efron14 discusses methods for computing standard errors for bootstrap-based estimators and advocates the use of infinites-imal jackknife (IJ) as a general approach. IJ is found preferable in random forests, as further explored by Wager, Hastie, and Efron24. Proposition 2 applies the IJ method to obtain a standard error formula for estimated ITE δ̂(x). Its proof is outlined in the Supporting Web Materials.
Proposition 2
The IJ estimate of variance of δ̂(x) is given by
(7) |
where and Zbi = (Nbi−1){δ̂b(x)−δ̂(x)} with Nbi being the number of times that the i-th observation appears in the b-th bootstrap sample. In other words, the quantity Z̄i is the bootstrap covariance between Nbi and δ̂b(x). In practice, V̂ is biased upwards, especially for a small or moderate B. A bias-corrected version is given by
(8) |
Further assuming approximate independence of Nbi and δ̂b(x), another bias-corrected version is given by
(9) |
which is easier to compute than (8).
The validity of these SE formulas will be investigated by simulation in Section 3. The bias-corrected SE formulas in (8) and (9) generally yield very similar results, both outperforming the uncorrected version (7). Note that computing (8) entails evaluation of the matrix Z = (Zbi) at each different x. Therefore, the SE given in (9) is recommended for its enhanced computational efficiency.
2.3 | Comparison with SR
Under the potential outcome framework, separate regression (SR) is an intuitive approach for estimating δ(x). 12,4 SR builds a model for μ1(x) = E(Y1 |X = x) based on data of treated individuals only. This step essentially involves predictive modeling of the observed response Y on the covariates X using the treated group data; random forests 13 can be used for this purpose. Similarly, a model for μ0(x) = E(Y0 |X = x) is built using data of untreated individuals only. For an individual with covariate vector x, both models are applied to each individual to predict his/her mean potential outcomes. Let μ̂0(x) and μ̂1(x) denote the resultant estimates of μ0(x) and μ1(x), respectively. ITE can be estimated as
(10) |
It is worth noting that it is tempting to use the observed response Y of a treated (untreated) individual as an estimate for μ1 (or μ0) directly. But this is not a good idea due to the potentially inflated variance.
We argue that RFIT is superior to SR, mainly because RFIT works on a simpler problem. To explain, consider the model form Y = μ0(x) + Tδ(x) + ε, where μ1(x) = μ0(x) + δ(x). Functions μ0(x) and δ(x) may involve different sets of covariates. In the clinical setting, covariates showing up in μ0(x) are called prognostic factors while covariates showing up in δ(x) are called predictive factors. 11 In other words, predictive factors interact with the treatment and hence cause differential treatment effects. In SR, both μ1(x) and μ0(x) have to be estimated in order to estimate the difference δ(x); thus it must take both prognostic and predictive factors into consideration. Comparatively, RFIT estimates δ(x) directly by focusing on predictive factors only. This is because a prognostic factor does not cause a difference in differences, referring to the splitting statistic in (2) for RFIT. In the following, we introduce a performance measure for RFIT and SR in estimating ITE δ(x) and a theoretical understanding of the measure is attempted.
Both RFIT and SR take the bootstrap-based ensemble learning approach. The ITE estimates δ̂(x) in (6) and δ̃(x) in (10) involve randomness owing to bootstrap resampling, the current data 𝒟, and the point x at which the estimation is made. To compare RFIT with SR, we consider an average mean squares error (AMSE) measure defined by
(11) |
where the expectation is taken with respect to the bootstrap distribution ℬ given the current data 𝒟, the sampling distribution of data 𝒟, and then the distribution of X.
Define
(12) |
where δ̄(x;𝒟) is the RFIT estimate of δ(x) obtained with perfect bootstrap or B → ∞ and δ̄(x) is the perfect bootstrap RFIT estimate if, furthermore, we are allowed to recollect data 𝒟 freely. Similarly, we define {μ̄0(x;𝒟), μ̄0(x)} on the basis of μ̂0(x) and {μ̄1(x;𝒟), μ̄1(x)} on the basis of μ̂1(x) in SR. Proposition 3 provides a decomposition of the AMSE for the ITE estimate δ̂(x) by RFIT and for δ̃(x) by SR.
Proposition 3
For the RFIT estimate δ̂(x) in (6),
(13) |
For the SR estimate δ̃(x) in (10),
(14) |
The first term of the AMSE in (13) corresponds to Monte Carlo variation resulting from using a finite number of B bootstrap samples. The second term represents the sampling variation owing to the lack of an endless supply of training data in reality. The third term is the bias. An analogous interpretation applies to the terms in (14), yet with an additional covariance term −2EX [{μ̄1(X) − μ1(X)}{μ̄0(X) − μ0(X)}]. It is worth noting that such a decomposition holds true for general bootstrap-based ensemble predictions.
Ensemble learners such as RF and bagging aim for variance reduction by imitating the endless supply of replicate data via bootstrap resampling. This is why we have the additional decomposition
in (13); similarly for μ̂1(X) and μ̂0(X) in (14). However, ensemble learning has little effect on the bias term EX{δ̄(X)−δ(X)}2 in (13); similarly for the two bias terms in (14) as well as the covariance term −2EX [{μ̄1(X) − μ1(X)}{μ̄0(X) − μ0(X)}]. The bias problem for ensemble learners such as random forests has been noted by Breiman25 and others. From another perspective, RF facilitates a smoothing procedure by averaging data over an adaptive neighborhood; as a result, it cuts the hill and fills the valley.
While both RFIT and SR would suffer from certain bias, the AMSE in SR tends to be larger than that of RFIT in general as we shall demonstrate numerically in Section 3. Numerical evidence shows that SR is more prone to the bias problem because it tends to underestimate a large ITE and overestimate a small ITE. In fact, such a bias also has an effect on the last covariance term in (14). A large ITE δ(x) occurs when μ1(x) is large and/or μ0(x) is small. The smoothing effect yields μ̄1(X) − μ1(X) < 0 with cut hills and μ̄0(X) − μ0(X) > 0 with filled valleys. Thus {μ̄1(X) − μ1(X)}{μ̄0(X) − μ0(X)} tends to be negative. A similar observation holds for a small ITE, which occurs when μ1(X) is small and/or μ0(X) is large. As a result, the last term in (14) tends to be negative, leading to a more inflated AMSE for SR.
3 | SIMULATION STUDIES
This section presents results from simulation studies designed to compare RFIT with other methods in estimating the individualized treatment effects (ITE). We also investigate the standard error (SE) formulas for the ITE estimates by RFIT.
3.1 | Comparison in Estimating ITE
To compare RFIT with other methods, we generate data by the following scheme. First simulate five (p = 5) predictors xj ~ uniform[0, 1] for j = 1,…, 5 with a common correlation ρ. This is achieved by simulating multivariate normal vectors with the common correlation ρ′ = 2 sin(ρπ/6) and applying the probability integral transform.26 Two correlations ρ ∈ {0, 0.5} are considered. Then we generate with a nonlinear polynomial mean function and α and ε0 independently follow a 𝒩(0, 1) distribution. Next, we generate , where μ1(x) = μ0(x) + δ(x) and ε1 ~ 𝒩(0, 1) is independent of both α and ε0. The random effect term α is introduced to mimic some common characteristics shared by repeated measures and taken from the same subject. The unit-level effect equals δ(x)+(ε1 − ε0), where (ε1 − ε0) represents additional random errors that can not be accounted for by covariates x. Four models (I)–(IV) are considered for the ITE δ(x), as tabulated below:
Model | Form | var(μ1(X)) | var(δ(X)) | var(α + ε) |
---|---|---|---|---|
I | δ(x) = 5 | 1.009 | 0.000 | 1.996 |
II | δ(x) = −5+5x1 + 5x2 | 1.017 | 4.183 | 2.002 |
III | δ(x) = −5+5x4 + 5x5 | 1.002 | 4.201 | 1.998 |
III | δ(x) = −2+2I(x1 ≤ 0.5) + 2I(x2 ≤ 0.5) I(x3 ≤ 0.5) | 1.014 | 1.764 | 1.996 |
IV | δ(x) = −6+0.1 exp(4x1) + 4 exp{20(x2 − 0.5)} + 3x3 + 2x4 + x5 | 1.012 | 6.316 | 1.999 |
V | δ(x) = −10 + 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 | 1.009 | 23.837 | 1.990 |
Model I is a null model where the treatment does not have heterogeneous effects. Both Model II and Model III exemplify a linear ITE; but X1 and X2 are both prognostic and predictive in Model II while Model III contains different sets of covariates as prognostic factors (X1,X2,X3) and predictive factors (X4,X5). Model III represents a tree-structured model. Models IV & V are two nonlinear models derived from Friedman.27 Finally, we simulate the randomized treatment assignment variable T independently from a Bernoulli(0.5) distribution and hence the observed response . Also provided in the above table are the empirical variances (based on 100,000 realizations) of the additive effect μ0(X), ITE δ(X), and the error term α + ε. These variance values inform us about the signal-to-noise ratio in each model.
For each training data set 𝒟, four methods are used to learn a model on ITE: single interaction tree (IT) analysis, SIDES, RFIT, and separate regression (SR). In IT, 1 B = 30 bootstrap samples are used to determine the final tree structure. The default setting is used in SIDES. A total of B = 500 bootstrap samples are taken in RFIT and SR. In RFIT, we set a = 10 in SSS splitting. The number m of randomly chosen covariates to examine at each node splitting is set as m = 2 in both RFIT and SR. In order to evaluate performance, a test sample 𝒟′ of size n′ = 2000 is generated beforehand. The ITE models trained with different methods in each simulation are applied to estimate the ITE for 𝒟′ and a mean squared error (MSE) measure is computed. Two sample sizes n ∈ {100, 500} are considered for the training data 𝒟 and a total of 100 simulation runs are used for each simulation setting.
Figure 2 presents parallel boxplots of the MSE measures when the covariates {X1,…,X5} are independent (ρ = 0). The averaged MSE over 100 simulation runs is highlighted with blue bars in each boxplot, corresponding to estimates of the AMSE in (11). It can be seen that SIDES perform poorly in all scenarios. SIDES is not suitable for the task of ITE estimation since it essentially splits data into at most two groups: one subgroup containing individuals with enhanced treatment effects and the other group formed by the remaining individuals. Both RFIT and SR outperform IT for a great deal except the null case, Model I, indicating an advantage of ensemble learning methods over the single tree analysis. In comparison with SR, RFIT tends to have smaller MSE values consistently in nearly all scenarios except Model I where SR slightly outperforms RFIT. In this null model, RFIT forces superfluous partitions while SR mainly accounts for the effects of prognostic factors. Again, the superiority of RFIT in nonnull cases can be explained by the fact that it works on an easier task than SR by examining predictive factors only. The amount of outperformance varies, depending on factors such as the sample size, the signal strength, and the level of nonlinearity. Some additional results are relegated to the Supporting Web Materials. Numerical insight into the bias problem is provided by plotting the averaged ITE estimates δ̂(x) versus the actual ITE δ(x). Having correlated covariates (with ρ = 0.5) does not seem to affect the results much.
FIGURE 2.
Comparison of IT, SIDES, RFIT, and SR in Estimating ITE: The Independent (ρ = 0) Case. Parallel boxplots of MSE values are based on a test sample of size n′ = 2000 with 100 simulation runs. The
middle bar indicates the average of MSE measures.
3.2 | Standard Error Formulas
To investigate the validity and performance of the standard error (SE) formulas, we generated training data sets of size n = 500 from Model III and one test data set 𝒟′ of size n′ = 50. For each training data set 𝒟, RFIT is trained with B = 2000 bootstrap samples and applied to estimate ITE for each observation in 𝒟′, together with standard errors. We repeat the experiment for 200 simulation runs. At the end of the experiment, we have 200 predicted ITE δ̂ for each observation in 𝒟′, together with 200 SEs. We compute the standard deviation (SD) of these ITE estimates δ̂ and average the SE values. If the SE formula works well, the averaged SE values should be close to their corresponding SD values.
Figure 3 plots the average SE versus SD for each observation in the test sample 𝒟′. It can be seen that the uncorrected standard errors are overly conservative. After bias correction, the average standard errors become reasonably close to the SD values. The bias-corrected SE presented here is computed from (9). The other version (8) that is somewhat harder to compute provides very similar results, which have been omitted from the plot.
FIGURE 3.
Plot of averaged standard errors (SE) versus sample standard deviation (SD) of predicted ITE δ̂(x) for n′ = 50 observations in a test sample. The standard deviations (SD) are computed based on 200 simulation runs while the standard errors (SE) are averaged over the 200 runs. In each simulation run, a training sample of size n = 500 is generated from Model III and a bootstrap size B = 2, 000 is used to build RFIT. The bias-corrected and uncorrected SE averages for the same observation are connected by a
line segment. The reference line in
is y = x.
We experimented with other models in Section 3.1 and similar results were obtained. One issue pertains to the number B of bootstrap samples needed. According to Efron,14 a large B, e.g., B = 2, 000, is needed to guarantee the validity of IJ-based standard errors. We experimented with different B values. Generally speaking, ITE estimation stabilizes quickly even with a small B, e.g., B = 100; however, negative values may frequently occur for the bias-corrected variance estimates in both (8) and (9) when B is small or moderate, e.g., B = 500. Thus a large number of bootstrap samples are needed to have sensible results for the SE formula.
4 | APPLICATION: ACUPUNCTURE TRIAL
For further illustration of RFIT, we consider data collected from an acupuncture headache trial 28,34, available at https://trialsjournal.biomedcentral.com/articles/10.1186/1745-6215-7-15 In this randomized study, 401 patients with chronic headache, predominantly migraine, were randomly assigned either to receive up to 12 acupuncture treatments over three months or to a control intervention offering usual care. Among many other measurements, the primary endpoint of the trial is the change in headache severity score from baseline to 12 months since study entry. The acupuncture treatment was concluded effective overall in bringing down the headache score significantly more than the control group. More details of the trial and its results are reported in Vickers et al. 28
To apply RFIT, we consider only the 301 participants who completed the trial. The response variable is taken as the difference in headache severity score between baseline and 12 months, while the score at baseline is treated as a covariate. There are three subjects with some missing data, which are imputed with random forests (see R pacakge missForest 33). A total of 18 covariates are included in the analysis; these are demographic, medical, or treatment variables measured at baseline. See Table 1 for a brief variable description.
TABLE 1.
Variable Description for the Headache Data.
Name | Description |
---|---|
id | Patient ID code |
diff | Difference in headache severity score between one year follow-up and baseline, i.e., ( pk5- pk1) |
group | Randomized treatment assignment: 0 is control; 1 is acupuncture |
| |
age | Age |
sex | sex: 0 male; 1 female |
migraine | Migraine: 0 No and 1 Yes |
chronicity | Chronicity |
pk1 | Severity score at baseline |
f1 | Headache frequency at baseline |
pf1 | Baseline SF36 (36-Item Short Form Health Survey) physical functioning |
rlp1 | Baseline SF36 role limitation physical |
rle1 | Baseline SF36 role limitation emotional |
ef1 | Baseline SF36 energy fatigue |
ewb1 | Baseline SF36 emotional well being |
sf1 | Baseline SF36 social functioning |
p1 | Baseline SF36 pain |
gen1 | Baseline SF36 general health |
hc1 | Baseline SF36 health change |
painmedspk1 | Medication Quantification Scale (MQS) at baseline |
prophmqs1 | MQS of prophylactic medication at baseline |
allmedsbaseline | Total MQS at baseline |
A total of B = 5, 000 trees are used to build RFIT, where the scale parameter a is set as a = 10 in SSS splitting. ITE is estimated for each individual in the same data set and the IJ-based standard error (SE) with bias correction is also computed. Figure 4(a) provides a bar plot of the estimated ITE, plus and minus one SE, sorted by ITE. It can be seen that a majority (76.85%) of ITE are above 0, indicating the effectiveness of acupuncture in achieving a greater reduction in headache severity score from baseline to Month 12 in comparison to the control group. Overall speaking, the treatment effects in this trial show certain heterogeneity, but not by much. It is interesting to note that the averaged ITE is 3.9. Comparatively, the unadjusted effect of acupuncture (i.e., mean difference between acupuncture and control groups in headache severity score change frombaseline to Month 12) is estimated as 6.5 while the adjusted effect from ANCOVA is 4.6, as reported in Table 2 of Vickers et al. 28 Figure 4 also shows many individuals, for whom the acupuncture treatment did not help much. Two individuals, the 44-th (with patient ID 222) and the 224th (with patient ID 630) are noteworthy. Both are female patients aged 60 and 58, suffering migraine headaches and being assigned to the control group, but they surprisingly achieved a reduction of 36 and 29.75 in headache severity score, respectively. Their initial severity scores are relatively similar as well: 44.25 and 37. Their estimated ITEs turn out to be −14.81 and −9.09, indicating a detrimental effect from acupuncture. Although the performances of these two patients are quite unusual relative to the rest of the patients, they may indicate a small subgroup that is worth further investigation. Figure 4(b) plots the estimated ITE by SR versus the estimated ITE by RIFT. The least-squares fitted (red dashed) line almost overlaps with the reference (solid green) line y = x, indicating that the two methods provide similar ITE estimates in this example.
FIGURE 4.
RFIT Analysis of the Headache Data: (a) the error bar plot of the estimated ITE ± SE and (b) estimated ITE by SR versus estimated ITE by RFIT. In Panel (a), individuals are ranked by estimated ITE. The
horizontal line indicates the unadjusted average treatment effect 6.5, i.e., the mean difference in headache severity score change from baseline to Month 12. In Panel (b), the solid
line is the reference line y = x while the dashed
line corresponds to the least-squares regression line.
5 | DISCUSSION
We have implemented random forests of interaction trees to tackle the problem of estimating individualized treatment effects (ITE). To this end, we have introduced smooth sigmoid surrogate (SSS) splitting to speed up RFIT and possibly improve its performance. We have also derived a standard error for the estimated ITE by applying the infinitesimal jackknife method. Altogether, RFIT provides enlightening results for deploying personalized medicine by informing a new patient about the potential effect of the treatment on him/her.
According to our numerical experiments, RFIT outperforms the separate regression (SR) approach for estimating ITE. SR estimates the potential outcomes separately and then takes difference. In RFIT, we group individuals so that those with similar treatment effects are put together and then estimate the treatment effect by taking differences within each group. Comparatively, RFIT focuses on predictive covariates and estimation of ITE directly while SR has to deal with both prognostic and predictive covariates. Since SR is used as an intermediary step in other causal inference procedures, our method might contribute to their improvement as well.
To conclude, we identify several avenues for future research. First of all, our discussion has been restricted to data from randomized experiments. Assessing treatment effects with data from observational data can be very different, entailing adjustment for potential confounders.31,29,30 Secondly, the standard error formula provides some assessment for precision in estimating ITE; however, issues such as consistency of RFIT, asymptotic normality of estimated ITE (see comments in Efron14) and multiplicity have not been thoroughly addressed as of yet. Thirdly, the current version of RFIT is not free of variable selection bias7 and how to address this problem with the SSS approach awaits further investigation. Fourthly, several other features in random forests including variable importance ranking, partial dependence plots, and the proximity matrix 19 have yet to be explored for RFIT. Like random forests, RFIT is essentially a black-box tool for predicting ITE, although the IJ-based standard error supplies additional reliability measure. The last direction of future research is closely related to how to extract meaningful interpretations of RFIT. Specifically, the variable importance measure can sort out important effect-modifiers of the treatment; the partial dependence plot can depict how a covariate modifies the treatment effect under the intertwined influences of other covariates; the proximity matrix can identify a neighborhood of a future patient in terms of how similarly they react to the treatment. In order to make these additional features of RF better suitable for ITE assessment, major modifications are needed, which warrants future research.
Supplementary Material
Acknowledgments
XS was partially supported by NIMHD grant 2G12MD007592 from NIH; LL was partially supported by AHRQ grant HS 020263; RL was supported in part by NSF grant 1633130. The authors wish to thank the editor (Dr. Joel Greenhouse), the associate editor, and two anonymous reviewers, whose insightful and constructive comments played an important role in improving the paper.
Abbreviations
- IT
interaction trees
- ITE
individualized treatment effects
- RF
random forests
- SR
separate regression
- SSS
smooth sigmoid surrogate
References
- 1.Su X, Tsai CL, Wang H, Nickerson DM, Li B. Subgroup analysis via recursive partitioning. Journal of Machine Learning Research. 2009;10:141–158. [Google Scholar]
- 2.Lipkovich I, Dmitrienko A, D’Agostino RB. Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in Medicine. 2017;36(1):136–196. doi: 10.1002/sim.7064. [DOI] [PubMed] [Google Scholar]
- 3.Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Belmont, CA: Wadsworth International Group; 1984. [Google Scholar]
- 4.Foster JC, Taylor JMC, Ruberg SJ. Subgroup identification from randomized clinical trial data. Statistics in Medicine. 2011;30(24):2867–2880. doi: 10.1002/sim.4322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lipkovich I, Dmitrienko A, Denne J, Enas G. Subgroup identification based on differential effect search (SIDES): a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine. 2011;30(21):2601–2621. doi: 10.1002/sim.4289. [DOI] [PubMed] [Google Scholar]
- 6.Dusseldorp E, van Mechelen I. Qualitative interaction trees: a tool to identify qualitative treatment-subgroup interactions. Statistics in Medicine. 2014;33(2):219–237. doi: 10.1002/sim.5933. [DOI] [PubMed] [Google Scholar]
- 7.Loh WY, He X, Man M. A regression tree approach to identifying subgroups with differential treatment effects. Statistics in Medicine. 2015;34(11):1818–1833. doi: 10.1002/sim.6454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Murphy SA. Optimal dynamic treatment regimes (with discussion) Journal of the Royal Statistical Society, Series B. 2003;65(2):331–366. [Google Scholar]
- 9.Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E. Estimating optimal treatment regimes from a classification perspective. STAT. 2012;1(1):103–114. doi: 10.1002/sta.411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Laber EB, Zhao Y. Tree-based methods for individualized treatment regimes. Biometrika. 2015;102(3):501–514. doi: 10.1093/biomet/asv028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ballman KV. Biomarker: Predictive or Prognostic? Journal of Clinical Oncology. 2015;33(33):3968–3971. doi: 10.1200/JCO.2015.63.3651. [DOI] [PubMed] [Google Scholar]
- 12.van der Laan M, Polley E, Hubbard A. Super Learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(1) doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
- 13.Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
- 14.Efron B. Estimation and accuracy after model selection (with discussion) Journal of the American Statistical Association. 2014;109(507):991–1007. doi: 10.1080/01621459.2013.823775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Neyman J. On the application of probability theory to agricultural experiments. Essay on Principles 1923, Section 9. Statistical Science. 1990;5(4):465–472. Translated by Dabrowska DM and Speed TP. [Google Scholar]
- 16.Rubin DB. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology. 1974;66(5):688–701. [Google Scholar]
- 17.Rubin DB. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association. 2005;100(469):322–331. [Google Scholar]
- 18.Holland PW. Statistics and Causal Inference. Journal of the American Statistical Association. 1986;81(396):945–960. doi: 10.1080/01621459.1986.10478347. [DOI] [PubMed] [Google Scholar]
- 19.Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18–22. [Google Scholar]
- 20.Abadie A. Semiparametric difference-in-differences estimators. Review of Economic Studies. 2005;72(1):1–19. [Google Scholar]
- 21.Brent R. Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall; 1973. [Google Scholar]
- 22.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2017. URL https://www.R-project.org/ [Google Scholar]
- 23.LeBlanc M, Crowley J. Survival trees by goodness of split. Journal of the American Statistical Association. 1993;88(422):457–467. [Google Scholar]
- 24.Wager S, Hastie T, Efron B. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. Journal of the Machine Learning Research. 2014;15:1625–1651. [PMC free article] [PubMed] [Google Scholar]
- 25.Breiman L. Technical Report #547. Department of Statistics, University of California at Berkeley; 1999. Using adaptive bagging to debias regressions. [Google Scholar]
- 26.Gilli M, Maringer D, Schumann E. Numerical Methods and Optimization in Finance. Elsevier; 2011. [Google Scholar]
- 27.Friedman JH. Multivariate adaptive regression splines. Annals of Statistics. 1991;19(1):1–67. [Google Scholar]
- 28.Vickers AJ, Rees RW, Zollman CE, McCarney R, Smith C, Ellis N, Fisher P, Van Haselen R. Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. British Medical Journal, Primary Care. 2004;328(7442):744. doi: 10.1136/bmj.38029.421863.EB. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Su X, Kang J, Fan J, Levine R, Yan X. Facilitating score and causal inference trees for large observational data. Journal of Machine Learning Research. 2012;13:2955–2994. [Google Scholar]
- 30.Wager S, Athey S. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association. 2017 published online. [Google Scholar]
- 31.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. [Google Scholar]
- 32.Efron B. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); 1982. The jackknife, the bootstrap and ohter resampling plans; p. 38. [Google Scholar]
- 33.Stekhoven DJ, Buehlmann P. missForest – nonparametric missing value imputation for mixed-type data. Bioinformatics. 2012;28:112–118. doi: 10.1093/bioinformatics/btr597. [DOI] [PubMed] [Google Scholar]
- 34.Vickers AJ. Whose data set is it anyway? Sharing raw data from randomized trials. Trials. 2006;7:15. doi: 10.1186/1745-6215-7-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.