Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Feb 26.
Published in final edited form as: Biometrics. 2014 Sep 16;71(1):267–271. doi: 10.1111/biom.12228

Reader Reaction to “A Robust Method for Estimating Optimal Treatment Regimes” by Zhang et al (2012)

Jeremy M G Taylor 1,*, Wenting Cheng 1, Jared C Foster 1
PMCID: PMC4768908  NIHMSID: NIHMS759889  PMID: 25228049

Summary

A recent paper (Zhang et al., 2012) compares regression based and inverse probability based methods of estimating an optimal treatment regime and shows for a small number of covariates that inverse probability weighted methods are more robust to model misspecification than regression methods. We demonstrate that using models that fit the data better reduces the concern about non-robustness for the regression methods. We extend the simulation study of Zhang et al (2012), also considering the situation of a larger number of covariates, and show that incorporating random forests into both regression and inverse probability weighted based methods improves their properties.

Keywords: Optimal Treatment Regime, Random Forests

1. Introduction

In an excellent article (Zhang et al., 2012), on estimating an optimal treatment regime, the authors consider the following situation: n subjects in a study, who are either in the treatment (A = 1) or the control (A = 0) group. Each subject has p baseline covariates X = (X1, …, Xp) and higher values of the continuous outcome measure (Y) are better. A treatment regime g(X) is a function from X to {0, 1}, such that patients should receive A = 1 if g(X) = 1 and A = 0 if g(X) = 0. The value of g(X) is determined by whether η0+j=1pηjXj is positive or not. The goal is to find the optimal treatment regime. Both a randomized trial and an observational study setting were considered. The authors develop and compare different approaches. One is a regression approach (RG), which requires a model for μ(A, X) = E(Y|A, X). Other approaches are based on inverse probability weighted estimators (IPWE). The standard IPWE does not require a model for μ(A, X), but does requires a model for P(A = 1|X). The authors extend the IPWE to an Augmented Inverse Probability Weighted estimator (AIPWE), which requires models for both μ(A, X) and P (A = 1|X). The AIPWE results is a gain in efficiency relative to IPWE and has a double robustness property. In a simulation study, the RG method was the best if the model for μ(A, X) was correctly specified, but was not robust to misspecification of μ(A, X). With correct specification of μ(A, X), the AIPWE method was not quite as efficient as RG.

For the misspecified model for μ(A, X), residual plots would immediately recognize the model as providing a poor fit to the data. In this paper we examine the relative merits of RG, IPWE and AIPWE when one uses a model for μ(A, X) which better fits the data.

2. Review Of Methods

Let Y(g) be the response for a patient who follows regime g. For a randomly chosen patient from a population the expected response if regime g(X) is followed is given by E(Y(g)) = EX[μ(1, X)g(X)+μ(0, X){1−g(X)}]. The optimal treatment regime is gopt(X) = I{μ(1, X) > μ(0, X)}. Let ĝ denote an estimated regime that is derived from a dataset.

Denote by Q(g) the average value of the expected response for subjects in a future population of very large size N if regime g were to be used, where Q(g) is given by

Q(g)=1Ni=1N[μ(1,Xi)g(Xi)+μ(0,Xi){1g(Xi)}] (1)

Larger values of Q(g) are better. Thus, when μ(A, X) is known, the success of different methods for estimating g can be based on Q(ĝ) and also compared to Q(gopt).

2.1 Regression Method

The RG method is to posit a parametric regression model for μ(A, X) = μ(A, X; β), estimate β from the data, then g^regopt(X)=I{μ(1,X;β^)>μ(0,X;β^)} Below we will also consider alternative nonparametric regression models for μ(A, X).

2.2 Inverse Probability Weighted Estimators

For the IPWE method, a parametric form for g(X) = g(X; η) is specified. For fixed η, define Cη,i = Aig(Xi; η) + (1−Ai){1−g(Xi; η)} and π(X) = P (A = 1|X). Then the expected population average outcome is 1ni=1nCη,iYi/πC(Xi) which is maximized over η to give g^IPWopt(X)=g(X;η^) where πC(Xi)=π(Xi)Ai{1π(Xi)}1Ai For a randomized study the propensity score π(X) is estimated by the sample proportion assigned to treatment 1, which will be close to 0.5. For a non-randomized study logistic regression is used to estimate π(X).

For the AIPWE method, η is obtained by maximizing

AIPWE(η)=1ni=1nCη,iYiπC(Xi)Cη,iπC(Xi)πC(Xi)m(Xi;η,β^) (2)

over η, where m(X; η, β) = μ(1,X; β)g(X; η) + μ(0,X; β){1−g(X; η)}

The Zhang et al (2012) also considered the consistency properties and calculation of standard errors for η^ we will not consider these in the current paper.

3. Simulation Study

In the simulation study in Zhang et al (2012), in Table 8 of the Supplementary Materials, data were generated from a true model Yi = μ(Ai,Xi) + ei, where ei~N(0, 1) and

μ(A,X)=exp{2.01.5X121.5X22+3.0X1X2+A(0.1X1+X2+0.2X3)},

where Xi1 and Xi2 were U(1.5,1.5) and Xi3 and Ai were Bern(0.5). For this model gopt(X) = I(0.1−X1+X2+0.2X3>0). They considered two parametric regression models for μ(A, X), a correctly specified model of the form

μt(A,X;β)=exp{β0+β1X12+β2X22+β3X1X2+A(β4+β5X1+β6X2+β7X3)} (3)

and a misspecified simple linear model of the form

μmsl(A,X;β)=β0+β1X1+β2X2+β3X3+A(β4+β5X1+β6X2+β7X3). (4)

From standard residual plots it is obvious that the misspecified model gives a very bad fit to the data, and wouldn’t be seriously entertained, particularly for the RG method. Inspection of the data suggests that some transformation of the response Y may lead to an improved fit. Although log(Y) might appear to be a natural choice, it is not possible because a small fraction of the Y’s are negative, thus we choose Y1/3 as an approximation. Thus the question is, if one used a better fitting model for Y in both the RG and AIPWE methods would the results improve? We consider two parametric models, as well as a non-parametric estimator. The first misspecified parametric model recognizes the benefit of a transformation, and the second also recognizes the need for quadratic terms and interactions. In these models we develop predictions for Z = Y1/3, and then cube these predictions of Z to obtain predictions of Y. The simple misspecified cube root model is given by

E(Z)=μms33(A,X;β)=β0+β1X1+β 2X2+β3X3+A(β4+β5X1+β6X2+β7X3). (5)

and the misspecified complex cube root model is given by

μmc33(A,X;β)=μms33(A,X;β)+j=12βj+7Xj2+β10X1×X2+β11X1×X3+β12X2×X3. (6)

Standard model assessment methods would recognize some lack of fit for μmc33, although it is a noticeable improvement over μmsl and μmc33.

In the other approach, we used random forests as a non-parametric estimator of μ(A,X). and denote the estimate by μ^(A,X). The RGrf method consists of maximizing

1ni=1n[μ^(1,Xi)g(Xi;η)+μ^(0,Xi){1g(Xi;η)}] (7)

with respect to η. While we present results for random forests, other non-parametric estimators could be considered. To implement random forests, with Y1/3 as the response, we used the function randomforest in R, using default settings except that the number of trees was 1000. Similar to previous work (Foster et al., 2011), we found that the performance of random forests was improved by using A,Xk,Xk2,XkI(A=1) and XkI(A = 0) for k = 1 to p as input covariates. We note that random forests with Y1/3 as the response gave a very mild improvement over random forests with Y as the response.

To fit the linear model in equations 4, 5 and 6, the R function lm was used. To fit the non-linear model in equation 3, the R function nlsLM was used. To maximize the criteria in equations 2 and 7, we used the R function genoud, as described in Zhang et al. (2012).

3.1 Results For Three Covariates

In our simulation study 1000 datasets, each of size 500 was generated. We report in table 1 two quantities: the average of the ratio Q(g^)/Q(gopt) and the average fraction who would be treated if, following each trial, g^ were to be used. For each of the 1000 datasets Q(g^) and Q(gopt) were calculated from equation 1, with N=1,000,000.

Table 1.

Simulation results, randomized studies. Ratio to optimal is Q(g^)/Q(gopt). Fraction treated denotes the fraction that would be treated in a future population if regime g^ was followed. Case A: 3 independent covariates. Optimal fraction treated = 0.5. Case B: 3 independent covariates. Optimal fraction treated = 0.902. Case C: 3 independent covariates. Optimal g includes interaction. Optimal fraction treated = 0.5. Case D: 15 independent covariates. Optimal fraction treated = 0.500. The subscripts denote the model that was used for estimating μ(A,X), t=true, msl=misspecified simple linear, ms33=misspecified simple with Y1/3, mc33=misspecified complex with Y1/3, rf=random forest

Method Case A
p=3, indep
Case B
p=3, indep
Case C
p=3, interaction
Case D
p=15, indep

Ratio to optimal Fraction treated Ratio to optimal Fraction treated Ratio to optimal Fraction treated Ratio to optimal Fraction treated
Assuming form of true model is known
RGt 1.000 0.50 0.999 0.90 1.000 0.50 1.000 0.50
AIP W Et 0.997 0.50 0.995 0.90 0.996 0.51 0.997 0.50

Form of model is unknown
RGmsl 0.925 0.67 0.946 0.93 0.922 0.63 0.893 0.62
RGms33 0.927 0.58 0.936 0.91 0.924 0.56 0.883 0.55
RGmc33 0.948 0.58 0.936 0.91 0.942 0.57
RGrf 0.990 0.53 0.990 0.92 0.977 0.53 0.904 0.68
IPWE 0.971 0.49 0.970 0.89 0.956 0.50 0.879 0.58
AIPWEmsl 0.984 0.50 0.977 0.87 0.965 0.52 0.884 0.62
AIPWEms33 0.978 0.49 0.970 0.90 0.963 0.50 0.885 0.59
AIPWEmc33 0.989 0.49 0.966 0.91 0.976 0.50
AIPWErf 0.990 0.50 0.992 0.89 0.976 0.50 0.896 0.60

The 2nd and 3rd columns show the results for three covariate, labeled as Case A, matching the situation considered in (Zhang et al., 2012). For this setting Q(g = 0) = 3.02, Q(g = 1) = 3.48, and Q(gopt) = 3.95. The first two rows are for ideal, but not applicable in practice, methods in which the structure of the true model for μ(A,X) is known. Amongst these two, RGt is slightly better than AIPWEt, achieving the desired values of 1 and 0.5 for the Ratio to optimal and the Fraction treated respectively. Amongst the applicable methods, RGmc33 generally improves on RGmsl and RGms33, and RGrf is the much better than both. Amongst the AIPWE methods they are all preferable to IPWE, with AIPWErf being the best.

Also of note is that the inverse probability methods tend to recommend treating closer to the true 50% fraction of patients, than the regression methods. The regression methods tend to include too many subjects in the region g^(X)=1.

In the 4th and 5th columns of the table we show results for Case B, a situation in which the optimal regime is that 90% of the subjects should be treated. The data were generated from the model μ(A,X)=exp{2.01.5X121.5X22+3.0X1X2+A(0.1X1+X2+0.2X3)}, where Xi1 was U(1.5,1.5), Xi2 was U(0.2,3), Xi3 was Bern(0.6) and Ai was Bern(0.5). For this setting Q(g = 0) = 1.66, Q(g = 1) = 2.51, and Q(gopt) = 2.64.

We considered four parametric outcome regression models. The first one is μt(A, X; β), the correct nonlinear regression model, as given in equation 3; the other three were the misspecified models μmsl(A, X; β), μms33(A, X; β) and μmc33(A,X; β), as given in equations 4, 5 and 6. The results again show the benefit of using random forests to estimate μ(A, X) in both RG and AIPWE methods and that again RGrf has similar performance as AIPWErf.

We also considered a case similar to case A, but in which the covariates were correlated. The results were very similar to the uncorrrelated case and are not presented here.

In the 6th and 7th columns of the table we show results for Case C, a situation where the optimal g is not determined by a linear combination of the covariates. The data were generated from the model μ(A,X)=exp{2.01.5X121.5X22+3.0X1X2+A(0.1X1+X2+0.2X3+0.5X1X3)} where Xi1 and Xi2 were U(1.5,1.5), Xi3 and Ai were Bernoulli(0.5). For this setting Q(g = 0) = 3.02, Q(g = 1) = 3.49, and Q(gopt) = 3.99. The optimal g(X) is I(0.1 − X1 + X2 + 0.2X3 + 0.5X1X3 > 0).

We considered four parametric outcome regression models. The first one is μt(A,X; β)

μt(A,X;β)=exp{β0+β1X12+β2X22+β3X1X2+A(β4+β5X1+β6X2+β7X3+β8X1X3)} (8)

corresponding to the correct nonlinear regression model; the other three were μmsl(A,X; β) and μms33(A,X; β), and μmc33(A,X; β) as given in 4, 5 and 6.

The results are similar to those for Case A, with for the RG methods a mild improvement by using the complex parametric model and substantial improvement by using random forests. The results again show the benefit of using random forests to estimate μ(A, X) in the AIPWE methods. The fact that the optimal g is not within the class of models being estimated does not seem to have negatively impacted the performance of the methods.

3.2 Results For Fifteen Covariates

The above results are for a small number of three covariates. With a larger number of covariates, the task of building models for μ(A,X) is more challenging. Fitting parametric models with many quadratic terms and interactions is not feasible, or would require variable selection. The ability of non-parametric regression methods, such as random forests, to give reliable predictions decreases with increasing p. The performance of the AIPWE methods is also likely to deteriorate with larger p, because the maximization in equation 2 will give poorer estimates of η. To investigate this, we considered a situation of 15 covariates, where the true model for μ(A, X) was

μ(A,X)=exp{2.01.5X121.5X22+3.0X1X2+A(0.1X1+X2+0.2X3)},

with corresponding gopt(X) = I(X2 > X10.2X3+0.1), and where X1 and X2 ~ U(1.5, 1.5), X3 ~ Bern(0.5), X4, X5, X7, X8, X10, X11, X13, X14 ~ U(1.5,1.5) and X6, X9, X12, X15 ~ Bern(0.5). For this setting Q(g = 0) = 3.02, Q(g = 1) = 3.48, and Q(gopt) = 3.95. The linear combination that determines the estimated g could include 15 variables.

We considered three possible parametric outcome regression models. The first one was μt(A,X;β)=exp{β0+β1X12+β2X22+β3X1X2+A(β4+β5X1+β6X2+β7X3)}, which corresponds to the correct nonlinear regression model; the second one is μmsl(A,X;β)=β0+j=115βjXj+A(β16+j=115βj+16Xj). The third one is μms33(A X; β), which is the same as μmsl(A,X; β) except that the response is Y1/3. Fitting μmc33(A,X; β) was not feasible in this case. The RGrf and AIPWE methods were also implemented.

The results for Case D, given in the 8th and 9th columns, differ from those of Case A. Here the RG method with a simple misspecified linear model has properties as good as those from AIPWE using this misspecified model and better than the IPWE method. Again we see that both RG and AIPWE methods are improved by the use of random forests. The general performance of all the methods, is clearly worse when there are more covariates.

3.3 Results For Non-randomized Trial Setting

For this situation the RG methods are unchanged, but the IPWE and AIPWE require formulating and fitting an additional model for P (A=1|X).

In the first simulation scenario presented by Zhang et al (2012), they generated data from a true model of the form Yi = μ(Ai, Xi) + εi, where εi ~ N(0, 1) and

μ(A,X)=exp{2.01.5X121.5X22+3.0X1X2+A(0.1X1+X2)} (9)

where X1 and X2 were independent U(1.5, 1.5). The treatment group indicator Ai was determined by the model logit {P(A=1|X)}=0.1+0.8X12+0.8X22.

For model (9) gopt(X) = I(0.1 − X1 + X2 > 0), and E{Y(gopt)} = 3.71. Two regression models for μ(A, X) were considered, a correctly specified model of the form

μt(A,X;β)=exp{β0+β1X12+β2X22+β3X1X2+A(β4+β5X1+β6X2)}

and a misspecified simple linear model of the form

μmsl(A,X;β)=β0+β1X1+β2X2+A(β3+β4X1+β5X2).

The misspecified model μms33 is E(Z) = β0 + β1X1 + β2X2 + A(β3 + β4X1 + β5X2) and misspecified model μmc33 is

E(Z)=β0+β1X1+β2X2+A(β3+β4X1+β5X2)+j=12βj+5Xj2+β8X1×X2

where Z = Y1/3. Method rf applies the random forest approach. The input covariates used were A,Xk,Xk2,XkI(A=1) and XkI(A = 0) for k = 1 to 2, and the response variable is Z.

The propensity score model for P(A=1|X), required for IPWE and AIPWE, is either correctly specified as logit {P(A=1|X)}=γ0+γ1X12+γ2X22 or incorrectly specified as logit {P(A = 1|X)}=γ0 + γ1X1 + γ2X2.

The results show that the use of random forests in the RG methods is as good as any other method, and that poorly fitting parametric models for both RG and for AIPWE when the propensity score model is incorrect can lead to regimes with noticeably worse properties.

4. Discussion

The Zhang et al (2012) paper illustrates that regression methods may not be robust to model misspecification, and that AIPWE methods do have an appealing robustness property. However, this robustness property shouldn’t be an excuse for not seeking reasonably fitting models for the data. We demonstrate in a small simulation study, that modeling the response for the regression method with a better fitting parametric model leads to some improvement, while using a readily available non-parametric method removes concerns about non-robustness of the regression method. Furthermore, the properties of both regression and augmented inverse probability weighted methods are improved by using the non-parametric method for the response compared to parametric models, and are quite similar. Thus the extra modeling needed for the AIPWE is not doing any harm, but also may not be necessary.

Table 2.

Simulation results, nonrandomized studies. Ratio to optimal is Q(g^)/Q(gopt). Fraction Treated denotes the fraction that would be treated in a future population if regime g^ was followed. Two independent covariates. Optimal fraction treated = 0.5. The subscripts denote the model that was used for estimating μ(A, X), t=true, msl=misspecified simple linear, ms33=misspecified simple with Y1/3, mc33=misspecified complex with Y 1/3, rf=random forest

Method Ratio to optimal Fraction treated
RGμt 1.000 0.47
RGμmsl 0.878 0.25
RGμms33 0.861 0.18
RGμmc33 0.936 0.50
RGrf 0.994 0.49

Propensity score model correct
IPWE 0.979 0.47
AIPWEμt 0.998 0.47
AIPWEμmsl 0.988 0.47
AIPWEμms33 0.984 0.47
AIPWEμmc33 0.991 0.46
AIPWErf 0.995 0.47

Propensity score model incorrect
IPWE 0.921 0.33
AIPWEμt 0.998 0.47
AIPWEμmsl 0.961 0.39
AIPWEμms33 0.934 0.35
AIPWEμmc33 0.982 0.42
AIPWErf 0.995 0.47

Acknowledgments

This work was partially funded by US National Institutes of Health grant CA 129102 and by Eli Lilly.

References

  1. Foster JC, Taylor Jeremy MG, Ruberg Stephen J. Subgroup identification from randomized clinical trial data. Statistics in Medicine. 2011;30:2867–2880. doi: 10.1002/sim.4322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;168:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES