Summary
A saline feature of data from clinical trials and medical studies is inhomogeneity. Patients not only differ in baseline characteristics, but also the way they respond to treatment. Optimal individualized treatment regimes are developed to select effective treatments based on patient’s heterogeneity. However, the optimal treatment regime might also vary for patients across different subgroups. In this paper, we mainly consider patient’s heterogeneity caused by groupwise individualized treatment effects assuming the same marginal treatment effects for all groups. We propose a new maximin-projection learning for estimating a single treatment decision rule that works reliably for a group of future patients from a possibly new subpopulation. Based on estimated optimal treatment regimes for all subgroups, the proposed maximin treatment regime is obtained by solving a quadratically constrained linear programming (QCLP) problem, which can be efficiently computed by interior-point methods. Consistency and asymptotic normality of the estimator is established. Numerical examples show the reliability of the proposed methodology.
Keywords: Heterogeneity, Maximin-projection learning, Optimal treatment regime, Quadratically constrained linear programming
1. Introduction
Data from clinical trials and medical studies are often characterized by some degree of inhomogeneity. Patients not only differ in baseline characteristics, but also the way they respond to the treatment. There have been increasing interest in developing individualized optimal treatment regimes (OTRs) to account for patients’ heterogeneity in response to treatment and to achieve the best treatment effect for individual patients. Some common methods for estimating OTRs include Q-learning (Watkins and Dayan, 1992; Chakraborty et al., 2010), A-learning (Robins et al., 2000; Murphy, 2003) and value search methods which directly search OTRs by maximizing the estimated value function (Zhang et al., 2012; Zhao et al., 2012). However, the OTRs may also vary for patients from different subpopulations. This is typically the case in meta analysis, where we combine the results of multiple studies conducted at different locations or times. One motivating example is from a multicenter randomised controlled trial as studied in Tarrier et al. (2004). The goal is to examine the effectiveness of cognitive-behavioural therapy for patients with early schizophrenia. Patients can be classified into three groups according to their treatment centres (Manchester, Liverpool and North Nottinghamshire). As we can see in Section D in the supplementary article, the group-wise OTRs can vary across different centres. Another example is from an observational study for investigating the influence of early disease modifying antirheumatic drug (DMARD) treatment on patients with recent onset inflammatory polyarthritis (Farragher et al., 2010). According to patients’ enrollment time, they can be classified into three groups. As studied in Section 6, the group-wise OTRs can vary across different enrollment periods. The heterogeneity in OTRs may be explained by the differences in characteristics of treatment setting across subgroups. For instance, in the schizophrenia example, the strength of therapeutic alliance between therapist and patient, the adherence to treatment protocols and the quality of treatment provided can vary from one treatment centre to another (Dunn and Bentall, 2007); in the inflammatory polyarthritis example, there are more use of hydroxychloroquine for the methotrexate combination strategy in recruitment time group 3 (1997–2000) than in group 1 (1990–1992) or group 2 (1993–1996) as hydroxychloroquine was increasingly used in the UK before anti-tumour necrosis factor therapy was introduced to treat rheumatoid arthritis in 2001. Moreover, these characteristics are often unobserved or partially observed, and they may explain the interaction between subgroups and OTRs.
The aim of this paper is to propose a reliable OTR for new patients based on the observed data from different groups with heterogeneity in optimal treatment decision. The group of new patients may differ from any of the currently observed groups in terms of optimal treatment decision. For example, compared with existing data, the group of new patients, who come from a new treatment centre, may have a different OTR because of different strength of therapeutic alliance or different quality of treatment provided in the new treatment centre. Therefore, the true OTR for the group of new patients is not estimable at all based on the observed data, and any of the group-wise OTRs may not be the best choice. The challenge becomes how to derive a meaningful and reliable treatment regime that can take into account the heterogeneity in optimal treatment decision for different groups of patients. One simple approach is to pool the data of different groups together and obtain the “pooled” OTR based on the pooled data. Another method is to first obtain the OTR for each group, and then aggregate the group-wise OTRs in certain ways. Random effects meta-analysis (DerSimonian and Laird, 1986) is commonly used to combine subject-specific studies. Using its multivariate extensions (cf. Jackson et al., 2010; Chen et al., 2012), we can aggregate the groupwise OTRs based on random effects models. The resulting OTR is similar to the “pooled” OTR when we have large numbers of subgroup patients. These OTRs maybe reasonable choices when the OTRs for different groups do not vary much. However, when there is certain degree of heterogeneity in OTRs across different groups as demonstrated in the toy example given in the next section, these OTRs are uniformly worse than the proposed OTR for any of the groups. One possible reason is that these OTRs for different groups may assign the same patient to different treatments and thus their effects are averaged out when pooling the data from different groups.
Bühlmann and Meinshausen (2016) and Meinshausen and Bühlmann (2015) considered a maximin criteria which has a nice characterization in linear models and proposed to use maximin aggregation (magging) to obtain the maximin estimator. Their proposed estimator is shown to be more robust than the pooled estimator in linear regression. The key idea of the maximin criteria is to find an estimator that works the best under the worst-case scenario. In optimal treatment decision, the percentage of making the correct decision (PCD) and value function are two commonly used measures to evaluate the effectiveness of a treatment regime. A natural maximin criteria for optimal treatment decision is to find an OTR that maximizes the minimum PCD or the minimum value function of all groups. Such a maximin OTR is appealing due to its nice interpretation and robustness. However, it is hard to implement in practice due to the following reasons. First, the PCD of a treatment regime is generally not estimable from data since the true OTR is unknown. Second, the empirical estimator of the value function as studied in Zhang et al. (2012) is non-smooth and non-concave, thus the estimation of the associated maximin OTR is not feasible.
In this paper, we propose a novel maximin-projection learning (MPL) to aggregate linear OTRs across different groups. Specifically, the proposed maximin-projection learning finds a linear decision rule that maximizes the minimum “inner product” between the vectors of regression parameters in the linear rule and the group-wise linear OTRs. We show that under certain model assumptions, the OTR obtained by the maximin-projection learning maximizes the minimum percentage of making the correct decision and value function of different groups, i.e. achieve the desired maximin properties. In addition, the corresponding estimation procedure can be represented as a linear programming problem with a quadratic constraint (Lee et al., 2016), which can be efficiently solved in O(Gs2 + s3) flops. Here G denotes the number of groups and s the dimension of baseline covariates. Consistency and the asymptotic distribution of the corresponding maximin-projection estimators are established. Such kind of asymptotic results are rarely studied in the literature. To derive such asymptotic properties, we establish a necessary and sufficient condition for the existence and uniqueness of the population maximin-projection parameters and obtain a closed-form expression for the resulting estimator.
The rest of the paper is organized as follows. We introduce the model, notations and assumptions in Section 2. We also provide a heuristic comparison between the maximin OTR, the pooled OTR and the OTR based on random effects models with a toy example. In Section 3, we formally introduce the proposed maximin-projection learning including its statistical interpretation and geometrical characterization. Section 4 presents the estimating procedure of the maximin-projection estimator and the associated asymptotic properties. Simulation studies to evaluate the empirical performance of the proposed maximin OTR are conducted in Section 5. We apply our method to a real examples in Section 6, followed by a Conclusion Section. Proof of Theorem 3.2 is provided in Section A. Other proofs and additional numerical studies are given in the supplementary article.
2. Preliminaries and a toy example
2.1. Preliminaries
For simplicity, we consider a single stage study with two treatments. Let Y denote a patient’s response of interest, the larger the better by convention, A ∈ 𝒜 = {0, 1} the treatment received by the patient and X the associated s-dimensional vector of baseline covariates. In addition, let Y*(0) and Y*(1) denote the potential outcomes that a patient would get if he or she was given treatment 0 and 1, respectively. A treatment regime d is a deterministic function that maps a patient’s covariates to {0, 1}. Define the potential outcome
representing the response that a patient would get if treated according to the regime d. The optimal treatment regime is defined as the regime dopt that maximizes E{Y*(d)}. Under the stable unit treatment value assumption (SUTVA) and no unmeasured confounders assumption (Rubin, 1974), the optimal treatment regime can be written as dopt(x) = I{C(x) > 0} where
Function C(·) is referred to as the contrast function. In practice, for simplicity, we may assume the contrast takes a linear form, i.e, C(x) = βT x+c. To take population heterogeneity into account, we assume that the contrast function varies for patients from different groups. Specifically, we assume there are G groups of patients and consider the following semiparametric model:
| (1) |
where E(eg|Xg, Ag) = 0. In Model (1), Yg, Ag and Xg ∈ ℝs stand for the response, the treatment and the covariates of patients in Group g, respectively, and hg denotes the unspecified baseline function in Group g. Without loss of generality, we further assume all covariates Xg are standardized to have zero mean and identity covariance matrix. Otherwise, we consider variable transformation where μg = E(Xg) and Σg = cov(Xg). Then Model (1) can be represented as , for some function . The parameter cg stands for the marginal treatment effects (average causal effects) after adjusting covariates. Mathematically, we have
When cg > 0, treatment 1 is generally better for patients in Group g. The vector βg describes individualized treatment effects. For patients in Group g with covariates x, the larger , the more benefits he or she receives if assigned to treatment 1.
Define πg(x) = Pr(Ag = 1|Xg = x) as the propensity score in Group g. Model (1) allows hg and πg to vary across groups, which we refer to baseline effect heterogeneity and treatment assignment heterogeneity respectively. These sources of heterogeneity are not related to treatment decisions since they do not appear in the contrast function. The following sources of groupwise heterogeneity will affect decision making: the marginal treatment effects cg and the individualized treatment effects βg. In this paper, we mainly focus on heterogeneity caused by different βg’s. We assume c1 = ··· = cG = c0 for some c0, that is, the same marginal treatment effect for all groups.
To introduce the pooled and the maximin optimal treatment regime, we need some optimality criterion. Here, we consider the difference of patient’s mean response (value function) between a regime d(x) = I(βTx > −c) and d0(x) = 0, which assigns all patients to treatment 0. Specifically, the difference of value functions is defined as
In this section, for illustrative purposes only, we consider a special case with c0 = c = 0. A general discussion will be given in the next section. When the distributions of Xg’s are the same across groups, we can represent VDg(β, 0) as
We assume the same number of patients across all groups. Then, the pooled optimal treatment regime is defined as where
| (2) |
and the maximin optimal treatment regime is defined as where
| (3) |
We add the L2 constraint on β to make βP and βM identifiable. Therefore, the pooled optimal treatment regime aims to maximize the average value difference while the maximin optimal treatment regime aims to maximize the minimum value difference in G groups, i.e. maximize the reward of the worst-case scenario.
The random effects meta-analyses assume the following model for βg’s:
where εg’s are independent and satisfy E(εg) = 0, cov(εg) = Ω0 for all g. For any subgroup estimators β̂1, …, β̂G with cov(β̂g) = Ωg, the aggregated estimator is given by
where Ω̂g’s and Ω̂0 denote some estimators for Ωg’s and Ω0. Given sufficiently many observations, we have and . As a result, we have
| (4) |
The corresponding optimal treatment regime is defined as .
More generally, we can treat the parameters βg in the group-specific contrast function as a multivariate random variable and assume that the parameters βg’s of training groups are generated according to some distribution Fb, either continuous or discrete, and let Hb denote the support of Fb. Then, we define βR, βP and βM as
where the expectation Etrain,b is taken with respect to Fb. Definitions in (2), (3) and (4) correspond to the special case where Fb only takes values in {β1, …, βG} with an equal probability. Our objective is to minimize Etest,b{VD(β, b)}, where Etest,b is taken with respect to Gb, the distribution of βg for future groups of patients.
2.2. A toy example
Recall that s is the dimension of Xg. For illustration, we take s = 2, and assume that patients’ baseline covariates are generated independently from a standard normal distribution. Since ||β||2 = 1, after some calculation, we have
The first equality in the second line is due to the independence between and . Hence, we obtain
| (5) |
Therefore, βP is proportional to βR which equals a simple average of all subgroup parameters, while βM maximizes its minimum inner product across different βg’s. When all βg’s have the same L2 norm, βM becomes
| (6) |
or equivalently
| (7) |
where ∠(a, b) = arccos(aT b) stands for the angle between two vectors. The equivalence between (6) and (7) is due to the monotonicity of the arccos function. In (6) or (7), βM is defined to maximize (minimize) the minimum correlation (maximum angle) between all subgroup coefficients. Such formulation is referred to as the maximin correlation approach in the classification literature (c.f, Avi-Itzhak et al., 1995; Lee et al., 2016). In general, we weight the correlation βT βg/||βg||2 by the L2 norm of βg. The βM defined in (5) is more informative since it not only takes the heterogeneity due to different directions βg/||βg||2 into consideration, but different magnitudes ||βg||2 as well.
Since βP is proportional to βR, the VD under is the same as . Therefore, in the following, we focus on comparing βP with βM. We set G = 4 and assume ||βg||2 = 1, g = 1, 2, 3, 4. Since s = 2, we represent each βg as βg = {cos(ψg), sin(ψg)} with ψg ∈ [0, π). The parameter ψg is the angle between βg and the x-axis in a 2-dimensional coordinate system. In this special case, βM lies on the bisector of the largest angles formed by all βg’s and it can be shown that βM = {cos(ψM), sin(ψM)} where
ψ(1) and ψ(4) denote the smallest and largest angles of ψg’s. Similarly define βP = {cos(ψP), sin(ψP)}. We set ψ1 = 0°, ψ2 = 15°, ψ3 = 70°and ψ4 = 90°. Consider the following leave-one-group-out cross validation procedure. For the ith round, we choose the ith group as the testing group, and obtain βP and βM based on the remaining 3 groups. Then we evaluate the value difference of the pooled and maximin OTRs based on the ith group. In other words, we set Fb to be a discrete distribution that takes value on {β1, …, β4}/{βi} with equal probability, and Gb a degenerate distribution that concentrates on βi. Table 1 summarizes the results.
Table 1.
Different combinations of training groups and the corresponding ψP, ψM, and their value differences on the testing group
| Training groups | ψM | ψP | ψtest | VD(βM, βtest) | VD(βP, βtest) |
|---|---|---|---|---|---|
| (1, 2, 3) | 35° | 27.44° | 90° | 0.23 | 0.18 |
| (1, 2, 4) | 45° | 32.63° | 70° | 0.36 | 0.32 |
| (1, 3, 4) | 45° | 55.32° | 15° | 0.35 | 0.30 |
| (2, 3, 4) | 52.5° | 59.25° | 0° | 0.24 | 0.20 |
From Table 1, we can see that for all four cases, the value differences of the maximin optimal treatment regime are uniformly larger than those of the pooled optimal treatment regime on the testing groups. To illustrate the idea graphically, we plot βP (denoted by the snow symbol), βM (denoted by the circle symbol), and βg of the training (denoted by the square symbol) and testing (denoted by the plus symbol) groups for the second and third cases in Figure 1, where the left panel is for the second case and the right one is for the third case. For both cases, βM is closer to βg of the testing groups, while βP is pulled towards the area where most βg’s of the training groups locate due to the averaging effect.
Fig. 1.
Plots of βP (denoted by the snow symbol), βM (denoted by the circle symbol), and βg of the training (denoted by the square symbol) and testing groups (denoted by the plus symbol) for the second (left panel) and third (right panel) cases.
3. Maximin-projection learning
We now formally introduce our maximin projection treatment regime. Based on model (1) and the common marginal treatment effect assumption, the optimal treatment regime for the gth subgroup is . Here, our goal is to find a single treatment regime with ||βM||2 = 1 that performs uniformly well for heterogeneous data. Motivated by the toy example in the previous section, our proposed maximin-projection learning is aim to find
3.1. Statistical interpretation
In this subsection, we show that the maximin projection, represented by βM, has two nice statistical interpretations in terms of maximizing the minimum PCD and value difference (VD). Specifically, in group g, the PCD of a treatment regime d(x) = I(xTβ > −c) is defined as
and the VD is defined as
Here, the larger PCD and VD values, the better the treatment regime d(x) approximates the groupwise optimal treatment regime .
Based on the defined PCD and VD, for any fixed constant c, we consider the following maximin treatment regimes: where
| (8) |
and where
| (9) |
Remark 3.1
The two maximin treatment regimes, defined by and , are appealing for their nice statistical interpretations. However, we note that the definition of involves unknown parameters. The empirical estimators of VD are of non-smooth and non-concave functional forms of the corresponding estimators. Therefore, their estimations are not feasible and they may not be practically useful.
Remark 3.2
It is worth noting that would be meaningless when not all ||βg||2’s are the same. This is because PCD only measures the similarity between the overall and groupwise optimal treatment decisions, but does not account for the magnitude of groupwise contrast function. When ||βg||2’s are not the same, the L2 norm of groupwise contrast function would be different. This implies that PCDs are not comparable across different groups. In comparison, VD is a better criterion since it takes both the sign and magnitude of contrast function into consideration. Below, under some conditions, we establish the equivalence between these two maximin treatment regimes and our proposed maximin-projection treatment regime.
Theorem 3.1 (Equivalence of βM and )
Assume that Xg’s are i.i.d. spherically distributed, and all ||βg||2’s are the same. Then, for any fixed c,
Theorem 3.2 (Equivalence of βM and )
Assume Xg’s are i.i.d. spherically distributed. Then, for any fixed c,
Remark 3.3
Theorems 3.1 and 3.2 require Xg to have a spherical distribution (see Definition F.1), which is a rich class of symmetric multivariate distributions (see Fang et al., 1990).
The definition of βM has nice statistical interpretations. However, it has two drawbacks. First, when F0 ≡ max||β||2=1 ming βT βg < 0, the uniqueness of βM is not guaranteed. This may cause identifiability issues when we establish properties of the corresponding estimators. In addition, the optimization problem in (5) is not concave. This can make the implementation of the estimating procedure infeasible.
To address these concerns, we define
| (10) |
Compared to βM, it replaces the feasible set ||β||2 = 1 with a closed convex set ||β||2 ≤ 1. Lemma 3.1 below states that is well defined, when F0 ≠ 0. Moreover, the optimization problem (10) is concave, which can be easily implemented.
Lemma 3.1
The maximin-projection estimator always exists. Moreover, when F0 ≠ 0, is unique.
Remark 3.4
The existence of is guaranteed by the continuity of the objective function F(β) = ming∈{1,…,G} βT βg, boundedness and closeness of the feasible set β : ||β||2 ≤ 1. Its uniqueness is a byproduct of Lemma 3.3, which is stated in the next subsection. When F0 = 0, is not unique and the set of solutions is given by
The problem of estimating then becomes non-regular and all the large sample theories about the maximin estimator fail (see Section 4).
Define G0 = max||β||2≤1 ming βT βg. It is obvious that G0 ≥ 0. In addition, G0 > 0 if and only if F0 > 0. When G0 = 0, we can set , which leads to a trivial regime by assigning the same treatment to all patients. From now on, we focus on the situation when G0 > 0. In this case, we have . Define
Note that and c0 are sign equivalent. Our maximin-projection OTR is given by
Theorem 3.3
Under conditions of Theorem 3.1, if G0 > 0, we have
Theorem 3.4
Under conditions of Theorem 3.2, if G0 > 0, we have
Together with Theorems 3.1 and 3.2, Theorems 3.3 and 3.4 suggest that the treatment regime maximizes the minimum PCD and the minimum VD among different groups.
3.2. Geometrical characterization
In this subsection we give a geometrical view of when G0 > 0. Findings in this subsection are similar in rationale with the results in Avi-Itzhak et al. (1995). However, we generalize their results by getting rid of the unit L2-norm condition ||βg||2 = 1 and allowing the set of vectors {β1, …, βG} to be linear dependent, which is the case when s ≥ G.
We first introduce some notation. For an arbitrary s × G matrix Ψ and a set K ⊆ [1, …, G], let ΨK denote the submatrice of Ψ formed by columns in K. Define the equicorrelated points set
and the optimal equicorrelated point
where Ψi refers to the ith column vector of matrix Ψ. When |K| = 1 and ΨK = ψ, . Readers can refer to Section B of the supplementary article for a detailed discussion on the equicorrelated points set and the optimal equicorrelated point.
For any matrix Ω, Let Ω+ denote the Moore-Penrose matrix inverse of Ω and C(Ω) the column space of Ω. Let e denote a vector of ones. We have the following result.
Lemma 3.2
For any Ψ and K ⊆ [1, …, n], when , the optimal equicorrelated point of ΨK exists and is unique. Moreover, it takes the form
| (11) |
Define matrix B = (β1, β2, …, βG) whose gth column is the subgroup parameter βg.
Lemma 3.3
Assume G0 > 0. Then there exists a unique nonempty set K0 ⊆ [1, …, G] such that and , where . Moreover, if the set of vectors βg, g ∈ K0 are linearly independent, then a necessary and sufficient condition for is that each element in the vector is nonnegative.
We denote K0 as the maximin optimal equicorrelated points set when G0 > 0. In Lemma 3.2, the condition automatically holds when has full row rank. In Lemma 3.3, we assume the set of vectors βg, g ∈ K0 are linearly independent. This implies the matrix as full row rank. As a result, we have .
In Lemma 3.3, the non-negativity of is sufficient and necessary for . Together with Lemma 3.2, Lemma 3.3 implies that is uniquely defined by
This implies is proportional to and can be represented as a linear combination of the column vectors in BK0. Geometrically, the non-negativity of requires to lie in the convex cone of βg, g ∈ K0, i.e, {Σg∈K0 agβg : ag ≥ 0, ∀g ∈ K0}. To better understand Lemma 3.3, in Figure 2, we take s = 3, G = 3 and B = (β1, β2, β3) where β1 = (1, 1, 0), β2 = (1,−1, 0) and β3 = (1.2, 0, 0.5). Both and satisfy the necessary conditions of Lemma 3.3. While lies in the convex cone of β1 and β2, appears outside the convex cone of β1, β2 and β3. Therefore, satisfies the sufficient conditions of Lemma 3.3 and doesn’t. As a result, we have .
Fig. 2.

Plots of βg (denoted by the square symbol), (denoted by the snow symbol) and (denoted by the circle symbol)
4. Estimation procedure
The data are summarized as (Ygj,Agj,Xgj), for g = 1, …, G, j = 1, …, mg, where mg is the number of patients in Group g. We assume that the data are independent across g = 1, …, G and j = 1, …, mg. Based on the data, parameters β1, …, βG and c0 in model (1) can be estimated with existing methods. In this paper, we implement with the popular Q-learning and A-learning and give a brief discussion on estimating these parameters in Section 4.2. Let β̂1, …, β̂G and ĉ0 be the corresponding estimators. We propose to estimate by solving the following optimization problem:
| (12) |
Note that the objective function ming βTβ̂g is concave in β and the region ||β||2 ≤ 1 is convex. Therefore, (12) is a tractable convex optimization problem. It can be further casted as a quadratic constraint linear programming (QCLP) problem, specifically, β̂M is equivalent to the solution of
The above optimization problem can be efficiently computed using existing softwares. Define ĉM = ĉ0/Ĝ0, where .
Given a group of future patients, denoted by their baseline covariates. We calculate and . The recommend treatment for the jth patient is given by
4.1. Statistical properties
In this subsection we investigate the asymptotic properties of the maximin-projection estimator β̂M obtained by solving the optimization problem (12). We first study the consistency of the estimator by assuming the following two conditions.
-
(C1.)
Assume that β̂1, …, β̂G and ĉ0 converge in probability to β1, …, βG and c0, respectively.
-
(C2.)
Assume that F0 ≠ 0. When F0 > 0, assume that the column vectors in BK0 are linearly independent and all elements in the vector are nonzero, where K0 is the maximin optimal equicorrelated points set as defined previously.
Remark 4.1
Condition (C1) requires each subgroup estimator to be consistent. The condition F0 ≠ 0 in (C2) ensures the existence and uniqueness of . Apparently, is not stable when F0 approaches to 0, since its L2 norm will change from 1 to 0. To ensure the stability of in the sense that it will not deviate too much when there are minor changes in the set of vectors β1, …, βG, we would expect
| (13) |
as B̃K0 → BK0, where B̃ = (β̃1, …, β̃G) represents the coefficient matrix with some disturbance. A sufficient condition to establish (13) is that BK0 is of full column rank, as assumed in Condition (C2). Lemma 3.2 suggests can be represented as , for some weight vector ω0 proportional to . Condition (C2) further assumes the weights are nonzero. Such a condition guarantees that for any coefficient matrix B̃ → B, K0 is the optimal equicorrelated points set of B̃ as well.
Theorem 4.1 (Consistency)
Define B̂ = (β̂1, …, β̂G). Assume Conditions C1 and C2 are satisfied. Then with probability tending to 1, the estimator β̂M is equal to
In addition, assume and for some . When F0 > 0, we have .
Remark 4.2
Theorem 4.1 implies that (β̂M, ĉM) is consistent as long as each subgroup estimator is consistent. The first part of the theorem follows as a consequence of Lemma 3.3.
Next, we study the asymptotic normality of the estimator. For notational simplicity, we assume m1 = ⋯= mG = m and posit the following condition.
-
(C3.)
Assume that for all g ∈ K0, and are jointly asymptotically normal with mean zero.
Theorem 4.2 (Asymptotic normality)
Assume that Conditions C1–C3 hold, and that F0 > 0. We have that and are jointly asymptotically normal with mean zero and some covariance matrix VM. The expression of VM is given in Appendix C.
Since the expression of the asymptotic covariance matrix VM is quite complicated, we propose to estimate it using a bootstrap method. Here, the bootstrap sampling is done within each subgroup. Specifically, we independently generate B bootstrap samples for each group g = 1, …, G,
j = 1, …, B. For each j, we obtain estimators β̂(j) and ĉ(j) based on the data
Confidence intervals of β̂M and ĉM are calculated based on quantiles of (β̂(1), …, β̂(B)) and (ĉ(1), …, ĉ(M)).
4.2. Estimation of group-specific regimes
In this subsection we discuss two popular approaches to obtain subgroup estimators β̂g and ĉ0.
Example 1 (Q-learning)
We estimate βg and c0 by modeling the Q-functions, which represent the conditional mean of the response given the covariates and the treatment. Specifically, the baseline function is assumed to have some parametric form hg(x, ηg) with parameter ηg. Then,
Since c0 is common across all subgroups, we propose to estimate β1, …, βG and c0 by jointly solving the following set of estimating equations:
When the parametric models hg(x, ηg)’s are correctly specified, the resulting estimators β̂g’s and ĉ0 are consistent and jointly asymptotically normal.
Example 2 (A-learning)
Here, we posit some parametric model πg(X, αg) for the propensity score and hg(X, ηg) for the baseline function. The parameters αg’s, ηg’s, βg’s and c0 are estimated by solving the following set of estimating equations:
It can be shown that when either the propensity score or the baseline function for each group is correctly specified, the resulting estimators β̂g’s and ĉ0 are consistent and jointly asymptotically normal. This is the so-called doubly robust property of the A-learning estimation.
5. Simulation studies
We consider four groups of patients. In each group, we generate 200 samples according to the following model
where and . Two baseline models are considered for h, including a linear model and a nonlinear model . We generate treatments from two propensity score models, a constant model, Pr(Agj = 1) = 0.5 and a probit model, , where Φ(·) is the standard normal cumulative distribution function. This yields four simulation settings.
We further consider two scenarios for the subgroup parameters to exhibit different degrees of heterogeneity. In the first scenario, we set . Hence, all βg’s have the same L2 norm and their directions βg/||βg||2 differ. For the second scenario, we choose subgroup parameters to have similar directions but allow their L2 norms to vary. Specifically, . It can be shown that and for all scenarios.
We first obtain the subgroup estimators of βg and c0 using the A-learning estimating equations discussed in Section 4.2. Here, a logistic regression model is fitted for the propensity score and a linear model for the baseline function. As a result, both the propensity score model and the baseline model are correctly specified in the first setting; either of them is misspecified in the second and the third setting; while both are misspecified in the last setting. We then obtain the estimators β̂M and ĉM using the proposed maximin-projection learning. Confidence intervals for the resulting estimators are obtained based on 600 bootstrap samples.
For each setting, we conduct 600 simulations. The biases, standard deviations (SD) of β̂M and ĉM, and coverage probabilities (CP) of 95% Wald-type confidence intervals for and are reported in Tables 2. In all scenarios, the proposed estimators achieve the smallest biases and standard deviations in Setting 1, where the baseline function and the propensity score are both correctly specified. In Settings 2 and 3, the proposed estimators are nearly unbiased, showing the doubly robust property of the subgroup estimators obtained using the A-learning estimating equations. In Setting 4, where the baseline function and the propensity score are both misspecified, biases and standard deviations of the estimators tend to be larger, however, the biases are still reasonably small. In addition, the coverage probabilities of 95% Wald-type confidence intervals are close to the nominal level for all cases.
Table 2.
Biases, standard deviations (in parenthesis) of β̂M, ĉM and coverage probabilities (CP) of 95% Wald-type confidence intervals for and .
| Scenario 1 |
|
|
ĉM | CP for | CP for | CP for ĉM | ||
|---|---|---|---|---|---|---|---|---|
| Setting 1 | −0.002(0.027) | 0.001(0.027) | 0.0003(0.024) | 96.0% | 96.0% | 95.3% | ||
| Setting 2 | −0.003(0.053) | −0.001(0.052) | 0.001(0.045) | 94.7% | 94.7% | 93.8% | ||
| Setting 3 | −0.003(0.036) | 0.001(0.035) | −0.0005(0.035) | 96.2% | 96.2% | 94.5% | ||
| Setting 4 | −0.003(0.068) | −0.004(0.068) | 0.002(0.068) | 96.0% | 96.0% | 95.0% | ||
|
| ||||||||
| Scenario 2 |
|
|
ĉM | CP for | CP for | CP for ĉM | ||
|
| ||||||||
| Setting 1 | −0.002(0.036) | 0.0002(0.036) | 0.0002(0.023) | 95.5% | 95.5% | 95.3% | ||
| Setting 2 | −0.009(0.061) | 0.003(0.060) | −0.001(0.043) | 96.0% | 96.0% | 93.8% | ||
| Setting 3 | −0.010(0.091) | −0.002(0.089) | −0.001(0.033) | 93.7% | 93.7% | 94.5% | ||
| Setting 4 | −0.029(0.136) | 0.034(0.130) | −0.002(0.056) | 98.3% | 98.3% | 95.0% | ||
To further assess the performance of the proposed maximin OTRs, we compare it with the estimated pooled OTR, d̂P (x) = I(xT β̂P > − ĉP) and the OTR based on random effects models, d̂R(x) = I(xT β̂R > − ĉR). Here, β̂P and ĉP are obtained based on pooled data by solving a single A-learning estimating equation. To obtain β̂R and ĉR, we first obtain β̂g, ĉg by solving A-learning estimating equations, based on . The covariance of is estimated by the sandwich estimator. Based on these estimators, we calculate β̂R and ĉR using the R package mvmeta. The between-group covariance matrix is estimated by the method of moments. For both scenarios, we consider the following leave-one-group-out cross-validation procedure for evaluation. We first obtain estimators β̂M, ĉM, β̂P, ĉP, β̂R and ĉR based on pooled samples of any three groups. Then, we evaluate the PCD and the VD as defined in Section 3.1 under the obtained maximin OTR and the pooled OTR for the remaining testing group, using Monte Carlo simulations based on the true model for the testing group.
Table 3 and 4 summarize the results of the VD for Scenario 1 and Scenario 2. The results of the PCD are given in Table 21 and 22 in the supplementary article. The OTR obtained by random effects meta-analyses is close to the estimated pooled OTR in both scenarios. In Scenario 1, both the PCD and the VD under our maximin OTR are much higher than those under the other two OTRs for all the testing groups. Taking PCD as an example, on average, the PCD under the maximin OTR is approximately 5 ~ 6% higher than those under the other OTRs. This demonstrates the advantages of the proposed maximin-projection learning when there is relatively large heterogeneity in optimal treatment decision-making across subgroups. In Scenario 2, since the groupwise optimal treatment regimes are “close” to each other in “angles”, all the estimated OTRs do not differ much. From Table 4, it can be seen that our maximin OTR performs better than the other OTRs when the first group is taken as the testing group, while it has comparable performance with the other OTRs for other groups as testing groups.
Table 3.
VD results (with standard errors in parenthesis) for Scenario 1 under the estimated maximin OTR d̂M, the pooled OTR d̂P and the OTR obtained by random effects meta-analyses d̂R.
| Testing group | First group | Second group | Third group | Fourth group | |
|---|---|---|---|---|---|
| Setting 1 | d̂P | 0.407(0.002) | 0.606(0.001) | 0.632(0.002) | 0.368(0.002) |
| d̂R | 0.408(0.001) | 0.608(0.001) | 0.633(0.001) | 0.367(0.001) | |
| d̂M | 0.486(0.001) | 0.690(0.001) | 0.723(0.001) | 0.458(0.001) | |
|
| |||||
| Setting 2 | d̂P | 0.406(0.002) | 0.606(0.002) | 0.630(0.002) | 0.366(0.002) |
| d̂R | 0.407(0.001) | 0.608(0.001) | 0.633(0.001) | 0.366(0.001) | |
| d̂M | 0.483(0.002) | 0.689(0.001) | 0.719(0.001) | 0.452(0.002) | |
|
| |||||
| Setting 3 | d̂P | 0.407(0.003) | 0.604(0.002) | 0.630(0.002) | 0.367(0.003) |
| d̂R | 0.405(0.002) | 0.606(0.001) | 0.632(0.001) | 0.367(0.002) | |
| d̂M | 0.483(0.002) | 0.688(0.001) | 0.723(0.001) | 0.454(0.002) | |
|
| |||||
| Setting 4 | d̂P | 0.406(0.003) | 0.602(0.003) | 0.628(0.003) | 0.365(0.003) |
| d̂R | 0.406(0.002) | 0.606(0.001) | 0.632(0.001) | 0.366(0.002) | |
| d̂M | 0.473(0.003) | 0.686(0.002) | 0.716(0.001) | 0.439(0.004) | |
Table 4.
VD results (with standard errors in parenthesis) for Scenario 2 under the estimated maximin OTR d̂M, the pooled OTR d̂P and the OTR obtained by random effects meta-analyses d̂R.
| Testing group | First group | Second group | Third group | Fourth group | |
|---|---|---|---|---|---|
| Setting 1 | d̂P | 0.803(<0.001) | 0.597(<0.001) | 0.865(<0.001) | 0.762(<0.001) |
| d̂R | 0.803(<0.001) | 0.598(<0.001) | 0.865(<0.001) | 0.761(<0.001) | |
| d̂M | 0.847(<0.001) | 0.588(<0.001) | 0.865(<0.001) | 0.769(<0.001) | |
|
| |||||
| Setting 2 | d̂P | 0.802(0.001) | 0.597(<0.001) | 0.864(<0.001) | 0.761(<0.001) |
| d̂R | 0.803(<0.001) | 0.598(<0.001) | 0.865(<0.001) | 0.762(<0.001) | |
| d̂M | 0.843(0.001) | 0.587(<0.001) | 0.863(<0.001) | 0.767(0.001) | |
|
| |||||
| Setting 3 | d̂P | 0.801(0.001) | 0.597(<0.001) | 0.863(<0.001) | 0.760(0.001) |
| d̂R | 0.801(0.001) | 0.597(<0.001) | 0.864(<0.001) | 0.761(0.001) | |
| d̂M | 0.841(0.001) | 0.588(<0.001) | 0.861(0.001) | 0.765(0.001) | |
|
| |||||
| Setting 4 | d̂P | 0.799(0.001) | 0.595(<0.001) | 0.861(0.001) | 0.758(0.001) |
| d̂R | 0.804(0.001) | 0.597(<0.001) | 0.863(<0.001) | 0.759(0.001) | |
| d̂M | 0.826(0.002) | 0.587(0.001) | 0.853(0.001) | 0.756(0.002) | |
In Section C.2 in the supplementary article, we conduct some additional simulation experiments with non-normal covariates. Findings are similar to those with normal covariates.
Although our maximin estimators have better performance for treatment decision making in the above simulation examples, they can have larger variances compared with the random effects models. This is a potential disadvantage of our method.
6. Health assessment questionnaire (HAQ) progression data
The HAQ progression data comes from an observational study to investigate the influence of early disease modifying antirheumatic drug (DMARD) treatment and its duration for patients with recent onset inflammatory polyarthritis (Farragher et al., 2010). Early DMARDs treatment was routinely used in the management of rheumatoid arthritis (RA). Among conventional DMARDs, Methotrexate is the most widely used one and is now considered a benchmark against new treatments to be used. Previous studies showed that RA patients who have failed to respond to methotrexate may have clinically important improvements if treated with combination DMARDs, such as methotrexate-sulfasalazine-hydroxychloroquine, methotrexate-sulfasalazine-steroids or other Methotrexate combinations (Boers et al., 1997). However, Methotrexate combinations did not work for all RA patients, and they may not add benefits in some patients who were stable on DMARD monotherapy (Symmons et al., 2005). It is of clinical interest to develop individualized OTRs and to know which patients will benefit from treating with Methotrexate combinations. The study sample include 420 patients who were recruited to the study from 1990 to 2000 and were treated with either methotrexate monotherapy or methotrexate combinations. Age, gender, duration of disease, HAQ score, number of swollen joints and number of tender joints were recorded at baseline. We standardize all six covariates such that their sample covariance matrix equals the identity matrix within each group. We compare methotrexate combinations (A = 1) with methotrexate monotherapy (A = 0). The difference HAQ scores between baseline and 5-year is set to be the response. Here, we classify 420 patients into three groups according to their recruitment time. Specifically, group 1 includes patients enrolled from 1990 to 1992; group 2 includes those enrolled from 1993 to 1996; and group 3 includes those enrolled from 1997 to 2000. Sample sizes of the three groups are 265, 78 and 77, respectively.
In our analysis, we use the last two standardized covariates to fit the contrast function, since the regression coefficients of other variables are not significant. Denoted these two covariates by and , respectively. For each group g, we consider the following model
The parameters c0, βg1, βg2 are estimated using the A-learning estimating equations as discussed in Section 4.2. Here, a linear model is fitted for the baseline function and a logistic regression model is fitted for the propensity score. When fitting the propensity score model, all six covariates are included. Table 5 reports the group-wise estimators obtained using the A-learning estimating equations, suggesting there is some heterogeneity in optimal treatment regimens across three groups.
Table 5.
Estimators of groupwise OTR (standard errors in paranthesis) for the HAQ data.
| Group 1 | Group 2 | Group 3 | |
|---|---|---|---|
| β̂g1 | 0.05(0.11) | −0.40(0.17) | 0.07(0.21) |
| β̂g2 | 0.07(0.11) | 0.06(0.21) | 0.32(0.16) |
We use the same leave-one-group-out cross validation procedure as done in simulations to evaluate the performance of the proposed method. We calculate the maximin OTR d̂M, the pooled OTR d̂P, and the OTR obtained by random effects meta-analyses d̂R based on every two groups of patients, and evaluate them on the remaining group based on the estimated value function. For a given treatment regime d and group g, the estimated value function is given by
which is computed based on the advantage function as introduced in Murphy (2003). Results are given in Table 6. Value under the maximin OTR are uniformly better than those under other OTRs across all three groups, showing a big improvement for group 2. Besides, the estimators involved in the regimes d̂P and d̂R are very close.
Table 6.
d̂M, d̂P, d̂R and their value functions
| Testing group | Group 1 | Group 2 | Group 3 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| d̂M | d̂P | d̂R | d̂M | d̂P | d̂R | d̂M | d̂P | d̂R | ||
| ĉ | −0.87 | −0.14 | −0.12 | −2.38 | −0.21 | −0.11 | −3.08 | −0.31 | −0.32 | |
| β̂1 | −0.48 | −0.02 | −0.00 | 0.61 | 0.16 | 0.16 | −0.02 | 0.06 | −0.01 | |
| β̂2 | 0.88 | 0.25 | 0.23 | 0.79 | 0.10 | 0.14 | 1.00 | 0.06 | 0.10 | |
|
|
−0.08 | −0.09 | −0.09 | −0.09 | −0.19 | −0.22 | −0.12 | −0.13 | −0.12 | |
7. Discussion
In this paper, we propose a maximin-projection learning to aggregate OTRs for patients from different populations with heterogeneity. It has appealing statistical interpretations in the sense of maximizing the minimum PCD and the minimum value difference across subgroups. The corresponding estimation procedure is easy to implement via quadratically constrained linear programming, and the asymptotic properties of the resulting estimators are studied.
7.1. Alternative maximin formulation
Our procedure requires to scale the baseline covariates Xg to mean zero and identity covariance matrix for g = 1, 2, …, G,G + 1. Let Xg,0 be the original variable prior to transformation and βg,0, cg,0 the corresponding individualized and marginal treatment effects, respectively. The proposed maximin OTR is constructed based on βM = argmax||β||2=1 ming∈{1,…,G} βT βg, or equivalently,
where Σg is the covariance matrix of Xg,0 for g = 1, …, G + 1.
As pointed by one of the referee, we can also consider the maximin OTR based on βM** where
Assuming EX1,0 = EX2,0 = ⋯ = EXG,0 = EXG+1,0 = 0, c1,0 = c2,0 = ⋯ = cG,0 and XG+1 is spherically distributed, we can show
for any c > 0. This implies that βM** maximizes the minimum groupwise value difference function under the new distribution XG+1,0.
It is worthwhile to investigate the performance of the OTR based on βM**. However, this is beyond the scope of the current paper. Below, we briefly compare the proposed maximin OTR with the maximin OTR based on βM** and discuss their connections. First, βM** maximizes the minimum groupwise value difference function under the new distribution XG+1,0 while βM maximizes the minimum groupwise value difference function under the new distribution XG+1 after scaling. To see this, note that when c1 = ⋯ = cG and XG+1 is spherically distributed, we have
for any c > 0. Second, βM** usually doesn’t coincide with βM*. A sufficient condition for βM* = βM** is that Σ1 = Σ2 = ⋯ = ΣG = ΣG+1. Lastly, estimating βM** might exhibit less variances than βM*, since it doesn’t require the estimation of Σ1, …,ΣG. However, the OTR based on βM** is not scale invariant. To see this, let for some invertible matrix C. The covariance matrix of XG+1,0 is equal to CΣG+1CT. Let
there’s no guarantee that βM*** = (CT)−1βM**.
7.2. Extensions
In current work, we mainly deal with heterogeneity caused by groupwise individualized treatment effects βg’s, and assume the same marginal treatment effects cg’s for all groups. It is possible to extend our proposed maximin projection learning to the case when cg’s vary across different groups as well. Specifically, consider
where β̂g and ĉg are subgroup estimators. Statistical properties of β̂M and ĉM can be similarly established. For example, β̂M and ĉM can be shown to converge almost surely to some and , respectively. However, the defined and can no longer preserve the interpretation of maximizing the minimum PCD and the minimum VD, due to the fact that the PCD and the VD are complicated functions of (βg, cg) and (β, c) when cg’s vary across groups. Consequently, the angle interpretation as demonstrated by the toy example given in Section 2.2 does not hold.
To establish the consistency and asymptotic normality of β̂M and ĉM, we require βg, g ∈ K0 to be linearly independent. In Section C.1 in the supplementary article, we conduct some additional simulation studies to examine our methods under settings where some of the βg’s are the same. Results suggest that β̂M and ĉM are still consistent to and , in these settings. We further evaluate the VD and the PCD under the estimated maximin OTR and compare them with those under the estimated pooled OTR. Findings are similar to those in Section 5.
In addition, in our current work, we assume a linear interaction between treatment and baseline covariates. It is interesting to consider a more general model as follows:
| (14) |
where Q is a strictly monotone increasing function with Q(0) = 0. The parameters βg in each group can be consistently estimated using the concordance-assisted learning method by Fan et al. (2017). The properties of the corresponding maximin-projection estimator warrant further investigation.
Supplementary Material
Acknowledgments
We thank the editor, the AE and three referees for providing helpful suggestions that significantly improved the quality of the paper. The research of Chengchun Shi and Rui Song is partially supported by Grant NSF-DMS-1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is partially supported by Grant NCI P01 CA142538.
A. Proof of Theorem 3.2
Before proving Theorem 3.2, we state the following lemma whose proof is given in Section F of the supplementary article.
Lemma A.1
Consider a set of vectors β1, …, βG of dimension s and function h(β, b, t) defined on the domain {(βT, bT, t) ∈ ℝs × ℝs × 𝒯: ||β||2 = 1}. Assume h(β, b, t) = g(βT b, t) for some function g(·, ·). Besides, assume for any fixed t, g(c, t) is monotonically increasing as a function of c. Then, for any random variable T defined on 𝒯, we have
Since Xg’s are identically distributed for g = 1, …, G, we omit the subscript g for brevity. We need to show βM maximizes
We first show for any ||β||2 = 1 and c, the probability Pr(XTβ > c) is constant as a function of β. Since ||βg||2 = 1, it follows from Lemma F.1 that there exists some orthogonal matrix U such that Uβ = e1 = (1, 0, …, 0)T. Hence
| (15) |
where the last equality is due to the definition of spherical distribution (see Definition F.1). By (15), it suffices to show βM maximizes
| (16) |
Let ρg = βT βg/||βg||2. Since X is spherically distributed, we have
| (17) |
for all βg, c and β such that ||β||2 = 1, where X(1) and X(2) are the first two components of the random vector X. It follows from Theorem 2.6 in Fang et al. (1990) that
| (18) |
with r = ||X||2, d ~ B(1, p/2 − 1), U1 and U2 uniformly distributed on the surface , where B(p, q) stands for the Beta distribution with parameters p, q. The random variables r, d are independent of U1 and U2. Set T = rd. Combining (17) with (18) gives
| (19) |
for any β and c such that ||β||2 = 1.
When c/t ≤ −1, we have I(U1 > −c/t) = I(U1 > 1) = 0 and hence h = 0. When c/t ≥ 1,
Obviously, in these two trivial cases, h is an increasing function of βT βg. Now we consider the case where c/t = cos(ψ1) for some ψ1 ∈ (0, π). Assume ρg = cos(ψ2) for some ψ2 ∈ (0, π). The function h can further be simplified to
This proves h is an increasing function of βT βg. Hence, (16) follows by an application of Lemma A.1.
Contributor Information
Chengchun Shi, North Carolina State University, Raleigh, USA.
Rui Song, North Carolina State University, Raleigh, USA.
Wenbin Lu, North Carolina State University, Raleigh, USA.
Bo Fu, Fudan University, Shanghai, People’s Republic of China.
References
- Avi-Itzhak H, Van Mieghem JA, Rub L, et al. Multiple subclass pattern recognition: a maximin correlation approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1995;17(4):418–431. [Google Scholar]
- Bühlmann P, Meinshausen N. Magging: maximin aggregation for inhomogeneous large-scale data. Proceedings of the IEEE. 2016;104(1):126–135. [Google Scholar]
- Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19(3):317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H, Manning AK, Dupuis J. A method of moments estimator for random effect multivariate meta-analysis. Biometrics. 2012;68(4):1278–1284. doi: 10.1111/j.1541-0420.2012.01761.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled clinical trials. 1986;7(3):177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]
- Fan C, Lu W, Song R, Zhoua Y. Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2017;79(5):1565–1582. doi: 10.1111/rssb.12216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fang KT, Kotz S, Ng KW. Symmetric multivariate and related distributions, Volume 36 of Monographs on Statistics and Applied Probability. Chapman and Hall, Ltd; London: 1990. [Google Scholar]
- Farragher TM, Lunt M, Fu B, Bunn D, Symmons DP. Early treatment with, and time receiving, first disease-modifying antirheumatic drug predicts long-term function in patients with inflammatory polyarthritis. Annals of the rheumatic diseases. 2010;69(4):689–695. doi: 10.1136/ard.2009.108639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackson D, White IR, Thompson SG. Extending DerSimonian and Laird’s methodology to perform multivariate random effects meta-analyses. Stat Med. 2010;29(12):1282–1297. doi: 10.1002/sim.3602. [DOI] [PubMed] [Google Scholar]
- Lee T, Moon T, Kim SJ, Yoon S. Regularization and kernelization of the maximin correlation approach. IEEE Access. 2016;4:1385–1392. [Google Scholar]
- Meinshausen N, Bühlmann P. Maximin effects in inhomogeneous large-scale data. Ann Statist. 2015;43(4):1801–1830. [Google Scholar]
- Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65(2):331–366. [Google Scholar]
- Robins J, Hernan M, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
- Rubin D. Estimating causal effects of treatments in randomized and non-randomized studies. J Edu Psychol. 1974;66:688–701. [Google Scholar]
- Tarrier N, Lewis S, Haddock G, Bentall R, Drake R, Kinderman P, Kingdon D, Siddle R, Everitt J, Leadley K, et al. Cognitive-behavioural therapy in first-episode and early schizophrenia. The British Journal of Psychiatry. 2004;184(3):231–239. doi: 10.1192/bjp.184.3.231. [DOI] [PubMed] [Google Scholar]
- Watkins C, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107(499):1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

