Maximin Projection Learning for Optimal Treatment Decision with Heterogeneous Individualized Treatment Effects

Chengchun Shi; Rui Song; Wenbin Lu; Bo Fu

doi:10.1111/rssb.12273

. Author manuscript; available in PMC: 2019 Sep 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2018 May 10;80(4):681–702. doi: 10.1111/rssb.12273

Maximin Projection Learning for Optimal Treatment Decision with Heterogeneous Individualized Treatment Effects

Chengchun Shi ¹, Rui Song ², Wenbin Lu ³, Bo Fu ⁴

PMCID: PMC6289536 NIHMSID: NIHMS955724 PMID: 30555269

Summary

A saline feature of data from clinical trials and medical studies is inhomogeneity. Patients not only differ in baseline characteristics, but also the way they respond to treatment. Optimal individualized treatment regimes are developed to select effective treatments based on patient’s heterogeneity. However, the optimal treatment regime might also vary for patients across different subgroups. In this paper, we mainly consider patient’s heterogeneity caused by groupwise individualized treatment effects assuming the same marginal treatment effects for all groups. We propose a new maximin-projection learning for estimating a single treatment decision rule that works reliably for a group of future patients from a possibly new subpopulation. Based on estimated optimal treatment regimes for all subgroups, the proposed maximin treatment regime is obtained by solving a quadratically constrained linear programming (QCLP) problem, which can be efficiently computed by interior-point methods. Consistency and asymptotic normality of the estimator is established. Numerical examples show the reliability of the proposed methodology.

Keywords: Heterogeneity, Maximin-projection learning, Optimal treatment regime, Quadratically constrained linear programming

1. Introduction

Data from clinical trials and medical studies are often characterized by some degree of inhomogeneity. Patients not only differ in baseline characteristics, but also the way they respond to the treatment. There have been increasing interest in developing individualized optimal treatment regimes (OTRs) to account for patients’ heterogeneity in response to treatment and to achieve the best treatment effect for individual patients. Some common methods for estimating OTRs include Q-learning (Watkins and Dayan, 1992; Chakraborty et al., 2010), A-learning (Robins et al., 2000; Murphy, 2003) and value search methods which directly search OTRs by maximizing the estimated value function (Zhang et al., 2012; Zhao et al., 2012). However, the OTRs may also vary for patients from different subpopulations. This is typically the case in meta analysis, where we combine the results of multiple studies conducted at different locations or times. One motivating example is from a multicenter randomised controlled trial as studied in Tarrier et al. (2004). The goal is to examine the effectiveness of cognitive-behavioural therapy for patients with early schizophrenia. Patients can be classified into three groups according to their treatment centres (Manchester, Liverpool and North Nottinghamshire). As we can see in Section D in the supplementary article, the group-wise OTRs can vary across different centres. Another example is from an observational study for investigating the influence of early disease modifying antirheumatic drug (DMARD) treatment on patients with recent onset inflammatory polyarthritis (Farragher et al., 2010). According to patients’ enrollment time, they can be classified into three groups. As studied in Section 6, the group-wise OTRs can vary across different enrollment periods. The heterogeneity in OTRs may be explained by the differences in characteristics of treatment setting across subgroups. For instance, in the schizophrenia example, the strength of therapeutic alliance between therapist and patient, the adherence to treatment protocols and the quality of treatment provided can vary from one treatment centre to another (Dunn and Bentall, 2007); in the inflammatory polyarthritis example, there are more use of hydroxychloroquine for the methotrexate combination strategy in recruitment time group 3 (1997–2000) than in group 1 (1990–1992) or group 2 (1993–1996) as hydroxychloroquine was increasingly used in the UK before anti-tumour necrosis factor therapy was introduced to treat rheumatoid arthritis in 2001. Moreover, these characteristics are often unobserved or partially observed, and they may explain the interaction between subgroups and OTRs.

The aim of this paper is to propose a reliable OTR for new patients based on the observed data from different groups with heterogeneity in optimal treatment decision. The group of new patients may differ from any of the currently observed groups in terms of optimal treatment decision. For example, compared with existing data, the group of new patients, who come from a new treatment centre, may have a different OTR because of different strength of therapeutic alliance or different quality of treatment provided in the new treatment centre. Therefore, the true OTR for the group of new patients is not estimable at all based on the observed data, and any of the group-wise OTRs may not be the best choice. The challenge becomes how to derive a meaningful and reliable treatment regime that can take into account the heterogeneity in optimal treatment decision for different groups of patients. One simple approach is to pool the data of different groups together and obtain the “pooled” OTR based on the pooled data. Another method is to first obtain the OTR for each group, and then aggregate the group-wise OTRs in certain ways. Random effects meta-analysis (DerSimonian and Laird, 1986) is commonly used to combine subject-specific studies. Using its multivariate extensions (cf. Jackson et al., 2010; Chen et al., 2012), we can aggregate the groupwise OTRs based on random effects models. The resulting OTR is similar to the “pooled” OTR when we have large numbers of subgroup patients. These OTRs maybe reasonable choices when the OTRs for different groups do not vary much. However, when there is certain degree of heterogeneity in OTRs across different groups as demonstrated in the toy example given in the next section, these OTRs are uniformly worse than the proposed OTR for any of the groups. One possible reason is that these OTRs for different groups may assign the same patient to different treatments and thus their effects are averaged out when pooling the data from different groups.

Bühlmann and Meinshausen (2016) and Meinshausen and Bühlmann (2015) considered a maximin criteria which has a nice characterization in linear models and proposed to use maximin aggregation (magging) to obtain the maximin estimator. Their proposed estimator is shown to be more robust than the pooled estimator in linear regression. The key idea of the maximin criteria is to find an estimator that works the best under the worst-case scenario. In optimal treatment decision, the percentage of making the correct decision (PCD) and value function are two commonly used measures to evaluate the effectiveness of a treatment regime. A natural maximin criteria for optimal treatment decision is to find an OTR that maximizes the minimum PCD or the minimum value function of all groups. Such a maximin OTR is appealing due to its nice interpretation and robustness. However, it is hard to implement in practice due to the following reasons. First, the PCD of a treatment regime is generally not estimable from data since the true OTR is unknown. Second, the empirical estimator of the value function as studied in Zhang et al. (2012) is non-smooth and non-concave, thus the estimation of the associated maximin OTR is not feasible.

In this paper, we propose a novel maximin-projection learning (MPL) to aggregate linear OTRs across different groups. Specifically, the proposed maximin-projection learning finds a linear decision rule that maximizes the minimum “inner product” between the vectors of regression parameters in the linear rule and the group-wise linear OTRs. We show that under certain model assumptions, the OTR obtained by the maximin-projection learning maximizes the minimum percentage of making the correct decision and value function of different groups, i.e. achieve the desired maximin properties. In addition, the corresponding estimation procedure can be represented as a linear programming problem with a quadratic constraint (Lee et al., 2016), which can be efficiently solved in O(Gs² + s³) flops. Here G denotes the number of groups and s the dimension of baseline covariates. Consistency and the asymptotic distribution of the corresponding maximin-projection estimators are established. Such kind of asymptotic results are rarely studied in the literature. To derive such asymptotic properties, we establish a necessary and sufficient condition for the existence and uniqueness of the population maximin-projection parameters and obtain a closed-form expression for the resulting estimator.

The rest of the paper is organized as follows. We introduce the model, notations and assumptions in Section 2. We also provide a heuristic comparison between the maximin OTR, the pooled OTR and the OTR based on random effects models with a toy example. In Section 3, we formally introduce the proposed maximin-projection learning including its statistical interpretation and geometrical characterization. Section 4 presents the estimating procedure of the maximin-projection estimator and the associated asymptotic properties. Simulation studies to evaluate the empirical performance of the proposed maximin OTR are conducted in Section 5. We apply our method to a real examples in Section 6, followed by a Conclusion Section. Proof of Theorem 3.2 is provided in Section A. Other proofs and additional numerical studies are given in the supplementary article.

2. Preliminaries and a toy example

2.1. Preliminaries

For simplicity, we consider a single stage study with two treatments. Let Y denote a patient’s response of interest, the larger the better by convention, A ∈ 𝒜 = {0, 1} the treatment received by the patient and X the associated s-dimensional vector of baseline covariates. In addition, let Y^*(0) and Y^*(1) denote the potential outcomes that a patient would get if he or she was given treatment 0 and 1, respectively. A treatment regime d is a deterministic function that maps a patient’s covariates to {0, 1}. Define the potential outcome

Y^{*} (d) = Y^{*} (1) d (X) + Y^{*} (0) {1 - d (X)},

representing the response that a patient would get if treated according to the regime d. The optimal treatment regime is defined as the regime d^opt that maximizes E{Y^*(d)}. Under the stable unit treatment value assumption (SUTVA) and no unmeasured confounders assumption (Rubin, 1974), the optimal treatment regime can be written as d^opt(x) = I{C(x) > 0} where

C (x) = E (Y ∣ A = 1, X = x) - E (Y ∣ A = 0, X = x) .

Function C(·) is referred to as the contrast function. In practice, for simplicity, we may assume the contrast takes a linear form, i.e, C(x) = β^T x+c. To take population heterogeneity into account, we assume that the contrast function varies for patients from different groups. Specifically, we assume there are G groups of patients and consider the following semiparametric model:

Y_{g} = h_{g} (X_{g}) + A_{g} (β_{g}^{T} X_{g} + c_{g}) + e_{g}, g = 1, \dots, G .

(1)

where E(e_g|X_g, A_g) = 0. In Model (1), Y_g, A_g and X_g ∈ ℝ^s stand for the response, the treatment and the covariates of patients in Group g, respectively, and h_g denotes the unspecified baseline function in Group g. Without loss of generality, we further assume all covariates X_g are standardized to have zero mean and identity covariance matrix. Otherwise, we consider variable transformation $X_{g}^{*} = \sum_{g}^{- 1 / 2} (X_{g} - μ_{g}), β_{g}^{*} = \sum^{1 / 2} β_{g}, c_{g}^{*} = c_{g} + μ_{g}^{T} β_{g}$ where μ_g = E(X_g) and Σ_g = cov(X_g). Then Model (1) can be represented as $Y_{g} = h_{g}^{*} (X_{g}^{*}) + A_{g} ({β_{g}^{*}}^{T} X_{g}^{*} + c_{g}^{*}) + e_{g}$ , for some function $h_{g}^{*}$ . The parameter c_g stands for the marginal treatment effects (average causal effects) after adjusting covariates. Mathematically, we have

c_{g} = E (Y_{g} ∣ A_{g} = 1) - E (Y_{g} ∣ A_{g} = 0) = E {Y_{g}^{*} (1)} - E {Y_{g}^{*} (0)} .

When c_g > 0, treatment 1 is generally better for patients in Group g. The vector β_g describes individualized treatment effects. For patients in Group g with covariates x, the larger $β_{g}^{T} x$ , the more benefits he or she receives if assigned to treatment 1.

Define π_g(x) = Pr(A_g = 1|X_g = x) as the propensity score in Group g. Model (1) allows h_g and π_g to vary across groups, which we refer to baseline effect heterogeneity and treatment assignment heterogeneity respectively. These sources of heterogeneity are not related to treatment decisions since they do not appear in the contrast function. The following sources of groupwise heterogeneity will affect decision making: the marginal treatment effects c_g and the individualized treatment effects β_g. In this paper, we mainly focus on heterogeneity caused by different β_g’s. We assume c₁ = ··· = c_G = c₀ for some c₀, that is, the same marginal treatment effect for all groups.

To introduce the pooled and the maximin optimal treatment regime, we need some optimality criterion. Here, we consider the difference of patient’s mean response (value function) between a regime d(x) = I(β^Tx > −c) and d₀(x) = 0, which assigns all patients to treatment 0. Specifically, the difference of value functions is defined as

{VD}_{g} (β, c) = E {Y_{g}^{*} (d)} - E {Y_{g}^{*} (d_{0})} = E {(X_{g}^{T} β_{g} + c_{0}) I (X_{g}^{T} β > - c)} .

In this section, for illustrative purposes only, we consider a special case with c₀ = c = 0. A general discussion will be given in the next section. When the distributions of X_g’s are the same across groups, we can represent VD_g(β, 0) as

VD (β, β_{g}) = E {(X_{g}^{T} β_{g}) I (X_{g}^{T} β > 0)} .

We assume the same number of patients across all groups. Then, the pooled optimal treatment regime is defined as $d_{P}^{opt} (x) = I (x^{T} β^{P} > 0)$ where

β^{P} = arg max_{{‖ β ‖}_{2} = 1} \frac{1}{G} \sum_{g = 1}^{G} VD (β, β_{g}),

(2)

and the maximin optimal treatment regime is defined as $d_{M}^{opt} (x) = I (x^{T} β^{M} > 0)$ where

β^{M} = arg max_{{‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} VD (β, β_{g}) .

(3)

We add the L₂ constraint on β to make β^P and β^M identifiable. Therefore, the pooled optimal treatment regime aims to maximize the average value difference while the maximin optimal treatment regime aims to maximize the minimum value difference in G groups, i.e. maximize the reward of the worst-case scenario.

The random effects meta-analyses assume the following model for β_g’s:

β_{g} = β_{0} + ε_{g},

where ε_g’s are independent and satisfy E(ε_g) = 0, cov(ε_g) = Ω₀ for all g. For any subgroup estimators β̂₁, …, β̂_G with cov(β̂_g) = Ω_g, the aggregated estimator is given by

{\hat{β}}^{R} = {(\sum_{g = 1}^{G} {({\hat{Ω}}_{g} + {\hat{Ω}}_{0})}^{- 1})}^{- 1} (\sum_{g = 1}^{G} {({\hat{Ω}}_{g} + {\hat{Ω}}_{0})}^{- 1} {\hat{β}}_{g}),

where Ω̂_g’s and Ω̂₀ denote some estimators for Ω_g’s and Ω₀. Given sufficiently many observations, we have ${‖ {\hat{β}}_{g} - β_{g} ‖}_{2} \overset{P}{\to} 0$ and ${‖ {\hat{Ω}}_{g} ‖}_{2} \overset{P}{\to} 0$ . As a result, we have

{\hat{β}}^{R} \overset{P}{\to} {(\sum_{g = 1}^{G} {({\hat{Ω}}_{0})}^{- 1})}^{- 1} (\sum_{g = 1}^{G} {({\hat{Ω}}_{0})}^{- 1} β_{g}) = \frac{1}{G} \sum_{g} β_{g} \equiv β^{R} .

(4)

The corresponding optimal treatment regime is defined as $d_{R}^{opt} (x) = I (x^{T} β^{R} > 0)$ .

More generally, we can treat the parameters β_g in the group-specific contrast function as a multivariate random variable and assume that the parameters β_g’s of training groups are generated according to some distribution F_b, either continuous or discrete, and let H_b denote the support of F_b. Then, we define β^R, β^P and β^M as

β^{R} = E_{train, b} (b), β^{P} = arg max_{{‖ β ‖}_{2} = 1} E_{train, b} {VD (β, b)}, β^{M} = arg max_{{‖ β ‖}_{2} = 1} min_{b \in H_{b}} VD (β, b),

where the expectation E_train_,b is taken with respect to F_b. Definitions in (2), (3) and (4) correspond to the special case where F_b only takes values in {β₁, …, β_G} with an equal probability. Our objective is to minimize E_test_,b{VD(β, b)}, where E_test_,b is taken with respect to G_b, the distribution of β_g for future groups of patients.

2.2. A toy example

Recall that s is the dimension of X_g. For illustration, we take s = 2, and assume that patients’ baseline covariates are generated independently from a standard normal distribution. Since ||β||₂ = 1, after some calculation, we have

VD (β, β_{g}) = E {X_{g}^{T} β_{g} I (X_{g}^{T} β > 0)} = E {(X_{g}^{T} β_{g} - β_{g}^{T} β X_{g}^{T} β + β_{g}^{T} β X_{g}^{T} β) I (X_{g}^{T} β > 0)} = β_{g}^{T} β E {X_{g}^{T} β I (X_{g}^{T} β > 0)} = β_{g}^{T} β \frac{1}{\sqrt{2 π}} .

The first equality in the second line is due to the independence between $X_{g}^{T} β_{g} - β_{g}^{T} β X_{g}^{T} β$ and $X_{g}^{T} β$ . Hence, we obtain

β^{P} = arg max_{{‖ β ‖}_{2} = 1} \frac{1}{G} \sum_{g = 1}^{G} β^{T} β_{g} = \frac{\sum_{g} β_{g}}{{‖ \sum_{g} β_{g} ‖}_{2}}, β^{M} = arg max_{{‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} β^{T} β_{g} .

(5)

Therefore, β^P is proportional to β^R which equals a simple average of all subgroup parameters, while β^M maximizes its minimum inner product across different β_g’s. When all β_g’s have the same L₂ norm, β^M becomes

β^{M} = arg max_{{‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} \frac{β^{T} β_{g}}{{‖ β_{g} ‖}_{2}},

(6)

or equivalently

β^{M} = arg min_{{‖ β ‖}_{2} = 1} max_{g \in {1, \dots, G}} ∠ (β, β_{g}),

(7)

where ∠(a, b) = arccos(a^T b) stands for the angle between two vectors. The equivalence between (6) and (7) is due to the monotonicity of the arccos function. In (6) or (7), β^M is defined to maximize (minimize) the minimum correlation (maximum angle) between all subgroup coefficients. Such formulation is referred to as the maximin correlation approach in the classification literature (c.f, Avi-Itzhak et al., 1995; Lee et al., 2016). In general, we weight the correlation β^T β_g/||β_g||₂ by the L₂ norm of β_g. The β^M defined in (5) is more informative since it not only takes the heterogeneity due to different directions β_g/||β_g||₂ into consideration, but different magnitudes ||β_g||₂ as well.

Since β^P is proportional to β^R, the VD under $d_{P}^{opt}$ is the same as $d_{R}^{opt}$ . Therefore, in the following, we focus on comparing β^P with β^M. We set G = 4 and assume ||β_g||₂ = 1, g = 1, 2, 3, 4. Since s = 2, we represent each β_g as β_g = {cos(ψ_g), sin(ψ_g)} with ψ_g ∈ [0, π). The parameter ψ_g is the angle between β_g and the x-axis in a 2-dimensional coordinate system. In this special case, β^M lies on the bisector of the largest angles formed by all β_g’s and it can be shown that β^M = {cos(ψ^M), sin(ψ^M)} where

ψ^{M} = \frac{1}{2} (ψ_{(1)} + ψ_{(4)}),

ψ₍₁₎ and ψ₍₄₎ denote the smallest and largest angles of ψ_g’s. Similarly define β^P = {cos(ψ^P), sin(ψ^P)}. We set ψ₁ = 0°, ψ₂ = 15°, ψ₃ = 70°and ψ₄ = 90°. Consider the following leave-one-group-out cross validation procedure. For the ith round, we choose the ith group as the testing group, and obtain β^P and β^M based on the remaining 3 groups. Then we evaluate the value difference of the pooled and maximin OTRs based on the ith group. In other words, we set F_b to be a discrete distribution that takes value on {β₁, …, β₄}/{β_i} with equal probability, and G_b a degenerate distribution that concentrates on β_i. Table 1 summarizes the results.

Table 1.

Different combinations of training groups and the corresponding ψ^P, ψ^M, and their value differences on the testing group

Training groups	ψ^M	ψ^P	ψ_test	VD(β^M, β_test)	VD(β^P, β_test)
(1, 2, 3)	35°	27.44°	90°	0.23	0.18
(1, 2, 4)	45°	32.63°	70°	0.36	0.32
(1, 3, 4)	45°	55.32°	15°	0.35	0.30
(2, 3, 4)	52.5°	59.25°	0°	0.24	0.20

Open in a new tab

From Table 1, we can see that for all four cases, the value differences of the maximin optimal treatment regime are uniformly larger than those of the pooled optimal treatment regime on the testing groups. To illustrate the idea graphically, we plot β^P (denoted by the snow symbol), β^M (denoted by the circle symbol), and β_g of the training (denoted by the square symbol) and testing (denoted by the plus symbol) groups for the second and third cases in Figure 1, where the left panel is for the second case and the right one is for the third case. For both cases, β^M is closer to β_g of the testing groups, while β^P is pulled towards the area where most β_g’s of the training groups locate due to the averaging effect.

Fig. 1 — Plots of *β^P* (denoted by the snow symbol), *β^M* (denoted by the circle symbol), and *β_g* of the training (denoted by the square symbol) and testing groups (denoted by the plus symbol) for the second (left panel) and third (right panel) cases.

3. Maximin-projection learning

We now formally introduce our maximin projection treatment regime. Based on model (1) and the common marginal treatment effect assumption, the optimal treatment regime for the gth subgroup is $d_{g}^{opt} (x) = I {x^{T} β_{g} > - c_{0}}$ . Here, our goal is to find a single treatment regime $d_{M}^{opt} (x) = I (x^{T} β^{M} > - c^{M})$ with ||β^M||₂ = 1 that performs uniformly well for heterogeneous data. Motivated by the toy example in the previous section, our proposed maximin-projection learning is aim to find

β^{M} = arg max_{β : {‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} β^{T} β_{g} .

3.1. Statistical interpretation

In this subsection, we show that the maximin projection, represented by β^M, has two nice statistical interpretations in terms of maximizing the minimum PCD and value difference (VD). Specifically, in group g, the PCD of a treatment regime d(x) = I(x^Tβ > −c) is defined as

{PCD}_{g} (β, c) = 1 - E {∣ I (X_{g}^{T} β > - c) - I (X_{g}^{T} β_{g} > - c_{0}) ∣},

and the VD is defined as

{VD}_{g} (β, c) = E [Y_{g}^{*} {I (X_{g}^{T} β > - c)}] - E {Y_{g}^{*} (0)} = E {(X_{g}^{T} β_{g} + c_{0}) I (X_{g}^{T} β > - c)} .

Here, the larger PCD and VD values, the better the treatment regime d(x) approximates the groupwise optimal treatment regime $d_{g}^{opt} (x)$ .

Based on the defined PCD and VD, for any fixed constant c, we consider the following maximin treatment regimes: $d_{1} (x) = I (x^{T} β_{(1)}^{M} > - c)$ where

β_{(1)}^{M} = arg max_{β : {‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} {PCD}_{g} (β, c),

(8)

and $d_{2} (x) = I (x^{T} β_{(2)}^{M} > - c)$ where

β_{(2)}^{M} = arg max_{β : {‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} {VD}_{g} (β, c) .

(9)

Remark 3.1

The two maximin treatment regimes, defined by $β_{(1)}^{M}$ and $β_{(2)}^{M}$ , are appealing for their nice statistical interpretations. However, we note that the definition of $β_{(1)}^{M}$ involves unknown parameters. The empirical estimators of VD are of non-smooth and non-concave functional forms of the corresponding estimators. Therefore, their estimations are not feasible and they may not be practically useful.

Remark 3.2

It is worth noting that $β_{(1)}^{M}$ would be meaningless when not all ||β_g||₂’s are the same. This is because PCD only measures the similarity between the overall and groupwise optimal treatment decisions, but does not account for the magnitude of groupwise contrast function. When ||β_g||₂’s are not the same, the L₂ norm of groupwise contrast function ${E {(X_{g}^{T} β_{g} + c_{0})}^{2}}^{1 / 2}$ would be different. This implies that PCDs are not comparable across different groups. In comparison, VD is a better criterion since it takes both the sign and magnitude of contrast function into consideration. Below, under some conditions, we establish the equivalence between these two maximin treatment regimes and our proposed maximin-projection treatment regime.

Theorem 3.1 (Equivalence of β^M and $β_{(1)}^{M}$ )

Assume that X_g’s are i.i.d. spherically distributed, and all ||β_g||₂’s are the same. Then, for any fixed c,

β^{M} = arg max_{{‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} {PCD}_{g} (β, c) .

Theorem 3.2 (Equivalence of β^M and $β_{(2)}^{M}$ )

Assume X_g’s are i.i.d. spherically distributed. Then, for any fixed c,

β^{M} = arg max_{{‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} {V D}_{g} (β, c) .

Remark 3.3

Theorems 3.1 and 3.2 require X_g to have a spherical distribution (see Definition F.1), which is a rich class of symmetric multivariate distributions (see Fang et al., 1990).

The definition of β^M has nice statistical interpretations. However, it has two drawbacks. First, when F₀ ≡ max_||_β_||₂=1 min_g β^T β_g < 0, the uniqueness of β^M is not guaranteed. This may cause identifiability issues when we establish properties of the corresponding estimators. In addition, the optimization problem in (5) is not concave. This can make the implementation of the estimating procedure infeasible.

To address these concerns, we define

β_{(0)}^{M} = arg max_{{‖ β ‖}_{2} \leq 1} min_{g \in {1, \dots, G}} β^{T} β_{g} .

(10)

Compared to β^M, it replaces the feasible set ||β||₂ = 1 with a closed convex set ||β||₂ ≤ 1. Lemma 3.1 below states that $β_{(0)}^{M}$ is well defined, when F₀ ≠ 0. Moreover, the optimization problem (10) is concave, which can be easily implemented.

Lemma 3.1

The maximin-projection estimator $β_{(0)}^{M}$ always exists. Moreover, when F₀ ≠ 0, $β_{(0)}^{M}$ is unique.

Remark 3.4

The existence of $β_{(0)}^{M}$ is guaranteed by the continuity of the objective function F(β) = min_g_∈{1_,_…_,G_} β^T β_g, boundedness and closeness of the feasible set β : ||β||₂ ≤ 1. Its uniqueness is a byproduct of Lemma 3.3, which is stated in the next subsection. When F₀ = 0, $β_{(0)}^{M}$ is not unique and the set of solutions is given by

{a β : a \in [0, 1], {‖ β ‖}_{2} = 1, max_{{‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} β^{T} β_{g} = 0} .

The problem of estimating $β_{(0)}^{M}$ then becomes non-regular and all the large sample theories about the maximin estimator fail (see Section 4).

Define G₀ = max_||_β_||₂≤1 min_g β^T β_g. It is obvious that G₀ ≥ 0. In addition, G₀ > 0 if and only if F₀ > 0. When G₀ = 0, we can set $β_{(0)}^{M} = 0$ , which leads to a trivial regime by assigning the same treatment to all patients. From now on, we focus on the situation when G₀ > 0. In this case, we have $β^{M} = β_{(0)}^{M}$ . Define

c_{(0)}^{M} = c_{0} / G_{0} .

Note that $c_{(0)}^{M}$ and c₀ are sign equivalent. Our maximin-projection OTR is given by

d_{M}^{opt} (x) = I (x^{T} β_{(0)}^{M} > - c_{(0)}^{M}) .

Theorem 3.3

Under conditions of Theorem 3.1, if G₀ > 0, we have

c_{(0)}^{M} = arg max_{c} min_{g \in {1, \dots, G}} {PCD}_{g} (β_{(0)}^{M}, c) .

Theorem 3.4

Under conditions of Theorem 3.2, if G₀ > 0, we have

c_{(0)}^{M} = arg max_{c} min_{g \in {1, \dots, G}} {V D}_{g} (β_{(0)}^{M}, c) .

Together with Theorems 3.1 and 3.2, Theorems 3.3 and 3.4 suggest that the treatment regime $d_{M}^{opt} (x)$ maximizes the minimum PCD and the minimum VD among different groups.

3.2. Geometrical characterization

In this subsection we give a geometrical view of $β_{(0)}^{M}$ when G₀ > 0. Findings in this subsection are similar in rationale with the results in Avi-Itzhak et al. (1995). However, we generalize their results by getting rid of the unit L₂-norm condition ||β_g||₂ = 1 and allowing the set of vectors {β₁, …, β_G} to be linear dependent, which is the case when s ≥ G.

We first introduce some notation. For an arbitrary s × G matrix Ψ and a set K ⊆ [1, …, G], let Ψ_K denote the submatrice of Ψ formed by columns in K. Define the equicorrelated points set

E_{K} (Ψ) = {t \in ℝ^{s} ∣ t^{T} Ψ_{j} = t^{T} Ψ_{i}, \forall i, j \in K},

and the optimal equicorrelated point

E_{K}^{★} (Ψ) = arg max_{\begin{matrix} t \in E_{K} (Ψ) \\ {‖ t ‖}_{2} = 1 \end{matrix}} {t^{T} Ψ_{i}, \forall i \in K},

where Ψ_i refers to the ith column vector of matrix Ψ. When |K| = 1 and Ψ_K = ψ, $E_{K}^{★} (Ψ) = ψ / {‖ ψ ‖}_{2}$ . Readers can refer to Section B of the supplementary article for a detailed discussion on the equicorrelated points set and the optimal equicorrelated point.

For any matrix Ω, Let Ω⁺ denote the Moore-Penrose matrix inverse of Ω and C(Ω) the column space of Ω. Let e denote a vector of ones. We have the following result.

Lemma 3.2

For any Ψ and K ⊆ [1, …, n], when $e \in C (Ψ_{K}^{T})$ , the optimal equicorrelated point of Ψ_K exists and is unique. Moreover, it takes the form

E_{K}^{★} (Ψ) = {[e^{T} {(Ψ_{K}^{T} Ψ_{K})}^{+} e]}^{- 1 / 2} Ψ_{K} {(Ψ_{K}^{T} Ψ_{K})}^{+} e .

(11)

Define matrix B = (β₁, β₂, …, β_G) whose gth column is the subgroup parameter β_g.

Lemma 3.3

Assume G₀ > 0. Then there exists a unique nonempty set K₀ ⊆ [1, …, G] such that $β_{(0)}^{M} = E_{K_{0}}^{★} (B)$ and ${min}_{g \in K_{0}^{c}} {β_{(0)}^{M}}^{T} β_{g} > G_{0}$ , where $K_{0}^{c} = [1, \dots, G] - K_{0}$ . Moreover, if the set of vectors β_g, g ∈ K₀ are linearly independent, then a necessary and sufficient condition for $β_{(0)}^{M} = E_{K_{0}}^{★} (B)$ is that each element in the vector ${(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e$ is nonnegative.

We denote K₀ as the maximin optimal equicorrelated points set when G₀ > 0. In Lemma 3.2, the condition $e \in C (Ψ_{K}^{T})$ automatically holds when $Ψ_{K}^{T}$ has full row rank. In Lemma 3.3, we assume the set of vectors β_g, g ∈ K₀ are linearly independent. This implies the matrix $B_{K_{0}}^{T}$ as full row rank. As a result, we have $e \in C (B_{K_{0}}^{T})$ .

In Lemma 3.3, the non-negativity of ${(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e$ is sufficient and necessary for $β_{(0)}^{M} = E_{K_{0}} (B)$ . Together with Lemma 3.2, Lemma 3.3 implies that $β_{(0)}^{M}$ is uniquely defined by

β_{0}^{(M)} = E_{K_{0}}^{★} (B) = {[e^{T} {(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e]}^{- 1 / 2} B_{K_{0}} {(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e .

This implies $E_{K_{0}}^{★} (B)$ is proportional to $B_{K_{0}} {(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e$ and can be represented as a linear combination of the column vectors in B_K_₀. Geometrically, the non-negativity of ${(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e$ requires $E_{K_{0}}^{★} (B)$ to lie in the convex cone of β_g, g ∈ K₀, i.e, {Σ_g_∈_K_₀ a_gβ_g : a_g ≥ 0, ∀g ∈ K₀}. To better understand Lemma 3.3, in Figure 2, we take s = 3, G = 3 and B = (β₁, β₂, β₃) where β₁ = (1, 1, 0), β₂ = (1,−1, 0) and β₃ = (1.2, 0, 0.5). Both $E_{{1, 2, 3}}^{★} (B)$ and $E_{{1, 2}}^{★} (B)$ satisfy the necessary conditions of Lemma 3.3. While $E_{{1, 2}}^{★} (B)$ lies in the convex cone of β₁ and β₂, $E_{{1, 2, 3}}^{★} (B)$ appears outside the convex cone of β₁, β₂ and β₃. Therefore, $E_{{1, 2}}^{★} (B)$ satisfies the sufficient conditions of Lemma 3.3 and $E_{{1, 2, 3}}^{★} (B)$ doesn’t. As a result, we have $β_{(0)}^{M} = E_{{1, 2}}^{★} (B)$ .

Fig. 2 — Plots of *β_g* (denoted by the square symbol), $E_{{1, 2, 3}}^{★} (B)$ (denoted by the snow symbol) and $E_{{1, 2}}^{★} (B)$ (denoted by the circle symbol)

4. Estimation procedure

The data are summarized as (Y_gj,A_gj,X_gj), for g = 1, …, G, j = 1, …, m_g, where m_g is the number of patients in Group g. We assume that the data are independent across g = 1, …, G and j = 1, …, m_g. Based on the data, parameters β₁, …, β_G and c₀ in model (1) can be estimated with existing methods. In this paper, we implement with the popular Q-learning and A-learning and give a brief discussion on estimating these parameters in Section 4.2. Let β̂₁, …, β̂_G and ĉ₀ be the corresponding estimators. We propose to estimate $β_{(0)}^{M}$ by solving the following optimization problem:

{\hat{β}}^{M} = arg max_{β : {‖ β ‖}_{2} \leq 1} min_{g \in {1, \dots, G}} β^{T} {\hat{β}}_{g} .

(12)

Note that the objective function min_g β^Tβ̂_g is concave in β and the region ||β||₂ ≤ 1 is convex. Therefore, (12) is a tractable convex optimization problem. It can be further casted as a quadratic constraint linear programming (QCLP) problem, specifically, β̂^M is equivalent to the solution of

\begin{array}{l} maximize & t \in ℝ \\ subject to & β^{T} {\hat{β}}_{g} \geq t, g = 1, \dots, G \\ β^{T} β \leq 1. \end{array}

The above optimization problem can be efficiently computed using existing softwares. Define ĉ^M = ĉ₀/Ĝ₀, where ${\hat{G}}_{0} = {min}_{g} {\hat{β}}_{g}^{T} {\hat{β}}^{M}$ .

Given a group of future patients, denoted by ${X_{G + 1, j}}_{j = 1}^{n}$ their baseline covariates. We calculate ${\hat{μ}}_{G + 1} = \sum_{j = 1}^{n} X_{G + 1, j} / n$ and ${\sum^{^}}_{G + 1} = \sum_{j = 1}^{n} (X_{G + 1, j} - {\hat{μ}}_{G + 1}) {(X_{G + 1, j} - {\hat{μ}}_{G + 1})}^{T} / (n - 1)$ . The recommend treatment for the jth patient is given by

I {{(X_{G + 1, j} - {\hat{μ}}_{G + 1})}^{T} {\sum^{^}}_{G + 1}^{- 1 / 2} {\hat{β}}^{M} > - {\hat{c}}^{M}} .

4.1. Statistical properties

In this subsection we investigate the asymptotic properties of the maximin-projection estimator β̂^M obtained by solving the optimization problem (12). We first study the consistency of the estimator by assuming the following two conditions.

(C1.)
Assume that β̂₁, …, β̂_G and ĉ₀ converge in probability to β₁, …, β_G and c₀, respectively.
(C2.)
Assume that F₀ ≠ 0. When F₀ > 0, assume that the column vectors in B_K_₀ are linearly independent and all elements in the vector ${(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e$ are nonzero, where K₀ is the maximin optimal equicorrelated points set as defined previously.

Remark 4.1

Condition (C1) requires each subgroup estimator to be consistent. The condition F₀ ≠ 0 in (C2) ensures the existence and uniqueness of $β_{(0)}^{M}$ . Apparently, $β_{(0)}^{M}$ is not stable when F₀ approaches to 0, since its L₂ norm will change from 1 to 0. To ensure the stability of $β_{(0)}^{M}$ in the sense that it will not deviate too much when there are minor changes in the set of vectors β₁, …, β_G, we would expect

{‖ {[{\tilde{B}}_{K_{0}}^{T} {\tilde{B}}_{K_{0}}]}^{+} - {[B_{K_{0}}^{T} B_{K_{0}}]}^{+} ‖}_{2} \to 0,

(13)

as B̃_K_₀ → B_K_₀, where B̃ = (β̃₁, …, β̃_G) represents the coefficient matrix with some disturbance. A sufficient condition to establish (13) is that B_K_₀ is of full column rank, as assumed in Condition (C2). Lemma 3.2 suggests $β_{(0)}^{M}$ can be represented as $ω_{0}^{T} B_{K_{0}}$ , for some weight vector ω₀ proportional to ${(B_{K_{0}}^{T} B_{K_{0}})}^{- 1} e$ . Condition (C2) further assumes the weights are nonzero. Such a condition guarantees that for any coefficient matrix B̃ → B, K₀ is the optimal equicorrelated points set of B̃ as well.

Theorem 4.1 (Consistency)

Define B̂ = (β̂₁, …, β̂_G). Assume Conditions C1 and C2 are satisfied. Then with probability tending to 1, the estimator β̂^M is equal to

{\begin{cases} {e^{T} {({\hat{B}}_{K_{0}}^{T} {\hat{B}}_{K_{0}})}^{- 1} e}^{- 1 / 2} {\hat{B}}_{K_{0}} {({\hat{B}}_{K_{0}}^{T} {\hat{B}}_{K_{0}})}^{- 1} e & i f & F_{0} > 0, \\ 0 & i f & F_{0} < 0. \end{cases}

In addition, assume ${max}_{g \in K_{0}} {‖ {\hat{β}}_{g} - β_{g} ‖}_{2} = O_{p} (r_{n}^{(1)})$ and ${\hat{c}}_{0} = c_{0} + O_{p} (r_{n}^{(2)})$ for some $r_{n}^{(1)}, r_{n}^{(2)} \to 0$ . When F₀ > 0, we have ${‖ {\hat{β}}^{M} - β_{(0)}^{M} ‖}_{2} = O_{p} (r_{n}^{(1)}), {\hat{c}}^{M} = c_{(0)}^{M} + O_{p} (r_{n}^{(1)} + r_{n}^{(2)})$ .

Remark 4.2

Theorem 4.1 implies that (β̂^M, ĉ^M) is consistent as long as each subgroup estimator is consistent. The first part of the theorem follows as a consequence of Lemma 3.3.

Next, we study the asymptotic normality of the estimator. For notational simplicity, we assume m₁ = ⋯= m_G = m and posit the following condition.

(C3.)
Assume that for all g ∈ K₀, $\sqrt{m} ({\hat{β}}_{g} - β_{g})$ and $\sqrt{m} ({\hat{c}}_{0} - c_{0})$ are jointly asymptotically normal with mean zero.

Theorem 4.2 (Asymptotic normality)

Assume that Conditions C1–C3 hold, and that F₀ > 0. We have that $\sqrt{m} ({\hat{β}}^{M} - β_{(0)}^{M})$ and $\sqrt{m} ({\hat{c}}^{M} - c_{(0)}^{M})$ are jointly asymptotically normal with mean zero and some covariance matrix V^M. The expression of V^M is given in Appendix C.

Since the expression of the asymptotic covariance matrix V^M is quite complicated, we propose to estimate it using a bootstrap method. Here, the bootstrap sampling is done within each subgroup. Specifically, we independently generate B bootstrap samples for each group g = 1, …, G,

{(Y_{g 1}^{(j)}, A_{g 1}^{(j)}, X_{g 1}^{(j)}), \dots, (Y_{g m}^{(j)}, A_{g m}^{(j)}, X_{g m}^{(j)})},

j = 1, …, B. For each j, we obtain estimators β̂⁽^j⁾ and ĉ⁽^j⁾ based on the data

{(Y_{11}^{(j)}, A_{11}^{(j)}, X_{11}^{(j)}), \dots, (Y_{1 m}^{(j)}, A_{1 m}^{(j)}, X_{1 m}^{(j)})}, \dots, {(Y_{G 1}^{(j)}, A_{G 1}^{(j)}, X_{G 1}^{(j)}), \dots, (Y_{G m}^{(j)}, A_{G m}^{(j)}, X_{G m}^{(j)})} .

Confidence intervals of β̂^M and ĉ^M are calculated based on quantiles of (β̂⁽¹⁾, …, β̂⁽^B⁾) and (ĉ⁽¹⁾, …, ĉ⁽^M⁾).

4.2. Estimation of group-specific regimes

In this subsection we discuss two popular approaches to obtain subgroup estimators β̂_g and ĉ₀.

Example 1 (Q-learning)

We estimate β_g and c₀ by modeling the Q-functions, which represent the conditional mean of the response given the covariates and the treatment. Specifically, the baseline function is assumed to have some parametric form h_g(x, η_g) with parameter η_g. Then,

Q_{g} (X_{g}, A_{g}; β_{g}, c_{0}, η_{g}) \equiv E (Y_{g} ∣ A_{g}, X_{g}) = h_{g} (X_{g}, η_{g}) + A_{g} (X_{g}^{T} β_{g} + c_{0}), g = 1, \dots, G .

Since c₀ is common across all subgroups, we propose to estimate β₁, …, β_G and c₀ by jointly solving the following set of estimating equations:

\sum_{j} \frac{\partial h_{g} (X_{g j}, η_{g})}{\partial η_{g}} {Y_{g j} - Q_{g} (X_{g j}, A_{g j}; β_{g}, c_{0}, θ_{g})} = 0, g = 1, \dots, G, \sum_{j} A_{g j} X_{g j} {Y_{g j} - Q_{g} (X_{g j}, A_{g j}; β_{g}, c_{0}, θ_{g})} = 0, g = 1, \dots, G, \sum_{g} \sum_{j} A_{g j} {Y_{g j} - Q_{g} (X_{g j}, A_{g j}; β_{g}, c_{0}, θ_{g})} = 0.

When the parametric models h_g(x, η_g)’s are correctly specified, the resulting estimators β̂_g’s and ĉ₀ are consistent and jointly asymptotically normal.

Example 2 (A-learning)

Here, we posit some parametric model π_g(X, α_g) for the propensity score and h_g(X, η_g) for the baseline function. The parameters α_g’s, η_g’s, β_g’s and c₀ are estimated by solving the following set of estimating equations:

\sum_{j} \frac{1}{π_{g} (X, α_{g}) {1 - π_{g} (X, α_{g})}} \frac{\partial π_{g} (X, α_{g})}{\partial α_{g}} {A_{g j} - π_{g} (X, α_{g})} = 0, g = 1, \dots, G, \sum_{j} \frac{\partial h_{g} (X_{g j}, η_{g})}{\partial η_{g}} {Y_{g j} - h_{g} (X_{g j}, η_{g}) - A_{g j} (X_{g j}^{T} β_{g} + c_{0})} = 0, g = 1, \dots, G, \sum_{j} X_{g j} {A_{g j} - π_{g} (X_{g j}, α_{g})} {Y_{g j} - h_{g} (X_{g j}, η_{g}) - A_{g j} (X_{g j}^{T} β_{g} + c_{0})} = 0, g = 1, \dots, G, \sum_{g} \sum_{j} {A_{g j} - π_{g} (X_{g j}, α_{g})} {Y_{g j} - h_{g} (X_{g j}, η_{g}) - A_{g j} (X_{g j}^{T} β_{g} + c_{0})} = 0.

It can be shown that when either the propensity score or the baseline function for each group is correctly specified, the resulting estimators β̂_g’s and ĉ₀ are consistent and jointly asymptotically normal. This is the so-called doubly robust property of the A-learning estimation.

5. Simulation studies

We consider four groups of patients. In each group, we generate 200 samples according to the following model

Y_{g j} = h (X_{g j}) + A_{g j} X_{g j}^{T} β_{g} + ε_{g j},

where $X_{g j} = {(X_{g j}^{(1)}, X_{g j}^{(2)})}^{T} \overset{iid}{\sim} N (0, I_{2})$ and $ε_{g j} \overset{iid}{\sim} N (0, 0.25)$ . Two baseline models are considered for h, including a linear model $h (X_{g j}) = 1 + 0.5 X_{g j}^{(1)} + 0.5 X_{g j}^{(2)}$ and a nonlinear model $h (X_{g j}) = 1 + sin (0.5 π X_{g j}^{(1)} + 0.5 π X_{g j}^{(2)})$ . We generate treatments from two propensity score models, a constant model, Pr(A_gj = 1) = 0.5 and a probit model, $Pr (A_{g j} = 1 ∣ X_{g j}) = Φ (X_{g j}^{(1)} - X_{g j}^{(2)})$ , where Φ(·) is the standard normal cumulative distribution function. This yields four simulation settings.

We further consider two scenarios for the subgroup parameters to exhibit different degrees of heterogeneity. In the first scenario, we set $β_{1}^{T} = (2, 0), β_{2}^{T} = (2 cos (15 °), 2 sin (15 °)), β_{3}^{T} = 2 (cos (70 °), 2 sin (70 °)), β_{4}^{T} = (0, 2)$ . Hence, all β_g’s have the same L₂ norm and their directions β_g/||β_g||₂ differ. For the second scenario, we choose subgroup parameters to have similar directions but allow their L₂ norms to vary. Specifically, $β_{1}^{T} = (2.2 cos (30 °), 2.2 sin (30 °)), β_{2}^{T} = (1.5 cos (45 °), 1.5 sin (45 °)), β_{3}^{T} = (2.2 cos (54 °), 2.2 sin (54 °)), β_{4}^{T} = (2 cos (60 °), 2 sin (60 °))$ . It can be shown that $β_{(0)}^{M} = (cos (45 °), sin (45 °))$ and $c_{(0)}^{M} = 0$ for all scenarios.

We first obtain the subgroup estimators of β_g and c₀ using the A-learning estimating equations discussed in Section 4.2. Here, a logistic regression model is fitted for the propensity score and a linear model for the baseline function. As a result, both the propensity score model and the baseline model are correctly specified in the first setting; either of them is misspecified in the second and the third setting; while both are misspecified in the last setting. We then obtain the estimators β̂^M and ĉ^M using the proposed maximin-projection learning. Confidence intervals for the resulting estimators are obtained based on 600 bootstrap samples.

For each setting, we conduct 600 simulations. The biases, standard deviations (SD) of β̂^M and ĉ^M, and coverage probabilities (CP) of 95% Wald-type confidence intervals for $β_{(0)}^{M}$ and $c_{(0)}^{M}$ are reported in Tables 2. In all scenarios, the proposed estimators achieve the smallest biases and standard deviations in Setting 1, where the baseline function and the propensity score are both correctly specified. In Settings 2 and 3, the proposed estimators are nearly unbiased, showing the doubly robust property of the subgroup estimators obtained using the A-learning estimating equations. In Setting 4, where the baseline function and the propensity score are both misspecified, biases and standard deviations of the estimators tend to be larger, however, the biases are still reasonably small. In addition, the coverage probabilities of 95% Wald-type confidence intervals are close to the nominal level for all cases.

Table 2.

Biases, standard deviations (in parenthesis) of β̂^M, ĉ^M and coverage probabilities (CP) of 95% Wald-type confidence intervals for $β_{(0)}^{M}$ and $c_{(0)}^{M}$ .

Scenario 1

{\hat{β}}_{1}^{M}

{\hat{β}}_{2}^{M}

ĉ^M

CP for

{\hat{β}}_{1}^{M}

CP for

{\hat{β}}_{2}^{M}

CP for ĉ^M

Setting 1

−0.002(0.027)

0.001(0.027)

0.0003(0.024)

96.0%

95.3%

Setting 2

−0.003(0.053)

−0.001(0.052)

0.001(0.045)

94.7%

93.8%

Setting 3

−0.003(0.036)

0.001(0.035)

−0.0005(0.035)

96.2%

94.5%

Setting 4

−0.003(0.068)

−0.004(0.068)

0.002(0.068)

96.0%

95.0%

Scenario 2

{\hat{β}}_{1}^{M}

{\hat{β}}_{2}^{M}

ĉ^M

CP for

{\hat{β}}_{1}^{M}

CP for

{\hat{β}}_{2}^{M}

CP for ĉ^M

Setting 1

−0.002(0.036)

0.0002(0.036)

0.0002(0.023)

95.5%

95.3%

Setting 2

−0.009(0.061)

0.003(0.060)

−0.001(0.043)

96.0%

93.8%

Setting 3

−0.010(0.091)

−0.002(0.089)

−0.001(0.033)

93.7%

94.5%

Setting 4

−0.029(0.136)

0.034(0.130)

−0.002(0.056)

98.3%

95.0%

Open in a new tab

To further assess the performance of the proposed maximin OTRs, we compare it with the estimated pooled OTR, d̂^P (x) = I(x^T β̂^P > − ĉ^P) and the OTR based on random effects models, d̂^R(x) = I(x^T β̂^R > − ĉ^R). Here, β̂^P and ĉ^P are obtained based on pooled data by solving a single A-learning estimating equation. To obtain β̂^R and ĉ^R, we first obtain β̂_g, ĉ_g by solving A-learning estimating equations, based on ${X_{g j}, A_{g j}, Y_{g j}}_{j = 1}^{m}$ . The covariance of ${({\hat{β}}_{g}^{T}, {\hat{c}}_{g})}^{T}$ is estimated by the sandwich estimator. Based on these estimators, we calculate β̂^R and ĉ^R using the R package mvmeta. The between-group covariance matrix is estimated by the method of moments. For both scenarios, we consider the following leave-one-group-out cross-validation procedure for evaluation. We first obtain estimators β̂^M, ĉ^M, β̂^P, ĉ^P, β̂^R and ĉ^R based on pooled samples of any three groups. Then, we evaluate the PCD and the VD as defined in Section 3.1 under the obtained maximin OTR and the pooled OTR for the remaining testing group, using Monte Carlo simulations based on the true model for the testing group.

Table 3 and 4 summarize the results of the VD for Scenario 1 and Scenario 2. The results of the PCD are given in Table 21 and 22 in the supplementary article. The OTR obtained by random effects meta-analyses is close to the estimated pooled OTR in both scenarios. In Scenario 1, both the PCD and the VD under our maximin OTR are much higher than those under the other two OTRs for all the testing groups. Taking PCD as an example, on average, the PCD under the maximin OTR is approximately 5 ~ 6% higher than those under the other OTRs. This demonstrates the advantages of the proposed maximin-projection learning when there is relatively large heterogeneity in optimal treatment decision-making across subgroups. In Scenario 2, since the groupwise optimal treatment regimes are “close” to each other in “angles”, all the estimated OTRs do not differ much. From Table 4, it can be seen that our maximin OTR performs better than the other OTRs when the first group is taken as the testing group, while it has comparable performance with the other OTRs for other groups as testing groups.

Table 3.

VD results (with standard errors in parenthesis) for Scenario 1 under the estimated maximin OTR d̂_M, the pooled OTR d̂_P and the OTR obtained by random effects meta-analyses d̂_R.

Testing group		First group	Second group	Third group	Fourth group
Setting 1	d̂_P	0.407(0.002)	0.606(0.001)	0.632(0.002)	0.368(0.002)
	d̂_R	0.408(0.001)	0.608(0.001)	0.633(0.001)	0.367(0.001)
	d̂_M	0.486(0.001)	0.690(0.001)	0.723(0.001)	0.458(0.001)

Setting 2	d̂_P	0.406(0.002)	0.606(0.002)	0.630(0.002)	0.366(0.002)
	d̂_R	0.407(0.001)	0.608(0.001)	0.633(0.001)	0.366(0.001)
	d̂_M	0.483(0.002)	0.689(0.001)	0.719(0.001)	0.452(0.002)

Setting 3	d̂_P	0.407(0.003)	0.604(0.002)	0.630(0.002)	0.367(0.003)
	d̂_R	0.405(0.002)	0.606(0.001)	0.632(0.001)	0.367(0.002)
	d̂_M	0.483(0.002)	0.688(0.001)	0.723(0.001)	0.454(0.002)

Setting 4	d̂_P	0.406(0.003)	0.602(0.003)	0.628(0.003)	0.365(0.003)
	d̂_R	0.406(0.002)	0.606(0.001)	0.632(0.001)	0.366(0.002)
	d̂_M	0.473(0.003)	0.686(0.002)	0.716(0.001)	0.439(0.004)

Open in a new tab

Table 4.

VD results (with standard errors in parenthesis) for Scenario 2 under the estimated maximin OTR d̂_M, the pooled OTR d̂_P and the OTR obtained by random effects meta-analyses d̂_R.

Testing group		First group	Second group	Third group	Fourth group
Setting 1	d̂_P	0.803(<0.001)	0.597(<0.001)	0.865(<0.001)	0.762(<0.001)
	d̂_R	0.803(<0.001)	0.598(<0.001)	0.865(<0.001)	0.761(<0.001)
	d̂_M	0.847(<0.001)	0.588(<0.001)	0.865(<0.001)	0.769(<0.001)

Setting 2	d̂_P	0.802(0.001)	0.597(<0.001)	0.864(<0.001)	0.761(<0.001)
	d̂_R	0.803(<0.001)	0.598(<0.001)	0.865(<0.001)	0.762(<0.001)
	d̂_M	0.843(0.001)	0.587(<0.001)	0.863(<0.001)	0.767(0.001)

Setting 3	d̂_P	0.801(0.001)	0.597(<0.001)	0.863(<0.001)	0.760(0.001)
	d̂_R	0.801(0.001)	0.597(<0.001)	0.864(<0.001)	0.761(0.001)
	d̂_M	0.841(0.001)	0.588(<0.001)	0.861(0.001)	0.765(0.001)

Setting 4	d̂_P	0.799(0.001)	0.595(<0.001)	0.861(0.001)	0.758(0.001)
	d̂_R	0.804(0.001)	0.597(<0.001)	0.863(<0.001)	0.759(0.001)
	d̂_M	0.826(0.002)	0.587(0.001)	0.853(0.001)	0.756(0.002)

Open in a new tab

In Section C.2 in the supplementary article, we conduct some additional simulation experiments with non-normal covariates. Findings are similar to those with normal covariates.

Although our maximin estimators have better performance for treatment decision making in the above simulation examples, they can have larger variances compared with the random effects models. This is a potential disadvantage of our method.

6. Health assessment questionnaire (HAQ) progression data

The HAQ progression data comes from an observational study to investigate the influence of early disease modifying antirheumatic drug (DMARD) treatment and its duration for patients with recent onset inflammatory polyarthritis (Farragher et al., 2010). Early DMARDs treatment was routinely used in the management of rheumatoid arthritis (RA). Among conventional DMARDs, Methotrexate is the most widely used one and is now considered a benchmark against new treatments to be used. Previous studies showed that RA patients who have failed to respond to methotrexate may have clinically important improvements if treated with combination DMARDs, such as methotrexate-sulfasalazine-hydroxychloroquine, methotrexate-sulfasalazine-steroids or other Methotrexate combinations (Boers et al., 1997). However, Methotrexate combinations did not work for all RA patients, and they may not add benefits in some patients who were stable on DMARD monotherapy (Symmons et al., 2005). It is of clinical interest to develop individualized OTRs and to know which patients will benefit from treating with Methotrexate combinations. The study sample include 420 patients who were recruited to the study from 1990 to 2000 and were treated with either methotrexate monotherapy or methotrexate combinations. Age, gender, duration of disease, HAQ score, number of swollen joints and number of tender joints were recorded at baseline. We standardize all six covariates such that their sample covariance matrix equals the identity matrix within each group. We compare methotrexate combinations (A = 1) with methotrexate monotherapy (A = 0). The difference HAQ scores between baseline and 5-year is set to be the response. Here, we classify 420 patients into three groups according to their recruitment time. Specifically, group 1 includes patients enrolled from 1990 to 1992; group 2 includes those enrolled from 1993 to 1996; and group 3 includes those enrolled from 1997 to 2000. Sample sizes of the three groups are 265, 78 and 77, respectively.

In our analysis, we use the last two standardized covariates to fit the contrast function, since the regression coefficients of other variables are not significant. Denoted these two covariates by $X_{g j}^{(1)}$ and $X_{g j}^{(2)}$ , respectively. For each group g, we consider the following model

E (Y_{g j} ∣ X_{g j}, A_{g j}) = h_{g} (X_{g j}) + A_{g j} (c_{0} + β_{g 1} X_{g j}^{(1)} + β_{g 2} X_{g j}^{(2)}),

The parameters c₀, β_g₁, β_g₂ are estimated using the A-learning estimating equations as discussed in Section 4.2. Here, a linear model is fitted for the baseline function and a logistic regression model is fitted for the propensity score. When fitting the propensity score model, all six covariates are included. Table 5 reports the group-wise estimators obtained using the A-learning estimating equations, suggesting there is some heterogeneity in optimal treatment regimens across three groups.

Table 5.

Estimators of groupwise OTR (standard errors in paranthesis) for the HAQ data.

	Group 1	Group 2	Group 3
β̂_g₁	0.05(0.11)	−0.40(0.17)	0.07(0.21)
β̂_g₂	0.07(0.11)	0.06(0.21)	0.32(0.16)

Open in a new tab

We use the same leave-one-group-out cross validation procedure as done in simulations to evaluate the performance of the proposed method. We calculate the maximin OTR d̂_M, the pooled OTR d̂_P, and the OTR obtained by random effects meta-analyses d̂_R based on every two groups of patients, and evaluate them on the remaining group based on the estimated value function. For a given treatment regime d and group g, the estimated value function is given by

\hat{E} Y_{g}^{★} (d) = \frac{1}{m_{g}} \sum_{j = 1}^{m_{g}} [Y_{g j} + ({\hat{c}}_{0} + {\hat{β}}_{g 1} X_{g j}^{(1)} + {\hat{β}}_{g 2} X_{g j}^{(2)}) {d (X_{g j}) - A_{g j}}],

which is computed based on the advantage function as introduced in Murphy (2003). Results are given in Table 6. Value under the maximin OTR are uniformly better than those under other OTRs across all three groups, showing a big improvement for group 2. Besides, the estimators involved in the regimes d̂_P and d̂_R are very close.

Table 6.

d̂_M, d̂_P, d̂_R and their value functions

Testing group

Group 1

Group 2

Group 3

d̂_M

d̂_P

d̂_R

d̂_M

d̂_P

d̂_R

d̂_M

d̂_P

d̂_R

−0.87

−0.14

−0.12

−2.38

−0.21

−0.11

−3.08

−0.31

−0.32

β̂₁

−0.48

−0.02

−0.00

0.61

0.16

−0.02

0.06

−0.01

β̂₂

0.88

0.25

0.23

0.79

0.10

0.14

1.00

0.06

0.10

\hat{E} Y_{g}^{★} (d)

−0.08

−0.09

−0.19

−0.22

−0.12

−0.13

−0.12

Open in a new tab

7. Discussion

In this paper, we propose a maximin-projection learning to aggregate OTRs for patients from different populations with heterogeneity. It has appealing statistical interpretations in the sense of maximizing the minimum PCD and the minimum value difference across subgroups. The corresponding estimation procedure is easy to implement via quadratically constrained linear programming, and the asymptotic properties of the resulting estimators are studied.

7.1. Alternative maximin formulation

Our procedure requires to scale the baseline covariates X_g to mean zero and identity covariance matrix for g = 1, 2, …, G,G + 1. Let X_g,₀ be the original variable prior to transformation and β_g,₀, c_g,₀ the corresponding individualized and marginal treatment effects, respectively. The proposed maximin OTR is constructed based on β^M = argmax_||_β_||₂=1 min_g_∈{1_,_…_,G_} β^T β_g, or equivalently,

β^{M_{*}} = arg max_{{‖ \sum_{G + 1}^{1 / 2} β ‖}_{2} = 1} min_{g \in {1, \dots, G}} β^{T} \sum_{G + 1}^{1 / 2} \sum_{g}^{1 / 2} β_{g, 0},

where Σ_g is the covariance matrix of X_g,₀ for g = 1, …, G + 1.

As pointed by one of the referee, we can also consider the maximin OTR based on β^M^** where

β^{M * *} = arg max_{{‖ \sum_{G + 1}^{1 / 2} β ‖}_{2} = 1} min_{g \in {1, \dots, G}} β^{T} \sum_{G + 1} β_{g, 0} .

Assuming EX₁_,₀ = EX₂_,₀ = ⋯ = EX_G,₀ = EX_G₊₁_,₀ = 0, c₁_,₀ = c₂_,₀ = ⋯ = c_G,₀ and X_G₊₁ is spherically distributed, we can show

β^{M * *} = arg max_{{‖ \sum_{G + 1}^{1 / 2} β ‖}_{2} = 1} min_{g \in {1, \dots, G}} E (X_{G + 1, 0}^{T} β_{g, 0} + c_{g, 0}) I (X_{G + 1, 0}^{T} β + c),

for any c > 0. This implies that β^M^** maximizes the minimum groupwise value difference function under the new distribution X_G₊₁_,₀.

It is worthwhile to investigate the performance of the OTR based on β^M^**. However, this is beyond the scope of the current paper. Below, we briefly compare the proposed maximin OTR with the maximin OTR based on β^M^** and discuss their connections. First, β^M^** maximizes the minimum groupwise value difference function under the new distribution X_G₊₁_,₀ while β^M maximizes the minimum groupwise value difference function under the new distribution X_G₊₁ after scaling. To see this, note that when c₁ = ⋯ = c_G and X_G₊₁ is spherically distributed, we have

β^{M} = arg max_{{‖ β ‖}_{2} = 1} min_{g \in {1, \dots, G}} E (X_{G + 1}^{T} β_{g} + c_{g}) I (X_{G + 1}^{T} β + c),

for any c > 0. Second, β^M^** usually doesn’t coincide with β^M^*. A sufficient condition for β^M^* = β^M^** is that Σ₁ = Σ₂ = ⋯ = Σ_G = Σ_G₊₁. Lastly, estimating β^M^** might exhibit less variances than β^M^*, since it doesn’t require the estimation of Σ₁, …,Σ_G. However, the OTR based on β^M^** is not scale invariant. To see this, let $X_{G + 1}^{* *} = C X_{G + 1, 0}$ for some invertible matrix C. The covariance matrix of X_G₊₁_,₀ is equal to CΣ_G₊₁C^T. Let

β^{M * * *} = arg max_{{‖ \sum_{G + 1}^{1 / 2} C^{T} β ‖}_{2} = 1} min_{g \in 1, \dots, G} β^{T} C \sum_{G + 1} C^{T} β_{g, 0},

there’s no guarantee that β^M^*** = (C^T)⁻¹β^M^**.

7.2. Extensions

In current work, we mainly deal with heterogeneity caused by groupwise individualized treatment effects β_g’s, and assume the same marginal treatment effects c_g’s for all groups. It is possible to extend our proposed maximin projection learning to the case when c_g’s vary across different groups as well. Specifically, consider

({\hat{β}}^{M}, {\hat{c}}^{M}) = arg max_{{‖ β ‖}_{2}^{2} + c^{2} \leq 1} min_{g \in {1, \dots, G}} ({\hat{β}}_{g}^{T} β + {\hat{c}}_{g} c),

where β̂_g and ĉ_g are subgroup estimators. Statistical properties of β̂^M and ĉ^M can be similarly established. For example, β̂^M and ĉ^M can be shown to converge almost surely to some $β_{(0)}^{M}$ and $c_{(0)}^{M}$ , respectively. However, the defined $β_{(0)}^{M}$ and $c_{(0)}^{M}$ can no longer preserve the interpretation of maximizing the minimum PCD and the minimum VD, due to the fact that the PCD and the VD are complicated functions of (β_g, c_g) and (β, c) when c_g’s vary across groups. Consequently, the angle interpretation as demonstrated by the toy example given in Section 2.2 does not hold.

To establish the consistency and asymptotic normality of β̂^M and ĉ^M, we require β_g, g ∈ K₀ to be linearly independent. In Section C.1 in the supplementary article, we conduct some additional simulation studies to examine our methods under settings where some of the β_g’s are the same. Results suggest that β̂^M and ĉ^M are still consistent to $β_{(0)}^{M}$ and $c_{(0)}^{M}$ , in these settings. We further evaluate the VD and the PCD under the estimated maximin OTR and compare them with those under the estimated pooled OTR. Findings are similar to those in Section 5.

In addition, in our current work, we assume a linear interaction between treatment and baseline covariates. It is interesting to consider a more general model as follows:

Y_{g} = h_{g} (X_{g}) + A_{g} Q (β_{g}^{T} X_{g} + c_{g}) + e_{g}, g = 1, \dots, G,

(14)

where Q is a strictly monotone increasing function with Q(0) = 0. The parameters β_g in each group can be consistently estimated using the concordance-assisted learning method by Fan et al. (2017). The properties of the corresponding maximin-projection estimator warrant further investigation.

Supplementary Material

Supp info

NIHMS955724-supplement-Supp_info.pdf^{(150.6KB, pdf)}

Acknowledgments

We thank the editor, the AE and three referees for providing helpful suggestions that significantly improved the quality of the paper. The research of Chengchun Shi and Rui Song is partially supported by Grant NSF-DMS-1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is partially supported by Grant NCI P01 CA142538.

A. Proof of Theorem 3.2

Before proving Theorem 3.2, we state the following lemma whose proof is given in Section F of the supplementary article.

Lemma A.1

Consider a set of vectors β₁, …, β_G of dimension s and function h(β, b, t) defined on the domain {(β^T, b^T, t) ∈ ℝ^s × ℝ^s × 𝒯: ||β||₂ = 1}. Assume h(β, b, t) = g(β^T b, t) for some function g(·, ·). Besides, assume for any fixed t, g(c, t) is monotonically increasing as a function of c. Then, for any random variable T defined on 𝒯, we have

arg max_{\begin{matrix} β \in ℝ^{p} \\ {‖ β ‖}_{2} = 1 \end{matrix}} min_{g} E h (β_{g}, β, T) = arg max_{\begin{matrix} β \in ℝ^{p} \\ {‖ β ‖}_{2} = 1 \end{matrix}} min_{g} {β_{g}}^{T} β .

Since X_g’s are identically distributed for g = 1, …, G, we omit the subscript g for brevity. We need to show β^M maximizes

min_{g} [E {h_{g} (X) + (X^{T} β_{g} + c_{0}) I (X^{T} β > - c)} - E h_{g} (X)] = min_{g} E (X^{T} β_{g} + c_{0}) I (X^{T} β > - c) .

We first show for any ||β||₂ = 1 and c, the probability Pr(X^Tβ > c) is constant as a function of β. Since ||β_g||₂ = 1, it follows from Lemma F.1 that there exists some orthogonal matrix U such that Uβ = e₁ = (1, 0, …, 0)^T. Hence

Pr (X^{T} β > c) = Pr (X^{T} U^{T} U β > c) = Pr (X^{T} U^{T} e_{1} > c) = Pr (X^{T} e_{1} > c),

(15)

where the last equality is due to the definition of spherical distribution (see Definition F.1). By (15), it suffices to show β^M maximizes

min_{g} E (X^{T} β_{g}) I (X^{T} β > - c) .

(16)

Let ρ_g = β^T β_g/||β_g||₂. Since X is spherically distributed, we have

E (X^{T} β_{g}) I (X^{T} β > - c) = {‖ β_{g} ‖}_{2} E (ρ_{g} X^{(1)} + \sqrt{1 - ρ_{g}^{2}} X^{(2)}) I (X^{(1)} > - c),

(17)

for all β_g, c and β such that ||β||₂ = 1, where X⁽¹⁾ and X⁽²⁾ are the first two components of the random vector X. It follows from Theorem 2.6 in Fang et al. (1990) that

(X^{(1)}, X^{(2)}) \overset{d}{=} r d (U_{1}, U_{2}),

(18)

with r = ||X||₂, d ~ B(1, p/2 − 1), U₁ and U₂ uniformly distributed on the surface $u_{1}^{2} + u_{2}^{2} = 1$ , where B(p, q) stands for the Beta distribution with parameters p, q. The random variables r, d are independent of U₁ and U₂. Set T = rd. Combining (17) with (18) gives

E (X^{T} β_{g}) I (X^{T} β > - c) = E {‖ β_{g} ‖}_{2} T (ρ_{g} U_{1} + \sqrt{1 - ρ_{g}^{2}} U_{2}) I (T U_{1} > - c) = E [{E {‖ β_{g} ‖}_{2} t (ρ_{g} U_{1} + \sqrt{1 - ρ_{g}^{2}} U_{2}) I (U_{1} > - c / t)} ∣ T = t] \equiv E [h (β, β_{g}, t) ∣ T = t],

(19)

for any β and c such that ||β||₂ = 1.

When c/t ≤ −1, we have I(U₁ > −c/t) = I(U₁ > 1) = 0 and hence h = 0. When c/t ≥ 1,

h (β, β_{g}, t) = t {‖ β_{g} ‖}_{2} E (ρ_{g} U_{1} + \sqrt{1 - ρ_{g}^{2}} U_{2}) = 0.

Obviously, in these two trivial cases, h is an increasing function of β^T β_g. Now we consider the case where c/t = cos(ψ₁) for some ψ₁ ∈ (0, π). Assume ρ_g = cos(ψ₂) for some ψ₂ ∈ (0, π). The function h can further be simplified to

h (β, β_{g}, t) = \frac{1}{2 π} \int_{- ψ_{1}}^{ψ_{1}} {‖ β_{g} ‖}_{2} t {cos (ψ_{2}) cos (ψ) + sin (ψ_{2}) sin (ψ)} d ψ = \frac{1}{2 π} t {‖ β_{g} ‖}_{2} \int_{- ψ_{1}}^{ψ_{1}} cos (ψ - ψ_{2}) d ψ = \frac{1}{π} t {‖ β_{g} ‖}_{2} sin (ψ_{1}) cos (ψ_{2}) = \frac{1}{π} t sin (ψ_{1}) β^{T} β_{g} .

This proves h is an increasing function of β^T β_g. Hence, (16) follows by an application of Lemma A.1.

Contributor Information

Chengchun Shi, North Carolina State University, Raleigh, USA.

Rui Song, North Carolina State University, Raleigh, USA.

Wenbin Lu, North Carolina State University, Raleigh, USA.

Bo Fu, Fudan University, Shanghai, People’s Republic of China.

References

Avi-Itzhak H, Van Mieghem JA, Rub L, et al. Multiple subclass pattern recognition: a maximin correlation approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1995;17(4):418–431. [Google Scholar]
Bühlmann P, Meinshausen N. Magging: maximin aggregation for inhomogeneous large-scale data. Proceedings of the IEEE. 2016;104(1):126–135. [Google Scholar]
Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19(3):317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H, Manning AK, Dupuis J. A method of moments estimator for random effect multivariate meta-analysis. Biometrics. 2012;68(4):1278–1284. doi: 10.1111/j.1541-0420.2012.01761.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled clinical trials. 1986;7(3):177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]
Fan C, Lu W, Song R, Zhoua Y. Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2017;79(5):1565–1582. doi: 10.1111/rssb.12216. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fang KT, Kotz S, Ng KW. Symmetric multivariate and related distributions, Volume 36 of Monographs on Statistics and Applied Probability. Chapman and Hall, Ltd; London: 1990. [Google Scholar]
Farragher TM, Lunt M, Fu B, Bunn D, Symmons DP. Early treatment with, and time receiving, first disease-modifying antirheumatic drug predicts long-term function in patients with inflammatory polyarthritis. Annals of the rheumatic diseases. 2010;69(4):689–695. doi: 10.1136/ard.2009.108639. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jackson D, White IR, Thompson SG. Extending DerSimonian and Laird’s methodology to perform multivariate random effects meta-analyses. Stat Med. 2010;29(12):1282–1297. doi: 10.1002/sim.3602. [DOI] [PubMed] [Google Scholar]
Lee T, Moon T, Kim SJ, Yoon S. Regularization and kernelization of the maximin correlation approach. IEEE Access. 2016;4:1385–1392. [Google Scholar]
Meinshausen N, Bühlmann P. Maximin effects in inhomogeneous large-scale data. Ann Statist. 2015;43(4):1801–1830. [Google Scholar]
Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65(2):331–366. [Google Scholar]
Robins J, Hernan M, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
Rubin D. Estimating causal effects of treatments in randomized and non-randomized studies. J Edu Psychol. 1974;66:688–701. [Google Scholar]
Tarrier N, Lewis S, Haddock G, Bentall R, Drake R, Kinderman P, Kingdon D, Siddle R, Everitt J, Leadley K, et al. Cognitive-behavioural therapy in first-episode and early schizophrenia. The British Journal of Psychiatry. 2004;184(3):231–239. doi: 10.1192/bjp.184.3.231. [DOI] [PubMed] [Google Scholar]
Watkins C, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107(499):1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

NIHMS955724-supplement-Supp_info.pdf^{(150.6KB, pdf)}

[R1] Avi-Itzhak H, Van Mieghem JA, Rub L, et al. Multiple subclass pattern recognition: a maximin correlation approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1995;17(4):418–431. [Google Scholar]

[R2] Bühlmann P, Meinshausen N. Magging: maximin aggregation for inhomogeneous large-scale data. Proceedings of the IEEE. 2016;104(1):126–135. [Google Scholar]

[R3] Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19(3):317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen H, Manning AK, Dupuis J. A method of moments estimator for random effect multivariate meta-analysis. Biometrics. 2012;68(4):1278–1284. doi: 10.1111/j.1541-0420.2012.01761.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled clinical trials. 1986;7(3):177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]

[R6] Fan C, Lu W, Song R, Zhoua Y. Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2017;79(5):1565–1582. doi: 10.1111/rssb.12216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fang KT, Kotz S, Ng KW. Symmetric multivariate and related distributions, Volume 36 of Monographs on Statistics and Applied Probability. Chapman and Hall, Ltd; London: 1990. [Google Scholar]

[R8] Farragher TM, Lunt M, Fu B, Bunn D, Symmons DP. Early treatment with, and time receiving, first disease-modifying antirheumatic drug predicts long-term function in patients with inflammatory polyarthritis. Annals of the rheumatic diseases. 2010;69(4):689–695. doi: 10.1136/ard.2009.108639. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Jackson D, White IR, Thompson SG. Extending DerSimonian and Laird’s methodology to perform multivariate random effects meta-analyses. Stat Med. 2010;29(12):1282–1297. doi: 10.1002/sim.3602. [DOI] [PubMed] [Google Scholar]

[R10] Lee T, Moon T, Kim SJ, Yoon S. Regularization and kernelization of the maximin correlation approach. IEEE Access. 2016;4:1385–1392. [Google Scholar]

[R11] Meinshausen N, Bühlmann P. Maximin effects in inhomogeneous large-scale data. Ann Statist. 2015;43(4):1801–1830. [Google Scholar]

[R12] Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65(2):331–366. [Google Scholar]

[R13] Robins J, Hernan M, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]

[R14] Rubin D. Estimating causal effects of treatments in randomized and non-randomized studies. J Edu Psychol. 1974;66:688–701. [Google Scholar]

[R15] Tarrier N, Lewis S, Haddock G, Bentall R, Drake R, Kinderman P, Kingdon D, Siddle R, Everitt J, Leadley K, et al. Cognitive-behavioural therapy in first-episode and early schizophrenia. The British Journal of Psychiatry. 2004;184(3):231–239. doi: 10.1192/bjp.184.3.231. [DOI] [PubMed] [Google Scholar]

[R16] Watkins C, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]

[R17] Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107(499):1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Maximin Projection Learning for Optimal Treatment Decision with Heterogeneous Individualized Treatment Effects

Chengchun Shi

Rui Song

Wenbin Lu

Bo Fu

Summary

1. Introduction

2. Preliminaries and a toy example

2.1. Preliminaries

2.2. A toy example

Table 1.

Fig. 1.

3. Maximin-projection learning

3.1. Statistical interpretation

Remark 3.1

Remark 3.2

Theorem 3.1 (Equivalence of βM and β(1)M)

Theorem 3.2 (Equivalence of βM and β(2)M)

Remark 3.3

Lemma 3.1

Remark 3.4

Theorem 3.3

Theorem 3.4

3.2. Geometrical characterization

Lemma 3.2

Lemma 3.3

Fig. 2.

4. Estimation procedure

4.1. Statistical properties

Remark 4.1

Theorem 4.1 (Consistency)

Remark 4.2

Theorem 4.2 (Asymptotic normality)

4.2. Estimation of group-specific regimes

Example 1 (Q-learning)

Example 2 (A-learning)

5. Simulation studies

Table 2.

Table 3.

Table 4.

6. Health assessment questionnaire (HAQ) progression data

Table 5.

Table 6.

7. Discussion

7.1. Alternative maximin formulation

7.2. Extensions

Supplementary Material

Acknowledgments

A. Proof of Theorem 3.2

Lemma A.1

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Theorem 3.1 (Equivalence of β^M and $β_{(1)}^{M}$ )

Theorem 3.2 (Equivalence of β^M and $β_{(2)}^{M}$ )