Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Oct 4.
Published in final edited form as: Proc Mach Learn Res. 2024 May;238:712–720.

Fusing Individualized Treatment Rules Using Secondary Outcomes

Daiqi Gao 1, Yuanjia Wang 2, Donglin Zeng 3
PMCID: PMC11450767  NIHMSID: NIHMS1968478  PMID: 39371406

Abstract

An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.

1. INTRODUCTION

An individualized treatment rule (ITR) is a decision rule that recommends treatments to patients based on their pre-treatment covariates, such as demographics, medical history, and health status. Many methods have been proposed to estimate the optimal ITR using data from a randomized controlled trial or an observational study. The goal is to find the ITR that would yield the maximal primary outcome if patients follow the treatment recommendations. Regression-based approaches start with estimating Q-functions, which are the expected outcomes associated with each treatment option, and the optimal treatment is determined by maximizing the Q-functions. Exemplary methods include Q-learning (Qian and Murphy, 2011, Ma et al., 2022) and A-learning (Murphy, 2003; Shi et al., 2018). Alternative methods directly search for the ITR that maximizes the mean reward, also called the value function, using either an inverse probability weighted estimator (Zhao et al., 2012, Chen et al., 2016, Zhang et al. 2020; Gao et al., 2022; Ma et al., 2023) or a doubly robust estimator (Zhang et al., 2012: Zhou et al. 2017, Liu et al. 2018, Zhao et al., 2019). However, all these methods focus on one primary outcome when estimating the ITRs.

In practical settings, it is important to consider certain secondary outcomes. Non-favorable secondary outcomes potentially indicate worsened overall health or an increased risk of treatment non-compliance. Therefore, when estimating the optimal ITR for the primary outcome, it is crucial to ensure that the ITR also optimizes the secondary outcomes to the greatest extent. For example, for major depressive disorder (MDD; Trivedi et al. 2016), one common outcome to measure depressive symptoms is the Quick Inventory of Depressive Symptomatology (QIDS) score, a rating scale of the patient’s emotional state over the past seven days. Besides, another important outcome is the Clinical Global Improvement (CGI) scale, which is often used to assess a patient’s symptoms, behavior, and the impact on the patient’s ability to function. It is an indicator of overall clinical improvement. Although the primary goal in many studies is to identify the optimal treatment strategy for improving the QIDS score, it is equally important for such a strategy to be effective for the CGI scale.

In this work, we have a twofold objective, which is to estimate the optimal ITR for the primary outcome and to ensure that the resulting treatment decision closely aligns with the optimal treatment for the secondary outcomes. Specifically, we aim to fuse the treatment rules for different outcomes to provide a unified ITR that performs optimally for the primary outcome and effectively for the secondary outcomes, even if it may not be optimal for the latter. To achieve this goal, we introduce a novel fused learning framework to estimate the optimal ITR, namely the fused individualized treatment rule (FITR). Under this framework, we maximize the value function for the primary outcome. At the same time, we incorporate a fusion penalty to encourage agreement between the estimated ITR and the optimal ITRs for the secondary outcomes. The latter are assumed to be obtained apriori using external data or the data from the same study. The fusion penalty is designed as a weighted sum of the disagreement rates between the treatment rules, with weights determined based on the similarity of treatment effect on these outcomes. Therefore, this penalty aims to encourage consistency between the ITRs for the different outcomes.

Related Work.

Several approaches have been proposed to learn ITRs that can handle multiple outcomes simultaneously. Some of these approaches (Wang et al. 2018, Laber et al. 2018, Fang et al. 2022) estimated the optimal ITRs for the primary outcome while restricting the secondary outcome to be no less (or no larger if the outcome is a risk outcome) than a threshold. However, these approaches require the threshold to be pre-specified and only ensure that the average value of the secondary outcome is controlled, but not for any specific individuals. Another class of methods (Thall et al., 2002, Luckett et al., 2021) constructed expert-derived or data-driven patient-specific composite outcomes for learning the ITRs. However, their ITRs may not be optimal for the primary outcome of interest. In addition, Laber et al. (2014) and Lizotte and Laber (2016) proposed constructing a set-valued regime for competing outcomes to generate non-inferior outcome vectors but does not recommend a single treatment, limiting their use for practice. Our goal is to learn an optimal ITR for the primary outcome while ensuring closeness to the optimal secondary outcomes, which is distinct from all existing approaches for learning multi-outcome ITRs.

Our problem is also fundamentally different from the literature that combines multiple studies in analysis. Meta-analysis (Haidich, 2010, Lin and Zeng, 2010; Claggett et al., 2014 Liu et al., 2015) combines the estimators across multiple studies efficiently. Integrative data analysis (Curran and Hussong, 2009, Brown et al. 2018) considers multiple data sources or summary statistics using shared parameter models. Transfer learning (Li et al., 2021, Cai and Wei, 2021, Tian and Feng, 2022) uses existing knowledge from one task to assist in learning a new task. However, these existing methods often require individual-level data or assume specific models or distributions for each study to achieve integration, so are not applicable to combining different treatment rules.

Main Contributions.

Our paper introduces several significant contributions. (1) We propose a fusion penalty to encourage the primary outcome ITR to closely agree with the optimal secondary outcome ITRs. The proposed method relies on the secondary outcome only through its treatment rules. We do not require individual-level data for the secondary outcome when such rules are already available. In addition, there is no restriction on the form of the decision function since the fusion penalty is directly applied to the disagreement rate. The proposed method is not affected by the magnitude of the decision function or the function space in which the secondary outcome ITRs are estimated. (2) To efficiently solve for the FITR, we propose two algorithms that use surrogate losses to substitute the noncontinuous, nonconvex value function and fusion penalty in the objective function. (3) Theoretically, we prove that the agreement rate between the estimated FITR learned with the surrogate loss will converge to the true agreement rate, and the convergence rate is faster than the approach without considering the secondary outcome. We also obtain the convergence rate for the value function and misclassification rate of FITR. (4) In the numeric experiments, we show that the agreement rate indeed converges faster to the true rate. Moreover, when the true optimal ITRs of different outcomes are closely aligned, the learned FITR actually yields higher value function and accuracy for the primary outcome compared to traditional methods. By leveraging the secondary outcomes as useful side information, the proposed method has the capability to improve the treatment decisions for multiple outcomes simultaneously.

2. METHODOLOGY

The vector X𝒳Rd represents pre-treatment covariates for tailoring treatment decisions, where d is the dimension of X. The treatment A𝒜={1,1} is assumed to be binary. The primary outcome is denoted by R1. Without loss of generality, we assume that a higher value of this outcome indicates a better health condition. Denote R1(a) as the potential outcome under the treatment a. For the primary outcome R1, an ITR 𝒟1:𝒳𝒜 maps a patient’s covariates to a treatment suggestion. It can also be expressed as 𝒟1(X)=signf1(X) for some decision function f1:𝒳R. Define 𝒱1f1:=ER1𝒟1(X) as the value function associated with f1, which is the average potential outcome when the treatments follow the ITR 𝒟1=signf1 for all X. Our goal is to estimate the optimal ITR that maximizes this value function. The optimal ITR 𝒟1* is given as sign{f1*}, where f1*(x)=ER1(1)X=xER1(1)X=x.

We make standard assumptions in causal inference, including ignorability, consistency, and positivity (see Section B.1, so that the optimal ITR can be estimated from a sample of n i.i.d data of X,A,R1. Under these conditions, according to Qian and Murphy (2011), we have 𝒱1f1=ER11Af1(X)>0/π(A;X), where π(A;X) is the probability of taking treatment A given covariates X in the data. Thus, the optimal ITR can be estimated by solving

minf11ni=1nRi11Aif1Xi<0πAi;Xi+λ1nf12,

where is some function class, f1 is the semi-norm for f1 in its function space, and λ1n is a tuning parameter that depends on the sample size n.

Separate Learning (SepL)

Since minimizing a loss function with 0–1 loss is computationally challenging, it can be substituted by some convex surrogate loss (Bartlett et al. 2006), denoted by ϕ(x). For example, Zhao et al. (2012) proposed using the hinge loss where ϕ(x)=max(1x,0). In our implementation, we propose using the logistic loss, ϕ(t)=log1+et, due to its differentiability and computational stability. To further reduce the variability of the value function estimator, we replace Ri1 by Ri1ERi1Xi, which does not change the optimal ITR. The conditional expectation ERi1Xi can be estimated by fitting Ri1 against Xi with any regression methods, e.g. simple linear regression. Moreover, we simultaneously flip the sign of Ri1ERi1Xi and A when Ri1ERi1Xi is negative (c.f., Liu et al. 2018). This allows us to formulate the problem as a convex optimization problem

f˜1n=argminf11ni=1nRi1ERi1XiπAi;Xi.ϕAisignRi1ERi1Xif1Xi+λ1nf12.

We refer to this method as separate learning (SepL).

2.1. Learning Fused ITR Using Optimal Rules for Secondary Outcomes

The secondary outcomes are denoted as R2,,RK for K2. Suppose we have obtained the ITRs f~2,,f~K for the secondary outcomes either from external data or the same study. Our goal is to estimate the optimal ITR for the primary outcome while encouraging it to be consistent as much as possible with these secondary outcome ITRs. To this end, we propose a fusion penalty on the disagreement rates between f1 and f˜2,,f˜K. Specifically, we estimate the fused individualized treatment rule (FITR) f1 by minimizing

1ni=1nRi1πAi;Xi1Aif1Xi<0+λ1nf12+μ1nni=1nk=2KΩ1k1f1Xif˜kXi<0, (1)

where Ω1k is a nonnegative pre-specified constant that reflects the prior knowledge of the similarity between the primary outcome and each secondary outcome Rk For example, Ω1k can be expert-specified or defined as the correlation between R1 and Rk. If Ω1k is negative, we can simultaneously flip the sign of Ω1k and f~k to encourage the consistency between f1 and f~k. The tuning parameter μ1n of the fusion penalty, which reweights Ω1k for all k, is determined data-adaptively using cross-validation. It is selected to maximize the estimated value function of f1, thus automatically quantifying the similarity between the outcomes.

The fusion penalty is related to the Laplacian penalty (Huang et al. 2011), which is used to learn multiple models simultaneously while encouraging similarity among them. With L as the Laplacian matrix and f:=f1,,fK as the vector of decision functions for each outcome, the Laplacian penalty can be defined as 1ni=1n[1fXi>0L1fXi>0T+1fXi<0L1fXi<0T]. It prompts fkXi and fjXi to have the same sign when Lkj<0 for k,j=1,,K,kj. This is equivalent to (1) if we use the same data to learn FITRs for all outcomes simultaneously.

FITR-Ramp.

The optimization problem in (1) is known to be NP-hard. To tackle this problem, we substitute the 0–1 losses with a smooth loss function for optimization purposes. The 0–1 loss in the first part of the expression is substituted by a logistic loss as in SepL. For the 0–1 loss in the fusion penalty, we use a ramp loss ψκ(t)=min{1,max{0,1t/κ}}, where κ is a tuning parameter, for approximation since the latter converges to the 0–1 loss when κ decreases to zero. With the same variance reduction and negative reward handling trick, we propose the FITR-Ramp method to estimate f1 by

f^1n=argminf11ni=1nRi1ERi1XiπAi;Xi.ϕAisignRi1ERi1Xif1Xi+λ1nf12+μ1nni=1nk=2KΩ1kψκ1nf1Xif˜kXi. (2)

We allow κ1n to depend on the sample size n.

FITR-IntL.

Alternatively, we can solve the optimization problem in (1) using the following procedure. Notice that the disagreement between f1 and f˜k can be expressed with the disagreement between Ai and f1 since

1f1Xif˜kXi<0=1Aif1Xi<01Aif˜kXi>0+11Aif1Xi<01Aif~kXi<0=1Aif1Xi<0signAif~kXi+1Aif~kXi<0.

The last term does not depend on f1 given observed data. Therefore, the problem in (1) is equivalent to f^1n=argminf11ni=1nR˜i1πAi;Xi1Aif1Xi<0+λ1nf12, where

R~i1=Ri1+μ1nπAi;Xik=2KΩ1ksignAif~kXi

is the pseudo outcome. We then substitute the indicator function by the logistic loss and estimate f^1n by

f^1n=argminf11ni=1nR˜i1ER˜i1XiπAi;Xi.ϕAisignR˜i1ER˜i1Xif1Xi+λ1nf12 (3)

with the same variance reduction and negative reward handling trick. We refer to this optimization method as FITR-IntL. It is worth noting that a similar procedure was originally proposed in Qiu et al. (in press) and Qiu (2018) to integrate treatment rules from multiple studies for the same outcome without theoretical justifications, whereas our focus is on integrating multiple outcomes.

Implementation.

In SepL, FITR-Ramp and FITR-IntL, the semi-norm for f1 is usually chosen from a reproducing kernel Hilbert space (RKHS) associated with a real-valued kernel function k:𝒳×𝒳R. The choice of k can be the linear kernel kx,x=xTx which yields a linear decision function for f1, or the Gaussian kernel kx,x=exp(σ1n2xx22) which yield a nonlinear decision function, where σ1n is a parameter depending on n. By the representer theorem, the minimizer for f1 takes the form f(X)=i=1nαikX,Xi, so the optimization problems can be restricted to class :=f:f(X)=i=1nαikX,Xi,α1,,αnRn. The function class is nonparametric when the RKHS with Gaussian kernel is used. When μ1n,κ1n0,σ1n2 as n, this nonparametric method is asymptotically equivalent to unconstrained problems without penalties. Therefore, the learned FITR is asymptotically optimal even when the non-constant shift E(RX) and the fusion penalty are introduced.

For computation, FITR-Ramp can be solved by the difference of convex algorithm (DCA) (Le Thi and Pham Dinh, 2018) or the Powell algorithm (Powell, 1964 Press et al., 2007). DCA re-expresses the objective function as the difference between two convex functions, whereas the Powell algorithm is suitable for non-differentiable objective functions. In contrast, FITR-IntL has a differentiable convex objective function so that it can be easily solved by gradient-based algorithms including variations of the gradient descent algorithm (Ruder, 2016) or the BFGS algorithm (Fletcher, 1987). The tuning parameters λ1n,μ1n,κ1n can be selected by cross-validation. The computation time of one replication when n=200 is about 0.003 seconds for SepL, 0.003 seconds for FITR-IntL, 0.086 seconds for FITR-Ramp when using the linear kernel, and about 0.457 seconds for SepL, 0.356 seconds for FITR-IntL, 2.744 seconds for FITR-Ramp when using the Gaussian kernel.

3. THEORETICAL RESULTS

In this section, we provide theoretical results about the agreement rate between fˆ1n and f2* in FITR-Ramp. See Section B. 1 in the supplementary material for the convergence rates of the value function 𝒱1(fˆ1n) and the misclassification rate P(f^1nf1*<0).

Without loss of generality, we assume that f~k:𝒳{1,1} is a binary mapping learned from a dataset of size Nk for all k=2,,K, since every decision function f can be transformed into a binary function by taking its sign. In this section, we assume that the conditional expectation of R1 has already been removed and that the signs have been flipped if the remaining reward is negative. Then the problem becomes

minf11ni=1nRi1πAi;XiϕAif1Xi+λ1nf12+μ1nni=1nk=2KΩ1kψκ1nf1Xif˜kXi.

We introduce the following assumptions to derive the convergence rate of the agreement rate.

Assumption 1.

Suppose that 0R1c1 for all R1 and for some constant c1>0.

Assumption 1 assumes that the primary outcome is nonnegative and bounded. Furthermore, with r1(a)(X):=ER1A=a,X0, we define

η(X):=r1(1)(X)a𝒜r1(a)(X),ifa𝒜r1(a)(X)>0,12,ifr1(a)(X)=0foralla𝒜.

The definition of η(X) aligns with the probability P(y=1x) in classification literature, where x is the predictor, and y is the label. Denote the regions 𝒳1:={x𝒳:η(x)<1/2}, 𝒳1:={x𝒳:η(x)>1/2} and 𝒳0:={x𝒳:η(x)=1/2}. Finally, we define a distance function xωx as

ωx:=dx,𝒳0𝒳1,ifx𝒳1,dx,𝒳0𝒳1,ifx𝒳1,0,otherwise,

where d(x,𝒮) denotes the distance of x to a set 𝒮 with respect to the Euclidean norm.

Assumption 2.

Assume that there exists a constant q>0 for the distribution of X such that

𝒳expωx2tP𝒳(dx)Ctqd/2 (4)

for all t>0 and some constant C>0. We say q= if the inequality holds for all q>0.

Note that Assumption 2 is used to bound the approximation error when we estimate ITRs in an RKHS, and the logistic loss is used. It is stronger than the geometric noise assumption (Steinwart and Scovel, 2007) in the sense that |2η(x)1| is not included in the left-hand side of (4) and |2η(x)1|1. However, when t0, we can still ensure that the left-hand side of (4) goes to zero.

Assumption 3.

For any k=2,,K, assume that the estimator of the secondary outcome ITR f~k converges to fk* in the sense that P(f~kfk*<0)δ~kNk(τ) with probability at least 1eτ, where Nk is the sample size of the dataset for learning f~k and δ~kNk(τ)=o(1) as Nk.

Assumption 3 is used to quantify the agreement rate between f~k and its corresponding optimal ITR fk*. It is a weak assumption that only requires f~k to converge to its true optimal ITR and can be satisfied by general methods for learning ITRs. We provide the convergence rate δ~kNk(τ) in the remark of Corollary B.2 when f~k is learned by SepL.

We first focus on the case when K=2. That is, there exists only one secondary treatment rule f~2, which is already estimated from an external or the same dataset. Define the first and second part in the loss function as

1f1(Z):=R1π(AX)ϕAf1(X)and2f1(Z):=μ1nΩ12ψκ1nf1(X)f˜2(X).

Let the risk based on the surrogate losses be

f1:=E1f1+2f1.

In the following inequalities, “ “ indicates that the left-hand side is no larger than the right-hand side for all n up to a universal constant.

Lemma 3.1.

Under Assumptions B.1 B.3 and 12, when μ1n,κ1n0,σ1n2, and κn1 for all n, for any constant δ>0,0<ν<2 and for all τ1, we have P((fˆ1n)(f1*)δ1n(τ))1eτ, where

δ1n(τ):=λ1n12n12τ+σ1n(1ν/2)(1+δ)dγ1nν+λ1nσ1nd+γ1n(2d)qd2σ1nqd, (5)

and γ1n:=1+μ1nκ1n1.

The first term on the right-hand side of (5) is the estimation error, and the sum of the last two terms is the approximation error. A larger class generally leads to a larger estimation error and a smaller approximation error, which corresponds to small penalty parameters λ1n and μ1n with less restriction on the norm of the space or the agreement with the secondary outcome ITR. The optimal rate that balances the estimation error and approximation error is

δ1nτ=γ1n2qν+2Δ+13q+2Δ+1nq3q+2Δ+1, (6)

where Δ=(1ν/2)(1+δ). Note that when μ1n=0, FITR degenerates to SepL learned with the logistic loss. In this case, the optimal convergence rate is

δ1n(0)(τ)=nq3q+2Δ+1. (7)

The optimal tuning parameters are provided in Section B.2.

Then we can present the convergence rates of the agreement rate P(f^1nf2*>0).

Theorem 3.2.

Under Assumptions B.1 B.3 and 13, the agreement rate between fˆ1n and fˆ2* satisfies

P(f1*f2*>0)P(f^1nf2*>0)δ1n(τ)μ1n+δ~2N2(τ) (8)

with probability at least 12eτ.

Now we compare the agreement rate with or without the fusion penalty. Note that for SepL,

P(f1*f2*>0)P(f^1nf2*>0)P(f^1nf1*<0)nα2αq3q+2Δ+1 (9)

with high probability under additional Assumptions B. 4 and B.5 (see details in Section B.2). On the other hand, for FITR-Ramp, (8) and (6) implies

P(f1*f2*>0)P(f^1nf2*>0)μ1n1γ1n2qν+2Δ+13q+2Δ+1nq3q+2Δ+1+δ~2N2(τ) (10)

with high probability. The inequalities (9) and (10) suggest that SepL can learn the disagreement rate faster than FITR-Ramp when the data are fully separated by the decision boundary, i.e. when α=1. However, FITR-Ramp has a faster convergence rate when

μ1n3q2qνκ1n2qν+2Δ+1n22α2αq,ifμ1nκ1n,μ1nn22α2αq3q+2Δ+1,otherwise, (11)

and f˜2 is learned using SepL and N2n. The requirement of the sample size N2 is consistent with the conclusion in transfer learning literature (Li et al. 2021. Tian and Feng, 2022), which also requires the sample size of auxiliary data to be much larger than the sample size of target data. The relationship (11) holds when, for example, α=1/2,ν=3/2,δ=1,Δ=1/2,minμ1n,κ1nn2q/(9q+6), and the sample size N2n3minμ1n,κ1n(9q+6)/qn. That is, the data are not fully separated, and the tuning parameters μ1n,κ1n are not decreasing too fast. Besides, (10) does not rely on Assumptions B.4 and B.5. The former is Tsybakov’s noise assumption, and the latter assumes that for each patient, at least one treatment can yield a positive mean reward.

When K2, Theorem 3.2 can be generalized as follows.

Theorem 3.3.

Under Assumptions B.1 B.3 and 13. the agreement rate between f^1n and any fk* for k2 satisfies

P(f1*fk*>0)P(f^1nfk*>0)δ1n(τ)μ1n+k=2Kδ~kNk(τ)

with probability at least 1Keτ.

4. SIMULATION STUDY

In the simulation study, we learn FITR for the primary outcome using the proposed methods FITR-Ramp and FITR-IntL, comparing them with the baseline method SepL. Suppose the ITR f~kNk for the secondary outcome Rk is estimated with SepL using an external dataset of sample size Nk. Even though the parameter dimension of the Gaussian kernel in RKHS depends on the sample size, the values of n and Nk can be different for f^1n and f~kNk since the fusion penalty relies on secondary outcome ITRs only through the treatment suggestions. The weight Ω1k is chosen to be the Pearson correlation between R1 and Rk. Our experiments show that Spearman’s rank correlation (Zar, 2014) yields similar results. Further implementation details can be found in Section A. 1

Simulation Settings.

We assume the data are collected from a randomized controlled trial and π1;Xi=π1;Xi=0.5 for all i=1,,n. The k th outcome is defined as Rik=mkXi+TkXi,Ai+ϵkXi,Ai, where mk is the main effect, Tk is the interaction effect between the covariates and the treatment, and ϵk is the noise term. By choosing different Tk, we allow both linear and nonlinear treatment rules. We let the dimension of covariates be d=10 (see details in Section A.2). The simulation process is repeated 400 times under each scenario.

The sample size of the external dataset Nk is set to rn for k2, with r taking values from the set {0,1,2,4,8,}. In the case of r=0, we do not estimate f~kNk, and FITR degenerates to SepL. When r=, we essentially have f~kNk=fk*, which represents the scenario where the true ITR fk* is known.

We first assess performance with K=2 and n=200. Consider the following four scenarios. In all scenarios, the main effects are defined to be nonlinear functions of X,

m1X=1+2X1+X22+X1X2,
m2X=1+2X12+1.5X2+0.5X1X2.

The interaction terms are defined as follows:

  • In linear scenarios S1andS2,
    T1X,A=0.5A0.2X12X2,
    T2X,A=0.8A0.2X1γ1X2,
    where γ1=1.8 in S1 and γ1=1.4 in S2.
  • In nonlinear scenarios S3 and S4,
    T1X,A=1.0A2.2eX1eX2,
    T2X,A=1.5Aγ2eX1eX2,
    where γ2=2.3 in S3 and γ2=2.4 in S4.

The true agreement rate P(f1*f2*>0) is 98.6% for S1, 94.5% for S2, 96.4% for S3, and 92.8% for S4. Table A. 1 presents the true optimal value functions for each scenario.

Simulation Results.

Our experiments show that the linear kernel performs better in linear scenarios, while the Gaussian kernel is more effective in nonlinear scenarios. We present results using only the most suitable kernel for each scenario.

The average disagreement rate P(fˆ1nf2*<0) across all replications, along with its standard deviation, is plotted in Figure 1 for different Nk. As expected, the disagreement rate remains unchanged as Nk increases if f^1n is learned by SepL. In contrast, the fusion penalty’s effect on reducing the disagreement rate becomes more significant with increasing Nk. Specifically, FITR-Ramp is better in linear scenarios, while FITR-IntL outperforms it in nonlinear scenarios. This distinction may be attributed to the fact that FITR-IntL imposes a heavier penalty on the disagreement rate when the parameter dimension is high. The improvement in the agreement rate slows down after Nk/n2, indicating that a moderately small external dataset is sufficient for learning an FITR with comparable performance to using the true f2*. As the difference between outcomes increases, the disagreement rate increases for all methods, but an improvement with FITR is still observed.

Figure 1:

Figure 1:

The mean and standard deviation of disagreement rate between FITR fˆ1n learned using SepL, FITR-Ramp or FITR-IntL, and the true secondary outcome ITR f2*. The dashed line represents the true disagreement rate Pfˆ1nf2*>0.

In Figure 2, we evaluate FITR’s ability to generate treatment suggestions for the primary outcome. We present two metrics: (a) the root mean square error (RMSE) of the value function 𝒱1(f^1n) in comparison to the optimal value function 𝒱1(f1*), and (b) the average misclassification rate P(f^1nf1*<0) along with its standard deviation. As an additional advantage of the proposed method, the primary outcome is actually improved by the FITR with the help of the secondary outcome. Similarly, FITR-Ramp in S1, S2 and FITR-IntL in S3, S4 observe the minimum RMSE and misclassification rate among all methods. When the discrepancy between outcomes increases, the performance of SepL remains unaffected, but the improvement by FITR-Ramp and FITR-IntL is reduced.

Figure 2:

Figure 2:

(a) The RMSE of value functions and (b) the mean and standard deviation of misclassification rate.

Additional results for different values of K and n are available in Section A.3. When K=2 and n=100 in scenarios S3 and S4, and the Gaussian kernel is used, FITR-Ramp instead of FITR-IntL yields the minimum RMSE of the value function. This indicates that FITR-IntL may be unstable when the sample size is small, and the parameter dimension is high. Furthermore, in cases where K=3, the disagreement rate, RMSE, and misclassification rate further decrease. The FITR performs better for the primary outcome as long as either R2 or R3 is close to R1, even if f2* and f3* differ from f1* in opposite directions, suggesting the advantages of incorporating multiple secondary outcomes. In practice, cross-validation can be used to choose the best model and kernel.

Sensitivity Analysis.

To demonstrate the influence of the similarity between outcomes on the estimated FITRs, we fix the primary outcome and vary the secondary outcome when K=2. Specifically, we use the same main effect and noise term as in S1 and let the interaction effect be

T1X,A=0.5A0.2X12X2,
T2X,A=0.8A0.2X12ρX2,

where ρ controls the similarity between R1 and R2. In Table A. 3 we summarize different values of ρ and their corresponding true agreement rates. The performance metrics, including the disagreement rate, RMSE, and misclassification rate, are presented when n=200 and N2=4n. The table shows that the fusion penalty consistently increases the agreement rate between f^1n and f2*. However, the value function and accuracy of FITRs exceed SepL only when ρ0.5. Although μ1n tuned by cross-validation should ideally be zero when the outcomes are substantially different, practical constraints such as a small sample size may result in μ1n>0, leading to decreased value functions. This suggests that the true agreement rate should be no less than 87% in this scenario for the fusion penalty to have a positive impact on the treatment suggestions for the primary outcome.

5. REAL DATA ANALYSIS

We apply our proposed methods to a study of major depressive disorder (MDD) (Trivedi et al., 2016) which randomized MDD patients to serotonin selective reuptake inhibitor sertraline (SERT) or placebo (PBO).

Description of the Dataset.

The metric for depressive symptoms is the Quick Inventory of Depressive Symptomatology (QIDS) score, with a lower score indicating better relief of symptoms. Seven covariates are used for learning ITRs, including the NEO-Five Factor Inventory score (NEO), Flanker Interference Accuracy score (Flanker), sex, age, education years, Edinburgh Handedness Inventory (EHI) score, and the QIDS score at the baseline (see details in Section A.5).

The primary outcome is the reduction in QIDS at the end of the treatment phase compared to the baseline (QIDS-change), with a larger positive value indicating a more desirable improvement in symptoms. In addition to the primary outcome, we examine two secondary outcomes. One is the clinician-assessed Clinical Global Improvement scale (CGI) as an overall assessment of treatment effect, and a smaller score suggests better improvement (Busner and Targum, 2007). The other is the Social Adjustment Scale (SAS), which evaluates the impact of a person’s mental health difficulties, with a higher score indicating greater impairment. We consider the QIDS-change and the negative values of CGI and SAS as the three outcomes. The dataset contains 186 patients with complete information. Further details can be found in Section A. 5

Analysis Results.

The weights Ω1k for k=2,3 are set by the Pearson correlation between the primary and secondary outcomes, which equals 0.72 and 0.19 for CGI and SAS, respectively. Our experiments show that the linear kernel outperforms the Gaussian kernel on this dataset, so we present the results of the linear kernel. We examine the agreement rates between the ITRs of the primary outcome and secondary outcomes in the upper panel of Table 1. Since we do not have access to the true optimal secondary outcome ITR, the agreement rate is calculated based on the ITRs f~CGI and f˜SAS estimated with SepL. The table suggests that FITR-IntL or FITR-Ramp increases the agreement rate compared to SepL, especially between QIDS-change and CGI.

Table 1:

The upper panel shows the agreement rates between the FITR f˜QIDS of the primary outcome QIDS-change, learned from SepL, FITR-IntL or FITR-Ramp, and the secondary outcome ITRs f˜CGI, f˜SAS. The lower panel shows the estimated value functions of the OSFA strategy, SepL, FITR-IntL, and FITR-Ramp for the primary outcome.

Agreement rates between ITRs of primary and secondary outcomes

Secondary ITR f˜CGI f˜SAS

Primary ITR learned from SepL 0.807 0.694
FITR-IntL 0.903 0.737
FITR-Ramp 0.887 0.720

Value functions from 400 replicates

All SERT All PBO SepL FITR-IntL FITR-Ramp

8.000 6.438 8.869 (0.286) 9.082 (0.317) 9.015 (0.327)

We also assess the value functions of the primary outcome for FITR-IntL, FITR-Ramp, SepL, and the one-size-for-all (OSFA) strategy (see the lower panel of Table 1). The value function 𝒱1f1 is estimated with the normalized inverse probability weighted estimator

i=1nRi11Ai=signfkXi/πiAi;Xii=1n1Ai=signfkXi/πiAi;Xi, (12)

which is a consistent estimator. The mean and standard deviation of the value functions across 400 replications are reported. Due to the limited sample size, we use 5-fold cross-validation in each replication, averaging the estimated value function across 5 validation sets to obtain the estimate for the current replication. For OSFA, we directly calculate the mean reward on the subpopulation with the desired treatment, which is equivalent to 12 since πi=0.5 for all i. The tuning parameters are selected using 4-fold cross-validation. The results suggest that compared to OSFA and SepL, FITR-IntL and FITR-Ramp can improve the value functions for QIDS-change when assisted by the other secondary outcomes. FITR-IntL has the best performance, and it increases the value function of SepL by approximately one standard deviation.

To compare the learned ITR using different methods, we present the coefficients of each variable in Table A.4. We observe that f^QIDS and f˜CGI, when both estimated using SepL without the fusion penalty, share the same signs of the coefficients. This indicates that QIDS-change and negative CGI indeed need similar treatments. In addition, for QIDS-change, the signs of the coefficients in SepL, FITR-Ramp, and FITR-IntL are all consistent, suggesting that the fusion penalty does not change the ITR significantly from SepL.

6. DISCUSSION

In this work, we have proposed a method to optimize the primary outcome while also approximating the optimal secondary outcome ITRs as closely as possible. A fusion penalty is introduced to encourage the ITRs to generate similar treatment recommendations. We demonstrate theoretically and numerically that the agreement rate between the FITR for the primary outcome and the ITR for the secondary outcome converges faster to its true value than SepL. Moreover, the simulation and real data studies show that the ITR learned by FITR-Ramp and FITR-IntL have a higher value function and accuracy than SepL when the true optimal ITRs are closely aligned.

The proposed fusion penalty is a general method that can be easily extended to any variant of outcome-weighted learning (Zhao et al., 2015, Chen et al., 2016. Gao et al. 2022, Ma et al. (2023), a classification-based method to estimate an ITR as the sign of some decision function. There are several promising directions that are worth further research. A possible extension is to let the tuning parameters λ and μ depend on the covariates X since the fusion level should vary if the optimal ITRs for the primary and secondary outcomes differ substantially for specific patients. Another interesting direction is to explore a minimax bound for the convergence rate of the agreement rate. Such a bound might be tighter than the current results presented in Section 3. Nevertheless, the upper bounds for SepL and FITR are currently derived under the same conditions, allowing for a meaningful comparison between them.

Supplementary Material

Supplement

Acknowledgements

The authors express their gratitude to the anonymous reviewers for their insightful comments and valuable suggestions. This research was supported in part by US NIH grants GM124104, NS073671, and MH123487.

Footnotes

Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, Spain. PMLR: Volume 238. Copyright 2024 by the author(s).

Contributor Information

Daiqi Gao, Harvard University.

Yuanjia Wang, Columbia University.

Donglin Zeng, University of Michigan.

References

  1. Bartlett PL, Jordan MI, and McAuliffe JD (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156. [Google Scholar]
  2. Brown CH, Brincks A, Huang S, Perrino T, Cruden G, Pantin H, Howe G, Young JF, Beardslee W, Montag S, et al. (2018). Two-year impact of prevention programs on adolescent depression: An integrative data analysis approach. Preven-tion Science, 19(1):74–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Busner J and Targum SD (2007). The clinical global impressions scale: applying a research tool in clinical practice. Psychiatry (Edgmont), 4(7):28. [PMC free article] [PubMed] [Google Scholar]
  4. Cai TT and Wei H (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Annals of Statistics, 49(1):100–128. [Google Scholar]
  5. Chen G, Zeng D, and Kosorok MR (2016). Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association, 111(516):1509–1521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Claggett B, Xie M, and Tian L (2014). Meta-analysis with fixed, unknown, study-specific parameters. Journal of the American Statistical Association, 109(508):1660–1671. [Google Scholar]
  7. Curran PJ and Hussong AM (2009). Integrative data analysis: the simultaneous analysis of multiple data sets. Psychological Methods, 14(2):81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fang EX, Wang Z, and Wang L (2022). Fairness-oriented learning for optimal individualized treatment rules. Journal of the American Statistical Association, pages 1–14.35757777 [Google Scholar]
  9. Fletcher R (1987). Practical methods of optimization. John Wiley & Sons. [Google Scholar]
  10. Gao D, Liu Y, and Zeng D (2022). Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. Journal of Machine Learning Research, 23(1):11362–11403. [PMC free article] [PubMed] [Google Scholar]
  11. Haidich A-B (2010). Meta-analysis in medical research. Hippokratia, 14(Suppl 1):29. [PMC free article] [PubMed] [Google Scholar]
  12. Huang J, Ma S, Li H, and Zhang C-H (2011). The sparse Laplacian shrinkage estimator for high-dimensional regression. Annals of Statistics, 39(4):2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Laber EB, Lizotte DJ, and Ferguson B (2014). Set-valued dynamic treatment regimes for competing outcomes. Biometrics, 70(1):53–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Laber EB, Wu F, Munera C, Lipkovich I, Colucci S, and Ripa S (2018). Identifying optimal dosage regimes under safety constraints: An application to long term opioid treatment of chronic pain. Statistics in Medicine, 37(9):1407–1418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Le Thi HA and Pham Dinh T (2018). DC programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5–68. [Google Scholar]
  16. Li S, Cai T, and Li H (2021). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):149–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lin D-Y and Zeng D (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika, 97(2):321–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Liu D, Liu RY, and Xie M (2015). Multivariate meta-analysis of heterogeneous studies using only summary statistics: efficiency and robustness. Journal of the American Statistical Association, 110(509):326–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Liu Y, Wang Y, Kosorok MR, Zhao Y, and Zeng D (2018). Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens. Statistics in Medicine, 37(26):3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lizotte DJ and Laber EB (2016). Multi-objective Markov decision processes for data-driven decision support. Journal of Machine Learning Research, 17(1):7378–7405. [PMC free article] [PubMed] [Google Scholar]
  21. Luckett DJ, Laber EB, Kim S, and Kosorok MR (2021). Estimation and optimization of composite outcomes. Journal of Machine Learning Research, 22:167–1. [PMC free article] [PubMed] [Google Scholar]
  22. Ma H, Zeng D, and Liu Y (2022). Learning individualized treatment rules with many treatments: A supervised clustering approach using adaptive fusion. Advances in Neural Information Processing Systems, 35:15956–15969. [Google Scholar]
  23. Ma H, Zeng D, and Liu Y (2023). Learning optimal group-structured individualized treatment rules with many treatments. Journal of Machine Learning Research, 24(102):1–48. [PMC free article] [PubMed] [Google Scholar]
  24. Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355. [Google Scholar]
  25. Powell MJ (1964). An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal, 7(2):155–162. [Google Scholar]
  26. Press WH, Teukolsky SA, Vetterling WT, and Flannery BP (2007). Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press. [Google Scholar]
  27. Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Annals of Statistics, 39(2):1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Qiu X (2018). Statistical Learning Methods for Personalized Medicine. Columbia University. [Google Scholar]
  29. Qiu X, Zeng D, and Wang Y (in press). Integrative learning to combine individualized treatment rules from multiple randomized trials. In Precision Medicines: Methods and Applications. Springer. [Google Scholar]
  30. Ruder S (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. [Google Scholar]
  31. Shi C, Fan A, Song R, and Lu W (2018). High-dimensional A-learning for optimal dynamic treatment regimes. Annals of Statistics, 46(3):925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Steinwart I and Scovel C (2007). Fast rates for support vector machines using Gaussian kernels. Annals of Statistics, 35(2):575–607. [Google Scholar]
  33. Thall PF, Sung H-G, and Estey EH (2002). Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association, 97(457):29–39. [Google Scholar]
  34. Tian Y and Feng Y (2022). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Trivedi MH, McGrath PJ, Fava M, Parsey RV, Kurian BT, Phillips ML, Oquendo MA, Bruder G, Pizzagalli D, Toups M, et al. (2016). Establishing moderators and biosignatures of antidepressant response in clinical care (EMBARC): Rationale and design. Journal of Psychiatric Research, 78:11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Wang Y, Fu H, and Zeng D (2018). Learning optimal personalized treatment rules in consideration of benefit and risk: with an application to treating type 2 diabetes patients with insulin therapies. Journal of the American Statistical Association, 113(521):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zar JH (2014). Spearman rank correlation: overview. Wiley StatsRef: Statistics Reference Online. [Google Scholar]
  38. Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Zhang C, Chen J, Fu H, He X, Zhao Y-Q, and Liu Y (2020). Multicategory outcome weighted margin-based learning for estimating individualized treatment rules. Statistica Sinica, 30:1857. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Zhao Y, Laber EB, Ning Y, Saha S, and Sands BE (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research, 20(1):1821–1843. [PMC free article] [PubMed] [Google Scholar]
  41. Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Zhao Y-Q, Zeng D, Laber EB, and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510):583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Zhou X, Mayer-Hamblett N, Khan U, and Kosorok MR (2017). Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES