Fusing Individualized Treatment Rules Using Secondary Outcomes

Daiqi Gao; Yuanjia Wang; Donglin Zeng

. Author manuscript; available in PMC: 2024 Oct 4.

Published in final edited form as: Proc Mach Learn Res. 2024 May;238:712–720.

Fusing Individualized Treatment Rules Using Secondary Outcomes

Daiqi Gao ¹, Yuanjia Wang ², Donglin Zeng ³

PMCID: PMC11450767 NIHMSID: NIHMS1968478 PMID: 39371406

Abstract

An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.

1. INTRODUCTION

An individualized treatment rule (ITR) is a decision rule that recommends treatments to patients based on their pre-treatment covariates, such as demographics, medical history, and health status. Many methods have been proposed to estimate the optimal ITR using data from a randomized controlled trial or an observational study. The goal is to find the ITR that would yield the maximal primary outcome if patients follow the treatment recommendations. Regression-based approaches start with estimating Q-functions, which are the expected outcomes associated with each treatment option, and the optimal treatment is determined by maximizing the Q-functions. Exemplary methods include Q-learning (Qian and Murphy, 2011, Ma et al., 2022) and A-learning (Murphy, 2003; Shi et al., 2018). Alternative methods directly search for the ITR that maximizes the mean reward, also called the value function, using either an inverse probability weighted estimator (Zhao et al., 2012, Chen et al., 2016, Zhang et al. 2020; Gao et al., 2022; Ma et al., 2023) or a doubly robust estimator (Zhang et al., 2012: Zhou et al. 2017, Liu et al. 2018, Zhao et al., 2019). However, all these methods focus on one primary outcome when estimating the ITRs.

In practical settings, it is important to consider certain secondary outcomes. Non-favorable secondary outcomes potentially indicate worsened overall health or an increased risk of treatment non-compliance. Therefore, when estimating the optimal ITR for the primary outcome, it is crucial to ensure that the ITR also optimizes the secondary outcomes to the greatest extent. For example, for major depressive disorder (MDD; Trivedi et al. 2016), one common outcome to measure depressive symptoms is the Quick Inventory of Depressive Symptomatology (QIDS) score, a rating scale of the patient’s emotional state over the past seven days. Besides, another important outcome is the Clinical Global Improvement (CGI) scale, which is often used to assess a patient’s symptoms, behavior, and the impact on the patient’s ability to function. It is an indicator of overall clinical improvement. Although the primary goal in many studies is to identify the optimal treatment strategy for improving the QIDS score, it is equally important for such a strategy to be effective for the CGI scale.

In this work, we have a twofold objective, which is to estimate the optimal ITR for the primary outcome and to ensure that the resulting treatment decision closely aligns with the optimal treatment for the secondary outcomes. Specifically, we aim to fuse the treatment rules for different outcomes to provide a unified ITR that performs optimally for the primary outcome and effectively for the secondary outcomes, even if it may not be optimal for the latter. To achieve this goal, we introduce a novel fused learning framework to estimate the optimal ITR, namely the fused individualized treatment rule (FITR). Under this framework, we maximize the value function for the primary outcome. At the same time, we incorporate a fusion penalty to encourage agreement between the estimated ITR and the optimal ITRs for the secondary outcomes. The latter are assumed to be obtained apriori using external data or the data from the same study. The fusion penalty is designed as a weighted sum of the disagreement rates between the treatment rules, with weights determined based on the similarity of treatment effect on these outcomes. Therefore, this penalty aims to encourage consistency between the ITRs for the different outcomes.

Related Work.

Several approaches have been proposed to learn ITRs that can handle multiple outcomes simultaneously. Some of these approaches (Wang et al. 2018, Laber et al. 2018, Fang et al. 2022) estimated the optimal ITRs for the primary outcome while restricting the secondary outcome to be no less (or no larger if the outcome is a risk outcome) than a threshold. However, these approaches require the threshold to be pre-specified and only ensure that the average value of the secondary outcome is controlled, but not for any specific individuals. Another class of methods (Thall et al., 2002, Luckett et al., 2021) constructed expert-derived or data-driven patient-specific composite outcomes for learning the ITRs. However, their ITRs may not be optimal for the primary outcome of interest. In addition, Laber et al. (2014) and Lizotte and Laber (2016) proposed constructing a set-valued regime for competing outcomes to generate non-inferior outcome vectors but does not recommend a single treatment, limiting their use for practice. Our goal is to learn an optimal ITR for the primary outcome while ensuring closeness to the optimal secondary outcomes, which is distinct from all existing approaches for learning multi-outcome ITRs.

Our problem is also fundamentally different from the literature that combines multiple studies in analysis. Meta-analysis (Haidich, 2010, Lin and Zeng, 2010; Claggett et al., 2014 Liu et al., 2015) combines the estimators across multiple studies efficiently. Integrative data analysis (Curran and Hussong, 2009, Brown et al. 2018) considers multiple data sources or summary statistics using shared parameter models. Transfer learning (Li et al., 2021, Cai and Wei, 2021, Tian and Feng, 2022) uses existing knowledge from one task to assist in learning a new task. However, these existing methods often require individual-level data or assume specific models or distributions for each study to achieve integration, so are not applicable to combining different treatment rules.

Main Contributions.

Our paper introduces several significant contributions. (1) We propose a fusion penalty to encourage the primary outcome ITR to closely agree with the optimal secondary outcome ITRs. The proposed method relies on the secondary outcome only through its treatment rules. We do not require individual-level data for the secondary outcome when such rules are already available. In addition, there is no restriction on the form of the decision function since the fusion penalty is directly applied to the disagreement rate. The proposed method is not affected by the magnitude of the decision function or the function space in which the secondary outcome ITRs are estimated. (2) To efficiently solve for the FITR, we propose two algorithms that use surrogate losses to substitute the noncontinuous, nonconvex value function and fusion penalty in the objective function. (3) Theoretically, we prove that the agreement rate between the estimated FITR learned with the surrogate loss will converge to the true agreement rate, and the convergence rate is faster than the approach without considering the secondary outcome. We also obtain the convergence rate for the value function and misclassification rate of FITR. (4) In the numeric experiments, we show that the agreement rate indeed converges faster to the true rate. Moreover, when the true optimal ITRs of different outcomes are closely aligned, the learned FITR actually yields higher value function and accuracy for the primary outcome compared to traditional methods. By leveraging the secondary outcomes as useful side information, the proposed method has the capability to improve the treatment decisions for multiple outcomes simultaneously.

2. METHODOLOGY

The vector $X \in 𝒳 \subset R^{d}$ represents pre-treatment covariates for tailoring treatment decisions, where $d$ is the dimension of $X$ . The treatment $A \in 𝒜 = {1, - 1}$ is assumed to be binary. The primary outcome is denoted by $R_{1}$ . Without loss of generality, we assume that a higher value of this outcome indicates a better health condition. Denote $R_{1} (a)$ as the potential outcome under the treatment $a$ . For the primary outcome $R_{1}$ , an ITR $𝒟_{1} : 𝒳 \to 𝒜$ maps a patient’s covariates to a treatment suggestion. It can also be expressed as $𝒟_{1} (X) = s i g n \{f_{1} (X)\}$ for some decision function $f_{1} : 𝒳 \to R$ . Define $𝒱_{1} (f_{1}) : = E [R_{1} \{𝒟_{1} (X)\}]$ as the value function associated with $f_{1}$ , which is the average potential outcome when the treatments follow the ITR $𝒟_{1} = s i g n \{f_{1}\}$ for all $X$ . Our goal is to estimate the optimal ITR that maximizes this value function. The optimal ITR $𝒟_{1}^{*}$ is given as $s i g n {f_{1}^{*}}$ , where $f_{1}^{*} (x) = E [R_{1} (1) ∣ X = x] - E [R_{1} (- 1) ∣ X = x]$ .

We make standard assumptions in causal inference, including ignorability, consistency, and positivity (see Section B.1, so that the optimal ITR can be estimated from a sample of $n$ i.i.d data of $(X, A, R_{1})$ . Under these conditions, according to Qian and Murphy (2011), we have $𝒱_{1} (f_{1}) = E [R_{1} 1 \{A f_{1} (X) > 0\} / π (A; X)]$ , where $π (A; X)$ is the probability of taking treatment $A$ given covariates $X$ in the data. Thus, the optimal ITR can be estimated by solving

min_{f_{1} \in ℋ} \frac{1}{n} \sum_{i = 1}^{n} \frac{R_{i 1} 1 \{A_{i} f_{1} (X_{i}) < 0\}}{π (A_{i}; X_{i})} + λ_{1 n} {‖f_{1}‖}^{2},

where $ℋ$ is some function class, $∥f_{1}∥$ is the semi-norm for $f_{1}$ in its function space, and $λ_{1 n}$ is a tuning parameter that depends on the sample size $n$ .

Separate Learning (SepL)

Since minimizing a loss function with 0–1 loss is computationally challenging, it can be substituted by some convex surrogate loss (Bartlett et al. 2006), denoted by $ϕ (x)$ . For example, Zhao et al. (2012) proposed using the hinge loss where $ϕ (x) = m a x (1 - x, 0)$ . In our implementation, we propose using the logistic loss, $ϕ (t) = l o g (1 + e^{- t})$ , due to its differentiability and computational stability. To further reduce the variability of the value function estimator, we replace $R_{i 1}$ by $R_{i 1} - E (R_{i 1} ∣ X_{i})$ , which does not change the optimal ITR. The conditional expectation $E (R_{i 1} ∣ X_{i})$ can be estimated by fitting $R_{i 1}$ against $X_{i}$ with any regression methods, e.g. simple linear regression. Moreover, we simultaneously flip the sign of $R_{i 1} - E (R_{i 1} ∣ X_{i})$ and $A$ when $R_{i 1} - E (R_{i 1} ∣ X_{i})$ is negative (c.f., Liu et al. 2018). This allows us to formulate the problem as a convex optimization problem

{\tilde{f}}_{1 n} = \underset{f_{1} \in ℋ}{arg min} \frac{1}{n} \sum_{i = 1}^{n} \frac{|R_{i 1} - E (R_{i 1} ∣ X_{i})|}{π (A_{i}; X_{i})} . ϕ [A_{i} sign \{R_{i 1} - E (R_{i 1} ∣ X_{i})\} f_{1} (X_{i})] + λ_{1 n} {‖f_{1}‖}^{2} .

We refer to this method as separate learning $(S e p L)$ .

2.1. Learning Fused ITR Using Optimal Rules for Secondary Outcomes

The secondary outcomes are denoted as $R_{2}, \dots, R_{K}$ for $K \geq 2$ . Suppose we have obtained the ITRs ${\tilde{f}}_{2}, \dots, {\tilde{f}}_{K}$ for the secondary outcomes either from external data or the same study. Our goal is to estimate the optimal ITR for the primary outcome while encouraging it to be consistent as much as possible with these secondary outcome ITRs. To this end, we propose a fusion penalty on the disagreement rates between $f_{1}$ and ${\tilde{f}}_{2}, \dots, {\tilde{f}}_{K}$ . Specifically, we estimate the fused individualized treatment rule (FITR) $f_{1}$ by minimizing

\frac{1}{n} \sum_{i = 1}^{n} \frac{R_{i 1}}{π (A_{i}; X_{i})} 1 \{A_{i} f_{1} (X_{i}) < 0\} + λ_{1 n} {‖f_{1}‖}^{2} + \frac{μ_{1 n}}{n} \sum_{i = 1}^{n} \sum_{k = 2}^{K} Ω_{1 k} 1 \{f_{1} (X_{i}) {\tilde{f}}_{k} (X_{i}) < 0\},

(1)

where $Ω_{1 k}$ is a nonnegative pre-specified constant that reflects the prior knowledge of the similarity between the primary outcome and each secondary outcome $R_{k}$ For example, $Ω_{1 k}$ can be expert-specified or defined as the correlation between $R_{1}$ and $R_{k}$ . If $Ω_{1 k}$ is negative, we can simultaneously flip the sign of $Ω_{1 k}$ and ${\tilde{f}}_{k}$ to encourage the consistency between $f_{1}$ and $- {\tilde{f}}_{k}$ . The tuning parameter $μ_{1 n}$ of the fusion penalty, which reweights $Ω_{1 k}$ for all $k$ , is determined data-adaptively using cross-validation. It is selected to maximize the estimated value function of $f_{1}$ , thus automatically quantifying the similarity between the outcomes.

The fusion penalty is related to the Laplacian penalty (Huang et al. 2011), which is used to learn multiple models simultaneously while encouraging similarity among them. With $L$ as the Laplacian matrix and $f : = \{f_{1}, \dots, f_{K}\}$ as the vector of decision functions for each outcome, the Laplacian penalty can be defined as $\frac{1}{n} \sum_{i = 1}^{n} [1 \{f (X_{i}) > 0\} \cdot L \cdot 1 {\{f (X_{i}) > 0\}}^{T} + 1 \{f (X_{i}) < 0\} \cdot L \cdot 1 {\{f (X_{i}) < 0\}}^{T}]$ . It prompts $f_{k} (X_{i})$ and $f_{j} (X_{i})$ to have the same sign when $L_{k j} < 0$ for $k, j = 1, \dots, K, k \neq j$ . This is equivalent to (1) if we use the same data to learn FITRs for all outcomes simultaneously.

FITR-Ramp.

The optimization problem in (1) is known to be NP-hard. To tackle this problem, we substitute the 0–1 losses with a smooth loss function for optimization purposes. The 0–1 loss in the first part of the expression is substituted by a logistic loss as in SepL. For the 0–1 loss in the fusion penalty, we use a ramp loss $ψ_{κ} (t) = m i n {1, m a x {0, 1 - t / κ}}$ , where $κ$ is a tuning parameter, for approximation since the latter converges to the 0–1 loss when $κ$ decreases to zero. With the same variance reduction and negative reward handling trick, we propose the FITR-Ramp method to estimate $f_{1}$ by

{\hat{f}}_{1 n} = \underset{f_{1} \in ℋ}{arg min} \frac{1}{n} \sum_{i = 1}^{n} \frac{|R_{i 1} - E (R_{i 1} ∣ X_{i})|}{π (A_{i}; X_{i})} . ϕ [A_{i} sign \{R_{i 1} - E (R_{i 1} ∣ X_{i})\} f_{1} (X_{i})] + λ_{1 n} {‖f_{1}‖}^{2} + \frac{μ_{1 n}}{n} \sum_{i = 1}^{n} \sum_{k = 2}^{K} Ω_{1 k} ψ_{κ_{1 n}} [f_{1} (X_{i}) {\tilde{f}}_{k} (X_{i})] .

(2)

We allow $κ_{1 n}$ to depend on the sample size $n$ .

FITR-IntL.

Alternatively, we can solve the optimization problem in (1) using the following procedure. Notice that the disagreement between $f_{1}$ and ${\tilde{f}}_{k}$ can be expressed with the disagreement between $A_{i}$ and $f_{1}$ since

1 \{f_{1} (X_{i}) {\tilde{f}}_{k} (X_{i}) < 0\} = 1 \{A_{i} f_{1} (X_{i}) < 0\} 1 \{A_{i} {\tilde{f}}_{k} (X_{i}) > 0\} + [1 - 1 \{A_{i} f_{1} (X_{i}) < 0\}] 1 \{A_{i} {\tilde{f}}_{k} (X_{i}) < 0\} = 1 \{A_{i} f_{1} (X_{i}) < 0\} s i g n \{A_{i} {\tilde{f}}_{k} (X_{i})\} + 1 \{A_{i} {\tilde{f}}_{k} (X_{i}) < 0\} .

The last term does not depend on $f_{1}$ given observed data. Therefore, the problem in (1) is equivalent to ${\hat{f}}_{1 n} = a r g {m i n}_{f_{1} \in ℋ} \frac{1}{n} \sum_{i = 1}^{n} \frac{{\tilde{R}}_{i 1}}{π (A_{i}; X_{i})} 1 \{A_{i} f_{1} (X_{i}) < 0\} + λ_{1 n} {∥f_{1}∥}^{2}$ , where

{\tilde{R}}_{i 1} = R_{i 1} + μ_{1 n} π (A_{i}; X_{i}) \sum_{k = 2}^{K} Ω_{1 k} s i g n \{A_{i} {\tilde{f}}_{k} (X_{i})\}

is the pseudo outcome. We then substitute the indicator function by the logistic loss and estimate ${\hat{f}}_{1 n}$ by

{\hat{f}}_{1 n} = \underset{f_{1} \in ℋ}{arg min} \frac{1}{n} \sum_{i = 1}^{n} \frac{|{\tilde{R}}_{i 1} - E ({\tilde{R}}_{i 1} ∣ X_{i})|}{π (A_{i}; X_{i})} . ϕ [A_{i} sign \{{\tilde{R}}_{i 1} - E ({\tilde{R}}_{i 1} ∣ X_{i})\} f_{1} (X_{i})] + λ_{1 n} {‖f_{1}‖}^{2}

(3)

with the same variance reduction and negative reward handling trick. We refer to this optimization method as FITR-IntL. It is worth noting that a similar procedure was originally proposed in Qiu et al. (in press) and Qiu (2018) to integrate treatment rules from multiple studies for the same outcome without theoretical justifications, whereas our focus is on integrating multiple outcomes.

Implementation.

In SepL, FITR-Ramp and FITR-IntL, the semi-norm for $f_{1}$ is usually chosen from a reproducing kernel Hilbert space (RKHS) associated with a real-valued kernel function $k : 𝒳 \times 𝒳 \to R$ . The choice of $k$ can be the linear kernel $k (x, x^{'}) = x^{T} x^{'}$ which yields a linear decision function for $f_{1}$ , or the Gaussian kernel $k (x, x^{'}) = e x p (- σ_{1 n}^{2} {∥x - x^{'}∥}_{2}^{2})$ which yield a nonlinear decision function, where $σ_{1 n}$ is a parameter depending on $n$ . By the representer theorem, the minimizer for $f_{1}$ takes the form $f (X) = \sum_{i = 1}^{n} α_{i} k (X, X_{i})$ , so the optimization problems can be restricted to class $ℋ : = \{f : f (X) = \sum_{i = 1}^{n} α_{i} k (X, X_{i}), (α_{1}, \dots, α_{n}) \in R^{n}\}$ . The function class $ℋ$ is nonparametric when the RKHS with Gaussian kernel is used. When $μ_{1 n}, κ_{1 n} \to 0, σ_{1 n}^{2} \to \infty$ as $n \to \infty$ , this nonparametric method is asymptotically equivalent to unconstrained problems without penalties. Therefore, the learned FITR is asymptotically optimal even when the non-constant shift $E (R ∣ X)$ and the fusion penalty are introduced.

For computation, FITR-Ramp can be solved by the difference of convex algorithm (DCA) (Le Thi and Pham Dinh, 2018) or the Powell algorithm (Powell, 1964 Press et al., 2007). DCA re-expresses the objective function as the difference between two convex functions, whereas the Powell algorithm is suitable for non-differentiable objective functions. In contrast, FITR-IntL has a differentiable convex objective function so that it can be easily solved by gradient-based algorithms including variations of the gradient descent algorithm (Ruder, 2016) or the BFGS algorithm (Fletcher, 1987). The tuning parameters $λ_{1 n}, μ_{1 n}, κ_{1 n}$ can be selected by cross-validation. The computation time of one replication when $n = 200$ is about 0.003 seconds for SepL, 0.003 seconds for FITR-IntL, 0.086 seconds for FITR-Ramp when using the linear kernel, and about 0.457 seconds for SepL, 0.356 seconds for FITR-IntL, 2.744 seconds for FITR-Ramp when using the Gaussian kernel.

3. THEORETICAL RESULTS

In this section, we provide theoretical results about the agreement rate between ${\hat{f}}_{1 n}$ and $f_{2}^{*}$ in FITR-Ramp. See Section B. 1 in the supplementary material for the convergence rates of the value function $𝒱_{1} ({\hat{f}}_{1 n})$ and the misclassification rate $P ({\hat{f}}_{1 n} f_{1}^{*} < 0)$ .

Without loss of generality, we assume that ${\tilde{f}}_{k} : 𝒳 \to {1, - 1}$ is a binary mapping learned from a dataset of size $N_{k}$ for all $k = 2, \dots, K$ , since every decision function $f$ can be transformed into a binary function by taking its sign. In this section, we assume that the conditional expectation of $R_{1}$ has already been removed and that the signs have been flipped if the remaining reward is negative. Then the problem becomes

\underset{f_{1} \in ℋ}{m i n} \frac{1}{n} \sum_{i = 1}^{n} \frac{R_{i 1}}{π (A_{i}; X_{i})} ϕ [A_{i} f_{1} (X_{i})] + λ_{1 n} {∥f_{1}∥}^{2} + \frac{μ_{1 n}}{n} \sum_{i = 1}^{n} \sum_{k = 2}^{K} Ω_{1 k} ψ_{κ_{1 n}} [f_{1} (X_{i}) {\tilde{f}}_{k} (X_{i})] .

We introduce the following assumptions to derive the convergence rate of the agreement rate.

Assumption 1.

Suppose that $0 \leq R_{1} \leq c_{1}$ for all $R_{1}$ and for some constant $c_{1} > 0$ .

Assumption 1 assumes that the primary outcome is nonnegative and bounded. Furthermore, with $r_{1}^{(a)} (X) : = E (R_{1} ∣ A = a, X) \geq 0$ , we define

η (X) : = \{\begin{array}{l} \frac{r_{1}^{(1)} (X)}{\sum_{a \in 𝒜} r_{1}^{(a)} (X)}, & if \sum_{a \in 𝒜} r_{1}^{(a)} (X) > 0, \\ \frac{1}{2}, & if r_{1}^{(a)} (X) = 0 for all a \in 𝒜 . \end{array}

The definition of $η (X)$ aligns with the probability $P (y = 1 ∣ x)$ in classification literature, where $x$ is the predictor, and $y$ is the label. Denote the regions $𝒳_{- 1} : = {x \in 𝒳 : η (x) < 1 / 2}$ , $𝒳_{1} : = {x \in 𝒳 : η (x) > 1 / 2}$ and $𝒳_{0} : = {x \in 𝒳 : η (x) = 1 / 2}$ . Finally, we define a distance function $x \mapsto ω_{x}$ as

ω_{x} : = \{\begin{array}{l} d (x, 𝒳_{0} \cup 𝒳_{1}), & if x \in 𝒳_{- 1}, \\ d (x, 𝒳_{0} \cup 𝒳_{- 1}), & if x \in 𝒳_{1}, \\ 0, & otherwise, \end{array}

where $d (x, 𝒮)$ denotes the distance of $x$ to a set $𝒮$ with respect to the Euclidean norm.

Assumption 2.

Assume that there exists a constant $q > 0$ for the distribution of $X$ such that

\int_{𝒳} e x p (- \frac{ω_{x}^{2}}{t}) P_{𝒳} (d x) \leq C t^{q d / 2}

(4)

for all $t > 0$ and some constant $C > 0$ . We say $q = \infty$ if the inequality holds for all $q > 0$ .

Note that Assumption 2 is used to bound the approximation error when we estimate ITRs in an RKHS, and the logistic loss is used. It is stronger than the geometric noise assumption (Steinwart and Scovel, 2007) in the sense that $| 2 η (x) - 1 |$ is not included in the left-hand side of (4) and $| 2 η (x) - 1 | \leq 1$ . However, when $t \to 0$ , we can still ensure that the left-hand side of (4) goes to zero.

Assumption 3.

For any $k = 2, \dots, K$ , assume that the estimator of the secondary outcome ITR ${\tilde{f}}_{k}$ converges to $f_{k}^{*}$ in the sense that $P ({\tilde{f}}_{k} f_{k}^{*} < 0) \leq {\tilde{δ}}_{k N_{k}} (τ)$ with probability at least $1 - e^{τ}$ , where $N_{k}$ is the sample size of the dataset for learning ${\tilde{f}}_{k}$ and ${\tilde{δ}}_{k N_{k}} (τ) = o (1)$ as $N_{k} \to \infty$ .

Assumption 3 is used to quantify the agreement rate between ${\tilde{f}}_{k}$ and its corresponding optimal ITR $f_{k}^{*}$ . It is a weak assumption that only requires ${\tilde{f}}_{k}$ to converge to its true optimal ITR and can be satisfied by general methods for learning ITRs. We provide the convergence rate ${\tilde{δ}}_{k N_{k}} (τ)$ in the remark of Corollary B.2 when ${\tilde{f}}_{k}$ is learned by SepL.

We first focus on the case when $K = 2$ . That is, there exists only one secondary treatment rule ${\tilde{f}}_{2}$ , which is already estimated from an external or the same dataset. Define the first and second part in the loss function as

ℓ_{1} \circ f_{1} (Z) : = \frac{R_{1}}{π (A ∣ X)} ϕ [A f_{1} (X)] and ℓ_{2} \circ f_{1} (Z) : = μ_{1 n} Ω_{12} ψ_{κ_{1 n}} [f_{1} (X) {\tilde{f}}_{2} (X)] .

Let the risk based on the surrogate losses be

ℛ (f_{1}) : = E [ℓ_{1} \circ f_{1} + ℓ_{2} \circ f_{1}] .

In the following inequalities, “ $≲$ “ indicates that the left-hand side is no larger than the right-hand side for all $n$ up to a universal constant.

Lemma 3.1.

Under Assumptions B.1 B.3 and 1–2, when $μ_{1 n}, κ_{1 n} \to 0, σ_{1 n}^{2} \to \infty$ , and $κ_{n} \leq 1$ for all $n$ , for any constant $δ > 0, 0 < ν < 2$ and for all $τ \geq 1$ , we have $P (ℛ ({\hat{f}}_{1 n}) - ℛ (f_{1}^{*}) ≲ δ_{1 n} (τ)) \geq 1 - e^{- τ}$ , where

δ_{1 n} (τ) : = λ_{1 n}^{- \frac{1}{2}} n^{- \frac{1}{2}} [\sqrt{τ} + σ_{1 n}^{(1 - ν / 2) (1 + δ) d} γ_{1 n}^{ν}] + λ_{1 n} σ_{1 n}^{d} + γ_{1 n} (2 d)^{\frac{q d}{2}} σ_{1 n}^{- q d},

(5)

and $γ_{1 n} : = 1 + μ_{1 n} κ_{1 n}^{- 1}$ .

The first term on the right-hand side of (5) is the estimation error, and the sum of the last two terms is the approximation error. A larger class $ℋ$ generally leads to a larger estimation error and a smaller approximation error, which corresponds to small penalty parameters $λ_{1 n}$ and $μ_{1 n}$ with less restriction on the norm of the space or the agreement with the secondary outcome ITR. The optimal rate that balances the estimation error and approximation error is

δ_{1 n} (τ) = γ_{1 n}^{\frac{2 q ν + 2 Δ + 1}{3 q + 2 Δ + 1}} n^{- \frac{q}{3 q + 2 Δ + 1}},

(6)

where $Δ = (1 - ν / 2) (1 + δ)$ . Note that when $μ_{1 n} = 0$ , FITR degenerates to SepL learned with the logistic loss. In this case, the optimal convergence rate is

δ_{1 n}^{(0)} (τ) = n^{- \frac{q}{3 q + 2 Δ + 1}} .

(7)

The optimal tuning parameters are provided in Section B.2.

Then we can present the convergence rates of the agreement rate $P ({\hat{f}}_{1 n} f_{2}^{*} > 0)$ .

Theorem 3.2.

Under Assumptions B.1 B.3 and 1–3, the agreement rate between ${\hat{f}}_{1 n}$ and ${\hat{f}}_{2}^{*}$ satisfies

P (f_{1}^{*} f_{2}^{*} > 0) - P ({\hat{f}}_{1 n} f_{2}^{*} > 0) ≲ \frac{δ_{1 n} (τ)}{μ_{1 n}} + {\tilde{δ}}_{2 N_{2}} (τ)

(8)

with probability at least $1 - 2 e^{τ}$ .

Now we compare the agreement rate with or without the fusion penalty. Note that for SepL,

P (f_{1}^{*} f_{2}^{*} > 0) - P ({\hat{f}}_{1 n} f_{2}^{*} > 0) \leq P ({\hat{f}}_{1 n} f_{1}^{*} < 0) ≲ n^{- \frac{α}{2 - α} \frac{q}{3 q + 2 Δ + 1}}

(9)

with high probability under additional Assumptions B. 4 and B.5 (see details in Section B.2). On the other hand, for FITR-Ramp, (8) and (6) implies

P (f_{1}^{*} f_{2}^{*} > 0) - P ({\hat{f}}_{1 n} f_{2}^{*} > 0) ≲ μ_{1 n}^{- 1} γ_{1 n}^{\frac{2 q ν + 2 Δ + 1}{3 q + 2 Δ + 1}} n^{- \frac{q}{3 q + 2 Δ + 1}} + {\tilde{δ}}_{2 N_{2}} (τ)

(10)

with high probability. The inequalities (9) and (10) suggest that SepL can learn the disagreement rate faster than FITR-Ramp when the data are fully separated by the decision boundary, i.e. when $α = 1$ . However, FITR-Ramp has a faster convergence rate when

\{\begin{array}{l} μ_{1 n}^{3 q - 2 q ν} κ_{1 n}^{2 q ν + 2 Δ + 1} ≳ n^{- \frac{2 - 2 α}{2 - α} q}, & if μ_{1 n} ≳ κ_{1 n}, \\ μ_{1 n} ≳ n^{- \frac{2 - 2 α}{2 - α} \frac{q}{3 q + 2 Δ + 1}}, & otherwise, \end{array}

(11)

and ${\tilde{f}}_{2}$ is learned using SepL and $N_{2} ≫ n$ . The requirement of the sample size $N_{2}$ is consistent with the conclusion in transfer learning literature (Li et al. 2021. Tian and Feng, 2022), which also requires the sample size of auxiliary data to be much larger than the sample size of target data. The relationship (11) holds when, for example, $α = 1 / 2, ν = 3 / 2, δ = 1, Δ = 1 / 2, m i n \{μ_{1 n}, κ_{1 n}\} ≳ n^{- 2 q / (9 q + 6)}$ , and the sample size $N_{2} ≳ n^{3} m i n {\{μ_{1 n}, κ_{1 n}\}}^{(9 q + 6) / q} ≳ n$ . That is, the data are not fully separated, and the tuning parameters $μ_{1 n}, κ_{1 n}$ are not decreasing too fast. Besides, (10) does not rely on Assumptions B.4 and B.5. The former is Tsybakov’s noise assumption, and the latter assumes that for each patient, at least one treatment can yield a positive mean reward.

When $K \geq 2$ , Theorem 3.2 can be generalized as follows.

Theorem 3.3.

Under Assumptions B.1 B.3 and 1–3. the agreement rate between ${\hat{f}}_{1 n}$ and any $f_{k}^{*}$ for $k \geq 2$ satisfies

P (f_{1}^{*} f_{k}^{*} > 0) - P ({\hat{f}}_{1 n} f_{k}^{*} > 0) ≲ \frac{δ_{1 n} (τ)}{μ_{1 n}} + \sum_{k = 2}^{K} {\tilde{δ}}_{k N_{k}} (τ)

with probability at least $1 - K e^{τ}$ .

4. SIMULATION STUDY

In the simulation study, we learn FITR for the primary outcome using the proposed methods FITR-Ramp and FITR-IntL, comparing them with the baseline method SepL. Suppose the ITR ${\tilde{f}}_{k N_{k}}$ for the secondary outcome $R_{k}$ is estimated with SepL using an external dataset of sample size $N_{k}$ . Even though the parameter dimension of the Gaussian kernel in RKHS depends on the sample size, the values of $n$ and $N_{k}$ can be different for ${\hat{f}}_{1 n}$ and ${\tilde{f}}_{k N_{k}}$ since the fusion penalty relies on secondary outcome ITRs only through the treatment suggestions. The weight $Ω_{1 k}$ is chosen to be the Pearson correlation between $R_{1}$ and $R_{k}$ . Our experiments show that Spearman’s rank correlation (Zar, 2014) yields similar results. Further implementation details can be found in Section A. 1

Simulation Settings.

We assume the data are collected from a randomized controlled trial and $π (1; X_{i}) = π (- 1; X_{i}) = 0.5$ for all $i = 1, \dots, n$ . The $k$ th outcome is defined as $R_{i k} = m_{k} (X_{i}) + T_{k} (X_{i}, A_{i}) + ϵ_{k} (X_{i}, A_{i})$ , where $m_{k}$ is the main effect, $T_{k}$ is the interaction effect between the covariates and the treatment, and $ϵ_{k}$ is the noise term. By choosing different $T_{k}$ , we allow both linear and nonlinear treatment rules. We let the dimension of covariates be $d = 10$ (see details in Section A.2). The simulation process is repeated 400 times under each scenario.

The sample size of the external dataset $N_{k}$ is set to $r \cdot n$ for $k \geq 2$ , with $r$ taking values from the set ${0, 1, 2, 4, 8, \infty}$ . In the case of $r = 0$ , we do not estimate ${\tilde{f}}_{k N_{k}}$ , and FITR degenerates to SepL. When $r = \infty$ , we essentially have ${\tilde{f}}_{k N_{k}} = f_{k}^{*}$ , which represents the scenario where the true ITR $f_{k}^{*}$ is known.

We first assess performance with $K = 2$ and $n = 200$ . Consider the following four scenarios. In all scenarios, the main effects are defined to be nonlinear functions of $X$ ,

m_{1} (X) = 1 + 2 X_{1} + X_{2}^{2} + X_{1} X_{2},

m_{2} (X) = 1 + 2 X_{1}^{2} + 1.5 X_{2} + 0.5 X_{1} X_{2} .

The interaction terms are defined as follows:

In linear scenarios $S 1 a n d S 2$ ,
$T_{1} (X, A) = 0.5 A (0.2 - X_{1} - 2 X_{2}),$

$T_{2} (X, A) = 0.8 A (0.2 - X_{1} - γ_{1} X_{2}),$
where $γ_{1} = 1.8$ in S1 and $γ_{1} = 1.4$ in S2.
In nonlinear scenarios S3 and S4,
$T_{1} (X, A) = 1.0 A (- 2.2 - e^{X_{1}} - e^{X_{2}}),$

$T_{2} (X, A) = 1.5 A (- γ_{2} - e^{X_{1}} - e^{X_{2}}),$
where $γ_{2} = 2.3$ in S3 and $γ_{2} = 2.4$ in S4.

The true agreement rate $P (f_{1}^{*} f_{2}^{*} > 0)$ is 98.6% for S1, 94.5% for S2, 96.4% for S3, and 92.8% for S4. Table A. 1 presents the true optimal value functions for each scenario.

Simulation Results.

Our experiments show that the linear kernel performs better in linear scenarios, while the Gaussian kernel is more effective in nonlinear scenarios. We present results using only the most suitable kernel for each scenario.

The average disagreement rate $P ({\hat{f}}_{1 n} f_{2}^{*} < 0)$ across all replications, along with its standard deviation, is plotted in Figure 1 for different $N_{k}$ . As expected, the disagreement rate remains unchanged as $N_{k}$ increases if ${\hat{f}}_{1 n}$ is learned by SepL. In contrast, the fusion penalty’s effect on reducing the disagreement rate becomes more significant with increasing $N_{k}$ . Specifically, FITR-Ramp is better in linear scenarios, while FITR-IntL outperforms it in nonlinear scenarios. This distinction may be attributed to the fact that FITR-IntL imposes a heavier penalty on the disagreement rate when the parameter dimension is high. The improvement in the agreement rate slows down after $N_{k} / n \geq 2$ , indicating that a moderately small external dataset is sufficient for learning an FITR with comparable performance to using the true $f_{2}^{*}$ . As the difference between outcomes increases, the disagreement rate increases for all methods, but an improvement with FITR is still observed.

In Figure 2, we evaluate FITR’s ability to generate treatment suggestions for the primary outcome. We present two metrics: (a) the root mean square error (RMSE) of the value function $𝒱_{1} ({\hat{f}}_{1 n})$ in comparison to the optimal value function $𝒱_{1} (f_{1}^{*})$ , and (b) the average misclassification rate $P ({\hat{f}}_{1 n} f_{1}^{*} < 0)$ along with its standard deviation. As an additional advantage of the proposed method, the primary outcome is actually improved by the FITR with the help of the secondary outcome. Similarly, FITR-Ramp in S1, S2 and FITR-IntL in S3, S4 observe the minimum RMSE and misclassification rate among all methods. When the discrepancy between outcomes increases, the performance of SepL remains unaffected, but the improvement by FITR-Ramp and FITR-IntL is reduced.

Additional results for different values of $K$ and $n$ are available in Section A.3. When $K = 2$ and $n = 100$ in scenarios S3 and S4, and the Gaussian kernel is used, FITR-Ramp instead of FITR-IntL yields the minimum RMSE of the value function. This indicates that FITR-IntL may be unstable when the sample size is small, and the parameter dimension is high. Furthermore, in cases where $K = 3$ , the disagreement rate, RMSE, and misclassification rate further decrease. The FITR performs better for the primary outcome as long as either $R_{2}$ or $R_{3}$ is close to $R_{1}$ , even if $f_{2}^{*}$ and $f_{3}^{*}$ differ from $f_{1}^{*}$ in opposite directions, suggesting the advantages of incorporating multiple secondary outcomes. In practice, cross-validation can be used to choose the best model and kernel.

Sensitivity Analysis.

To demonstrate the influence of the similarity between outcomes on the estimated FITRs, we fix the primary outcome and vary the secondary outcome when $K = 2$ . Specifically, we use the same main effect and noise term as in S1 and let the interaction effect be

T_{1} (X, A) = 0.5 A (0.2 - X_{1} - 2 X_{2}),

T_{2} (X, A) = 0.8 A (0.2 - X_{1} - 2 ρ X_{2}),

where $ρ$ controls the similarity between $R_{1}$ and $R_{2}$ . In Table A. 3 we summarize different values of $ρ$ and their corresponding true agreement rates. The performance metrics, including the disagreement rate, RMSE, and misclassification rate, are presented when $n = 200$ and $N_{2} = 4 n$ . The table shows that the fusion penalty consistently increases the agreement rate between ${\hat{f}}_{1 n}$ and $f_{2}^{*}$ . However, the value function and accuracy of FITRs exceed SepL only when $ρ \geq 0.5$ . Although $μ_{1 n}$ tuned by cross-validation should ideally be zero when the outcomes are substantially different, practical constraints such as a small sample size may result in $μ_{1 n} > 0$ , leading to decreased value functions. This suggests that the true agreement rate should be no less than 87% in this scenario for the fusion penalty to have a positive impact on the treatment suggestions for the primary outcome.

5. REAL DATA ANALYSIS

We apply our proposed methods to a study of major depressive disorder (MDD) (Trivedi et al., 2016) which randomized MDD patients to serotonin selective reuptake inhibitor sertraline (SERT) or placebo (PBO).

Description of the Dataset.

The metric for depressive symptoms is the Quick Inventory of Depressive Symptomatology (QIDS) score, with a lower score indicating better relief of symptoms. Seven covariates are used for learning ITRs, including the NEO-Five Factor Inventory score (NEO), Flanker Interference Accuracy score (Flanker), sex, age, education years, Edinburgh Handedness Inventory (EHI) score, and the QIDS score at the baseline (see details in Section A.5).

The primary outcome is the reduction in QIDS at the end of the treatment phase compared to the baseline (QIDS-change), with a larger positive value indicating a more desirable improvement in symptoms. In addition to the primary outcome, we examine two secondary outcomes. One is the clinician-assessed Clinical Global Improvement scale (CGI) as an overall assessment of treatment effect, and a smaller score suggests better improvement (Busner and Targum, 2007). The other is the Social Adjustment Scale (SAS), which evaluates the impact of a person’s mental health difficulties, with a higher score indicating greater impairment. We consider the QIDS-change and the negative values of CGI and SAS as the three outcomes. The dataset contains 186 patients with complete information. Further details can be found in Section A. 5

Analysis Results.

The weights $Ω_{1 k}$ for $k = 2, 3$ are set by the Pearson correlation between the primary and secondary outcomes, which equals 0.72 and 0.19 for CGI and SAS, respectively. Our experiments show that the linear kernel outperforms the Gaussian kernel on this dataset, so we present the results of the linear kernel. We examine the agreement rates between the ITRs of the primary outcome and secondary outcomes in the upper panel of Table 1. Since we do not have access to the true optimal secondary outcome ITR, the agreement rate is calculated based on the ITRs ${\tilde{f}}_{CGI}$ and ${\tilde{f}}_{SAS}$ estimated with SepL. The table suggests that FITR-IntL or FITR-Ramp increases the agreement rate compared to SepL, especially between QIDS-change and CGI.

Table 1:

The upper panel shows the agreement rates between the FITR ${\tilde{f}}_{QIDS}$ of the primary outcome QIDS-change, learned from SepL, FITR-IntL or FITR-Ramp, and the secondary outcome ITRs ${\tilde{f}}_{CGI}$ , ${\tilde{f}}_{SAS}$ . The lower panel shows the estimated value functions of the OSFA strategy, SepL, FITR-IntL, and FITR-Ramp for the primary outcome.

Agreement rates between ITRs of primary and secondary outcomes

Secondary ITR			${\tilde{f}}_{CGI}$	${\tilde{f}}_{SAS}$

Primary ITR learned from		SepL	0.807	0.694
		FITR-IntL	0.903	0.737
		FITR-Ramp	0.887	0.720

Value functions from 400 replicates

All SERT	All PBO	SepL	FITR-IntL	FITR-Ramp

8.000	6.438	8.869 (0.286)	9.082 (0.317)	9.015 (0.327)

Open in a new tab

We also assess the value functions of the primary outcome for FITR-IntL, FITR-Ramp, SepL, and the one-size-for-all (OSFA) strategy (see the lower panel of Table 1). The value function $𝒱_{1} (f_{1})$ is estimated with the normalized inverse probability weighted estimator

\frac{\sum_{i = 1}^{n} R_{i 1} 1 [A_{i} = s i g n \{f_{k} (X_{i})\}] / π_{i} (A_{i}; X_{i})}{\sum_{i = 1}^{n} 1 [A_{i} = s i g n \{f_{k} (X_{i})\}] / π_{i} (A_{i}; X_{i})},

(12)

which is a consistent estimator. The mean and standard deviation of the value functions across 400 replications are reported. Due to the limited sample size, we use 5-fold cross-validation in each replication, averaging the estimated value function across 5 validation sets to obtain the estimate for the current replication. For OSFA, we directly calculate the mean reward on the subpopulation with the desired treatment, which is equivalent to 12 since $π_{i} = 0.5$ for all $i$ . The tuning parameters are selected using 4-fold cross-validation. The results suggest that compared to OSFA and SepL, FITR-IntL and FITR-Ramp can improve the value functions for QIDS-change when assisted by the other secondary outcomes. FITR-IntL has the best performance, and it increases the value function of SepL by approximately one standard deviation.

To compare the learned ITR using different methods, we present the coefficients of each variable in Table A.4. We observe that ${\hat{f}}_{QIDS}$ and ${\tilde{f}}_{CGI}$ , when both estimated using SepL without the fusion penalty, share the same signs of the coefficients. This indicates that QIDS-change and negative CGI indeed need similar treatments. In addition, for QIDS-change, the signs of the coefficients in SepL, FITR-Ramp, and FITR-IntL are all consistent, suggesting that the fusion penalty does not change the ITR significantly from SepL.

6. DISCUSSION

In this work, we have proposed a method to optimize the primary outcome while also approximating the optimal secondary outcome ITRs as closely as possible. A fusion penalty is introduced to encourage the ITRs to generate similar treatment recommendations. We demonstrate theoretically and numerically that the agreement rate between the FITR for the primary outcome and the ITR for the secondary outcome converges faster to its true value than SepL. Moreover, the simulation and real data studies show that the ITR learned by FITR-Ramp and FITR-IntL have a higher value function and accuracy than SepL when the true optimal ITRs are closely aligned.

The proposed fusion penalty is a general method that can be easily extended to any variant of outcome-weighted learning (Zhao et al., 2015, Chen et al., 2016. Gao et al. 2022, Ma et al. (2023), a classification-based method to estimate an ITR as the sign of some decision function. There are several promising directions that are worth further research. A possible extension is to let the tuning parameters $λ$ and $μ$ depend on the covariates $X$ since the fusion level should vary if the optimal ITRs for the primary and secondary outcomes differ substantially for specific patients. Another interesting direction is to explore a minimax bound for the convergence rate of the agreement rate. Such a bound might be tighter than the current results presented in Section 3. Nevertheless, the upper bounds for SepL and FITR are currently derived under the same conditions, allowing for a meaningful comparison between them.

Supplementary Material

Supplement

NIHMS1968478-supplement-Supplement.pdf^{(581KB, pdf)}

Acknowledgements

The authors express their gratitude to the anonymous reviewers for their insightful comments and valuable suggestions. This research was supported in part by US NIH grants GM124104, NS073671, and MH123487.

Footnotes

Contributor Information

Daiqi Gao, Harvard University.

Yuanjia Wang, Columbia University.

Donglin Zeng, University of Michigan.

References

Bartlett PL, Jordan MI, and McAuliffe JD (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156. [Google Scholar]
Brown CH, Brincks A, Huang S, Perrino T, Cruden G, Pantin H, Howe G, Young JF, Beardslee W, Montag S, et al. (2018). Two-year impact of prevention programs on adolescent depression: An integrative data analysis approach. Preven-tion Science, 19(1):74–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
Busner J and Targum SD (2007). The clinical global impressions scale: applying a research tool in clinical practice. Psychiatry (Edgmont), 4(7):28. [PMC free article] [PubMed] [Google Scholar]
Cai TT and Wei H (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Annals of Statistics, 49(1):100–128. [Google Scholar]
Chen G, Zeng D, and Kosorok MR (2016). Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association, 111(516):1509–1521. [DOI] [PMC free article] [PubMed] [Google Scholar]
Claggett B, Xie M, and Tian L (2014). Meta-analysis with fixed, unknown, study-specific parameters. Journal of the American Statistical Association, 109(508):1660–1671. [Google Scholar]
Curran PJ and Hussong AM (2009). Integrative data analysis: the simultaneous analysis of multiple data sets. Psychological Methods, 14(2):81. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fang EX, Wang Z, and Wang L (2022). Fairness-oriented learning for optimal individualized treatment rules. Journal of the American Statistical Association, pages 1–14.35757777 [Google Scholar]
Fletcher R (1987). Practical methods of optimization. John Wiley & Sons. [Google Scholar]
Gao D, Liu Y, and Zeng D (2022). Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. Journal of Machine Learning Research, 23(1):11362–11403. [PMC free article] [PubMed] [Google Scholar]
Haidich A-B (2010). Meta-analysis in medical research. Hippokratia, 14(Suppl 1):29. [PMC free article] [PubMed] [Google Scholar]
Huang J, Ma S, Li H, and Zhang C-H (2011). The sparse Laplacian shrinkage estimator for high-dimensional regression. Annals of Statistics, 39(4):2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laber EB, Lizotte DJ, and Ferguson B (2014). Set-valued dynamic treatment regimes for competing outcomes. Biometrics, 70(1):53–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laber EB, Wu F, Munera C, Lipkovich I, Colucci S, and Ripa S (2018). Identifying optimal dosage regimes under safety constraints: An application to long term opioid treatment of chronic pain. Statistics in Medicine, 37(9):1407–1418. [DOI] [PMC free article] [PubMed] [Google Scholar]
Le Thi HA and Pham Dinh T (2018). DC programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5–68. [Google Scholar]
Li S, Cai T, and Li H (2021). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):149–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin D-Y and Zeng D (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika, 97(2):321–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu D, Liu RY, and Xie M (2015). Multivariate meta-analysis of heterogeneous studies using only summary statistics: efficiency and robustness. Journal of the American Statistical Association, 110(509):326–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Y, Wang Y, Kosorok MR, Zhao Y, and Zeng D (2018). Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens. Statistics in Medicine, 37(26):3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lizotte DJ and Laber EB (2016). Multi-objective Markov decision processes for data-driven decision support. Journal of Machine Learning Research, 17(1):7378–7405. [PMC free article] [PubMed] [Google Scholar]
Luckett DJ, Laber EB, Kim S, and Kosorok MR (2021). Estimation and optimization of composite outcomes. Journal of Machine Learning Research, 22:167–1. [PMC free article] [PubMed] [Google Scholar]
Ma H, Zeng D, and Liu Y (2022). Learning individualized treatment rules with many treatments: A supervised clustering approach using adaptive fusion. Advances in Neural Information Processing Systems, 35:15956–15969. [Google Scholar]
Ma H, Zeng D, and Liu Y (2023). Learning optimal group-structured individualized treatment rules with many treatments. Journal of Machine Learning Research, 24(102):1–48. [PMC free article] [PubMed] [Google Scholar]
Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355. [Google Scholar]
Powell MJ (1964). An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal, 7(2):155–162. [Google Scholar]
Press WH, Teukolsky SA, Vetterling WT, and Flannery BP (2007). Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press. [Google Scholar]
Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Annals of Statistics, 39(2):1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qiu X (2018). Statistical Learning Methods for Personalized Medicine. Columbia University. [Google Scholar]
Qiu X, Zeng D, and Wang Y (in press). Integrative learning to combine individualized treatment rules from multiple randomized trials. In Precision Medicines: Methods and Applications. Springer. [Google Scholar]
Ruder S (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. [Google Scholar]
Shi C, Fan A, Song R, and Lu W (2018). High-dimensional A-learning for optimal dynamic treatment regimes. Annals of Statistics, 46(3):925. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinwart I and Scovel C (2007). Fast rates for support vector machines using Gaussian kernels. Annals of Statistics, 35(2):575–607. [Google Scholar]
Thall PF, Sung H-G, and Estey EH (2002). Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association, 97(457):29–39. [Google Scholar]
Tian Y and Feng Y (2022). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trivedi MH, McGrath PJ, Fava M, Parsey RV, Kurian BT, Phillips ML, Oquendo MA, Bruder G, Pizzagalli D, Toups M, et al. (2016). Establishing moderators and biosignatures of antidepressant response in clinical care (EMBARC): Rationale and design. Journal of Psychiatric Research, 78:11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y, Fu H, and Zeng D (2018). Learning optimal personalized treatment rules in consideration of benefit and risk: with an application to treating type 2 diabetes patients with insulin therapies. Journal of the American Statistical Association, 113(521):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zar JH (2014). Spearman rank correlation: overview. Wiley StatsRef: Statistics Reference Online. [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C, Chen J, Fu H, He X, Zhao Y-Q, and Liu Y (2020). Multicategory outcome weighted margin-based learning for estimating individualized treatment rules. Statistica Sinica, 30:1857. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Laber EB, Ning Y, Saha S, and Sands BE (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research, 20(1):1821–1843. [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y-Q, Zeng D, Laber EB, and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510):583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou X, Mayer-Hamblett N, Khan U, and Kosorok MR (2017). Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1968478-supplement-Supplement.pdf^{(581KB, pdf)}

[R1] Bartlett PL, Jordan MI, and McAuliffe JD (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156. [Google Scholar]

[R2] Brown CH, Brincks A, Huang S, Perrino T, Cruden G, Pantin H, Howe G, Young JF, Beardslee W, Montag S, et al. (2018). Two-year impact of prevention programs on adolescent depression: An integrative data analysis approach. Preven-tion Science, 19(1):74–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Busner J and Targum SD (2007). The clinical global impressions scale: applying a research tool in clinical practice. Psychiatry (Edgmont), 4(7):28. [PMC free article] [PubMed] [Google Scholar]

[R4] Cai TT and Wei H (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Annals of Statistics, 49(1):100–128. [Google Scholar]

[R5] Chen G, Zeng D, and Kosorok MR (2016). Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association, 111(516):1509–1521. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Claggett B, Xie M, and Tian L (2014). Meta-analysis with fixed, unknown, study-specific parameters. Journal of the American Statistical Association, 109(508):1660–1671. [Google Scholar]

[R7] Curran PJ and Hussong AM (2009). Integrative data analysis: the simultaneous analysis of multiple data sets. Psychological Methods, 14(2):81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fang EX, Wang Z, and Wang L (2022). Fairness-oriented learning for optimal individualized treatment rules. Journal of the American Statistical Association, pages 1–14.35757777 [Google Scholar]

[R9] Fletcher R (1987). Practical methods of optimization. John Wiley & Sons. [Google Scholar]

[R10] Gao D, Liu Y, and Zeng D (2022). Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. Journal of Machine Learning Research, 23(1):11362–11403. [PMC free article] [PubMed] [Google Scholar]

[R11] Haidich A-B (2010). Meta-analysis in medical research. Hippokratia, 14(Suppl 1):29. [PMC free article] [PubMed] [Google Scholar]

[R12] Huang J, Ma S, Li H, and Zhang C-H (2011). The sparse Laplacian shrinkage estimator for high-dimensional regression. Annals of Statistics, 39(4):2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Laber EB, Lizotte DJ, and Ferguson B (2014). Set-valued dynamic treatment regimes for competing outcomes. Biometrics, 70(1):53–61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Laber EB, Wu F, Munera C, Lipkovich I, Colucci S, and Ripa S (2018). Identifying optimal dosage regimes under safety constraints: An application to long term opioid treatment of chronic pain. Statistics in Medicine, 37(9):1407–1418. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Le Thi HA and Pham Dinh T (2018). DC programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5–68. [Google Scholar]

[R16] Li S, Cai T, and Li H (2021). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):149–173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Lin D-Y and Zeng D (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika, 97(2):321–332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Liu D, Liu RY, and Xie M (2015). Multivariate meta-analysis of heterogeneous studies using only summary statistics: efficiency and robustness. Journal of the American Statistical Association, 110(509):326–340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Liu Y, Wang Y, Kosorok MR, Zhao Y, and Zeng D (2018). Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens. Statistics in Medicine, 37(26):3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lizotte DJ and Laber EB (2016). Multi-objective Markov decision processes for data-driven decision support. Journal of Machine Learning Research, 17(1):7378–7405. [PMC free article] [PubMed] [Google Scholar]

[R21] Luckett DJ, Laber EB, Kim S, and Kosorok MR (2021). Estimation and optimization of composite outcomes. Journal of Machine Learning Research, 22:167–1. [PMC free article] [PubMed] [Google Scholar]

[R22] Ma H, Zeng D, and Liu Y (2022). Learning individualized treatment rules with many treatments: A supervised clustering approach using adaptive fusion. Advances in Neural Information Processing Systems, 35:15956–15969. [Google Scholar]

[R23] Ma H, Zeng D, and Liu Y (2023). Learning optimal group-structured individualized treatment rules with many treatments. Journal of Machine Learning Research, 24(102):1–48. [PMC free article] [PubMed] [Google Scholar]

[R24] Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355. [Google Scholar]

[R25] Powell MJ (1964). An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal, 7(2):155–162. [Google Scholar]

[R26] Press WH, Teukolsky SA, Vetterling WT, and Flannery BP (2007). Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press. [Google Scholar]

[R27] Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Annals of Statistics, 39(2):1180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Qiu X (2018). Statistical Learning Methods for Personalized Medicine. Columbia University. [Google Scholar]

[R29] Qiu X, Zeng D, and Wang Y (in press). Integrative learning to combine individualized treatment rules from multiple randomized trials. In Precision Medicines: Methods and Applications. Springer. [Google Scholar]

[R30] Ruder S (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. [Google Scholar]

[R31] Shi C, Fan A, Song R, and Lu W (2018). High-dimensional A-learning for optimal dynamic treatment regimes. Annals of Statistics, 46(3):925. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Steinwart I and Scovel C (2007). Fast rates for support vector machines using Gaussian kernels. Annals of Statistics, 35(2):575–607. [Google Scholar]

[R33] Thall PF, Sung H-G, and Estey EH (2002). Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association, 97(457):29–39. [Google Scholar]

[R34] Tian Y and Feng Y (2022). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Trivedi MH, McGrath PJ, Fava M, Parsey RV, Kurian BT, Phillips ML, Oquendo MA, Bruder G, Pizzagalli D, Toups M, et al. (2016). Establishing moderators and biosignatures of antidepressant response in clinical care (EMBARC): Rationale and design. Journal of Psychiatric Research, 78:11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Wang Y, Fu H, and Zeng D (2018). Learning optimal personalized treatment rules in consideration of benefit and risk: with an application to treating type 2 diabetes patients with insulin therapies. Journal of the American Statistical Association, 113(521):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Zar JH (2014). Spearman rank correlation: overview. Wiley StatsRef: Statistics Reference Online. [Google Scholar]

[R38] Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Zhang C, Chen J, Fu H, He X, Zhao Y-Q, and Liu Y (2020). Multicategory outcome weighted margin-based learning for estimating individualized treatment rules. Statistica Sinica, 30:1857. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Zhao Y, Laber EB, Ning Y, Saha S, and Sands BE (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research, 20(1):1821–1843. [PMC free article] [PubMed] [Google Scholar]

[R41] Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Zhao Y-Q, Zeng D, Laber EB, and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510):583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Zhou X, Mayer-Hamblett N, Khan U, and Kosorok MR (2017). Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Fusing Individualized Treatment Rules Using Secondary Outcomes

Daiqi Gao

Yuanjia Wang

Donglin Zeng

Abstract

1. INTRODUCTION

Related Work.

Main Contributions.

2. METHODOLOGY

Separate Learning (SepL)

2.1. Learning Fused ITR Using Optimal Rules for Secondary Outcomes

FITR-Ramp.

FITR-IntL.

Implementation.

3. THEORETICAL RESULTS

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 3.1.

Theorem 3.2.

Theorem 3.3.

4. SIMULATION STUDY

Simulation Settings.

Simulation Results.

Figure 1:

Figure 2:

Sensitivity Analysis.

5. REAL DATA ANALYSIS

Description of the Dataset.

Analysis Results.

Table 1:

6. DISCUSSION

Supplementary Material

Acknowledgements

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases