A single-index model with multiple-links

Hyung Park; Eva Petkova; Thaddeus Tarpey; R Todd Ogden

doi:10.1016/j.jspi.2019.05.008

. Author manuscript; available in PMC: 2021 Mar 1.

Published in final edited form as: J Stat Plan Inference. 2019 Jul 4;205:115–128. doi: 10.1016/j.jspi.2019.05.008

A single-index model with multiple-links

Hyung Park ^a,^*, Eva Petkova ^a,^b, Thaddeus Tarpey ^a, R Todd Ogden ^c

PMCID: PMC7441812 NIHMSID: NIHMS1533650 PMID: 32831459

Abstract

In a regression model for treatment outcome in a randomized clinical trial, a treatment effect modifier is a covariate that has an interaction with the treatment variable, implying that the treatment efficacies vary across values of such a covariate. In this paper, we present a method for determining a composite variable from a set of baseline covariates, that can have a nonlinear association with the treatment outcome, and acts as a composite treatment effect modifier. We introduce a parsimonious generalization of the single-index models that targets the effect of the interaction between the treatment conditions and the vector of covariates on the outcome, a single-index model with multiple-links (SIMML) that estimates a single linear combination of the covariates (i.e., a single-index), with treatment-specific nonparametric link functions. The approach emphasizes a focus on the treatment-by-covariates interaction effects on the treatment outcome that are relevant for making optimal treatment decisions. Asymptotic results for estimator are obtained under possible model misspecification. A treatment decision rule based on the derived single-index is defined, and it is compared to other methods for estimating optimal treatment decision rules. An application to a clinical trial for the treatment of depression is presented.

Keywords: Single-index models, Treatment effect modifier, Biosignature

1. Introduction

In precision medicine, a critical concern is to identify baseline measures that have distinct relationships with the outcome from different treatments so that patient-specific treatment decisions can be made [1, 2]. Such variables are called treatment effect modifiers, and these can be useful in determining a treatment decision rule that will select a treatment for a patient based on observations made at baseline. There is a growing need to extract treatment effect modifiers from (usually noisy) baseline patient data that, more and more commonly, consist of a large number of clinical and biological characteristics.

Typically, treatment effect modifiers (or, “moderators”) are identified either one by one, using one model for each potential predictor, or from a large model which includes all potential predictors and their (two-way) interactions with treatment, and then testing for significance of the interaction terms, almost exclusively using linear models. In the linear model context, [3] proposed a model using a linear combination (i.e., an index) of patients’ characteristics, termed a generated effect modifier (GEM) constructed to optimize the interaction with a treatment indicator. Such a composite variable approach is especially appealing for complex diseases such as psychiatric diseases, in which each baseline characteristic may only have a small treatment modifying effect. In such settings, it is not common to find variables that are individually strong moderators of treatment effects.

Here we present novel flexible methods for determining composite variables that permit non-linear association with the outcome. In particular, the proposed methods allow the conditional expectation of the outcomes to have a flexible treatment-specific link function with an index. We define the index to be a one-dimensional linear combination of the covariates. This approach is related to single-index models [4, 5, 6, 7, 8, 9, 10], as well as to single-index model generalizations such as projection pursuit regression [11] and multiple-index models [12, 13]. We employ a single projection of the covariates (i.e., an index) to summarize the variability of the baseline covariates, and multiple link functions to connect the derived single-index to the treatment-specific mean responses; we call these single-index models with multiple-links (SIMML). This single-index models with multiple-links provides a parsimonious extension of the single-index model in modeling the effect of the interaction between a categorical treatment variable and a vector-valued covariate. The dependence of treatment-specific outcomes on a common single-index improves the interpretability, and helps in determining treatment decision rules. This approach generalizes the notion of a composite “treatment effect modifier” from the linear model setting, to a nonparametric context, to define a nonparametric generated effect modifier.

2. A Single-index model with multiple-links (SIMML)

Let $X = {(x_{1}, \dots, x_{p})}^{⊤} \in ℝ^{p}$ denote the set of covariates. Let T denote the categorical (treatment assignment) variable of interest, taking values in {1, …, K} with nonzero probabilities (π₁, … π_K) that sum to one. Let $Y \in ℝ$ denote an outcome variable; without loss of generality, we assume that a higher value of Y is preferred. We focus on data arising from a randomized experiment, however, the method can be extended to observational studies.

A common approach to interrogate the effect of the interaction between X and the treatment indicator T on an outcome is to fit a regression model separately for each of the K treatment groups, as functions of X. For instance, a single-index model can be fitted separately for each treatment group t, resulting in K indices, $β_{t}^{⊤} X, t \in {1, \dots, K}$ . We refer to this as a K-index model; it has the form

E (Y | T = t, X = x) = g_{t} (β_{t}^{⊤} x) (t = 1, \dots, K),

(1)

where both the treatment-specific nonparametric link functions g_t(·), and the treatment-specific index vectors $β_{t} \in ℝ^{p}$ , need to be estimated for each group t. (The vectors β_t need to satisfy some identifiability condition ([14]).) While this is a reasonable approach, the K indices of model (1) lack useful interpretation as effect modifiers and often lead to over-parametrization.

For parsimony and insight, the SIMML constrains the β_t in (1) to be equal, and it requires separate nonparametrically defined curves for each treatment t as a function of a single index α^⊤X common for all t:

E (Y | T = t, X = x) = g_{t} (α^{⊤} x) (t = 1, \dots, K),

(2)

where both the links g_t and the vector α need to be estimated. The SIMML (2) provides a single parsimonious biosignature, $α^{⊤} X \in ℝ$ . Due to the nonparametric nature of g_t, the scale of α is not identifiable in (2) and to address this we restrict α to be in $Θ = {α = {(α_{1}, \dots, α_{p})}^{⊤} | \sum_{j = 1}^{p} α_{j}^{2} = 1, α_{p} > 0}$ , i.e., to be in the upper hemisphere of the unit sphere.

If the true model for the treatment-specific outcome Y_t is not a SIMML, then the SIMML can be regarded as the L² projection of the treatment specific mean outcome $m_{t} (X) = E (Y_{t} | X)$ on the single index u = α^⊤X,

g_{t} (u) = E (m_{t} (X) | α^{⊤} X = u) (t = 1, \dots, K),

(3)

for each given α. Specifically, suppose the true treatment-specific model can be expressed as

Y_{t} = m_{t} (X) + σ_{t} (X) ϵ (t = 1, \dots, K),

(4)

in which $E (ϵ | X) = 0, E (ϵ^{2} | X) = 1$ . Let $R (α) = \sum_{t = 1}^{K} π_{t} E {(Y_{t} - g_{t} (α^{⊤} X))}^{2}$ , where g_t is defined in (3) and let

α_{0} : = \underset{α \in Θ}{\arg \min} R (α) .

(5)

Then α₀ can be shown to be the minimizer of the cross-entropy (e.g., [15]) between the SIMML (2) and the general model (4) under the Gaussian noise assumption. Here, the cross-entropy of an arbitrary distribution with probability density f, with respect to another reference distribution $P$ is defined as $E_{P} (- \log f)$ , where the expectation is take with respect to the distribution $P$ . Model (3) evaluated at α₀ can be viewed as the “projection” (in the sense of the closest point) of the true distribution $P$ (4) onto the space Θ of the SIMML distribution, using the Kullback-Leibler divergence as a distance measure.

The SIMML (2) allows a visualization useful for characterizing differential treatment effects, varying with the single-index α^⊤X. As $X \in ℝ^{p}$ varies, the mean response of model (2) changes only in the specific direction α ∈ Θ, and the effect of varying X, described by the link functions g_t, is different for each treatment condition t ∈ {1, …, K}. Therefore, the single-index can be viewed as a useful biosignature for describing differential treatment effects, provided g_t ≠ g_t′ for at least one pair t, t′ ∈ {1, …, K}.

3. Estimation

While any smoothing technique can be used to approximate the unspecified smooth links g_t(·) in (2), in this paper, we will focus on cubic B-splines. Specifically, $g_{t} (u) \approx η_{t}^{⊤} Z_{t} (u)$ , for some coefficients $η_{t} \in ℝ^{d_{t}}$ . Here, $Z_{t} (u) = {[B_{1} (u), \dots, B_{d_{t}} (u)]}^{⊤} \in ℝ^{d_{t}}$ consists of a set of d_t normalized cubic B-spline basis functions [16]. Let n_t be the sample size for the tth treatment group and $n = \sum_{t = 1}^{K} n_{t}$ denote the total sample size. Note, d_t depends on n_t (see Assumption 5 and [17]). For a given α, let $ℤ_{α, t}$ denote the B-spline evaluation matrix (n_t × d_t), so that the ith row is Z_t(α^⊤X_ti)^⊤, which is the B-spline evaluation of the ith individual from the tth treatment group. The subscript α in the matrix $ℤ_{α, t}$ highlights its dependence on α. Without loss of generality we assume that the outcome and the covariates are all centered at zero for each treatment group t, so that the model does not involve any intercept terms.

For sample data, SIMML (2) can be represented by

{[Y]}_{n \times 1} = {[ℤ_{α}]}_{n \times (\sum_{t = 1}^{K} d_{t})} {[η]}_{(\sum_{t = 1}^{K} d_{t}) \times 1} + {[ϵ]}_{n \times 1},

(6)

where $Y = {[Y_{1}^{⊤}, \dots, Y_{K}^{⊤}]}^{⊤}$ is the observed response vector with $Y_{t} \in ℝ^{n_{t}}$ , $ℤ_{α}$ is $n \times (\sum_{t = 1}^{K} d_{t})$ block-diagonal B-spline design matrix of the $ℤ_{α, t}^{'} s$ , $η = {[η_{1}^{⊤}, \dots, η_{K}^{⊤}]}^{⊤}$ is the B-spline coefficient vector, and $ϵ = {[ϵ_{1}^{⊤}, \dots, ϵ_{K}^{⊤}]}^{⊤}$ is a mean zero noise vector with covariance matrix σ²I_n.

For a given α, we define the n × n single-index projection matrix to be $S_{α} = ℤ_{α} {(ℤ_{α}^{⊤} ℤ_{α})}^{- 1} ℤ_{α}^{⊤}$ . Assuming Gaussian noise and treating η as a nuisance parameter, the negative “profile” loglikelihood of α, up to a constant multiplier, is

Q (α) = ‖ Y - S_{α} Y ‖^{2} .

(7)

We define the profile likelihood estimator of the index parameter α as

\hat{α} = \underset{α \in Θ}{\arg \min} Q (α) .

(8)

Each link function g_t(·) in (2) can be estimated by

{\hat{g}}_{t} (u) = Z_{t} {(u)}^{⊤} {(ℤ_{\hat{α}, t}^{⊤} ℤ_{\hat{α}, t})}^{- 1} ℤ_{\hat{α}, t}^{⊤} Y_{t} (t = 1, \dots, K),

(9)

where $ℤ_{\hat{α}, t}$ is $ℤ_{α, t}$ evaluated at $α = \hat{α}$ .

To solve (8), we can perform a procedure that alternates between the following two steps: first, for a fixed α, estimate each link functions g_t(·) in (2) by (9), where $\hat{α}$ is taken at α; second, for a fixed ${\hat{g}}_{t} (u)$ , perform an iteratively reweighted least squares (IRLS) to approximately solve (8) for α. These two steps can be iterated until convergence.

4. Asymptotic theory

In this section, we establish the asymptotic results of the profile estimator $\hat{α}$ in (8) under possible misspecification, when the true model is assumed to be(4). Let us denote the pth component of the vector α₀ in (5) by α_0,p(> 0, since α₀ ∈ Θ). By the completeness property of $ℝ$ , we can always find some c > 0 such that α_0,p ≥ c, and therefore, without loss of generality, we may assume that α₀ is in a compact set $Θ_{c} = {α = {(α_{1}, \dots, α_{p})}^{⊤} \in ℝ^{p} | \sum_{j = 1}^{p} α_{j}^{2} = 1, α_{p} \geq c}$ , with an appropriate choice of small c > 0. Further, to avoid the complication from the restricted parameter space Θ_c, we can consider instead the “pth component removed” R(α) in (5), as follows:

R (α_{- p}) = R (α_{1}, \dots, α_{p - 1}, \sqrt{1 - (α_{1}^{2} + \dots + α_{p - 1}^{2})}),

(10)

where a vector $α_{- p} = (α_{1}, α_{2}, \dots, α_{p - 1}) \in ℝ^{p - 1}$ lives inside the unit ball. Let the “pth component removed” value of α₀ in (5) be denoted by $α_{0, - p} \in ℝ^{p - 1}$ .

Similarly, let the “pth component removed” value of the corresponding profile estimator $\hat{α}$ in (8) be denoted by ${\hat{α}}_{- p} \in ℝ^{p - 1}$ . The following conditions are assumed for the asymptotic results.

Assumption 1. The objective function R(α_−p) in (10) is locally convex at α_0,−p, and its Hessian function, H(α_−p) evaluated at α_−p = α_0,−p, is positive definite, with bounded eigenvalues.

Assumption 2. The underlying mean functions m_t(X) in (4) are in $C^{(4)} (B_{a}^{p})$ , t ∈ {1, …, K} for some finite a > 0, where $B_{a}^{p}$ is the p-dimensional ball with center 0 and radius a and $C^{(q)} (B_{a}^{p}) = {f | t h e q t h o r d e r p a r t i a l d e r i v a t i v e s o f f a r e c o n t i n u o u s i n B_{a}^{p}}$ .

Assumption 3. The probability density function of X, $f_{X} (x) \in C^{(4)} (B_{a}^{p})$ , and there exist constants 0 < c_f < C_f such that $c_{f} / V o l_{p} (B_{a}^{p}) \leq f_{X} (x) \leq C_{f} / V o l_{p} (B_{a}^{p})$ , if $x \in B_{a}^{p}$ and f_X(x) = 0, if $x \notin B_{a}^{p}$ .

Assumption 4. The underlying noise ϵ in (4) satisfies $E (ϵ | X) = 0$ with $E (ϵ^{2} | X) = 1$ , and there exists a constant C_ϵ > 0, such that $\sup_{x \in B_{a}^{p}} E (| ϵ |^{3} | X = x) < C_{ϵ}$ . For each group t ∈ {1, …, K}, the standard deviation function σ_t(x) is continuous in $B_{a}^{p}$ , with $0 < c_{σ_{t}} \leq \inf_{x \in B_{a}^{p}} σ_{t} (x) \leq \inf_{x \in B_{a}^{p}} σ_{t} (x) \leq C_{σ_{t}} < \infty$ , for some constants $0 < c_{σ_{t}} < C_{σ_{t}}$ .

Assumption 5. The number of interior knots, N_t(= d_t − 4), in the cubic B-spline approximation of the link function g_t(·) for the tth treatment group satisfies: $n_{t}^{1 / 6} ≪ N_{t} ≪ n_{t}^{1 / 5} {(\log (n_{t}))}^{- (2 / 5)}$ , t ∈ {1, …, K}.

The first theorem establishes consistency of the estimator (8) and the second theorem establishes asymptotic normality of the estimator ${\hat{α}}_{- p}$ for α_0,−p.

Theorem 1. (Consistency) Under Assumption 1 to 5, $\hat{α} \to α_{0}$ almost surely, where α₀ is defined in (5).

Theorem 2. (Asymptotic Normality) Under Assumption 1 to 5, $\sqrt{n} ({\hat{α}}_{- p} - α_{0, - p}) \to N (0, Σ_{α_{0, - p}})$ in distribution, with asymptotic covariance matrix $Σ_{α_{0}, - p} = H_{α_{0, - p}}^{- 1} W_{α_{0, - p}} H_{α_{0, - p}}^{- 1}$ , where the matrix $H_{α_{0}, - p}$ is the Hessian matrix $H (α_{- p}) = \frac{\partial^{2}}{\partial α_{- p} \partial α_{- p}^{T}} R (α_{- p})$ evaluated at α_−p = α_0,−p and the matrix $W_{α_{0, - p}}$ is defined in the Appendix.

The proofs of the theorems are given in the Appendix.

5. Simulation illustrations

5.1. Performance on estimating treatment decision rules

A treatment decision function, $D (X) : ℝ^{p} \mapsto {1, \dots, K}$ , mapping a subject’s baseline characteristics $X \in ℝ^{p}$ to one of K available treatments, defines a treatment decision rule for the single decision time point [1, 2, 18, 19, 20]. Given covariates X, a treatment decision rule based on SIMML is $D (X) = \arg \max_{t \in {1, \dots, K}} g_{t} (α^{⊤} X)$ . We investigate the performance of the estimated treatment decision rules of the form $D (X) = \arg \max_{t \in {1, \dots, K}} E (Y | X, T = t)$ , where the conditional expectation is obtained from various modeling procedures.

In our simulation settings, the baseline covariate vector $X = {(x_{1}, \dots, x_{p})}^{⊤} ~ N (0, Ψ_{X})$ , with Ψ_X having 1′s on the diagonal and 0.1 everywhere else. We consider K = 2 with different noise levels for the two treatment groups: $ϵ_{1} ~ N (0, {0.4}^{2})$ , $ϵ_{2} ~ N (0, {0.2}^{2})$ The outcome data are generated under the following fairly broad model

Y_{t} = δ M (μ^{⊤} X; ν) + C_{t} (α^{⊤} X; ω) + ϵ_{t} (t = 1, 2) .

(11)

As a function of the index μ^⊤X, M is referred to as the “main effect” of X. As functions of the other index α^⊤X, the C_t’s are referred to as the “contrast” functions that define the treatment-by-X interaction. Here, we will use the parameters ν and ω to control the degree of non-linearity of M and C_t’s, respectively.

An optimal treatment decision rule depends only on the C_t’s, not on M or the ϵ_t’s. The parameter δ in (11) controls the relative contribution of the “signal” component $C_{t}^{'} s$ to the variance in the outcomes, and is calibrated to obtain a relative contribution of 0.35. The contrast functions C_t’s in (11) are set to

C_{t} (u; ω) = {\begin{array}{l} C_{1} (u; ω) = + 1 - \cos (0.5 π ω u) + 0.5 (u - ω) \\ C_{2} (u; ω) = - 1 + \cos (0.5 π ω u) - 0.5 (u - ω), \end{array}

(12)

where, if ω = 0, then the C_t’s are linear functions; and they are more nonlinear for larger values of ω. We considered three cases, corresponding to linear (ω = 0), moderately nonlinear (ω = 0.5), and highly nonlinear (ω = 1) C_t’s, respectively, illustrated in the first three panels of Figure 1. We set the main effect function M in (11) to be

M (u; ν) = 0.5 u - \sin (0.5 π ν u),

where, as ν increases, the degree of nonlinearity in the main effect function M increases. We considered two cases, ν = 0, corresponding to a linear M; and ν = 1, corresponding to a nonlinear M, illustrated in the fourth and the fifth panels of Figure 1. We set p = 5 and p = 10 with α = (1, …, 5)^⊤ and α = (1, …, 10)^⊤, respectively, each standardized to have norm one. We set μ to be proportional to a vector of 1’s, standardized to have norm one. Two treatment groups were considered, with equal sample sizes n₁ = n₂ = 40. We used d₁ = d₂ = 5 B-spline basis functions to approximate the link functions. The treatment decision rules were based on the following regression models: (i) SIMML (2) estimated from maximizing the profile likelihood; (ii) the K-Index model (1) fitted separately for each treatment group by the B-spline approach of [17], denoted as K-Index; (iii) the linear GEM model ([3]) estimated under the criterion of maximizing the difference in the treatment-specific slope, denoted as linGEM; and (iv) linear regression models fitted separately for each treatment group under the least squares criterion, denoted as K-LR. For each scenario, using the outcome Y from a simulated test set (of size 10⁵), we computed the proportion of correct decisions (PCD) of the treatment decision rules estimated from each method and the methods were compared in terms of PCD using boxplots from training datasets.

Figure 1: — The first panel shows the *linear* contrast C_t’s (ω = 0), the second panel the *moderately nonlinear* contrast C_t’s (ω = 0.5), and the third panel displays *highly nonlinear* contrast C_t’s (ω = 1). Data points are generated from model (11) with δ = 0 and p = 5. The fourth and the fifth panels show the *linear* (ν = 0) and the *nonlinear* main effect M (ν = 1), respectively.

Figure 2 shows that SIMML outperforms all other methods, except for the case under the linear M and C_t’s in which all 4 approaches perform well. The K-Index model is clearly second best, under the linear M (ν = 0) (the top panels) with the nonlinear C_t’s (ω = 0.5 and ω = 1). However, with a more complex M function (ν = 1) (the bottom panels), the performance of the K-index approach is considerably worse compared to SIMML. Given a relatively small sample size and under the complex main effect, the SIMML that emphasizes the treatment contrasts through the common single-index is more effective in estimating optimal treatment decisions than the K-Index model. As would be expected, additional complexity in the contrasts C_t’s (ω = 0.5 and ω = 1) has a greater effect on the performance of the more restrictive models (linGEM and K-LR) than it does on the flexible models (SIMML and K-index). The number of covariates, p, also has a clear impact on the performance of all methods. As p changes from 5 (red) to 10 (blue), the deterioration in performance is more pronounced for the K-Index model that requires separate fits for each treatment and thus involves estimation of more parameters (K(p − 1) + Kd), compared to the more parsimonious SIMML with a fewer number of parameters (p − 1 + Kd) to be estimated.

5.2. Coverage probability of asymptotic 95% confidence intervals

The next simulation experiment assesses the coverage probability of the asymptotic confidence intervals derived from Theorem 2. The data were generated under model (11) with δ = 0 (i.e., no main effect M) with p = 5 covariates. We set the SIMML index vector α(= α₀) to be stepwise increasing: (1, …, 5)^⊤, normalized to have unit L² norm. The associated contrast functions, C_t’s, are given by (12). As in Section 5.1, we consider three levels of the curvature of the contrasts, corresponding to linear (ω = 0), moderately nonlinear (ω = 0.5), and highly nonlinear (ω = 1) contrasts (see Figure 1). In (11), the standard deviations of the noise ϵ_t were set to 0.5. We set the sample size n = n₁ + n₂ with n₁ = n₂. With varying n ∈ {50, 100, 200, 400, 800, 1600, 3200}, the number of interior knots used in the B-spline approximation, N_t, was determined to be $N_{t} = [n_{t}^{1 / 5.5}]$ , as recommended by [17] ([v] denotes the integer part of v). Two hundred datasets were generated for all combinations of n and ω. For each (i.e., the jth) component α_j of α, the proportion of times the 95% asymptotic confidence interval contains the true value of α_j was recorded in the Table C.2 in the Appendix. Notice that the 5th (i.e., the pth) element is estimated to satisfy the constraint α ∈ Θ in Theorem 2. To obtain the confidence intervals for the 5th component, we applied Theorem 2 with the 4th component removed (without loss of generality), and obtained the confidence intervals for the 5th component.

We note that the choice of $N_{t} = [n_{t}^{1 / 5.5}]$ is an approximation to the N_t of Assumption 5 which requires $n_{t}^{1 / 6} ≪ N_{t} ≪ n_{t}^{1 / 5} {(\log (n_{t}))}^{- (2 / 5)}$ , as such N_t can only be obtained for a very large n_t. Nevertheless, in Table C.2 in the Appendix, as the sample size n(= n₁+n₂) increases, the “actual” coverage probability gets closer to the “nominal” coverage probability, with better coverage results for the linear and the moderately nonlinear contrasts (ω ∈ {0, 0.5}) compared to the highly nonlinear contrasts (ω = 1).

6. Application to data from a randomized clinical trial

Major depressive disorder afflicts millions and, according to the World Health Organization, it is the leading cause of disability worldwide. It is a highly heterogeneous disorder, however, no individual biological or clinical marker has demonstrated sufficient ability to match individuals to efficacious treatment. Here we illustrate the utility of the proposed SIMML method for estimating a composite biomarker and treatment decision rules, with an application to data from a randomized clinical trial comparing an antidepressant and placebo for treating depression.

Of the 166 subjects, 88 were randomized to placebo and 78 to the antidepressant. In addition to standard clinical assessments, patients underwent neuropsychiatric testing prior to treatments. Table 1 summarizes the information on p = 9 baseline patient characteristics, X = (x₁, …, x₉)^⊤. These baseline covariates were considered as potential treatment effect modifiers, and standardized to have unit variance. The treatment outcome Y was the improvement in symptom severity from week 0 (baseline) to week 8 and thus larger values of the outcome were better.

Table 1: Depression randomized clinical trial:

Description of the p = 9 baseline covariates (means and SDs); the estimated values (“Indiv. Value”) of treatment decision rules from each individual covariate, using either the B-spline regression (“Nonpar.”, in the third column) or the linear regression (“Linear”, in the fourth column); the estimated single-index coefficients (in the last three columns), and the values of the associated treatment decision rules from the three methods (in the bottom row).

(Label) Baseline	Mean	Indiv. Value		Coefficients α_j’s, j ∈ {1, …, 9}
patient characteristics	(SD)	Nonpar.	Linear	SIMML*	SIMML	linGEM
(x_i) Age at evaluation	38.00 (13.84)	8.56	8.24	−0.53	−0.50	−0.43
(x₂) Severity of depression	18.80 (4.29)	6.85	7.07	−0.07	−0.13	−0.37
(х₃) Dur. MDD (month)	38.19 (53.17)	7.42	7.33	0.08	−0.18	0.20
(x₄) Age at MDD	16.46 (6.09)	6.29	6.95	0.23	0.05	0.31
(x₅) Axis II	3.92 (1.43)	7.16	7.11	0.23	0.20	0.17
(x₆) Word Fluency	37.42 (11.68)	7.64	7.11	0.11	0.09	0.27
(x₇) Flanker RT	59.51 (26.63)	8.19	8.39	0.12	0.23	−0.18
(x₈) Post-conflict adjus.	0.07 (0.12)	6.73	7.23	−0.30	−0.29	−0.18
(x₉) Flanker Accuracy	0.22 (0.15)	7.89	8.37	0.70	0.70	0.59
Value from single-index model				9.34	8.72	8.22

Open in a new tab

Figure 3 shows the treatment outcomes Y against each of the 9 baseline covariates, for the placebo group (blue) and the active drug group (red). The estimated B-spline approximated curves for each individual covariate are shown with the associated 95% confidence bands: the solid blue curves for the placebo group and the dotted red curves for the active drug group. In Figure 3, each individual covariate has at most a small treatment modifying effect, as its treatment-specific curves do not differ much.

One natural measure for the effectiveness of a treatment decision rule $D$ is called the “value” (V) of a treatment decision rule $D$ [20], which is defined as the expected mean outcome if everyone in the population receives treatment according to that rule:

V (D) = E_{X} (E_{Y | X} (Y | X, T = D (X))) .

(13)

In the third and the fourth columns of Table 1, “Indiv. Value” refers to the estimated “value” of a decision rule $D$ estimated from each of the 9 individual covariates, using the following two approaches for estimating $D$ : the B-spline regressions of the treatment-specific outcome on each individual covariate (“Nonpar.” in the third column of Table 1) as suggested by the overlaid curves in Figure 3, and the linear regressions of the treatment-specific outcome on each individual covariate (“Linear” in the fourth column of Table 1). The value (13) of $D$ can be estimated by the inverse probability weighted estimator [21]:

\hat{V} (D) = \sum_{i = 1}^{\tilde{n}} Y_{i} I_{T_{i} = D (X_{i})} / \sum_{i = 1}^{\tilde{n}} I_{T_{i} = D (X_{i})},

(14)

using a testing set, say, ${(Y_{i}, X_{i}, T_{i}), i = 1, \dots, \tilde{n}}$ , where, if one uses only the jth covariate for estimating $D$ , then X_i = x_ij. The data were randomly split into a training set and a testing set with a ratio of 10 to 1. This splitting was performed 500 times, each time estimating $D$ on the training set and computing (14) from the testing set. Values (14) are averaged over the 500 splits.

The SIMML can be made more efficient by incorporating a main effect component β^⊤D(X) in the model, i.e., we can consider $E (Y | T = t, X = x) = β^{T} D (x) + g_{t} (α^{⊤} x)$ , for an appropriate vector-valued function D(X). If the n × q matrix $D$ denotes the evaluation of D(X) on the sample data, then for each α, the negative “profile” loglikelihood (7) under this extended model (with Gaussian outcome), up to constants, is $Q^{*} (α) = {‖ \tilde{Y} - S_{α} \tilde{Y} ‖}^{2}$ , where $\tilde{Y} = (I_{n} - (I_{n} - S_{α}) D {(D^{T} D)}^{- 1} D^{T}) Y$ . In this analysis, we took D(X) = X. We refer to this approach as “main effect adjusted” profile likelihood SIMML and denote it by SIMML*.

In Table 1, the last three columns show the estimated single-index coefficients α obtained by two different SIMMLs (SIMML* and SIMML) and the linear GEM (linGEM) which restricts the link g_t(·) to be a linear function. In Figure 4, the estimated pairs of link functions are plotted against the approach-specific single-index α^⊤X, obtained from applying the two SIMML approaches and the linear GEM approach. From Figures 3 and 4, it appears that the index α^⊤X exhibits a stronger moderating effect of treatment than the individual covariates. Also, the shapes of the regression curves from the SIMML approaches appear to capture a nonlinear treatment-by-index interaction effect, especially due to some non-monotone relationship between the index and the outcome in the active drug group.

Figure 4: — Pair of estimated link functions (g₁ and g₂) obtained from SIMML with the “main effect adjusted” profile likelihood (first panel), SIMML with the (main effect un-adjusted) profile likelihood (second panel), and the linear GEM model estimated under the criterion maximizing the difference in the linear regression slopes (third panel), respectively, for the placebo group (blue solid curves) and the active drug group (red dotted curves). The 95% confidence bands were constructed conditioning on the single-index coefficient α. For each treatment group, the observed outcomes are plotted against the estimated single-index.

In Figure 5, we illustrate the single-index coefficient estimates from each of the methods, and the associated 95% confidence intervals obtained from a bias-corrected and accelerated (BC_a, [22]) bootstrap with 500 replications. The coverage of the asymptotic-based confidence intervals for this sample size is not expected to be very good (based on the simulation results in Section 5.2) and thus instead we used bootstrap confidence intervals. The magnitude of the estimated coefficients α₁, …, α₉ reflects the relative importance of the covariates x₁, …, x₉ in determining a composite treatment effect modifier α^⊤X.

In this analysis, the incorporation of the “main effect” component improved the value of treatment decision rules determined from the proposed SIMML method, as illustrated in the boxplots in Figure 6; we compared the two SIMML approaches (SIMML* and SIMML); the linear GEM (linGEM) and the two approaches based on separate regression models for each treatment group (K-Index and K-LR), with respect to the estimated values (14) of the treatment decision rules. For comparison, we also included the decision to treat everyone with placebo (All PBO), and the decision to treat everyone with the active drug (All DRG). The results are summarized in Figure 6.

In Figure 6, in terms of the averaged estimated values (14) estimated from the aforementioned 500 randomly split testing sets, the proposed SIMML approaches outperform all other methods. The visualization (see Figure 4) indicates that the superiority of the active drug over placebo does not linearly decrease with the index, but rather, it appears to remain relatively constant to the left of the crossing point, exhibiting some nonlinear patterns. Finally, we note that the value of the treatment decision rule All PBO was lower than the value of the treatment decision rule All DRG, and that all treatment decision rules that took patient characteristics into account outperformed the decision of treating everyone with the drug (which is standard current clinical practice). In particular, the superiority the treatment decision rule SIMML* over treating everyone with the drug in terms of value was of similar magnitude of the superiority of the decision to treat everyone with the drug versus treating everyone with placebo. This is a clear indication that patient characteristics can help treatment decisions for patients with depression, and the more flexible SIMML methods are well suited for developing treatment decision rules. Particularly, the proposed methods show that combining patient characteristics with little moderating effects of a treatment can result in a strong treatment effect modifier which exhibits nonlinear association with the outcome that can help with making treatment decisions.

7. Discussion

The SIMML model (6) can be extended in various ways, for example, by allowing treatment-specific noise variances $σ_{t}^{2}$ . Under a Gaussian noise assumption, the B-spline approximated profile log likelihood of α, that profiles out the nuisance parameters $σ_{t}^{2}$ and η_t, up to constants, is $\sum_{t = 1}^{K} n_{t} \log Q_{t} (α)$ , in which $Q_{t} (α) = ‖ Y_{t} - S_{α, t} Y_{t} ‖^{2} / n_{t}$ . The corresponding profile estimator of α is $\underset{α \in Θ}{\arg \min} \sum_{t = 1}^{K} n_{t} \log Q_{t} (α)$ . The estimation can be performed similarly as in the estimation of $\hat{α}$ in (8), but the criterion function Q(α) will be replaced by $\sum_{t = 1}^{K} n_{t} \log Q_{t} (α)$ .

The SIMML can also be extended to generalized linear models (GLM) in which the outcome variable is a member of the exponential family. The standard form of the density is f_Y(Y; θ, ϕ) = exp {(Y θ − b(ϕ))/a(ϕ) + c(Y, ϕ)}, with a canonical link function h(·). We can extend the SIMML approach to the GLM setting with treatment-specific natural parameters θ_t, t ∈ {1, …, K} by modeling the treatment-specific outcomes as a function of a single-index $α^{⊤} X : θ_{t} (x) = h^{- 1} (E (Y | T = t, X = x)) = g_{t} (α^{⊤} x)$ , t ∈ {1, …, K}; g_t(·), hence $θ_{t} (x) \in ℝ$ , can be approximated, for example, by B-splines. The approximates can be denoted by ${\tilde{θ}}_{t} (x) = η_{t}^{⊤} Z_{t} (α^{⊤} x)$ for some $η_{t} \in ℝ^{d_{t}}$ . As in Section 3, the general strategy of nonlinear maximization of the “profile” likelihood over α ∈ Θ, where we profile out η_t for each value of α, can be employed. The dispersion parameter ϕ can also be profiled out. Other potential extensions involve incorporating variable selection in high-dimensional covariate settings using a regularization method and incorporating functional-valued data objects (such as images) as patient covariates.

An important extension to the SIMML model is to factor out baseline effects common to all treatment groups, by allowing an unspecified main-effect term μ(X) [e.g., 23] in the model. Generally, this can be handled by an “orthogonalization” approach, and the estimation can be performed under the framework of A-learning [1, 24, 25, 26, 27]. To elaborate, consider the following extension of model (2),

E (Y | T = t, X = x) = μ (x) + g_{t} (α^{⊤} x) (t = 1, \dots, K),

(15)

where we impose a structural constraint, $E_{T} (g_{T} (α^{⊤} X) | X) = \sum_{t = 1}^{K} π_{t} g_{t} (α^{⊤} X) = 0$ , which is a sufficient condition for orthogonality between the SIMML, g_T (α^⊤X), and the unspecified main effect, μ(X) in (15), as in [27]. Optimization of model (15) can be achieved by constrained least squares under this orthogonality constraint and A-learning can be employed for estimating an optimal treatment decision rule, focusing on estimating the interactions in the presence of the unspecified main effect μ(X). The technicalities of this adjustment are treated in a separate work.

Acknowledgement

This work was supported by National Institute of Health (NIH) grant 5 R01 MH099003.

Appendix A. The asymptotic covariance matrix in Theorem 2

Define $R_{t} (α) = E_{Y, X | T = t} {(Y - g_{t} (α^{⊤} X))}^{2}$ , t ∈ {1, …, K}. In Theorem 2, the asymptotic covariance matrix is given as $Σ_{α_{0, - p}} = H_{α_{0, - p}}^{- 1} W_{α_{0, - p}} H_{α_{0, - p}}^{- 1}$ . Here, the Hessian matrix $H_{α_{0}, - p} = {[H_{j, q}]}_{j, q = 1}^{p - 1}$ evaluated at α_−p = α_0,−p has its (j, q)th element given by

H_{j, q} = \sum_{t = 1}^{K} π_{t} [\frac{\partial^{2}}{\partial α_{j} \partial α_{q}} R_{t} (α) - \frac{α_{j}}{α_{p}} \frac{\partial^{2}}{\partial α_{p} \partial α_{q}} R_{t} (α) - \frac{α_{q}}{α_{p}} \frac{\partial^{2}}{\partial α_{p} \partial α_{j}} R_{t} (α) - \frac{α_{j} α_{q}}{α_{p}^{3}} \frac{\partial}{\partial α_{p}} R_{t} (α) + \frac{α_{j} α_{q}}{α_{p}^{2}} \frac{\partial^{2}}{\partial α_{p}^{2}} R_{t} (α) {] |}_{α = α_{0}} .

(A.1)

The matrix $H_{α_{0}, - p} = {[H_{j, q}]}_{j, q = 1}^{p - 1}$ evaluated at α_−p = α_0,−p has its (j, q)th element given by

W_{j, q} = \sum_{t = 1}^{K} π_{t} E_{Y, X | T = t} ({2 (g_{t} (u_{α}) - Y) (\frac{\partial}{\partial α_{j}} g_{t} (u_{α}) - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}} g_{t} (u_{α})) + \frac{\partial}{\partial α_{j}} R_{t} (α) - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}} R_{t} (α)} \times {2 (g_{t} (u_{α}) - Y) (\frac{\partial}{\partial α_{q}} g_{t} (u_{α}) - \frac{α_{q}}{α_{p}} \frac{\partial}{\partial α_{p}} g_{t} (u_{α})) + \frac{\partial}{\partial α_{q}} R_{t} (α) - \frac{α_{q}}{α_{p}} \frac{\partial}{\partial α_{p}} R_{t} (α)} {) |}_{α = α_{0}}

(A.2)

where u_α = α^⊤X.

Appendix B. Proof

Appendix B.1. Proof of Theorem 1

Proof. Let us write $Q_{t} (α) = ‖ Y_{t} - S_{α, t} Y_{t} ‖^{2} / n_{t}$ and $Q (α) = \sum_{t = 1}^{K} n_{t} Q_{t} (α) / n$ . Under Assumptions 2–4, by the results from A.14 of [28], we have

\sup_{α \in Θ_{c}} | Q_{t} (α) - R_{t} (α) | \leq O ({(n_{t}^{- 1 / 2} h_{t}^{- 1 / 2} \log n_{t})}^{2} + {(h_{t}^{4})}^{2}) + O (n_{t}^{- 1 / 2} \log n_{t} h_{t}^{- 1 / 2} + h^{4})

almost surely, where $h_{t} = \frac{1}{N_{t} + 1}$ is the distance between knot points, and N_t (note, N_t = d_t − 4) is the number of interior knots on [0, 1]. Since we choose N_t such that $n_{t}^{1 / 6} ≪ N_{t} ≪ n_{t}^{1 / 5} {(\log (n_{t}))}^{- (2 / 5)}$ for all t ∈ {1, …, K}, under Assumption 5,

\sup_{α \in Θ_{e}} | Q_{t} (α) - R_{t} (α) | \to 0 t \in {1, \dots, K},

almost surely. By the continuous mapping theorem,

\sup_{α \in Θ_{c}} | \sum_{t = 1}^{K} \frac{n_{t}}{n} Q_{t} (α) - \sum_{t = 1}^{K} π_{t} R_{t} (α) | \leq \sup_{α \in Θ_{c}} \sum_{t = 1}^{K} | \frac{n_{t}}{n} Q_{t} (α) - π_{t} R_{t} (α) | \to 0

almost surely, therefore, we have

\sup_{α \in Θ_{c}} | Q (α) - R (α) | \to 0,

(B.1)

almost surely. Denote by (Ω, $F$ , $P$ ) the probability space on which all ${Y_{i}, T_{i}, X_{i}^{⊤}}_{i = 1}^{\infty}$ are defined. By (B.1), for any δ > 0, ω ∈ Ω, there is an integer n*(ω), such that Q(α₀, ω) − R(α₀) < δ/2, whenever n > n*(ω). Since $\hat{α} (ω)$ is the minimizer of Q(α, ω), we have $Q (\hat{α} (ω), ω) - R (α_{0}) < δ / 2$ . Also, by (B.1), there exists an integer n**(ω), such that $R (\hat{α} (ω), ω) - Q (\hat{α} (ω), ω) < δ / 2$ , whenever n > n**(ω). Therefore, whenever n > max(n*(ω), n**(ω)), we have $R (\hat{α} (ω), ω) - R (α_{0}) < δ$ . The strong consistency $\hat{α} \to α_{0}$ follows from the local convexity of Assumption 1. □

Appendix B.2. Proof of Theorem 2

Proof. We first derive the expression (A.1) from the Appendix for the Hessian matrix. We can write $R (α_{- p}) = \sum_{t = 1}^{K} π_{t} R_{t} (α_{- p})$ , where the “pth component removed” function corresponding to the tth treatment is $R_{t} (α_{- p}) = R_{t} (α_{1}, \dots, α_{p - 1}, \sqrt{1 - (α_{1}^{2} + \dots + α_{p - 1}^{2})})$ . Applying the chain rule for taking the derivative of R_t(α_−p) with respect to α_j, we obtain

\frac{\partial}{\partial α_{j}} R_{t} (α_{- p}) = \frac{\partial}{\partial α_{j}} R_{t} (α) - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}} R_{t} (α)

(B.2)

for each j ∈ {1, …, p – 1}. Taking another derivative of (B.2) with respect to α_q, for each q ∈ {1, …, p – 1}, again by applications of the chain rule,

\frac{\partial^{2}}{\partial α_{q} \partial α_{j}} R_{t} (α_{- p}) = \frac{\partial^{2}}{\partial α_{q} \partial α_{j}} R_{t} (α) - \frac{α_{q}}{α_{p}} \frac{\partial^{2}}{\partial α_{p} \partial α_{j}} R_{t} (α) - \frac{α_{j}}{α_{p}} \frac{\partial^{2}}{\partial α_{q} \partial α_{p}} R_{t} (α) - \frac{\partial}{\partial α_{q}} (\frac{α_{j}}{α_{p}}) \frac{\partial}{\partial α_{p}} R_{t} (α) + \frac{α_{q} α_{j}}{α_{p}^{2}} \frac{\partial^{2}}{\partial α_{p} \partial α_{p}} R_{t} (α) .

(B.3)

After summing (B.3) over the groups t ∈ {1, …, K}, weighted by the group probabilities π₁, …, π_K, evaluated at α = α₀, we obtain (A.1).

Next, we examine the asymptotics of the profile estimator $\hat{α}$ . From A.15 of [28] and under Assumptions 2–5, we have

\sup_{α \in Θ_{c}} \sup_{1 \leq j \leq p} | \frac{\partial}{\partial α_{j}} {Q_{t} (α) - R_{t} (α)} - \frac{1}{n_{t}} \sum_{i = 1}^{n_{t}} ξ_{α, i, j, t} | = o (n_{t}^{- 1 / 2})

(B.4)

almost surely, with $ξ_{α, i, j, t} = 2 {g_{t} (u_{α, t i}) - Y_{t i}} \frac{\partial}{\partial α_{j}} g_{t} (u_{α, t i}) - \frac{\partial}{\partial α_{j}} R_{t} (α)$ , where u_α,ti = α^⊤X_ti and furthermore

\sup_{α \in Θ_{c}} \sup_{1 \leq j \leq p} | \frac{\partial}{\partial α_{j}} {Q_{t} (α) - R_{t} (α)} | = o (1), \sup_{α \in Θ_{c}} \sup_{1 \leq q, j \leq p} | \frac{\partial^{2}}{\partial α_{q} \partial α_{j}} {Q_{t} (α) - R_{t} (α)} | = o (1),

(B.5)

almost surely, for each group t ∈ {1, …, K}.

Now, we will prove that the estimated score of $Q (α_{- p}) = \sum_{t = 1}^{K} {\hat{π}}_{t} Q_{t} (α_{- p})$ , where ${\hat{π}}_{t} = \sum_{i = 1}^{n} I (T_{i} = t) / n$ , evaluated at α_−p = α_0,−p, is represented up to o(n^−1/2) almost surely, by a sum of mean-zero independent random variables, which we denote by $η_{i} \in ℝ^{p - 1}, i \in {1, \dots, n}$ , i ∈ {1, …, n} where $n = \sum_{t = 1}^{K} n_{t}$ . Let us denote the estimated score function by $\hat{Ψ} (α_{- p}) = \frac{\partial}{\partial α_{- p}^{⊤}} Q (α_{- p})$ , where $α_{- p} \in ℝ^{p - 1}$ . We will show

\sup_{1 \leq j \leq p - 1} | {\hat{Ψ}}_{j} (α_{0, - p}) - \frac{1}{n} \sum_{i = 1}^{n} η_{i, j} | = o (n^{- 1 / 2}),

(B.6)

almost surely, where ${\hat{Ψ}}_{j} (α_{- p}) \in ℝ$ is the jth component of the score function $\hat{Ψ} (α_{- p})$ and $η_{i, j} \in ℝ$ is the jth component of the random variable η_i. In order to employ the result (B.4), we first consider the score function defined on the set Θ_c, i.e., the score function ${\hat{Ψ}}_{j} (α)$ , instead of the “pth component removed” score function defined on $ℝ^{p - 1}$ , i.e., ${\hat{Ψ}}_{j} (α_{- p})$ . We will show that, for some mean-zero independent random variables, which we denote by $ξ_{α, i, j}^{*}, i \in {1, \dots, n}$ , i ∈ {1, …, n}, j ∈ {1, …, p},

\sup_{α \in Θ_{c} 1 \leq j \leq p} | \frac{\partial}{\partial α_{j}} {Q (α) - R (α)} - \frac{1}{n} \sum_{i = 1}^{n} ξ_{α, i, j}^{*} | = o (n^{- 1 / 2})

(B.7)

is satisfied almost surely. Let us set the desired mean-zero independent random variable $ξ_{α, i, j}^{*}$ to be $ξ_{α, i, j}^{*} = \sum_{t = 1}^{K} ξ_{α, i, j, t}^{*}$ , where

ξ_{α, i, j, t}^{*} = [2 {g_{t} (α^{⊤} X_{i}) - Y_{i}} \frac{\partial}{\partial α_{j}} g_{t} (α^{⊤} X_{i}) - \frac{\partial}{\partial α_{j}} R_{t} (α)] I (T_{i} = t),

which must satisfy the following:

\sup_{α \in Θ_{c} 1 \leq j \leq p} | \sum_{t = 1}^{K} π_{t} [\frac{\partial}{\partial α_{j}} Q_{t} (α) - \frac{\partial}{\partial α_{j}} R_{t} (α)] - \frac{1}{n} \sum_{i = 1}^{n} \sum_{t = 1}^{K} ξ_{α, i, j, t}^{*} | = o (n^{- 1 / 2}) .

(B.8)

We can write

| \sum_{t = 1}^{K} π_{t} [\frac{\partial}{\partial α_{j}} Q_{t} (α) - \frac{\partial}{\partial α_{j}} R_{t} (α)] - \frac{1}{n} \sum_{t = 1}^{K} \sum_{i = 1}^{n} ξ_{α, i, j, t}^{*} | = | \sum_{t = 1}^{K} π_{t} [\frac{\partial}{\partial α_{j}} Q_{t} (α) - \frac{\partial}{\partial α_{j}} R_{t} (α) - \frac{1}{π_{t}} \frac{n_{t}}{n} \frac{1}{n_{t}} \sum_{i = 1}^{n_{t}} ξ_{α, i, j, t}] |,

where ξ_α,i,j,t is defined in (B.4). Therefore, applying the continuous mapping theorem and Slutsky’s theorem to (B.4) leads to the desired result (B.8).

Next, we will show (B.6), the result corresponding to the “pth component removed” estimated score function, $\hat{Ψ} (α_{- p})$ on $ℝ^{p - 1}$ . Considering the linear operator $\frac{\partial}{\partial α_{j}} - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}}$ , we note that by the chain rule,

(\frac{\partial}{\partial α_{j}} - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}}) {Q (α) - Q (α)} = {\hat{Ψ}}_{j} (α_{- p}) - Ψ_{j} (α_{- p}),

for j ∈ {1, …, p−1}, where Ψ_j(α_−p) denotes the jth component of the gradient of R(α_−p). If we set the approximation variable η_i,j of (B.6) to be

η_{i, j} = ξ_{α, i, j}^{*} - \frac{α_{j}}{α_{p}} ξ_{α, i, p}^{*} = \sum_{t = 1}^{K} [2 {g_{t} (α^{⊤} X_{i}) - Y_{i}} {\frac{\partial}{\partial α_{j}} g_{t} (α^{⊤} X_{i}) - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}} g_{t} (α^{⊤} X_{i})} + \frac{\partial}{\partial α_{j}} R_{t} (α) - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}} R_{t} (α)] I (T_{i} = t),

(B.9)

then we can show

\sup_{α \in Θ_{c}} \sup_{1 \leq j \leq p - 1} | (\frac{\partial}{\partial α_{j}} - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}}) {Q (α) - R (α)} - \frac{1}{n} \sum_{i = 1}^{n} η_{i, j} | \leq \sup_{α \in Θ_{c}} \sup_{1 \leq j \leq p - 1} | \frac{\partial}{\partial α_{j}} (Q (α) - R (α)) - \frac{1}{n} \sum_{i = 1}^{n} ξ_{α, i, j}^{*} | + \sup_{α \in Θ_{c}} \frac{α_{j}}{α_{p}} | \frac{\partial}{\partial α_{p}} (Q (α) - R (α)) - \frac{1}{n} \sum_{i = 1}^{n} ξ_{α, i, p}^{*} | = o (n^{- 1 / 2}),

(B.10)

by the triangle inequality and the result of (B.7). Since Ψ_j(α_−p) is evaluated at the minimum α_0,−p, we have

Ψ_{j} (α_{0, - p}) = (\frac{\partial}{\partial α_{j}} - \frac{α_{j}}{α_{p}} \frac{\partial}{\partial α_{p}}) {Q (α)} |_{α = α_{0}} = 0,

(B.11)

by the local convexity under Assumption 1. Then we obtain the desired result of (B.6), by (B.10) and (B.11).

The uniform consistency of the observed Hessian, $\hat{H} (α) = \frac{\partial^{2}}{\partial α_{- p} \partial α_{- n}^{⊤}} Q (α_{- p})$ , to the population Hessian H(α_−p) of (A.1) follows directly from the results of (B.5) under Assumptions 2–5, with applications of the continuous mapping theorem.

Finally, we prove the main result. Consider the random variable ${\hat{Ψ}}_{j} (α_{0, - p})$ introduced in (B.6), and the following parametrization: for each component j ∈ {1, …, p – 1}

f_{j} (s) = {\hat{Ψ}}_{j} (s {\hat{α}}_{- p} + (1 - s) α_{0, - p}), s \in [0, 1] .

Taking the derivative with respect to t, we have by the chain rule

\frac{d}{d t} f_{j} (s) = \sum_{m = 1}^{p - 1} \frac{\partial}{\partial α_{m}} {\hat{Ψ}}_{j} (s {\hat{α}}_{- p} + (1 - s) α_{0, - p}) ({\hat{α}}_{m} - α_{0, m}) .

Since ${\hat{Ψ}}_{j} ({\hat{α}}_{- p}) = 0$ by the definition of ${\hat{α}}_{- p}$ , it follows that $f_{j} (1) - f_{j} (0) = {\hat{Ψ}}_{j} ({\hat{α}}_{- p}) - {\hat{Ψ}}_{j} (α_{0, - p}) = - {\hat{Ψ}}_{j} (α_{0, - p})$ . Therefore, for any particular j = 1, …, p−1, there exists $s_{j}^{*} \in [0, 1]$ by the mean value theorem, such that

- {\hat{Ψ}}_{j} (α_{0, - p}) = [\frac{\partial}{\partial α_{1}} {\hat{Ψ}}_{j} (s_{j}^{*} {\hat{α}}_{- p} + (1 - s_{j}^{*}) α_{0, - p}), \dots, \frac{\partial}{\partial α_{p - 1}} {\hat{Ψ}}_{j} (s_{j}^{*} {\hat{α}}_{- p} + (1 - s_{j}^{*}) α_{0, - p})] [{\hat{α}}_{- p} - α_{0, - p}],

which is just

[\frac{\partial^{2}}{\partial α_{1} \partial α_{j}} \hat{Q} (s_{j}^{*} {\hat{α}}_{- p} + (1 - s_{j}^{*}) α_{0, - p}), \dots, \frac{\partial^{2}}{\partial α_{p - 1} \partial α_{j}} \hat{Q} (s_{j}^{*} {\hat{α}}_{- p} + (1 - s_{j}^{*}) α_{0, - p})] [{\hat{α}}_{- p} - α_{0, - p}],

(B.12)

where $[{\hat{α}}_{- p} - α_{0, - p}]$ is a p − 1 dimensional random vector. Writing (B.12) in matrix notation, we have

- \hat{Ψ} (α_{0, - p}) = {[\frac{\partial^{2}}{\partial α_{q} \partial α_{j}} \hat{Q} (s_{j}^{*} {\hat{α}}_{- p} + (1 - s_{j}^{*}) α_{0, - p})]}_{j, q = 1}^{p - 1} [{\hat{α}}_{- p} - α_{0, - p}] .

(B.13)

Then, by (B.13) one can write

\sqrt{n} ({\hat{α}}_{- p} - α_{0, - p}) = - {{[\frac{\partial^{2}}{\partial α_{q} \partial α_{j}} \hat{Q} (s_{j}^{*} {\hat{α}}_{- p} + (1 - s_{j}^{*}) α_{0, - p})]}_{j, q = 1}^{p - 1}}^{- 1} \sqrt{n} \hat{Ψ} (α_{0, - p}) .

(B.14)

Meanwhile, by (B.6), for each component j ∈ {1, …, p−1} of $\hat{Ψ} (α_{0, - p})$ , we can write

{\hat{Ψ}}_{j} (α_{0, - p}) = \frac{1}{n} \sum_{i = 1}^{n} η_{i, j} + o (n^{- 1 / 2}),

(B.15)

almost surely with $E (η_{i, j}) = 0$ . The variance-covariance matrix of the random vector $η_{i} = {[η_{i, 1}, \dots, η_{i, p - 1}]}^{⊤} \in ℝ^{p - 1}$ evaluated at α_−p = α_0,−p, where η_i,j are specified in (B.9), is given in (A.2), where it is denoted by W_α0,−p. From (B.15), the central limit theorem ensures that $\sqrt{n} \hat{Ψ} (α_{0, - p}) \to N (0, W_{α_{0, - p}})$ in distribution. Now, by the representation of (B.14) together with an application of Slutsky’s theorem on the observed Hessian, we obtain $\sqrt{n} ({\hat{α}}_{0, - p} - α_{0, - p}) \to N (0, Σ_{α_{0, - p}})$ in distribution, where $Σ_{α_{0, - p}} = H_{α_{0, - p}}^{- 1} W_{α_{0, - p}} H_{α_{0, - p}}^{- 1}$ , which is the desired result of Theorem 2. □

Appendix C. Table for Section 5.2 Coverage probability of asymptotic 95% confidence intervals

Table C.2:

The proportion of time (“Coverage”) that the asymptotic 95% confidence interval contains the true value of α_j, j ∈ {1, …, 5}, for varying ω ∈ {0, 0.5, 1}, corresponding to linear, moderately nonlinear, and highly nonlinear contrasts, respectively, with varying n(= n₁ + n₂, where n₁ = n₂).

ω = 0 (linear)						ω = 0.5 (moderate nonlinear)					ω =1 (highly nonlinear)
n	α₁	α₂	α₃	α₄	α₅	α₁	α₂	α₃	α₄	α₅	α₁	α₂	α₃	α₄	α₅
50	0.36	0.45	0.43	0.44	0.42	0.49	0.46	0.45	0.46	0.40	0.59	0.58	0.57	0.55	0.52
100	0.64	0.67	0.72	0.68	0.64	0.76	0.75	0.80	0.72	0.76	0.89	0.82	0.84	0.75	0.73
200	0.77	0.77	0.79	0.78	0.73	0.88	0.83	0.82	0.85	0.79	0.92	0.88	0.84	0.78	0.81
400	0.85	0.90	0.87	0.87	0.85	0.88	0.88	0.88	0.82	0.85	0.95	0.88	0.84	0.79	0.78
800	0.95	0.92	0.91	0.89	0.88	0.92	0.89	0.92	0.89	0.87	0.92	0.91	0.83	0.78	0.81
1600	0.93	0.93	0.92	0.93	0.92	0.94	0.94	0.91	0.93	0.91	0.93	0.90	0.87	0.84	0.81
3200	0.94	0.95	0.94	0.94	0.94	0.96	0.94	0.90	0.92	0.90	0.93	0.92	0.87	0.90	0.85

Open in a new tab

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Murphy SA, Optimal dynamic treatment regimes, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2003) 331–355. [Google Scholar]
[2].Robins J, Optimal Structural Nested Models for Optimal Sequential Decisions, Springer, New York, 2004. [Google Scholar]
[3].Petkova E, Tarpey T, Su Z, Ogden RT, Generated effect modifiers in randomized clinical trials, Biostatistics 18 (2016) 105–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Brillinger RD, A generalized linear model with “Gaussian” regressor variables In A Festschrift for Lehman Erich L.(Edited by Bickel PJ, Doksum KA and Hodges JL), Wadsworth, New York, 1982. [Google Scholar]
[5].Stoker TM, Consistent estimation of scaled coefficients, Econometrica 54 (1986) 1461–1481. [Google Scholar]
[6].Powell J, Stock J, Stoker T, Semiparametric estimation of index coefficients, Econometrica 57 (1989) 1403–1430. [Google Scholar]
[7].Hardle W, Hall P, Ichimura H, Optimal smoothing in single-index models, Annals of Statistics 21 (1993) 157–178. [Google Scholar]
[8].Xia Y, Li W, On single index coefficient regression models, Journal of the American Statistical Association 94 (1999) 1275–1285. [Google Scholar]
[9].Horowitz JL, Semiparametric and Nonparametric Methods in Econometrics, Springer, 2009. [Google Scholar]
[10].Antoniadis A, Gregoire G, McKeague I, Bayesian estimation in single-index models, Statistica Sinica 14 (2004) 1147–1164. [Google Scholar]
[11].Friedman JH, Stuetzle W, Projection pursuit regression, Journal of the American Statistical Association 76 (1981) 817–823. [Google Scholar]
[12].Xia Y, A multiple-index model and dimension reduction, Journal of the American Statistical Association 103 (2008) 1631–1640. [Google Scholar]
[13].Yuan M, On the identifiabliity of additive index models, Statistica Sinica 21 (2011) 1901–1911. [Google Scholar]
[14].Lin W, Kulasekera KB, Uniqueness of a single index model, Biometrika 94 (2007) 496–501. [Google Scholar]
[15].Mackay DJ, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003. [Google Scholar]
[16].de Boor C, A Practical Guide to Splines, Springer-Verlag, New York, 2001. [Google Scholar]
[17].Wang L, Yang L, Spline estimation of single-index models, Statistica Sinica 19 (2009) 765–783. [Google Scholar]
[18].Zhang B, Tsiatis AA, Laber EB, Davidian M, A robust method for estimating optimal treatment regimes, Biometrics 68 (2012) 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Cai T, Tian L, Wong PH, Wei LJ, Analysis of randomized comparative clinical trial data for personalized treatment selections, Biostatistics 12 (2011) 270–282. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Qian M, Murphy SA, Performance guarantees for individualized treatment rules, The Annals of Statistics 39 (2011) 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Murphy SA, A generalization error for q-learning, Journal of Machine Learning 6 (2005) 1073–1097. [PMC free article] [PubMed] [Google Scholar]
[22].DiCiccio TJ, Efron B, Bootstrap confidence intervals, Statistical Science 11 (1996) 189–228. [Google Scholar]
[23].Tian L, Alizadeh A, Gentles A, Tibshrani R, A simple method for estimating interactions between a treatment and a large number of covariates, Journal of the American Statistical Association 109 (508) (2014) 1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Lu W, Zhang H, Zeng D, Variable selection for optimal treatment decision, Statistical Methods in Medical Research 22 (2011) 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Shi C, Song R, Lu W, Robust learning for optimal treatment decision with np-dimensionality, Electronic Journal of Statistics 10 (2016) 2894–2921. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Shi C, Fan A, Song R, Lu W, High-dimensional A-learning for optimal dynamic treatment regimes, The Annals of Statistics 46 (3) (2018) 925–957. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Jeng X, Lu W, Peng H, High-dimensional inference for personalized treatment decision, Electronic Journal of Statistics 12 (2018) 2074–2089. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Wang L, Yang L, Spline single-index prediction model, Technical Report. https://arxiv.org/abs/0704.0302.

[R1] [1].Murphy SA, Optimal dynamic treatment regimes, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2003) 331–355. [Google Scholar]

[R2] [2].Robins J, Optimal Structural Nested Models for Optimal Sequential Decisions, Springer, New York, 2004. [Google Scholar]

[R3] [3].Petkova E, Tarpey T, Su Z, Ogden RT, Generated effect modifiers in randomized clinical trials, Biostatistics 18 (2016) 105–118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Brillinger RD, A generalized linear model with “Gaussian” regressor variables In A Festschrift for Lehman Erich L.(Edited by Bickel PJ, Doksum KA and Hodges JL), Wadsworth, New York, 1982. [Google Scholar]

[R5] [5].Stoker TM, Consistent estimation of scaled coefficients, Econometrica 54 (1986) 1461–1481. [Google Scholar]

[R6] [6].Powell J, Stock J, Stoker T, Semiparametric estimation of index coefficients, Econometrica 57 (1989) 1403–1430. [Google Scholar]

[R7] [7].Hardle W, Hall P, Ichimura H, Optimal smoothing in single-index models, Annals of Statistics 21 (1993) 157–178. [Google Scholar]

[R8] [8].Xia Y, Li W, On single index coefficient regression models, Journal of the American Statistical Association 94 (1999) 1275–1285. [Google Scholar]

[R9] [9].Horowitz JL, Semiparametric and Nonparametric Methods in Econometrics, Springer, 2009. [Google Scholar]

[R10] [10].Antoniadis A, Gregoire G, McKeague I, Bayesian estimation in single-index models, Statistica Sinica 14 (2004) 1147–1164. [Google Scholar]

[R11] [11].Friedman JH, Stuetzle W, Projection pursuit regression, Journal of the American Statistical Association 76 (1981) 817–823. [Google Scholar]

[R12] [12].Xia Y, A multiple-index model and dimension reduction, Journal of the American Statistical Association 103 (2008) 1631–1640. [Google Scholar]

[R13] [13].Yuan M, On the identifiabliity of additive index models, Statistica Sinica 21 (2011) 1901–1911. [Google Scholar]

[R14] [14].Lin W, Kulasekera KB, Uniqueness of a single index model, Biometrika 94 (2007) 496–501. [Google Scholar]

[R15] [15].Mackay DJ, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003. [Google Scholar]

[R16] [16].de Boor C, A Practical Guide to Splines, Springer-Verlag, New York, 2001. [Google Scholar]

[R17] [17].Wang L, Yang L, Spline estimation of single-index models, Statistica Sinica 19 (2009) 765–783. [Google Scholar]

[R18] [18].Zhang B, Tsiatis AA, Laber EB, Davidian M, A robust method for estimating optimal treatment regimes, Biometrics 68 (2012) 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Cai T, Tian L, Wong PH, Wei LJ, Analysis of randomized comparative clinical trial data for personalized treatment selections, Biostatistics 12 (2011) 270–282. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Qian M, Murphy SA, Performance guarantees for individualized treatment rules, The Annals of Statistics 39 (2011) 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Murphy SA, A generalization error for q-learning, Journal of Machine Learning 6 (2005) 1073–1097. [PMC free article] [PubMed] [Google Scholar]

[R22] [22].DiCiccio TJ, Efron B, Bootstrap confidence intervals, Statistical Science 11 (1996) 189–228. [Google Scholar]

[R23] [23].Tian L, Alizadeh A, Gentles A, Tibshrani R, A simple method for estimating interactions between a treatment and a large number of covariates, Journal of the American Statistical Association 109 (508) (2014) 1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Lu W, Zhang H, Zeng D, Variable selection for optimal treatment decision, Statistical Methods in Medical Research 22 (2011) 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Shi C, Song R, Lu W, Robust learning for optimal treatment decision with np-dimensionality, Electronic Journal of Statistics 10 (2016) 2894–2921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Shi C, Fan A, Song R, Lu W, High-dimensional A-learning for optimal dynamic treatment regimes, The Annals of Statistics 46 (3) (2018) 925–957. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Jeng X, Lu W, Peng H, High-dimensional inference for personalized treatment decision, Electronic Journal of Statistics 12 (2018) 2074–2089. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Wang L, Yang L, Spline single-index prediction model, Technical Report. https://arxiv.org/abs/0704.0302.

PERMALINK

A single-index model with multiple-links

Hyung Park

Eva Petkova

Thaddeus Tarpey

R Todd Ogden

Abstract

1. Introduction

2. A Single-index model with multiple-links (SIMML)

3. Estimation

4. Asymptotic theory

5. Simulation illustrations

5.1. Performance on estimating treatment decision rules

Figure 1:

Figure 2:

5.2. Coverage probability of asymptotic 95% confidence intervals

6. Application to data from a randomized clinical trial

Table 1: Depression randomized clinical trial:

Figure 3: Depression randomized clinical trial:

Figure 4: Depression randomized clinical trial:

Figure 5: Depression randomized clinical trial:

Figure 6: Depression randomized clinical trial:

7. Discussion

Acknowledgement

Appendix A. The asymptotic covariance matrix in Theorem 2

Appendix B. Proof

Appendix B.1. Proof of Theorem 1

Appendix B.2. Proof of Theorem 2

Appendix C. Table for Section 5.2 Coverage probability of asymptotic 95% confidence intervals

Table C.2:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A single-index model with multiple-links

Hyung Park

Eva Petkova

Thaddeus Tarpey

R Todd Ogden

Abstract

1. Introduction

2. A Single-index model with multiple-links (SIMML)

3. Estimation

4. Asymptotic theory

5. Simulation illustrations

5.1. Performance on estimating treatment decision rules

Figure 1:

Figure 2:

5.2. Coverage probability of asymptotic 95% confidence intervals

6. Application to data from a randomized clinical trial

Table 1: Depression randomized clinical trial:

Figure 3: Depression randomized clinical trial:

Figure 4: Depression randomized clinical trial:

Figure 5: Depression randomized clinical trial:

Figure 6: Depression randomized clinical trial:

7. Discussion

Acknowledgement

Appendix A. The asymptotic covariance matrix in Theorem 2

Appendix B. Proof

Appendix B.1. Proof of Theorem 1

Appendix B.2. Proof of Theorem 2

Appendix C. Table for Section 5.2 Coverage probability of asymptotic 95% confidence intervals

Table C.2:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases