LINEAR BIOMARKER COMBINATION FOR CONSTRAINED CLASSIFICATION

Yijian Huang; Martin G Sanda

doi:10.1214/22-aos2210

. Author manuscript; available in PMC: 2022 Nov 4.

Published in final edited form as: Ann Stat. 2022 Oct 27;50(5):2793–2815. doi: 10.1214/22-aos2210

LINEAR BIOMARKER COMBINATION FOR CONSTRAINED CLASSIFICATION

Yijian Huang ¹, Martin G Sanda ²

PMCID: PMC9635489 NIHMSID: NIHMS1819429 PMID: 36341282

Abstract

Multiple biomarkers are often combined to improve disease diagnosis. The uniformly optimal combination, i.e., with respect to all reasonable performance metrics, unfortunately requires excessive distributional modeling, to which the estimation can be sensitive. An alternative strategy is rather to pursue local optimality with respect to a specific performance metric. Nevertheless, existing methods may not target clinical utility of the intended medical test, which usually needs to operate above a certain sensitivity or specificity level, or do not have their statistical properties well studied and understood. In this article, we develop and investigate a linear combination method to maximize the clinical utility empirically for such a constrained classification. The combination coefficient is shown to have cube root asymptotics. The convergence rate and limiting distribution of the predictive performance are subsequently established, exhibiting robustness of the method in comparison with others. An algorithm with sound statistical justification is devised for efficient and high-quality computation. Simulations corroborate the theoretical results, and demonstrate good statistical and computational performance. Illustration with a clinical study on aggressive prostate cancer detection is provided.

Keywords: Bahadur representation, Cube root asymptotics, Diagnostic test, Sensitivity, Specificity

1. Introduction.

In this era of precision medicine, many biomarkers have been identified for disease diagnosis, as well as for disease prognosis and prediction of therapeutic response. Since a single biomarker often has only limited diagnostic accuracy, combining multiple biomarkers holds the promise for improved discrimination. This is a classical binary classification problem, with a vast relevant literature in statistics, machine learning, and econometrics (cf. Hastie, Tibshirani and Friedman, 2009). Nevertheless, challenges abound.

Intuitively, the Conditional probability of the disease status given the biomarkers, i.e., the posterior probability, is a uniformly optimal combination with respect to any reasonable performance metrics including classification error and expected cost. Since an optimal combination remains so under monotone transformation, interestingly such a combination may be determined even when the posterior probability cannot be identified from the observed data with, for example, case-control design as often adopted in biomarker studies. Specifically, as following from Bayes’ rule, the likelihood ratio is such a transformation and optimal; see McIntosh and Pepe (2002) and the references therein. These results lead to a common strategy via estimating the posterior probability or the biomarker distributions of cases and controls. Parametric modeling is often adopted, with popular methods including linear discriminant analysis, logistic regression, and probit regression. In particular, logistic regression is routinely employed to linearly combine biomarkers in practice. However, correct model specification can be difficult in, for example, cancer detection, where cancer biomarkers often have irregular distributions. Obviously, model misspecification could result in sub-optimality. Semiparametric methods, including the monotonic density ratio model of Chen et al. (2016), have been pursued, but they still may not be sufficiently general. On the other hand, nonparametric density estimation with multiple biomarkers is notoriously difficult.

Of course, for classification purpose, a locally optimal combination, i.e., with respect to a given performance metric of interest, is sufficient. Such a combination does not need to be uniformly optimal and thus can be easier for identification and estimation (cf. Elliott and Lieli, 2013), as pursued by many methods. They include machine learning techniques such as random forest, support vector machine, and neural network. For instance, the support vector machine targets the hinge loss function. Another notable example involves the receiver operating characteristic curve (ROC) to use area under the curve (AUC) as the performance metric. Pepe, Cai and Longton (2006) proposed a linear combination to maximize the empirical AUC; see also Ma and Huang (2007) and Lin et al. (2011) which maximized smoothed versions of the empirical AUC, as well as related work by Vexler et al. (2006), Chen, Vexler and Markatou (2015), and Fong, Yin and Huang (2016). However, all these loss functions and performance metrics do not align well with clinical utility of diagnostic tests in practice. As an exception, Elliott and Lieli (2013) incorporated cost function into the classification optimization by extending the maximum score estimator of Manski (1975, 1985). Their empirical risk minimization accommodates differential costs associated with false positives and false negatives. Nevertheless, the cost quantification as required can be extremely difficult in practice, posing a potential limitation.

Considerable efforts have been made on locally optimal combination with respect to clinical utility-relevant performance metrics that are more practical. Note that a medical test typically needs to operate above a certain sensitivity or specificity level. Take aggressive prostate cancer diagnosis as an example. Since a positive non-invasive test would be confirmed with biopsy, a false negative has far more serious clinical consequences than a false positive. Thus, a high sensitivity, say, 95%, needs to be maintained (e.g., Catalona et al., 1998; Sanda et al., 2017). Partial AUC around the high sensitivity level then is a sensible metric, and perhaps more so is specificity at the controlled sensitivity level; indeed the latter may be viewed as a limiting special case. Of course, a different clinical context might rather call for a high specificity, which then needs to be controlled instead. Interestingly, specificity at controlled sensitivity or vice versa is in line with power at controlled type I error in standard statistical hypothesis testing; incidentally, this explains the applicability of the Neyman–Pearson lemma to optimal biomarker combination as discussed in McIntosh and Pepe (2002) and Eguchi and Copas (2002). A number of methods have been developed for optimal linear combination with respect to these metrics, including Pepe and Thompson (2000) and Yan et al. (2018) using partial AUC, and Meisner et al. (2021) targeting sensitivity at a controlled specificity level. Unfortunately, statistical properties of these procedures are not well understood.

Notably, most of these methods targeting a given performance metric are restricted to the class of linear combinations. As such, they may only approach the optimality in the linear class, but not necessarily so in general. Nevertheless, linear combination is of interest in its own right for a few reasons. First, it provides a first-order approximation to the general combination. Second, reliable estimation in a more general class may not be feasible in many applications where data are limited. Finally, nonlinear combinations can be accommodated in a linear combination method through the general linear basis expansion technique as adopted in, e.g., support vector machines (Hastie, Tibshirani and Friedman, 2009); see further discussion in Section 6.

In this article, we investigate and develop an optimal linear combination estimation method aiming at sensitivity- or specificity-constrained classification by maximizing the empirical utility. For simplicity in expressions, we opt to consider sensitivity at a controlled specificity level in the methodology development. The same method also applies to specificity at a controlled sensitivity level as of interest in our motivating application, by transposing the roles of cases and controls. As natural as the empirical estimator appears, its asymptotic analysis and computation are challenging due to noncontinuity in both the objective function and the constraint of the optimization problem. Our approach is different from Meisner et al. (2021) which adopted kernel-smoothed estimates of sensitivity and specificity instead. Most importantly, the limiting distribution of our combination coefficients is established, to have non-standard cube root asymptotics. Subsequently, the convergence rate and limiting distribution of the predictive performance are obtained. Meanwhile, a novel computational algorithm with sound statistical justification is devised.

The rest of the paper is organized as follows. Section 2 will introduce the problem and the empirical utility maximization method, to set the stage. We present the asymptotic study in Section 3, and the computational algorithm in Section 4. Numerical studies are reported in Section 5. Final remarks are provided in Section 6. Technical details are deferred to the Appendices.

2. The problem and empirical utility maximization.

Write a k-vector biomarker under consideration as M_d, for k ≥ 2 and with d = 1, 0 denoting case and control, respectively. With coefficient b, a linear combination is given by b^⊤M_d. Since sensitivity at controlled specificity is scale invariant to the combination, we limit to ${b^{⊤} M_{d} : ‖ b ‖_{1} = 1, b \in ℝ^{k}}$ as the linear combination class under consideration, where ∥·∥₁ is the ℓ₁ norm. Other norms for b can serve the same purpose, but our choice has computational advantages as will be seen later.

2.1. Optimal linear combination.

Adopt the convention that a positive diagnosis is associated with a larger combination value. With ρ ∈ (0, 1) as the given control level of specificity, the optimal combination coefficient β to maximize the sensitivity is the solution b of the following optimization problem:

max_{\underset{t}{b : ‖ b ‖_{1} = 1}} Pr (b^{⊤} M_{1} > t) subject to Pr (b^{⊤} M_{0} \leq t) \geq ρ,

(1)

where t corresponds to a test threshold.

Write F_d(t; b) = Pr(b^⊤M_d ≤ t) for the cumulative distribution function, d = 1, 0, and $F_{0}^{- 1} (p; b) \equiv inf {t : F_{0} (t; b) \geq p}$ for the control quantile function. With a given b, the smallest threshold t with the specificity being at least ρ and the resulting constrained sensitivity are given by

τ (b) = F_{0}^{- 1} (ρ; b), ϕ (b) = 1 - F_{1} {τ (b); b},

respectively. Then, we have

β = arg max_{b : ‖ b ‖_{1} = 1} ϕ (b),

(2)

as an identity for the optimal combination coefficient.

2.2. Empirical utility maximization.

Consider a case-control study, with n_d independent replicates of M_d: M_d,[i], i = 1, …, n_d, where d = 1, 0 represents the case and control samples, respectively. As a natural estimator of β, $\hat{β}$ is the solution b of the following problem to optimize the empirical utility:

max_{\underset{t}{b : ‖ b ‖_{1} = 1}} \hat{E} I (b^{⊤} M_{1} > t) subject to \hat{E} I (b^{⊤} M_{0} \leq t) \geq p,

(3)

where I(·) is the indicator function and $\hat{E}$ denotes the empirical average, over the case or control sample as appropriate; for example, $\hat{E} I (b^{⊤} M_{1} > t) = n_{1}^{- 1} \sum_{i = 1}^{n_{1}} I (b^{⊤} M_{1, [i]} > t)$ . Let ${\hat{F}}_{d} (t; b)$ denote the empirical counterpart of F_d(t; b). The empirical estimators for the threshold and associated constrained sensitivity are then given by

\hat{τ} (b) = {\hat{F}}_{0}^{- 1} (ρ; b), \hat{ϕ} (b) = 1 - {\hat{F}}_{1} {τ (b); b},

respectively. Subsequently,

\hat{β} = arg max_{b : ‖ b ‖_{1} = 1} \hat{ϕ} (b),

(4)

which will facilitate later analysis of the estimator.

2.3. A related problem.

Consider the circumstance that a biomarker is known a priori to have a non-zero coefficient with a specific sign in the optimal linear combination. Without loss of generality, suppose that the first biomarker is such a so-called anchor with a positive coefficient, which can always be achieved by biomarker reordering and sign-flipping. Write $M_{d} \equiv {(M_{d, 1}, M_{d, - 1}^{⊤})}^{⊤}$ , $β \equiv {(β_{1}, β_{- 1}^{⊤})}^{⊤}$ , and $\hat{β} \equiv {({\hat{β}}_{1}, {\hat{β}}_{- 1}^{⊤})}^{⊤}$ . Then, instead of ${b^{⊤} M_{d} : ‖ b ‖_{1} = 1, b \in ℝ^{k}}$ , one may work with a restricted class ${M_{d, 1} + h^{⊤} M_{d, - 1} : h \in ℝ^{k - 1}}$ . With a slight abuse of notation, in this context, write F_d(t; h) ≡ F_d{t; (1, h^⊤)^⊤} and similarly for ${\hat{F}}_{d} (t; h)$ , τ(h), $\hat{τ} (h)$ , ϕ(h), and $\hat{ϕ} (h)$ . With β₁ > 0 and scale invariance of ϕ(b) to b, we have the optimal combination coefficient in the new class,

η = β_{- 1} / β_{1} = arg max_{h} ϕ (h) .

Its empirical estimator is given by

\hat{η} = arg max_{h} \hat{ϕ} (h),

(5)

or, equivalently, the solution h of

max_{h, t} \hat{E} I (M_{1, 1} + h^{⊤} M_{1, - 1} > t) subject to \hat{E} I (M_{0, 1} + h^{⊤} M_{0, - 1} \leq t) \geq ρ .

(6)

The optimization problem (6) is clearly a simplification, and such a restricted class has been commonly adopted for linear combinations in the literature, e.g., Pepe, Cai and Longton (2006). However, the requirement of prior knowledge imposes a restriction, and we do not take it as an equivalent problem in general, especially for computational purposes. Nevertheless, provided that the consistency of $\hat{β}$ is established for β, this restricted problem may be taken as a surrogate for the purposes of weak convergence. Specifically, under the set-up of β₁ > 0, ${\hat{β}}_{1} > 0$ and subsequently $\hat{η} = {\hat{β}}_{- 1} / {\hat{β}}_{1}$ hold with probability tending to 1. In that asymptotic sense, these two estimated combinations differ by a scale and thus are equivalent in terms of performance. Therefore, for weak convergence, we can instead work with $\hat{η}$ in the restricted problem, which is advantageous since the distribution of $\hat{β}$ is concentrated on a (k − 1)-dimensional subspace of $ℝ^{k}$ by definition.

3. Asymptotic theory.

An asymptotic study for this non-standard estimation problem, via optimization (3), is conducted under the circumstance that the number of biomarkers k is held fixed, and the total sample size n ≡ n₁ + n₀ approaches ∞. Recall that the case and control samples are independent of each other, and each of them consists of independent and identically distributed observations. Consistency shall be tackled first. We then establish a uniform Bahadur representation for the estimated thresholds to obtain an approximation for the empirical constrained sensitivities, of nearly optimal combinations. On the basis of this approximation, weak convergence results are subsequently pursued.

3.1. Strong consistency.

Mild regularity Conditions are imposed.

Condition 1 (Case and control sizes). The size ratio n₁/n₀ converges to a constant γ ∈ (0, ∞) as n → ∞.

Condition 2 (Threshold uniqueness). For any ε > 0, the threshold τ(b) satisfies ${sup}_{b : ‖ b ‖_{1} = 1} F_{0} {τ (b) - ε; b} < ρ < {inf}_{b : ‖ b ‖_{1} = 1} F_{0} {τ (b) + ε; b}$ .

Condition 3 (Sensitivity continuity). For each b such that ∥b∥₁ = 1, F₁(t; b) is continuous at t = τ(b).

Condition 1 implies that n₁ and n₀ in O(·) and o(·) notation expressions can be equivalently replaced with n, as we shall do consistently. Conditions 2 and 3 concern the biomarker distributions at the thresholds for the combinations, but not elsewhere. Nevertheless, the requirement for each combination might not be necessary especially for b such that ϕ(b) is far away from the optimum. Simplicity is part of the consideration in adopting the current form.

The Conditions so far guarantee the existence of an optimal combination. However, the optimal combination is not necessarily unique as in the case, for example, that M₁ and M₀ lie in a linear subspace of dimension less than k with probability 1. Then, β is not well defined and much less the notion of consistency for its estimation. For that reason, we further require the following uniqueness Condition.

Condition 4 (Optimal combination uniqueness). The maximizer β of ϕ(b) over ${b : ‖ b ‖_{1} = 1, b \in ℝ^{k}}$ is unique.

Theorem 3.1. Under Conditions 1–3, a maximizer β of ϕ(b) over ${b : ‖ b ‖_{1} = 1, b \in ℝ^{k}}$ exists. Consider an estimator $\tilde{β}$ such that $‖ \tilde{β} ‖_{1} = 1$ , satisfying

\hat{ϕ} (\tilde{β}) \geq max_{b : ‖ b ‖_{1} = 1} \hat{ϕ} (b) + o (1),

almost surely. Then, both $ϕ (\tilde{β})$ and $\hat{ϕ} (\tilde{β})$ converge to ϕ(β) almost surely. If, in addition, Condition 4 holds, then $\tilde{β}$ converges to β almost surely.

This result concerns a near maximizer $\tilde{β}$ , of which $\hat{β}$ is a special case. While ϕ(β) represents the ideal performance, the predictive performance $ϕ (\tilde{β})$ reflects how the learned combination $\tilde{β}$ performs on future data once implemented. Their difference $ϕ (β) - ϕ (\tilde{β})$ , referred to as the performance deficiency, converges to 0 almost surely. Meanwhile, the empirical estimate $\hat{ϕ} (\tilde{β})$ is consistent for both $ϕ (\tilde{β})$ and ϕ(β). Appealingly, this consistency may hold even when β is not unique and thus $\tilde{β}$ might not converge. Indeed, of most interest is the performance rather than the combination coefficient, which is a distinctive feature of this problem. Nevertheless, the uniqueness of β and consistency of $\hat{β}$ would facilitate a rigorous study on weak convergence of the predictive performance. In some circumstances, non-uniqueness of β may be resolved as with the example that M₁ and M₀ lie in a linear subspace of dimension less than k by means of reducing the biomarker dimension.

3.2. Approximating $\hat{ϕ} (h)$ via uniform Bahadur representation of $\hat{τ} (h)$ .

As discussed in Section 2.3, we may now switch to the simplified problem dealing with η and $\hat{η}$ instead of β and $\hat{β}$ . Recall β₁ > 0. Then, by Theorem 3.1 under Conditions 1–4, $\hat{η}$ converges to η almost surely. Write m⁺ ≡ mI(m ≥ 0) and m⁻ ≡ mI(m ≤ 0) and, in the case of a vector, the operators ⁺ and ⁻ apply componentwise. Additional Conditions are adopted.

Condition 5 (Biomarker integrability). Biomarker M_d,−1 is integrable for d = 1, 0.

Condition 6 (Smoothness of control quantile). With the control anchor biomarker M_0,1, (i) the Conditional distribution function Pr(M_0,1 + h^⊤M_0,−1 ≤ t | M_0,−1) has a bounded density for (h^⊤, t) ^⊤ in a neighborhood of {η^⊤, τ(η)}^⊤; and (ii) possibly upon a location shift for M_0,−1, the density of the marginal distribution function $Pr (M_{0, 1} + h_{+}^{⊤} M_{0, - 1}^{+} + h_{-}^{⊤} M_{0, - 1}^{-} \leq t)$ is bounded away from 0 for ${(h_{+}^{⊤}, h_{-}^{⊤}, t)}^{⊤}$ in a neighborhood of {η^⊤, −η^⊤, τ(η)}^⊤.

Condition 7 (Smoothness of case and control distributions). For d = 1, 0, the Conditional distribution function Pr(M_d,1 +h^⊤M_d,−1 ≤ t | M_d,−1) has a bounded second derivative with respect to t, for (h^⊤, t)^⊤ in a neighborhood of {η^⊤, τ(η)}^⊤.

Conditions 6 and 7 require the marginal distribution of the anchor biomarker M_d,1, d = 1, 0, to have certain smoothness, but not so for other biomarkers. However, these non-anchor biomarkers need to have finite expectations by Condition 5.

Remark 1. The biomarker combination in part (ii) of Condition 6 is more general than the linear combination under consideration, since the coefficients for positive or negative M_0,−1, h₊ or −h₋, respectively, may or may not be the same. However, in certain situations, the two are actually equivalent. Notable in the Condition is the possible location shift for M_0,−1, to which both the linear combination coefficient and the constrained sensitivity are invariant. Nevertheless, the split point of a biomarker for different coefficients then may correspond to any finite value before the shift. Therefore, when each component of M_0,−1 has a finite upper or lower bound for the support of its distribution, only one of the corresponding components of h₊ and −h₋ is relevant upon shifting the finite bound to 0, to give rise to a linear combination. The more general combination is only essential when a component of M_0,−1 has a distribution with support on the whole real line as with, e.g., a normal distribution.

Remark 2. When the distribution of M_d,−1 has a bounded support, |(h − η)^⊤M_d,−1| can be made arbitrarily small when h is sufficiently close to η. In that case, Condition 6 is satisfied when the Conditional density of the optimal combination M_0,1 + η^⊤M_0,−1 given M_0,−1 is bounded from above and away from 0 in a neighborhood of τ(η). Similarly, Condition 7 is met when the Conditional distribution of the optimal combination M_d,1 + η^⊤M_d,−1 given M_d,−1 has a bounded second derivative in a neighborhood of τ(η).

Remark 3. We have made these Conditions slightly less general in the interest of clarity in exposition. The first biomarker, i.e., the anchor, serves a role different from the others in Conditions 5–7. Actually, that role may be taken by any biomarker with a non-zero coefficient in the optimal combination. Also, the three such biomarkers, one for Condition 6 and two for Condition 7 corresponding to d = 1, 0 separately, do not need to be the same. These Conditions serve to ensure the marginal distribution of a linear combination to have certain smoothness around the threshold and around the optimal combination; see Appendix B. Only existence, but not identification, of these aforementioned biomarkers is required.

We first provide a uniform convergence rate of $\hat{τ} (h)$ , extending the classical result on the empirical quantile (cf. Serfling, 1980, lemma 2.5.4.B). Write ∥ · ∥_∞ as the maximum norm.

Lemma 3.2. Under Conditions 1–6, there exists a constant ε > 0 such that

sup_{‖ h - η ‖_{\infty} \leq ε} | \hat{τ} (h) - τ (h) | = O {n^{- 1 / 2} {(log n)}^{1 / 2}},

almost surely.

Meanwhile, the local behavior result of the empirical distribution function (cf. Serfling, 1980, lemma 2.5.4.E) can also be generalized.

Lemma 3.3. Suppose that Conditions 1–5 and 7 hold and take c₀ > 0 as any given constant. For d = 1, 0 separately, there exists a constant ε > 0 such that

sup_{\begin{matrix} ‖ h - η ‖_{\infty} \leq ε \\ | u | \leq c_{0} n_{d}^{- 1 / 2} {(log n_{d})}^{1 / 2} \end{matrix}} | {\hat{F}}_{d} {τ (h) + u; h} - {\hat{F}}_{d} {τ (h); h} - F_{d} {τ (h) + u; h} + F_{d} {τ (h); h} | = O {n^{- 3 / 4} {(log n)}^{3 / 4}},

almost surely.

Remark 4. The proofs of the uniform results in Lemmas 3.2 and 3.3, given in Appendix B, exploit a monotonicity property of $\hat{τ} (h)$ and ${\hat{F}}_{d} (t; h)$ . That is, in the special case that the non-anchor biomarkers, M_d,−1, d = 1, 0, are nonnegative, $\hat{τ} (h)$ is non-decreasing in each component of h whereas ${\hat{F}}_{d} (t; h)$ is non-decreasing in t and non-increasing in each component of h. Furthermore, the general problem can be so reformulated. This can be easily achieved, by means of biomarker shifting and sign-flipping, if the case and control distributions for each component of M_d,−1 are bounded in the union of their supports from either above or below. More generally, we now develop a novel biomarker splitting technique to extend the problem to a linear combination with 2k − 1 biomarkers, ${(M_{d, 1}, M_{d, - 1}^{+ ⊤}, M_{d, - 1}^{- ⊤})}^{⊤}$ . Now η in the original problem translates to (η^⊤, −η^⊤)^⊤ in the extended one, with all non-anchor biomarkers being nonnegative. Focus on Lemma 3.2 with Condition 6. Note that an extended linear combination can be equivalently written as:

M_{0, 1} + h_{+}^{⊤} M_{0, - 1}^{+} + h_{-}^{⊤} M_{0, - 1}^{-} = M_{0, 1} + {h_{+} ⊙ I (M_{0, - 1} \geq 0) - h_{-} ⊙ I (M_{0, - 1} < 0)}^{⊤} M_{0, - 1},

where the indicator functions apply to M_0,−1 componentwise, and ⊙ is the Hadamard product. Furthermore, Conditioning on $M_{0, - 1}^{+}$ and $M_{0, - 1}^{-}$ is the same as that on M_0,−1. Therefore, part (i) of Condition 6 on the Conditional distribution automatically holds for the extended problem if it does for the original one. Meanwhile, part (ii) of Condition 6 on the marginal distributions accommodates this extension as discussed in Remark 1. Therefore, the extended problem, with the monotonicity property, can be utilized to establish Lemma 3.2, upon noting that the result for the extended problem is more general. The same approach applies to Lemma 3.3 with Condition 7.

Write f_d(t; h) as the probability density function of F_d(t; h), d = 1, 0, if it exists. Let λ ≡ f₁{τ(η); η}f₀{τ(η); η}⁻¹. The preceding lemmas give rise to a uniform Bahadur representation of the empirical threshold $\hat{τ} (h)$ and subsequently to an approximation of the empirical constrained sensitivity $\hat{ϕ} (h)$ by

\bar{ϕ} (h) = 1 - {\hat{F}}_{1} {τ (h); h} + λ [{\hat{F}}_{0} {τ (h); h} - ρ],

which is more amenable to analysis.

Theorem 3.4. Suppose that Conditions 1–7 hold. There exists a constant ε > 0 such that

sup_{‖ h - η ‖_{\infty} \leq ε} | \hat{τ} (h) - τ (h) + f_{0} {τ (h); h}^{- 1} [{\hat{F}}_{0} {τ (h); h} - ρ] | = O {n^{- 3 / 4} {(log n)}^{3 / 4}},

(7)

almost surely. Furthermore, for any h_n → η,

\hat{ϕ} (h_{n}) - \bar{ϕ} (h_{n}) = {‖ h_{n} - η ‖}_{\infty} O {n^{- 1 / 2} {(log n)}^{1 / 2}} + O {n^{- 3 / 4} {(log n)}^{3 / 4}},

(8)

almost surely.

The order of the remainder in (7), O{n^−3/4(log n)^3/4}, may be improved in the light of the sharper bound in the standard Bahadur representation (Bahadur, 1966; Kiefer, 1967). Nevertheless, for our purposes, the remainder needs only to be o_p(n^−2/3). In fact, a weaker version of (8),

\hat{ϕ} (h_{n}) - \bar{ϕ} (h_{n}) = {‖ h_{n} - η ‖}_{\infty} o_{p} (n^{- 1 / 3}) + o_{p} (n^{- 2 / 3}),

(9)

is adequate for the ensuing results.

3.3. Weak convergence: cube root asymptotics.

In their seminal work, Kim and Pollard (1990) established cube root asymptotics with their main Theorem dealing with an unconstrained maximization of a one-sample empirical process. Our objective function $\hat{ϕ} (h)$ , however, is a rather complicated functional of empirical processes involving two independent samples. We shall extend their result to our problem, by exploiting the approximation (9) through $\bar{ϕ} (h)$ as a linear combination of empirical processes from independent samples.

Theorem 3.5. Suppose that Conditions 1–7 hold. Then, the process $n_{1}^{2 / 3} {\hat{ϕ} (η + n_{1}^{- 1 / 3} a) - \hat{ϕ} (η))}$ converges weakly to a Gaussian process Z(a) with continuous sample paths, mean a^⊤Ha/2, and covariance kernel V, where H and V are given by (26) and (27), respectively, in Appendix B. Consider $\tilde{η}$ as an estimator of η satisfying

\hat{ϕ} (\tilde{η}) \geq max_{h} \hat{ϕ} (h) + o_{p} (n^{- 2 / 3}) .

If H is negative definite and Z has nondegenerate increments, i.e., V (a, a) ≠ 0 for a ≠ 0, $n_{1}^{1 / 3} (\tilde{η} - η)$ converges in distribution to U ≡ argmax_a Z(a). Meanwhile, $n_{1}^{2 / 3} {ϕ (\tilde{η}) - ϕ (η)}$ converges in distribution to U^⊤HU/2.

Of course, $\hat{η}$ is a special case of $\tilde{η}$ . The estimated combination coefficient $\tilde{η}$ has a n^−1/3 bee convergence rate, leading to a n^−2/3 convergence rate of the predictive performance $ϕ (\tilde{η})$ to the ideal one ϕ(η). This exhibits a distinctive and more robust convergence profile in comparison with other linear combination methods, with respect to the constrained sensitivity under consideration. Note that a number of methods originally developed for cohort studies, such as support vector classifier and the maximum score estimator (Manski, 1975, 1985), can be applied to case-control studies as well, although their loss functions and thus interpretation may change accordingly. Among existing methods, the maximum score estimator has the same n^−1/3 convergence rate as $\tilde{η}$ but may not converge to the same limit. Thus, its predictive performance does not approach the ideal one ϕ(η) in general. Meanwhile, a large number of methods, including linear discriminant analysis, logistic regression, support vector classifier, and AUC maximization (Pepe, Cai and Longton, 2006), have the faster parametric n^−1/2 convergence rate of their combination coefficients. Accordingly, in the case that their limits coincide with η, their associated predictive performances have a faster n⁻¹ convergence rate; this occurs, for example, for logistic regression when the model is correctly specified. Under other circumstances, however, their predictive performances may not approach the ideal one ϕ(η), and also typically converge at a slower n^−1/2 rate.

4. Computational algorithm.

The optimization problem (3) is computationally challenging. Obviously brute-force grid search can be used, but the computational burden is prohibitive except with very few biomarkers. We have also approached the problem via modern mixed integer linear programming. Algorithmic advances and hardware improvements over the past three decades have dramatically sped up mixed integer optimization. The last few years has seen applications to several statistical problems, related or unrelated to ours, that were once regarded as intractable; see Florios and Skouras (2008) and Bertsimas, King and Mazumder (2016) among others. Most appealingly, this approach permits exact optimization or an approximate one with a definite error bound. Nevertheless, our experience showed that the computation is still too intensive, at least with personal computers, for typical datasets in practice. In the following, we suggest a different approach, not a purely computational one but rather a combined statistical and computational solution to balance efficiency and high quality. We shall devise an asymptotically equivalent optimization problem and then develop a novel computational algorithm, partly inspired by Ou, Zeng and Cai (2016) on a different problem.

Two features of the optimization problem (3) contribute to its poor computational properties. One is the nonlinear equality constraint, on ∥b∥₁, and the other is the indicator function in the objective function and the inequality constraint. For the former, because of scale invariance of $\hat{ϕ} (b)$ to b,

\hat{β} = arg max_{b : ‖ b ‖_{1} \leq 1} \hat{ϕ} (b) + w ‖ b ‖_{1},

with any constant w > 0. This results in an equivalent optimization problem with inequality constraints only,

min_{b, t} \hat{E} I (b^{⊤} M_{1} \leq t) - w ‖ b ‖_{1} subject to ρ - \hat{E} I (b^{⊤} M_{0} \leq t) \leq 0 ‖ b ‖_{1} - 1 \leq 0;

(10)

such a minimization formulation is more standard in the optimization literature. For the indicator function, approximating I(x ≤ 0) by σ⁻¹{x⁻ − (x + σ)⁻}, for a small σ > 0, gives rise to:

min_{b, t} σ^{- 1} \hat{E} {(b^{⊤} M_{1} - t)}^{-} - {σ^{- 1} \hat{E} {(b^{⊤} M_{1} - t + σ)}^{-} + w ‖ b ‖_{1}} subject to {ρ + σ^{- 1} \hat{E} {(b^{⊤} M_{0} - t + σ)}^{-}} - σ^{- 1} \hat{E} {(b^{⊤} M_{0} - t)}^{-} \leq 0 ‖ b ‖_{1} - 1 \leq 0.

(11)

However, the solution b of this problem, say, ${\hat{ξ}}_{σ}$ , may no longer have unity ℓ₁ norm. A rescaling leads to ${\hat{β}}_{σ} = {\hat{ξ}}_{σ} / {‖ {\hat{ξ}}_{σ} ‖}_{1}$ .

This estimation procedure is better elucidated through an analysis. Add subscript σ to ${\hat{F}}_{d}$ , $\hat{τ}$ , and $\hat{ϕ}$ to denote their counterparts after the indicator function approximation. From (11), we can see

{\hat{ξ}}_{σ} = arg max_{b : ‖ b ‖_{1} \leq 1} {\hat{ϕ}}_{σ} (b) + w ‖ b ‖_{1} .

With σ > 0, ${\hat{ϕ}}_{σ} (b)$ is no longer scale invariant to b as ${\hat{ϕ}}_{σ} (s b) = {\hat{ϕ}}_{s^{- 1} σ} (b)$ for s > 0, which explains that ${‖ {\hat{ξ}}_{σ} ‖}_{1}$ may not be 1. Nevertheless, since ${\hat{ϕ}}_{σ} (b) \in [0, 1]$ , we have $1 + w {‖ {\hat{ξ}}_{σ} ‖}_{1} \geq w$ or ${‖ {\hat{ξ}}_{σ} ‖}_{1} \geq 1 - w^{- 1}$ . Suppose w > 1 and we then obtain

{\hat{ξ}}_{σ} = arg max_{b : ‖ b ‖_{1} = s} {\hat{ϕ}}_{σ} (b),

for some data-dependent s ∈ [1−w⁻¹, 1]. Subsequently,

{\hat{β}}_{σ} = arg max_{b : ‖ b ‖_{1} = 1} {\hat{ϕ}}_{s^{- 1} σ} (b) .

(12)

Meanwhile, in the circumstance of the first biomarker being the anchor as described in Section 2.3, write ${\hat{β}}_{σ} \equiv {({\hat{β}}_{1, σ}, {\hat{β}}_{- 1, σ}^{⊤})}^{⊤}$ and ${\hat{η}}_{σ} \equiv {\hat{β}}_{- 1, σ} / | {\hat{β}}_{1, σ} |$ . When ${\hat{β}}_{1, σ} > 0$ , we have a similar identity,

{\hat{η}}_{σ} = arg max_{_{h}} {\hat{ϕ}}_{{(s | {\hat{β}}_{1, σ} |)}^{- 1} σ} (h) .

(13)

On the other hand, from I(x ≤ −σ) ≤ σ⁻¹{x⁻ − (x + σ)⁻} ≤ I(x ≤ 0), we have ${\hat{F}}_{d} (t - σ; b) \leq {\hat{F}}_{d, σ} (t; b) \leq {\hat{F}}_{d} (t; b)$ , $\hat{τ} (b) \leq {\hat{τ}}_{σ} (b) \leq \hat{τ} (b) + σ$ , and subsequently

1 - {\hat{F}}_{1} {\hat{τ} (b) + σ; b} \leq {\hat{ϕ}}_{σ} (b) \leq 1 - {\hat{F}}_{1} {\hat{τ} (b) - σ; b} .

(14)

Identities (12) and (13) together with (14) suggest that, with σ sufficiently small, impact of the approximation is asymptotically negligible such that the consistency and weak convergence results in Theorems 3.1 and 3.5 hold.

Corollary 4.1. Set the finite constant w > 1 in optimization problem (11). If σ = o(1), then ${\hat{β}}_{σ}$ is a special case of $\tilde{β}$ defined in Theorem 3.1. Furthermore, when the first biomarker is the anchor and σ = o_p(n^−2/3), ${\hat{β}}_{σ}$ and ${(1, {\hat{η}}_{σ}^{⊤})}^{⊤}$ differ by a scale with probability tending to 1 and ${\hat{η}}_{σ}$ is a special case of $\tilde{η}$ defined in Theorem 3.5.

Now, focus on the computation with problem (11). The objective function and the left-hand side of the first constraint are sums of convex and concave components, whereas the left-hand side of the second constraint is convex. We can then adopt the concave-convex procedure (Yuille and Rangarajan, 2003) as extended by Lipp and Boyd (2016) to accommodate constraints, which is the core of Algorithm 1. At each step, the two concave components are replaced by their tangent planes at the current variable value resulting in a convex optimization problem. With our application, the convex optimization is actually a linear program as given by (15), upon adopting slack variables, thanks partly to the adopted ℓ₁ norm on b. The current variable value is then updated with the optimizer, which always satisfies the original constraints and improves the original objective function. Such steps can be repeated until the original objective function could not be further improved, and the algorithmic convergence in the objective function of (11) is guaranteed.

Nevertheless, one issue with the implementation is how to choose σ. Corollary 4.1 suggests a small value, which, however, could lead to a local optimizer near the initial value. Indeed, multiple local optimizers may exist and the concave-convex procedure does not guarantee to reach the global one (cf. Lipp and Boyd, 2016). In Algorithm 1, rather than a single small value, a sequence of decreasing σ values are taken. Thanks to the specific approximation adopted for the indicator function, the transition from one σ value to the next is seamless since the variable value remains feasible, i.e., satisfying the constraints, as σ decreases. With a larger σ, conceivably the objective function and the constraint are more smooth to have fewer local optima. As the σ value decreases gradually, the optimizer might thus more likely approach the global one.

With shrinking σ in Algorithm 1, the choice of constant w > 1 becomes less essential since w is absorbed into σ in identities (12) and (13). A convenient value, say, 2, suffices. Finally, an initial feasible value is required for (b^⊤, t)^⊤ with the starting σ value. For that purpose, a working method such as logistic regression or a coarse grid search may be adopted.

As a note, this developed algorithm is not limited to the proposed method but rather is generally applicable for other optimization problems of a similar form. The maximum score estimator of Manski (1975, 1985) is one example.

graphic file with name nihms-1819429-f0001.jpg

5. Numerical studies.

The proposed empirical utility maximization with Algorithm 1 has been implemented using a linear programming solver from R package Rmosek. In the studies reported below, w was set to 2. The logistic regression coefficient served as the initial value; the results using a coarse grid search were similar. With the initial coefficient, the maximum gap among the cases and that among the controls, between adjacent order statistics of the combinations, were computed. The larger of the two was taken as the starting σ value. A feasible initial value for the threshold was then calculated. Each subsequent σ value was shrunk by a factor of 0.8.

Seven existing linear combination methods were included in our numerical studies for comparison. Linear discriminant analysis and logistic regression are both standard methods; the latter in particular is routinely adopted in biomarker research. As a semiparametric method, the monotonic density ratio model of Chen et al. (2016) is more general. For the support vector classifier, function svm in R package e1071 was used. To target the AUC, we adopted the smoothed empirical AUC maximizer of Lin et al. (2011) as implemented in R package aucm. As software for the maximum score estimator of Manski (1975, 1985) did not appear to be readily available, we adapted our algorithm in Section 4 for its computation. Finally, the kernel smoothing-based estimator of Meisner et al. (2021) as implemented in R package maxTPR with default parameters was assessed as well.

Specificity at controlled 95% sensitivity is most relevant for our prostate cancer application reported in Section 5.2. That specific performance metric accordingly was adopted for all the numerical studies, and so targeted by both the proposed empirical utility maximization and the kernel smoothing-based estimator of Meisner et al. (2021). These methods as formulated for sensitivity at controlled specificity, including our proposal in Sections 2–4, applied upon transposing the roles of cases and controls.

The computer code for the proposed empirical utility maximization and the maximum score estimator of Manski (1975, 1985) is available on the first author’s website (http://web1.sph.emory.edu/users/yhuang5).

5.1. Simulations.

Across all set-ups, the control biomarkers were independent and identically distributed with the standard normal distribution, whereas the case biomarkers had different distributions from one set-up to another. Either three or six biomarkers were considered for combination. For the former, all the biomarkers were informative, i.e., having different case and control distributions. Then, in the latter case, three independent and non-informative biomarkers were additionally included; that is, each followed the same standard normal distribution for the cases as for the controls. Four different scenarios were constructed for the three informative biomarkers:

Scenario A. The case biomarkers were independent and normally distributed each with mean 0.9 and variance 1.
Scenario B. The case biomarkers were independent and normally distributed with the same mean of 0.8 but different variances, 0.5, 1, and 2.
Scenario C. The case distribution was jointly normal with the same mean of 1, variances of 0.5, 1, and 2, and pairwise correlation coefficient of 0.5.
Scenario D. The case biomarkers followed a mixture of independent normal distributions. With probability 2/3, they had means of 1.7, 1.7, and 0, and variances of 0.5, 2, and 1, respectively. Then, with probability 1/3, their means became 0, 0, and 1.7, and the variances had the same value of 1.

These scenarios were motivated from cancer detection applications. In Scenarios A, B, and C, all the three case biomarkers were elevated in comparison to their control counterparts. However, they might be independent or correlated, with the same or different variability. Scenarios D mimicked cancer heterogeneity as common in cancer biology, involving two subtypes. The first two biomarkers were only elevated in one subtype, whereas the last was so in the other. The assumptions for linear discriminant analysis, logistic regression, and the monotonic density ratio model hold under Scenario A, but not under Scenarios B, C, and D. At controlled 95% sensitivity, the optimal specificities with linear combinations were 0.466, 0.452, 0.444, and 0.442 under Scenarios A, B, C, and D, respectively. The case and control sizes, n₁ and n₀, were set equal with values from 100 to 500.

Results were obtained from 1000 simulations for each set-up. Table 1 shows summary statistics of performance deficiencies for the proposed empirical utility maximization along with the seven existing methods. The averaged performance deficiencies for a subset of these methods are also displayed in Figure 1. For all the methods under each scenario, performance deficiency decreased with larger sample size as expected, and also from 6 to 3 biomarkers after the elimination of non-informative ones. In Scenario A, all the estimated combination coefficients converge to the optimal one and so do their predictive performances. Not surprisingly, linear discriminant analysis and logistic regression performed the best. They were followed by monotonic density ratio model, support vector classifier, and empirical AUC maximizer. The maximum score estimator and the kernel smoothing-based estimation also had better performance than the empirical utility maximization, although the differences were fairly small. However, in Scenario B and more so in Scenario C, the empirical utility maximization had the best performance whereas linear discriminant analysis and logistic regression performed rather poorly and so did monotonic density ratio model, support vector classifier, and empirical AUC maximizer. Under Scenario D, the relative performance varied with sample size and the number of biomarkers; the empirical utility maximization became the best as the sample size increased. Overall, the proposed method showed a different performance profile, even in comparison with the kernel smoothing-based method. Across all set-ups, on a logarithmic scale for both variables, the averaged performance deficiency and the sample size showed roughly a linear relationship with a slope of −2/3 for the empirical utility maximization. This is consistent with the asymptotic result of n^−2/3 convergence rate.

Table 1.

Simulation results on performance deficiencies of specificity at 95% sensitivity

	3 biomarkers										6 biomarkers
n₁ = n₀ =
	100		200		300		400		500		100		200		300		400		500
	M	D	M	D	M	D	M	D	M	D	M	D	M	D	M	D	M	D	M	D
	Scenario A
LDA	8	8	4	4	3	3	2	2	2	2	20	13	10	6	7	4	5	3	4	3
LR	8	8	4	4	3	3	2	2	2	2	21	14	10	7	7	5	5	3	4	3
MDR	10	10	5	5	4	4	3	3	2	3	23	15	12	8	8	5	6	4	5	4
SVC	9	10	5	5	3	3	2	2	2	2	24	15	12	8	8	5	6	4	5	3
AUC	9	15	4	4	3	3	2	2	2	2	24	23	11	7	7	5	6	4	5	3
MSE	22	21	14	13	10	10	9	8	7	7	44	29	27	18	20	13	16	11	14	10
KS	20	26	13	14	9	17	7	9	6	7	46	34	29	26	19	19	17	18	14	11
EUM	30	30	19	18	15	15	12	12	11	10	54	35	36	24	27	18	23	16	20	14
	Scenario B
LDA	37	25	31	17	30	14	30	12	29	11	50	27	39	18	35	14	33	12	32	10
LR	37	26	31	17	30	14	29	12	28	11	50	28	39	19	34	14	33	13	31	10
MDR	41	29	34	21	33	17	31	15	30	13	53	29	42	20	37	16	35	14	33	12
SVC	38	28	31	19	30	16	29	13	28	12	52	30	40	20	35	16	33	14	32	12
AUC	38	30	32	18	31	15	30	13	29	11	57	46	40	20	36	15	34	13	32	11
MSE	43	41	35	31	30	25	27	22	26	20	63	41	47	30	39	25	35	22	33	20
KS	33	33	21	18	20	22	18	21	16	13	56	39	38	29	30	18	26	21	24	14
EUM	30	29	18	17	14	14	13	13	11	11	53	34	34	25	26	18	22	17	19	13
	Scenario C
LDA	94	36	92	26	91	20	90	19	91	17	105	36	98	26	95	22	92	19	93	16
LR	98	39	97	28	96	22	95	21	96	18	110	38	103	28	100	23	98	20	98	18
MDR	106	42	107	32	108	28	108	27	110	25	114	40	109	29	107	25	106	23	107	20
SVC	106	42	105	31	104	25	103	23	104	21	118	42	111	31	107	26	105	23	105	20
AUC	95	40	92	28	92	22	91	20	91	18	110	48	99	29	95	23	94	20	94	18
MSE	106	63	104	53	104	49	108	47	107	45	118	59	112	52	110	49	110	45	111	41
KS	61	57	44	45	39	39	30	35	26	33	88	55	63	40	63	40	46	36	40	32
EUM	37	39	22	24	18	19	15	16	13	13	66	45	42	30	32	24	26	19	23	17
	Scenario D
LDA	34	27	28	18	27	15	26	13	25	11	45	29	34	17	30	15	27	12	27	11
LR	33	25	26	17	25	14	25	12	24	11	44	27	32	16	29	14	26	11	25	10
MDR	37	33	31	23	30	21	29	18	28	16	46	30	35	20	32	17	30	15	29	14
SVC	40	33	30	22	28	17	28	14	26	13	50	32	37	21	32	17	29	14	28	12
AUC	33	26	25	17	24	14	23	12	22	10	47	38	32	17	28	14	25	11	24	10
MSE	54	52	41	39	37	35	35	32	33	29	66	48	50	38	46	34	41	31	36	25
KS	36	42	21	20	19	25	16	20	14	22	56	43	37	23	29	25	25	20	21	13
EUM	33	31	22	21	17	16	14	13	12	12	59	38	39	25	29	20	25	17	20	13

Open in a new tab

M: empirical mean (×1000); D: empirical standard deviation (×1000).

LDA: linear discriminant analysis; LR: logistic regression; MDR: monotonic density ratio model; SVC: support vector classifier; AUC: smoothed empirical AUC maximizer; MSE: maximum score estimator; KS: kernel smoothing-based method of Meisner et al. (2021); EUM: proposed empirical utility maximization.

Fig 1. — Simulation results on linear combination via the proposed empirical utility maximization (●), in comparison with logistic regression (○), smoothed empirical AUC maximizer (Δ), maximum score estimator (+), and kernel smoothing-based method of Meisner et al. (2021) (×). The least-squares fitting lines of −2/3 slope are shown for the empirical utility maximization.

These simulations were performed on a 2020 MacBook Pro laptop with 2.3 GHz Intel Core i7. For the empirical utility maximization, the average CPU time for a single dataset ranged from 0.82 seconds in a case of n₁ = n₀ = 100 and 3 biomarkers to 4.43 seconds with n₁ = n₀ = 500 and 6 biomarkers.

5.2. Application to prostate cancer detection.

The proposed methodology was motivated by prostate cancer research, to improve the detection of aggressive cancer, i.e., Gleason score ≥ 7, using non-invasive biomarkers among men undergoing their first-time biopsy. Among a limited number of commercially available test assays, prostate health index (phi) is an FDA-approved blood test analysis by combining three forms of prostate-specific antigen (PSA) from serum, total PSA (tPSA), free PSA (fPSA), and isoform [−2]proPSA (p2PSA):

p h i = \frac{p 2 PSA}{fPSA} \times \sqrt{tPSA},

which is a proprietary calculation developed by Beckman Coulter Inc. The test has been evaluated and adopted to distinguish aggressive cancer from indolent or no cancer (Catalona et al., 2011). However, it was unclear whether the combination of the three PSA forms could be improved to achieve better specificity at a controlled high sensitivity level, in particular, 95%. To address it, we analyzed 156 cases and 358 controls, i.e., with and without aggressive prostate cancer, respectively, per pathology testing on prostate biopsies, enrolled in academic urology groups (Sanda et al., 2017). Serum specimens of these participants, obtained prior to biopsy, were assayed for phi.

Note that phi is a linear combination of logarithmic transformed tPSA, fPSA, and p2PSA. We applied our proposed method and the existing ones considered in the earlier simulations to this data set, as reported in Table 2. The combination coefficient of the empirical utility maximization appeared to deviate considerably from those of the existing methods, and even more so from that of phi. Furthermore, the empirical utility maximization had a substantially better empirical estimate of specificity at 95% sensitivity. However, except for phi, all these methods had their combinations trained in the same data set and thus their empirical performance estimates might not be taken as unbiased for predictive performances. In particular, the empirical performance estimate of the empirical utility maximization method tends to over-estimate, and possibly so does the kernel smoothing-based method. However, for other combination methods which target different metrics, their empirical performances do not necessarily over-estimate and could actually under-estimate. As an attempt to address such biases, three-fold cross-validation was performed. This specific fold choice was driven by the fact that the validation subset could not be made too small due to the need of threshold estimation. With the cross-validation estimates, not surprisingly, the edge of the combination from the empirical utility maximization shrank considerably. Yet the improvement over phi still appeared clinically meaningful, although it became only marginal over some of the other combinations. Nevertheless, the cross-validation results correspond to learning from a considerably smaller sample size. Thus, the difference in the cross-validation between the proposed method and those targeting different metrics might be conservative, since the former would approach the ideal performance as sample size increases. An independent validation study should provide a more definitive assessment.

Table 2.

Analysis results of the prostate cancer study

	coefficient			performance
	log(tPSA)	log(fPSA)	log(p2PSA)	empirical	CV
phi	0.200	−0.400	0.400	0.246	–
LDA	0.275	−0.441	0.283	0.277	0.296
LR	0.266	−0.434	0.301	0.293	0.285
MDR	0.257	−0.425	0.319	0.268	0.285
SVC	0.270	−0.432	0.299	0.293	0.290
AUC	0.257	−0.439	0.304	0.251	0.277
MSE	0.272	−0.424	0.305	0.288	0.276
KS	0.350	−0.429	0.221	0.299	0.282
EUM	0.427	−0.445	0.128	0.369	0.297

Open in a new tab

All combination coefficients are scaled to have unity ℓ₁ norm. performance: specificity at 95% sensitivity; CV: median of three-fold cross-validation estimates from 100 random splits.

6. Discussion.

We have developed a linear biomarker combination method that empirically maximizes clinical utility of the intended medical test. The estimated combination coefficient and predictive performance have been rigorously investigated with their limiting distributions established. The proposed empirical utility maximization is shown to be more robust in comparison with several common linear combination methods with respect to our performance metric of interest. Nevertheless, several topics warrant further investigation.

First of all, predictive performance estimation with the training data is of great value, for identification and selection of promising combinations to be validated in future studies. Unfortunately, as discussed in Section 5.2, the apparent empirical estimate tends to over-estimate whereas standard cross-validation might be conservative. The asymptotic theory may need further development to guide this pursuit of more reliable estimation.

Second, optimal biomarker combination without restriction to a specific class is the ultimate goal. Indeed, linear combination may not be effective with, for example, heterogeneous diseases, where various biomarkers are discriminative for certain subtypes but not others. As indicated in Section 1, the proposed linear combination method may accommodate nonlinear combinations of biomarkers via the linear basis expansion technique. However, the number of basis functions under consideration can be large or even infinite. Selection and regularization methods are thus needed; see Hastie, Tibshirani and Friedman (2009, section 5). They are under development.

Finally, a generalization to high-dimensional biomarkers is also being explored. High throughput technology is becoming increasingly available. Combination of such biomarkers also calls for selection and regularization methods. Further efficiency improvement in computation might be critical as well.

Acknowledgments.

The authors thank the reviewers for their helpful comments and suggestions, in particular the Associate Editor for pointing out several mistakes in previous versions of the paper, and Dattatraya H. Patil for assistance in arranging the prostate cancer dataset analyzed in Section 5.2.

Funding.

The authors were supported in part by NIH Grants R01 CA230268, U01 CA113913, and P30 AI050409.

APPENDIX A: PROOFS OF RESULTS IN SECTION 3.1

Proof of Theorem 3.1. We first show the existence of a maximizer β of ϕ(b). Consider a fixed b and an arbitrary b_* such that ∥b∥₁ = ∥b_*∥₁ = 1. For any ε > 0, Condition 2 implies

Pr {b^{⊤} M_{0} \leq τ (b) + ε / 2} > ρ \geq Pr {b_{*}^{⊤} M_{0} < τ (b_{*})} .

Thus, there exists a constant c₁ > 0, independent of b_*, such that Pr(∥M₀∥_∞ > c₁) is sufficiently small to satisfy

Pr {b^{⊤} M_{0} \leq τ (b) + ε / 2} \geq Pr {b_{*}^{⊤} M_{0} < τ (b_{*})} + Pr ({‖ M_{0} ‖}_{\infty} > c_{1}) .

(16)

When ∥M₀∥_∞ ≤ c₁, we can have |(b − b_*)^⊤M₀| ≤ ε/2 so long as b_* is sufficiently close to b. With such a b_*,

Pr {b_{*}^{⊤} M_{0} < τ (b_{*})} = Pr {b^{⊤} M_{0} < τ (b_{*}) + {(b - b_{*})}^{⊤} M_{0}} \geq Pr {b^{⊤} M_{0} < τ (b_{*}) + {(b - b_{*})}^{⊤} M_{0}, {‖ M_{0} ‖}_{\infty} \leq c_{1}} \geq Pr {b^{⊤} M_{0} < τ (b_{*}) - ε / 2, {‖ M_{0} ‖}_{\infty} \leq c_{1}} \geq Pr {b^{⊤} M_{0} < τ (b_{*}) - ε / 2} - Pr ({‖ M_{0} ‖}_{\infty} > c_{1}) .

(17)

Combining (16) and (17) gives

Pr {b^{⊤} M_{0} \leq τ (b) + ε / 2} \geq Pr {b^{⊤} M_{0} < τ (b_{*}) - ε / 2}

and subsequently τ(b)−τ(b_*) ≥ −ε. On the other hand, the same arguments lead to

Pr {b^{⊤} M_{0} \leq τ (b) - ε / 2} < ρ \leq Pr {b_{*}^{⊤} M_{0} \leq τ (b_{*})}

and subsequently, for b_* sufficiently close to b, τ(b) − τ(b_*) ≤ ε. Therefore, τ(b) is continuous. With this and Condition 3, so is ϕ(b) by similar arguments. The existence result then follows from the compactness of ${b : ‖ b ‖_{1} = 1, b \in ℝ^{k}}$ , by the extreme value Theorem (e.g., Rudin, 1976, Theorem 4.16).

Since the class of functions ${I (b^{⊤} M_{d} \leq t) : b \in ℝ^{k}, t \in ℝ}$ is Donsker (e.g., Kosorok, 2008, lemma 9.12) and thus Glivenko–Cantelli,

sup_{b, t} | {\hat{F}}_{d} (t; b) - F_{d} (t; b) | = o (1),

almost surely. Then, extending Theorem 2.3.1 of Serfling (1980) under Condition 2, we obtain

sup_{b} | \hat{τ} (b) - τ (b) | = o (1),

almost surely. Meanwhile, Condition 3 in conjunction with the continuity of τ(b) implies that F₁(t; b) is continuous at t = τ(b) uniformly in b such that ∥b∥₁ = 1, by similar arguments given earlier, for the continuity of τ(b), in a proof by contradiction. Subsequently,

sup_{b} | \hat{ϕ} (b) - ϕ (b) | = o (1),

almost surely. Consequently, $\hat{ϕ} (\hat{β}) - ϕ (β) = o (1)$ and furthermore $ϕ (\hat{β}) - ϕ (β) = o (1)$ , almost surely. Then, it follows that $\hat{ϕ} (\tilde{β}) - ϕ (β) = o (1)$ and $ϕ (\tilde{β}) - ϕ (β) = o (1)$ , almost surely.

Finally, under Condition 4, standard arguments can be used to establish the strong convergence of $\tilde{β}$ to β in light of the compact parameter space, continuity of ϕ(b), and uniform strong convergence of $\hat{ϕ} (b)$ as established above. □

APPENDIX B: PROOFS OF RESULTS IN SECTIONS 3.2 AND 3.3

As shown in the proof of Theorem 3.1, τ(h) is continuous at η. Therefore, we can make the ε in Lemma 3.2, Lemma 3.3, and Theorem 3.4 sufficiently small so as to restrict h to a neighborhood of η, as we do so implicitly, such that {h^⊤, τ(h)}^⊤ is in the neighborhoods of {η^⊤, τ(η)}^⊤ implicated in Conditions 6 and 7.

Write F_d,1|−1(t) as the Conditional distribution of M_d,1 given M_d,−1, and f_d,1|−1(t) as its density if it exists. For d = 0 under Condition 6 or 7 and for d = 1 under Condition 7, f_d,1|−1(t − h^⊤M_d,−1) exists at t = τ(h). Accordingly, F_d(t; h) = EF_d,1|−1(t − h^⊤M_d,−1) has a density

f_{d} (t; h) = E f_{d, 1 ∣ - 1} (t - h^{⊤} M_{d, - 1})

at t = τ(h), which is bounded away from 0 for d = 0 by part (ii) of Condition 6. Thus, F₀{τ(h); h} = ρ. Taking derivatives on both sides yields the gradient

\nabla τ (h) = \frac{E [f_{0, 1 ∣ - 1} {τ (h) - h^{⊤} M_{0, - 1}} M_{0, - 1}]}{f_{0} {τ (h); h}},

(18)

which is bounded under Conditions 5 and 6.

Furthermore, under Condition 7 and for d = 1, 0, the derivative $f_{d, 1 ∣ - 1}^{'}$ of f_d,1|−1 exists and F_d(t; h) has a second derivative:

f_{d}^{'} (t; h) = E f_{d, 1 ∣ - 1}^{'} (t - h^{⊤} M_{d, - 1})

at t = τ(h). Subsequently, f_d{τ(h); h} has a gradient

\nabla f_{d} {τ (h); h} = f_{d}^{'} {τ (h); h} \nabla τ (h) - E [f_{d, 1 ∣ - 1}^{'} {τ (h) - h^{⊤} M_{d, - 1}} M_{d, - 1}],

(19)

which is also bounded under Conditions 5 and 7.

In the following proofs of Lemmas 3.2 and 3.3, but not elsewhere, we restrict to the special case of nonnegative M_d,−1, d = 1, 0; the result for the general case follows subsequently by the argument in Remark 4 via the biomarker splitting technique.

Proof of Lemma 3.2. By part (ii) of Condition 6, f₀{τ(h); h} is bounded away from 0. Set finite constant $c_{2} = 2 {(k - 1)}^{1 / 2} {sup}_{‖ h - η ‖_{\infty} \leq ε} f_{0} {τ (h); h}^{- 1}$ . Following the proof of Serfling (1980, lemma 2.5.4.B), one obtains that, for each fixed h such that ∥h−η∥_∞ ≤ ε,

Pr {| \hat{τ} (h) - τ (h) | > c_{2} n_{0}^{- 1 / 2} {(log n_{0})}^{1 / 2}} \leq 2 n_{0}^{- 2 (k - 1)}

(20)

for sufficiently large n₀. On the other hand, both τ(h) and $\hat{τ} (h)$ are non-decreasing in each component of h as all components of M_0,−1 are nonnegative. We shall exploit this monotonicity property in the extension of the pointwise result (20). Write the floor function as ⌊·⌋. Impose an equally-spaced grid on each component of h with mesh size $ε / ⌊ n_{0}^{1 / 2} ⌋$ , centered at the corresponding component of η. Thus, each h such that ∥h − η∥_∞ ≤ ε can be bracketed by h₋ and h₊ in the sense that, componentwise, h₋ ≤ h ≤ h₊ with h₋ and h₊ being the adjacent grid points. Then,

\hat{τ} (h_{-}) - τ (h_{+}) \leq \hat{τ} (h) - τ (h) \leq \hat{τ} (h_{+}) - τ (h_{-}) .

Therefore, given bounded ∇τ(h) as in (18),

sup_{‖ h - η ‖_{\infty} \leq ε} | \hat{τ} (h) - τ (h) | \leq sup_{h_{1}} | \hat{τ} (h_{1}) - τ (h_{1}) | + sup_{h_{-}, h_{+}} {τ (h_{+}) - τ (h_{-})} = sup_{h_{1}} | \hat{τ} (h_{1}) - τ (h_{1}) | + O (n^{- 1 / 2}),

where h₁ is a grid point componentwise, taking up to ${(2 ⌊ n_{0}^{1 / 2} ⌋ + 1)}^{k - 1}$ different values. For sufficiently large n₀,

Pr {sup_{h_{1}} | \hat{τ} (h_{1}) - τ (h_{1}) | > c_{2} n_{0}^{- 1 / 2} {(log n_{0})}^{1 / 2}} \leq 2 n_{0}^{- 2 (k - 1)} {(2 ⌊ n_{0}^{1 / 2} ⌋ + 1)}^{k - 1},

following (20). By the Borel–Cantelli lemma,

sup_{h_{1}} | \hat{τ} (h_{1}) - τ (h_{1}) | = O {n^{- 1 / 2} {(log n)}^{1 / 2}},

almost surely. So is ${sup}_{‖ h - η ‖_{\infty} \leq ε} | \hat{τ} (h) - τ (h) |$ subsequently. □

Proof of Lemma 3.3. Consider h_i, i = 1, 2, and u such that ∥h_i − η∥_∞ ≤ ε, ∥h₁ − h₂∥_∞ = O(n^−3/4), and $| u | \leq c_{0} n_{d}^{- 1 / 2} {(log n_{d})}^{1 / 2}$ . Write h_∧ = min(h₁, h₂) and h_∨ = max(h₁, h₂), where the minimization and maximization apply componentwise. Compute the variance,

var [I {M_{d, 1} + h_{2}^{⊤} M_{d, - 1} \leq τ (h_{1}) + u} - I {M_{d, 1} + h_{1}^{⊤} M_{d, - 1} \leq τ (h_{2})}] \leq E | I {M_{d, 1} + h_{2}^{⊤} M_{d, - 1} \leq τ (h_{1}) + u} - I {M_{d, 1} + h_{1}^{⊤} M_{d, - 1} \leq τ (h_{2})} | \leq E | I {M_{d, 1} + h_{2}^{⊤} M_{d, - 1} \leq τ (h_{2})} - I {M_{d, 1} + h_{1}^{⊤} M_{d, - 1} \leq τ (h_{2})} | + E | I {M_{d, 1} + h_{2}^{⊤} M_{d, - 1} \leq τ (h_{1}) + u} - I {M_{d, 1} + h_{2}^{⊤} M_{d, - 1} \leq τ (h_{2})} | \leq {(h_{\lor} - h_{\land})}^{⊤} E \int_{0}^{1} M_{d, - 1} f_{d, 1 ∣ - 1} [τ (h_{2}) - {h_{\land} + r (h_{\lor} - h_{\land})}^{⊤} M_{d, - 1}] d r + f_{d} (t^{*}; h_{2}) | τ (h_{1}) + u - τ (h_{2}) |,

(21)

for some t* in the line segment between τ(h₁) + u and τ(h₂). Thus, it is clear that this variance is bounded by $c_{3} n_{d}^{- 1 / 2} {(log n_{d})}^{1 / 2}$ for some constant c₃ > 0 with large n_d, in light of bounded density f_d(t; h) and Conditional density f_d,1|−1(t − h^⊤M_d,−1) around τ(h), bounded gradient of τ(h), and integrability of M_d,−1 by Condition 5. Then, by a Bernstein’s inequality (Serfling, 1980, lemma 2.5.4.A),

Pr [| {\hat{F}}_{d} {τ (h_{1}) + u; h_{2}} - {\hat{F}}_{d} {τ (h_{2}); h_{1}} - F_{d} {τ (h_{1}) + u; h_{2}} + F_{d} {τ (h_{2}); h_{1}} | \geq 3 c_{3}^{1 / 2} k^{1 / 2} n_{d}^{- 3 / 4} {(log n_{d})}^{3 / 4}] \leq 2 exp {\frac{- 9 c_{3} k n_{d}^{- 1 / 2} {(log n_{d})}^{3 / 2}}{2 c_{3} n_{d}^{- 1 / 2} {(log n_{d})}^{1 / 2} + 4 c_{3}^{1 / 2} k^{1 / 2} n_{d}^{- 3 / 4} {(log n_{d})}^{3 / 4}}} \leq 2 n_{d}^{- 4 k},

(22)

with sufficiently large n_d.

Now, with all components of M_d,−1 being nonnegative, both F_d{τ(h₁) + u; h₂} and ${\hat{F}}_{d} {τ (h_{1}) + u; h_{2}}$ are non-decreasing in each component of h₁ and u, and are non-increasing in each component of h₂. Impose a grid on each component of h to bracket h by h₋ and h₊ in the same fashion as that in the proof of Lemma 3.2, however, with a different mesh size $ε / ⌊ n_{d}^{3 / 4} ⌋$ . Similarly, impose an equally-spaced grid on u centered at 0 with mesh size $c_{0} n_{d}^{- 1 / 2} {(log n_{d})}^{1 / 2} / ⌊ n_{d}^{1 / 4} {(log n_{d})}^{1 / 2} ⌋$ , to bracket u by the adjacent points u₋ and u₊ on the grid. Therefore,

{\hat{F}}_{d} {τ (h_{-}) + u_{-}; h_{+}} - {\hat{F}}_{d} {τ (h_{+}); h_{-}} - F_{d} {τ (h_{+}) + u_{+}; h_{-}} + F_{d} {τ (h_{-}); h_{+}} \leq {\hat{F}}_{d} {τ (h) + u; h} - {\hat{F}}_{d} {τ (h); h} - F_{d} {τ (h) + u; h} + F_{d} {τ (h); h} \leq {\hat{F}}_{d} {τ (h_{+}) + u_{+}; h_{-}} - {\hat{F}}_{d} {τ (h_{-}); h_{+}} - F_{d} {τ (h_{-}) + u_{-}; h_{+}} + F_{d} {τ (h_{+}); h_{-}} .

Let {h₁, h₂} be either {h₋, h₊} or {h₊, h₋}, and {u₁, u₂} be either {u₋, u₊} or {u₊, u₋}. Then,

sup_{\begin{matrix} | u | \leq c_{0} n_{d}^{- 1 / 2} {(log n_{d})}^{1 / 2} \\ ‖ h - η ‖_{\infty} \leq ε \end{matrix}} | {\hat{F}}_{d} {τ (h) + u; h} - {\hat{F}}_{d} {τ (h); h} - F_{d} {τ (h) + u; h} + F_{d} {τ (h); h} | \leq max_{{h_{1}, h_{2}}, u_{1}} ∣ {\hat{F}}_{d} {τ (h_{1}) + u_{1}; h_{2}} - {\hat{F}}_{d} {τ (h_{2}); h_{1}} - F_{d} {τ (h_{1}) + u_{1}; h_{2}} + F_{d} {τ (h_{2}); h_{1}} ∣ + max_{{h_{1}, h_{2}}, {u_{1}, u_{2}}} ∣ F_{d} {τ (h_{1}) + u_{1}; h_{2}} - F_{d} {τ (h_{2}); h_{1}} - F_{d} {τ (h_{2}) + u_{2}; h_{1}} + F_{d} {τ (h_{1}); h_{2}} ∣

where {h₁, h₂} and u₁ take up to $2 {(2 ⌊ n_{d}^{3 / 4} ⌋)}^{k - 1}$ and $2 ⌊ n_{d}^{1 / 4} {(log n_{d})}^{1 / 2} ⌋ + 1$ different values, respectively. Given (22), the probability that the first maximum above exceeds $3 c_{3}^{1 / 2} k^{1 / 2} n_{d}^{- 3 / 4} {(log n_{d})}^{3 / 4}$ is no larger than $4 n_{d}^{- 4 k} {(2 ⌊ n_{d}^{3 / 4} ⌋)}^{2 (k - 1)} {2 ⌊ n_{d}^{1 / 4} {(log n_{d})}^{1 / 2} ⌋ + 1}$ for sufficiently large n_d. Then, the first maximum is O{n^−3/4(log n)^3/4} almost surely by the Borel–Cantelli lemma. On the other hand, the second maximum is O(n^−3/4) by arguments similar to those for (21). Together, they lead to the assertion. □

Proof of Theorem 3.4. By Lemmas 3.2 and 3.3, uniformly in {h : ∥h − η∥_∞ ≤ ε},

{\hat{F}}_{d} {\hat{τ} (h); h} - {\hat{F}}_{d} {τ (h); h} = F_{d} {\hat{τ} (h); h} - F_{d} {τ (h); h} + O {n^{- 3 / 4} {(log n)}^{3 / 4}},

(23)

almost surely. Meanwhile, under Condition 7, F_d(t; h) has a bounded second partial derivative with respect to t around τ(h). Then, a Taylor expansion along with Lemma 3.2 gives that, uniformly in {h : ∥h−η∥_∞ ≤ ε},

F_{d} {\hat{τ} (h); h} - F_{d} {τ (h); h} = f_{d} {τ (h); h} {\hat{τ} (h) - τ (h)} + O (n^{- 1} log n),

(24)

almost surely.

Almost surely at most k independent observations of M₀ may simultaneously satisfy M_0,1 + h^⊤M_0,−1 = t for (h^⊤, t)^⊤ in a neighborhood of {η^⊤, τ(η)}^⊤ where the Conditional density f_0,1|−1(t − h^⊤M_0,−1) is bounded under part (i) of Condition 6. Then, with the consistency of $\hat{τ} (h)$ , given by Lemma 3.2, and the continuity of τ(h), uniformly in ${h : ‖ h - η ‖_{\infty} \leq ε}$ ,

| {\hat{F}}_{0} {\hat{τ} (h); h} - ρ | \leq k / n_{0},

(25)

almost surely. Equation (7) then follows from equations (23), (24), and (25), as f₀{τ(h); h} is bounded away from 0 under part (ii) of Condition 6.

Equations (23), (24), and (7) give rise to

\hat{ϕ} (h) = 1 - {\hat{F}}_{1} {τ (h); h} + f_{1} {τ (h); h} f_{0} {τ (h); h}^{- 1} [{\hat{F}}_{0} {τ (h); h} - ρ] + O {n^{- 3 / 4} {(log n)}^{3 / 4}},

almost surely. Following equation (7) and Lemma 3.2, uniformly in {h : ∥h−η∥_∞ ≤ ε},

{\hat{F}}_{0} {τ (h); h} - ρ = O {n^{- 1 / 2} {(log n)}^{1 / 2}},

almost surely. Meanwhile, since the gradient ∇f_d{τ(h); h} given in (19) is bounded, f₁{τ(h); h}f₀{τ(h); h}⁻¹ has a bounded gradient at η. Thus, equation (8) follows. □

Proof of Theorem 3.5. Our proof follows the general framework of Kim and Pollard (1990), although their main Theorem does not apply to our problem. Rather than working with $\hat{ϕ} (h)$ directly, we tackle the problem through its approximation $\bar{ϕ} (h)$ , which linearly combines two independent random components, ${\hat{F}}_{d} {τ (h); h}$ , d = 1, 0. Write ${\hat{F}}_{d} {τ (h); h} - {\hat{F}}_{d} {τ (η); η} = \hat{E} g (M_{d}; h)$ , where

g (m; h) = I {m_{1} + h^{⊤} m_{- 1} \leq τ (h)} - I {m_{1} + η^{⊤} m_{- 1} \leq τ (η)}

with $m \equiv {(m_{1}, m_{- 1}^{⊤})}^{⊤}$ . Consider the class of such functions,

G_{ε} = {g (\cdot; h) : ‖ h - η ‖_{\infty} \leq ε},

with envelope

G_{ε} (m) = sup_{‖ h - η ‖_{\infty} \leq ε} | g (m; h) | = I {min_{‖ h - η ‖_{\infty} \leq ε} τ (h) - h^{⊤} m_{- 1} < m_{1} \leq max_{‖ h - η ‖_{\infty} \leq ε} τ (h) - h^{⊤} m_{- 1}} .

Since the subgraphs of functions in $G_{\infty}$ form a VC class and G_ε(m) is bounded, $G_{ε}$ is uniformly manageable (Kim and Pollard, 1990, section 3). When M_d,−1 is nonnegative, EG_ε(M_d)² = O(ε) as ε ↓ 0 for both d = 1, 0, by arguments similar to those for (21). This result then holds generally, i.e., when M_d,−1 is not necessarily nonnegative, by the biomarker splitting technique described in Remark 4. The same approach can be used to establish E|g(M_d; h₁) − g(M_d; h₂)| = O(∥h₁ − h₂∥_∞) for h₁ and h₂ near η. Furthermore, since G_ε(m) ≤ 1, E[G_ε(M_d)²I{G_ε(M_d) > 1}] = 0. With these properties of $G_{ε}$ , we can now utilize the results in Kim and Pollard (1990).

We first establish $\tilde{η} - η = O_{p} (n^{- 1 / 3})$ . By lemma 4.1 of Kim and Pollard (1990), there exists ε > 0 such that, for ∥h − η∥ ≤ ε, with each δ > 0

| \hat{E} g (M_{d}; h) - E g (M_{d}; h) | \leq δ ‖ h - η ‖_{\infty}^{2} + O_{p} (n^{- 2 / 3}), d = 1, 0.

Give that the Hessian matrix of −Eg(M₁; h) at η is negative definite, we can choose δ such that $E g (M_{1}; h) \geq (2 + λ) δ ‖ h - η ‖_{\infty}^{2}$ in this neighborhood, provided that ε is sufficiently small. When $\tilde{η}$ is in this neighborhood,

\bar{ϕ} (\tilde{η}) - \bar{ϕ} (η) = - \hat{E} g (M_{1}; \tilde{η}) + λ \hat{E} g (M_{0}; \tilde{η}) \leq - δ ‖ \tilde{η} - η ‖_{\infty}^{2} + O_{p} (n^{- 2 / 3}) .

On the other hand,

\bar{ϕ} (\tilde{η}) - \bar{ϕ} (η) = {\hat{ϕ} (\tilde{η}) - \hat{ϕ} (\hat{η})} + {\hat{ϕ} (\hat{η}) - \hat{ϕ} (η)} - {\hat{ϕ} (\tilde{η}) - \bar{ϕ} (\tilde{η})} + {\hat{ϕ} (η) - \bar{ϕ} (η)} \geq ‖ \tilde{η} - η ‖_{\infty} o_{p} (n^{- 1 / 3}) + o_{p} (n^{- 2 / 3}),

where equation (8) or more specifically (9) has been used. Combining the two gives

δ ‖ \tilde{η} - η ‖_{\infty}^{2} \leq ‖ \tilde{η} - η ‖_{\infty} o_{p} (n^{- 1 / 3}) + O_{p} (n^{- 2 / 3}),

which shows the O_p(n^−1/3) convergence rate of $\tilde{η}$ .

Now, we work with the rescaled process

n_{1}^{2 / 3} {\bar{ϕ} (η + n_{1}^{- 1 / 3} a) - \bar{ϕ} (η)} = - n_{1}^{2 / 3} \hat{E} g (M_{1}; η + n_{1}^{- 1 / 3} a) + λ n_{1}^{2 / 3} \hat{E} g (M_{0}; η + n_{1}^{- 1 / 3} a),

which is a linear combination of two independent processes. Weak convergence of each follows from the results of Kim and Pollard (1990, lemma 4.5, lemma 4.6, and Theorem 4.7). Compute

H = - {\nabla^{2} F_{1} {τ (h); h} |}_{h = η} = - E [f_{1, 1 ∣ - 1}^{'} {τ (η) - η^{⊤} M_{1, - 1}} {\nabla τ (η) - M_{1, - 1}}^{\otimes 2}] + λ E [f_{0, 1 ∣ - 1}^{'} {τ (η) - η^{⊤} M_{0, - 1}} {\nabla τ (η) - M_{0, - 1}}^{\otimes 2}],

(26)

where v^⊗2 ≡ vv^⊤. Let

V_{d} (a_{1}, a_{2}) = lim_{α \to \infty} α E {\prod_{i = 1}^{2} g (M_{d}; η + α^{- 1} a_{i})} = E (min_{i = 1, 2} | {\nabla τ (η) - M_{d, - 1}}^{⊤} a_{i} | f_{d, 1 ∣ - 1} {τ (η) - η^{⊤} M_{d, - 1}} \times I [\prod_{i = 1}^{2} {\nabla τ (η) - M_{d, - 1}}^{⊤} a_{i} > 0]),

and define

V (a_{1}, a_{2}) = V_{1} (a_{1}, a_{2}) + γ λ^{2} V_{0} (a_{1}, a_{2}) .

(27)

Then, $n_{1}^{2 / 3} {\bar{ϕ} (η + n_{1}^{- 1 / 3} a) - \bar{ϕ} (η)}$ converges weakly to the corresponding combination of the two limiting processes, which is a Gaussian process Z(a) with continuous sample paths, mean a^⊤Ha/2, and covariance kernel V. So does $n_{1}^{2 / 3} {\hat{ϕ} (η + n_{1}^{- 1 / 3} a) - \hat{ϕ} (η)}$ , following equation (8) or (9).

As in Kim and Pollard (1990, example 6.2), one could show that the variance of Z(a₁) − Z(a₂) is V (a₁ − a₂, a₁ − a₂). Thus, the Gaussian process Z has nondegenerate increments provided V (a, a) ≠ 0 for a ≠ 0. Following Kim and Pollard (1990, Theorem 2.7), $n_{1}^{1 / 3} (\tilde{η} - η)$ then converges in distribution to argmax_aZ(a). Subsequently, the weak convergence of $ϕ (\tilde{η})$ follows with a Taylor expansion argument. □

APPENDIX C: PROOFS OF RESULTS IN SECTION 4

Proof of Corollary 4.1. First, consider σ = o(1). With inequality (14), the same arguments in the proof of Theorem 3.1 lead to ${sup}_{b} | {\hat{ϕ}}_{s^{- 1} σ} (b) - ϕ (b) | = o (1)$ and subsequently ${sup}_{b} | {\hat{ϕ}}_{s^{- 1} σ} (b) - \hat{ϕ} (b) | = o (1)$ , almost surely. It then follows that $\hat{ϕ} ({\hat{β}}_{σ}) - \hat{ϕ} (\hat{β}) = o (1)$ almost surely.

Now, switch to the circumstance of the first biomarker being the anchor and take σ = o_p(n^−2/3). Since ${\hat{β}}_{1, σ}$ converges in probability to β₁ > 0, ${\hat{β}}_{1, σ} > 0$ and thus identity (13) holds with probability tending to 1. Inequality (14) leads to

{\hat{ϕ}}_{{(s | {\hat{β}}_{1, σ} |)}^{- 1} σ} (h) = 1 - {\hat{F}}_{1} {\hat{τ} (h) + o_{p} (n^{- 2 / 3}); h} = 1 - {\hat{F}}_{1} {τ (h); h} - F_{1} {\hat{τ} (h) + o_{p} (n^{- 2 / 3}); h} + F_{1} {τ (h); h} + O_{p} {n^{- 3 / 4} {(log n)}^{3 / 4}} = \hat{ϕ} (h) + o_{p} (n^{- 2 / 3}),

uniformly in {h : ∥h − η∥_∞ ≤ ε} from some ε > 0, where the second equality follows from Lemmas 3.2 and 3.3 and the last one from (23) after a Taylor expansion. Then, $\hat{ϕ} ({\hat{η}}_{σ}) - \hat{ϕ} (\hat{η}) = o_{p} (n^{- 2 / 3})$ is implied. □

REFERENCES

Bahadur RR (1966). A note on quantiles in large samples. Ann. Math. Statist 37 577–580. [Google Scholar]
Bertsimas D, King A and Mazumder R (2016). Best subset selection via a modern optimization lens. Ann. Statist 44 813–852. [Google Scholar]
Catalona WJ, Partin AW, Sanda MG, Wei JT, Klee GG, Bangma CH, Slawin M, Marks LS, Loeb S, Broyles DL, Shin SS, Cruz AB, Chan DW, Sokoll LJ, Roberts WL, van Schaik RH and Mizrahi IA (2011). A multicenter study of [−2]pro-prostate specific antigen combined with prostate specific antigen and free prostate specific antigen for prostate cancer detection in the 2.0 to 10.0 ng/ml prostate specific antigen range. J. Urol 185 1650–1655. [DOI] [PMC free article] [PubMed] [Google Scholar]
Catalona WJ, Partin AW, Slawin KM, Brawer MK, Flanigan RC, Patel A, Richie JP, DeKernion JB, Walsh PC, Scardino PT, Lange PH, Subong EN, Parson RE, Gasior GH, Loveland KG and Southwick PC (1998). Use of the percentage of free prostate-specific antigen to enhance differentiation of prostate cancer from benign prostatic disease: a prospective multicenter clinical trial. JAMA 279 1542–1547. [DOI] [PubMed] [Google Scholar]
Chen B, Li P, Qin J and Yu T (2016). Using a monotonic density ratio model to find the asymptotically optimal combination of multiple diagnostic tests. J. Am. Statist. Assoc 111 861–874. [Google Scholar]
Chen X, Vexler A, Markatou M (2015). Empirical likelihood ratio confidence interval estimation of best linear combinations of biomarkers. Comput. Stat. Data Anal 82 186–198. [Google Scholar]
Eguchi S and Copas J (2002). A class of logistic-type discriminant functions. Biometrika 89 1–22. [Google Scholar]
Elliott G and Lieli RP (2013). Predicting binary outcomes. J. Econom 174 15–26. [Google Scholar]
Florios K and Skouras S (2008). Exact computation of max weighted score estimators. J. Econom 146 86–91. [Google Scholar]
Fong Y, Yin S and Huang Y (2016). Combining biomarkers linearly and nonlinearly for classification using the area under the ROC curve. Stat. Med 16 3792–3809. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie T, Tibshirani R and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. [Google Scholar]
Kiefer J (1967). On Bahadur’s representation of sample quantiles. Ann. Math. Statist 38 1323–1342. [Google Scholar]
Kim J and Pollard D (1990). Cube root asymptotics. Ann. Statist 18 191–219. [Google Scholar]
Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, New York. [Google Scholar]
Lin H, Zhou L, Peng H and Zhou X-H (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. Can. J. Stat 39, 324–343. [Google Scholar]
Lipp T and Boyd S (2016). Variations and extension of the convex-concave procedure. Optim. Eng 17 263–287. [Google Scholar]
Ma S and Huang J (2007). Combining multiple markers for classification using ROC. Biometrics 63 751–757. [DOI] [PubMed] [Google Scholar]
Manski CF (1975). Maximum score estimation of the stochastic utility model of choice. J. Econom 3 205–228. [Google Scholar]
Manski CF (1985). Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator. J. Econom 27 313–333. [Google Scholar]
McIntosh MW and Pepe MS (2002). Combining several screening tests: optimality of the risk score. Biometrics 58 657–664. [DOI] [PubMed] [Google Scholar]
Meisner A, Carone M, Pepe MS and Kerr KF (2021). Combining biomarkers by maximizing the true positive rate for a fixed false positive rate. Biom. J 63 1223–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ou FS, Zeng D and Cai J (2016). Quantile regression models for current status data. J. Statist. Plann. Inference 178 112–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pepe MS (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press, Oxford. [Google Scholar]
Pepe MS, Cai T and Longton G (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62 221–229. [DOI] [PubMed] [Google Scholar]
Pepe MS and Thompson ML (2000). Combining diagnostic test results to increase accuracy. Biostatistics 1 123–140. [DOI] [PubMed] [Google Scholar]
Rudin W (1976). Principles of Mathematical Analysis. McGraw Hill, New York. [Google Scholar]
Sanda MG, Feng Z, Howard DH, Tomlins SA, Sokoll LJ, Chan DW, Regan MM, Groskopf J, Chipman J, Patil DH, Salami SS, Scherr DS, Kagan J, Srivastava S, Thompson IM Jr, Siddiqui J, Fan J, Joon AY, Bantis LE, Rubin MA, Chinnayian AM, Wei JT; and the EDRN-PCA3 Study Group, Bidair M, Kibel A, Lin DW, Lotan Y, Partin A and Taneja S (2017). Association between combined TMPRSS2:ERG and PCA3 RNA urinary testing and detection of aggressive prostate cancer. JAMA Oncol. 3 1085–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
Serfling RJ (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. [Google Scholar]
Vexler A, Liu A, Schisterman EF and Wu C (2006). Note on distribution-free estimation of maximum linear separation of two multivariate distributions. J. Nonparametr. Stat 18 145–158. [Google Scholar]
Yan Q, Bantis LE, Stanford JL and Feng Z (2018). Combining multiple biomarkers linearly to maximize the partial area under the ROC curve. Stat. Med 37 627–642. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuille AL and Rangarajan A (2003). The concave-convex procedure. Neural Comput. 15 915–936. [DOI] [PubMed] [Google Scholar]

[R1] Bahadur RR (1966). A note on quantiles in large samples. Ann. Math. Statist 37 577–580. [Google Scholar]

[R2] Bertsimas D, King A and Mazumder R (2016). Best subset selection via a modern optimization lens. Ann. Statist 44 813–852. [Google Scholar]

[R3] Catalona WJ, Partin AW, Sanda MG, Wei JT, Klee GG, Bangma CH, Slawin M, Marks LS, Loeb S, Broyles DL, Shin SS, Cruz AB, Chan DW, Sokoll LJ, Roberts WL, van Schaik RH and Mizrahi IA (2011). A multicenter study of [−2]pro-prostate specific antigen combined with prostate specific antigen and free prostate specific antigen for prostate cancer detection in the 2.0 to 10.0 ng/ml prostate specific antigen range. J. Urol 185 1650–1655. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Catalona WJ, Partin AW, Slawin KM, Brawer MK, Flanigan RC, Patel A, Richie JP, DeKernion JB, Walsh PC, Scardino PT, Lange PH, Subong EN, Parson RE, Gasior GH, Loveland KG and Southwick PC (1998). Use of the percentage of free prostate-specific antigen to enhance differentiation of prostate cancer from benign prostatic disease: a prospective multicenter clinical trial. JAMA 279 1542–1547. [DOI] [PubMed] [Google Scholar]

[R5] Chen B, Li P, Qin J and Yu T (2016). Using a monotonic density ratio model to find the asymptotically optimal combination of multiple diagnostic tests. J. Am. Statist. Assoc 111 861–874. [Google Scholar]

[R6] Chen X, Vexler A, Markatou M (2015). Empirical likelihood ratio confidence interval estimation of best linear combinations of biomarkers. Comput. Stat. Data Anal 82 186–198. [Google Scholar]

[R7] Eguchi S and Copas J (2002). A class of logistic-type discriminant functions. Biometrika 89 1–22. [Google Scholar]

[R8] Elliott G and Lieli RP (2013). Predicting binary outcomes. J. Econom 174 15–26. [Google Scholar]

[R9] Florios K and Skouras S (2008). Exact computation of max weighted score estimators. J. Econom 146 86–91. [Google Scholar]

[R10] Fong Y, Yin S and Huang Y (2016). Combining biomarkers linearly and nonlinearly for classification using the area under the ROC curve. Stat. Med 16 3792–3809. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Hastie T, Tibshirani R and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. [Google Scholar]

[R12] Kiefer J (1967). On Bahadur’s representation of sample quantiles. Ann. Math. Statist 38 1323–1342. [Google Scholar]

[R13] Kim J and Pollard D (1990). Cube root asymptotics. Ann. Statist 18 191–219. [Google Scholar]

[R14] Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, New York. [Google Scholar]

[R15] Lin H, Zhou L, Peng H and Zhou X-H (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. Can. J. Stat 39, 324–343. [Google Scholar]

[R16] Lipp T and Boyd S (2016). Variations and extension of the convex-concave procedure. Optim. Eng 17 263–287. [Google Scholar]

[R17] Ma S and Huang J (2007). Combining multiple markers for classification using ROC. Biometrics 63 751–757. [DOI] [PubMed] [Google Scholar]

[R18] Manski CF (1975). Maximum score estimation of the stochastic utility model of choice. J. Econom 3 205–228. [Google Scholar]

[R19] Manski CF (1985). Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator. J. Econom 27 313–333. [Google Scholar]

[R20] McIntosh MW and Pepe MS (2002). Combining several screening tests: optimality of the risk score. Biometrics 58 657–664. [DOI] [PubMed] [Google Scholar]

[R21] Meisner A, Carone M, Pepe MS and Kerr KF (2021). Combining biomarkers by maximizing the true positive rate for a fixed false positive rate. Biom. J 63 1223–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Ou FS, Zeng D and Cai J (2016). Quantile regression models for current status data. J. Statist. Plann. Inference 178 112–127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Pepe MS (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press, Oxford. [Google Scholar]

[R24] Pepe MS, Cai T and Longton G (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62 221–229. [DOI] [PubMed] [Google Scholar]

[R25] Pepe MS and Thompson ML (2000). Combining diagnostic test results to increase accuracy. Biostatistics 1 123–140. [DOI] [PubMed] [Google Scholar]

[R26] Rudin W (1976). Principles of Mathematical Analysis. McGraw Hill, New York. [Google Scholar]

[R27] Sanda MG, Feng Z, Howard DH, Tomlins SA, Sokoll LJ, Chan DW, Regan MM, Groskopf J, Chipman J, Patil DH, Salami SS, Scherr DS, Kagan J, Srivastava S, Thompson IM Jr, Siddiqui J, Fan J, Joon AY, Bantis LE, Rubin MA, Chinnayian AM, Wei JT; and the EDRN-PCA3 Study Group, Bidair M, Kibel A, Lin DW, Lotan Y, Partin A and Taneja S (2017). Association between combined TMPRSS2:ERG and PCA3 RNA urinary testing and detection of aggressive prostate cancer. JAMA Oncol. 3 1085–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Serfling RJ (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. [Google Scholar]

[R29] Vexler A, Liu A, Schisterman EF and Wu C (2006). Note on distribution-free estimation of maximum linear separation of two multivariate distributions. J. Nonparametr. Stat 18 145–158. [Google Scholar]

[R30] Yan Q, Bantis LE, Stanford JL and Feng Z (2018). Combining multiple biomarkers linearly to maximize the partial area under the ROC curve. Stat. Med 37 627–642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Yuille AL and Rangarajan A (2003). The concave-convex procedure. Neural Comput. 15 915–936. [DOI] [PubMed] [Google Scholar]

PERMALINK

LINEAR BIOMARKER COMBINATION FOR CONSTRAINED CLASSIFICATION

Yijian Huang

Martin G Sanda

Abstract

1. Introduction.

2. The problem and empirical utility maximization.

2.1. Optimal linear combination.

2.2. Empirical utility maximization.

2.3. A related problem.

3. Asymptotic theory.

3.1. Strong consistency.

3.2. Approximating $\hat{ϕ} (h)$ via uniform Bahadur representation of $\hat{τ} (h)$ .

3.3. Weak convergence: cube root asymptotics.

4. Computational algorithm.

5. Numerical studies.

5.1. Simulations.

Table 1.

Fig 1.

5.2. Application to prostate cancer detection.

Table 2.

6. Discussion.

Acknowledgments.

Funding.

APPENDIX A: PROOFS OF RESULTS IN SECTION 3.1

APPENDIX B: PROOFS OF RESULTS IN SECTIONS 3.2 AND 3.3

APPENDIX C: PROOFS OF RESULTS IN SECTION 4

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

LINEAR BIOMARKER COMBINATION FOR CONSTRAINED CLASSIFICATION

Yijian Huang

Martin G Sanda

Abstract

1. Introduction.

2. The problem and empirical utility maximization.

2.1. Optimal linear combination.

2.2. Empirical utility maximization.

2.3. A related problem.

3. Asymptotic theory.

3.1. Strong consistency.

3.2. Approximating ϕ^(h) via uniform Bahadur representation of τ^(h).

3.3. Weak convergence: cube root asymptotics.

4. Computational algorithm.

5. Numerical studies.

5.1. Simulations.

Table 1.

Fig 1.

5.2. Application to prostate cancer detection.

Table 2.

6. Discussion.

Acknowledgments.

Funding.

APPENDIX A: PROOFS OF RESULTS IN SECTION 3.1

APPENDIX B: PROOFS OF RESULTS IN SECTIONS 3.2 AND 3.3

APPENDIX C: PROOFS OF RESULTS IN SECTION 4

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2. Approximating $\hat{ϕ} (h)$ via uniform Bahadur representation of $\hat{τ} (h)$ .