Abstract
Multiple biomarkers are often combined to improve disease diagnosis. The uniformly optimal combination, i.e., with respect to all reasonable performance metrics, unfortunately requires excessive distributional modeling, to which the estimation can be sensitive. An alternative strategy is rather to pursue local optimality with respect to a specific performance metric. Nevertheless, existing methods may not target clinical utility of the intended medical test, which usually needs to operate above a certain sensitivity or specificity level, or do not have their statistical properties well studied and understood. In this article, we develop and investigate a linear combination method to maximize the clinical utility empirically for such a constrained classification. The combination coefficient is shown to have cube root asymptotics. The convergence rate and limiting distribution of the predictive performance are subsequently established, exhibiting robustness of the method in comparison with others. An algorithm with sound statistical justification is devised for efficient and high-quality computation. Simulations corroborate the theoretical results, and demonstrate good statistical and computational performance. Illustration with a clinical study on aggressive prostate cancer detection is provided.
Keywords: Bahadur representation, Cube root asymptotics, Diagnostic test, Sensitivity, Specificity
1. Introduction.
In this era of precision medicine, many biomarkers have been identified for disease diagnosis, as well as for disease prognosis and prediction of therapeutic response. Since a single biomarker often has only limited diagnostic accuracy, combining multiple biomarkers holds the promise for improved discrimination. This is a classical binary classification problem, with a vast relevant literature in statistics, machine learning, and econometrics (cf. Hastie, Tibshirani and Friedman, 2009). Nevertheless, challenges abound.
Intuitively, the Conditional probability of the disease status given the biomarkers, i.e., the posterior probability, is a uniformly optimal combination with respect to any reasonable performance metrics including classification error and expected cost. Since an optimal combination remains so under monotone transformation, interestingly such a combination may be determined even when the posterior probability cannot be identified from the observed data with, for example, case-control design as often adopted in biomarker studies. Specifically, as following from Bayes’ rule, the likelihood ratio is such a transformation and optimal; see McIntosh and Pepe (2002) and the references therein. These results lead to a common strategy via estimating the posterior probability or the biomarker distributions of cases and controls. Parametric modeling is often adopted, with popular methods including linear discriminant analysis, logistic regression, and probit regression. In particular, logistic regression is routinely employed to linearly combine biomarkers in practice. However, correct model specification can be difficult in, for example, cancer detection, where cancer biomarkers often have irregular distributions. Obviously, model misspecification could result in sub-optimality. Semiparametric methods, including the monotonic density ratio model of Chen et al. (2016), have been pursued, but they still may not be sufficiently general. On the other hand, nonparametric density estimation with multiple biomarkers is notoriously difficult.
Of course, for classification purpose, a locally optimal combination, i.e., with respect to a given performance metric of interest, is sufficient. Such a combination does not need to be uniformly optimal and thus can be easier for identification and estimation (cf. Elliott and Lieli, 2013), as pursued by many methods. They include machine learning techniques such as random forest, support vector machine, and neural network. For instance, the support vector machine targets the hinge loss function. Another notable example involves the receiver operating characteristic curve (ROC) to use area under the curve (AUC) as the performance metric. Pepe, Cai and Longton (2006) proposed a linear combination to maximize the empirical AUC; see also Ma and Huang (2007) and Lin et al. (2011) which maximized smoothed versions of the empirical AUC, as well as related work by Vexler et al. (2006), Chen, Vexler and Markatou (2015), and Fong, Yin and Huang (2016). However, all these loss functions and performance metrics do not align well with clinical utility of diagnostic tests in practice. As an exception, Elliott and Lieli (2013) incorporated cost function into the classification optimization by extending the maximum score estimator of Manski (1975, 1985). Their empirical risk minimization accommodates differential costs associated with false positives and false negatives. Nevertheless, the cost quantification as required can be extremely difficult in practice, posing a potential limitation.
Considerable efforts have been made on locally optimal combination with respect to clinical utility-relevant performance metrics that are more practical. Note that a medical test typically needs to operate above a certain sensitivity or specificity level. Take aggressive prostate cancer diagnosis as an example. Since a positive non-invasive test would be confirmed with biopsy, a false negative has far more serious clinical consequences than a false positive. Thus, a high sensitivity, say, 95%, needs to be maintained (e.g., Catalona et al., 1998; Sanda et al., 2017). Partial AUC around the high sensitivity level then is a sensible metric, and perhaps more so is specificity at the controlled sensitivity level; indeed the latter may be viewed as a limiting special case. Of course, a different clinical context might rather call for a high specificity, which then needs to be controlled instead. Interestingly, specificity at controlled sensitivity or vice versa is in line with power at controlled type I error in standard statistical hypothesis testing; incidentally, this explains the applicability of the Neyman–Pearson lemma to optimal biomarker combination as discussed in McIntosh and Pepe (2002) and Eguchi and Copas (2002). A number of methods have been developed for optimal linear combination with respect to these metrics, including Pepe and Thompson (2000) and Yan et al. (2018) using partial AUC, and Meisner et al. (2021) targeting sensitivity at a controlled specificity level. Unfortunately, statistical properties of these procedures are not well understood.
Notably, most of these methods targeting a given performance metric are restricted to the class of linear combinations. As such, they may only approach the optimality in the linear class, but not necessarily so in general. Nevertheless, linear combination is of interest in its own right for a few reasons. First, it provides a first-order approximation to the general combination. Second, reliable estimation in a more general class may not be feasible in many applications where data are limited. Finally, nonlinear combinations can be accommodated in a linear combination method through the general linear basis expansion technique as adopted in, e.g., support vector machines (Hastie, Tibshirani and Friedman, 2009); see further discussion in Section 6.
In this article, we investigate and develop an optimal linear combination estimation method aiming at sensitivity- or specificity-constrained classification by maximizing the empirical utility. For simplicity in expressions, we opt to consider sensitivity at a controlled specificity level in the methodology development. The same method also applies to specificity at a controlled sensitivity level as of interest in our motivating application, by transposing the roles of cases and controls. As natural as the empirical estimator appears, its asymptotic analysis and computation are challenging due to noncontinuity in both the objective function and the constraint of the optimization problem. Our approach is different from Meisner et al. (2021) which adopted kernel-smoothed estimates of sensitivity and specificity instead. Most importantly, the limiting distribution of our combination coefficients is established, to have non-standard cube root asymptotics. Subsequently, the convergence rate and limiting distribution of the predictive performance are obtained. Meanwhile, a novel computational algorithm with sound statistical justification is devised.
The rest of the paper is organized as follows. Section 2 will introduce the problem and the empirical utility maximization method, to set the stage. We present the asymptotic study in Section 3, and the computational algorithm in Section 4. Numerical studies are reported in Section 5. Final remarks are provided in Section 6. Technical details are deferred to the Appendices.
2. The problem and empirical utility maximization.
Write a k-vector biomarker under consideration as Md, for k ≥ 2 and with d = 1, 0 denoting case and control, respectively. With coefficient b, a linear combination is given by b⊤Md. Since sensitivity at controlled specificity is scale invariant to the combination, we limit to as the linear combination class under consideration, where ∥·∥1 is the ℓ1 norm. Other norms for b can serve the same purpose, but our choice has computational advantages as will be seen later.
2.1. Optimal linear combination.
Adopt the convention that a positive diagnosis is associated with a larger combination value. With ρ ∈ (0, 1) as the given control level of specificity, the optimal combination coefficient β to maximize the sensitivity is the solution b of the following optimization problem:
| (1) |
where t corresponds to a test threshold.
Write Fd(t; b) = Pr(b⊤Md ≤ t) for the cumulative distribution function, d = 1, 0, and for the control quantile function. With a given b, the smallest threshold t with the specificity being at least ρ and the resulting constrained sensitivity are given by
respectively. Then, we have
| (2) |
as an identity for the optimal combination coefficient.
2.2. Empirical utility maximization.
Consider a case-control study, with nd independent replicates of Md: Md,[i], i = 1, …, nd, where d = 1, 0 represents the case and control samples, respectively. As a natural estimator of β, is the solution b of the following problem to optimize the empirical utility:
| (3) |
where I(·) is the indicator function and denotes the empirical average, over the case or control sample as appropriate; for example, . Let denote the empirical counterpart of Fd(t; b). The empirical estimators for the threshold and associated constrained sensitivity are then given by
respectively. Subsequently,
| (4) |
which will facilitate later analysis of the estimator.
2.3. A related problem.
Consider the circumstance that a biomarker is known a priori to have a non-zero coefficient with a specific sign in the optimal linear combination. Without loss of generality, suppose that the first biomarker is such a so-called anchor with a positive coefficient, which can always be achieved by biomarker reordering and sign-flipping. Write , , and . Then, instead of , one may work with a restricted class . With a slight abuse of notation, in this context, write Fd(t; h) ≡ Fd{t; (1, h⊤)⊤} and similarly for , τ(h), , ϕ(h), and . With β1 > 0 and scale invariance of ϕ(b) to b, we have the optimal combination coefficient in the new class,
Its empirical estimator is given by
| (5) |
or, equivalently, the solution h of
| (6) |
The optimization problem (6) is clearly a simplification, and such a restricted class has been commonly adopted for linear combinations in the literature, e.g., Pepe, Cai and Longton (2006). However, the requirement of prior knowledge imposes a restriction, and we do not take it as an equivalent problem in general, especially for computational purposes. Nevertheless, provided that the consistency of is established for β, this restricted problem may be taken as a surrogate for the purposes of weak convergence. Specifically, under the set-up of β1 > 0, and subsequently hold with probability tending to 1. In that asymptotic sense, these two estimated combinations differ by a scale and thus are equivalent in terms of performance. Therefore, for weak convergence, we can instead work with in the restricted problem, which is advantageous since the distribution of is concentrated on a (k − 1)-dimensional subspace of by definition.
3. Asymptotic theory.
An asymptotic study for this non-standard estimation problem, via optimization (3), is conducted under the circumstance that the number of biomarkers k is held fixed, and the total sample size n ≡ n1 + n0 approaches ∞. Recall that the case and control samples are independent of each other, and each of them consists of independent and identically distributed observations. Consistency shall be tackled first. We then establish a uniform Bahadur representation for the estimated thresholds to obtain an approximation for the empirical constrained sensitivities, of nearly optimal combinations. On the basis of this approximation, weak convergence results are subsequently pursued.
3.1. Strong consistency.
Mild regularity Conditions are imposed.
Condition 1 (Case and control sizes). The size ratio n1/n0 converges to a constant γ ∈ (0, ∞) as n → ∞.
Condition 2 (Threshold uniqueness). For any ε > 0, the threshold τ(b) satisfies .
Condition 3 (Sensitivity continuity). For each b such that ∥b∥1 = 1, F1(t; b) is continuous at t = τ(b).
Condition 1 implies that n1 and n0 in O(·) and o(·) notation expressions can be equivalently replaced with n, as we shall do consistently. Conditions 2 and 3 concern the biomarker distributions at the thresholds for the combinations, but not elsewhere. Nevertheless, the requirement for each combination might not be necessary especially for b such that ϕ(b) is far away from the optimum. Simplicity is part of the consideration in adopting the current form.
The Conditions so far guarantee the existence of an optimal combination. However, the optimal combination is not necessarily unique as in the case, for example, that M1 and M0 lie in a linear subspace of dimension less than k with probability 1. Then, β is not well defined and much less the notion of consistency for its estimation. For that reason, we further require the following uniqueness Condition.
Condition 4 (Optimal combination uniqueness). The maximizer β of ϕ(b) over is unique.
Theorem 3.1. Under Conditions 1–3, a maximizer β of ϕ(b) over exists. Consider an estimator such that , satisfying
almost surely. Then, both and converge to ϕ(β) almost surely. If, in addition, Condition 4 holds, then converges to β almost surely.
This result concerns a near maximizer , of which is a special case. While ϕ(β) represents the ideal performance, the predictive performance reflects how the learned combination performs on future data once implemented. Their difference , referred to as the performance deficiency, converges to 0 almost surely. Meanwhile, the empirical estimate is consistent for both and ϕ(β). Appealingly, this consistency may hold even when β is not unique and thus might not converge. Indeed, of most interest is the performance rather than the combination coefficient, which is a distinctive feature of this problem. Nevertheless, the uniqueness of β and consistency of would facilitate a rigorous study on weak convergence of the predictive performance. In some circumstances, non-uniqueness of β may be resolved as with the example that M1 and M0 lie in a linear subspace of dimension less than k by means of reducing the biomarker dimension.
3.2. Approximating via uniform Bahadur representation of .
As discussed in Section 2.3, we may now switch to the simplified problem dealing with η and instead of β and . Recall β1 > 0. Then, by Theorem 3.1 under Conditions 1–4, converges to η almost surely. Write m+ ≡ mI(m ≥ 0) and m− ≡ mI(m ≤ 0) and, in the case of a vector, the operators + and − apply componentwise. Additional Conditions are adopted.
Condition 5 (Biomarker integrability). Biomarker Md,−1 is integrable for d = 1, 0.
Condition 6 (Smoothness of control quantile). With the control anchor biomarker M0,1, (i) the Conditional distribution function Pr(M0,1 + h⊤M0,−1 ≤ t | M0,−1) has a bounded density for (h⊤, t) ⊤ in a neighborhood of {η⊤, τ(η)}⊤; and (ii) possibly upon a location shift for M0,−1, the density of the marginal distribution function is bounded away from 0 for in a neighborhood of {η⊤, −η⊤, τ(η)}⊤.
Condition 7 (Smoothness of case and control distributions). For d = 1, 0, the Conditional distribution function Pr(Md,1 +h⊤Md,−1 ≤ t | Md,−1) has a bounded second derivative with respect to t, for (h⊤, t)⊤ in a neighborhood of {η⊤, τ(η)}⊤.
Conditions 6 and 7 require the marginal distribution of the anchor biomarker Md,1, d = 1, 0, to have certain smoothness, but not so for other biomarkers. However, these non-anchor biomarkers need to have finite expectations by Condition 5.
Remark 1. The biomarker combination in part (ii) of Condition 6 is more general than the linear combination under consideration, since the coefficients for positive or negative M0,−1, h+ or −h−, respectively, may or may not be the same. However, in certain situations, the two are actually equivalent. Notable in the Condition is the possible location shift for M0,−1, to which both the linear combination coefficient and the constrained sensitivity are invariant. Nevertheless, the split point of a biomarker for different coefficients then may correspond to any finite value before the shift. Therefore, when each component of M0,−1 has a finite upper or lower bound for the support of its distribution, only one of the corresponding components of h+ and −h− is relevant upon shifting the finite bound to 0, to give rise to a linear combination. The more general combination is only essential when a component of M0,−1 has a distribution with support on the whole real line as with, e.g., a normal distribution.
Remark 2. When the distribution of Md,−1 has a bounded support, |(h − η)⊤Md,−1| can be made arbitrarily small when h is sufficiently close to η. In that case, Condition 6 is satisfied when the Conditional density of the optimal combination M0,1 + η⊤M0,−1 given M0,−1 is bounded from above and away from 0 in a neighborhood of τ(η). Similarly, Condition 7 is met when the Conditional distribution of the optimal combination Md,1 + η⊤Md,−1 given Md,−1 has a bounded second derivative in a neighborhood of τ(η).
Remark 3. We have made these Conditions slightly less general in the interest of clarity in exposition. The first biomarker, i.e., the anchor, serves a role different from the others in Conditions 5–7. Actually, that role may be taken by any biomarker with a non-zero coefficient in the optimal combination. Also, the three such biomarkers, one for Condition 6 and two for Condition 7 corresponding to d = 1, 0 separately, do not need to be the same. These Conditions serve to ensure the marginal distribution of a linear combination to have certain smoothness around the threshold and around the optimal combination; see Appendix B. Only existence, but not identification, of these aforementioned biomarkers is required.
We first provide a uniform convergence rate of , extending the classical result on the empirical quantile (cf. Serfling, 1980, lemma 2.5.4.B). Write ∥ · ∥∞ as the maximum norm.
Lemma 3.2. Under Conditions 1–6, there exists a constant ε > 0 such that
almost surely.
Meanwhile, the local behavior result of the empirical distribution function (cf. Serfling, 1980, lemma 2.5.4.E) can also be generalized.
Lemma 3.3. Suppose that Conditions 1–5 and 7 hold and take c0 > 0 as any given constant. For d = 1, 0 separately, there exists a constant ε > 0 such that
almost surely.
Remark 4. The proofs of the uniform results in Lemmas 3.2 and 3.3, given in Appendix B, exploit a monotonicity property of and . That is, in the special case that the non-anchor biomarkers, Md,−1, d = 1, 0, are nonnegative, is non-decreasing in each component of h whereas is non-decreasing in t and non-increasing in each component of h. Furthermore, the general problem can be so reformulated. This can be easily achieved, by means of biomarker shifting and sign-flipping, if the case and control distributions for each component of Md,−1 are bounded in the union of their supports from either above or below. More generally, we now develop a novel biomarker splitting technique to extend the problem to a linear combination with 2k − 1 biomarkers, . Now η in the original problem translates to (η⊤, −η⊤)⊤ in the extended one, with all non-anchor biomarkers being nonnegative. Focus on Lemma 3.2 with Condition 6. Note that an extended linear combination can be equivalently written as:
where the indicator functions apply to M0,−1 componentwise, and ⊙ is the Hadamard product. Furthermore, Conditioning on and is the same as that on M0,−1. Therefore, part (i) of Condition 6 on the Conditional distribution automatically holds for the extended problem if it does for the original one. Meanwhile, part (ii) of Condition 6 on the marginal distributions accommodates this extension as discussed in Remark 1. Therefore, the extended problem, with the monotonicity property, can be utilized to establish Lemma 3.2, upon noting that the result for the extended problem is more general. The same approach applies to Lemma 3.3 with Condition 7.
Write fd(t; h) as the probability density function of Fd(t; h), d = 1, 0, if it exists. Let λ ≡ f1{τ(η); η}f0{τ(η); η}−1. The preceding lemmas give rise to a uniform Bahadur representation of the empirical threshold and subsequently to an approximation of the empirical constrained sensitivity by
which is more amenable to analysis.
Theorem 3.4. Suppose that Conditions 1–7 hold. There exists a constant ε > 0 such that
| (7) |
almost surely. Furthermore, for any hn → η,
| (8) |
almost surely.
The order of the remainder in (7), O{n−3/4(log n)3/4}, may be improved in the light of the sharper bound in the standard Bahadur representation (Bahadur, 1966; Kiefer, 1967). Nevertheless, for our purposes, the remainder needs only to be op(n−2/3). In fact, a weaker version of (8),
| (9) |
is adequate for the ensuing results.
3.3. Weak convergence: cube root asymptotics.
In their seminal work, Kim and Pollard (1990) established cube root asymptotics with their main Theorem dealing with an unconstrained maximization of a one-sample empirical process. Our objective function , however, is a rather complicated functional of empirical processes involving two independent samples. We shall extend their result to our problem, by exploiting the approximation (9) through as a linear combination of empirical processes from independent samples.
Theorem 3.5. Suppose that Conditions 1–7 hold. Then, the process converges weakly to a Gaussian process Z(a) with continuous sample paths, mean a⊤Ha/2, and covariance kernel V, where H and V are given by (26) and (27), respectively, in Appendix B. Consider as an estimator of η satisfying
If H is negative definite and Z has nondegenerate increments, i.e., V (a, a) ≠ 0 for a ≠ 0, converges in distribution to U ≡ argmaxa Z(a). Meanwhile, converges in distribution to U⊤HU/2.
Of course, is a special case of . The estimated combination coefficient has a n−1/3 bee convergence rate, leading to a n−2/3 convergence rate of the predictive performance to the ideal one ϕ(η). This exhibits a distinctive and more robust convergence profile in comparison with other linear combination methods, with respect to the constrained sensitivity under consideration. Note that a number of methods originally developed for cohort studies, such as support vector classifier and the maximum score estimator (Manski, 1975, 1985), can be applied to case-control studies as well, although their loss functions and thus interpretation may change accordingly. Among existing methods, the maximum score estimator has the same n−1/3 convergence rate as but may not converge to the same limit. Thus, its predictive performance does not approach the ideal one ϕ(η) in general. Meanwhile, a large number of methods, including linear discriminant analysis, logistic regression, support vector classifier, and AUC maximization (Pepe, Cai and Longton, 2006), have the faster parametric n−1/2 convergence rate of their combination coefficients. Accordingly, in the case that their limits coincide with η, their associated predictive performances have a faster n−1 convergence rate; this occurs, for example, for logistic regression when the model is correctly specified. Under other circumstances, however, their predictive performances may not approach the ideal one ϕ(η), and also typically converge at a slower n−1/2 rate.
4. Computational algorithm.
The optimization problem (3) is computationally challenging. Obviously brute-force grid search can be used, but the computational burden is prohibitive except with very few biomarkers. We have also approached the problem via modern mixed integer linear programming. Algorithmic advances and hardware improvements over the past three decades have dramatically sped up mixed integer optimization. The last few years has seen applications to several statistical problems, related or unrelated to ours, that were once regarded as intractable; see Florios and Skouras (2008) and Bertsimas, King and Mazumder (2016) among others. Most appealingly, this approach permits exact optimization or an approximate one with a definite error bound. Nevertheless, our experience showed that the computation is still too intensive, at least with personal computers, for typical datasets in practice. In the following, we suggest a different approach, not a purely computational one but rather a combined statistical and computational solution to balance efficiency and high quality. We shall devise an asymptotically equivalent optimization problem and then develop a novel computational algorithm, partly inspired by Ou, Zeng and Cai (2016) on a different problem.
Two features of the optimization problem (3) contribute to its poor computational properties. One is the nonlinear equality constraint, on ∥b∥1, and the other is the indicator function in the objective function and the inequality constraint. For the former, because of scale invariance of to b,
with any constant w > 0. This results in an equivalent optimization problem with inequality constraints only,
| (10) |
such a minimization formulation is more standard in the optimization literature. For the indicator function, approximating I(x ≤ 0) by σ−1{x− − (x + σ)−}, for a small σ > 0, gives rise to:
| (11) |
However, the solution b of this problem, say, , may no longer have unity ℓ1 norm. A rescaling leads to .
This estimation procedure is better elucidated through an analysis. Add subscript σ to , , and to denote their counterparts after the indicator function approximation. From (11), we can see
With σ > 0, is no longer scale invariant to b as for s > 0, which explains that may not be 1. Nevertheless, since , we have or . Suppose w > 1 and we then obtain
for some data-dependent s ∈ [1−w−1, 1]. Subsequently,
| (12) |
Meanwhile, in the circumstance of the first biomarker being the anchor as described in Section 2.3, write and . When , we have a similar identity,
| (13) |
On the other hand, from I(x ≤ −σ) ≤ σ−1{x− − (x + σ)−} ≤ I(x ≤ 0), we have , , and subsequently
| (14) |
Identities (12) and (13) together with (14) suggest that, with σ sufficiently small, impact of the approximation is asymptotically negligible such that the consistency and weak convergence results in Theorems 3.1 and 3.5 hold.
Corollary 4.1. Set the finite constant w > 1 in optimization problem (11). If σ = o(1), then is a special case of defined in Theorem 3.1. Furthermore, when the first biomarker is the anchor and σ = op(n−2/3), and differ by a scale with probability tending to 1 and is a special case of defined in Theorem 3.5.
Now, focus on the computation with problem (11). The objective function and the left-hand side of the first constraint are sums of convex and concave components, whereas the left-hand side of the second constraint is convex. We can then adopt the concave-convex procedure (Yuille and Rangarajan, 2003) as extended by Lipp and Boyd (2016) to accommodate constraints, which is the core of Algorithm 1. At each step, the two concave components are replaced by their tangent planes at the current variable value resulting in a convex optimization problem. With our application, the convex optimization is actually a linear program as given by (15), upon adopting slack variables, thanks partly to the adopted ℓ1 norm on b. The current variable value is then updated with the optimizer, which always satisfies the original constraints and improves the original objective function. Such steps can be repeated until the original objective function could not be further improved, and the algorithmic convergence in the objective function of (11) is guaranteed.
Nevertheless, one issue with the implementation is how to choose σ. Corollary 4.1 suggests a small value, which, however, could lead to a local optimizer near the initial value. Indeed, multiple local optimizers may exist and the concave-convex procedure does not guarantee to reach the global one (cf. Lipp and Boyd, 2016). In Algorithm 1, rather than a single small value, a sequence of decreasing σ values are taken. Thanks to the specific approximation adopted for the indicator function, the transition from one σ value to the next is seamless since the variable value remains feasible, i.e., satisfying the constraints, as σ decreases. With a larger σ, conceivably the objective function and the constraint are more smooth to have fewer local optima. As the σ value decreases gradually, the optimizer might thus more likely approach the global one.
With shrinking σ in Algorithm 1, the choice of constant w > 1 becomes less essential since w is absorbed into σ in identities (12) and (13). A convenient value, say, 2, suffices. Finally, an initial feasible value is required for (b⊤, t)⊤ with the starting σ value. For that purpose, a working method such as logistic regression or a coarse grid search may be adopted.
As a note, this developed algorithm is not limited to the proposed method but rather is generally applicable for other optimization problems of a similar form. The maximum score estimator of Manski (1975, 1985) is one example.

5. Numerical studies.
The proposed empirical utility maximization with Algorithm 1 has been implemented using a linear programming solver from R package Rmosek. In the studies reported below, w was set to 2. The logistic regression coefficient served as the initial value; the results using a coarse grid search were similar. With the initial coefficient, the maximum gap among the cases and that among the controls, between adjacent order statistics of the combinations, were computed. The larger of the two was taken as the starting σ value. A feasible initial value for the threshold was then calculated. Each subsequent σ value was shrunk by a factor of 0.8.
Seven existing linear combination methods were included in our numerical studies for comparison. Linear discriminant analysis and logistic regression are both standard methods; the latter in particular is routinely adopted in biomarker research. As a semiparametric method, the monotonic density ratio model of Chen et al. (2016) is more general. For the support vector classifier, function svm in R package e1071 was used. To target the AUC, we adopted the smoothed empirical AUC maximizer of Lin et al. (2011) as implemented in R package aucm. As software for the maximum score estimator of Manski (1975, 1985) did not appear to be readily available, we adapted our algorithm in Section 4 for its computation. Finally, the kernel smoothing-based estimator of Meisner et al. (2021) as implemented in R package maxTPR with default parameters was assessed as well.
Specificity at controlled 95% sensitivity is most relevant for our prostate cancer application reported in Section 5.2. That specific performance metric accordingly was adopted for all the numerical studies, and so targeted by both the proposed empirical utility maximization and the kernel smoothing-based estimator of Meisner et al. (2021). These methods as formulated for sensitivity at controlled specificity, including our proposal in Sections 2–4, applied upon transposing the roles of cases and controls.
The computer code for the proposed empirical utility maximization and the maximum score estimator of Manski (1975, 1985) is available on the first author’s website (http://web1.sph.emory.edu/users/yhuang5).
5.1. Simulations.
Across all set-ups, the control biomarkers were independent and identically distributed with the standard normal distribution, whereas the case biomarkers had different distributions from one set-up to another. Either three or six biomarkers were considered for combination. For the former, all the biomarkers were informative, i.e., having different case and control distributions. Then, in the latter case, three independent and non-informative biomarkers were additionally included; that is, each followed the same standard normal distribution for the cases as for the controls. Four different scenarios were constructed for the three informative biomarkers:
Scenario A. The case biomarkers were independent and normally distributed each with mean 0.9 and variance 1.
Scenario B. The case biomarkers were independent and normally distributed with the same mean of 0.8 but different variances, 0.5, 1, and 2.
Scenario C. The case distribution was jointly normal with the same mean of 1, variances of 0.5, 1, and 2, and pairwise correlation coefficient of 0.5.
Scenario D. The case biomarkers followed a mixture of independent normal distributions. With probability 2/3, they had means of 1.7, 1.7, and 0, and variances of 0.5, 2, and 1, respectively. Then, with probability 1/3, their means became 0, 0, and 1.7, and the variances had the same value of 1.
These scenarios were motivated from cancer detection applications. In Scenarios A, B, and C, all the three case biomarkers were elevated in comparison to their control counterparts. However, they might be independent or correlated, with the same or different variability. Scenarios D mimicked cancer heterogeneity as common in cancer biology, involving two subtypes. The first two biomarkers were only elevated in one subtype, whereas the last was so in the other. The assumptions for linear discriminant analysis, logistic regression, and the monotonic density ratio model hold under Scenario A, but not under Scenarios B, C, and D. At controlled 95% sensitivity, the optimal specificities with linear combinations were 0.466, 0.452, 0.444, and 0.442 under Scenarios A, B, C, and D, respectively. The case and control sizes, n1 and n0, were set equal with values from 100 to 500.
Results were obtained from 1000 simulations for each set-up. Table 1 shows summary statistics of performance deficiencies for the proposed empirical utility maximization along with the seven existing methods. The averaged performance deficiencies for a subset of these methods are also displayed in Figure 1. For all the methods under each scenario, performance deficiency decreased with larger sample size as expected, and also from 6 to 3 biomarkers after the elimination of non-informative ones. In Scenario A, all the estimated combination coefficients converge to the optimal one and so do their predictive performances. Not surprisingly, linear discriminant analysis and logistic regression performed the best. They were followed by monotonic density ratio model, support vector classifier, and empirical AUC maximizer. The maximum score estimator and the kernel smoothing-based estimation also had better performance than the empirical utility maximization, although the differences were fairly small. However, in Scenario B and more so in Scenario C, the empirical utility maximization had the best performance whereas linear discriminant analysis and logistic regression performed rather poorly and so did monotonic density ratio model, support vector classifier, and empirical AUC maximizer. Under Scenario D, the relative performance varied with sample size and the number of biomarkers; the empirical utility maximization became the best as the sample size increased. Overall, the proposed method showed a different performance profile, even in comparison with the kernel smoothing-based method. Across all set-ups, on a logarithmic scale for both variables, the averaged performance deficiency and the sample size showed roughly a linear relationship with a slope of −2/3 for the empirical utility maximization. This is consistent with the asymptotic result of n−2/3 convergence rate.
Table 1.
Simulation results on performance deficiencies of specificity at 95% sensitivity
| 3 biomarkers | 6 biomarkers | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n1 = n0 = | ||||||||||||||||||||
| 100 | 200 | 300 | 400 | 500 | 100 | 200 | 300 | 400 | 500 | |||||||||||
| M | D | M | D | M | D | M | D | M | D | M | D | M | D | M | D | M | D | M | D | |
| Scenario A | ||||||||||||||||||||
| LDA | 8 | 8 | 4 | 4 | 3 | 3 | 2 | 2 | 2 | 2 | 20 | 13 | 10 | 6 | 7 | 4 | 5 | 3 | 4 | 3 |
| LR | 8 | 8 | 4 | 4 | 3 | 3 | 2 | 2 | 2 | 2 | 21 | 14 | 10 | 7 | 7 | 5 | 5 | 3 | 4 | 3 |
| MDR | 10 | 10 | 5 | 5 | 4 | 4 | 3 | 3 | 2 | 3 | 23 | 15 | 12 | 8 | 8 | 5 | 6 | 4 | 5 | 4 |
| SVC | 9 | 10 | 5 | 5 | 3 | 3 | 2 | 2 | 2 | 2 | 24 | 15 | 12 | 8 | 8 | 5 | 6 | 4 | 5 | 3 |
| AUC | 9 | 15 | 4 | 4 | 3 | 3 | 2 | 2 | 2 | 2 | 24 | 23 | 11 | 7 | 7 | 5 | 6 | 4 | 5 | 3 |
| MSE | 22 | 21 | 14 | 13 | 10 | 10 | 9 | 8 | 7 | 7 | 44 | 29 | 27 | 18 | 20 | 13 | 16 | 11 | 14 | 10 |
| KS | 20 | 26 | 13 | 14 | 9 | 17 | 7 | 9 | 6 | 7 | 46 | 34 | 29 | 26 | 19 | 19 | 17 | 18 | 14 | 11 |
| EUM | 30 | 30 | 19 | 18 | 15 | 15 | 12 | 12 | 11 | 10 | 54 | 35 | 36 | 24 | 27 | 18 | 23 | 16 | 20 | 14 |
| Scenario B | ||||||||||||||||||||
| LDA | 37 | 25 | 31 | 17 | 30 | 14 | 30 | 12 | 29 | 11 | 50 | 27 | 39 | 18 | 35 | 14 | 33 | 12 | 32 | 10 |
| LR | 37 | 26 | 31 | 17 | 30 | 14 | 29 | 12 | 28 | 11 | 50 | 28 | 39 | 19 | 34 | 14 | 33 | 13 | 31 | 10 |
| MDR | 41 | 29 | 34 | 21 | 33 | 17 | 31 | 15 | 30 | 13 | 53 | 29 | 42 | 20 | 37 | 16 | 35 | 14 | 33 | 12 |
| SVC | 38 | 28 | 31 | 19 | 30 | 16 | 29 | 13 | 28 | 12 | 52 | 30 | 40 | 20 | 35 | 16 | 33 | 14 | 32 | 12 |
| AUC | 38 | 30 | 32 | 18 | 31 | 15 | 30 | 13 | 29 | 11 | 57 | 46 | 40 | 20 | 36 | 15 | 34 | 13 | 32 | 11 |
| MSE | 43 | 41 | 35 | 31 | 30 | 25 | 27 | 22 | 26 | 20 | 63 | 41 | 47 | 30 | 39 | 25 | 35 | 22 | 33 | 20 |
| KS | 33 | 33 | 21 | 18 | 20 | 22 | 18 | 21 | 16 | 13 | 56 | 39 | 38 | 29 | 30 | 18 | 26 | 21 | 24 | 14 |
| EUM | 30 | 29 | 18 | 17 | 14 | 14 | 13 | 13 | 11 | 11 | 53 | 34 | 34 | 25 | 26 | 18 | 22 | 17 | 19 | 13 |
| Scenario C | ||||||||||||||||||||
| LDA | 94 | 36 | 92 | 26 | 91 | 20 | 90 | 19 | 91 | 17 | 105 | 36 | 98 | 26 | 95 | 22 | 92 | 19 | 93 | 16 |
| LR | 98 | 39 | 97 | 28 | 96 | 22 | 95 | 21 | 96 | 18 | 110 | 38 | 103 | 28 | 100 | 23 | 98 | 20 | 98 | 18 |
| MDR | 106 | 42 | 107 | 32 | 108 | 28 | 108 | 27 | 110 | 25 | 114 | 40 | 109 | 29 | 107 | 25 | 106 | 23 | 107 | 20 |
| SVC | 106 | 42 | 105 | 31 | 104 | 25 | 103 | 23 | 104 | 21 | 118 | 42 | 111 | 31 | 107 | 26 | 105 | 23 | 105 | 20 |
| AUC | 95 | 40 | 92 | 28 | 92 | 22 | 91 | 20 | 91 | 18 | 110 | 48 | 99 | 29 | 95 | 23 | 94 | 20 | 94 | 18 |
| MSE | 106 | 63 | 104 | 53 | 104 | 49 | 108 | 47 | 107 | 45 | 118 | 59 | 112 | 52 | 110 | 49 | 110 | 45 | 111 | 41 |
| KS | 61 | 57 | 44 | 45 | 39 | 39 | 30 | 35 | 26 | 33 | 88 | 55 | 63 | 40 | 63 | 40 | 46 | 36 | 40 | 32 |
| EUM | 37 | 39 | 22 | 24 | 18 | 19 | 15 | 16 | 13 | 13 | 66 | 45 | 42 | 30 | 32 | 24 | 26 | 19 | 23 | 17 |
| Scenario D | ||||||||||||||||||||
| LDA | 34 | 27 | 28 | 18 | 27 | 15 | 26 | 13 | 25 | 11 | 45 | 29 | 34 | 17 | 30 | 15 | 27 | 12 | 27 | 11 |
| LR | 33 | 25 | 26 | 17 | 25 | 14 | 25 | 12 | 24 | 11 | 44 | 27 | 32 | 16 | 29 | 14 | 26 | 11 | 25 | 10 |
| MDR | 37 | 33 | 31 | 23 | 30 | 21 | 29 | 18 | 28 | 16 | 46 | 30 | 35 | 20 | 32 | 17 | 30 | 15 | 29 | 14 |
| SVC | 40 | 33 | 30 | 22 | 28 | 17 | 28 | 14 | 26 | 13 | 50 | 32 | 37 | 21 | 32 | 17 | 29 | 14 | 28 | 12 |
| AUC | 33 | 26 | 25 | 17 | 24 | 14 | 23 | 12 | 22 | 10 | 47 | 38 | 32 | 17 | 28 | 14 | 25 | 11 | 24 | 10 |
| MSE | 54 | 52 | 41 | 39 | 37 | 35 | 35 | 32 | 33 | 29 | 66 | 48 | 50 | 38 | 46 | 34 | 41 | 31 | 36 | 25 |
| KS | 36 | 42 | 21 | 20 | 19 | 25 | 16 | 20 | 14 | 22 | 56 | 43 | 37 | 23 | 29 | 25 | 25 | 20 | 21 | 13 |
| EUM | 33 | 31 | 22 | 21 | 17 | 16 | 14 | 13 | 12 | 12 | 59 | 38 | 39 | 25 | 29 | 20 | 25 | 17 | 20 | 13 |
M: empirical mean (×1000); D: empirical standard deviation (×1000).
LDA: linear discriminant analysis; LR: logistic regression; MDR: monotonic density ratio model; SVC: support vector classifier; AUC: smoothed empirical AUC maximizer; MSE: maximum score estimator; KS: kernel smoothing-based method of Meisner et al. (2021); EUM: proposed empirical utility maximization.
Fig 1.

Simulation results on linear combination via the proposed empirical utility maximization (●), in comparison with logistic regression (○), smoothed empirical AUC maximizer (Δ), maximum score estimator (+), and kernel smoothing-based method of Meisner et al. (2021) (×). The least-squares fitting lines of −2/3 slope are shown for the empirical utility maximization.
These simulations were performed on a 2020 MacBook Pro laptop with 2.3 GHz Intel Core i7. For the empirical utility maximization, the average CPU time for a single dataset ranged from 0.82 seconds in a case of n1 = n0 = 100 and 3 biomarkers to 4.43 seconds with n1 = n0 = 500 and 6 biomarkers.
5.2. Application to prostate cancer detection.
The proposed methodology was motivated by prostate cancer research, to improve the detection of aggressive cancer, i.e., Gleason score ≥ 7, using non-invasive biomarkers among men undergoing their first-time biopsy. Among a limited number of commercially available test assays, prostate health index (phi) is an FDA-approved blood test analysis by combining three forms of prostate-specific antigen (PSA) from serum, total PSA (tPSA), free PSA (fPSA), and isoform [−2]proPSA (p2PSA):
which is a proprietary calculation developed by Beckman Coulter Inc. The test has been evaluated and adopted to distinguish aggressive cancer from indolent or no cancer (Catalona et al., 2011). However, it was unclear whether the combination of the three PSA forms could be improved to achieve better specificity at a controlled high sensitivity level, in particular, 95%. To address it, we analyzed 156 cases and 358 controls, i.e., with and without aggressive prostate cancer, respectively, per pathology testing on prostate biopsies, enrolled in academic urology groups (Sanda et al., 2017). Serum specimens of these participants, obtained prior to biopsy, were assayed for phi.
Note that phi is a linear combination of logarithmic transformed tPSA, fPSA, and p2PSA. We applied our proposed method and the existing ones considered in the earlier simulations to this data set, as reported in Table 2. The combination coefficient of the empirical utility maximization appeared to deviate considerably from those of the existing methods, and even more so from that of phi. Furthermore, the empirical utility maximization had a substantially better empirical estimate of specificity at 95% sensitivity. However, except for phi, all these methods had their combinations trained in the same data set and thus their empirical performance estimates might not be taken as unbiased for predictive performances. In particular, the empirical performance estimate of the empirical utility maximization method tends to over-estimate, and possibly so does the kernel smoothing-based method. However, for other combination methods which target different metrics, their empirical performances do not necessarily over-estimate and could actually under-estimate. As an attempt to address such biases, three-fold cross-validation was performed. This specific fold choice was driven by the fact that the validation subset could not be made too small due to the need of threshold estimation. With the cross-validation estimates, not surprisingly, the edge of the combination from the empirical utility maximization shrank considerably. Yet the improvement over phi still appeared clinically meaningful, although it became only marginal over some of the other combinations. Nevertheless, the cross-validation results correspond to learning from a considerably smaller sample size. Thus, the difference in the cross-validation between the proposed method and those targeting different metrics might be conservative, since the former would approach the ideal performance as sample size increases. An independent validation study should provide a more definitive assessment.
Table 2.
Analysis results of the prostate cancer study
| coefficient | performance | ||||
|---|---|---|---|---|---|
| log(tPSA) | log(fPSA) | log(p2PSA) | empirical | CV | |
| phi | 0.200 | −0.400 | 0.400 | 0.246 | – |
| LDA | 0.275 | −0.441 | 0.283 | 0.277 | 0.296 |
| LR | 0.266 | −0.434 | 0.301 | 0.293 | 0.285 |
| MDR | 0.257 | −0.425 | 0.319 | 0.268 | 0.285 |
| SVC | 0.270 | −0.432 | 0.299 | 0.293 | 0.290 |
| AUC | 0.257 | −0.439 | 0.304 | 0.251 | 0.277 |
| MSE | 0.272 | −0.424 | 0.305 | 0.288 | 0.276 |
| KS | 0.350 | −0.429 | 0.221 | 0.299 | 0.282 |
| EUM | 0.427 | −0.445 | 0.128 | 0.369 | 0.297 |
LDA: linear discriminant analysis; LR: logistic regression; MDR: monotonic density ratio model; SVC: support vector classifier; AUC: smoothed empirical AUC maximizer; MSE: maximum score estimator; KS: kernel smoothing-based method of Meisner et al. (2021); EUM: proposed empirical utility maximization.
All combination coefficients are scaled to have unity ℓ1 norm. performance: specificity at 95% sensitivity; CV: median of three-fold cross-validation estimates from 100 random splits.
6. Discussion.
We have developed a linear biomarker combination method that empirically maximizes clinical utility of the intended medical test. The estimated combination coefficient and predictive performance have been rigorously investigated with their limiting distributions established. The proposed empirical utility maximization is shown to be more robust in comparison with several common linear combination methods with respect to our performance metric of interest. Nevertheless, several topics warrant further investigation.
First of all, predictive performance estimation with the training data is of great value, for identification and selection of promising combinations to be validated in future studies. Unfortunately, as discussed in Section 5.2, the apparent empirical estimate tends to over-estimate whereas standard cross-validation might be conservative. The asymptotic theory may need further development to guide this pursuit of more reliable estimation.
Second, optimal biomarker combination without restriction to a specific class is the ultimate goal. Indeed, linear combination may not be effective with, for example, heterogeneous diseases, where various biomarkers are discriminative for certain subtypes but not others. As indicated in Section 1, the proposed linear combination method may accommodate nonlinear combinations of biomarkers via the linear basis expansion technique. However, the number of basis functions under consideration can be large or even infinite. Selection and regularization methods are thus needed; see Hastie, Tibshirani and Friedman (2009, section 5). They are under development.
Finally, a generalization to high-dimensional biomarkers is also being explored. High throughput technology is becoming increasingly available. Combination of such biomarkers also calls for selection and regularization methods. Further efficiency improvement in computation might be critical as well.
Acknowledgments.
The authors thank the reviewers for their helpful comments and suggestions, in particular the Associate Editor for pointing out several mistakes in previous versions of the paper, and Dattatraya H. Patil for assistance in arranging the prostate cancer dataset analyzed in Section 5.2.
Funding.
The authors were supported in part by NIH Grants R01 CA230268, U01 CA113913, and P30 AI050409.
APPENDIX A: PROOFS OF RESULTS IN SECTION 3.1
Proof of Theorem 3.1. We first show the existence of a maximizer β of ϕ(b). Consider a fixed b and an arbitrary b* such that ∥b∥1 = ∥b*∥1 = 1. For any ε > 0, Condition 2 implies
Thus, there exists a constant c1 > 0, independent of b*, such that Pr(∥M0∥∞ > c1) is sufficiently small to satisfy
| (16) |
When ∥M0∥∞ ≤ c1, we can have |(b − b*)⊤M0| ≤ ε/2 so long as b* is sufficiently close to b. With such a b*,
| (17) |
and subsequently τ(b)−τ(b*) ≥ −ε. On the other hand, the same arguments lead to
and subsequently, for b* sufficiently close to b, τ(b) − τ(b*) ≤ ε. Therefore, τ(b) is continuous. With this and Condition 3, so is ϕ(b) by similar arguments. The existence result then follows from the compactness of , by the extreme value Theorem (e.g., Rudin, 1976, Theorem 4.16).
Since the class of functions is Donsker (e.g., Kosorok, 2008, lemma 9.12) and thus Glivenko–Cantelli,
almost surely. Then, extending Theorem 2.3.1 of Serfling (1980) under Condition 2, we obtain
almost surely. Meanwhile, Condition 3 in conjunction with the continuity of τ(b) implies that F1(t; b) is continuous at t = τ(b) uniformly in b such that ∥b∥1 = 1, by similar arguments given earlier, for the continuity of τ(b), in a proof by contradiction. Subsequently,
almost surely. Consequently, and furthermore , almost surely. Then, it follows that and , almost surely.
Finally, under Condition 4, standard arguments can be used to establish the strong convergence of to β in light of the compact parameter space, continuity of ϕ(b), and uniform strong convergence of as established above. □
APPENDIX B: PROOFS OF RESULTS IN SECTIONS 3.2 AND 3.3
As shown in the proof of Theorem 3.1, τ(h) is continuous at η. Therefore, we can make the ε in Lemma 3.2, Lemma 3.3, and Theorem 3.4 sufficiently small so as to restrict h to a neighborhood of η, as we do so implicitly, such that {h⊤, τ(h)}⊤ is in the neighborhoods of {η⊤, τ(η)}⊤ implicated in Conditions 6 and 7.
Write Fd,1|−1(t) as the Conditional distribution of Md,1 given Md,−1, and fd,1|−1(t) as its density if it exists. For d = 0 under Condition 6 or 7 and for d = 1 under Condition 7, fd,1|−1(t − h⊤Md,−1) exists at t = τ(h). Accordingly, Fd(t; h) = EFd,1|−1(t − h⊤Md,−1) has a density
at t = τ(h), which is bounded away from 0 for d = 0 by part (ii) of Condition 6. Thus, F0{τ(h); h} = ρ. Taking derivatives on both sides yields the gradient
| (18) |
which is bounded under Conditions 5 and 6.
Furthermore, under Condition 7 and for d = 1, 0, the derivative of fd,1|−1 exists and Fd(t; h) has a second derivative:
at t = τ(h). Subsequently, fd{τ(h); h} has a gradient
| (19) |
which is also bounded under Conditions 5 and 7.
In the following proofs of Lemmas 3.2 and 3.3, but not elsewhere, we restrict to the special case of nonnegative Md,−1, d = 1, 0; the result for the general case follows subsequently by the argument in Remark 4 via the biomarker splitting technique.
Proof of Lemma 3.2. By part (ii) of Condition 6, f0{τ(h); h} is bounded away from 0. Set finite constant . Following the proof of Serfling (1980, lemma 2.5.4.B), one obtains that, for each fixed h such that ∥h−η∥∞ ≤ ε,
| (20) |
for sufficiently large n0. On the other hand, both τ(h) and are non-decreasing in each component of h as all components of M0,−1 are nonnegative. We shall exploit this monotonicity property in the extension of the pointwise result (20). Write the floor function as ⌊·⌋. Impose an equally-spaced grid on each component of h with mesh size , centered at the corresponding component of η. Thus, each h such that ∥h − η∥∞ ≤ ε can be bracketed by h− and h+ in the sense that, componentwise, h− ≤ h ≤ h+ with h− and h+ being the adjacent grid points. Then,
Therefore, given bounded ∇τ(h) as in (18),
where h1 is a grid point componentwise, taking up to different values. For sufficiently large n0,
following (20). By the Borel–Cantelli lemma,
almost surely. So is subsequently. □
Proof of Lemma 3.3. Consider hi, i = 1, 2, and u such that ∥hi − η∥∞ ≤ ε, ∥h1 − h2∥∞ = O(n−3/4), and . Write h∧ = min(h1, h2) and h∨ = max(h1, h2), where the minimization and maximization apply componentwise. Compute the variance,
| (21) |
for some t* in the line segment between τ(h1) + u and τ(h2). Thus, it is clear that this variance is bounded by for some constant c3 > 0 with large nd, in light of bounded density fd(t; h) and Conditional density fd,1|−1(t − h⊤Md,−1) around τ(h), bounded gradient of τ(h), and integrability of Md,−1 by Condition 5. Then, by a Bernstein’s inequality (Serfling, 1980, lemma 2.5.4.A),
| (22) |
with sufficiently large nd.
Now, with all components of Md,−1 being nonnegative, both Fd{τ(h1) + u; h2} and are non-decreasing in each component of h1 and u, and are non-increasing in each component of h2. Impose a grid on each component of h to bracket h by h− and h+ in the same fashion as that in the proof of Lemma 3.2, however, with a different mesh size . Similarly, impose an equally-spaced grid on u centered at 0 with mesh size , to bracket u by the adjacent points u− and u+ on the grid. Therefore,
Let {h1, h2} be either {h−, h+} or {h+, h−}, and {u1, u2} be either {u−, u+} or {u+, u−}. Then,
where {h1, h2} and u1 take up to and different values, respectively. Given (22), the probability that the first maximum above exceeds is no larger than for sufficiently large nd. Then, the first maximum is O{n−3/4(log n)3/4} almost surely by the Borel–Cantelli lemma. On the other hand, the second maximum is O(n−3/4) by arguments similar to those for (21). Together, they lead to the assertion. □
Proof of Theorem 3.4. By Lemmas 3.2 and 3.3, uniformly in {h : ∥h − η∥∞ ≤ ε},
| (23) |
almost surely. Meanwhile, under Condition 7, Fd(t; h) has a bounded second partial derivative with respect to t around τ(h). Then, a Taylor expansion along with Lemma 3.2 gives that, uniformly in {h : ∥h−η∥∞ ≤ ε},
| (24) |
almost surely.
Almost surely at most k independent observations of M0 may simultaneously satisfy M0,1 + h⊤M0,−1 = t for (h⊤, t)⊤ in a neighborhood of {η⊤, τ(η)}⊤ where the Conditional density f0,1|−1(t − h⊤M0,−1) is bounded under part (i) of Condition 6. Then, with the consistency of , given by Lemma 3.2, and the continuity of τ(h), uniformly in ,
| (25) |
almost surely. Equation (7) then follows from equations (23), (24), and (25), as f0{τ(h); h} is bounded away from 0 under part (ii) of Condition 6.
Equations (23), (24), and (7) give rise to
almost surely. Following equation (7) and Lemma 3.2, uniformly in {h : ∥h−η∥∞ ≤ ε},
almost surely. Meanwhile, since the gradient ∇fd{τ(h); h} given in (19) is bounded, f1{τ(h); h}f0{τ(h); h}−1 has a bounded gradient at η. Thus, equation (8) follows. □
Proof of Theorem 3.5. Our proof follows the general framework of Kim and Pollard (1990), although their main Theorem does not apply to our problem. Rather than working with directly, we tackle the problem through its approximation , which linearly combines two independent random components, , d = 1, 0. Write , where
with . Consider the class of such functions,
with envelope
Since the subgraphs of functions in form a VC class and Gε(m) is bounded, is uniformly manageable (Kim and Pollard, 1990, section 3). When Md,−1 is nonnegative, EGε(Md)2 = O(ε) as ε ↓ 0 for both d = 1, 0, by arguments similar to those for (21). This result then holds generally, i.e., when Md,−1 is not necessarily nonnegative, by the biomarker splitting technique described in Remark 4. The same approach can be used to establish E|g(Md; h1) − g(Md; h2)| = O(∥h1 − h2∥∞) for h1 and h2 near η. Furthermore, since Gε(m) ≤ 1, E[Gε(Md)2I{Gε(Md) > 1}] = 0. With these properties of , we can now utilize the results in Kim and Pollard (1990).
We first establish . By lemma 4.1 of Kim and Pollard (1990), there exists ε > 0 such that, for ∥h − η∥ ≤ ε, with each δ > 0
Give that the Hessian matrix of −Eg(M1; h) at η is negative definite, we can choose δ such that in this neighborhood, provided that ε is sufficiently small. When is in this neighborhood,
On the other hand,
where equation (8) or more specifically (9) has been used. Combining the two gives
which shows the Op(n−1/3) convergence rate of .
Now, we work with the rescaled process
which is a linear combination of two independent processes. Weak convergence of each follows from the results of Kim and Pollard (1990, lemma 4.5, lemma 4.6, and Theorem 4.7). Compute
| (26) |
where v⊗2 ≡ vv⊤. Let
and define
| (27) |
Then, converges weakly to the corresponding combination of the two limiting processes, which is a Gaussian process Z(a) with continuous sample paths, mean a⊤Ha/2, and covariance kernel V. So does , following equation (8) or (9).
As in Kim and Pollard (1990, example 6.2), one could show that the variance of Z(a1) − Z(a2) is V (a1 − a2, a1 − a2). Thus, the Gaussian process Z has nondegenerate increments provided V (a, a) ≠ 0 for a ≠ 0. Following Kim and Pollard (1990, Theorem 2.7), then converges in distribution to argmaxaZ(a). Subsequently, the weak convergence of follows with a Taylor expansion argument. □
APPENDIX C: PROOFS OF RESULTS IN SECTION 4
Proof of Corollary 4.1. First, consider σ = o(1). With inequality (14), the same arguments in the proof of Theorem 3.1 lead to and subsequently , almost surely. It then follows that almost surely.
Now, switch to the circumstance of the first biomarker being the anchor and take σ = op(n−2/3). Since converges in probability to β1 > 0, and thus identity (13) holds with probability tending to 1. Inequality (14) leads to
uniformly in {h : ∥h − η∥∞ ≤ ε} from some ε > 0, where the second equality follows from Lemmas 3.2 and 3.3 and the last one from (23) after a Taylor expansion. Then, is implied. □
REFERENCES
- Bahadur RR (1966). A note on quantiles in large samples. Ann. Math. Statist 37 577–580. [Google Scholar]
- Bertsimas D, King A and Mazumder R (2016). Best subset selection via a modern optimization lens. Ann. Statist 44 813–852. [Google Scholar]
- Catalona WJ, Partin AW, Sanda MG, Wei JT, Klee GG, Bangma CH, Slawin M, Marks LS, Loeb S, Broyles DL, Shin SS, Cruz AB, Chan DW, Sokoll LJ, Roberts WL, van Schaik RH and Mizrahi IA (2011). A multicenter study of [−2]pro-prostate specific antigen combined with prostate specific antigen and free prostate specific antigen for prostate cancer detection in the 2.0 to 10.0 ng/ml prostate specific antigen range. J. Urol 185 1650–1655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Catalona WJ, Partin AW, Slawin KM, Brawer MK, Flanigan RC, Patel A, Richie JP, DeKernion JB, Walsh PC, Scardino PT, Lange PH, Subong EN, Parson RE, Gasior GH, Loveland KG and Southwick PC (1998). Use of the percentage of free prostate-specific antigen to enhance differentiation of prostate cancer from benign prostatic disease: a prospective multicenter clinical trial. JAMA 279 1542–1547. [DOI] [PubMed] [Google Scholar]
- Chen B, Li P, Qin J and Yu T (2016). Using a monotonic density ratio model to find the asymptotically optimal combination of multiple diagnostic tests. J. Am. Statist. Assoc 111 861–874. [Google Scholar]
- Chen X, Vexler A, Markatou M (2015). Empirical likelihood ratio confidence interval estimation of best linear combinations of biomarkers. Comput. Stat. Data Anal 82 186–198. [Google Scholar]
- Eguchi S and Copas J (2002). A class of logistic-type discriminant functions. Biometrika 89 1–22. [Google Scholar]
- Elliott G and Lieli RP (2013). Predicting binary outcomes. J. Econom 174 15–26. [Google Scholar]
- Florios K and Skouras S (2008). Exact computation of max weighted score estimators. J. Econom 146 86–91. [Google Scholar]
- Fong Y, Yin S and Huang Y (2016). Combining biomarkers linearly and nonlinearly for classification using the area under the ROC curve. Stat. Med 16 3792–3809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hastie T, Tibshirani R and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. [Google Scholar]
- Kiefer J (1967). On Bahadur’s representation of sample quantiles. Ann. Math. Statist 38 1323–1342. [Google Scholar]
- Kim J and Pollard D (1990). Cube root asymptotics. Ann. Statist 18 191–219. [Google Scholar]
- Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, New York. [Google Scholar]
- Lin H, Zhou L, Peng H and Zhou X-H (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. Can. J. Stat 39, 324–343. [Google Scholar]
- Lipp T and Boyd S (2016). Variations and extension of the convex-concave procedure. Optim. Eng 17 263–287. [Google Scholar]
- Ma S and Huang J (2007). Combining multiple markers for classification using ROC. Biometrics 63 751–757. [DOI] [PubMed] [Google Scholar]
- Manski CF (1975). Maximum score estimation of the stochastic utility model of choice. J. Econom 3 205–228. [Google Scholar]
- Manski CF (1985). Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator. J. Econom 27 313–333. [Google Scholar]
- McIntosh MW and Pepe MS (2002). Combining several screening tests: optimality of the risk score. Biometrics 58 657–664. [DOI] [PubMed] [Google Scholar]
- Meisner A, Carone M, Pepe MS and Kerr KF (2021). Combining biomarkers by maximizing the true positive rate for a fixed false positive rate. Biom. J 63 1223–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ou FS, Zeng D and Cai J (2016). Quantile regression models for current status data. J. Statist. Plann. Inference 178 112–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pepe MS (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press, Oxford. [Google Scholar]
- Pepe MS, Cai T and Longton G (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62 221–229. [DOI] [PubMed] [Google Scholar]
- Pepe MS and Thompson ML (2000). Combining diagnostic test results to increase accuracy. Biostatistics 1 123–140. [DOI] [PubMed] [Google Scholar]
- Rudin W (1976). Principles of Mathematical Analysis. McGraw Hill, New York. [Google Scholar]
- Sanda MG, Feng Z, Howard DH, Tomlins SA, Sokoll LJ, Chan DW, Regan MM, Groskopf J, Chipman J, Patil DH, Salami SS, Scherr DS, Kagan J, Srivastava S, Thompson IM Jr, Siddiqui J, Fan J, Joon AY, Bantis LE, Rubin MA, Chinnayian AM, Wei JT; and the EDRN-PCA3 Study Group, Bidair M, Kibel A, Lin DW, Lotan Y, Partin A and Taneja S (2017). Association between combined TMPRSS2:ERG and PCA3 RNA urinary testing and detection of aggressive prostate cancer. JAMA Oncol. 3 1085–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Serfling RJ (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. [Google Scholar]
- Vexler A, Liu A, Schisterman EF and Wu C (2006). Note on distribution-free estimation of maximum linear separation of two multivariate distributions. J. Nonparametr. Stat 18 145–158. [Google Scholar]
- Yan Q, Bantis LE, Stanford JL and Feng Z (2018). Combining multiple biomarkers linearly to maximize the partial area under the ROC curve. Stat. Med 37 627–642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuille AL and Rangarajan A (2003). The concave-convex procedure. Neural Comput. 15 915–936. [DOI] [PubMed] [Google Scholar]
