Obtaining optimal cutoff values for tree classifiers using multiple biomarkers

Yuxin Zhu; Mei-Cheng Wang

doi:10.1111/biom.13409

. Author manuscript; available in PMC: 2022 Apr 3.

Published in final edited form as: Biometrics. 2020 Dec 22;78(1):128–140. doi: 10.1111/biom.13409

Obtaining optimal cutoff values for tree classifiers using multiple biomarkers

Yuxin Zhu ¹, Mei-Cheng Wang ¹

PMCID: PMC8557826 NIHMSID: NIHMS1746270 PMID: 33249556

Abstract

In biomedical practices, multiple biomarkers are often combined using a prespecified classification rule with tree structure for diagnostic decisions. The classification structure and cutoff point at each node of a tree are usually chosen on an ad hoc basis, depending on decision makers’ experience. There is a lack of analytical approaches that lead to optimal prediction performance, and that guide the choice of optimal cutoff points in a pre-specified classification tree. In this paper, we propose to search for and estimate the optimal decision rule through an approach of rank correlation maximization. The proposed method is flexible, theoretically sound, and computationally feasible when many biomarkers are available for classification or prediction. Using the proposed approach, for a prespecified tree-structured classification rule, we can guide the choice of optimal cutoff points at tree nodes and estimate optimal prediction performance from multiple biomarkers combined.

Keywords: biomarkers, classification tree, optimal prediction, rank-based estimation, semi-parametric models

1 ∣. INTRODUCTION

Prognostic biomarkers, or biological markers, refer to measurements of a specific feature as a depiction of a biological state used to diagnose biological or pathogenic processes. Tools that investigate a single prognostic biomarker’s performance, such as receiver operating characteristic curve (ROC) and area under the curve (AUC), have been well studied. However, in real applications, multiple markers are commonly collected, and it remains a question how we can optimally combine multiple markers for predicting disease outcome. Methods to optimally combine markers linearly into one “marker” have been studied under various model assumptions for predicting binary disease outcomes (Pepe and Thompson, 2000; McIntosh and Pepe, 2002; Pepe et al., 2006). However, a composite marker formed by linear combination lacks flexibility and may not be relevant to context when markers are combined from different domains. In contrast, a nonlinear combination using a tree structure is flexible and interpretable. Besides, tree-based models can handle correlation between markers with a more nonparametric and flexible manner than linear structures.

We conceptualize a classification tree model as composed of three elements: variables, the topological structure consisting of bifurcating internal nodes and leaf nodes with class assigned, and cutoff values at bifurcating nodes. Any model with these elements is a tree model, but tree model estimation methods have primarily adopted the recursively partitioning search algorithm, which simultaneously chooses variables for tree construction and cutoff values for splitting that are optimal in some sense at each local step. The Classification and Regression Tree (CART) algorithm, the seminal work proposed by Breiman et al. (1984), adopts the recursive partitioning approach and faces criticism despite its influence and wide use. CART has long been recognized as a greedy algorithm that does not consider global optimality (Breiman et al., 1984), and has received criticism for overfitting (Mingers, 1987) and for being biased toward selecting variables with many possible cutoff values (Quinlan, 1986, 1996; Jensen and Cohen, 2000) and those with missing values (Kim and Loh, 2001). Researchers made various attempts to address the overfitting and biased variable selection problem, for which Murthy (1998) gave a detailed overview. Especially, Hothorn et al. (2006) proposed the unified framework of conditional inference tree (CIT) for recursive partitioning that embeds tree-structured regression with well-established conditional inference testing procedures. However, the problem of currently mainstream tree estimation being based on step-wise greedy algorithms has yet to be addressed.

This fundamental problem is twofold—the recursive partitioning algorithm is greedy in growing a tree structure and is greedy in selecting cutoff values for splitting at each internal node, either of which has yet been studied. We aim to address the latter part of the problem. We study tree models with known topological structure but with cutoff values varying and left for optimization, and we refer to these trees as fixed trees. Actually, fixed trees of this type already have broad use in biomedical practices, and practitioners call for methodologies for obtaining optimal cutoff values. For examples, prefixed trees are used in HIV screening tests (Brown et al., 2007) and clinical diagnosis of Alzheimer’s disease (McKhann et al., 1984; Dubois et al., 2007), typically given when both a primary criterion and one of several supportive criteria are met. In these applications, biomedical researchers usually know how a classifying tree structure should be constructed (Querbes et al., 2009; Mattsson et al., 2012). However, it is unclear how to find cutoff values for optimal diagnosis accuracy, leaving researchers with the choice between using ad hoc cutoff values and retreating to one single biomarker.

Choosing globally optimal cutoff values relates to evaluating the optimal performance of a tree classifier closely. Although evaluating a single biomarker’s performance using ROC or AUC is straightforward, doing so for a tree involves additional difficulty because the true positive rate (TPR) and false positive rate (FPR) are not well-defined functions. Statistical literature that addresses this important issue is essentially vacant. Baker (1995) and Baker (2000) proposed an algorithm that finds a positivity region by adding observations that maximize gain in TPR per unit increase in FPR. However, it is a greedy algorithm in its core that does not guarantee global optimality. For continuous markers, to estimate the upper boundary curve of the ROC band based on two markers, Jin and Lu (2009) proposed a bivariate kernel estimator to estimate the upper boundary curve of the ROC band but indicated the unstable performance of their estimator. In general, when multiple markers are used with a tree-based classifier, the quantile function of FPR is not one-to-one. The area under the upper boundary curve does not possess the interpretation as AUC does in the single marker case. See Figure 1 for a plot of TPR against FPR generated by simulation for marker $X \in R^{2}$ combined such that a subject is considered diseased if both marker values exceed some threshold. Marker X follows bivariate standard normal distribution conditional on Y = 0, and bivariate normal distribution with mean vector (1, 1)^⊺, marginal variances 0.5 and correlation 0 conditional on Y = 1. When FPR is at 0.2, the corresponding TPR ranges from approximately 0.55 to 0.8. If a practitioner wants to have an FPR no greater than 0.2 but chooses cutoff values without being further informed, he or she could end up with a TPR anywhere between 0.55 and 0.8, risking to lose a lot of efficiency and resources. Therefore, it is desirable to find or approximate “optimality”— the cutoff values that give us the highest TPR for some given FPR. Wang and Li (2012) proposed a population-averaged ROC curve together with a weighted AUC as tools to evaluate the performance of multiple markers using tree classifiers. However, their work focused on the population-averaged performance of ROC and AUC, which is different from the aim of this work. The work proposed in this paper is the first to address this fundamental issue about the optimality of a fixed classification tree structure, establishing an applicable framework together with a method that is theoretically founded and computationally feasible for the search and evaluation of optimal prediction.

ROC band generated for marker $X \in R^{2}$ in prediction of binary outcome Y = 0, 1, where X follows bivariate standard normal conditional on Y = 0, and bivariate normal with mean vector (1, 1)^⊺, marginal variances 0.5 and correlation 0 conditional on Y = 1. A subject is classified to have outcome Y = 1 if both marker values exceed some threshold

The rest of this paper is organized as follows. In Section 2, we propose the standard representation of a fixed tree classifier, which makes it possible for us to generalize existing and propose new ROC-related definitions in Section 3. We argue that the newly defined optimal ROC curve captures the optimal prediction performance of a fixed tree classifier. We propose an empirical estimator for it in Section 4, but also point out the theoretical and computational obstacles in using the empirical estimator. In Section 5, we propose a semi-parametric and rank-based estimation or approximation approach by exploiting properties of the optimality ROC curve. Statistical inferential results are discussed in Section 6. We illustrate the proposed methods through simulation studies when the model is correctly and incorrectly specified in Section 7, and through analyses on the Prostate Cancer DREAM Challenge data (Synapse organization, 2005) in Section 8. Discussions are given in Section 9.

2 ∣. FIXED TREE CLASSIFIER AND ITS REPRESENTATION

Consider a fixed tree, T built with multiple markers and prefixed nodes. The fixed tree allows various cutoff values at nodes to classify a binary outcome Y = 0 or 1. Denote by X = (X₁, … , X_K)^⊺ the markers used as splitting variables, by x = (x₁, … , x_K)^⊺ ∈ S_X a generic realization of corresponding markers, and by c = (c₁, … , c_K)^⊺ ∈ S_c the corresponding varying cutoff values. Part of the difficulties in developing methods for finding optimal cutoff values at prefixed nodes lies in the lack of algebraic representations of trees that can be used to develop useful tools and properties. In this section, we propose a standard representation for a fixed tree.

A tree’s classifying behavior is determined by the marker space attributed to Y = 1 (the positive group) when cutoff values are given. Formally, define positivity region R(c;X, T) to be the set of x classified positive by tree T given cutoff values c of marker X. Two topologically different trees, T using markers X and T′ using X′, are considered identical in terms of classification if for every c there exists c′ such that R(c;X, T) = R(c′;X′, T′) and vice versa. Therefore it is sufficient to study the positivity region of a tree.

To find a standard representation, we first consider leaf nodes assigned Y = 1 and index these nodes by ℓ = 1, … , L. Denote by R_ℓ(c;X, T) the marker region attributed to Y = 1 according to the ℓth leaf node, and we have $R (c; X, T) = \cup_{ℓ = 1}^{L} R_{ℓ} (c; X, T)$ . Then consider the nodes “traveled” from root node to the ℓth leaf node, and suppose markers X_k for k ∈ κ_ℓ are used for data splitting at these traveled nodes. We can assume without loss of generality that at each node we obtain the marker space satisfying X_k > c_k, as we can always reverse the sign of a marker. This implies that R_ℓ(c;X, T) = ∩k∈κ_ℓ{x ∈ S_X : x_k > c_k}, which yields what we call the standard representation of positivity region in form $R (c; X, T) = \cup_{ℓ = 1}^{L} [\cap_{k \in κ_{ℓ}} {x \in S_{X} : x_{k} > c_{k}}]$ . For further simplification, we assume that κ_ℓ’s are disjoint and $\cup_{ℓ = 1}^{L} κ_{ℓ} = {1, \dots, K}$ . If there exists any repeated marker, we can add additional copies of it to the initial marker vector X along with appropriate modification to S_X and S_c. In Web Appendix A, we present a standard representation example.

3 ∣. ROC BAND AND OPTIMAL ROC CURVE

Having identified a standard representation of a tree classifier, we are ready to generalize the definitions of TPR and FPR. We consider continuous marker vector $X \in S_{X} \subset R^{K}$ for the simplicity of discussions, but all results can be extended to include discrete ordinal markers with minor technical modifications. Let X₀ = (X₀₁, … , X_0K)^⊺ and X₁ = (X₁₁, … X_1K)^⊺, respectively, represent the marker variable for negative and positive groups. For positivity region R(c;X, T), we define $T P R (c) = P {X_{1} \in R (c; X, T)} = P {X \in R (c; X, T) ∣ Y = 1}$ and $F P R (c) = P {X_{0} \in R (c; X, T)} = P {X \in R (c; X, T) ∣ Y = 0}$ . We also generalize the inverse of TPR and FPR to set-valued functions TPR⁻¹(t) = {c ∈ S_c : TPR(c) = t}, FPR⁻¹(t) = {c ∈ S_c : FPR(c) = t}, for t ∈ [0,1], which further implies the generalization of ROC curve to what we call the ROC band as the graph of set-valued function ROCB(t) = TPR{FPR⁻¹(t)] = {TPR(c) : FPR(c) = t}. It is referred to as a “band,” since for each FPR there exists multiple TPRs, and overall the ROCB function spans a band over [0, 1] ⊗ [0, 1].

An ROC band captures the range of prediction performance for a fixed tree classifier, but it is of more practical interest to study the ROC band’s upper boundary, which captures the “best” performance possible using a given tree classifier. Due to the optimality implication of the ROC band upper boundary, we refer to it as the optimality ROC curve (OROC curve), which is formally defined as the graph of function OROC(t) = sup{TPR(c) : FPR(c) = t} for t ∈ [0, 1]. Intuitively, the area under optimality ROC curve (AUOROC) can then be used to evaluate the overall optimal prediction performance of a tree with varying TPR and FPR, and we define it to be $A U O R O C = \int_{0}^{1} O R O C (t) d t$ .

The ROC band and the OROC curve have interesting properties connected to the ROC curve. Similar to ROC curve, we have ROCB(0) = 0, ROCB(1) = 1, OROC(0) = 0, and OROC(1) = 1, and that both the ROC band and the optimality ROC curve degenerate to the ROC curve in the single marker scenario. We can also show that OROC(t) is continuous and monotonically increasing in t. Through monotonicity of OROC(t), it is easy to establish the equivalent definition OROC(t) = sup{TPR(c) : FPR(c) ≤ t}.

4 ∣. EMPIRICAL ESTIMATION OF OPTIMAL ROC CURVE

Suppose the observed data consists of independent samples from cases (Y = 1) and controls (Y = 0). The observations include independent and identically distributed (i.i.d.) copies of X₀ from the group of controls, {x_0i = (x_0i1, … , x_0iK)^⊺ : i = 1, …, n₀}, and i.i.d. copies of X₁ from the group of cases, {x_1j = (x_1j1, … , x_1jK)^⊺ : j = 1, … , n₁}. For the positivity region R(c;X, T) of a tree classifier T, we can estimate TPR and FPR empirically as $T P R^{emp} (c) = n_{1}^{- 1} \sum_{j = 1}^{n_{1}} 1 I {x_{1 j} \in R (c; X, T)}$ and $F P R^{emp} (c) = n_{0}^{- 1} \sum_{i = 1}^{n_{0}} 1 I {x_{0 i} \in R (c; X, T)}$ . These two estimators can then be plugged into corresponding definitions to obtain empirical estimator of OROC and AUOROC as OROC^emp(t) = sup{TPR^emp(c) : FPR^emp(c) ≤ t} and $A U O R O C^{emp} = \int_{0}^{1} O R O C^{emp} (t) d t$ .

We prove the following result for OROC^emp(t) and AUOROC^emp in Web Appendix B.

Theorem 1. Estimator OROC^emp(t) is uniformly strongly consistent for OROC(t), that is, sup_t ∣OROC^emp(t) – OROC(t)∣ → 0 with probability 1. As a result, AUOROC^emp is strongly consistent for AUOROC.

However, not only is statistical inference difficult to obtain, the estimators also have severe positive biases so that the estimated prediction performance of a tree would often be overly optimistic. An intuitive explanation comes from the asymptotic behavior of OROC^emp(t) when cutoff values have discrete support. Suppose c_r’s for r = 1, … , R are cutoff value vectors such that TPR^emp{c_r} for r = 1, … , R are all possible TPRs when FPR^emp{c_r} is equal to some constant, and TPR{c_r} forms an increasing sequence. Each TPR^emp{c_r} is asymptotically normal, and the Jensen’s inequality then implies that the expectation of OROC^emp(t) is greater than that of TPR^emp{c_R}, which converges to TPR{c_R}. For these biases to be small, we likely need an unrealistically large sample size. However, even if one could collect a sample of sufficient size, one faces computational complexity growing at the rate of n^K, implying the infeasibility in applying the empirical estimators to only slightly complicated tree classifiers.

5 ∣. SEMI-PARAMETRIC AND RANK-BASED ESTIMATION

Taking an optimization perspective, the problem we face is to find those c ∈ S_c such that

T P R (c) = O R O C (t), and F P R (c) = t,

(1)

for any given t ∈ [0, 1], and to estimate OROC(t) with identified optimal cutoff values. Equation (1) implies that TPR(c) – OROC{FPR(c)} = 0, which inspires us to define function m(c) = TPR(c) – OROC{FPR(c)}, a continuous mapping from $S_{c} to R$ whose solution graph forms a hypersurface $S \subset S_{c}$ . Our goal then reduces to identifying $S$ . We refer to $S$ as the optimality hypersurface due to its connection to the optimality ROC curve.

In general, the optimality hypersurface is unlikely to have closed forms, especially when we avoid parametric distributional assumptions on markers. In addition to the difficult derivation, a general optimality hypersurface might not have a workable structure for estimation. Therefore, we consider a minimally restricted class of marker distributions and tree classifiers such that the optimality hypersurface has desired properties. Specifically, we impose the following two assumptions.

Assumption 1. There exists an index i*, which we take to be 1 without loss of generality, such that TPR(c) is strictly monotonically decreasing in c₁. And there exists unique functions g₁ : $R^{K} \to R$ and g₀ : $R^{K} \to R$ such that c₁ = g₁(c₂, … , c_K; t₁) is equivalent to TPR(c) = t₁ and that c₁ = g₀(c₂, … , C_K; t₀) is equivalent to FPR(c) = t₀, for functions g₁, g₀ : $R^{K} \to R$ .

Assumption 2. For vector space $H$ of parametric curves

h_{k} (c_{k}; θ) = c_{0}, f o r k = 1, \dots, K, c_{0} \in R,

(2)

where h_k(·; θ) : $R \to R$ are continuous and monotonically increasing functions indexed by a p-dimensional parameter $θ \in Θ \subset R^{p}$ , there exists $θ_{0} \in R^{p}$ such that for each $a \in R$ , there exists t₀, t₁ ∈ (0, 1) and ϵ_a > 0 such that for $c_{- 1} = (c_{- 1, 2}, \dots, c_{- 1, K})^{T} \in {c_{- 1} \in R^{K - 1} : ∣ c_{- 1} - c_{a, - 1} ∣ \leq ϵ_{a}$ , there exists θ ∈ Θ, $c_{0} \in R$ , and c₀ = h_k(c_k; θ), k = 2, … , K}, where c_{a, −1} = {h₂(a; θ₀), …, h_K(a; θ₀)}^⊺, we have g₁(c_−1,2, …, c_−1,K; t₁) ≤ g₀(c_−1,2, …, c_−1,K; t₀), where t₁ = TPR(c_a) and t₀ = FPR(c_a) for $c_{a} = (a, c_{a, - 1}^{T})^{T}$ . Here, the equality only holds at c₋₁ = c_a,−1. □

Assumption 1 states that the equipotential hypersurfaces of TPR and FPR can be defined using explicit functions. According to the implicit function theorem, this assumption is satisfied when ∂TPR(c)/∂c₁ and ∂FPR(c)/∂c₁ exist and are nonzero for all c₁ (Hamilton, 1982), but there could be other sufficient conditions. Assumption 2 then requires that the equipotential hypersurfaces have unique tangent points, which is represented using parametrized functions h_k(c_k; θ₀), in an open neighborhood of θ₀ when a TPR is given. Under these two assumptions, we have the following result.

Theorem 2. Under Assumptions 1 and 2, the curve defined by c₁ = h_k(c_k; θ₀), k = 2, …, K, is a locally unique curve from the class $H$ that lies in optimality hypersurface $S$ □.

We prove Theorem 2 in Web Appendix C. The role of functions {h_k(·; θ) : θ ∈ Θ, k = 2, … , K} can be conceptualized as being the structural part of the model, under which we obtain the following interesting insights. For c ∈ S_c satisfying (2) and FPR(c) = t, some algebra gives us $T P R (c) = P {X_{1} \in R (c; X, T)} = P (\cup_{ℓ = 1}^{L} [\cap k \in κ_{ℓ} {X_{1 k} > c_{k}}] = P (\cup_{ℓ = 1}^{L} [\cap_{k \in κ_{ℓ}} {h_{k} (X_{1 k}; θ_{0}) > c_{1}}])$ . Writing ∨ for maximum over a set, and ∧ for minimum, the last display further implies that $T P R (c) = P [\lor_{ℓ = 1}^{L} {\land_{k \in κ_{ℓ}} h_{k} (X_{1 k}; θ_{0})} > c_{1}]$ . Therefore, when $H$ indeed contains a curve indexed by θ₀ that lies in the optimality hypersurface $S$ , the optimality ROC curve corresponding to positivity region R(c;X, T) is exactly the ROC curve of random variable $H (X; θ_{0}) = \lor_{ℓ = 1}^{L} {\land_{k \in κ_{ℓ}} h_{k} (X_{k}; θ_{0})}$ .

Derivations above can also be used to show that θ indexes a class of random variables H(X; θ) whose ROC curves fall on the ROC band. The definition of OROC(t) further implies that θ₀ maximizes AUC of H(X; θ), which is equal to the concordance probability of correctly ranking two observations (Hanley and McNeil, 1982) on the transformed marker scale. That is, we have θ₀ ∈ argmax_θ∈Θ S(θ), where $S (θ) = P {H (X_{0}; θ) < H (X_{1}; θ)}$ . These properties inspire us to consider estimator $\hat{θ} \in {argmax}_{θ \in Θ} S_{n} (θ)$ , where $S_{n} (θ) = (n_{0} n_{1})^{- 1} \sum_{i = 1}^{n_{0}} \sum_{j = 1}^{n_{1}} [1 I {H (x_{0 i}; θ) < H (X_{1 j}; θ)}$ which is the empirical counterpart of concordance probability S(θ) based on observed data {x_0i : i = 1, …, n₀} and {x_1j : j = 1, … n₁}. Asymptotic properties of $\hat{θ}$ are discussed in Section 6.

There is a degree of flexibility in choosing an appropriate $H$ in applications. We can take h_k(·; θ) to be continuous and monotonically increasing piece-wise linear functions with knots at biomarker percentiles, or polynomial function, just to give a few examples. We focus on the piece-wise linear function classes with knots at percentiles for the rest of this paper, because they are robust and often flexible enough to approximate continuous functions. The number of knots can be chosen according to cross validation and thus adapted to specific data structures.

When $H$ does contain a curve from the optimality hypersurface, we identify the best classification rule given a tree structure. Further, if the tree structure yields the globally optimal rule when cutoff values are chosen appropriately, H(X; θ₀) then captures the overall optimal decision rule for classification, and we have $P (Y = 1 ∣ X) = g {H (X; θ_{0})}$ for a monotonically increasing function g, or equivalently Y = 1I{H(X; θ₀) + U_i}, where U_i’s are i.i.d. errors, which is the binary response model (Horowitz, 1992), and also a special case of the general linear model proposed by Han (1987). Note that we do not require that θ₀ be unique globally, but allow multiple parametric curves that are locally unique to exist in the optimality hypersurface. Additional utility functions can be used to choose from multiple optimal cutoff rules derived.

When $H$ does not contain any curve that comes from the optimality hypersurface, we have a misspecified model but θ₀ still indexes a random variable of the form H(X; θ) that has an AUC closest to AUOROC. With appropriate tuning, we expect the difference to be small and that the random variable $H (X; \hat{θ})$ has an ROC curve that is close to the optimality ROC curve and can be used to approximate the best prediction performance of given the tree structure.

To end this section, we consider the following example as an illustration of the proposed assumptions and the properties that follow. Details are given in Web Appendix D.

Example 1. We consider a tree that classifies a subject as positive (D = 1) if each element of marker vector M = (M₁, … , M_K)^⊺ exceeds some threshold. Suppose that there exist monotonically increasing functions $m_{k} : R \to R$ for k = 1, …, K such that $\tilde{M} = {m_{1} (M_{1}), \dots, m_{K} (M_{K})}^{T}$ ~ N(μ₁, Σ) conditional on D = 1, and that $\tilde{M}$ ~ N(μ₀, Σ) conditional on D = 0, for vectors μ₁ = (μ₁₁, …, μ_1K)^⊺ and μ0 = (μ₀₁, … , μ_0K)^⊺ and shared variance-covariance matrix Σ, such that μ₁ > μ₀ (element-wise) and that Σ^−1/2(μ₁ – μ₀) = 1 = (1, …, 1)^⊺. Also, we assume that the sum of the first row elements of Σ^−1/2 is positive, and that mk’s are parametrized by vector $θ_{0} \in R^{p}$ . Under these conditions, we show that Assumptions 1 and 2 are satisfied, which further implies that $c_{1} = {\tilde{h}}_{k} (c_{k}) = m_{1}^{- 1} [\frac{μ_{11} - μ_{01}}{μ_{1 k} - μ_{0 k}} {m_{k} (c_{k}) - μ_{0 k}} + μ_{01}]$ , for k = 2, …, K, defines an optimality curve of the given tree, and that functions ${\tilde{h}}_{k} ’ s$ are parametrized by θ₀. □

6 ∣. ASYMPTOTIC PROPERTIES AND STATISTICAL INFERENCE

We study and present asymptotic properties of $\hat{θ}$ in this section. For a generic vector (x^⊺, y)^⊺ ∈ S_X ⊗ {0, 1} and θ ∈ Θ, we define $τ (x, y; θ) = p_{0} \cdot 1 I (y = 1) E {H (x; θ) > H (X_{0}; θ)} + p_{1} \cdot 1 I (y = 0) E {H (x; θ) < H (X_{1}; θ)}$ , where p₀ = n₀/(n₀ + n₁) and p₁ = n₁/(n₀ + n₁). For a smooth function f(θ) with proper differentiability, denote $\nabla_{r} f (θ) = \frac{\partial^{r} f (θ)}{\partial θ^{r}}$ , $∣ \nabla_{r} f (θ) ∣ = \sum ∣ \frac{\partial^{r} f (θ)}{\partial θ^{r}} ∣$ , for r = 1, 2, where the sum is taken over all elements of the vector or matrix. Weak convergence of $\hat{θ}$ is established in the following theorem, with proof given in Web Appendix E.

Theorem 3. Under regularity conditions given in Web Appendix E, we have $n^{1 ∕ 2} (\hat{θ} - θ_{0}) \to N (0, V^{- 1} Δ V)$ in distribution, where $2 V = p_{0} \cdot E {\nabla_{2} τ (X_{0}, 0; θ_{0})} + p_{1} \cdot E {\nabla_{2} τ (X_{1}, 1; θ_{0})}$ , and $Δ = p_{0} \cdot E {\nabla_{1} τ (X_{0}, 0; θ_{0}) \cdot \nabla_{1} τ (X_{0}, 0; θ_{0})^{T}} + p_{1} \cdot E {\nabla_{1} τ (X_{1}, 1; θ_{0}) \cdot \nabla_{1} τ (X_{1}, 1; θ_{0})^{T}}$ . □

After obtaining an estimate $\hat{θ}$ of θ₀, we use bootstrap (Efron and Tibshirani, 1994) to approximate the asymptotic variance of $\hat{θ}$ . We can also estimate AUOROC using a testing sample that is independent of the training sample and is formed by i.i.d. copies of control observations ${x_{0 i} : i = n_{0} + 1, \dots, n_{0} + n_{0}^{'}}$ and i.i.d. copies of case observations ${x_{1 j} : j = n_{1} + 1, \dots, n_{1} + n_{1}^{'}}$ , as $A U O R O C^{test} = (n_{0}^{'} n_{1}^{'})^{- 1} \sum_{i = n_{0} + 1}^{n_{0} + n_{0}^{'}} \sum_{j = n_{1} + 1}^{n_{1} + n_{1}^{'}} 1 I {H (x_{0 i}; \hat{θ}) < H (x_{1 j}; \hat{θ})}$ . Confidence intervals of the prediction performance of the obtained classification rule estimated on the testing sample can be calculated through routine procedures treating $H (X; \hat{θ})$ as a single composite marker, using stratified bootstrap (Robin et al., 2011) or Delong’s method (DeLong et al., 1988). Finally, we can derive optimal cutoff values as $c = {c_{1}, h_{2}^{- 1} (c_{1}; \hat{θ}), \dots, h_{K}^{- 1} (c_{1}; \hat{θ})}^{T}$ , and c₁ is further determined by requirements on FPR, TPR, or some other measure of loss.

7 ∣. SIMULATION STUDIES

7.1 ∣. Simulation with Correctly Specified Models

We conduct simulation studies to evaluate the finite sample performance of proposed estimator when the model is correctly specified. Specifically we obtain i.i.d. copies of case and control observations x_0i for i = 1, … , n₀ = n/2 and x_1j for j = 1, … , n₁ = n/2 from distribution of (X, Y) satisfying Y = 1I[H(X; θ₀) + U > 0}. Here X is a three-dimensional marker vector following normal distribution with mean (0, 0, 0)^⊺, marginal variances 10, and covariances 10ρ. We take H(X; θ₀) = min(θ₀₁X₁ + θ₀₂, θ₀₃X₂ + θ₀₄, X₃), where θ₀ = (θ₀₁, θ₀₂, θ₀₃, θ₀₄)^⊺ = (1, −1, 2, 0.5)^⊺, and U ~ N(2, δ²). Various scenarios are considered varying δ = 1, 3, ρ = 0.2, 0.5, 0.8, and n = 100, 200, 500 (that is, n₀ = n₁ = 50, 100, 250). We report the summary statistics for estimator $\hat{θ} = ({\hat{θ}}_{1}, {\hat{θ}}_{2}, {\hat{θ}}_{3}, {\hat{θ}}_{4})^{T}$ over 1000 replications. All variance estimates are calculated through bootstrap over 10,000 samples. Simulation results concerning $\hat{θ}$ for ρ = 0.2 are summarized in Table 1, and remaining results for ρ = 0.5, 0.8 are summarized in Tables 1 and 2 in Web Appendix F. We also report summary statistics for difference in estimated AUOROC and maximum difference in estimated OROC(t) from the truths. We report AUOROC and OROC(t) on the training data as well as on an independent testing set of equal sample size. In addition, we apply CART (as implemented in R package “rpart” with pruning parameter chosen according to five-fold cross-validation) and CIT (as implemented in R package “partykit”) on simulated data and summarize difference in AUC and maximum absolute difference in ROC(t) of class probability compared to truths. These results are summarized in Table 2. ROC(t) and OROC(t) are estimated using piece-wise linear functions.

TABLE 1.

Simulation summary statistics for $\hat{θ}$ when ρ = 0.2

		${\hat{θ}}_{1}$	${\hat{θ}}_{2}$	${\hat{θ}}_{3}$	${\hat{θ}}_{4}$
	σ = 1
n = 100	Bias	0.066	0.042	0.149	0.130
	ESE	0.635	0.871	0.820	1.014
	MSE	0.571	0.717	0.812	0.892
	CP	0.875	0.892	0.937	0.901
n = 200	Bias	0.054	0.052	0.144	0.162
	ESE	0.461	0.696	0.759	1.069
	MSE	0.477	0.663	0.700	0.939
	CP	0.933	0.939	0.914	0.896
n = 500	Bias	0.010	0.020	0.556	0.061
	ESE	0.273	0.423	0.551	0.814
	MSE	0.298	0.463	0.563	0.840
	CP	0.957	0.951	0.939	0.948
	σ = 3
n = 100	Bias	0.106	−0.122	0.420	0.120
	ESE	0.974	1.186	1.833	1.328
	MSE	0.879	1.039	1.520	1.156
	CP	0.884	0.893	0.944	0.902
n = 200	Bias	0.099	−0.044	0.296	0.135
	ESE	0.999	1.057	1.398	1.346
	MSE	0.721	1.002	1.111	1.166
	CP	0.957	0.937	0.949	0.916
n = 500	Bias	0.022	0.007	0.101	0.100
	ESE	0.331	0.720	0.642	1.125
	MSE	0.397	0.761	0.643	1.067
	CP	0.970	0.955	0.931	0.937

Open in a new tab

Abbreviations: Bias, empirical bias; CP, empirical coverage probability of 95% confidence intervals; ESE, empirical standard error; MSE, empirical mean of standard error estimates.

TABLE 2.

Summary statistics for AUOROC (or AUC) and OROC(t) (or ROC(t)), for simulations with correctly specified models

p	δ	n	Data	Difference in AUOROC or AUC			Maximum absolute difference in OROC(t) or ROC(t)
p	δ	n	Data	Our method	CART	CIT	Our method	CART	CIT
0.2	1	100	train	0.007 (0.012, 0.014)	−0.045 (0.034, 0.056)	−0.031 (0.031, 0.044)	0.232 (0.074)	0.404 (0.180)	0.405 (0.178)
			test	−0.006 (0.017, 0.018)	−0.108 (0.061, 0.124)	−0.089 (0.047, 0.101)	0.217 (0.077)	0.560 (0.147)	0.548 (0.152)
		200	train	0.004 (0.009, 0.010)	−0.036 (0.021, 0.042)	−0.022 (0.018, 0.029)	0.172 (0.059)	0.445 (0.158)	0.391 (0.173)
			test	−0.003 (0.012, 0.012)	−0.075 (0.030, 0.081)	−0.062 (0.026, 0.068)	0.175 (0.069)	0.568 (0.114)	0.544 (0.129)
		500	train	0.002 (0.006, 0.006)	−0.042 (0.013, 0.044)	−0.011 (0.010, 0.016)	0.117 (0.047)	0.540 (0.081)	0.357 (0.163)
			test	−0.001 (0.006, 0.006)	−0.064 (0.017, 0.066)	−0.036 (0.015, 0.039)	0.119 (0.053)	0.582 (0.062)	0.511 (0.115)
	3	100	train	0.019 (0.036, 0.040)	−0.054 (0.128, 0.139)	−0.046 (0.063, 0.078)	0.251 (0.088)	0.250 (0.117)	0.242 (0.073)
			test	−0.015 (0.041, 0.043)	−0.164 (0.134, 0.212)	−0.135 (0.062, 0.149)	0.234 (0.072)	0.335 (0.134)	0.308 (0.088)
		200	train	0.009 (0.026, 0.027)	−0.037 (0.079, 0.087)	−0.026 (0.037, 0.045)	0.177 (0.060)	0.210 (0.077)	0.197 (0.055)
			test	−0.009 (0.029, 0.030)	−0.116 (0.097, 0.151)	−0.091 (0.040, 0.100)	0.175 (0.055)	0.283 (0.109)	0.257 (0.075)
		500	train	0.004 (0.017, 0.017)	−0.038 (0.031, 0.049)	−0.011 (0.022, 0.025)	0.114 (0.037)	0.205 (0.049)	0.135 (0.039)
			test	−0.003 (0.018, 0.018)	−0.078 (0.037, 0.087)	−0.051 (0.024, 0.057)	0.113 (0.036)	0.249 (0.063)	0.183 (0.054)
0.5	1	100	train	0.007 (0.011, 0.013)	−0.050 (0.033, 0.060)	−0.029 (0.026, 0.039)	0.234 (0.073)	0.472 (0.183)	0.440 (0.192)
			test	−0.006 (0.016, 0.017)	−0.112 (0.052, 0.124)	−0.086 (0.043, 0.096)	0.216 (0.077)	0.601 (0.140)	0.579 (0.149)
		200	train	0.004 (0.008, 0.009)	−0.040 (0.024, 0.047)	−0.017 (0.015, 0.023)	0.173 (0.062)	0.500 (0.149)	0.404 (0.184)
			test	−0.003 (0.010, 0.010)	−0.081 (0.034, 0.088)	−0.057 (0.026, 0.063)	0.170 (0.068)	0.594 (0.115)	0.559 (0.133)
		500	train	0.002 (0.005, 0.006)	−0.040 (0.016, 0.043)	−0.008 (0.008, 0.011)	0.117 (0.046)	0.557 (0.083)	0.371 (0.169)
			test	−0.001 (0.006, 0.006)	−0.062 (0.021, 0.065)	−0.032 (0.013, 0.034)	0.119 (0.052)	0.607 (0.064)	0.533 (0.105)
	3	100	train	0.018 (0.034, 0.038)	−0.055 (0.106, 0.119)	−0.038 (0.052, 0.064)	0.267 (0.094)	0.270 (0.101)	0.256 (0.070)
			test	−0.014 (0.039, 0.042)	−0.165 (0.141, 0.217)	−0.120 (0.055, 0.132)	0.242 (0.074)	0.357 (0.138)	0.311 (0.083)
		200	train	0.008 (0.024, 0.026)	−0.042 (0.073, 0.084)	−0.022 (0.033, 0.039)	0.187 (0.063)	0.240 (0.079)	0.209 (0.057)
			test	−0.009 (0.028, 0.029)	−0.118 (0.093, 0.150)	−0.085 (0.036, 0.092)	0.176 (0.054)	0.315 (0.105)	0.270 (0.071)
		500	train	0.004 (0.016, 0.017)	−0.043 (0.029, 0.052)	−0.007 (0.018, 0.020)	0.121 (0.039)	0.246 (0.057)	0.146 (0.040)
			test	−0.003 (0.017, 0.017)	−0.083 (0.039, 0.092)	−0.046 (0.021, 0.051)	0.120 (0.036)	0.291 (0.067)	0.197 (0.058)
0.8	1	100	train	0.005 (0.010, 0.011)	−0.055 (0.032, 0.064)	−0.025 (0.020, 0.032)	0.235 (0.076)	0.559 (0.149)	0.466 (0.196)
			test	−0.006 (0.014, 0.015)	−0.108 (0.051, 0.119)	−0.073 (0.037, 0.082)	0.213 (0.074)	0.632 (0.113)	0.576 (0.156)
		200	train	0.003 (0.007, 0.007)	−0.046 (0.024, 0.052)	−0.014 (0.012, 0.019)	0.176 (0.062)	0.564 (0.118)	0.414 (0.192)
			test	−0.003 (0.009, 0.010)	−0.082 (0.032, 0.088)	−0.048 (0.022, 0.053)	0.172 (0.065)	0.634 (0.084)	0.558 (0.140)
		500	train	0.001 (0.005, 0.005)	−0.044 (0.019, 0.048)	−0.005 (0.006, 0.008)	0.119 (0.049)	0.584 (0.093)	0.352 (0.181)
			test	−0.001 (0.005, 0.005)	−0.065 (0.022, 0.068)	−0.026 (0.011, 0.028)	0.121 (0.050)	0.632 (0.065)	0.521 (0.132)
	3	100	train	0.014 (0.032, 0.035)	−0.054 (0.088, 0.104)	−0.030 (0.044, 0.053)	0.265 (0.097)	0.296 (0.089)	0.273 (0.068)
			test	−0.011 (0.037, 0.038)	−0.139 (0.115, 0.180)	−0.097 (0.050, 0.109)	0.249 (0.076)	0.359 (0.115)	0.313 (0.073)
		200	train	0.007 (0.023, 0.024)	−0.051 (0.059, 0.078)	−0.016 (0.028, 0.032)	0.192 (0.065)	0.283 (0.071)	0.216 (0.054)
			test	−0.007 (0.025, 0.026)	−0.112 (0.075, 0.135)	−0.064 (0.031, 0.072)	0.187 (0.060)	0.341 (0.091)	0.257 (0.064)
		500	train	0.003 (0.016, 0.016)	−0.051 (0.034, 0.061)	−0.008 (0.018, 0.020)	0.122 (0.040)	0.290 (0.058)	0.162 (0.038)
			test	−0.004 (0.016, 0.016)	−0.086 (0.037, 0.094)	−0.039 (0.019, 0.043)	0.124 (0.039)	0.329 (0.064)	0.200 (0.052)

Open in a new tab

Results are summarized in the form of “Bias (Empirical standard error, Root mean squared error)” for difference in AUOROC or AUC from the truth, and are summarized in the form of “Bias (Empirical standard error)” for maximum absolute difference OROC(t) or ROC(t) from zero.

We observe from Table 1 that the estimators are slightly biased; the bootstrapped standard error estimates is close to empirical, and the difference becomes smaller as sample size increases; the 95% confidence interval coverage probability converges to 0.95, and is generally close enough to 0.95 when sample size is as large as 200. The performances of estimators do worsen as correlation between markers increases, which is due to the decreased information and can be amended for with larger sample sizes. In terms of predictive performance, we observe from Table 2 that the proposed rank-based tree estimation consistently outperforms CART and CIT in terms of both AUOROC/AUC or OROC(t)/ROC(t). This is expected as our proposed method works with a predetermined tree structure but CART and CIT work with no prior knowledge on the tree structure. We also observe that our proposed method suffers far less from overfitting compared to CART and CIT, both of which base themselves on step-wise greedy search algorithms prone to overfitting.

We also perform additional simulation with different function classes and tree structures, and are able to observe consistent performance of estimators and predictability across different settings. Additional simulation results for correctly specified models are summarized in Web Appendix G.

7.2 ∣. Simulation with Misspecified Models

We conduct another set of simulation studies to evaluate the finite sample bias in AUOROC and OROC(t) and the impact of tuning parameters when the model is misspecified. We simulate from the same setting as in Section 7.1 except that we have H(X; θ₀) = min{h₁ (X₁; θ₀), h₂(X₂; θ₀), X₃}, where h₁(x) = 1I(x ≤ 0) · (−x²/2 + x/20 – 1/2) + 1I(x > 0) · (x²/5 + x/20 – 1/2) and h₂(x) = 1I(x ≤ 0) · (−x²/10 + x/5 – 2) + 1I(x > 0) · (x² + x/3 – 2). We simulate for sample size n = 100 while varying p = 0.2, 0.5, 0.8, δ = 1,3. We perform rank-based estimation using monotonically increasing piece-wise linear functions, and vary the number of knots. We take K knots that are uniformly spaced in marker range for K = 0, 1, 2. After obtaining tree estimation, we summarize, in Table 3, the difference in AUOROC and the maximum absolute difference in OROC(t) compared to truths on the training set and on a testing set of equal sample size.

TABLE 3.

Summary statistics for AUOROC and OROC(t) for the tree that classifies subject as positive if X₁ > x₁ and X₂ > x₂ and X₃ > x₃ when function class is misspecified

ρ	δ	K	Difference in AUOROC		Maximum absolute difference in OROC(t)
ρ	δ	K	Train	Test	Train	Test
0.2	1	0	0.010 (0.018, 0.020)	−0.012 (0.025, 0.028)	0.220 (0.079)	0.217 (0.085)
		1	0.010 (0.017, 0.020)	−0.011 (0.024, 0.026)	0.223 (0.078)	0.217 (0.080)
		2	0.010 (0.018, 0.020)	−0.013 (0.025, 0.028)	0.224 (0.078)	0.226 (0.085)
	3	0	0.022 (0.038, 0.044)	−0.020 (0.047, 0.051)	0.215 (0.074)	0.211 (0.069)
		1	0.024 (0.038, 0.045)	−0.019 (0.046, 0.049)	0.217 (0.076)	0.210 (0.068)
		2	0.024 (0.038, 0.045)	−0.021 (0.046, 0.051)	0.216 (0.075)	0.208 (0.067)
0.5	1	0	0.009 (0.016, 0.019)	−0.011 (0.023, 0.025)	0.223 (0.076)	0.213 (0.078)
		1	0.010 (0.016, 0.019)	−0.010 (0.022, 0.024)	0.227 (0.077)	0.214 (0.080)
		2	0.009 (0.016, 0.019)	−0.011 (0.023, 0.026)	0.224 (0.076)	0.217 (0.078)
	3	0	0.019 (0.036, 0.041)	−0.021 (0.044, 0.049)	0.215 (0.073)	0.212 (0.067)
		1	0.021 (0.036, 0.042)	−0.019 (0.043, 0.047)	0.216 (0.074)	0.213 (0.068)
		2	0.021 (0.036, 0.042)	−0.020 (0.043, 0.048)	0.218 (0.073)	0.214 (0.067)
0.8	1	0	0.007 (0.016, 0.018)	−0.010 (0.020, 0.023)	0.224 (0.076)	0.208 (0.077)
		1	0.007 (0.016, 0.017)	−0.010 (0.020, 0.022)	0.224 (0.077)	0.207 (0.076)
		2	0.007 (0.016, 0.017)	−0.010 (0.020, 0.023)	0.224 (0.075)	0.205 (0.072)
	3	0	0.017 (0.035, 0.039)	−0.016 (0.040, 0.043)	0.213 (0.074)	0.207 (0.065)
		1	0.018 (0.035, 0.039)	−0.015 (0.040,0.043)	0.215 (0.075)	0.205 (0.065)
		2	0.018 (0.035, 0.039)	−0.016 (0.041, 0.044)	0.213 (0.074)	0.205 (0.064)

Open in a new tab

Note

Results are summarized in the form of “Bias (empirical standard error, Root mean squared error)” for difference in AUOROC from the truth, and are summarized in the form of “Bias (empirical standard error)” for maximum absolute difference OROC(t) from zero.

We observe from Table 3 that the increased number of knots of piece-wise linear functions has little effect on improving predictive performance, and the misspecified function class yields close approximation of the optimal predictive performance. However, simulation scenarios are limited, and cross-validation is recommended when feasible. Comparing with results in Web Appendix G when the correct function class was used, the piece-wise linear function class actually yield comparable results. Again, we observe little overfitting. We also perform additional simulation with different tree structures and observe consistent performance. Additional simulation results for misspecified models are summarized in Web Appendix H

8 ∣. DATA ANALYSIS: DREAM PROSTATE CANCER STUDY

To illustrate the use of our proposed method, we apply it to the Prostate Cancer DREAM Challenge data (Synapse organization, 2005). The data set is based on 1600 comparator arm patients with first line metastatic hormone refractory prostate cancer who received docetaxel treatment in a Phase III clinical trial. The data include clinical variables and survival information provided by Celgene, Sanofi and Memorial Sloan Kettering Cancer Center, and was initially released for a prediction challenge on the Project Data Sphere platform (Guinney et al., 2017).

We investigate the predictability of baseline biomarkers for the outcome of death within 500 days. Among over a hundred biomarkers, we select the top eight biomakers that have the largest AUC and that are available for at least 900 patients. This is an ad hoc preprocessing step to construct the predetermined tree structures needed for our proposed method to work. After restricting to the eight biomarkers, we include in the analyses 909 subjects that have complete biomarker data, among which 393 died within 500 days from baseline (case) and 516 stayed alive (control). The data were then split into a training data containing 196 cases and 257 controls, and a testing data containing 197 cases and 259 controls.

We carry out analyses on two types of predetermined tree structures. For the first set of analyses, we examine the “and” and “or” trees using every two-marker combination out of the eight markers, where signs of markers at node bifurcations are based on signs of marginal rank correlation (Kendall’s tau). The “and” tree refers to the tree structure that classifies a patient as case if both marker classifying criteria are met, and the “or” tree refers to the tree structure that classifies a patient as case if either criterion is met. We present results for three top-performing “and” trees and three top-performing “or” trees. For the second set of analyses, we fit classification trees using CART and CIT (same implementation as in simulation studies), and then use the constructed tree structures in our proposed rank-based tree estimation for further cutoff value optimization. See Figures 2a and 2d for the fitted classification trees from CART and CIT.

(a) Classification rule obtained using CART; (b) Classification rule obtained by applying the rank-based tree estimation method on the tree structure built using CART, and by choosing cutoff values that maximizes the Youden’s Index; (c) ROC curve of the class probability estimated from the CART rule, and the OROC curve of the optimal CART rule varying cutoff values; (d) Classification rule obtained using CIT; (e) Classification rule obtained by applying the rank-based tree estimation method on the tree structure built using CIT; (f) ROC curve of the class probability estimated from the CIT rule, and the OROC curve of the optimal CIT rule varying cutoff values. Note that although both the ROC curves for CART and CIT and the OROC curves for optimal CART and optimal CIT are composed of TPR-FPR pairs, they are constructed very differently—the former are ROC curves for estimated class probability, while the latter are constructed by varying cutoff values at tree nodes according to estimated optimal rules

In performing the rank-based estimation, we take $H$ to be the class of monotonically increasing piece-wise linear functions with uniform knots over marker range, and choose the number of knots from 0, 1, and 2 by five-fold cross-validation. After obtaining the optimal classification rule for each tree structure, we estimate the AUOROC on the testing dataset and construct 95% confidence intervals using Delong’s method (DeLong et al., 1988). We also present TPR and FPR that maximize the Youden’s Index (Youden, 1950). We summarize these results in Table 4. We also plot, in Figures 2b and 2e, the “optimal” rules estimated using CART and CIT tree structures and the rank-based method (referred to as “Optimal CART” and “Optimal CIT” rules), with cutoff values determined by maximizing Youden’s Index. These two trees are examples of an end product of our method. Finally, we plot ROC and OROC curves of obtained tree classification rules. From Table 4 and Figures 2c and 2f, we observe that the proposed rank-based tree estimation method often yields classification rules with high predictability when using ad hoc two-marker tree structures, and that the proposed method improves the predictive performance of classification trees constructed using CART and CIT. These results show promises of our proposed method.

TABLE 4.

Summary of analysis results

Tree	AUOROC ^test	TP	FP
First set of Analyses
ALP and HB	0.694 (0.646, 0.743)	0.609	0.313
PSA and EC	0.639 (0.588, 0.690)	0.497	0.297
AST and BMI	0.621 (0.569, 0.673)	0.690	0.506
ALP or EC	0.671 (0.623, 0.719)	0.812	0.529
HB or PSA	0.641 (0.590, 0.692)	0.563	0.297
WT or CA	0.571 (0.518, 0.625)	0.452	0.309
Second set of Analyses
CART	0.642 (0.591, 0.694)	0.619	0.348
Optimal CART	0.681 (0.631, 0.731)	0.584	0.274
CIT	0.598 (0.552, 0.643)	0.726	0.533
Optimal CIT	0.657 (0.607, 0.708)	0.795	0.538

Open in a new tab

Note

For each analyzed tree, we present the area under OROC estimated on testing data, its 95% confidence interval, and TPR and FPR that maximize the Youden’s Index. We refer to the rule obtained by applying our proposed approach to the CART and CIT tree structures as the “Optimal CART” and “Optimal CIT” rules, respectively.

Abbreviations: ALP, alkaline phosphatase (u/L); AST, aspartate aminotransferase (u/L); BMI, body mass index (kg/m²); CA, calcium (mmol/L); EC, patient performance status; HB, hemoglobin (g/dl); PSA, stands for prostate-specific antigen (ng/ml); WT, weight (kg).

9 ∣. DISCUSSIONS

In this paper, we proposed a standard form to represent any fixed tree classifier, and rigorously defined statistical quantities relevant to the optimality of a given tree. These quantities included the ROC band, the optimal ROC curve, and the optimality hypersurface. We pointed out the infeasibility of using the empirical estimator and proposed a novel rank-based method to estimate or approximate the optimal prediction performance of a tree classifier through rank correlation maximization. Our proposed method is theoretically founded, computationally feasible, and potentially applicable to many fields.

Our work faces a few limitations. The proposed method is only applicable to predetermined tree structures and does not perform variable selection or tree construction. However, the proposed approach could be used to improve cutoff values of classification trees obtained using CART or CIT, as illustrated with data analysis. We have also restricted the proposed method to continuous biomarkers in theory, although we believe the method could also handle discrete biomarkers when we have some continuous biomarkers. However, we do not wish to distract readers with this technicality and defer details to possible future work.

Despite limitations, our work opens up a series of intriguing research topics. First, it would be interesting to study the performance of more flexible function classes. Second, faster computational algorithms would improve the applicability of our proposed method, especially when tuning is needed. Third, we have not considered covariates’ role in using biomarkers for classification, but covariate-specific tree structures or cutoff values could be highly valuable in practice. Finally, extensions from the binary outcome to continuous or even survival outcome would significantly increase our work’s applicability.

Supplementary Material

Supplementary Materials

NIHMS1746270-supplement-Supplementary_Materials.pdf^{(201KB, pdf)}

Footnotes

SUPPORTING INFORMATION

Web Appendix A referenced in Section 2, Web Appendix B referenced in Section 4, Web Appendices C and D referenced in Section 5, Web Appendix E referenced in Section 6, Web Appendix F referenced in Section 7, Web Appendix G referenced in Sections 7.1 and 7.2, and Web Appendix H referenced in Section 7.2 are available with this paper at the Biometrics website on Wiley Online Library. R codes related to the simulation studies and data analyses in this paper can be found at https://github.com/dzhuyx/OptimalTree.

DATA AVAILABILITY STATEMENT

The data that support the findings of this paper are available from the prostate cancer DREAM challenge. Restrictions apply to the availability of these data, which were used under license for this paper. Data are available at https://www.synapse.org/#!Synapse:syn2813558/wiki/70844 with the permission of the challenge organizers.

References

Baker SG (1995) Evaluating multiple diagnostic tests with partial verification. Biometrics, 51, 330–337. [PubMed] [Google Scholar]
Baker SG (2000) Identifying combinations of cancer markers for further study as triggers of early intervention. Biometrics, 56, 1082–1087. [DOI] [PubMed] [Google Scholar]
Breiman L, Friedman J, Stone CJ and Olshen RA (1984). Classification and Regression Trees. Boca Raton, FL: CRC press. [Google Scholar]
Brown J, Shesser R, Simon G, Bahn M, Czarnogorski M, Kuo I, Magnus M and Sikka N (2007) Routine HIV screening in the emergency department using the new us centers for disease control and prevention guidelines: results from a high-prevalence area. JAIDS Journal of Acquired Immune Deficiency Syndromes, 46, 395–401. [DOI] [PubMed] [Google Scholar]
DeLong ER, DeLong DM and Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44, 837–845. [PubMed] [Google Scholar]
Dubois B, Feldman HH, Jacova C, DeKosky ST, Barberger-Gateau P, Cummings J, et al. (2007) Research criteria for the diagnosis of alzheimer’s disease: revising the NINCDS–ADRDA criteria. The Lancet Neurology, 6, 734–746. [DOI] [PubMed] [Google Scholar]
Efron B and Tibshirani RJ (1994). An Introduction to the Bootstrap. Boca Raton, FL: CRC press. [Google Scholar]
Guinney J, Wang T, Laajala TD, Winner KK, Bare JC, Neto EC, et al. (2017) Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data. The Lancet Oncology, 18, 132–142. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hamilton RS (1982) The inverse function theorem of Nash and Moser. Bulletin of the American Mathematical Society, 7, 65–222. [Google Scholar]
Han AK (1987) Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. Journal of Econometrics, 35, 303–316. [Google Scholar]
Hanley JA and McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36. [DOI] [PubMed] [Google Scholar]
Horowitz JL (1992) A smoothed maximum score estimator for the binary response model. Econometrica: Journal of the Econometric Society, 60, 505–531. [Google Scholar]
Hothorn T, Hornik K and Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical statistics, 15, 651–674. [Google Scholar]
Jensen DD and Cohen PR (2000) Multiple comparisons in induction algorithms. Machine Learning, 38, 309–338. [Google Scholar]
Jin H and Lu Y (2009) The ROC region of a regression tree. Statistics & Probability Letters, 79, 936–942. [Google Scholar]
Kim H and Loh W-Y (2001) Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589–604. [Google Scholar]
Mattsson N, Rosen E, Hansson O, Andreasen N, Parnetti L, Jonsson M, et al. (2012) Age and diagnostic performance of alzheimer disease csf biomarkers. Neurology, 78, 468–476. [DOI] [PMC free article] [PubMed] [Google Scholar]
McIntosh MW and Pepe MS (2002) Combining several screening tests: optimality of the risk score. Biometrics, 58, 657–664. [DOI] [PubMed] [Google Scholar]
McKhann G, Drachman D, Folstein M, Katzman R, Price D and Stadlan EM (1984) Clinical diagnosis of alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services task force on alzheimer’s disease. Neurology, 34, 939–939. [DOI] [PubMed] [Google Scholar]
Mingers J (1987) Expert systems–rule induction with statistical data. Journal of the Operational Research Society, 38, 39–47. [Google Scholar]
Murthy SK (1998) Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining and Knowledge Discovery, 2, 345–389. [Google Scholar]
Pepe MS, Cai T and Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics, 62, 221–229. [DOI] [PubMed] [Google Scholar]
Pepe MS and Thompson ML (2000) Combining diagnostic test results to increase accuracy. Biostatistics, 1, 123–140. [DOI] [PubMed] [Google Scholar]
Querbes O, Aubry F, Pariente J, Lotterie J-A, Démonet J-F, Duret V, et al. (2009) Early diagnosis of alzheimer’s disease using cortical thickness: impact of cognitive reserve. Brain, 132, 2036–2047. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinlan JR (1986) Induction of decision trees. Machine Learning, 1, 81–106. [Google Scholar]
Quinlan JR (1996) Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. [Google Scholar]
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C et al. (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. [DOI] [PMC free article] [PubMed] [Google Scholar]
Synapse Organization. (2005) Prostate Cancer DREAM Challenge. 10.7303/syn2813558. [DOI]
Wang M-C and Li S (2012) Bivariate marker measurements and ROC analysis. Biometrics, 68, 1207–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
Youden WJ (1950) Index for rating diagnostic tests. Cancer, 3, 32–35. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1746270-supplement-Supplementary_Materials.pdf^{(201KB, pdf)}

Data Availability Statement

[R1] Baker SG (1995) Evaluating multiple diagnostic tests with partial verification. Biometrics, 51, 330–337. [PubMed] [Google Scholar]

[R2] Baker SG (2000) Identifying combinations of cancer markers for further study as triggers of early intervention. Biometrics, 56, 1082–1087. [DOI] [PubMed] [Google Scholar]

[R3] Breiman L, Friedman J, Stone CJ and Olshen RA (1984). Classification and Regression Trees. Boca Raton, FL: CRC press. [Google Scholar]

[R4] Brown J, Shesser R, Simon G, Bahn M, Czarnogorski M, Kuo I, Magnus M and Sikka N (2007) Routine HIV screening in the emergency department using the new us centers for disease control and prevention guidelines: results from a high-prevalence area. JAIDS Journal of Acquired Immune Deficiency Syndromes, 46, 395–401. [DOI] [PubMed] [Google Scholar]

[R5] DeLong ER, DeLong DM and Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44, 837–845. [PubMed] [Google Scholar]

[R6] Dubois B, Feldman HH, Jacova C, DeKosky ST, Barberger-Gateau P, Cummings J, et al. (2007) Research criteria for the diagnosis of alzheimer’s disease: revising the NINCDS–ADRDA criteria. The Lancet Neurology, 6, 734–746. [DOI] [PubMed] [Google Scholar]

[R7] Efron B and Tibshirani RJ (1994). An Introduction to the Bootstrap. Boca Raton, FL: CRC press. [Google Scholar]

[R8] Guinney J, Wang T, Laajala TD, Winner KK, Bare JC, Neto EC, et al. (2017) Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data. The Lancet Oncology, 18, 132–142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Hamilton RS (1982) The inverse function theorem of Nash and Moser. Bulletin of the American Mathematical Society, 7, 65–222. [Google Scholar]

[R10] Han AK (1987) Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. Journal of Econometrics, 35, 303–316. [Google Scholar]

[R11] Hanley JA and McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36. [DOI] [PubMed] [Google Scholar]

[R12] Horowitz JL (1992) A smoothed maximum score estimator for the binary response model. Econometrica: Journal of the Econometric Society, 60, 505–531. [Google Scholar]

[R13] Hothorn T, Hornik K and Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical statistics, 15, 651–674. [Google Scholar]

[R14] Jensen DD and Cohen PR (2000) Multiple comparisons in induction algorithms. Machine Learning, 38, 309–338. [Google Scholar]

[R15] Jin H and Lu Y (2009) The ROC region of a regression tree. Statistics & Probability Letters, 79, 936–942. [Google Scholar]

[R16] Kim H and Loh W-Y (2001) Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589–604. [Google Scholar]

[R17] Mattsson N, Rosen E, Hansson O, Andreasen N, Parnetti L, Jonsson M, et al. (2012) Age and diagnostic performance of alzheimer disease csf biomarkers. Neurology, 78, 468–476. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] McIntosh MW and Pepe MS (2002) Combining several screening tests: optimality of the risk score. Biometrics, 58, 657–664. [DOI] [PubMed] [Google Scholar]

[R19] McKhann G, Drachman D, Folstein M, Katzman R, Price D and Stadlan EM (1984) Clinical diagnosis of alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services task force on alzheimer’s disease. Neurology, 34, 939–939. [DOI] [PubMed] [Google Scholar]

[R20] Mingers J (1987) Expert systems–rule induction with statistical data. Journal of the Operational Research Society, 38, 39–47. [Google Scholar]

[R21] Murthy SK (1998) Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining and Knowledge Discovery, 2, 345–389. [Google Scholar]

[R22] Pepe MS, Cai T and Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics, 62, 221–229. [DOI] [PubMed] [Google Scholar]

[R23] Pepe MS and Thompson ML (2000) Combining diagnostic test results to increase accuracy. Biostatistics, 1, 123–140. [DOI] [PubMed] [Google Scholar]

[R24] Querbes O, Aubry F, Pariente J, Lotterie J-A, Démonet J-F, Duret V, et al. (2009) Early diagnosis of alzheimer’s disease using cortical thickness: impact of cognitive reserve. Brain, 132, 2036–2047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Quinlan JR (1986) Induction of decision trees. Machine Learning, 1, 81–106. [Google Scholar]

[R26] Quinlan JR (1996) Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. [Google Scholar]

[R27] Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C et al. (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Synapse Organization. (2005) Prostate Cancer DREAM Challenge. 10.7303/syn2813558. [DOI]

[R29] Wang M-C and Li S (2012) Bivariate marker measurements and ROC analysis. Biometrics, 68, 1207–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Youden WJ (1950) Index for rating diagnostic tests. Cancer, 3, 32–35. [DOI] [PubMed] [Google Scholar]

PERMALINK

Obtaining optimal cutoff values for tree classifiers using multiple biomarkers

Yuxin Zhu

Mei-Cheng Wang

Abstract

1 ∣. INTRODUCTION

FIGURE 1.

2 ∣. FIXED TREE CLASSIFIER AND ITS REPRESENTATION

3 ∣. ROC BAND AND OPTIMAL ROC CURVE

4 ∣. EMPIRICAL ESTIMATION OF OPTIMAL ROC CURVE

5 ∣. SEMI-PARAMETRIC AND RANK-BASED ESTIMATION

6 ∣. ASYMPTOTIC PROPERTIES AND STATISTICAL INFERENCE

7 ∣. SIMULATION STUDIES

7.1 ∣. Simulation with Correctly Specified Models

TABLE 1.

TABLE 2.

7.2 ∣. Simulation with Misspecified Models

TABLE 3.

8 ∣. DATA ANALYSIS: DREAM PROSTATE CANCER STUDY

FIGURE 2.

TABLE 4.

9 ∣. DISCUSSIONS

Supplementary Material

Footnotes

DATA AVAILABILITY STATEMENT

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Obtaining optimal cutoff values for tree classifiers using multiple biomarkers

Yuxin Zhu

Mei-Cheng Wang

Abstract

1 ∣. INTRODUCTION

FIGURE 1.

2 ∣. FIXED TREE CLASSIFIER AND ITS REPRESENTATION

3 ∣. ROC BAND AND OPTIMAL ROC CURVE

4 ∣. EMPIRICAL ESTIMATION OF OPTIMAL ROC CURVE

5 ∣. SEMI-PARAMETRIC AND RANK-BASED ESTIMATION

6 ∣. ASYMPTOTIC PROPERTIES AND STATISTICAL INFERENCE

7 ∣. SIMULATION STUDIES

7.1 ∣. Simulation with Correctly Specified Models

TABLE 1.

TABLE 2.

7.2 ∣. Simulation with Misspecified Models

TABLE 3.

8 ∣. DATA ANALYSIS: DREAM PROSTATE CANCER STUDY

FIGURE 2.

TABLE 4.

9 ∣. DISCUSSIONS

Supplementary Material

Footnotes

DATA AVAILABILITY STATEMENT

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases