Abstract
Pancreatic ductal adenocarcinoma (PDAC) is the most deadly cancer and currently there is strong clinical interest in novel biomarkers that contribute to its early detection. Assessing appropriately the accuracy of such biomarkers is a crucial issue and often one needs to take into account that many assays include biospecimens of individuals coming from three groups: healthy, chronic pancreatitis, and PDAC. The ROC surface is an appropriate tool for assessing the overall accuracy of a marker employed under such trichotomous settings. A decision/classification rule is often based on the so-called Youden index and its three-dimensional generalization. However, both the clinical and the statistical literature have not paid the necessary attention to the underlying false classification (FC) rates that are of equal or even greater importance. In this paper we provide a framework to make inferences around all classification rates as well as comparisons. We explore the trinormal model, flexible models based on power transformations, and robust non-parametric alternatives. We provide a full framework for the construction of confidence intervals, regions and spaces for joint inferences or for clinically meaningful points of interest. We further discuss the implications of costs related to different false classifications. We evaluate our approaches through extensive simulations and illustrate them using data from a recent PDAC study conducted at the MD Anderson Cancer Center.
Keywords: 3-class classification, Box-Cox, Costs, False Class Rates, Kernel density estimators, Monotone Likelihood Ratio, Prevalence, ROC surface, Youden
1. Introduction
Pancreatic Ductal Adenocarcinoma (PDAC) is the most lethal cancer among organ site with an overall five-year survival rate of approximately 8% [1]. It is the third leading cause of cancer-related mortality in both men and women in the United States, and is projected to become the second most common cause of cancer-related deaths by 2030 [2]. Screening of the general population for PDAC is not considered cost-effective due to the low incidence of PDAC and the high cost of follow-up procedures following a positive diagnosis. However, it is considered cost-effective for certain higher-risk groups and its early detection is critical in improving survival [3]. Two of the guidelines given in [4] strongly suggest screening for 1) patients that are first-degree relatives of patients with pancreas cancer with at least 2 affected genetically related relatives, 2) patients with genetic syndromes associated with an increased risk of pancreas cancer, including all patients with Peutz-Jeghers syndrome, hereditary pancreatitis, patients with CDKN2A gene mutation, and patients with 1 or more first-degree relatives with pancreas cancer with Lynch syndrome, and mutations in BRCA1, BRCA2, PALB2, and ATM genes.
Even though CA19–9 has shown promising potential as a diagnostic biomarker, its utility is limited in detecting early-stage PDAC. Consequently, there is strong clinical interest in discovering new non-invasive blood-based biomarkers for the early detection of PDAC. Related biomarker studies consider assays that often include individuals coming from three groups: healthy (controls), chronic pancreatitis, and individuals with PDAC. Such a trichotomous nature of the true disease status needs to be taken into account when evaluating candidate biomarkers. Apart from assessing the accuracy of a candidate biomarker in terms of correctly classifying individuals among all three groups, it is of equal or even greater importance, to also make inferences around the false classification rates when a decision rule needs to be established.
Under a two-class case setting, where the disease status is defined by the ‘healthy’ and the ‘diseased’ class, the ROC curve is typically employed for assessing the accuracy of a continuous biomarker. If we denote with Y1 the scores of the healthy and with Y2 the scores of the diseased, the ROC visualizes all possible pairs of sensitivity, , and specificity , for a given cutoff c. By scanning c on the real line we obtain infinite pairs of (spec(c), sens(c)), where −∞ < c < ∞. The ROC is simply the plot of sens against 1 − spec. An overview of ROC curves is given in [5]. It is easy to see that the ROC curve in the two-class case captures all information in terms of both correct and false classification. Note that the false positive rate is . Note further that the false negative rate is . The standard ROC curve is constructed by plotting the true positive rate (i.e. the sensitivity or TPR) on the y-axis and the FPR along the x-axis. Thus, it is trivial to see that this plot is simply an equivalent visualization of all true and false classification rates. That is, using the traditional ROC curve, one could derive a visualization of the tradeoff of any pair of TPR, TNR, FPR, FNR. Hence, the two-class ROC curve provides inherently all information in terms of classification rates. One way of deriving an optimal cutoff point is the maximization of the Youden index [6].
Under a trichotomous setting we have three groups for discrimination, whose scores are denoted by Y1, Y2, and Y3, and are generated from distributions , , and , respectively. The ROC analogous in this case is the ROC surface ([7],[8], [9]). For an overview see [10]. Under such a setting we have three TC rates and two ordered cutoffs c1 < c2. The formers are defined by , , and . After scanning both c1 and c2 on the real line we obtain infinite triplets of . By plotting TC1, TC3 and TC2 on the x–, y–, and z–axis respectively, we obtain the ROC surface that is naturally bounded by the unit cube. However, such a plot does not contain all information with respect to all possible classification rates. A generalization of the Youden index in the three class case has been studied before (see [11]–[20]). This involves maximizing the sum of the three true class rates with respect to the two ordered cutoffs. As a result, one gets the optimal pairs of cutoffs that correspond to a true class triplet (TC1, TC2, TC3). However, this triplet (TC1, TC2, TC3) does not reveal the false classification rate of classifying to group 2 an individual that truly belongs to group 1. In particular, there are six such FC rates in total, and knowledge of a (TC1, TC2, TC3) triplet (or the ROC surface) does not imply knowledge of all underlying false classification rates, as opposed to the two-class case. Even though it is known that there are 6 FC rates, inferences for these FC rates are not currently available in the literature. This is an important scientific gap, since different costs may be associated with each FC rate. When assessing the accuracy of a biomarker in a trichotomous setting, it is important to know that a potentially poor performance in terms of a particular TCi, is because our decisions tend to bleed falsely towards group j and not group . Furthermore, it is important to have the inferential framework to construct confidence intervals for all true and false classification rates and to be able to compare FC rates with one another. Such comparisons are not trivial, as the estimated FC rates for a given biomarker are correlated. In this paper, we fill in this literature gap.
The paper is organized as follows. In section 2, we investigate the estimation of the two cutpoints in the general setting utilizing class prevalences and misclassification costs. In section 3, we consider statistical inference under the normality assumption. We provide an overall 100(1 − α)% confidence hyperspace for all 6 FC rates. We then provide confidence regions for a given pair of false classification rates that can be visualized by practitioners in their effort to assess the strengths and weaknesses of a biomarker in terms of classification. We provide a general framework for testing hypotheses concerning a set of linear combinations of FC rates. In particular, we provide an appropriate test to compare the equality of any two FC rates. In section 4, we consider the same points by employing more flexible models based on power transformations. In section 5, we present a non-parametric alternative that addresses the aforementioned points under a more robust framework. In section 6, we evaluate our approaches with an extensive simulation study. In section 7, we apply our approaches on a real data set that refers to PDAC’s early detection. We end with a discussion.
2. The General Framework
We assume the biomarker score, Yi, of an individual in the i-th group to be a continuous random variable with cdf Fi, survival function and pdf fi. The classification fractions that are obtained using the two cutoffs, c1 and c2 are given by:
| (1) |
where and . When j = i, we obtain the true classification fraction . Otherwise we obtain the false classification rates, that is , , .
Optimal cutoffs are obtained by considering class prevalences (when available) of the target population and (agreedupon) costs (see also [16]–[19]). Here, we expand on such a framework and provide further insight and technical details regarding the estimation of the cutoffs as well as the involved inferences around the FC rates. Let πi be the prevalence of disease status i in the target population under consideration. Denote the misclassification cost of classifying an individual in group i as being in group j by . Without loss of generality (concerning the optimal cutoffs) we assume here a zero cost for correct classification and zero cost for administering the test itself.
Let and be the two misclassification rates of subjects in the 1st disease stage. We easily obtain and . Thus, the remaining four FC rates can be written as
| (2) |
where . The expected overall cost at t1 and t2 is then equal to
| (3) |
By minimizing (3) one obtains the optimal cutoffs.
Since the derivative of equals to
we obtain
| (4) |
By setting the derivatives to 0, we obtain the following estimating equations for the two cutoffs and :
| (5) |
The above equations do not, in general, have unique solutions. When the solutions exist, they simply yield potential local minima of the cost function. Hence, in general one should, in addition, investigate the actual cost function. It may even be the case that the optimal rule is to not perform any testing at all. Unique solutions that yield a global minimum occur in some important situations that are discussed below. The estimating equations simplify when one considers equal overclassification costs for group 1, i.e. , and equal underclassifications costs for group 3, i.e. . We then solve
| (6) |
Youden-based cutoffs result by further assumming
| (7) |
in which case one simply solves and .
Equations in (6) and (7) imply strong assumptions that are in play when operating under the classical Youden index approach. In practice, it is very likely that these assumptions do not hold. Thus, there is a need of accommodating the implied different costs and prevalences.
When considering misclassification costs, it is most often reasonable to assume that and . By setting and the estimating equations in (5) become
| (8) |
where . The estimating equations in (8) yield a unique solution for the cutoffs, when the three distributions are likelihood ratio ordered, that is when r2(y) and r3:2(y) are increasing functions, and the likelihood ratios range from 0 to ∞. This is the case when one considers 3 Normal distributions with equal variances and increasing means or 3 Gamma distributions with equal shape parameters and increasing means.
2.1. Normality Assumption
It can be shown that, if one assumes Normal distributions for the three groups, , , with , and further considers equal overclassification costs for group 1, that is, C1|3 = C2|3, equal underclassifications costs for group 3, that is, C1|3 = C2|3, and unequal variances, one solves a pair of quadratic equations yielding the following roots for c1 and c2.
| (9) |
The roots are real under the conditions
| (10) |
in which case it can be shown that a local minimum of the expected cost is achieved at
| (11) |
At this point one should investigate the behavior of the expected cost function.
When one considers Youden based estimation of the cutoffs, the conditions in (10) are satisfied and a global minimum of the expected cost function (under the assumptions , , and (7)) is achieved at
| (12) |
When the variances are assumed equal (=σ2) then the three distributions are likelihood ratio ordered. According to (8), the equations to be solved to obtain the cutoffs, c1 and c2, can be succinctly rewritten by first setting
After some algebra, it can be shown that one needs to solve numerically for c1 and c2 the following
| (13) |
2.2. Numerical Examples
2.2.1. Unequal variances example:
To illustrate the theory and equations mentioned above we first consider a setting where , , and . We assume the following costs in some arbitrary units: , , , , , . We further set the underlying prevalences π1, π2, and π3 equal to 1/5, 2/5, and 2/5 respectively. The classical Youden-index based cutoffs are and . There are the cutoffs obtained if we are willing to assume that and . However, these assumptions in our example are obviously violated. Under our configuration the cutoffs can be derived by solving the equations in (5). The derived cutoffs are 0.5733 and 3.9425 (see also Figure 1). For the same example we further illustrate how the underlying costs may have a severe impact on the decision making process. We note that in an extreme situation where C3|2 = 40, 000, 000 and all else remains the same, the obtained cutoffs are 0.5733 and 9.1775 which implies that only 3.3% of the third (aggressive) group is correctly classified as such and 96.6% of the individuals of group 3 are misclassified as group 2. The reason is that a potential misclassification to group 3 of a subject that truly belongs in group 2 is extremely costly relative to the other misclassification costs. The corresponding cost function in visualized in Figure 2. It is obvious that for more extreme values of some of the underlying costs, the derived cutoffs may imply that no testing is necessary either under the notion that all individuals are driven to be classified in group 3, or that all individuals are to be classified as healthy (group 1).
Figure 1.
Configuration of the discussed numerical example where , , . We assume that , , , , , and further set the underlying prevalences π1, π2, and π3 equal to 1/5, 2/5, and 2/5 respectively. The derived cutoffs using the traditional Youden index are and . The proposed cutoffs after solving the corresponding estimating equations that account for the costs and the prevalances are 0.5733 and 3.9425.
Figure 2.
Visualization of the cost function that corresponds to the unequal variances example discussed above. We visualize the same cost function from two different angles (left and right panel) for convenience. The global maximum is achieved at t1 = 0.6636 and t2 = 0.0016. These correspond to the pair of cutoffs: 0.5733 and 3.9425. The profile of the cost function with respect to t1 and t2 is given in Figure 3.
Note also that, in this example, the approximate cost at the optimal cutoffs is 11348 units, whereas the classical/naive Youden based cutoffs (assuming the conditions given in (7)) correspond to an approximate cost of 25919 units (which is 2.59 times higher). In other words, we achieve a cost reduction of 56.2% ((25919–11348)/25919 = 0.5622) when taking into account the assumed costs and prevalences.
2.2.2. Equal variances example:
To illustrate another example that refers to an equal variances setting, we consider , , . We assume the same costs as before in some arbitrary units: , , , , , . Utilizing equations in (13) we obtain the values 1.3065 and 4.3668 for the first and second cutoff respectively. These are exactly the same values that we would get, had we used equations in (8) or equations in (5).
2.3. Remark and comments:
In Figure 1 we illustrate graphically one example with some hypothesized costs and prevalences. This plot highlights the different cutoffs of our approach in contrast to the classical Youden index. Note that the effect of the prevalence (or costs) in a given application can be dominated by costs (or prevalences) that are extremely imbalanced. For example, if the costs of miss-classifying an individual from groups 2 or 3 to group 1 is extremely high, then c1 will be driven towards the right tail of the density of Y1. The reason is that it is extremely costly to have an individual misclassified to group 1. This will drive the decision process to prefer miss-classifying individuals from group 1 to group 2 or 3. On the other extreme, if miss-classifying an individual from group 1 or 2 to group 3 is extremely high, then this will drive the second cutoff to be on the left tail of the density of Y3. The logic is analogous. In the general case, the prevalence kicks in as is shown by equation (5). It reflects a weight that affects the classification rule since if, for example, it is very small for the first group (and at the same time it is very costly to be misclassified to group 1), then there is an even stronger “push” of the first cutoff to the right of the density of Y1. The reason is that apart from being very costly to be misclassified to group 1, it is also only a few individuals that will be impacted (due to the low prevalence of the first group) and thus this will drive the process to select an even larger c1. The rational is analogous for other scenarios and the process boils down to the magnitude of the imbalanced costs and prevalences.
3. Inference under the Normality assumption
Under the normality assumption we assume that the biomarker scores of group 1, group 2, and group 3 are generated separately from three normal distributions i.e. , , . The true class fractions are given by:
| (14) |
and the triplet (TC1, TC2, TC3) defines the ROC surface. The corresponding parametric generalized Youden index is:
| (15) |
It is common to select cutoff points by maximizing J3. However, if the class prevalences and misclassification costs are available, then one ought to select the cutoff points by minimizing the expected overall cost. In general, after solving the equations in (8) one obtains cutoffs that correspond to possible local minima of the expected cost function. Further investigation of the expected cost function may result in ‘degenerate’ rules being optimal (such as the diagnostic test classifying everyone as belonging to group 3). As discussed in the previous section, it can be shown that the Youden-based cutoffs minimize the expected overall cost (and thus are optimal) under the assumptions , and the conditions in (7). The Youden-based cutoffs are given in (12).
The corresponding estimates are then obtained by substituting in Equations (12) the maximum likelihood estimates of . The associated estimated optimal triplet of true class fractions on the ROC surface is defined by where are the maximum likelihood estimates of . Natural interest lies in knowing how a biomarker performs with respect to the underlying false classification rates which are defined by:
Note that depends only on () and depends only on (), while the remaining FC rates depend on the full parameter vector denoted by .
Let denote the 2 × 1 vector containing the two FC probabilities corresponding to true disease status i, i.e. , with . Denote the full 6 × 1 false classification probability vector by FC, where . The full 6 × 6 variance-covariance for the maximum likelihood estimate of θ, let , can be derived in closed form. We denote it with Σ, with being its estimate.
Note that all FC rates depend either partially or fully on θ and hence their estimates exhibit covariability. To ensure that the estimates lie in the ROC space, we first transform the false classification rates on the real line through the probit transformation, , and then back-transform them in the ROC space through . The 6 × 6 estimated covariance matrix of our estimates of the transformed FC rates, is denoted by and is obtained through the delta method. Its diagonal elements are given by
| (16) |
where , and . Its off-diagonal elements are
| (17) |
where , , , with , , and . Note that in the above formula we allow i to be equal to i′ and that may be actually of clinical interest, since it would refer to making inferences by comparing two different classification rates conditioning on the true class. Namely, such conditioning would serve the purpose of exploring the direction towards which a given class state is misclassified. A joint 100(1 − α)% confidence hyper-space for all probit transformed classification rates is:
| (18) |
where denotes the (1 − α)-th percentile of a Chi-square distribution with ν degrees of freedom.
Particular interest lies in deriving confidence regions for the following pairs of FC rates: , and . A joint confidence region for the probits in the 2 × 1 vector can be constructed and graphed by the ellipse defined by:
| (19) |
where the 2 × 2 matrix corresponds to class i and is extracted from the full 6 × 6 covariance matrix . The 100(1 − α)% confidence region for the two actual FC rates is found by back-transforming that ellipse into the ROC space using Φ(·) on its coordinates. Due to the irregular shape of these proposed confidence regions, we will refer to them as egg-shaped confidence regions (see application).
We consider the general linear hypothesis , where the q × 6 matrix R has rank q ≤ 6, and r is a q × 1 vector. For example, by taking
and r = 0 one tests the null hypothesis that states that misclassification probabilities are equal within each group. By taking
and r = 0 one tests the null hypothesis that states that the probability of under-classification equals the probability of over-classification for group 2. By taking
one tests the null hypothesis that the expected costs incurred for each group separately equals some pre-specified values in the 3 × 1 vector r. By taking
one tests the null hypothesis that the overall probability of disagreement with the gold standard in the target population is equal to a pre-specified value r.
To test the general linear hypothesis we can employ the Wald test, that is, for a given significance level α, we reject the null hypothesis if
| (20) |
where equals the estimated 6 × 6 covariance matrix of the estimated untransformed FC rates which can be obtained by the delta method from , that is
| (21) |
and similarly for .
A potentially interesting clinical question within a given disease group is whether one FC rate is significantly higher than the other. In particular, we need to make inference towards the direction of our falsely classified individuals that belong to a certain true disease status. For this purpose, we test the null hypothesis: , where versus either the two-sided alternative or, if warranted, a one-sided alternative. For the two sided alternative we can use the Wald test in (20). For this particular hypothesis we prefer to conduct asymptotic Z tests in the probit scale. The probit scale has been recommended before in ROC settings (see Bantis and Feng (2016, 2018)). Hence, we use the test statistic:
| (22) |
that asymptotically follows a N(0, 1) under the null hypothesis. The components involved in Z∗ have already been derived. The expression above implies that for the marginal 95% confidence interval for each we use:
| (23) |
Such a strategy regarding transforming using and back-transforming the endpoints of the confidence intervals using has been also discussed in Bantis and Feng (2016) and is preferred over the traditional marginal confidence intervals that operate directly on the true or false class rates.
We note that our methods can be trivially extended to the case of known prevalences, using equations in (11). However, a global minimum of the expected cost function may not be obtained in some scenarios and additional restrictions are imposed on the parameter space that are not accounted for in this paper. These issues will be investigated by the authors in the future.
Finally, we note here that when one assumes a proper Normal biomarker, that is , maximizing the Youden index yields and . Our methods extend easily to this case by applying the delta method using the estimated parameter vector , where is the usual pooled sample standard deviation in the classical one-way Anova. Furthermore, in the case of a proper Normal biomarker, cost-adjusted and prevelance-adjusted cutoffs are easily obtained as unique solutions of the equations in (13), which are solved numerically. The methods described in this section can then be extended by appealing to the delta method for implicitly defined random variables, as discussed in [21].
4. Box-Cox approach
Even though the normality assumption allows closed-form expressions for the cutoffs and the subsequent inferences, it is most often not justified by the data at hand. However, it might be the case that a monotone transformation can yield measurements that are approximately normal. The so called Box-Cox transformation [22] has been used in the ROC literature before ([23]–[27]). It is defined by:
and similarly for Y2(λ) and Y3(λ). The corresponding log-likelihood function to be maximized is:
| (24) |
where . For a given λ, the estimates of all parameters in γ can be derived in closed-form. An estimate of λ is derived by maximizing the underlying profile likelihood (for the computational details see Bantis et al. 2014). The 7 × 7 observed information matrix I is of the form:
By extracting the upper left (6 × 6) part of , denoted by we can obtain an estimate of the variance covariance matrix of the vector . Then, statistical inference proceeds as in the previous section.
5. Non-Parametric Kernel-Based approach
Even though the Box-Cox approach provides a more flexible parametric alternative compared to the restrictive normality assumption, it might be the case that it fails to transform the data to approximate normality. In those cases, a non-parametric approach would be preferable. Here we explore a kernel-based approach (see [28] for an overview). A kernel density estimate for the measurements that correspond to the first group, i.e. Y1i, is given by:
| (25) |
and the expressions are similar for and f2 that correspond to group 2 and 3. We employ the normal kernel i.e. , and are the bandwidths for group 1, 2 and 3 respectively. These bandwidths can be found using Silverman’s rule and are given by , where sd and iqr refer to the standard deviation and inter-quartile range, respectively. Silverman [28] presents bandwidths that are optimal in terms of the asymptotic integrated mean squared error (AMISE), when normal kernels are employed.
Step 1: Apply the Box-Cox transformation using Y1, Y2, and Y3 to derive the estimate of the common transformation parameter λ, and hence the Box-Cox transformed scores , and .
Step 2: Based on , , and employ the kernel density estimate to estimate f1, f2 and f3.
Step 3: Based on the kernel estimates of f1, f2, f3, calculate the kernel-based estimates of all 6 FC rates, i.e.: , , , , , .
Step 4: Sample with replacement separately from each Y1, Y2, and Y3 and obtain the bootstrap based sample for the current b–th bootstrap iteration, denoted by Y1(b), Y2(b), and Y3(b).
Step 5: Apply the Box-Cox transformation using Y1(b), Y2(b), and Y3(b), in order to derive the current common transformation parameter , and hence the Box-Cox transformed bootstrap sample, denoted by , , and .
Step 6: Based on , , and , apply the kernel density estimate (the bandwidth calculation is based on the current samples) to obtain estimates for the densities f1(b), f2(b), and f3(b).
Step 7: Using the density estimates of the previous step calculate all 6 FC rates for the current bootstrap sample: , , , , , .
Step 8: Repeat steps 4–7 B times, and based on the B bootstrap estimates of all 6 FC rates, calculate the 6 × 6 empirical covariance matrix that corresponds to the transformed FC rates.
In practice, we employ B = 400 bootstrap samples. Statistical inference then proceeds similarly to section 3.
Based on the kernel estimates in Step 3 and the estimated variance covariance matrix in Step 8, , we can proceed with the construction of the elliptical confidence region given in (19) and back-transform its coordinates using Φ(·). To do so, we need to employ the 2 × 2 kernel version of V2 which can be simply extracted from our full 6 × 6 bootstrap based . In order to compare two FC rates we employ Z∗. All expressions involved in its denominator can be extracted from , while the numerator expressions are simply the transformations of the estimates of Step 3. Note that both and take into account the variability of the estimated Box-Cox related transformation parameter. For the kernel-based marginal confidence intervals we can be based on (23) where all estimates are kernel-based and all variance estimates are derived through the bootstrap described in this section.
6. Simulation Studies
Our simulation studies consist of two parts: (a) investigation of the coverage and width (or areas) of our confidence intervals (or regions) for all true and FC rates under different parametric scenarios with all proposed approaches, and (b): investigation of the size and power of our proposed statistic Z∗ under various scenarios.
6.1. Confidence intervals for all classification rates
We consider generating data based on three settings: i) all three groups are generated from three different normal distributions, ii) all three groups are generated from three different log-normal distributions, iii) all three groups are generated from three different gamma distributions, and iv) groups 1 and 2 are generated from two normal distributions, while group 3 is generated from a two-component normal mixture that exhibits bi-modality. All parameters utilized for all aforementioned models are given in Table 1 of the Web Appendix. For configurations i) and ii) we have considered different volumes under the ROC surface (VUS values) equal to 0.5 and 0.7. The sample sizes we consider for each group are 50, 100, and 200. The parameter values of the distributions for each scenario are given in Tables 1 and 2 of the Web Appendix.
We observe that when the data are generated from normal distributions (configuration (i)), the delta-based approach that relies on normality performs best, as expected. The coverage is close to the targeted one (i.e. 95%) in all cases, yielding systematically smaller widths (or areas) for the proposed marginal CIs and joint egg-shaped confidence regions compared to the Box-Cox based approach or the Kernel-based method. We observe that the Box-Cox approach yields slightly inflated widths (or confidence areas), because in takes into account the variability of an extra parameter that refers to the estimated λ. Even though the Kernel-based method provides nice coverage properties as well, these come with the cost of even higher widths and areas. See Tables 3 of the Web Appendix for all numerical results.
For configuration (ii) and (iii) where the data are generated from log-normal or gamma distributions, we observe that the Box-Cox performs best. Note that the gamma model lies outside the Box-Cox family and our results indicate that the Box-Cox based approach is flexible enough to capture such violations of normality. It outperforms the non-parametric kernel-based (KernelBC) approach in terms of widths/areas. The flexibility of the Box-Cox approach has been acknowledged before under different settings (see [26] [25]). See Table 4 of the Web Appendix for the corresponding numerical results.
For configuration (iii) we explore a setting that provides a severe violation of normality with the third group exhibiting bimodality. In this setting, we observe the Box-Cox based approach has a poor performance for several classification rates, indicating its limited use for more complex distributions (Table 2 of the Web Appendix). For this scenario, we observe that the kernel-based approach performs best, illustrating its robustness (Table 4 of the Web Appendix).
6.2. Power and size simulation
In the second part of our simulations, we explore the size and the power of Z∗ for testing the null hypothesis . We focus on this null hypothesis as it is often of clinical interest to know whether there misdiagnosis of the benign stage to either group 1 or the aggressive group 3. We consider this setting as more common compared to misclassifying individuals from group 1 to group 3 and vice-versa, even though our proposed statistic accommodates any comparison. To explore the size, we consider sample sizes of 30, 50, 100, and 200. The data are generated based on configurations (i), (ii), and (iii) discussed in the previous subsection so that and 0.20. The distributions considered to generate the data are given in Table 2 of the Web Appendix. All power-related numerical results of the simulations are reported in Table 5 of the Web Appendix.
Our findings are analogous to those presented in the previous subsection and are summarized in Figure 4. We note that the plots included in Figure 4 are comparable column-wise, as different operating points are considered for each distributional scenario. This is because we are interested in fixing specific values for . When the score measurements are generated based on normal models, then our delta-based approach performs best, as expected in terms of power. For the gamma distributed scores, the Box-Cox based approach achieves higher power compared to the Kernel, and is fairly robust in such situations even though the gamma distribution lies outside the Box-Cox family. However, for more complex models (such as the one utilized in configuration (iii)) the Box-Cox collapses and does not attain nice size properties. The Kernel is the most robust approach yielding satisfactory size and power in all scenarios.
Figure 4.
Power attained by Z∗ when testing the hypothesis . The difference is set to be 0, 0.05, 0.10, 0.15 and 0.20. Each panel refers to a specific simulation scenario that involves data generated either by normal distributions, log-normal distributions, gamma distributions, and a bimodal mixture. The sample sizes considered are (50, 50, 50), (100, 100, 100) and (200, 200, 200).
As requested by a referee, we also considered simulations that assume normality when we generate data from the scenarios that involve gamma distributions, lognormal distributions, and the bimodal mixture. For the gamma and the lognormal related scenarios, the assumption of normality yields a size that is far from the targeted 0.05 level and generally > 0.90 in all cases. For the bimodal mixture scenario, wrongly assuming normality yields sizes of 0.133, 0.242, and 0.442 for sample sizes (50,50,50), (100,100,100) and (200,200,200) respectively (see also Table 5 of the Web Appendix).
In summary, we recommend the following strategy for a given application. Test for normality for each one of the three groups under study. If normality is justified, then proceed with the delta-based approach. If normality is not justified, then proceed with testing the Box-Cox transformed scores for normality, and if successful, the Box-Cox based approach is recommended. In all other situations the kernel-based approach is preferred.
7. Pancreatic Cancer Application
CA19–9 is currently in clinical use as a pancreatic ductal adenocarcinoma (PDAC) biomarker. However, its performance is limited in detecting early-stage disease. Therefore, there is clinical interest in identifying new promising markers towards this direction. A study presented in [29] investigates a set of protein-based markers that could potentially outperform CA19.9. The authors carry out sequential validations starting with 17 protein biomarker candidates to determine which markers can improve detection of early-stage disease. Candidate biomarkers are subjected to enzyme-linked immunosorbent assay based sequential validation using independent multiple sample cohorts consisting of PDAC cases (n3 = 87), benign pancreatic disease (n2 = 93), and healthy controls (n1 = 169). A marker considered in this study is the insulin like growth factor binding protein 2, or simply IGFBP2 [29]. It is a protein-coding gene. The authors of this study provided a pairwise analysis.
Here, we focus on evaluating IGFBP2 in simultaneously discriminating all three groups under study; group 1: healthy (controls), group 2: chronic pancreatitis, and group 3: PDAC. We investigate the biomarker’s properties at a Youden-optimized pair of cutoffs and assess any potential inflated FC rates. For this marker, even after the Box-Cox transformation, normality is violated. We use the Anderson Darling test for normality (see [30] and [31]). The Anderson-Darling related p-values are 0.0031, 0.0005, and 0.0062 for the Box-Cox transformed groups 1, 2, and 3, respectively. Therefore, we consider our kernel approach in this application. All estimated classification rates along with their marginal confidence intervals are given in Table 1. The corresponding box-plots for all three groups and the three empirically estimated cumulative distribution functions are presented in Figure 1 of the Web Appendix. Both the ROC surface based on IGFBP2 and its performance at the Youden-based optimal operating point are illustrated in Figure 5. On the top panels (panels (a) and (b)) of Figure 5 we visualize the ROC surface along with the estimated triplet of (TC1, TC2, TC3) and its 95% confidence space that corresponds to the optimized cutoffs ( and ). At the bottom left panel (panel (c)) we visualize the three 95% confidence regions for: i) (, ) in blue color, ii) (, ) in green color, and iii) (, ) in red color. For a simultaneous visualization we propose the plot given in panel (d) of Figure 5 which we call ‘clock-plot’. The clock-plot provides an immediate visual representation of all TC and FC class rates along with their marginal 95% confidence intervals that are derived as described in Section 5 since the proposed kernel-based strategy is utilized for this data set. The light-blue arms (thick lines) represent the TC rates. The more extended they are, the better the biomarker’s performance at the estimated cutoffs is. For a perfect marker, these arms extend to the perimeter of the largest circle with radius equal to 1. To the left and right of each light-blue (TC) arm, we visualize the corresponding ‘bleeding’. That is, we depict the corresponding FCs that relate to that particular TC rate. This means that if we add the length of a TC arm to the sum of the lengths of the two arms on its left and right side, then the result is always 1. Note that an FC rate is allowed to exceed any TC rate. Their trade-off is a function of the location of the cutoffs.
Table 1.
Application results table: Estimated true and false class rates for biomarker IGFBP2 regarding the pancreatic cancer data set. The optimized cutoffs are 0.3282 and 1.1987
| Classification Rate | Kernel based point estimates | Kernel based 95% confidence interval |
|---|---|---|
| TCI | 0.6820 | 0.5211–0.8143 |
| TC 2 | 0.4318 | 0.2030–0.6870 |
| TC 3 | 0.3466 | 0.1778–0.5537 |
| 0.2765 | 0.1416–0.4549 | |
| 0.0414 | 0.0034–0.2217 | |
| 0.4054 | 0.2588 – 0.5667 | |
| 0.1628 | 0.0588 – 0.3443 | |
| 0.2932 | 0.1614–0.4603 | |
| 0.3602 | 0.1825–0.5754 |
Figure 5.
Panel (a): 95% confidence egg-shaped space for the triplet (TC1, TC2, TC3) (left). Panel (b): Corresponding ROC surface for IGFBP2 (right). Panel (c): 95% egg shaped confidence regions for: i) in blue color, ii) in green color, and iii) in red color. Panel (d): (clock-plot) a visualization of all TC and FC rates along with their marginal confidence intervals visualized with red stars alongside each clock arm.
In the clock-plot we can visually explore an interesting behavior concerning the misclassification rates for subjects that suffer from pancreatitis (group 2). is lower than 0.2 while is close to 0.4. That means, that given that an individual belongs to group 2 of chronic pancreatitis, it is twice more likely to be misclassified as healthy than having PDAC. We also see that TC2 is approximately 0.4, and approximately equal to . Hence, without making any formal arguments yet, we can easily observe that the reason TC2 is not high enough, is because of false decisions ‘bleeding’ to the healthy group as opposed to the PDAC group. That is, based on these cutoffs, it is more likely to misclassify an individual who has chronic pancreatitis as healthy rather as a PDAC patient. A natural question is whether this difference in the two FC is statistically significant. More formally, we are interested in testing the null hypothesis that . Based on Z∗ the corresponding p−value is 0.0366, and thus we reject this null hypothesis at a nominal level of 5%.
Related to TC1, which is estimated to be 0.6804, we note the following. Healthy subjects tend to be misclassified as having chronic-pancreatitis (group 2) () rather than having PDAC (group 3) (). If we focus on PDAC patients, we observe an estimated TC3 equal 0.3466 and false decisions are almost equally likely with their difference not being statistically significant (p – value=0.6012 based on our proposed Z∗ statistic).
For the results discussed so far we have assumed equal costs and prevalences. For illustration purposes we also show how these results would change under a configuration where the misclassification costs and the underlying prevalences are available. Let us assume the following costs (that essentially can be considered as relevant costs in some units): , , , , , with prevalences π1 = 0.56, π2 = 0.26, π3 = 0.18. In this case equations in (5) yield two roots for the first cutoff, those being −0.5254 and 1.6413, and two roots for the second cutoff, those being −0.8633 and 3.4213 (see Figure 6). This illustrates the necessity of exploring the underlying cost function based on which we observe that the global minimum is at c1 = 1.6413 and c2 = 3.4213. This global minimum is illustrated in Figure 7 and the profiles of the cost function with respect to t1 and t2 are presented in Figure 8. Then, the true class rates are and the corresponding FC rates are .
Figure 6.
Visualization of equations in (5) for the pancreatic cancer data after emplying the hypothesized costs and prevalences. These are: , , , , , with prevalences π1 = 0.56, π2 = 0.26, π3 = 0.18. We observe that there are two roots for each of the equations. This implies the necessity of exploring the underlying cost function which is illustrated in Figure 7
Figure 7.
Visualization of the cost function that corresponds to the hypothesized costs and prevalences for the discussed pancreatic cancer data set. We further illustrate the profile of this function with respect to t1 and t2 in Figure 8
Figure 8.
Illustrating the profile of the cost function that refers to the hypothesized costs and prevalences for the pancreatic cancer data set with respect to t1 and t2.
8. Discussion
There is strong clinical interest related to new biomarkers that could contribute to improved diagnosis of early stage PDAC and pancreatitis. A traditional marker currently used in clinical practice is CA19.9. However, it exhibits limited utility since its accuracy varies with disease stage [32]. This implies the need of exploration of new markers that will be able to contribute to that direction. Common statistical techniques fail to reveal the strengths and shortcomings of new biomarker candidates, as these are based on overall accuracy measures such as the AUC (or VUS in the three-class case) which cannot provide limited insights under a clinical decision making framework. False class rates play a crucial role in understanding the behavior of a biomarker. Knowing the misclassification rates and being able to make inferences around them provides a cornerstone in assessing clinical decision-making guidelines. Such a need is clearly reflected through the implied costs, as it is rarely the case that misclassifying towards any direction has the same impact. This heavily depends on the invasiveness of the implied work-up, the stage of the disease, the cost of the work-up both in resources as well as in health-related issues/side effects. In this paper, we provide a complete framework for evaluating all classification rates under a three-class setting that corresponds to the Youden-based optimized cutoffs. In addition, our methods can take into account underlying costs and prevalences, provided there are some additional restrictions in the parameter space (other than mean ordering are satisfied). We note that it is challenging to know the underlying costs for a given setting as these are affected by the cost of the test, the cost of the follow up and/or how invasive it is, the stage of the disease and how lethal it is and other factors.
By accommodating the underlying costs and prevalences, we discuss a generalization of the commonly used Youden index, illustrating that all three groups contribute to estimating both underlying cutoffs. This is not the case for the 3-class Youden index discussed in [11]. Even though other alternatives to the Youden index have been studied in the literature on the basis of avoiding the under-utilization of samples (see [14], [33], and [34] among others), here we illustrate that this under-utilization (of the Youden index) is simply a special case of a more general framework, that in fact allows all three groups to contribute to the estimation of both underlying cutoffs. We study parametric, flexible parametric and non-parametric kernel-based techniques. Current literature referring to the statistical evaluation of biomarkers in such settings revolves around the true classification rates. This only provides an incomplete picture of the operating characteristics of a biomarker. The work presented in this paper provides a full framework that bridges this important literature gap.
Supplementary Material
Figure 3.
The profile of the cost function visualized in Figure 2 is provided with respect to t1 and t2. Both these profile functions refer to the unequal variances numerical example that we discuss above (section 2.1.1).
Acknowledgements
The research of the first author is supported in part by the COBRE grant (NIH) P20GM130423. The first author is further supported in part by an R01 grant from the National Cancer Institute (R01CA260132), the Department of Defense (OC180414), the Ovarian Cancer Research Alliance, and The Honorable Tina Brozman Foundation. The first author is also supported in part by two Masonic Cancer Alliance (MCA) Partners Advisory Board grants from The University of Kansas Cancer Center (KUCC) and Children’s Mercy (CM). The contents are solely the responsibility of the authors and do not necessarily represent the official views of the MCA, KUCC, or CM. The authors would like to thank Dr. Samir Hanash for providing the data studied in the Application section. We also thank Kate Young for proof-reading the paper and two anonymous referees for their comments that substantially improved the presentation of this work.
Footnotes
Supplementary Material
Supplementary material is provided online by the journal’s website. Codes with our approaches are publicly available at the first author’s website: www.leobantis.net. The actual pancreatic cancer data are not provided for reasons of confidentiality. A similar (simulated) data set is analyzed for illustration purposes and provided in our codes.
Please ensure that you use the most up to date class file, available from the SIM Home Page at www.interscience.wiley.com/jpages/0277-6715
References
- 1.Siegel RL, Miller KD, and Jemal A (2017) Cancer Statistics. CA: A Cancer Journal for Clinicians 67(1), 7–30. [DOI] [PubMed] [Google Scholar]
- 2.Rahib L, Smith BD, Aizenberg R, Rosenzweig AB, Fleshman JM, and Matrisian LM (2014) Cancer Research 74(11), 2913–2921. [DOI] [PubMed] [Google Scholar]
- 3.Chari ST, Kelly K, Hollingsworth MA, et al. (2015). Early detection of sporadic pancreatic cancer: summative review. Pancreas 44(5), 693–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Aslanian HR, Lee JH, Canto MI. AGA (2020) Clinical Practice Update on Pancreas Cancer Screening in High-Risk Individuals: Expert Review. Gastroenterology 159(1), 358–362. [DOI] [PubMed] [Google Scholar]
- 5.Pepe MS (2003). The Statistical Evaluation of Medical Diagnostic Tests for Classification and Prediction. Oxford: Oxford University Press. [Google Scholar]
- 6.Youden WJ. (1950). An index for rating diagnostic tests. Cancer 3, 32–35. [DOI] [PubMed] [Google Scholar]
- 7.Mossman D (1999) Three-way ROCs. Medical Decision Making 1999; 19, 78–89. [DOI] [PubMed] [Google Scholar]
- 8.Dreiseitl S, Ohno-Machado L, Binder M. Comparing three-class diagnostic tests by three-way ROC analysis. Medical Decision Making 2000; 20, 323–331. [DOI] [PubMed] [Google Scholar]
- 9.Nakas CT and Yiannoutsos CT (2004). Ordered multiple-class ROC analysis with continuous measurements. Statistics in Medicine 23, 3437–3449 [DOI] [PubMed] [Google Scholar]
- 10.Nakas CT (2013b). DEVELOPMENTS IN ROC SURFACE ANALYSIS AND ASSESSMENT OF DIAGNOSTIC MARKERS IN THREE-CLASS CLASSIFICATION PROBLEMS. REVSTAT – Statistical Journal 12, 43–65. [Google Scholar]
- 11.Nakas CT, Dalrymple-Alford JC, Anderson TJ, Alonzo TA. (2013). Generalization of Youden index for multiple-class classification problems applied to the assessment of externally validated cognition in Parkinson disease screening. Statistics in Medicine 32(6), 995–1003. [DOI] [PubMed] [Google Scholar]
- 12.Luo J, Xiong C. (2013). Youden index and associated cutpoints for three ordinal diagnostic groups. Commun Stat Simul Comput 2013 42, 1213–1234. [DOI] [PMC free article] [PubMed]
- 13.Luo J, Xiong C. (2012). An R Package for Analyzing Diagnostic Tests with Three Ordinal Groups. Journal of Statistical Software 51(3), 1–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Attwood K, Tian L, Xiong C. (2014). Diagnostic thresholds with three ordinal groups. Journal of Biopharmaceutical Statistics 24(3), 608–633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yin J, Nakas CT, Tian L, Reiser B. (2018). Confidence intervals for differences between volumes under receiver operating characteristic surfaces (VUS) and generalized Youden indices (GYIs). Statistical Methods in Medical Research 27(3), 675–688. [DOI] [PubMed] [Google Scholar]
- 16.Carvalho VI, Branscum AJ. (2018). Bayesian nonparametric inference for the three-class Youden index and its associated optimal cutoff points. Statistical Methods in Medical Research 27(3), 689–700. [DOI] [PubMed] [Google Scholar]
- 17.Skaltsa K, Jover L, Carrasco JL (2010). Estimation of the diagnostic threshold accounting for decision costs and sampling uncertainty Biometrical Journal 52(5), 676–697. [DOI] [PubMed] [Google Scholar]
- 18.Skaltsa K, Jover L, Fuster D Carrasco JL (2011). Optimum threshold estimation basedon cost function in a multistatediagnostic setting Statistics in Medicine 31(11–12), 1098–1109. [DOI] [PubMed] [Google Scholar]
- 19.Batterton KA, Schubert K (2014). Confidence intervals around Bayes Cost in multi-state diagnostic settings to estimate optimal performance Statistics in Medicine 33(19), 3280–3299. [DOI] [PubMed] [Google Scholar]
- 20.Hong CS and Jung ES. (2018). Optimal thresholds criteria for ROC surfaces. Journal of the Korean Data 24(6), 1489–1496. [Google Scholar]
- 21.Benichou J, Gail MH (1989). A Delta Method for Implicitly Defined Random Variables. The American Statistician 43(1), 41–44. [Google Scholar]
- 22.Box GEP, and Cox DR (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B 26, 211–252. [Google Scholar]
- 23.Faraggi D, and Reiser B (2002). Estimation of the area under the ROC curve. Statistics in Medicine 21, 3093–3106. [DOI] [PubMed] [Google Scholar]
- 24.Fluss R, Faraggi D, and Reiser B (2005). Estimation of the Youden index and its associated cutoff point. Biometrical Journal 47, 458–472. [DOI] [PubMed] [Google Scholar]
- 25.Molodianovitch K, Faraggi D, and Reiser B (2006). Comparing the areas under two correlated ROC curves: parametric and nonparametric approaches. Biometrical Journal 48, 45–757. [DOI] [PubMed] [Google Scholar]
- 26.Bantis LE, Nakas CT, and Reiser B (2014). Construction of confidence regions in the ROC space after the estimation of the optimal Youden index-based cutoff point. Biometrics 70, 212–223. [DOI] [PubMed] [Google Scholar]
- 27.Bantis LE, Nakas CT, Reiser B, Daniel M, Dalrymple-Alford JC. (2017). Construction of joint confidence regions for the optimal true class fractions of Receiver Operating Characteristic (ROC) surfaces and manifolds. SMMR 26(3), 1429–1442. [DOI] [PubMed] [Google Scholar]
- 28.Silverman BW (1998). Density Estimation for Statistics and Data Analysis. London: Chapman & Hall/CRC. [Google Scholar]
- 29.Capello Michela, Bantis Leonidas E., Scelo Ghislaine, Zhao Yang, Li Peng, Dhillon Dilsher S., Patel Nikul J., Kundnani Deepali L., Wang Hong, Abbruzzese James L., Maitra Anirban, Tempero Margaret A., Brand Randall, Firpo Matthew A., Mulvihill Sean J., Katz Matthew H., Brennan Paul, Feng Ziding, Taguchi Ayumu, Hanash Samir M. (2017). Sequential validation of blood-based protein biomarker candidates for early-stage pancreatic cancer. Journal of the National Cancer Institute 109(4), 10.1093/jnci/djw266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lewis, Peter A.W. (1961). Distribution of the Anderson-Darling Statistic. Annals of Mathematical Statistics 32, 1118–24. [Google Scholar]
- 31.Stephens MA (1974). EDF Statistics for Goodness of Fit and Some Comparisons. Journal of the American Statistical Association 69, 730–737. [Google Scholar]
- 32.Scara S, Bottoni P, Scatena R. (2015). CA 19–9: Biochemical and clinical aspects. Adv Exp Med Biol 2015(867), 247–260. [DOI] [PubMed] [Google Scholar]
- 33.Hua J, Tian L. (2020). A comprehensive and comparative review of optimal cut-points selection methods for diseases with multiple ordinal stages. Journal of Biopharmaceutical Statistics 30(1), 46–68. [DOI] [PubMed] [Google Scholar]
- 34.Mosier BR, Bantis LE (2021). Estimation and construction of confidence intervals for biomarker cutoff-points under the shortest Euclidean distance from the ROC surface to the perfection corner. Statistics in Medicine In Press, 10.1002/sim/9077 [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.








