Accurate likelihood inference for the volume under the ROC surface

Erlis Ruli; Laura Ventura

doi:10.1002/cnr2.1206

. 2019 Dec 10;3(4):e1206. doi: 10.1002/cnr2.1206

Accurate likelihood inference for the volume under the ROC surface

Erlis Ruli ^1,^✉, Laura Ventura ¹

PMCID: PMC7941487 PMID: 32794638

Abstract

Background: With three ordered diagnostic categories, the volume under the receiver operating characteristic (ROC) surface, which is the extension of the area under the ROC curve for binary diagnostic outcomes, is the most commonly used measure for the overall diagnostic accuracy. For a continuous‐scale diagnostic test, classical likelihood‐based inference about the area under the ROC curve can be inaccurate, in particular when the sample size is small, and higher order inferential procedures have been proposed.

Aim: The goal of this paper is to illustrate higher order likelihood procedures for parametric inference in small samples, which provide accurate point estimates and confidence intervals for the volume under the ROC surface.

Methods: Simulation studies are performed in order to illustrate the accuracy of the proposed methodology, and two applications to real data are discussed.

Results: We show that likelihood modern inference provide refinements to classical inferential results. Furthermore, the freely available R package likelihoodAsy makes now their use almost automatic.

Conclusion: Modern likelihood inference based on higher‐order asymptotic methods for the area under the ROC surface provide refinements to classical inferential results. A possible limitation of higher‐order asymptotic methods for practical use is that their software implementation can be awkward. Nevertheless, use of the freely available R package likelihoodAsy makes such implementation straightforward.

Keywords: AUC, diagnostic accuracy, higher order likelihood inference, small sample size, stress‐strength model, VUS

1. INTRODUCTION

Diagnostic testing is an extremely important issue in medical care. Typically, medical diagnosis involves the classification of patients into two or more categories. In particular, when subjects are categorised in two groups (such as nondiseased vs diseased), receiver operating characteristic (ROC) analysis has been extensively studied as an important statistical tool for evaluating the accuracy of continuous diagnostic tests. The area under the ROC curve (AUC) is one of the most commonly used indices for summarising diagnostic accuracy (see, among others, the previous studies1, 2, 3, 4 and references therein).

In many practical situations medical diagnosis is not limited to a binary choice, but it may have to deal with ordinal‐three category classification problems (such as nondiseased, intermediate, or diseased). A possible example (see Section 4) that is a study in which it is of interest to test the mitosis‐specific antibody phosphohistone H3 (PHH3) as a diagnostic criterion in meningioma grading, with n = 70 patients with meningiomas classified according to the World Health Organisation grades 1, 2, and 3.5 When the possible disease status belongs to one of three ordered categories, the assessment of the performance of a diagnostic test is achieved by the analysis of the ROC surface and the volume under the surface (VUS), which extends the AUC ordinal‐three category outcomes; see, among others, the previous studies6, 7, 8, 9 and references therein.

Commonly used parametric point and interval estimators of the VUS are based on classical likelihood methods. However, it is well known that first‐order inference can be inaccurate, in particular when the sample size is small or in presence of several nuisance parameters; see, for instance, Cortese and Ventura10 for the AUC. With this motivation, in this paper, we illustrate higher‐order likelihood‐based procedures for parametric inference in small samples (see, e.g., Brazzale et al.11), which provide accurate point estimators and confidence intervals (CIs) for the VUS. Two simulation studies are performed in order to illustrate the accuracy of the proposed methodology, and two applications to real data are discussed.

The rest of the paper is organised as follows. Existing VUS estimators are briefly reviewed in section 2. Section 3 discusses higher order likelihood‐based procedures for parametric inference on VUS. Section 4 presents simulation results and applications to real data examples. Section 5 provides a concluding discussion.

2. BACKGROUND

This section introduces the preliminaries on VUS and briefly reviews existing nonparametric and parametric VUS estimators.

Suppose it is of interest to evaluate the predictive ability of a continuous diagnostic test in a context where the disease status of a patient refers to an ordinal‐three category classification problem (such as nondiseased, intermediate, or diseased). In this context, the ROC surface has been proposed to assess the accuracy of tests with three ordinal diagnostic categories.

2.1. Volume under the ROC surface

Let X, Y, and Z denote the scores resulting from a diagnostic test, and let F _x, F _y, and F _z be the corresponding cumulative distribution functions for nondiseased, intermediate, and diseased subjects, respectively. Assume the results of a diagnostic test are measured on continuous scale and higher values indicate greater severity of the disease. Given a pair of cut‐off values k ₁ and k ₂ (k ₁<k ₂), let δ _x=F _x(k ₁) and δ _z=1−F _z(k ₂) be the true classification rates for nondiseased and diseased categories, respectively. Then, the probability that a randomly selected patient from intermediate category has a score between k ₁ and k ₂ is as follows:

δ_{y} = F_{y} (k_{2}) - F_{y} (k_{1}) = F_{y} [F_{z}^{- 1} (1 - δ_{z})] - F_{y} [F_{x}^{- 1} (δ_{x})] .

(1)

The triplet (δ _x,δ _y,δ _z), where δ _y=δ _y(δ _x,δ _z) is a function of (δ _x,δ _z), produces a ROC surface in the three‐dimensional space for all possible $(k_{1}, k_{2}) \in I R^{2}$ , ie,

ROC (δ_{x}, δ_{y}) = F_{y} [F_{z}^{- 1} (1 - δ_{z})] - F_{y} [F_{x}^{- 1} (δ_{x})] .

As the ROC curve for a binary diagnosis represents the trade‐off between sensitivity and specificity for the two categories (nondiseased and diseased), the ROC surface represents the three‐way trade‐off among the correct classification probabilities for the three categories.

The VUS has been advanced as an index useful for summarising the overall diagnostic accuracy of the diagnostic test, and it is given by the following:

VUS = \int_{0}^{1} \int_{0}^{1 - F_{z} [F_{x}^{- 1} (δ_{x})]} \{F_{y} [F_{z}^{- 1} (1 - δ_{z})] - F_{y} [F_{x}^{- 1} (δ_{x})]\} d δ_{z} d δ_{x} .

(2)

It is immediate to see that this is a generalisation of the AUC for the ROC curve under a binary classification. The VUS is mathematically equivalent to the probability P(X<Y<Z), when X, Y, and Z are randomly selected from each diagnostic category, respectively. Values of VUS range from 1/6 for a useless test (the test is no better than chance alone) to 1 for a perfect test (the test perfectly discriminates among the three categories); see Figure 1. For more details see Xiong et al.7

Hypothetical distributions of X, Y, and Z in the case of a useless test (left) and in the case of a perfect test (right)

2.2. Nonparametric estimation

Assume the sample sizes for nondiseased, intermediate, and diseased patients are n _x, n _y, and n _z, respectively. The natural nonparametric estimator of VUS is the Mann‐Whitney estimator of the probability P(X<Y<Z), ie,

MW = \frac{1}{n_{x} n_{y} n_{z}} \sum_{i = 1}^{n_{x}} \sum_{j = 1}^{n_{y}} \sum_{k = 1}^{n_{z}} I (X_{i} < Y_{j} < Z_{k}),

(3)

where I(A) is the indicator function for the event A.6 In Li and Zhou8 a nonparametric estimator of ROC surface is proposed by replacing all the cumulative distribution functions in (2) with their empirical counterparts. Thus, the estimated VUS is given by the following:

\hat{VUS} = \int_{0}^{1} \int_{0}^{1 - {\hat{F}}_{z} [{\hat{F}}_{x}^{- 1} (δ_{x})]} \{{\hat{F}}_{y} [{\hat{F}}_{z}^{- 1} (1 - δ_{z})] - {\hat{F}}_{y} [{\hat{F}}_{x}^{- 1} (δ_{x})]\} d δ_{z} d δ_{x},

(4)

where ${\hat{F}}_{x}$ , $\hat{F},_{y}$ and ${\hat{F}}_{z}$ are the empirical distribution functions for the diagnostic measures from nondiseased, intermediate, and diseased category, respectively. In Li and Zhou,8 it is shown that this empirical plug‐in estimator of VUS is asymptotically unbiased.

2.3. Parametric estimation

In a general parametric approach, let f _x(x;θ _x), f _y(x;θ _y), and f _z(x;θ _z) be the probability density functions of X, Y, and Z, respectively, indexed by the unknown parameters $θ_{x} \in Θ_{x} \subseteq I R^{k_{x}}$ , $θ_{y} \in Θ_{x} \subseteq I R^{k_{y}}$ , and $θ_{z} \in Θ_{z} \subseteq I R^{k_{z}}$ , k _x,k _y,k _z ≥ 1. Moreover, let F _x(x;θ _x), F _y(x;θ _y), and F _z(x;θ _z) be the corresponding distribution functions. Then, the VUS is given by the following:

VUS = VUS (θ) = \int_{- \infty}^{\infty} F_{x} (y; θ_{x}) [1 - F_{z} (y; θ_{z})] f_{y} (y; θ_{y}) d y,

(5)

which can be expressed as a function of the entire parameter θ=(θ _x,θ _y,θ _z). Note that the VUS can be expressed as a reparametrisation of θ of the form θ=(ψ,λ), where ψ=ψ(θ) is the scalar parameter of interest (5) and λ=λ(θ) is the (k−1)‐dimensional nuisance parameter, with k=k _x+k _y+k _z. The maximum likelihood estimate (MLE) of VUS can then be obtained by substituting the MLE of θ into (5).

As an example, assume that $X \sim N (μ_{x}, σ_{x}^{2})$ , $Y \sim N (μ_{y}, σ_{y}^{2})$ , and $Z \sim N (μ_{z}, σ_{z}^{2})$ . As in Xiong et al.7, the VUS can be expressed as follows:

VUS = \int_{- \infty}^{\infty} Φ (a s - b) Φ (- c s + d) ϕ (s) d s,

(6)

where a=σ _y/σ _x, b=(μ _x−μ _y)/σ _x, c=σ _y/σ _z, d=(μ _z−μ _y)/σ _z, and Φ(·) and ϕ(·) are the standard normal distribution and density functions, respectively. The MLE of VUS can then be obtained by substituting sample means and sample standard deviations into (6). When the normal assumptions are not satisfied, since VUS is invariant under monotonic transformation, the Box‐Cox–type transformation can be applied to the data and then the normality‐based method for estimation of the VUS is used on the transformed data (see, e.g., Kang and Tian9).

3. BEYOND CLASSICAL LIKELIHOOD INFERENCE FOR THE VUS

3.1. First‐order likelihood inference

As shown in (5), the VUS can be expressed as a reparametrisation of θ of the form θ=(ψ,λ), where ψ=ψ(θ)=VUS(θ) is the scalar parameter of interest, and λ=λ(θ) is the (k−1)‐dimensional nuisance parameter.

The MLE of ψ is simply given by $\hat{ψ} = ψ (\hat{θ})$ because of the well‐known invariance property. Parametric inference on ψ, such as CIs and tests, relies on general profile likelihood procedures (see, eg, Severini12). These methods require the elimination of the nuisance parameter λ by replacing it with the constrained MLE ${\hat{λ}}_{ψ}$ obtained by maximising the log‐likelihood function ℓ(θ)=ℓ(ψ,λ) with respect to λ for fixed ψ. Then, inference about ψ may be performed using the profile log‐likelihood $ℓ_{p} (ψ) = ℓ (ψ, {\hat{λ}}_{ψ}) = ℓ ({\hat{θ}}_{ψ})$ , with ${\hat{θ}}_{ψ} = (ψ, {\hat{λ}}_{ψ})$ .

Inference on ψ is usually based on the Wald statistic

w_{p} = w_{p} (ψ) = j_{p} {(\hat{ψ})}^{1 / 2} (\hat{ψ} - ψ),

(7)

or on the signed log‐likelihood ratio statistic (or directed likelihood)

r_{p} = r_{p} (ψ) = sign (\hat{ψ} - ψ) {(2 (ℓ_{p} (\hat{ψ}) - ℓ_{p} (ψ)))}^{1 / 2},

(8)

which have standard normal distributions up to the order O(n ^−1/2). In Equation (7), the profile observed information j _p(ψ)=−∂ ² ℓ _p(ψ)/∂ψ ² can be expressed in terms of the full observed information through the identity:

j_{p} (ψ) = j_{ψ ψ} ({\hat{θ}}_{ψ}) - j_{ψ λ} ({\hat{θ}}_{ψ}) j_{λ λ}^{- 1} ({\hat{θ}}_{ψ}) j_{λ ψ} ({\hat{θ}}_{ψ}),

(9)

where j _ψψ(ψ,λ), j _ψλ(ψ,λ), j _λψ(ψ,λ), and j _λλ(ψ,λ) are the sub‐matrices of the full observed information $j (θ) = - \partial^{2} ℓ (θ) / (\partial θ \partial θ^{^{_{T}}})$ . First‐order CIs for ψ may be based on (7) or (8). For instance, a 100(1−α)% approximate CI for ψ based on the Wald statistic is in practice computed as follows:

(\hat{ψ} - z_{1 - α / 2} j_{p} {(\hat{ψ})}^{- 1 / 2}, \hat{ψ} + z_{1 - α / 2} j_{p} {(\hat{ψ})}^{- 1 / 2}),

(10)

where z _α is the α‐quantile of the standard normal distribution. Alternatively, a 100(1−α)% CI for ψ based on r _p is {ψ:|r _p(ψ)| ≤ z _1−α/2}.

In practice, (10) is often preferred because of the simplicity in calculations. Moreover, it is well known that in general Wald procedures have poor behaviour even for large samples, are not invariant under reparameterisation, and are less accurate than those based on the log‐likelihood ratio statistic (see, eg, Severini12 and Brazzale et al11).

Finally, note that, when the sample size is relatively small, in general, first‐order approximations are often inaccurate and can give poor results. In these situations, it may be useful to resort to modern likelihood theory based on higher order asymptotics. See Cortese and Ventura10 for higher order asymptotics for the AUC.

3.2. Higher‐order likelihood inference

The theory of higher‐order asymptotics provides more precise inferences than the standard theory; see, eg Brazzale et al.11 In this section, we discuss a modified version of the log‐likelihood ratio statistic (8) that is more accurate, having standard normal distribution up to O(n ^−3/2).

The modified directed likelihood for ψ is given by the following:

r_{p}^{*} = r_{p}^{*} (ψ) = r_{p} + \frac{1}{r_{p}} \log \frac{q}{r_{p}},

(11)

where q=q(ψ) is a suitable likelihood quantity, depending of likelihood derivatives; see, eg, Severini12, Chap. 7 for a review on possible expressions for q.

The modified directed likelihood $r_{p}^{*}$ is a higher order pivotal quantity with null standard normal distribution of order O(n ^−3/2). Moreover, it satisfies the requirement of parameterisation equivariance. A CI for ψ with approximate level (1−α) based on $r_{p}^{*}$ is given by the following:

\{ψ : | r_{p}^{*} (ψ) | \leq z_{1 - α / 2}\} .

(12)

The modified directed likelihood $r_{p}^{*}$ can also be used to derive a point estimator for ψ that improves the small sample properties of $\hat{ψ}$ , respecting the requirement of parameterisation equivariance. More precisely, following Giummolè and Ventura,13 $r_{p}^{*}$ can be used to define an estimating equation for ψ, of the form:

r_{p}^{*} (ψ) = 0 .

(13)

The estimator ${\hat{ψ}}^{*}$ , solution of (13), is a refinement of $\hat{ψ}$ , with the estimating equation (13) giving implicitly a higher order corlrection to the MLE. Indeed, in view of the properties of $r_{p}^{*}$ , the estimating equation (13) is mean unbiased as well as median unbiased at the third order of accuracy, and the median unbiasedness property also holds for ${\hat{ψ}}^{*}$ . Moreover, since $r_{p}^{*}$ is invariant under interest respecting reparameterisation, ${\hat{ψ}}^{*}$ is an equivariant estimator of ψ.

4. EXAMPLES

The proposed procedures can be easily implemented in practice for commonly used parametric models, by using the package likelyhoodAsy14 of the R software. All computations are done in R, and the relative code for producing the results below can be found in the Supporting Information. In the following subsections, we assess the performance of the three competing statistics (7, 8, and 11) by means of simulations with N = 10 000 Monte Carlo trials. For each of the three methods outlined above, we compute the empirical coverage of the resulting two‐tailed and (left) one‐tailed CIs at various nominal levels. Furthermore, we also check the uniformity of the P values obtained with the three competing statistics when testing the hypotheses H ₀:ψ=ψ ₀ vs H ₀:ψ<ψ ₀.

For computing the coverage, we have built the CIs around logit(VUS); that is, we have reparametrised VUS by taking $ψ = logit(VUS) = \log (VUS / (1 - VUS))$ . This reparametrisation is useful in order to get a fair comparison with the coverage of the Wald‐type CIs, which otherwise would be severely inaccurate and thus totally useless. Such an inaccuracy is due to essentially two reasons. First, in the somehow extreme setting, we consider that VUS is equal to 0.95, which is close to the boundary. Lastly, it is well known that Wald‐type inference procedures lack invariance to reparametrisations. To compute the expected length, we have used CIs on the VUS scale, since in practice, our interest is on VUS and not on its logit.

4.1. Simulation study 1: Normal model

We consider simulated data of sample sizes n _x=n _y=n _z=n with n=(5,10), drawn from the normal model with true group means μ _x=1, μ _y=2, and μ _z=4 and variances $σ_{x}^{2} = σ_{y}^{2} = σ_{z}^{2} = 0.4286$ . These parameter values correspond to a value for the true VUS roughly equal to 0.95. The results of the simulations are shown in Figures 2 and 3.

Empirical vs nominal coverage levels and expected length of confidence intervals for the volume under the surface (VUS) obtained with N=10000 Monte Carlo trials under the normal model. First (second) row refers to data with n=5 (n=10). The first (second) column shows the coverages of the two‐tailed (one‐tailed) intervals. The dashed lines denote the upper and lower limits of the 99% confidence interval for 1−α equal to $\pm 3 \sqrt{\frac{α (1 - α)}{N}}$ . Third column shows the expected length

Distribution of the P values of H ₀:ψ=ψ ₀ vs H ₀:ψ<ψ ₀, with ψ ₀=0.95 obtained with N=10000 Monte Carlo trials under the normal model. First (second) row refers to data with n=5 (n=10)

From these plots, we can deduce that the proposed method (11) outperforms (8) and (7), both in terms of coverage with respect to two‐tailed CIs as well as one‐tailed CIs (see Figure 2). As expected, with increasing sample size, the two‐tailed coverage converges to the nominal levels, whereas the convergence of the one‐tailed CIs appears to be slower. Therefore higher order inference for VUS, by means of (11), is more accurate than standard first‐order asymptotic inference. This conclusion is supported also by looking at the P values when testing H ₀:ψ=ψ ₀ vs H ₀:ψ<ψ ₀, with ψ ₀=0.95 (see Figure 3), which confirms that with the statistic (11) the resulting P values are uniformly distributed. In terms of expected length, we see that $r_{p}^{*}$ ‐based CI are on average slightly larger than those based on r _p, whereas Wald‐based CI are too wide.

4.2. Simulation study 2: Exponential model

Now, we consider the exponential model, according to which X∼Exp(λ _x), Y∼Exp(λ _y) and Z∼Exp(λ _z), with λ _x,λ _y,λ _z>0. In this case, it can be shown that

V U S = \frac{λ_{x} λ_{y}}{(λ_{x} + λ_{y} λ_{z}) (λ_{y} + λ_{z})} .

We consider simulated data of sample sizes n _x=n _y=n _z=n with n=(5,10), drawn from the exponential model with true group means, respectively, μ _x=500, μ _y=10 and μ _z=0.3136. As for the normal case, these parameter values are chosen in such a way that the corresponding value for the true VUS is roughly equal to 0.95. The results of the simulations under the exponential model are shown in Figures 4 and 5.

Empirical vs nominal coverage levels of confidence intervals for the volume under the surface (VUS) obtained with N=10000 Monte Carlo trials under the exponential model. First (second) row refers to data with n=5 (n=10). The first (second) column shows the coverages of the two‐tailed (one‐tailed) intervals. The dashed lines denote the upper and lower limits of the 99% confidence interval for 1−α equal to $\pm 3 \sqrt{\frac{α (1 - α)}{N}}$ . Third column shows the expected length

Distribution of the P values of H ₀:ψ=ψ ₀ vs H ₀:ψ<ψ ₀, with ψ ₀=0.95 obtained with N=10000 Monte Carlo trials under the exponential model. First (second) row refers to data with n=5 (n=10)

Also from this simulation study, we can deduce that higher order inference for VUS, based on (11), is more accurate than standard first‐order asymptotic inference, since it has better coverage and gives more uniform P values under the null hypothesis.

4.3. Real data application 1: PHH3 mitotic count

In this section, we apply the proposed method to a real data example of 70 patients from a clinical study for the detection of meningioma (see Duregon et al.5). In particular, the aim of the study was to test the mitosis‐specific antibody PHH3 as a diagnostic criterion in meningioma grading, with the 70 meningiomas classified according to the WHO grades 1, 2, and 3. In this data set, 15 patients are graded 1; 41 are graded 2, and 14 patients are graded 3. The average PHH3 for grades 1, 2, and 3 are respectively 1.93, 8.41, and 32.79; the sample standard deviations are 0.88, 4.5, and 13.22. The left plot in Figure 6 shows, by means of boxplots, the distribution of diagnostic test PHH3 in the three grades of WHO.

Left: boxplot of the PHH3 with respect to the three grades of World Health Organisation; right: the r _p and $r_{p}^{*}$ as functions of volume under the surface (VUS); the latter is computed on the logit scale for numerical stability

Referring to the meningioma groups identified by WHO grades as the gold standard, the nonparametric estimate of the VUS is 0.91 (95% CI, 0.83‐0.97), indicating a very good diagnostic ability of PHH3 scores in discriminating the three meningioma groups.

Under the Gaussian assumption on the distribution of PHH3, the MLE of VUS is 0.89. In this case, the 0.95 CI based on the Wald and r _p(ψ) statistics is almost identically equal to (0.77‐0.95), whereas the 0.95 CI‐based, the more accurate $r_{p}^{*} (ψ)$ statistic is (0.75‐0.95). The 0.99 CIs for VUS with the Wald, r _p(ψ), and $r_{p}^{*} (ψ)$ statistics are respectively (0.72‐0.96), (0.71‐0.96), and (0.68‐0.96). The right plot in Figure 6 gives the plots of r _p(ψ) and $r_{p}^{*} (ψ)$ as function of ψ, where ψ is the VUS on the logit scale. For the former plot, we can deduce that the quantiles of r _p(ψ) and $r_{p}^{*} (ψ)$ essentially differ mainly on the right tail.

Finally, suppose it is of interest to test H ₀:VUS=0.75 against H ₁:VUS≠0.75. Then, with r _p(ψ) and $r_{p}^{*} (ψ)$ , we obtain, respectively, the P values (.022 and .052). Hence, if we fix α=.05, the first‐order result based on r _p(ψ) suggests that H ₀ must be rejected, whereas the more accurate third‐order P value based on $r_{p}^{*} (ψ)$ does not.

4.4. Real data application 2: Measure of miRNA in the thyroid cytology smears

Here, we consider data from Fassina et al.15, in which it is of interest to use the expression of a microRNA (miRNA) in thyroid cytology smears. In particular, it is of interest to study differences in the miRNA expression between specimens of anaplastic thyroid carcinoma (ATC), primary thyroid lymphoma (PTL), and multinodular goiter (MNG). In the pilot data set, the patients in the three groups (ATC, PTL, and MNG) are 18, 12, and 13, respectively. The average measures of the miRNA in the three groups are, respectively, 2.23, 0.49, and −0.24; the sample standard deviations are 1.27, 0.47, and 0.30. The left plot in Figure 7 shows, by means of boxplots, the distribution of the miRNA in ATC, PTL, and MNG patients. The nonparametric estimate of the VUS is 0.81 (95% CI, 0.63‐0.99), indicating a good diagnostic ability of the miRNA in discriminating the three thyroid cytology smears.

Left: boxplot of the measure of miRNA in the three thyroid cytology smears; right: the r _p and $r_{p}^{*}$ as functions of VUS; the latter is computed on the logit scale for numerical stability

Under the Gaussian assumption, the MLE of VUS is 0.784. The 0.95 CI based on Wald and r _p(ψ) statistics are equal to (0.585‐0.903) and (0.587‐0.906), respectively, whereas the 0.95 CI based on the $r_{p}^{*} (ψ)$ statistic is (0.565‐0.898). The plots of (8) and (11), as function of ψ, where ψ is the VUS on the logit scale, are given in the right plot in Figure 7, showing that r _p(ψ) tends to overestimate the parameter of interest compared to $r_{p}^{*} (ψ)$ .

Finally, if it is of interest to test H ₀:VUS=0.8 against H ₁:VUS≠0.8, which is the value obtained with the nonparametric approach, we obtain a P value = .027 with r _p(ψ) and a P value = .0456 with $r_{p}^{*} (ψ)$ , that show a different evidence against H ₀.

5. FINAL REMARKS

Despite their widespread use, classical inferential techniques may lead to poor inferential results, especially in studies with small samples or with many nuisance parameters. Likelihood‐based higher order asymptotics provide refinements to many classical inferential results. A possible limitation of higher order asymptotic methods for practical use is that their software implementation can be awkward. However, with the freely available R package likelihoodAsy, this is no more the case. Indeed, with likelihoodAsy, practitioners are required to provide only their likelihood function and a function to generate data from the assumed model, after that all the complicated quantities involved in the expansion of $r_{p}^{*} (ψ)$ are computed automatically.

Using the likelihoodAsy package, we showed by means of practical examples that important refinements can be obtained for the volume under the ROC curve. Possible extensions to our methodology is to incorporate covariates in each group. This could still be handled by likelihoodAsy, but now, the number of nuisance parameters is much higher. We leave this extension for future work.

We note that the analyses based on modern likelihood methods have also an interesting Bayesian interpretation, when considering the class of matching priors.16, 17, 18 With this prior, a credible equi‐tailed interval for ψ coincides with the accurate higher order likelihood‐based CI (12), and the posterior median coincides with the frequentist estimator defined as the solution of (13). For the use of the tail area approximations for measuring evidence in the Bayesian context; see Ventura and Reid19 and references therein.

Finally, in many applications, it may be of interest to compare different diagnosis tests in terms of their accuracy as measured by the VUS (see Yin et al.20). The extension of the proposed method in this direction is under investigation.

CONFLICT OF INTERESTS

The authors have no conflict of interests to declare.

AUTHORS' CONTRIBUTION

All authors had full access to the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Conceptualization, E.R, L.V.; Methodology, E.R., L.V.; Investigation, E.R., L.V.; Formal Analysis, E.R., L.V.; Resources, E.R., L.V.; Writing ‐ Original Draft, Review & Editing, E.R., L.V.; Visualization, E.R., L.V.; Funding Acquisition, L.V.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study as well as the R code used in the examples are available as Supporting Information.

ACKNOWLEDGEMENTS

This research work was partially supported by University of Padova (BIRD197903) and by PRIN 2015 (grant 2015EASZFS_003).

Ruli E, Ventura L. Accurate likelihood inference for the volume under the ROC surface. Cancer Reports. 2020;3:e1206. 10.1002/cnr2.1206

REFERENCES

1. Shapiro DE. The interpretation of diagnostic tests. Stat Methods Med Res. 1999;8(2):113‐134. [DOI] [PubMed] [Google Scholar]
2. Faraggi D, Reiser B. Estimation of the area under the ROC curve. Stat Med. 2002;21(20):3093‐3106. [DOI] [PubMed] [Google Scholar]
3. Zhou XH, McClish DK, Obuchowski NA. Statistical Methods in Diagnostic Medicine, 2nd edition. Hoboken, New Jersey: John Wiley & Sons; 2011. [Google Scholar]
4. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press; 2003. [Google Scholar]
5. Duregon E, Cassenti A, Pittaro A, et al. Better see to better agree: phosphohistone H3 increases interobserver agreement in mitotic count for meningioma grading and imposes new specific thresholds. Neuro Oncol. 2015;17(5):663‐669. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Nakas CT. Yiannoutsos CT. Ordered multiple‐class ROC analysis with continuous measurements. Stat Med. 2004;23(22):3437‐3449. [DOI] [PubMed] [Google Scholar]
7. Xiong C, van Belle G, Miller JP, Morris JC. Measuring and estimating diagnostic accuracy when there are three ordinal diagnostic groups. Stat Med. 2006;25(7):1251‐1273. [DOI] [PubMed] [Google Scholar]
8. Li J, Zhou XH. Nonparametric and semiparametric estimation of the three way receiver operating characteristic surface. J Stat Plann Inference. 2009;139(12):4133‐4142. [Google Scholar]
9. Kang L, Tian L. Estimation of the volume under the ROC surface with three ordinal diagnostic categories. Comput Stat Data Anal. 2013;62:39‐51. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Cortese G. Ventura L. Accurate higher‐order likelihood inference on P(Y<lX). Comput Stat. 2013;28(3):1035‐1059. [Google Scholar]
11. Brazzale AR, Davison AC, Reid N, et al. : Cambridge University Press, Applied Asymptotics: Case Studies in Small‐Sample Statistics; 2007. [Google Scholar]
12. Severini TA. Likelihood Methods in Statistics: Oxford University Press; 2000. [Google Scholar]
13. Giummolè F, Ventura L. Practical point estimation from higher‐order pivots. J Stat Comput Simul. 2002;72(5):419‐430. [Google Scholar]
14. Bellio R, Pierce D. likelihoodAsy: functions for likelihood asymptotics. R package version 0.50. https://CRAN.R-project.org/package=likelihoodAsy; 2018.
15. Fassina A, Cappellesso R, Simonato F, other. A 4‐microRNA signature can discriminate primary lymphomas from anaplastic carcinomas in thyroid cytology smears. Cancer Cytopathol. 2014;122(4):274‐281. [DOI] [PubMed] [Google Scholar]
16. Tibshirani R. Noninformative priors for one parameter of many. Biometrika. 1989;76(3):604‐608. [Google Scholar]
17. Ventura L, Cabras S, Racugno W. Prior distributions from pseudo‐likelihoods in the presence of nuisance parameters. J Am Stat Assoc. 2009;104(486):768‐774. [Google Scholar]
18. Ventura L, Sartori N, Racugno W. Objective Bayesian higher‐order asymptotics in models with nuisance parameters. Comput Stat Data Anal. 2013;60:90‐96. [Google Scholar]
19. Ventura L, Reid N. Approximate Bayesian computation with modified log‐likelihood ratios. Metron. 2014;72(2):231‐245. [Google Scholar]
20. Yin J, Nakas CT, Tian L, Reiser B. Confidence intervals for differences between volumes under receiver operating characteristic surfaces (VUS) and generalized Youden indices (GYIs). Stat Methods Med Res. 2018;27(3):675‐688. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study as well as the R code used in the examples are available as Supporting Information.

[cnr21206-bib-0001] 1. Shapiro DE. The interpretation of diagnostic tests. Stat Methods Med Res. 1999;8(2):113‐134. [DOI] [PubMed] [Google Scholar]

[cnr21206-bib-0002] 2. Faraggi D, Reiser B. Estimation of the area under the ROC curve. Stat Med. 2002;21(20):3093‐3106. [DOI] [PubMed] [Google Scholar]

[cnr21206-bib-0003] 3. Zhou XH, McClish DK, Obuchowski NA. Statistical Methods in Diagnostic Medicine, 2nd edition. Hoboken, New Jersey: John Wiley & Sons; 2011. [Google Scholar]

[cnr21206-bib-0004] 4. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press; 2003. [Google Scholar]

[cnr21206-bib-0005] 5. Duregon E, Cassenti A, Pittaro A, et al. Better see to better agree: phosphohistone H3 increases interobserver agreement in mitotic count for meningioma grading and imposes new specific thresholds. Neuro Oncol. 2015;17(5):663‐669. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cnr21206-bib-0006] 6. Nakas CT. Yiannoutsos CT. Ordered multiple‐class ROC analysis with continuous measurements. Stat Med. 2004;23(22):3437‐3449. [DOI] [PubMed] [Google Scholar]

[cnr21206-bib-0007] 7. Xiong C, van Belle G, Miller JP, Morris JC. Measuring and estimating diagnostic accuracy when there are three ordinal diagnostic groups. Stat Med. 2006;25(7):1251‐1273. [DOI] [PubMed] [Google Scholar]

[cnr21206-bib-0008] 8. Li J, Zhou XH. Nonparametric and semiparametric estimation of the three way receiver operating characteristic surface. J Stat Plann Inference. 2009;139(12):4133‐4142. [Google Scholar]

[cnr21206-bib-0009] 9. Kang L, Tian L. Estimation of the volume under the ROC surface with three ordinal diagnostic categories. Comput Stat Data Anal. 2013;62:39‐51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cnr21206-bib-0010] 10. Cortese G. Ventura L. Accurate higher‐order likelihood inference on P(Y<lX). Comput Stat. 2013;28(3):1035‐1059. [Google Scholar]

[cnr21206-bib-0011] 11. Brazzale AR, Davison AC, Reid N, et al. : Cambridge University Press, Applied Asymptotics: Case Studies in Small‐Sample Statistics; 2007. [Google Scholar]

[cnr21206-bib-0012] 12. Severini TA. Likelihood Methods in Statistics: Oxford University Press; 2000. [Google Scholar]

[cnr21206-bib-0013] 13. Giummolè F, Ventura L. Practical point estimation from higher‐order pivots. J Stat Comput Simul. 2002;72(5):419‐430. [Google Scholar]

[cnr21206-bib-0014] 14. Bellio R, Pierce D. likelihoodAsy: functions for likelihood asymptotics. R package version 0.50. https://CRAN.R-project.org/package=likelihoodAsy; 2018.

[cnr21206-bib-0015] 15. Fassina A, Cappellesso R, Simonato F, other. A 4‐microRNA signature can discriminate primary lymphomas from anaplastic carcinomas in thyroid cytology smears. Cancer Cytopathol. 2014;122(4):274‐281. [DOI] [PubMed] [Google Scholar]

[cnr21206-bib-0016] 16. Tibshirani R. Noninformative priors for one parameter of many. Biometrika. 1989;76(3):604‐608. [Google Scholar]

[cnr21206-bib-0017] 17. Ventura L, Cabras S, Racugno W. Prior distributions from pseudo‐likelihoods in the presence of nuisance parameters. J Am Stat Assoc. 2009;104(486):768‐774. [Google Scholar]

[cnr21206-bib-0018] 18. Ventura L, Sartori N, Racugno W. Objective Bayesian higher‐order asymptotics in models with nuisance parameters. Comput Stat Data Anal. 2013;60:90‐96. [Google Scholar]

[cnr21206-bib-0019] 19. Ventura L, Reid N. Approximate Bayesian computation with modified log‐likelihood ratios. Metron. 2014;72(2):231‐245. [Google Scholar]

[cnr21206-bib-0020] 20. Yin J, Nakas CT, Tian L, Reiser B. Confidence intervals for differences between volumes under receiver operating characteristic surfaces (VUS) and generalized Youden indices (GYIs). Stat Methods Med Res. 2018;27(3):675‐688. [DOI] [PubMed] [Google Scholar]

PERMALINK

Accurate likelihood inference for the volume under the ROC surface

Erlis Ruli

Laura Ventura

Abstract

1. INTRODUCTION