Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Mar 4.
Published in final edited form as: J Am Stat Assoc. 2008;103(482):705–713. doi: 10.1198/016214508000000364

A Unified Approach to Nonparametric Comparison of Receiver Operating Characteristic Curves for Longitudinal and Clustered Data

Gang Li 1, Kefei Zhou 2
PMCID: PMC2832229  NIHMSID: NIHMS125548  PMID: 20209021

Abstract

We present a unified approach to nonparametric comparisons of receiver operating characteristic (ROC) curves for a paired design with clustered data. Treating empirical ROC curves as stochastic processes, their asymptotic joint distribution is derived in the presence of both between-marker and within-subject correlations. A Monte Carlo method is developed to approximate their joint distribution without involving nonparametric density estimation. The developed theory is applied to derive new inferential procedures for comparing weighted areas under the ROC curves, confidence bands for the difference function of ROC curves, confidence intervals for the set of specificities at which one diagnostic test is more sensitive than the other, and multiple comparison procedures for comparing more than two diagnostic markers. Our methods demonstrate satisfactory small-sample performance in simulations. We illustrate our methods using clustered data from a glaucoma study and repeated-measurement data from a startle response study.

Keywords: Area under the receiver operating characteristic curve, Clustered data, Confidence band, Intersection-union tests, Longitudinal data, Multiple comparison, Paired design, Partial area under the receiver operating characteristic curve, Quantile process, Repeated measurement

1. INTRODUCTION

Receiver operating characteristic (ROC) curves are commonly used to evaluate and compare diagnostic markers in various fields, such as signal detection and medicine (Green and Swets 1966; Zhou, McClish, and Obuchowski 2002; Pepe 2003). The ROC curve of a diagnostic marker is a plot of the true positive rate (sensitivity) against the false-positive rate (1 – specificity) at different threshold values. It has the appealing property of describing the discrimination capacity of a diagnostic marker without linking it to any specific threshold. This enables direct comparisons of diagnostic markers even if they are on different measurement scales.

The most popular nonparametric methods for comparing ROC curves are based on some summary indexes, such as the area under the ROC curve (AUC) and the partial area under the ROC curve (pAUC) (see Hanley and McNeil 1982, 1983; McClish 1987; DeLong, DeLong, and Clarke-Pearson 1988; and Wieand, Gail, Barray, and James 1989 for independent data, and Obuchowski 1997; Emir, Wieand, Su, and Cha 1998; and Emir, Wieand, Jung, and Ying 2000 for clustered data). Although simple to use, these summary indexes are not without limitations; for example, a large value of the AUC or pAUC does not necessarily imply high sensitivity at a prespecified specificity. Many medical applications require comparisons of sensitivities of diagnostic markers at a prespecified specificity that is of clinical relevance. Confidence bands for the difference of two ROC curves also are desirable in practice for comparing sensitivities of two diagnostic markers over a prespecified range of specificities. In addition, interest may lie in estimating the set of specificities with differential sensitivity. Unfortunately, very few nonparametric comparison procedures beyond the use of AUC and pAUC have been rigorously developed in the literature. The problems can become more complicated when a paired design is used in which both markers are performed on the same study subjects and when there are multiple measurements for each marker from repeated-measurement or clustered data studies. In such situations, the between-marker and within-subject correlations must be accounted for. Some examples of a paired design with clustered data are given in Section 5.

The purpose of this article is to develop a unified approach to the analysis of ROC curves that allows a variety of non-parametric comparisons in the presence of between-marker and within-cluster correlations. We first derive the asymptotic distribution theory for correlated empirical ROC processes, taking into account of both between-marker and within-cluster correlations. This extends the work of Hsieh and Turnbull (1996) and Li, Tiwari, and Wells (1996b), who derived the asymptotic distribution of a single empirical ROC curve. The variance–covariance functions of the limiting ROC processes are shown to involve unknown densities that are difficult to estimate nonparametrically. By extending an idea of Keaney and Wei (1994), we propose to approximate the joint distribution of the empirical ROC processes by some Gaussian random processes whose distribution can be calculated by computer Monte Carlo simulations without involving density estimation. We provide a theoretical justification for this approach by proving that the Monte Carlo Gaussian processes have the same limiting distribution as the original empirical ROC processes. These results allow us to approximate the distribution of any continuous functional of the empirical ROC curves. This provides a general framework that enables various ROC curve comparisons. As examples, we apply the developed theory to obtain confidence bands for the difference of two ROC curves over a given range of specificities. We also derive nonparametric tests and confidence intervals for comparing weighted areas under the ROC curves that include AUCs, pAUCs, and sensitivities at a fixed specificity as special cases. In addition, we apply the intersection-union test concept (cf. Berger and Boos 1999; Berger and Hsu 1996) to obtain confidence intervals for the set of specificities at which one diagnostic test is more sensitive than the other. We also extend our theory to deal with multiple comparison problems involving more than two diagnostic markers.

We note that Uno, Cai, Tian, and Wei (2007) described a similar perturbation-resampling method to evaluate an ROC curve for censored data that include uncensored binary outcomes as a special case. As pointed by a referee, it may be possible to extend their method to correlated ROC curves and to clustered data. But our work is the first to clearly adapt these ideas to analyses of correlated ROC curves with clustered data, and we describe additional measures, such as confidence intervals, for the set of specificities with differential sensitivity.

The rest of the article is organized as follows. Section 2 gives the general asymptotic distribution theory for empirical ROC curves. Section 3 applies the developed theory to derive a number of inferential procedures for comparing ROC curves. Section 4 presents a simulation study to evaluate our methods’ performance. Section 5 illustrates our methods using a cluster data set from an ophthalmology study and a repeated-measurement data set from a neuroscience study. Section 6 gives some closing remarks.

2. LARGE–SAMPLE DISTRIBUTION THEORY FOR RECEIVER OPERATING CHARACTERISTIC CURVES

For simplicity, we consider only two diagnostic tests in this section.

2.1 The Receiver Operating Characteristic Curve

Let X(v) denote the continuous outcome of the vth diagnostic marker for which a value greater than a cutoff value t indicates a positive test result, v = 1, 2. The sensitivity and specificity are then given by 1 − G(v)(t) and F(v)(t), where F(v) and G(v) are the cumulative distribution functions of X(v) for the healthy and diseased populations, v = 1, 2. The ROC curve for the vth diagnostic test is defined by

ROC(v)(p)=1G(v){F(v)1(1p)},p[0,1], (1)

where F−1(p) = inf(t: F(t) ≥ p) for any function F. Equivalently, this is the plot of sensitivity against 1 − specificity, as the cutoff value t varies. Clearly, the closer to the upper left corner of the unit box, the greater the discriminating power.

2.2 The Data

Assume that there are a total of n subjects in the study. Suppose that we observe Xij(v)F(v),j=1,,mi(v), representing the measurements of the vth marker from mi(v) healthy units within subject i, and Yij(v)G(v),j=1,,ni(v), the measurements of the vth marker from ni(v) diseased units within subject i, i = 1, …, n and v = 1, 2. Assume further that measurements from different subjects are independent and measurements within a subject are possibly correlated. These types of data commonly arise from clustered data or repeated-measurement studies. Our data setting allows for both between-marker and within-subject correlations. We also allow different markers to have different numbers of measurements per subject.

2.3 The Estimators

Let M(v)=i=1nmi(v) and N(v)=i=1nni(v). Define the empirical ROC curves

ROC^(v)(p)=1G^(v){F(v)^1(1p)},v=1,2, (2)

where F^(v)(t)=i=1nj=1mi(v)I(Xij(v)t)/M(v) and G^(v)(t)=i=1nj=1mI(Xij(v)t)/N(v).

2.4 The Joint Limiting Distribution of ( ROC^(1)(p),ROC^(2)(p))

For any interval I, let D(I) denote the cadlag space of all right-continuous functions on I that have left-side limits, equipped with supremum norm || · ||. The following lemma is needed to establish the limiting distribution of (ROC(1)(p), ROC(2)(p)).

Lemma 1

Assume that as n → ∞, n1i=1nmi(v)kλk(v), and n1i=1nni(v)kγk(v) for some positive constants λk(v) and γk(v), v = 1, 2 and k = 1, 2, 3. Then

n(F^(1)(t)F(1)(t)F^(2)(t)F(2)(t)G^(1)(t)G(1)(t)G^(2)(t)G(2)(t))d(WF(1)(t)WF(2)(t)WG(1)(t)WG(2)(t))asn, (3)

in D(ℝ)4 = D(ℝ) × ··· × D(ℝ), where ℝ = (−∞, ∞), (WF(1)(t), WF(2)(t), WG(1)(t), WG(2)(t))′ is a vector of mean-0 Gaussian processes defined in (A.1) of the Appendix.

The joint limiting distribution of ( ROC^(1)(p),ROC^(2)(p)) is given next.

Theorem 1

Assume that the assumptions of Lemma 1 hold. Assume further that for v = 1, 2, F(v) and G(v) have derivatives F(v)′ and G(v)′ that are positive and continuous on [F(v)−1(a) − ε, F(v)−1(b) + ε], for some 0 < a < b < 1 and ε > 0. Then, as n → ∞,

n(ROC^(1)(p)ROC(1)(p)ROC^(2)(p)ROC(2)(p))d(Z(1)(1p)Z(2)(1p)), (4)

in D[1 − b, 1 − a] × D[1 − b, 1 − a], where for v = 1, 2, Z(v)(p) = − s(v)(p) · WF(v)(F(v)−1(p)) + WG(v)(F(v)−1(p)), s(v)(p) = G(v)′ (F(v)−1(p))/F(v)′ (F(v)−1(p)). Recall that WF(1)(t), WF(2)(t), WG(1)(t), and WG(2)(t) are the limiting Gaussian processes in (3).

The foregoing theorem allows us to obtain the limiting distribution of any continuous functional of ( ROC^(1)(p),ROC^(2)(p)). For simplicity, hereinafter we focus on the difference function of the two ROC curves.

Corollary 1

Let D(p) = ROC(1)(p) − ROC(2)(p) and D^(p)=ROC^(1)(p)ROC^(2)(p). Then, as n → ∞,

n(D^(p)D(p))dU(p)=Z(2)(1p)Z(1)(1p).

2.5 Approximate the Limiting Process U(p)

Note that the result of Corollary 1 cannot be readily used to make inference for D(p). For instance, to construct simultaneous confidence bands for D(p) over [p1, p2], we need to know the distribution of supp∈[p1,p2] |U(p)|, which depends on some unknown quantities and is intractable. Variance estimation also is problematic, because it involves density estimation. Next, we study how to approximate the distribution of U(·) without using density estimation.

As a building block, we first describe a standard Monte Carlo method for approximating the empirical distributions. For v = 1, 2, define

WF(v)(t)=nM(v)i=1nηij=1mi(v){I(Xij(v)t)F^(v)(t)}

and

WG(v)(t)=nN(v)i=1nηij=1ni(v){I(Xij(v)t)G^(v)(t)},

where the ηi’s are iid standard normal random variables. Then, conditional on the observed data,

(WF(1)(t),WF(2)(t),WG(1)(t),WG(2)(t))d(WF(1)(t),WF(2)(t),WG(1)(t),WG(2)(t)) (5)

in D(ℝ)4 for almost all data realizations, where the limiting processes in (5) are defined in (3). This can be proven by observing that, conditional on the data, the left side is a Gaussian random field whose covariance function converges to that of the right side.

A naive method for approximating the limiting process U is to replace (WF(v)(t), WF(v)(t)) with ( WF(v)(t),WG(v)(t)) and then replace s(v)(p) and F(v)(t), v = 1, 2, by their corresponding sample estimates. But estimating s(v)(p) nonparametrically is difficult. Keaney and Wei (1994) presented a novel method to approximate the distribution of a median survival time without requiring density estimation. We extend their idea to approximate the process U(p).

Note that the distribution of n{F^(v)(F(v)1(p))p} can be estimated by that of WF(v)(F(v)1(p)). Define ξ(v)*(p) by n{F^(v)(ξ(v)(p))p}=WF(v)(F^(v)1(p)), or

ξ(v)(p)=F^(v)1(p+n1/2WF(F^(v)1(p))). (6)

Then it can be shown that the conditional distribution of the process n{ξ(v)(p)F^(v)1(p)} given the data is asymptotically equivalent to n{F^(v)1(p)F(v)1(p)}.

Similarly, define

ζ(v)(t)=G^(v)(t)n1/2WG(v)(t). (7)

Then it is easy to see that the distribution of n{G^(v)(t)G(v)(t)} can be estimated by the conditional distribution of n{ζ(v)(t)G^(v)(t)} given the data.

Finally, let Q(v)*(p) = ζ(v)*(ξ(v)*(p)), ROC(v)*(p) = 1 − Q(v)*(1 − p), and D*(p) = ROC(1)*(p) − ROC(2)*(p). We then have the following theorem.

Theorem 2

Assume that the conditions of Theorem 1 hold. Then, for every subsequence {nj} of {n}, there is a further subsequence {mk} ⊂ {nj} such that, conditional on the data,

n(D(p)D^(p))dU(p)

in D[p1, p2] along the subsequence {mk}, for almost all data realizations, where [p1, p2] ⊂ (1 − b, 1 − a) and U(p) is the limiting process defined in Theorem 1.

Although the foregoing result is established along subsequences, it leads to the following corollary for any continuous functional of n(DD^) along the entire sequence {n}.

Corollary 2

Assume that the conditions of Theorem 1 hold. Let T: D[p1, p2] → ℝ be a continuous mapping from D[p1, p2] to the real line ℝ. Let Hn(t)=P(T(n(DD^))tdata). Then, as n → ∞,

HntH(t)P(T(U)t)

uniformly in t on any compact interval for almost all data realizations.

Corollary 2 allows us to approximate the distribution of T(n(D^D)) by that of T(n(DD^)) given the data. Some useful examples of T include T1(f) = f(p0) for a fixed p0, T2(f)=p1p2f(p)dp, and T3(f) = supp∈[p1, p2]|f(p)|, which can be used to compare sensitivities at a prespecified specificity p0, compare partial areas under the curve (pAUC) and construct simultaneous confidence bands for D(p).

3. RECEIVER OPERATING CHARACTERISTIC CURVE ANALYSIS

In this section we apply the theory developed in the previous section to derive a number of inferential procedures for comparing ROC curves. Specifically, we develop statistical tests for comparing weighted areas under ROC curves, construct confidence bands for the difference of two ROC curves, estimate the set of specificities at which one test is more sensitive than the other, and discuss multiple comparison procedures for more than two diagnostic markers.

3.1 Comparing the Weighted Areas Under Receiving Operating Characteristic Curves

Let Δ=01D(p)dw(p) be the difference between the weighted areas under the two ROC curves, where w(p) is a prespecified weight function. Note that Δ is the difference between two sensitivities for a fixed specificity p0 if w(p) is a degenerate distribution function at p0, the difference between the total AUCs if w(p) is the uniform distribution function on [0, 1], and the difference between the pAUCs if w(p) is the uniform distribution function on [p1, p2] ⊂ [0, 1] multiplied by p2p1.

Let Δ^=01D^(p)dw(p). From Section 2, we see that the distribution of n(Δ^Δ) can be approximated by the conditional distribution of n(ΔΔ^) given data. Therefore, a confidence interval for Δ can be computed as follows:

Step 1. Generate K(say K = 1,000) independent standard normal samples η1(k),,ηn(k), k = 1, …, K. For each k, compute Δk=01Dk(p)dw(p), based on η1(k),,ηn(k). Let SΔ be the sample standard deviation of { ΔkΔ^, k = 1, …, K}.

Step 2. A 100(1 − α)% confidence interval for Δ is given by

Δ^±z1α/2SΔ,

where z1 − α/2 is the 1 − α/2 percentile of the standard normal distribution.

One-sided confidence intervals and hypothesis tests for Δ can be obtained similarly.

3.2 Confidence Bands for D(p) = ROC(1)(p) − ROC(2)(p)

The following theorem is a direct consequence of Theorem 2 and Corollary 2.

Theorem 3

For 0 < α < 1 and [p1, p2] ⊂ [0, 1], let Cα be determined by

limnP{supp[p1,p2]|D(p)D^(p)SD(p)|Cαdata}=1α,

where SD2(p) is a consistent variance estimate of (p). Then, as n → ∞,

P{D^(p)CαSD(p)D(p)D^(p)+CαSD(p),forallp[p1,p2]}1α.

The confidence band can be computed as follows:

Step 1. Generate K1(say K1 = 500) independent standard normal samples η1(k),,ηn(k), k = 1, …, K1. For each k, compute Dk(p), based on η1(k),,ηn(k). Let SD(p) be the sample standard deviation of { Dk(p)D^(p), k = 1, …, K1}

Step 2. Generate another K2(say K2 = 500) independent realizations D1(p),,DK2(p). Compute Cα as the 100(1 − α)th percentile of supp1pp2(D1(p)D^(p))/SD(p),,supp1pp2(DK2(p)D^(p))/SD(p).

Step 3. A 100(1 − α)% simultaneous confidence band for D(p) over [p1, p2] is

(D^(p)n1/2CαSD(p),D^(p)+n1/2CαSD(p)),p[p1,p2].

3.3 Determine the Set of Specificities With Differential Sensitivity

All of the preceding results are focused on comparisons of ROC curves for a prespecified set of specificities. In practice, it often is of interest to find the set of specificities at which one test is more sensitive than the other. Specifically, let R = {p ∈ [0, 1]: ROC1(p) > ROC2(p)}. We wish to find an estimated set such that P(R) = 1 − α.

Note that R is an unknown set of specificities, not a parameter. Next we outline a procedure for estimating R based on an idea of Berger and Boos (1999) and Berger and Hsu (1996) who studied the problem of estimating the onset and duration of a treatment effect. Our procedure is as follows:

Step 1. For each p, conduct a one-sided level α/2 test of H0p: D(p) = 0 versus Hap: D(p) > 0, as discussed in Section 3.1.

Step 2. Let p0 be an a priori fixed starting value. If H0p0 is accepted, then no confidence statement is made. If H0p0 is rejected, then test sequentially downward from p0, and let P1 be the first p for which the hypothesis H0p is accepted. Also test sequentially upward from p0, and let P2 be the first p for which the hypothesis H0p is accepted. Consequently, =[P1, P2] is the largest interval containing p0 for which H0p is rejected for all p ∈ [P1, P2].

Theorem 4

Let = [P1, P2] be defined by the foregoing algorithm. Then, as n → ∞,

P(R^R)=P(ROC(1)(p)>ROC(2)(p)forallP1pP2)1α.

The proof of this theorem is essentially the same as that of Berger and Boos (1999) and thus is omitted. Note that in Step 1, a one-sided hypothesis is tested at level α/2 instead of level α; this is necessary to achieve the overall confidence level of 1 − α for . As argued by Berger and Boos (1999), the starting value p0 should be chosen beforehand and in a region likely to produce significant results. A statistical approach that requires less knowledge is to repeat the method at k different starting points using level α/k for each one.

3.4 Multiple Comparisons of More Than Two Diagnostic Tests

In this section we illustrate how to extend our method to multiple comparisons of more than two ROC curves. Without loss of generality, we consider three diagnostic tests. Similar to Theorem 1, it can be shown that

n(D^(12)(p)D(12)(p)D^(13)(p)D(13)(p)D^(23)(p)D(23)(p))d(U(12)(p)U(13)(p)U(23)(p)),

where D(ij)(p) = ROC(j)(p) − ROC(i)(p), (ij)(p) is the sample estimate of D(ij)(p), U(ij)(p) = Z(i)(1 − p) − Z(j)(1 − p), and Z(v)(p) is as defined in Theorem 1 for v = 1, 2, 3.

As in Section 2.5, we construct D(12)*(p), D(13)*(p), and D(23)*(p) so that the joint distribution of (U(12), U(13), U(23)) is approximated by that of (U(12)*, U(13)*, U(23)*) given the data, where U(ij2)*(p) = D(ij)*(p) − (ij)(p). This allows us to perform various multiple comparisons. The following algorithm can be used to construct joint confidence intervals for three pairwise comparisons of weighted areas under the curve: Δ(12), Δ(13), and Δ(23), where Δ(ij)=01D(ij)(p)dw(p):

Step 1. Generate K1(say K1 = 500) realizations of Z(12)* ≡ Δ(12)* − Δ̂(12), Z(13)* ≡ Δ(13)* − Δ̂(13), and Z(23)* ≡ Δ(23)* − Δ̂(23) by generating K1 independent standard normal samples. Compute their sample standard deviations SΔ(12), SΔ(13), and SΔ(23)

Step 2. Generate another K2(say K2 = 500) independent realizations of Z(12)*, Z(13)*, and Z(23)* and let Ak be the maximum of Z(12)SΔ(12),Z(13)SΔ(13), and Z(23)SΔ(23) for the kth realization, k = 1, …, K2.

Step 3. Let Cα be the 100(1 − α)th percentile of A1, …, AK2. The 100(1 − α)% joint confidence intervals for Δ(12), Δ(13), and Δ(23) are given by

Δ^(12)±CαSΔ(12),Δ^(13)±CαSΔ(13),andΔ^(23)±CαSΔ(23).

Other multiple inferences, such as multiple confidence bands, can be obtained similarly.

4. SIMULATIONS

In this section we report a simulation study conducted to evaluate the finite-sample performance of our methods. We also study the consequence of ignoring the within-subject correlations.

We generated repeated measurements using a setting similar to Emir et al. (2000). A failure time was generated for each subject from an exponential distribution with an expected value of 4. A subject was classified as “healthy” before the failure and as “diseased” after the failure. We then generated X(1)=Y1λ+Y2(1λ) and X(2)=Y1λ+Y3(1λ), where Yi = (Yi1, …, Yi4)′, i = 1, 2, 3, are iid multivariate normal random vectors with mean 0 and cov(Yij, Yik) = ρ|j k|, for j, k = 1, 2, 3, 4. The value for subject i for marker v at visit j is Xij(v) if the subject is “healthy” at the jth visit and Xij(v)+1 if he or she is “diseased.” Note that λ and ρ measure the between-marker and within-marker correlations.

Table 1 reports the achieved significance level of our test for the equality of pAUCs over three intervals (.1,.3), (.1,.5), and (.05,.95) for various sample sizes and between-marker and within-marker correlations. Each entry is based on 500 simulations. It is observed that the achieved significance level agrees with the nominal level fairly well even for small samples (nD = nH = 30).

Table 1.

Achieved significance level of our method for two-sided test of equality of partial areas under the curves (pAUC(p1, p2)) (nominal level =.05)

Achieved significance level
(ρ, λ) (nD, nH) pAUC(.05,.35) pAUC(.05,.55) pAUC(.05,.95)
(0, 0) (30, 30) .07 .07 .06
(50, 50) .06 .05 .05
(100, 100) .05 .05 .06
(.25,.5) (30, 30) .05 .05 .04
(50, 50) .04 .05 .05
(100, 100) .04 .06 .04
(.50,.75) (30, 30) .05 .05 .06
(50, 50) .05 .04 .04
(100, 100) .05 .04 .04
(.75,.75) (30, 30) .05 .05 .06
(50, 50) .04 .04 .04
(100, 100) .04 .05 .05

NOTE: ρ and λ are measures of the within-subject and between-marker correlations, and (nD, nH) are the number of clusters for diseased and healthy subjects.

Next, we illustrate the consequence of ignoring the within-subject correlations. Table 2 reports the achieved significance levels of our method (designated method 1) compared with the method that treats the observations within each cluster as independent (designated method 2) for testing the equality of the AUCs. We set the between-marker correlation λ =.5 and vary the within-marker correlation ρ from.05,.75 to.99. It can be seen that the achieved significance level of our method is always close to the nominal level.05, whereas ignoring the correlations within the cluster can lead to an unreasonably high probability of type I error (e.g.,.39) as the within-subject correlation ρ and cluster size grow. Therefore, ignoring the correlations within the cluster may lead to seriously biased conclusions.

Table 2.

Achieved significance level of the proposed method (method 1) versus the method ignoring clustering (method 2) for testing the equality of the total areas under the two ROC curves [nominal level =.05, (nD, nH) = (100, 100), λ =.5]

Achieved significance level
ρ Cluster size Method 1 Method 2
.05 4 .05 .04
8 .06 .05
.75 4 .05 .13
8 .06 .26
.99 4 .05 .22
8 .04 .39

NOTE: ρ is a measure of within-subject correlation, and (nD, nH) are the number of diseased and healthy subjects.

5. EXAMPLES

5.1 Detection of Glaucomatous Deterioration

A visual field test is a technique for measuring an individual’s entire scope of vision. It maps the visual fields of each eye individually. Longitudinal visual field image data can be used for early diagnosis of glaucomatous progression. But reliable detection remains one of the most difficult challenges for clinicians, due to the complex structure and high level of noise in the longitudinal visual field image data. Together with researchers at UCLA’s Jules Stein Eye Institute, Jiang (2005) developed some outlier statistics to use as a diagnostic marker for detecting visual field deterioration in glaucoma patients based on Bayesian hierarchical modeling. Here we illustrate how to use our methods to compare the diagnostic power of different models in discriminating between stable and progressive eyes. Our data set comprises a visual field series of 188 eyes of 171 patients over 8 years of follow-up study; it is independent of the training sample used by Jiang (2005) for model building. Apparently these are clustered data, because some data are from both eyes of the same patient. A paired design is appropriate for ROC analysis, because both models are applied to the same set of eyes.

Figure 1 depicts the empirical ROC curves of the diagnostic markers based on two candidate models, referred to as models 1 and 2 here, which correspond to models 1 and 11 of Jiang (2005, p. 45). Figure 1 shows that the diagnostic marker based on model 2 has better power than model 1 for detecting glaucomatous progression. We performed a test for the equality of the total AUCs using the method in Section 3.1; the two-sided p value was.024, confirming a significant difference between the two diagnostic markers.

Figure 1.

Figure 1

Empirical ROC curves of two diagnostic markers for detecting glaucoma deterioration (——, model 2;— —, model 1).

Figure 2 shows the confidence band for the difference function of the ROC curves between models 1 and 2 over the specificities.1–.9. Because the span of the confidence band over the specificities is too broad, it is too conservative to detect any significant difference.

Figure 2.

Figure 2

95% confidence band for the difference of the ROC curves between models 1 and 2 from.1 to.9 for the glaucoma data (——, difference; · · · ·, lower bound; · · · ·, upper bound).

Finally, we applied the method in Section 3.3 to estimate the range of specificities at which the diagnostic marker based on model 2 is more sensitive than that based on model 1. We had = [.34,.92]; therefore, with 95% confidence, model 2 had a higher sensitivity than model 1 at specificities between.34 and.92.

5.2 Acoustic Startle Response Data

The acoustic startle response in human studies is quantified by the startle blink response to a startling stimulus. It is typically measured and summarized by the orbicularis oculi electromyogram (ooEMG), a time series plot. But high variabilities make it difficult to evaluate the ooEMG and distinguish a normal response (healthy) from no response (diseased).

In this example we consider four diagnostic markers developed by researchers at UCLA’s Semel Institute for Neuroscience: the peak magnitude after the stimulus, the duration of the maximum magnitude above a certain threshold, the area under the maximum magnitude, and the ratio representing the amount exceeding a threshold. The data set comprises 37 participants from a total of 229 experiments. The maximum number of repeated experiments per person is 16. A detailed description of the study has been given by Waters and Ornitz (2008). Results of the same subject from different experiments are considered to be highly correlated; thus we adopt the clustered data setting. This is also a paired design or, more accurately, a block design, because each ooEMG yields values for all four markers.

Figure 3 displays the empirical ROC curves for the four markers. We can see that ratio appears to be the best and peak the worst. We used the multiple comparison method outlined in Section 3.5 to construct 95% joint confidence intervals for six pairwise differences of the pAUC, pAUC.05,.95, for these four markers. The results, summarized in Table 3, show that area, ratio, and duration are all significantly better than peak. No significant difference is found among area, ratio, and duration.

Figure 3.

Figure 3

Empirical ROC curves of four diagnostic markers for the acoustic startle response data (——, peak; · – ·, area; · · · ·, ratio; · – ·, duration).

Table 3.

Simultaneous 95% confidence intervals for the pairwise differences of pAUC(.05,.95) of four tests for the acoustic response data

Area Ratio Duration
Peak (.03,.33) (.05,.35) (.01,.30)
Area (−.03,.08) (−.10,.06)
Ratio (−.12,.03)

NOTE: pAUCpeak =.66, pAUCarea =.84, pAUCratio =.86, pAUCduration =.81, and Cα = 2.62.

Figure 4 gives the 95% confidence band for the difference of ROC curves between the peak and area over specificities of.1–.9. The confidence bands again seem to be conservative.

Figure 4.

Figure 4

95% confidence band of the difference of the ROC curves between peak and area from.1 to.9 for the acoustic startle response data (——, difference; · · · ·, lower bound; · · · ·, upper bound).

We also applied the method in Section 3.3 to estimate the set of specificities at which area is more sensitive than peak. We found = [.38,.95]; therefore, with 95% confidence, area has a higher sensitivity than peak at specificities between.38 and.95.

6. DISCUSSION

The Monte Carlo resampling method is commonly used to approximate an empirical process, such as the empirical distribution process, that can be represented as the sum of iid random variables. But this article is the first to rigorously demonstrate that this method also can be extended to approximate quantile processes and ROC curves without the need to estimate density functions for a paired design. Clustered observations also are allowed. The result allows us to approximate the distribution of any continuous functional of correlated empirical ROC curves and thus provides a unified approach to the nonparametric comparison of ROC curves. Our method demonstrates satisfactory performance for both large and relatively small samples in simulations. We have developed software to implement the proposed procedures using R; this software can be downloaded at http://roc.cluster.googlepages.com/.

Note that much effort has been directed at approximating a quantile process without involving density estimation. For example, Li, Hollander, McKeague, and Yang (1996a) developed a nonparametric likelihood ratio method; Doss and Gill (1992) studied the bootstrap method, which was later adapted for ROC analysis by Li et al. (1996b); and Keaney and Wei (1994) introduced a Monte Carlo method for estimating a median survival time. But all of the earlier methods consider only iid observations and thus do not account for within-subject correlation for clustered data.

An alternative approach is to use the bootstrap method to approximate ROC processes by resampling the independent subjects. Although the bootstrap method has been commonly suggested in practice and is simple to use in principle, its consistency for the paired design with clustered data has not been established in the ROC analysis literature. For the upaired case where different markers are performed on different sets of subjects, Li et al. (1996b, 1999) showed that the bootstrap method is consistent in a sense that the bootstrapped ROC process has the same limiting process as the empirical ROC process. Extending their result to the paired design with clustered data does not appear to be trivial. Future work is needed to develop a theoretical justification for the bootstrap method for clustered data under the paired design.

Acknowledgments

The authors thank the joint editors and two referees for their insightful comments and suggestions, Luohua Jiang and Ed M. Ornitz for providing the data in the examples, and Tianxi Cai for helpful discussions. Li’s research was supported in part by a National Institutes of Health grant.

APPENDIX: PROOFS

Proof of Lemma 1

Note that

n(F^(1)(t)F(1)(t)F^(2)(t)F(2)(t)G^(1)(t)G(1)(t)G^(2)(t)G(2)(t))=1ni=1nVi(t),

where

Vi(t)=(nM(1)j=1mi(1){I(Xij(1)t)F(1)(t)}nM(2)j=1mi(2){I(Xij(2)t)F(2)(t)}nN(1)j=1ni(1){I(Xij(1)t)G(1)(t)}nN(2)j=1ni(2){I(Xij(2)t)G(2)(t)}),i=1,,n,

are independent random vectors. Applying the Cramer–Rao device and the Lyapunov central limit theorem, and following along the lines of the approach of Billingsley (1999) for empirical distribution processes, it can be shown that

1ni=1nVi(t)dW(t)inD(R)4, (A.1)

where W(t) =(WF(1)(t), WF(2)(t), WG(1)(t), WG(2)(t))′ is a Gaussian process in D(ℝ)4 whose variance–covariance function is the limit of 1ni=1ncov(Vi(t),Vi(t)) as n → ∞.

Proof of Theorem 1

By (A.1), the compact differentiability of the inverse function and the functional delta method (see, e.g., Andersen et al. 1993), we have

n(F(1)^1F(1)1F(2)^1F(2)1G(1)^G(1)G(2)^G(2))d(WF(1)(F(1)1)F(1)(F(1)1)WF(2)(F(2)1)F(2)(F(2)1)WG(1)WG(2)) (A.2)

in D[a, b] × D[a, b] × D[F(1)−1(a), F(1)−1(b)] × D[F(2)−1(a), F(2)−1(b)], as n → ∞. This, along with lemma A.1 of Li et al. (1996a,b) and the functional delta method, implies that

n(G^(1)(F(1)^1(p))G(1)(F(1)1(p))G^(2)(F(2)^1(p))G(2)(F(2)1(p)))d(Z(1)(p)Z(2)(p)) (A.3)

in D[a, b] where Z(v), v = 1, 2 are as defined in Theorem 1.

Finally, combining (A.3) and the continuous mapping theorem proves the theorem.

Proof of Theorem 2

First, note that

n{D(p)D^(p)}=n{Q(2)(1p)Q^(2)(1p)}n{Q(1)(1p)Q^(1)(1p)},

where for v = 1, 2, Q(v)(p) = G(v)(F(v)−1(p)), (v)(p) = Ĝ(v)((v)−1(p)), and Q(v)*(p) = ζ(v)*(ξ(v)(p)) are as defined before Theorem 2. Moreover, by (A.3),

n(Q^(1)(p)Q(1)(p)Q^(2)(p)Q(2)(p))d(Z(1)(p)Z(2)(p))inD[a,b]. (A.4)

Thus, to prove Theorem 2, it suffices to show that for every subsequence {nj} of {n}, there is a further subsequence {mk} ⊂ {nj} such that, conditional on the data,

n(Q(1)(p)Q^(1)(p)Q(2)(p)Q^(2)(p))d(Z(1)(p)Z(2)(p))inD[q1,q2], (A.5)

along the subsequence {mk} for almost all data realizations and a < q1 < q2 < b.

Here we prove only the convergence of a marginal process in (A.5), because the joint convergence is derived along the same lines. For simplicity, we also omit the superscript.

We first show that for any sequence δn(p) satisfying supp∈ [q1,q2] |δn(p)| = O(n−1/2),

supp[q1,q2]n{Q^(p+δn(p))Q^(p)}+Q(p)nδn(p)P0. (A.6)

Let Zn(p)=n(Q^(p)Q(p)). Then, by (A.4), ZndZ in D[a, b] as n → ∞. By the Skorohod–Dudley–Wichura representation theorem (cf. thm. 4 of Shorack and Wellner 1984), there exists a sequence Zn, Z of random elements in D[a, b], defined in a common probability space, such that Zn=dZn,Z=dZ, and Zn·Z almost surely in D[a, b] as n → ∞, where Z has continuous sample paths on [a, b] and ||·|| represents the supremum norm. Because the sample paths of Z are continuous on a compact interval [a, b], they also are uniformly continuous on [a, b]. Thus

supp[q1,q2]Zn(p+δn(p))Zn(p)supp[q1,q2]Zn(p)Z(p)+supp[q1,q2]Z(p+δn(p))Z(p)+supp[q1,q2]Zn(p+δn(p))Z(p+δn(p))a.s.0, (A.7)

where the first and third terms converge to 0 because Zn·Z almost surely in D[a, b] and the second term goes to 0 because Z has uniformly continuous sample paths on [a, b]. Because Zn=dZn, (A.7) implies that as n → ∞,

supp[q1,q2]Zn(p+δn(p))Zn(p)P0. (A.8)

Moreover, by the mean value theorem, we have that n{Q(p+δn(p))Q(p)}=Q(δn)nδn(p), where δn lies between p and p + δn(p). This, together with the fact that Q′(p) is uniformly continuous on a compact interval, implies that as n → ∞,

supp[q1,q2]n{Q(p+δn(p))Q(p)}Q(p)nδn(p)0. (A.9)

Combining (A.8) and (A.9) proves (A.6).

For the foregoing sequence δn(p), we also have

supp[q1,q2]F^1(p+δn(p))F1(p)supp[q1,q2]F^1(p+δn(p))F1(p+δn(p))+supp[q1,q2]F1(p+δn(p))F1(p)P0, (A.10)

where the convergence of the first term follows from the uniform consistency of −1 on [a, b] and the second term goes to 0 because F−1 is uniformly continuous on [q1, q2].

By (A.6) and (A.10), for any subsequence {nj} of {n}, there is a further subsequence {mk} ⊂ {nj} and a subsample space Ω0 ⊂ Ω such that P0) = 1 and, for every ω ∈ Ω0,

supp[q1,q2]n{Q^(p+δn(p))Q^(p)}+Q(p)nδn(p)(ω)0 (A.11)

and

supp[q1,q2]F^1(p+δn(p))F1(p)(ω)0 (A.12)

along the subsequence {mk}.

Next, we note from direct calculations that

n{Q(p)Q^(p)}=n{Q^(p+n1/2WF(F^1(p)))Q^(p)}WG(F^1(p+n1/2WF(F^1(p)))).

By (5) and the Korohod–Dudley–Wichura representation theorem (cf. thm. 4 of Shorack and Wellner 1984), there exists a sequence (WF,WG),(WF,WG) of random elements in D(ℝ) defined in a common probability space such that, conditional on the data,

(WF,WG)=d(WF,WG),(WF,WG)=d(WF,WG), (A.13)

and

(WF,WG)·(WF,WG)almostsurelyinD(R) (A.14)

as n → ∞ for almost all data realizations, where WF and WG have continuous sample paths and ||·|| is the supremum norm. Thus, conditional on the observed data,

supq1pq2WF(F^1(p))WF(F1(p))supq1pq2WF(F^1(p))WF(F^1(p))+supq1pq2WF(F^1(p))WF(F1(p))a.s.0, (A.15)

as n → ∞ for almost all data realizations, where the first term on the right side of the inequality converges to 0 because WF·WF almost surely in D(ℝ), and the second term goes to 0 because of the uniform consistency of −1 and the uniform continuity of WF(t) on any compact interval.

Finally, by combining (A.11)–(A.15), we can conclude that for every subsequence {nj} of {n}, there is a further subsequence {mk} ⊂ {nj} such that, conditional on the data, n(QQ^)dZ in D[q1, q2], along the subsequence {mk} for almost all data realizations and a < q1 < q2 < b. This concludes Theorem 2, as argued at the beginning of the proof.

Proof of Corollary 1

By Theorem 2 and the continuity of T(·), for every subsequence {nj} of {n} there is another subsequence {mk} ⊂ {nj} such that, conditional on the data, T(n(DD^))dT(U) along the subsequence {mk}, for almost all data realizations.

We first show that as n → ∞,

Hn(t)H(t)foreveryt, (A.16)

for almost all data realizations. This can be proved by contradiction. If for some t, Hn(t) does not converge to H(t), then there exist an ε > 0 and a subsequence {nj} of {n} such that Hnj(t)H(t)ε>0 for all nj. This implies that no further subsequence of Hnj(t) converges to H(t), which contradicts the result in the first paragraph; thus (A.16) holds. The uniform convergence Hn(t) to H(t) in t follows from (A.16) and the fact the Hn(t) and H(t) are continuous nondecreasing distribution functions.

Contributor Information

Gang Li, Professor, Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA 90095-1772 (E-mail: vli@ucla.edu).

Kefei Zhou, Biostatistics Manager, Amgen Inc., Thousand Oaks, CA 91320 (E-mail: kzhou@amgen.com).

References

  1. Andersen et al. (1993), ???.
  2. Berger R, Boos D. Confidence Limits for the Onset and Duration of Treatment Effect. Biometrical Journal. 1999;41:517–531. [Google Scholar]
  3. Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Statistical Science. 1996;11:283–319. [Google Scholar]
  4. Billingsley P. Convergence of Probability Measures. 2. New York: Wiley; 1999. [Google Scholar]
  5. DeLong ER, DeLong DM, Clarke-Person DL. Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
  6. Doss H, Gill RD. An Elementary Approach to Weak Convergence for Quantile Processes With Applications to Censored Survival Data. Journal of American Statistical Association. 1992;87:869–877. [Google Scholar]
  7. Emir B, Wieand S, Jung SH, Ying Z. Comparison of Diagnostic Markers With Repeated Measurements: A Nonparametric ROC Curve Approach. Statistics in Medicine. 2000;19:11–23. doi: 10.1002/(sici)1097-0258(20000229)19:4<511::aid-sim353>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  8. Emir B, Wieand S, Su JQ, Cha S. Analysis of Repeated Markers Used to Predict Progression of Cancer. Statistics in Medicine. 1998;17:2563–2578. doi: 10.1002/(sici)1097-0258(19981130)17:22<2563::aid-sim952>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]
  9. Green DM, Swets JA. Signal Detection: Theory and Psychophysics. New York: Wiley; 1966. [Google Scholar]
  10. Hanley JA, McNeil BJ. The Meaning and Use of the Area Under an ROC Curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
  11. Hanley JA, McNeil BJ. A Method of Comparing the Area Under Two ROC Curves Derived From the Same Cases. Radiology. 1983;148:839–843. doi: 10.1148/radiology.148.3.6878708. [DOI] [PubMed] [Google Scholar]
  12. Hsieh F, Turnbull BW. Nonparametric and Semiparametric Estimation of the Receiver Operating Characteristic Curve. The Annals of Statistics. 1996;24:25–40. [Google Scholar]
  13. Jiang LH. unpublished doctoral dissertation. University of California; Los Angeles: 2005. Bayesian Hierarchical Modelling of Glaucomatous Visual Field Data. ??? [Google Scholar]
  14. Keaney KM, Wei LJ. Interim Analysis Based on Median Survival Times. Biometrica. 1994;81:279–286. [Google Scholar]
  15. Li G, Hollander M, McKeague I, Yang J. Nonparametric Likelihood Ratio Confidence Bands for Quantile Functions From Incomplete Survival Data. The Annals of Statistics. 1996a;24:628–640. [Google Scholar]
  16. Li G, Tiwari RC, Wells MT. Quantile Comparison Functions in Two-Sample Problems, With Application to Comparisons of Diagnostic Markers. Journal of American Statistical Association. 1996b;91:689–698. [Google Scholar]
  17. Li G, Tiwari RC, Wells MT. Semiparametric Inference for Shift Functions: With Applications to Receiver Operating Characteristic Curves. Biometrika. 1999;86:487–502. [Google Scholar]
  18. McClish DK. Comparing the Areas Under More Than Two Independent ROC Curves. Medical Decision Making. 1987;7:149–155. doi: 10.1177/0272989X8700700305. [DOI] [PubMed] [Google Scholar]
  19. Obuchowski NA. Nonparametric Analysis of Clustered ROC Curve Data. Biometrics. 1997;53:567–578. [PubMed] [Google Scholar]
  20. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford, U.K: Oxford University Press; 2003. [Google Scholar]
  21. Shorack GR, Wellner JA. Empirical Processes With Applications to Statistics. New York: Wiley; 1984. [Google Scholar]
  22. Uno H, Cai T, Tian L, Wei LJ. Evaluating Prediction Rules for t -Year Survivors With Censored Regression Models. Journal of American Statistical Association. 2007;103:527–537. [Google Scholar]
  23. Waters AM, Ornitz EM. When the Orbicularis Oculi Response to a Startling Stimulus Is Zero, the Vertical EOG May Reveal That a Blink Has Occurred. Clinical Neurophysiology 2008;??:??–??. doi: 10.1016/j.clinph.2005.06.012. [DOI] [PubMed] [Google Scholar]
  24. Wieand HS, Gail MH, Barray RJ, James KL. A Family of Nonparametric Statistics for Comparing Diagnostic Markers With Paired or Unpaired Data. Biometrica. 1989;76:585–592. [Google Scholar]
  25. Zhou XH, McClish DK, Obuchowski NA. Statistical Methods in Diagnostic Medicine. New York: Wiley; 2002. [Google Scholar]

RESOURCES