Abstract
There are several methods available for analyzing multireader ROC studies that generalize results to both the reader and case populations. Two of these methods - the Dorfman-Berbaum-Metz (DBM) method and the Obuchowki-Rockette (OR) method - appear to be quite different in their original formulations. However, recently it has been shown that the DBM and OR procedures yield the same test statistic when based on the same accuracy measure and covariance estimation method, but inferences can vary depending on which denominator degrees of freedom (ddf) method, DBM or OR, is used. I show in simulations that there are problems with both ddf methods: OR is ultraconservative with significance levels considerably below the nominal level, and DBM can result in extremely wide confidence intervals because the ddf can be close to zero. I propose a new ddf method that overcomes both of these problems and can be used with either the OR or DBM procedure.
Keywords: Receiver operating characteristic (ROC) curve, corrected F, diagnostic radiology, degrees of freedom
1 INTRODUCTION
Receiver operating characteristic (ROC) curve analysis is a well established method for evaluating and comparing the performance of diagnostic tests [1-5]. There are several methods available for analyzing multireader ROC studies that generalize results to both the reader and case populations. Five such methods have recently been described and compared in an analysis of three data sets by Obuchowski et al [6] The study shows that the methods can lead to different conclusions, illustrating the need for more theoretical and empirical investigation of the methods to provide insight and guidance regarding the appropriateness of each procedure in a given situation.
Two of the these methods - the Dorfman-Berbaum-Metz (DBM) method [7, 8] and the Obuchowski-Rockette (OR) method [9, 10] - appear to be quite different in their original formulations. However, recently it has been shown by Hillis et al [11] that the DBM and OR procedures yield the same test statistic when based on the same accuracy measure and covariance estimation method, but inferences depend on which denominator degrees of freedom (ddf) method, DBM or OR, is used.
In simulations I find that there are problems with both ddf methods: OR is ultraconservative with significance levels considerably below the nominal level while DBM, though much closer to the nominal significance level, sometimes results in extremely wide confidence intervals because the ddf can be close to zero. I propose using a new ddf estimator that overcomes these problems and can be used with either the DBM or OR procedure.
The outline of the paper is as follows. I describe the OR method in Section 2. In Section 3 I derive the new ddf from the OR model, discuss the rationale for the original OR ddf, and compare the new ddf with the DBM ddf. In Section 4 I show how the new ddf can be derived from the DBM model and discuss the possibility that the DBM ddf may be based on an inadequate Satterthwaite approximation. In Section 5 I discuss confidence intervals for test differences, inferences for single tests, and fixed readers analysis. In Section 6 I compare the performance of the new ddf method with the OR and DBM ddf methods in a simulation study. I illustrate and compare the three ddf methods using a previously published data set in Section 7 and make concluding remarks in the final section.
2 THE OBUCHOWSKI-ROCKETTE (OR) METHOD
2.1 Design and notation
A commonly used study design in radiology that allows generalization to both the reader and case populations is the test reader case factorial design, where each case (i.e., patient) undergoes each diagnostic test (e.g., CT scan, MRI, or X-ray) and the resulting images are evaluated once by each reader (usually a radiologist). Typically the number of cases is 25-200 while the number of readers is 3-15. Throughout I assume that the data have been collected using this factorial design.
Let Zijk denote the rating assigned to the kth case by the jth reader using the ith modality. For example, a five-level ordinal integer scale is commonly used with higher rating values indicating a higher level of confidence that the case is diseased; thus a rating value of 1 might correspond to “definitely not diseased,” 5 to “definitely diseased,” with other values corresponding to intermediate assessments. Alternatively, a quasi continuous 0% to 100% confidence scale is often used. The observed data consists of the Zijk, with i = 1,..., t, j = 1,..., r, k = 1,..., c, where t is the number of tests (or modalities), r the number of readers, and c the number of cases. In addition, each case is classified as diseased or nondiseased according to an available reference standard.
2.2 Model and test statistic
The corrected F test was proposed by Obuchowski and Rockette [9]. Let denote the AUC estimate (or other accuracy estimate) for the ith test and jth reader. b Their approach is to use a test reader ANOVA model for the AUCs, but unlike a conventional ANOVA model they allow the errors to be correlated to account for correlation in the AUCs due to each reader evaluating the same cases. Thus their two-way ANOVA model corresponds to the three-way study design. For the factorial design with only one replication, the corrected F test model, which I refer as the OR model, can be written as
(1) |
i = 1,..., t, j = 1,..., r, where τi denotes the fixed effect of test i; Rj denotes the random effect of reader j, (τR)ij denotes the random test reader interaction, and ∊ij is the error term. The Rj and (τR)ij are assumed to be mutually independent and normally distributed with zero means and variances , reflecting differences in reader ability, and , reflecting test-by-reader interaction. The ∊ij are assumed to be normally distributed with zero mean and constant variance , which represents variability attributable to cases and within-reader variability that describes how a reader interprets the same image in different ways on different occasions. The ∊ij are independent of the Rj and (τR)ij. However, since the same cases are read by each reader using each test, the ∊ij are not assumed to be independent. Instead, equi-covariance of the errors between readers and tests is assumed, resulting in three possible covariances given by
It follows from model (1) that , Cov1, Cov2, and Cov3 are also the variance and corresponding covariances of the AUC estimates, conditional on the reader and test×reader effects. Obuchowski and Rockette [9] suggest the following ordering for the covariances:
(2) |
I only consider the factorial design with one replication; however, my results will also apply when there are multiple replications.
The corrected F statistic for testing the null hypothesis of no test effect (H0: τ1 = τ2 = ... = τt) is given by
(3) |
where and . A subscript replaced by a dot indicates that values are averaged across the missing subscript; for example, and . Obuchowski and Rockette [9] state that F * has an approximate null distribution, where
(4) |
Throughout, the ddf subscript denotes the first author of the paper that proposed the ddf method: O = Obuchowski, D = Dorfman, H = Hillis.
In practice we do not know Cov2 or Cov3 and have to estimate them from the data or use estimates from previous studies. Thus the statistic actually used is
(5) |
where and denote estimates for Cov2 and Cov3, respectively. Note that equation (5) incorporates constraint (2) by setting to zero if it is negative. Since Cov2 and Cov3 are also the corresponding covariances of the AUC estimates conditional on the reader and test×reader effects, they can be estimated using ROC analysis methods that treat cases as random but readers as fixed. For example, for trapezoidal-rule AUC estimates [5], Obuchowski and Rockette [9] suggest estimating , Cov1, Cov2 and Cov3 by averaging corresponding variances and covariances computed using the method of DeLong et al [12], which provides covariance and variance estimates treating only cases as random. More generally, any acceptable method for estimating the AUC variance and covariances that treats readers as fixed can be used, such as jackkni-ng or bootstrapping. The OR estimates obtained from averaging the corresponding fixed-reader AUC variances and covariances are denoted by , and .
3 THE PROPOSED DDF
In this section I show that F * has an approximate F ( t -1),df2 null distribution, where
(6) |
In deriving (6) I treat Cov2 and Cov3 as known. An estimate of df2 that incorporates constraint (2) is given by
(7) |
Thus I propose using ddfH as the ddf for F OR, since F OR approximates F *. I also discuss why Obuchowski and Rockette suggested (t-1) (r-1) as the ddf.
3.1 Derivation of ddfH
The expected values for MS(T), MS(R), and MS(T*R) are derived in the Appendix and given in Table 1, where . It follows from Table 1 that
(8) |
where H0 is the null hypothesis of no test effect (H0:τ1 = ... = τt = 0). Thus the numerator and denominator of F * in equation (3) have the same expected value under H0, while the numerator has a larger expected value if H0 does not hold; hence F * is an appropriate ANOVA test statistic. [ Table 1 ]
Table 1.
Expected mean squares for the Obuchowski-Rockette model.
Mean square | Expected mean square |
---|---|
MS(T) | |
MS(R) | |
MS(T*R) |
To show that F * has an approximate F ( t -1),df2 null distribution with df2 defined by equation (6), I first consider an approximation for the distribution of its denominator, MS(T*R)+r (Cov2-Cov3). Let Xm denote a random variable with a distribution that is independent of MS(T*R). Then E (Xm) = 1 and Xm converges to 1 in probability as m → ∞. By Slutsky’s theorem the limiting distribution of MS(T*R)+r (Cov2-Cov3) Xm as m → ∞ is equal to the distribution of MS(T*R)+r (Cov2-Cov3). My approach is to approximate the distribution of MS(T*R)+r (Cov2-Cov3) by the limiting distribution of an approximate distribution for MS(T*R)+r (Cov2-Cov3) Xm as m → ∞.
I show in the Appendix that MS(T) and MS(T*R) are independently distributed, , and under . Satterthwaite [13, 14] has shown that a linear function of independent random variables , where , is approximately distributed as , where
(9) |
Using Satterthwaite’s procedure with a 1 = 1; MS1 =MS(T*R), a 2 = r (Cov2-Cov3), and MS2 = Xm, it follows that MS(T*R)+r (Cov2-Cov3) Xm is approximately distributed as
(10) |
where df_S(m) is given by
Since limm →∞ df_S(m) = df2, where df2 is defined by equation (6), it follows that the limiting distribution of distribution (10) as m → ∞ is
(11) |
Thus the distribution of the denominator of F *, MS(T*R)+r (Cov2-Cov3), is approximated by distribution (11).
Since MS(T) and MS(T*R)+r (Cov2-Cov3) are independent, under H0, MS(T*R)+r (Cov2-Cov3) is approximately , and E[MS(T)] = E[MS(T*R)]+r (Cov2-Cov3) under H0, it follows that the null distribution of F * = MS(T)/[MS(T*R)+r (Cov2-Cov3)] is approximated by
where , , and U and V are independent; that is, F * has an approximate F ( t -1),df2 null distribution. Since F OR approximates F * and ddfH (7) estimates df2, I propose using ddfH as the ddf for F OR.
In summary, approximating the distribution of F OR involves three distinct approximations. First, F OR is approximated by F *. The adequacy of this approximation depends on how well approximates Cov2-Cov3. Throughout, I assume that the number of cases is moderate or large, as is typical for ROC studies; thus it is reasonable to expect that this approximation will be adequate since the precision of the covariance estimates increases with the number of cases. Second, the distribution of MS(T*R)+r (Cov2-Cov3) is approximated by the distribution of MS(T*R)+r (Cov2-Cov3) Xm for large m, where , so that the Satterthwaite approximation can be applied to the denominator of F OR; this approximation is justified by Slutsky’s theorem and the law of large numbers. Finally, the distribution of MS(T*R)+r (Cov2-Cov3) Xm for large m is approximated using the Satterthwaite procedure to give distribution (11), with df2 then estimated by ddfH. This approximation depends on the adequacy of the Satterthwaite approximation, with Cov2-Cov3 approximated by and E[MS(T*R)] approximated by MS(T*R). I expect this approximation to be adequate since the Satterthwaite approximation has been shown to perform acceptably for sums of mean squares, with expected mean squares estimated by observed means squares when estimating the degrees of freedom, even when the mean square degrees of freedom are low [13, 14]. In the simulation study I empirically investigate the performance of ddfH.
3.2 Rationale for ddfO
To understand why Obuchowski and Rockette [9] suggested using (t - 1) (r - 1) as the ddf, I now reexamine their argument. Based upon the work of Pavur and Nath [15], they note that under H0
(12) |
is distributed as F ( t -1),( t -1)( r -1), where F =MS(T)/MS(T*R) and
They then replace ρ in equation (12) by the estimate
where
(13) |
to derive the OR F statistic (5).
Although replacing ρ by a high precision estimate should not appreciably alter the distribution of (12), the variance estimate (13) lacks precision for the typical study with only a few tests and readers, since MS(T*R) has degrees of freedom (t-1) (r-1). Furthermore, the estimate of ρ is correlated with F, and this correlation needs to be taken into account because of the lack of precision in the estimate. Thus ddfO does not adequately describe the distribution of the F OR statistic (5) for typical studies because it does not account for either the lack of precision in estimating ρ or the correlation between the estimate of ρ and F.
4 DDFH AND THE DBM PROCEDURE
4.1 The DBM procedure
The DBM method proposed by Dorfman et al [7] for analyzing multireader ROC studies also generalizes results to both the reader and case populations. For this method AUC pseudovalues are computed using the Quenouille-Tukey jackknife [16-18] separately for each reader-test combination. Let Yijk denote the AUC pseudovalue for test i, reader j, and case k; by definition , where denotes the AUC estimate based on all of the data for the ith test and jth reader band ij ( k ) denotes the AUC estimate when data for the kth case are omitted. Using the Yijk as the responses, the DBM procedure specifies testing for a test effect using a fully crossed three-factor ANOVA, with test treated as a fixed factor and reader and case as random factors. The DBM estimate of is Yij., which is the jackknife accuracy estimate corresponding to .
Recently Hillis et al [11] generalize the DBM method by showing how it can be used with normalized pseudovalues and quasi pseudovalues: normalized pseudovalues are defined by , and quasi pseudovalues are defined as any values such that the resulting test-reader sample means, variances, and covariances are identical to and , where is the estimated fixed-reader covariance matrix used to compute the OR procedure quantities , and . For normalized and quasi pseudovalues the DBM accuracy estimate, given by Ydij, is equal to . They show that using DBM with normalized pseudovalues yields the same test statistic b for testing for a modality effect as using OR with jackknife covariance estimates, and more generally, using DBM with quasi pseudovalues yields the same test statistic as OR for other covariance estimation methods. However, even when DBM and OR yield the same test statistic, inferences depend on which ddf method, DBM or OR, is used. From hereon I assume, when comparing the DBM and OR procedures, that the pseudovalues are the normalized pseudovalues if and are based on jackknife covariance estimates, and are quasi pseudovalues if and are based on another fixed-reader covariance estimation method such as DeLong et al [12].
Let MS(T)pseudo, MS(T*R)pseudo, MS(T*C)pseudo, and MS(T*R*C)pseudo denote the test, test×reader, test×case, and test×reader case mean squares for the DBM three-way ANOVA of the pseudovalues. The DBM F statistic for testing the null hypothesis of no test effect is
(14) |
Hillis et al [11] show that F DBM = F OR by showing that
(15) |
and
(16) |
The result follows by substitution in (14). They also show that the DBM model implies the OR model, and they give the one-to-one mapping between the OR and DBM model parameters.
This form (14) of the DBM F statistic utilizes the same constraints as F OR (5), since equation (16) implies that the constraint MS(T*C)pseudo-MS(T*R*C)pseudo ≥ 0, utilized in the denominator of F DBM (14), is equivalent to the constraint employed in F OR. Although Dorfman et al [7, 8] suggest also constraining d MS(T*R)pseudo-MS(T*R*C)pseudo to be nonnegative, References [11, 19] discuss conceptual reasons for not using this constraint; furthermore, simulations [20] show that performance is better when only MS(T*C)pseudo-MS(T*R*C)pseudo is constrained to be nonnegative. Throughout this paper I use F DBM as specified by equation (14).
4.2 Comparison of ddfH, ddfO, and ddfD
The DBM numerator and denominator degrees of freedom for the null distribution of F DBM (14) are t-1 and ddfD, respectively, where
(17) |
Hillis et al [11] show that ddfD has the following form when expressed in terms of the OR mean squares and covariance estimates:
(18) |
Equations (17) and (18) yield the same value for ddfD.
Comparing equations (7) and (18) we see that
(19) |
This relationship is intuitive, since in deriving ddfH I treat σ2, Cov1, Cov2, and Cov3 as known while DBM treats them (or equivalently, the corresponding DBM model variance components) as unknown; thus the uncertainty from estimating them is manifested in the lower DBM degrees of freedom.
It follows from equation (7) that
with equality attained if and only if < 0 , showing that the ddfO method is more conservative than the ddfH method. There is not a similar relationship between ddfO and ddfD: ddfD can be larger or smaller that ddfO. Table 2 presents the ddf formulas in terms of the OR model in part (a) and the ddf relationships in part (b). [ Table 2 ]
Table 2.
Denominator degrees of freedom summary.
a) In terms of the OR mean squares and covariances estimates: | |
ddfO = (t-1)(r-1) | |
| |
| |
b) Relationships: | |
ddfD ≤ddfH | |
| |
ddfD can be larger or smaller than ddfO | |
c) In terms of the DBM mean squares: | |
ddfO = (t-1)(r-1) | |
| |
|
4.3 DdfH in terms of the DBM mean squares
I can also express ddfH in terms of the DBM analysis mean squares. It follows from equations (7,15-16) that
(20) |
Part (c) of Table 2 presents the ddf formulas expressed in terms of the DBM model. Note that the relationships in part (b), which we previously derived from the part (a) formulas, can also be derived from the equivalent part (c) formulas.
Alternatively, I can derive ddfH in form (20) directly from the DBM model in the following way. It is shown by Hillis et al [11] that, under the assumptions of the DBM model,
(21) |
where is the test case variance component for the DBM model. Define
For typical ROC studies (c > 25) the degrees of freedom will be at least 25 for each mean square in equation (21); thus MS(T*C)pseudo-MS(T*R*C)pseudo should approximate reasonably well, implying
My approach is to show that an approximate ddf for , and hence for F DBM, is given by equation (20).
Under the DBM model assumptions (conventional three-way ANOVA model assumptions for the pseudovalues) the numerator and denominator of have the same null expected values, MS(T)pseudo and MS(T*R)pseudo are independently distributed,
and
Using the same argument given in Section 3.1 but with MS(T) replaced by MS(T)pseudo, MS(T*R) replaced by MS(T*R)pseudo, and r (Cov2-Cov3) replaced by , it follows that has an approximate Ft -1,df2 null distribution, with
Replacing and by their estimates, MS(T*R)pseudo and , respectively, yields equation (20).
4.4 Inadequate DBM Satterthwaite approximation
When MS(T*C)pseudo-MS(T*R*C)pseudo > 0, ddfD in equation (17) is the Satterthwaite approximation (9) for MS(T*R) + MS(T*C) MS(T*R*C). Satterthwaite [13] remarks that caution should be used when applying formula (9) to a linear function of mean squares when some of the coefficients are negative, and Gaylor and Hopper [21] provide guidelines for determining if the Satterthwaite approximate is valid for this situation. We will see in the simulation study that an inadequate Satterthwaite approximation can cause ddfD to approach zero, resulting in an extremely wide confidence interval.
5 SPECIAL CASES
5.1 Confidence interval for the difference of two tests
Define θi to be the expected accuracy estimate for test i; that is, . I have shown that F OR, and hence also F DBM, has an approximate F ( t -1);ddfH distribution; it follows that for t = 2 an approximate (1-α)100% confidence interval for θi-θj is given by
(22) |
where MSdenOR is the denominator of F OR (5), or equivalently by
(23) |
where MSdenDBM is the denominator of F DBM (14). It can be shown, using that approach of Section 3.1, that equations (22-23) are also valid when t > 2.
5.2 Single test inference using all of the data
The results in this and the following section are presented without proof, but they can be derived using an approach similar to that used in Section 3.1. Define
An approximate (1-α)100% confidence interval for θi is given by
(24) |
and a test for H0>:θi = θ0 can be made by comparing to a t distribution with degrees of freedom ddfH_single. Equivalent expressions can be derived for the DBM procedure, but I omit those since I recommend instead the more robust approach described in the next section.
5.3 Single test inference using only corresponding data
A confidence interval and hypotheses test for θi can alternatively be based only on the data for the ith test. When we consider the data for test i, the OR model (1) reduces to a one-factor random effects ANOVA with the covariance for each pair of errors equal to Cov2. Since this single-test model makes no assumptions about the variances or covariances of the error terms corresponding to other tests, the resulting confidence interval should be more robust than confidence interval (24). Let denote the average of the fixed reader covariance estimates for the ith test and define
and
An approximate (1-α)100% confidence interval for θi is given by
(25) |
A similar result is also given by Obuchowski and Rockette [9], but instead of they use r-1; thus we see that yields a less conservative result, since .
Similarly, the DBM procedure reduces to a conventional reader×case ANOVA of pseudovalues when we consider only data for a single test. Let , and denote mean squares using only pseudovalues corresponding to the ith test. An equivalent expression for confidence interval (25) is given by
where
and
A test for H0: θi = θ0 can be made by comparing either or to a t distribution with degrees of freedom .
5.4 Fixed readers
If readers are treated as fixed for the OR model (1) then the only random effects are the error terms. Treating the variance and covariances of the error terms as known, the method of generalized least squares can be used, as discussed by Obuchowski and Rockette [9]. The test statistic, given by
has an approximate chi-squared distribution with t-1 degrees of freedom. Hillis et al [11] show that the DBM analysis, treating readers as fixed, will give the same test statistic when suitably normalized and approximately the same p-value for typical ROC studies. I see no problems with these methods in terms of the degrees of freedom.
6 SIMULATION STUDY
In a simulation study I compare ddfH, ddfO, and ddfD with respect to the empirical significance level for testing the null hypothesis of no test effect; and with respect to the width of a 95% confidence interval for the difference of the test AUCs. The simulation model of Roe and Metz [22] provides continuous decision-variable outcomes generated from a binormal model that treats both cases and readers as random. I use this simulation model for simulating rating data, taking integer values from one to five, by transforming the continuous outcomes to discrete ratings using the same cutpoints as Dorfman et al [8]; the combinations of reader and case sample sizes, AUC values, and variance components are the same as used in Roe and Metz [22] and Dorfman et al [8]. Briefly, rating data are simulated for 144 combinations of three reader-sample sizes (readers = 3, 5, and 10), four case sample sizes (10+/90-, 25+/25-, 50+/50-, and 100+/100-, where “+” indicates a diseased case and “-” indicates a normal case), three AUC values (AUC = .702, .855, and .961), and four combinations of reader and case variance components. Two thousand samples are generated for each of the 144 combinations; within each simulation, all Monte Carlo readers read the same cases for each of two tests. Since these are null simulations the test effect in the model is set to zero. The simulation design is summarized in Table 3. [ Table 3 ]
Table 3.
Simulation study design.
Factor | Number of levels | Description of levels |
---|---|---|
reader-sample size | 3 | readers = 3, 5, 10 |
case-sample size | 4 | cases = 10+/90-, 25+/25-, 50+/50-, 100+/100- |
AUC | 3 | AUC = .702, .855, .961 |
variance components | 4 | HH, HL, LH, LL |
Notes: For each factor combination n = 2000 samples are simulated for t = 2 tests, with each reader reading the same cases using each test. “+”= diseased case, “-”= normal case. See Roe and Metz [22] for the definitions of HH, HL, LH, and LL.
In terms of the OR procedure, each sample is analyzed using two different methods for estimating the AUC and and . For one method I estimate the AUC using maximum likelihood estimation assuming a binormal model [23,24] and obtain , and using the jackknife method; for the other method I estimate the AUC using the trapezoidal-rule (trapezoid) method and obtain , and using the DeLong method. These estimation methods are referred to as the MLE-jack and trapezoid-DeLong methods, respectively. For each method I compute F OR, ddfH, ddfO, and ddfD using equations (5,7,4,18). The null hypothesis of no test effect is rejected if F OR > F .05;1,ddf. For each of the 144 combinations the empirical significance level is the proportion of samples for which the null hypothesis is rejected. For each simulation I also compute the width of the 95% confidence interval for the difference of the two test AUCs using equation (22) and report the mean width for the 144·2000 simulations.
Identical results can also be obtained using the DBM procedure, as discussed in Hillis et al [11]. MLE-jack results can be obtained by computing the normalized pseudovalues corresponding to the MLE AUC estimates, computing MS(T)pseudo, MS(T*R)pseudo, MS(T*C)pseudo, and MS(T*R*C)pseudo, computing F DBM (14) which will be the same as F OR, and then computing ddfO, ddfH, and ddfD using equations (4,20,17). A 95% confidence for the difference of the two test AUCs is given by equation (23). Trapezoid-DeLong results can similarly be obtained using the DBM procedure with quasi pseudovalues as described in Hillis et al [11]. I exploit these OR-DBM equivalent relationships by using the DBM procedure for the MLE-jack simulations and the OR procedure for the Trapezoid-DeLong simulations. Simulations are performed using the IML procedure in SAS 9.1 [25] running under Windows XP. The MLE AUC pseudovalues are computed using a dynamic linked library (DLL), written in Fortran 90 by Don Dorfman and Kevin Schartz, that is accessed from within the IML procedure; this DLL is available on request.
The empirical significance levels from the simulation study are described in Table 4 and displayed in dot plots in Figure 1. The mean significance levels, with ranges indicated in parentheses, for ddfO, ddfH, and ddfD, respectively, are 0.018 (0.041), 0.051 (0.052), and 0.043 (0.050) for MLE-jackknife estimation, and 0.019 (0.047), 0.054 (0.051), and 0.047 (0.058) for trapezoid-DeLong estimation. Thus we see that the average significance levels for ddfH and ddfO are much closer to the nominal .05 level than for ddfO, with ddfD slightly more conservative than ddfH in accord with relationship (19). The dot plots show the ultraconservative performance of ddfO: several significance levels are 0.00 and all of the significance levels are less than .05. For ddfH and ddfD the dot plots show bell-shaped distributions without outliers. For ddfH, 93% (134/144) of the significance levels are within the interval [0.03, 0.07] for both estimation methods and the standard deviations of the significance levels are 0.011 (MLE-jackknife) and 0.010 (Trapezoid-DeLong) computed across the 144 combination of design factors; for ddfD, 86% (124/144) and 92% (133/144) of the significance levels are within the interval [0.03, 0.07] for MLE-jackknife and trapezoid-DeLong estimation, respectively, and the standard deviations of the significance levels are 0.011 for both estimation methods. I conclude from the closeness of the mean significance level to the nominal level, the relatively small standard deviation, the high proportion of significance levels within .02 of the nominal level, and the absence of outliers that both ddfH and ddfDBM perform satisfactorily with respect to significance levels. [ Table 4 ] [ Figure 1 ]
Table 4.
Results of the simulation study for the 144 combinations of reader-sample size, case-sample size, AUC, and variance components.
Significance levels |
||||||||
---|---|---|---|---|---|---|---|---|
Estimation method | ddf method | N | Mean | Min | Max | Range | SD | CI width mean |
MLE-jackknife | O | 144 | 0.018 | 0.000 | 0.041 | 0.041 | 0.0144 | 0.231 |
H | 144 | 0.051 | 0.025 | 0.077 | 0.052 | 0.0105 | 0.173 | |
D | 144 | 0.043 | 0.017 | 0.067 | 0.050 | 0.0109 | 2.36E+121 | |
Trapezoid-Delong | O | 144 | 0.019 | 0.000 | 0.047 | 0.047 | 0.0124 | 0.245 |
H | 144 | 0.054 | 0.032 | 0.082 | 0.051 | 0.0100 | 0.184 | |
D | 144 | 0.047 | 0.023 | 0.080 | 0.058 | 0.0105 | 2.74E+121 |
Notes: Min: minimum; Max: maximum; SD: standard deviation; CI width: width of a 95% confidence interval for the difference of the AUC estimates; MLE-jackknife: binormal maximum likelihood AUC estimation and jackknife covariance estimation; Trapezoid-DeLong: trapezoid AUC estimation and DeLong covariance estimation.
Figure 1.
Dot plots of the 144 empirical significance levels using MLE-jackknife and trapezoid-DeLong estimation with each ddf method. The nominal significance level is .05.
The extremely large mean confidence interval widths corresponding to ddfD in Table 4 (mean CI width: ddfD = 2.36E+121 vs. ddfH = 0.173 for MLE-jackknife estimation; ddfD = 2.74E+121 vs. ddfH = 0.184 for Trapezoid-DeLong estimation) can be attributed to a small proportion of samples for which ddfD approaches zero. For example, ddfD ≤ 1 in 0.002 of the samples; when I set ddfD = 1 for these samples, then the resulting mean confidence interval widths are comparable (although somewhat larger) than those for ddfH. Each ddfD value less than one corresponds to an inadequate Satterthwaite ddf approximation, using the adequacy guidelines provided by Gaylor and Hopper [21]. I conclude that although the ddfH and ddfD methods are comparable with respect to significance levels, ddfH performs better when confidence interval width is also considered.
Table 5 shows the mean empirical significance levels for ddfH by factor level for each of the four simulation study factors. The results suggest that high AUC (.961) is associated with mild conservatism, while low AUC (.701) and low number of readers (3) are associated with mild liberalism. Accordingly, 17 of the 18 lowest empirical ddfH significance levels in Figure 1(a) and 18 of the lowest 19 empirical ddfH significance levels in Figure 1(b) correspond to AUC = .961; in contrast, 22 of the highest 23 values in Figure 1(a) and 21 of the 22 highest values in Figure 1(b) corresponding either to AUC = .701 or readers = 3. [ Table 5 ]
Table 5.
Simulation study mean significance levels for ddfH by factor level.
Estimation method |
||||
---|---|---|---|---|
factor | factor level | N | MLE-jackknife (mean = 0.051) | Trapezoid-Delong (mean = 0.054) |
case-sample size | 10+/90- | 36 | 0.047 | 0.053 |
25+/25- | 36 | 0.049 | 0.053 | |
50+/50- | 36 | 0.052 | 0.055 | |
100+/100- | 36 | 0.054 | 0.056 | |
reader-sample size | 3 | 48 | 0.056 | 0.060 |
5 | 48 | 0.048 | 0.052 | |
10 | 48 | 0.048 | 0.051 | |
AUC | 0.702 | 48 | 0.057 | 0.060 |
0.855 | 48 | 0.052 | 0.056 | |
0.961 | 48 | 0.043 | 0.047 | |
variance components | HH | 36 | 0.051 | 0.054 |
HL | 36 | 0.055 | 0.057 | |
LH | 36 | 0.046 | 0.051 | |
LL | 36 | 0.050 | 0.054 |
7 EXAMPLE
My example comes courtesy of Carolyn Van Dyke, MD. The study [26] compared the relative performance of single spin-echo magnetic resonance imaging (SE MRI) to cinematic presentation of MRI (CINE MRI) for the detection of thoracic aortic dissection. There were 45 patients with an aortic dissection and 69 patients without a dissection imaged with both SE MRI and CINE MRI. Five radiologists independently interpreted all of the images using a five-point ordinal scale: 1 = definitely no aortic dissection, 2 = probably no aortic dissection, 3 = unsure about aortic dissection, 4 = probably aortic dissection, and 5 = definitely aortic dissection.
The analysis of this study using trapezoid AUC estimates and DeLong covariance estimates is displayed in Table 6 and the empirical ROC curves are displayed in Figure 2. Although Table 6 only displays the analysis using the OR procedure, identical results can be obtained with the DBM procedure using quasi pseudovalues. For testing H0: θ1 = θ2 we have F OR = 4.485, ddfO = 4 (p = .102), ddfH = 15.07 (p = .051), and ddfD = 13.81 (p = .053). Thus ddfO and ddfH give somewhat different results, while ddfH and ddfD give similar results, with ddfH > ddfD in accord with relationship (19). [ Table 6 ] [ Figure 2 ]
Table 6.
Analysis of Van Dyke et al [26] data using trapezoid AUC estimation and DeLong covariance estimation for t = 2 tests and r = 5 readers.
a) Trapezoid AUCs: | ||
test | ||
1 (CINE) | 2 (Spin Echo) | |
reader (j) | ||
1 | 0.9196 | 0.9478 |
2 | 0.8588 | 0.9053 |
3 | 0.9039 | 0.9217 |
4 | 0.9731 | 0.9994 |
5 | 0.8298 | 0.9300 |
= .8970 | =.9408 |
b) ANOVA table: | |||
Source | df | Sum of squares | Mean square |
T | 1 | 0.00479617 | 0.00479617 |
R | 4 | 0.01534480 | 0.00383620 |
T*R | 4 | 0.00220412 | 0.00055103 |
c) Covariance estimates computed from DeLong covariance matrix: |
= .000792133, = .000342009, = .000339526, = .000235850 |
d) FOR =MS(T)/[MS(T*R) + = 4.4849 |
e) Denominator degrees of freedom: | |
ddfO = (t - 1)(r - 1) = 4 | |
| |
|
f) P-values for H0: θ1 = θ2: | ||
ddf method | ddf | p = Pr = (F ( t -1), ddf ≥ F OR) |
O | 4 | 0.1016 |
H | 15.07 | 0.0512 |
D | 13.81 | 0.0528 |
g) Single test 95% confidence intervals using only corresponding data: | ||||||
i | MSR(i) | 95% CI for θi | ||||
1 (Cine) | .8970 | .0004785 | .003083 | .005470 | 12.60 | .8970±.07169 |
2 (Spin Echo) | .9408 | .0002015 | .001305 | .002312 | 12.57 | .9408±.0466 |
Figure 2.
Trapezoid ROC curves of the five readers for SE MRI and CINE MRI in the detection of aortic dissection. The average areas under the empirical curves are .941 (SE) and .897 (CINE).
Single test 95% confidence intervals using only the corresponding data, as given by equation (25), are given in part (f) of Table 5. We see that the confidence interval for CINE (.897±0717) is roughly 50% wider that the interval for SE MRI (.941±0.0467). When based upon the combined data using equation (24), the single test confidence intervals are given by ±0.0591, i = 1, 2. I prefer the first approach since it does not assume that the AUCs have the same variances and covariances for each test, as does the combined data approach.
8 DISCUSSION
The motivation for this paper was the recent finding by Hillis et al [11] that the DBM and OR procedures yield identical test statistics when based on the same accuracy measure and covariance estimation methods, but their different ddf methods, ddfD and ddfO, can result in considerably different inferences. I proposed a new ddf estimator, ddfH, that overcomes problems with the ddfD and ddfO methods. I derived ddfH by showing that the null distribution of the OR test statistic can be approximated by an Ft -1,ddfH distribution. I showed how ddfH can be used with both the OR and DBM procedures and how it can also be derived from the DBM model. The p-value corresponding to ddfH will always be less than or equal to that corresponding to ddfD or ddfO.
Although ddfH can be derived from the DBM model assumptions, the derivation in terms of the OR model is more important because the OR model provides an acceptable conceptual model. In contrast, the DBM model does not provide an acceptable conceptual model because pseudovalues have no intrinsic meaning; hence Hillis et al [11] characterize the DBM model as a “working model” that allows one to fit the OR model using conventional ANOVA software.
In simulations ddfH performed better than ddfO with respective to significance level, with the ddfH mean significance level much closer to the nominal level. The shape and spread of the distributions of the significance levels were similar for the ddfH and ddfD methods, with ddfH closer to the nominal level for binormal AUC estimation than ddfD (mean significance levels: ddfH = .051, ddfD= .043) but deviating approximately the same amount from the nominal level for trapezoid AUC estimation (mean significance levels: ddfH = .054, ddfD = .047). However, a drawback of ddfD was that confidence intervals were sometimes extremely wide due to ddfD being close to zero. For these reasons I concluded that ddfH performs better than either ddfD or ddfO. A SAS macro for implementing the OR-DBM procedure using ddfH is available on request.
A limitation of this study is that I used only typical ROC sample sizes where the number of readers is low (≤ 10) and the number of cases is moderate (≥ 50) in the simulation study. Although the derivation of ddfH involved several approximations, I discussed in Section 3.1 why ddfH should perform adequately for the typical ROC diagnostic study where the number of cases is moderate to large, regardless of the number of readers. However, the performance of the new method may not be acceptable when the number of cases is less; certainly, further research is required before using it in this situation. Although the simulations only looked at small reader sample sizes, I see no reason why the new method should not work satisfactorily when the number of readers is moderate or large, but again I caution that our simulations have only been for typical ROC study sample sizes.
A topic to consider for future research is the robustness of the OR and DBM procedures to violations of the model assumptions. Since the procedures can be viewed as equivalent, only the less restrictive OR model assumptions need to be considered. The OR model assumes that the variances and covariances for the accuracy estimates do not vary by test. I conjecture that, similar to conventional ANOVA models, the test of the null hypothesis of no test effect will be fairly robust to departures from this assumption, but further investigation is required to support this conjecture. Although I would not expect single test confidence intervals based on this model to be robust to violations of this assumption, this problem can be circumvented by basing the confidence interval only on data for the corresponding test, as discussed in Section 5.3.
ACKNOWLEDGEMENTS
This research was supported by the National Institutes of Health, grant R01EB000863. I thank two anonymous referees for their excellent suggestions which greatly improved the paper, and also Nancy Obuchowski and Kevin Berbaum for their helpful suggestions in the final stage of preparing the manuscript. The views expressed in this article are those of the author and do not necessarily represent the views of the Department of Veterans Affairs.
A Appendix
In this section I show that the OR mean squares, MS(T), MS(R), and MS(T*R), have expectations as given in Table 1 and are independently distributed as follows: , with
(A1) |
and noncentrality parameter ; and . Furthermore, from Table 2 we see that E[MS(T) | H0] = c 1; hence if H0: τ1 = ... = τt is true, then .
Let denote the vector of outcomes for the OR model (1) and let denote the corresponding mean vector, where θij = μ+τi. Then θ ∼ N (θ,Σ) where Σ=Cov(θ). I assume that Σ is positive definite.
I first show that Σ has the following form:
(A2) |
where ⊗ denotes the Kronecker product operator, Ip denotes the p × p identity matrix, Mp denotes a p × p matrix of ones,
(A3) |
(A4) |
and
(A5) |
It follows from the OR model (1) that and
Define . Then, for example, for t = 2 and r = 3 the covariance matrix of is given by
where
We can write
(A6) |
where 0 denotes a matrix of zeros. Since
and Σ2 = x 3I3+Cov3M3, where x 1, x 2, and x 3 are defined by equations (A3-A5), then substituting in (A6) gives
which is equation (A2) with t = 2; r = 3. It can similarly be shown for arbitrary t and r that the covariance matrix for is given by equation (A2).
Define . Smith and Lewis [27] give matrix expressions for balanced ANOVA sums of squares; from their results we have , , and , where , A2 = Ct ⊗ Cr, and .
Let A and B be any two matrices and let tr(·) denote the trace function. Results that I utilize for the mean square distribution derivations are given below. For properties (a-c) I also assume that A and B are tr × tr symmetric matrices.
where m =rank(AΣ) and λ = θ′Aθ if and only if AΣ is idempotent
and are distributed independently if and only if AΣB = 0
0 ⊗ A = A ⊗ 0 = 0
a A ⊗ b B = ab A ⊗ B
(A ⊗ C) (B ⊗ D) = (AB) ⊗ (CD)
(A + B) ⊗ C = A ⊗ C + B ⊗ C; C ⊗ (A + B) = C ⊗ A + C ⊗ B
rank(A ⊗ B) =rank(A) rank(B)
if B is nonsingular then rank(AB) =rank(A)
if A is idempotent then rank (A) =tr(A)
CpCp = Cp
MpMp = p Mp
MpCp = CpMp = 0
A 1, A 2, and A 3 are idempotent
tr(A 1) = t-1; tr(A 2) = (t-1) (r-1); tr(A 3) = (r-1)
A 1 A 2 = A 1 A 3 = A 2 A 3 = 0
Properties (a-c) are well known [28], (d-j) are standard matrix results [29], and properties (k-p) are easily derived.
To derive the distribution of , I show below that A 1Σ = c 1 A 1, where c 1 is defined by equation (A1). Using property (a) we then have
Since θij = μ, then ; also, from property (o) we have tr(c 1 A 1) = c 1(t - 1). Thus
as given in Table 1. Since A 1Σ = c 1 A 1, then from property (n) it follows that is idempotent, and hence properties (b,i,j,o) imply where the noncentrality parameter is . It follows that .
To show A 1 = c 1 A 1, using (A2) we have
Using property (f) it follows that
Using properties (d,e,g,l,m) we have
I similarly derive the distribution of by showing below that A 2Σ = c 2 A 2, where
Since , then similar to the derivation for MS(T) distribution it follows that
as given in Table 1, and . Since E [MS(T*R)] = c 2, then it follows that .
To show A 2Σ = c 2 A 2, using (A2) we have
Using property (f) it follows that
Using properties (d,e,g,m) we have
From property (c), and are independent if A 1ΣA 2 = 0. To show A 1ΣA 2 = 0, note that since A 1Σ = c 1 A 1, then A 1ΣA 2 = c 1 A 1 A 2 = 0 using property (p). It follows that MS(T) and MS(T*R) are independent. Similarly, the distribution of MS(R) can be derived and it can be shown that MS(R) is independent of MS(T) and MS(T*R).
References
- 1.Swets JA, Pickett RM. Evaluation of diagnostic systems: methods from signal detection theory. Academic Press; New York: 1982. [Google Scholar]
- 2.Metz CE. ROC methodology in radiologic imaging. Investigative Radiology. 1986;21:720–733. doi: 10.1097/00004424-198609000-00009. [DOI] [PubMed] [Google Scholar]
- 3.Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988;240:1285–1293. doi: 10.1126/science.3287615. [DOI] [PubMed] [Google Scholar]
- 4.Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Critical Reviews in Diagnostic Imaging. 1989;29:307–335. [PubMed] [Google Scholar]
- 5.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- 6.Obuchowski NA, Beiden SV, Berbaum KS, Hillis SL, Ishwaran H, Song HH, Wagner RF. Multi-case ROC analyses: an empirical comparison of five methods. Academic Radiology. 2004;11:980–995. doi: 10.1016/j.acra.2004.04.014. DOI:10.1016/j.acra.2004.04.014. [DOI] [PubMed] [Google Scholar]
- 7.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]
- 8.Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]
- 9.Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Communications in Statistics: Simulation and Computation. 1995;24:285–308. [Google Scholar]
- 10.Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
- 11.Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. DOI:10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]
- 12.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–844. [PubMed] [Google Scholar]
- 13.Satterthwaite FE. Synthesis of variance. Psychometrika. 1941;6:309–316. [Google Scholar]
- 14.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometric Bulletin. 1946;2:110–114. [PubMed] [Google Scholar]
- 15.Pavur R, Nath R. Exact F tests in an ANOVA procedure for dependent observations. Multivariate Behavioral Research. 1984;19:408–420. doi: 10.1207/s15327906mbr1904_3. [DOI] [PubMed] [Google Scholar]
- 16.Quenoille MH. Approximate tests of correlation in time series. Journal of the Royal Statistical Society, Series B. 1949;11:68–84. [Google Scholar]
- 17.Quenoille MH. Notes on bias in estimation. Biometrika. 1956;43:353–360. [Google Scholar]
- 18.Tukey JW. Bias and confidence in not quite large samples. Annals of Mathematical Statistics. 1958;29:614. [Google Scholar]
- 19.Hillis SL, Berbaum KS. Power estimation for the Dorfman-Berbaum-Metz method. Academic Radiology. 2004;11:1260–1273. doi: 10.1016/j.acra.2004.08.009. DOI:10.1016/j.acra.2004.08.009. [DOI] [PubMed] [Google Scholar]
- 20.Hillis SL, Berbaum KS. Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification. Academic Radiology. 2005;12:1534–1541. doi: 10.1016/j.acra.2005.07.012. DOI:10.1016/j.acra.2005.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Gaylor DW, Hopper FN. Estimating the degrees of freedom for linear combinations of mean squares by Satterthwaite’s formula. Technometrics. 1969;11:691–706. [Google Scholar]
- 22.Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Academic Radiology. 1997;4:298–303. doi: 10.1016/s1076-6332(97)80032-3. [DOI] [PubMed] [Google Scholar]
- 23.Dorfman DD, Alf E., Jr. Maximum likelihood estimation of parameters of signal-detection theory and determination of confidence intervals-rating method data. Journal of Mathematical Psychology. 1969;6:487–496. [Google Scholar]
- 24.Dorfman DD. RSCORE II. In: Swets JA, Pickett RM, editors. Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press; San Diego, CA: 1982. pp. 212–232. [Google Scholar]
- 25.SAS for Windows, Version 9.1. SAS Institute Inc.; Cary, NC, USA: copyright (c) 2002-2003 by. [Google Scholar]
- 26.Van Dyke CW, White RD, Obuchowski NA, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the diagnosis of thoracic aortic dissection. 79th RSNA Meetings; Chicago, IL. November 28 - December 3, 1993. [Google Scholar]
- 27.Smith JH, Lewis TO. Determining the effects of intra-class correlation on factorial-experiments. Communications in Statistics Part A-Theory and Methods. 1980;9:1353–1364. [Google Scholar]
- 28.Searle SR. Linear Models. Wiley; New York: 1971. pp. 55–59. [Google Scholar]
- 29.Harville DA. Matrix Algebra From a Statistician’s Perspective. Springer; New York: pp. 335–338. [Google Scholar]