Abstract
Rationale and Objectives
The Dorfman-Berbaum-Metz (DBM) method has been one of the most popular methods for analyzing multireader receiver operating characteristic (ROC) studies since it was proposed in 1992. Despite its popularity, the original procedure has several drawbacks: it is limited to jackknife accuracy estimates, it is substantially conservative, and it is not based on a satisfactory conceptual or theoretical model. Recently, solutions to these problems have been presented in three papers. Our purpose is to summarize and provide an overview of these recent developments.
Materials and Methods
We present and discuss the recently proposed solutions for the various drawbacks of the original DBM method.
Results
We compare the solutions in a simulation study and find that they result in improved performance for the DBM procedure. We also compare the solutions using two real data studies and find that the modified DBM procedure that incorporates these solutions yields more significant results and clearer interpretations of the variance component parameters than the original DBM procedure.
Conclusions
We recommend using the modified DBM procedure that incorporates the recent developments.
Keywords: receiver operating characteristic (ROC) curve, DBM, diagnostic radiology, jackknife, area under the curve (AUC)
Introduction
There are several different statistical methods for analyzing multireader receiver operating characteristic (ROC) studies, with the Dorfman-Berbaum-Metz (DBM) method [1–3] being one of the most frequently used methods. The DBM method involves an analysis of variance (ANOVA) of pseudovalues computed with the Quenouille-Tukey jackknife [4–6]. The basic data for the analysis are pseudovalues corresponding to test-reader ROC accuracy measures, such as the area under the ROC curve (AUC), computed by jackknifing cases separately for each test-reader combination. Throughout we use the term test to refer to a diagnostic test, modality, or treatment. A mixed-effects ANOVA is performed on the pseudovalues to test the null hypothesis that the average accuracy of readers is the same for all of the diagnostic tests studied. Accuracy can be characterized using any accuracy measure, such as sensitivity, specificity, area under the ROC curve, partial area under the ROC curve, sensitivity at a fixed specificity, or specificity at a fixed sensitivity. Furthermore, these measures of accuracy can be estimated parametrically, semiparametrically or nonparametrically; the DBM method accuracy estimates are the corresponding jackknife estimates.
Although the DBM method may be the most frequently used analysis method for multireader ROC studies since it was proposed in 1992, having been used in over 100 published studies [7], the original procedure has several drawbacks: it requires that the analysis be based on jackknife accuracy estimates, it is substantially conservative, and it is not based on a satisfactory conceptual or theoretical model. Recently, solutions to these problems have been presented in three papers [8–10]. We summarize these recent developments and compare the solutions in a simulation study and in two examples.
Materials and Methods
Original DBM Method
The DBM method is typically used with the test×reader × case factorial study design where each case (i.e., patient) undergoes each of several diagnostic tests and the resulting images are interpreted once by each reader. Throughout this paper, we assume that the data have been collected using this factorial design. The competing modalities can be compared using the DBM method; in particular, the null hypothesis of no test effect can be tested and confidence intervals for test differences can be computed. Results generalize to both the population of cases and the population of readers. To simplify the narration here, we assume that the outcome is AUC.
For the original DBM method, AUC pseudovalues are computed using the Quenouille-Tukey jackknife separately for each test-reader combination as described in Dorfman et al [1]. Let Yijk denote the AUC pseudovalue for test i, reader j, and case k; by definition Yijk = cθ̂ij−(c−1)θ̂ij(k), where c denotes the number of cases, θ̂ij denotes the AUC estimate based on all of the data for the ith test and jth reader, and θ̂ij(k), denotes the AUC estimate based on the same data but with data for the kth case removed. Thus, in effect, Yijk represents the contribution of the kth case to the accuracy estimate for the ith test and jth reader, θ̂ij. Then using the Yijk as the data to be evaluated by conventional statistical analysis, the DBM procedure tests for a test effect using a fully-crossed three-factor ANOVA with test treated as a fixed factor and reader and case as random factors. A “jackknife estimate” of AUC for the ith test and jth reader is given by the mean of the corresponding pseudovalues:
(1) |
We refer to θ̂ij as the original AUC estimate, Ȳij· as the jackknife AUC estimate, and the Yijk as the raw pseudovalues.
The analysis model is expressed by
(2) |
i=1,…,t; j=1,…,r;k=1,…,c; where τi denotes the fixed effect of test i, Rj denotes the random effect of reader j, Ck denotes the random effect of case k, the multiple symbols in parentheses denote interactions, and εijk is the error term. The interaction terms are all random effects. The random effects are assumed to be mutually independent and normally distributed with zero means and respective variances and . Since there are no replications, and are inseparable.
The DBM F statistic for testing for a test effect is the conventional mixed-model ANOVA F statistic based on the pseudovalues. Letting MS(T), MS(T*R), MS(T*C), and MS(T*R*C) denote the mean squares corresponding to the test, test×reader, test×case and test×reader × case effects, respectively, the F statistic for testing for a test effect for model (2) is given by
(3) |
Under the null hypothesis of no test effect, F has an approximate Fdf1,df2 distribution, where df1 = t−1 and df2 is the Satterthwaite [11, 12] degrees of freedom approximation given by
(4) |
In the original DBM formulation, extensive model-based simplification is performed to prevent the F statistic (3) from becoming negative (due to a negative denominator). Specifically, model (2) is simplified by omitting (or equivalently, setting to zero) the test×reader and the test×case variance components if the corresponding ANOVA estimates are not positive. For the simplified model the appropriate F statistic and denominator degrees of freedom (ddf) are used; the appropriate F statistic for each simplified model contains only one mean square in the denominator and hence cannot be negative. Thus equations (3–4) are used only when both of the variance component estimates are positive.
The test×reader and the test×case variance component ANOVA estimates are
(5) |
Taking into account possible model simplification, the F statistic and ddf for the original DBM method are given by
(6) |
and
(7) |
The numerator degrees of freedom for F in equation (6) is t−1. We refer to this approach, using Forig and ddforig, as original DBM. Note that the conditions in equations (6) and (7) can also be written in terms of the mean squares; e.g., is equivalent to MS(T*R) > MS(T*R*C), MS(T*C) > MS(T*R*C).
Problem 1: DBM is limited to jackknife accuracy estimates
One problem with original DBM is that it requires that the analysis be based on jackknife AUC estimates. Although it is possible for the jackknife AUC estimator to perform better that the corresponding original AUC estimator, clearly it would be preferable to have the flexibility to base the analysis on either the jackknife or original accuracy estimator, especially if (as is typically the case) it has not been shown that the jackknife AUC estimator performs as well as the original AUC estimator. For trapezoidal-rule (trapezoid) AUC estimates [13] this is not a problem, since the trapezoid and corresponding jackknife AUC estimates are equal [8].
Hillis et al [8] provide a solution to this problem by showing that the DBM method can be based on normalized pseudovalues , defined by . That is, the normalized pseudovalue for patient k, reader j, and test i is equal to the sum of the raw pseudovalue Yijk and the difference between the ijth test-reader original and jackknife AUC estimates. The estimate for θij based on the normalized pseudovalues, given by , is equal to the original AUC estimate θ̂ij. Thus, the DBM procedure with normalized pseudovalues yields single test and test-difference confidence intervals centered on the original accuracy estimates and their differences, averaged across readers.
Problem 2: DBM is substantially conservative
Another problem with original DBM is that it is substantially conservative. Dorfman et al [3] conclude from simulations that the DBM method provides a “moderately conservative statistical test of modality differences,” with the degree of conservatism greatest with very large ROC areas and decreasing as the number of cases increases. Using the Roe and Metz [2] simulation structure, Hillis and Berbaum [14] report that, using semiparametric estimation with either normalized or raw pseudovalues, the average type I error across 144 combinations of reader-sample size, case-sample size, AUC, and variance components is.036, considerably lower than the nominal.05 significance level. The downside of a conservative test is that power is diminished compared to the same test with the critical value adjusted to yield significance levels closer to the nominal level.
In simulations Hillis [10] shows that the DBM procedure attains a type I error much closer to the nominal level when two modifications are incorporated: (1) less data-based model simplification is performed, and (2) a different ddf formula is used. We now discuss these two modifications.
Less data-based model simplification
Hillis et al [8] propose that, similar to original DBM, the test×case variance component be omitted if its ANOVA estimate is not positive; however, they stipulate that the test×reader variance component should never be omitted, even when its estimate is zero or negative. We refer to this approach as new model simplification. Like original DBM, new model simplification ensures that the F test statistic will not be negative. However, an important advantage of new model simplification is that it results in a less conservative test, with the type I error rate considerably closer to the nominal level [9]. Another advantage is that this approach avoids making inferences under the unrealistic assumption that differences between tests are the same for all readers in the population, which is implied when the test×reader variance component is omitted [14].
Using new model simplification, the F statistic for testing the null hypothesis of no test effect is the same as that given by equation (3) when , whereas it is set equal to MS(T)/MS(T*R) when . We denote this F statistic using new model simplification by FDBM. Thus,
(8) |
Since is equivalent to MS(T*C)− MS(T*R*C) ≤ 0, this F statistic can be succinctly written in the following form that takes model simplification into account:
(9) |
The corresponding conventional ANOVA ddf is given by
(10) |
Thus, new model simplification uses FDBM and ddfD.
In Appendix A we derive the following relationships: (1) if then FDBM = Forig and ddfD = ddforig; and (2) if then FDBM ≥ Forig but ddfD < ddforig. However, we have found that typically the larger F statistic under new model simplification, when , will result in a more significant conclusion (smaller p-value), compared to that obtained using original DBM, even though the ddf is smaller under new model simplification. In this way new model simplification produces a less conservative test.
New denominator degrees of freedom
Hillis [10] proposes a new ddf given by
(11) |
Equation (11) can be written more compactly in the form
(12) |
The quantity ddfH is derived by assuming that new model simplification is used – that is, it is to be used with FDBM (9). We refer to this approach, using FDBM and ddfH, as new model simplification plus ddfH.
In Appendix A we show that ddfH > ddfD if , whereas ddfH = ddfD if . Since new model simplification and new model simplification plus ddfH both use FDBM, it follows that new model simplification plus ddfH results in a lower p-value when and the same p-value when ; hence, it is less conservative than new model simplification.
Table 1 presents a summary of the three different DBM approaches – original DBM, new model simplification, and new model simplification plus ddfH – and Table 2 presents their relationships.
Table 1.
a) Original DBM | ||||
---|---|---|---|---|
Forig | ddforig | condition | ||
|
Equation (4) |
|
||
MS(T)/MS(T*R) | (t−1)(r−1) |
|
||
MS(T)/MS(T*C) | (t−1)(c−1) |
|
||
MS(T)/MS(T*R*C) | (t−1)(r−1)(c−1) |
|
||
b) New model simplification | ||||
| ||||
| ||||
c) New model simplification plus ddfH | ||||
| ||||
. |
These approaches can be used with raw, normalized, or quasi pseudovalues. See Table 6 for computational formulas for and .
Table 2.
F relationship | Ddf relationship | ||||
---|---|---|---|---|---|
>0 | >0 | Forig = FDBM | ddforig = ddfD < ddfH | ||
≤0 | >0 | Forig ≤ FDBM (equality iff ) | ddfD < ddforig, ddfD < ddfH | ||
>0 | ≤0 | Forig = FDBM | ddforig = ddfD = ddfH | ||
≤0 | ≤0 | Forig ≤ FDBM (equality iff ) | ddfD = ddfH < ddforig |
These relationships are derived in Appendix A. Iff: if and only if.
Problem 3: DBM model is unsatisfactory conceptually and theoretically
The original DBM procedure does not provide a satisfactory conceptual model since the the model parameters are expressed in terms of pseudovalues rather than AUC values. The model is also unsatisfactory theoretically since it assumes that the pseudovalues are independent and normally distributed -- but they are neither. Thus, desirable statistical properties of the DBM procedure do not directly follow from the model assumptions, since the assumptions are not true; rather, the validity of the model must be determined through simulation studies.
Hillis et al [8] provide a solution to this problem by showing that the DBM procedure is equivalent to another procedure that is based on an acceptable conceptual and theoretical model. Specifically, they show that the DBM model can be viewed as a “working” model that produces the same inferences as obtained using the test×reader ANOVA model with correlated errors proposed by Obuchowski and Rockette (OR) [15, 16]. The OR model is given by
(13) |
i=1,…,t; j=1,…,r; where θ̂ij is the AUC estimate (or other accuracy estimate) for the ith test and jth reader, τ̃i denotes the fixed effect of test i, Rj denotes the random effect of reader j, (τR)ij denotes the random test×reader interaction, and εij is the error term having mean zero and variance . The random effects Rj and (τR)ij are assumed independent and normally distributed with zero means and variances and , respectively, and are assumed independent of the εij. We use the tilde symbol “~” to distinguish OR model parameters from analogous DBM model parameters. Since the same cases are read by each reader using each test, the error terms are not assumed to be independent. Instead, equi-covariance of the errors between readers and tests is assumed, resulting in three possible covariances given by
(14) |
Obuchowski and Rockette [15] suggest the following ordering: Cov1 ≥ Cov2 ≥ Cov3.
Conditional on the reader and test×reader effects (that is, treating readers as fixed), it follows from model (13) that Cov1, Cov2, and Cov3 are also the corresponding covariances of the AUC estimates; for example, Cov2 is the covariance between the AUCs for two fixed readers using the same test, while Cov3 is the covariance between the AUCs for two fixed readers using different modalities.
The OR F statistic for testing for a test difference is given by
(15) |
where MS(T)θ̂ij and MS(T*R)θ̂ij are the test and test×reader mean squares corresponding to the OR model (13), and where and are covariance estimates; the subscript “θ̂ij” is used here to indicate that the mean squares are computed from the AUCs rather than the pseudovalues. The quantities and are estimated by averaging corresponding covariance estimates for pairs of AUCs, estimated using covariance estimation methods that treat readers as fixed. For example, , where is an estimate of the covariance between AUCs for fixed readers j and j′ using test i, estimated using a fixed reader method such as bootstrapping or jackknifing.
The DBM and OR procedures are related as follows [8]. Note that the jackknife procedure provides both AUC point estimates, defined by equation (1), and covariance and variance estimates for the AUCs, as discussed in Reference [8]. The DBM and OR F statistics, FDBM and FOR defined by equations (9) and (15), are equal if and are jackknife covariance estimates and normalized pseudovalues are used with the DBM procedure. This relationship does not require any particular estimation method for the θ̂ij in equation (13). On the other hand, if raw pseudovalues are used, then the relationship still holds if, additionally, the θ̂ij in equation (13) are jackknife estimates. More generally, for any given AUC estimation method and any given method of estimating Cov2 and Cov3, FDBM = FOR if the DBM procedure is used with quasi pseudovalues, as defined in Reference [8]. These conditions which ensure that FDBM = FOR are summarized in Table 3. The appropriate ddf to use with either the DBM or OR procedure is ddfH, given by equation (12) for the DBM procedure. In terms of the OR procedure mean squares, Reference [10] shows that ddfH is given by
Table 3.
|
Note: any one of the above conditions results in FDBM = FOR.
(16) |
Under any of the conditions described above that result in FDBM = FOR, the same value for ddfH is obtained using either equation (12) or (16), and there is a one-to-one correspondence between the DBM and OR computed quantities, as shown in Table 4.
Table 4.
OR computed quantity | Equivalent function of DBM computed quantities | ||
---|---|---|---|
MS(T)θ̂ij |
|
||
MS(R)θ̂ij |
|
||
MS(T*R)θ̂ij |
|
||
|
|
||
|
|
||
|
|
||
|
|
||
DBM computed quantity | Equivalent function of OR computed quantities | ||
MS(T) | =cMS(T)θ̂ij | ||
MS(R) | =cMS(R)θ̂ij | ||
MS(T*R) | =cMS(T*R)θ̂ij | ||
MS(C) |
|
||
MS(T*C) |
|
||
MS(R*C) |
|
||
MS(T*R*C) |
|
These relationships assume one of the three conditions given in Table 3.
The OR model is a satisfactory conceptual model since it is expressed in terms of meaningful reader-level accuracy outcomes (e.g., AUC values). In addition, the model assumptions are reasonable. The assumed independence of the reader effects follows from the independent selection of readers, and the assumption of independent test×reader interactions and equi-covariant errors allows for a fairly general covariance structure. Normality for the error terms is reasonable since typically there are many cases for each reader, and normality for the reader and test×reader effects is a typical assumption for generalizing from a sample to a population when we do not know the exact population distribution. Of course, these assumptions may not always hold, and topics for future research include the robustness of the DBM and OR procedures to violations of these assumptions and generalization of the procedures to accommodate less restrictive assumptions.
The equivalence of the DBM and OR procedures allows for interpretation of the DBM parameters in terms of the meaningful OR parameters. Table 5 shows the relationships between the DBM and OR parameters. We see that the DBM parameters μ, τi, , and have the same interpretation as the analogous OR parameters μ̃, τ̃i, and , while and are equal to linear functions of , Cov1, Cov2 and Cov3, and vice versa. For example, we see from Table 5 that ; hence, setting , as is done with new model simplification when , is equivalent to assuming that Cov2 = Cov3, which is a reasonable assumption. On the other hand, we see that setting , as is done with original DBM when , is equivalent to assuming that the test×reader variance component of the OR model ( ) is zero, implying that differences between tests are the same for all readers in the population. As mentioned earlier, this is an unreasonable assumption and is one reason why we no longer recommend original DBM.
Table 5.
OR model parameter | Equivalent function of DBM model parameters | ||
---|---|---|---|
μ̃ | =μ | ||
τ̃i | =τi | ||
|
|
||
|
|
||
|
|
||
Cov1 |
|
||
Cov2 |
|
||
Cov3 |
|
||
DBM model parameter | Equivalent function of OR model parameters | ||
μ | =μ̃ | ||
τi | =τ̃i | ||
|
|
||
|
|
||
|
=cCov3 | ||
|
=c(Cov2 − Cov3) | ||
|
=c(Cov1 − Cov3) | ||
|
|
These relationships assume that the constraints for the OR model parameters are those implied by the DBM model: , Cov1≥Cov3, Cov2≥Cov3, and Cov3≥0. They also assume the same linear constraint for the τi (e.g., Στi = 0) for both models and that either (1) normalized or quasi pseudovalues are used; or (2) if raw pseudovalues are used, then the OR model outcome is the jackknife accuracy estimate.
Note: Adapted and reprinted, with permission, from Reference [8]
Other examples of interpreting functions of OR parameters are the following. The expected accuracy measure across readers for the ith test is given by μ+ τi; the variance of the inherent (or latent) reader accuracy measure is given by , with denoting the component due to the main effect of readers and the component due to test×reader interaction; the variance of the reader accuracy measure estimate is given by ; and the measurement error variance that is attributable to cases and within-reader variability that describes how a reader interprets the same image in different ways on different occasions is given by . The interpretations of Cov1, Cov2 and Cov3 have been discussed earlier. Various correlations are functions of the parameters. For example, define and ; then ρBR is the correlation between AUC estimates for two different readers using the same test, and ρBR|readers is the analogous correlation but treating readers as fixed. See Appendix B for derivations of these last two correlations.
Formulas for computing the DBM variance components are presented in Table 6. Estimates for the OR variance components and covariances result from using Table 5 with the DBM variance components replaced by their estimates.
Table 6.
DBM model parameter | Estimate | ||
---|---|---|---|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
MS(T*R*C) |
Note: These estimates, except for the last, can be negative.
Summary of related papers
The relationship between the DBM and OR methods is described by Hillis et al [8]. They generalize the DBM method, using new model simplification, to include the use of normalized and quasi pseudovalues and determine the conditions under which the DBM and OR methods produce equal test statistics. They also show how the DBM method can be used when readers are treated as fixed and show the relationship between the DBM and OR methods for fixed readers. Hillis and Berbaum [9] show empirically that new model simplification performs better than original DBM, as well as showing that use of normalized pseudovalues has little effect on the type I error compared to raw pseudovalues. Hillis [10] derives ddfH for both the DBM and OR procedures and empirically shows that new model simplification plus ddfH performs better than new model simplification. Hillis and Berbaum [14] show how to compute the power for the DBM method using new model simplification; updated power software using new model simplification plus ddfH can be downloaded from http://perception.radiology.uiowa.edu
Results
Simulation Study
In a simulation study we examined the performance of the three DBM approaches –original DBM, new model simplification, and new model simplification plus ddfH – with respect to the empirical type I error rate for testing the null hypothesis of no test effect. The simulation model of Roe and Metz [2] provided continuous decision-variable outcomes generated from a conventional binormal model that treats both cases and readers as random. We used this simulation model to create discrete rating data by computer simulation. The discrete rating data, taking integer values from one to five, were created by transforming the continuous outcomes using the cutpoints reported by Dorfman et al [3]. The combinations of reader and case sample sizes, AUC values, and variance components were the same as those used in Roe and Metz [2] and Dorfman et al [3]. Briefly, rating data were simulated for 144 combinations of three reader-sample sizes (readers = 3, 5, and 10); four case sample sizes (10+/90−, 25+/25−, 50+/50−, and 100+/100−, where “+” indicates a diseased case and “−” indicates a normal case); three AUC values (AUC = 0.702, 0.855, and 0.961) that describe the separation between the normal and diseased case populations, averaged across readers; and four combinations of reader and case variance components. Two thousand samples were generated for each of the 144 combinations; within each simulation, all Monte Carlo readers read the same cases for each of two equal tests.
The data from each simulated sample were analyzed by all three approaches. Both maximum likelihood (semiparametric) estimation assuming a latent binormal model [17, 18] and the trapezoidal-rule (nonparametric) method were used to estimate AUC from the 5-category discrete rating data. Analyses that employed semiparametric AUC estimation were performed using both raw and normalized pseudovalues, while for nonparametric AUC estimation no distinction was made since raw and normalized pseudovalues produce the same AUC estimates. For each of the 144 combinations, the empirical type I error rate was taken as the proportion of samples for which the null hypothesis was rejected at the alpha = 0.05 level. Data simulation was performed using the IML procedure in SAS [19]. The semiparametric AUC pseudovalues were computed using a dynamic link library (DLL), written in Fortran 90 by Don Dorfman and Kevin Schartz, that was accessed from within the IML procedure; this DLL, as well as a SAS macro that performs the different analyses used in this paper, can be downloaded from http://perception.radiology.uiowa.edu.
From the results, summarized in Tables 7 and 8, we draw the following conclusions. (1) New model simplification plus ddfH has the mean empirical type I error rate closest to the nominal.05 level: 0.051 (raw pseudovalues) and 0.049 (normalized pseudovalues) for semiparametric estimation, and 0.053 for nonparametric estimation. (2) Original DBM has the most conservative type I error rates: 0.036 (raw and normalized pseudovalues) for semiparametric estimation and 0.041 for nonparametric estimation. (3) New model simplification gives type I error rates midway between those obtained from the other two approaches. (4) With semiparametric estimation, the mean type I error rates for raw and normalized pseudovalues differ only slightly for each approach. (5) New model simplification confidence intervals can be extremely wide, due to a small proportion of proportion of samples where ddfD approaches zero [10]. We note that new model simplification plus ddfH does not have this problem, since ddfH is bounded below by (t−1)(r−1). (6) For semiparametric estimation using either original DBM or new model simplification plus ddfH, normalized pseudovalue confidence interval widths are 4% smaller, on average, than those for raw pseudovalues, For new model simplification the confidence interval widths are 40% smaller, although here outliers are affecting the results as noted above. These results suggest that the original AUC estimator has more precision and power for semiparametric estimation than the jackknife AUC estimator.
Table 7.
Type I error rates | ||||||||
---|---|---|---|---|---|---|---|---|
Approach | Pseudovalues | N | Mean | Min | Max | Range | SD | CI width mean |
Original | raw | 144 | 0.036 | 0.009 | 0.063 | 0.054 | 0.0124 | 0.196 |
normalized | 144 | 0.036 | 0.011 | 0.062 | 0.052 | 0.0111 | 0.188 | |
New | raw | 144 | 0.042 | 0.011 | 0.070 | 0.060 | 0.0123 | 4.05E+121 |
normalized | 144 | 0.043 | 0.017 | 0.067 | 0.050 | 0.0108 | 2.74E+121 | |
New plus ddfH | raw | 144 | 0.049 | 0.016 | 0.075 | 0.060 | 0.0124 | 0.192 |
normalized | 144 | 0.051 | 0.025 | 0.077 | 0.052 | 0.0105 | 0.184 |
Original: original DBM; New: new model simplification; New plus ddfH: new model simplification plus ddfH; Min: minimum; Max: maximum; SD: standard deviation; CI width: width of a 95% confidence interval for the difference of the AUC estimates.
Table 8.
Type I error rates |
|||||||
---|---|---|---|---|---|---|---|
Approach | N | Mean | Min | Max | Range | SD | CI width mean |
original | 144 | 0.041 | 0.014 | 0.069 | 0.055 | 0.0098 | 0.177 |
new | 144 | 0.046 | 0.024 | 0.072 | 0.049 | 0.0100 | 4.55E+121 |
new plus ddfH | 144 | 0.053 | 0.029 | 0.079 | 0.050 | 0.0097 | 0.174 |
No distinction is made between raw and normalized pseudovalues since the trapezoid estimate is the same for either type of pseudovalues. Original: original DBM; new: new model simplification; new plus ddfH: new model simplification plus ddfH; min: minimum; max: maximum; SD: standard deviation; CI width: width of a 95% confidence interval for the difference of the AUC estimates.
Example 1: Spin-Echo versus CINE MRI for Detection of Aortic Dissection
The data for this example were provided by Carolyn Van Dyke, MD, who had obtained them in a study [20] that compared the relative performance of single Spin-Echo Magnetic Resonance Imaging (SE MRI) and CINE MRI in detecting thoracic aortic dissection. There were 45 patients with an aortic dissection and 69 patients without a dissection imaged with both SE MRI and CINE MRI. Five radiologists independently interpreted all of the images using a 5-point ordinal scale.
Table 9 presents the analysis results for raw and normalized pseudovalues obtained with semiparametric AUC estimation. We note that the jackknife and original semiparametric AUC estimates are similar, so there is little difference in the population estimates: the test AUC estimates based on the raw pseudovalues are.920 for CINE and.951 for Spin Echo, whereas the estimates based on normalized pseudovalues are.911 for CINE and.952 for Spin Echo. Since for both types of pseudovalues, both original DBM and new model simplification yield the same results. For the normalized pseudovalues, Forig = FDBM = 2.619, ddforig = ddfDBM = 10.31 and p = 0.1358 in assessing the difference in AUC. (We note that results for this and the following example differ slightly from those in References [8, 9, 14] because we have used an updated AUC algorithm). From equation (12) we have ddfH = 10.99, resulting in p = 0.1339 with new model simplification plus ddfH. Hence, the latter approach produces a slightly more significant result, illustrating a point made earlier: if , then new model simplification plus ddfH will yield a more significant result than new model simplification, since ddfH > ddfDBM. We note that the raw pseudovalues analysis produced less significant results, with p =.2579 for new model simplification and p =.2563 for new model simplification plus ddfH.
Table 9.
Semiparametric and corresponding jackknife AUC estimates: | ||||
---|---|---|---|---|
test | ||||
1 (CINE) | 2 (Spin Echo) | |||
reader (j) | θ̂1j (semiparametric) | Y1j· (jackknife) | θ̂2j (semiparametric) | Y2j· (jackknife) |
1 | 0.933 | 0.947 | 0.951 | 0.950 |
2 | 0.890 | 0.909 | 0.935 | 0.933 |
3 | 0.929 | 0.929 | 0.928 | 0.928 |
4 | 0.970 | 0.981 | 1.000 | 0.999 |
5 | 0.833 | 0.836 | 0.945 | 0.943 |
θ̂1· =.911 | Y1·· =.920 | θ̂2· =.952 | Y2·· =.951 |
ANOVA table: | ||||
---|---|---|---|---|
Source | ddf | Raw pseudovalue mean square | Normalized pseudovalue mean square | |
T | 1 | 0.264166 | 0.468996 | |
R | 4 | 0.315637 | 0.297310 | |
C | 113 | 0.392538 | 0.392538 | |
T×R | 4 | 0.112560 | 0.108062 | |
T×C | 113 | 0.143095 | 0.143095 | |
R×C | 452 | 0.098771 | 0.098771 | |
T×R×C | 452 | 0.072068 | 0.072068 |
T: tests; R: readers; C: cases.
Raw pseudovalues results:
Original DBM: Forig = 1.439, ddforig = 10.03, p = 0.2579
New model simplification: FDBM = 1.439, ddfD = 10.03, p = 0.2579
New model simplification plus ddfH: FDBM = 1.439, ddfH = 10.64, p = 0.2563
Normalized pseudovalues results:
Original DBM: Forig = 2.619, ddforig = 10.31, p = 0.1358
New model simplification: FDBM = 2.619, ddfD = 10.31, p = 0.1358
New model simplification plus ddfH: FDBM = 2.619, ddfH = 10.99, p = 0.1339
Table 10 presents the DBM and OR variance components obtained on the basis of normalized pseudovalues. The DBM variance components were computed using the equations in Table 6, whereas the OR variance components and covariances were computed by replacing the DBM variance components in Table 5 with their estimates. The OR parameter estimates allow us to make statements such as the following about the variability in the reader-level AUC outcomes. The estimated variance of the inherent reader accuracy measures is ; thus, we estimate that, with probability.95, the inherent (or latent) AUC of a randomly selected reader lies within of the population test AUC. The estimated variance of the observed reader accuracy measures is . The estimated measurement error variance due to cases and within-reader variability is . The estimated correlation between observed AUC values for a randomly selected reader reading the same cases in different modalities is given by , and the analogous correlation for a given (or fixed) reader is .
Table 10.
DBM |
OR |
||||
---|---|---|---|---|---|
Variance component | Estimate | Variance component | Estimate | ||
|
0.000713 |
|
0.000713 | ||
|
0.000316 |
|
0.000316 | ||
|
0.022274 | Cov1 | 0.000313 | ||
|
0.014205 | Cov2 | 0.000320 | ||
|
0.013351 | Cov3 | 0.000195 | ||
|
0.072068 |
|
0.001069 |
Example 2: Picture archiving communication system versus plain film interpretation of neonatal examinations
Franken et al [21] compared the diagnostic accuracy of interpreting clinical neonatal radiographs using a picture archiving and communication system (PACS) workstation versus plain film. The case sample consisted of 100 chest or abdominal radiographs (67 abnormal and 33 normal). The readers were four radiologists with considerable experience in interpreting neonatal examinations. The readers indicated whether each patient had normal or abnormal findings and their degree of confidence in this judgment using a five-point ordinal scale.
Table 11 presents the ANOVA tables for the raw and normalized pseudovalues using semiparametric AUC estimation. For either type of pseudovalue we have MS(T*R) < MS(T*R*C) and MS(T*C) < MS(T*R*C); thus and from equation (5). Hence for original DBM we assume and use MS(T*R*C) as the denominator for Forig with ddforig = (t−1)(r−1)(c−1) = 297; in contrast, for new model simplification and new model simplification plus ddfH we only assume and use MS(T*R) as the denominator for FDBM with ddfD =ddfH = (t−1)(r−1) = 3. Using the normalized pseudovalues with original DBM yields Forig = 0.796, ddforig = 297 and p = 0.3729, while new model simplification and new model simplification plus ddfH yield FDBM = 8.888, ddfD = ddfH = 3 and p = 0.0585. The raw pseudovalues analysis produces less significant results, with p = 0.0647 for both new model simplification and new model simplification plus ddfH.
Table 11.
ANOVA table: | |||
---|---|---|---|
Source | ddf | Raw pseudovalue Mean square | Normalized pseudovalue Mean square |
T | 1 | 0.063574 | 0.066606 |
R | 3 | 0.088782 | 0.097686 |
C | 99 | 0.547734 | 0.547734 |
T×R | 3 | 0.007781 | 0.007494 |
T×C | 99 | 0.078071 | 0.078071 |
R×C | 297 | 0.127582 | 0.127582 |
T×R×C | 297 | 0.083643 | 0.083643 |
T: tests; R: readers; C: cases.
Raw pseudovalues results:
Original DBM: Forig = 0.760, ddforig = 297, p = 0.3840
New model simplification: FDBM = 8.171, ddfD = 3, p = 0.0647
New model simplification plus ddfH: FDBM = 8.171, ddfH = 3, p = 0.0647
Normalized pseudovalues results:
Original DBM: Forig = 0.796, ddforig = 297, p = 0.3729
New model simplification: FDBM = 8.888, ddfD = 3, p = 0.0585
New model simplification plus ddfH: FDBM = 8.888, ddfH = 3, p = 0.0585
Discussion
We have summarized recently proposed solutions for the various drawbacks of the original DBM method and examined the performance of these solutions in a simulation study. The solutions include using normalized pseudovalues which allow DBM results to be based on either the original or the jackknife accuracy estimates; using less data-based model reduction and ddfH to make DBM less conservative with a type I error rate much closer to the nominal level; and showing that the DBM model can be viewed as a “working” model that produces the same inferences as obtained using the acceptable conceptual and theoretical OR model. This last solution is especially important, since it establishes a solid theoretical justification for using DBM, allows us to make meaningful statements about the variability and covariances of the accuracy estimates by computing OR model parameter estimates from the DBM model parameter estimates, and allows for generalization in future research. Thus we recommend the revised DBM procedure (“new model simplification plus ddfH”) that incorporates these recent developments. Stand-alone software as well as a SAS macro that incorporates these modifications are available to the public [22–24].
The DBM and OR approaches complement each other. We can think of each approach as consisting of a model and a procedure, where procedure denotes the computational algorithm steps and model denotes the statistical model used to motivate the procedure and justify inferences. The OR model is conceptually and theoretically more acceptable. However, the DBM procedure is easier to implement, because after computing the pseudovalues (for each test-reader combination) the F statistic is easily obtained by subjecting the pseudovalues to a conventional 3-way ANOVA analysis. Furthermore, the DBM model, though not statistically acceptable, makes the DBM procedure easier to initially comprehend, especially for users without an extensive statistical background.
Finally, we note that the choice between using the original or corresponding jackknife AUC estimator should depend on which estimator has superior performance properties. For the trapezoidal method AUC this is not an issue, since the original and jackknife estimates are equal; however, for semiparametric estimation our simulation study and examples (both examples had a smaller p value using normalized pseudovalues) suggest that the original estimator has higher precision and power.
Acknowledgments
The authors thank Carolyn Van Dyke, M.D. for sharing her data set. This research was supported by the National Institutes of Health, grant R01EB000863. The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs.
Grant support: This research was supported by the National Institutes of Health, grant R01EB000863.
APPENDIX A
In this section we derive the relationships given in Table 2 between Forig and FDBM, as defined by equations (6) and (8), respectively, and between ddforig, ddfD, and ddfH, as defined by equations (7), (10), and (11), respectively. We do this for the four possible situations corresponding to the test-by-reader and test-by-case variance component estimates being either positive or nonpositive. We make the reasonable assumptions that none of the mean squares are zero (and hence must be positive) and that the number of cases exceeds two (c>2).
First we derive the relationship between ddfD and ddfH. If then
If then
Thus ddfD < ddfH if and ddfD = ddfH if . These relationships hold regardless of the value of . Now we consider each of the four situations separately for the other relationships.
Situation 1
. For this situation we have
Situation 2
. From equation (5) we have . Hence
with Forig = FDBM if and only if . Also,
That is, ddfD<ddforig. In the proof we have utilized the relationship MS(T*R) −MS(T*R*C) + MS(T*C)>0, since from equation (5) we have .
Situation 3
. For this situation we have
Situation 4
. From equation (5) it follows that MS(T*R)≤MS(T*R*C), with equality if and only if . Thus
with equality if and only if . Also,
Note that we require the assumption that c>2 for this last relationship.
APPENDIX B
In this section we show how to derive AUC correlations assuming the OR model (13). Let and denote two AUC estimates, with the first subscript denoting test and the second reader. Their correlation is defined by
where is the covariance. To find the covariance and variances, we write and as functions of random and fixed effects using the OR model (13). It follows from well known statistical properties that the variance for each AUC estimate is the sum of the OR model variance components corresponding to the random effects, and is the sum of the variance components corresponding to the reader or test×reader random effects that the AUC estimates have in common (i.e., they have the same subscript values for each AUC estimate), plus the covariance between the error terms.
For example, the between-reader correlation between AUC estimates for two different readers using the same test is given by
(17) |
where j≠j′. From equation (13), with taking the place of θ̂ij, we have
(18) |
Each AUC estimate has the same variance, equal to the sum of all of the variance components corresponding to the random effects; that is,
Examination of equations (18) shows that the AUCs do not have any reader or test×reader random effects in common since j≠j′. Thus the covariance is equal to Cov2, the covariance between the error terms for different readers using the same test:
(19) |
It follows from equations (17), (18) and (19) that
Now we derive the between-reader correlation between AUC estimates for two different readers using the same test, but this time treating readers as fixed. In this case the correlation is a measure of the association between the deviation of one reader’s AUC estimate from that reader’s underlying AUC, due to case variation and reader error, with the deviation of the other reader’s AUC estimate from that reader’s underlying AUC. In contrast, ρBR is a measure of association between deviations of randomly chosen readers’ AUC estimates from the reader population AUC.
To derive this correlation we treat the reader and test×reader effects as fixed in model (13) by conditioning on them; thus these effects do not have corresponding variance components, but rather are treated like constants. We denote this correlation by ρBR|readers to indicate that it is for two fixed readers. The correlation is defined as before, except now the covariance and variances are conditional on the reader and test×reader random effects:
(20) |
When we condition on the reader and test×reader random effects, the only random effects in equations (18) are the error terms. Thus each AUC has the same variance, equal to :
(21) |
Similarly, the covariance is equal to Cov2, the covariance between the error terms:
(22) |
It follows from equations (20), (21) and (22) that
These correlations can be written in terms of the DBM model parameters using the relationships in Table 5. For example, since and , where and denote the DBM model variance components, then in terms of the DBM variance components. This last expression is also given in equation (4) of Reference [2].
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Contributor Information
Stephen L. Hillis, Center for Research in the Implementation of Innovative Strategies in Practice (CRIISP) Iowa City VA Medical Center, Iowa City, IA, U.S.A. Department of Biostatistics, University of Iowa, Iowa City, IA, U.S.A
Kevin S. Berbaum, Department of Radiology, University of Iowa, Iowa City, IA, U.S.A
Charles E. Metz, Department of Radiology, University of Chicago Medical Center, Chicago, IL, U.S.A
References
- 1.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]
- 2.Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Academic Radiology. 1997;4:298–303. doi: 10.1016/s1076-6332(97)80032-3. [DOI] [PubMed] [Google Scholar]
- 3.Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]
- 4.Quenoille MH. Approximate tests of correlation in time series. Journal of the Royal Statistical Society, Series B. 1949;11:68–84. [Google Scholar]
- 5.Quenoille MH. Notes on bias in estimation. Biometrika. 1956;43:353–360. [Google Scholar]
- 6.Tukey JW. Bias and confidence in not quite large samples (abstract) Annals of Mathematical Statistics. 1958;29:614. [Google Scholar]
- 7.Berbaum KS. God, like the devil, is in the details. Academic Radiology. 2006;13:1311–1316. doi: 10.1016/j.acra.2006.09.053. [DOI] [PubMed] [Google Scholar]
- 8.Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]
- 9.Hillis SL, Berbaum KS. Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification. Academic Radiology. 2005;12:1534–1542. doi: 10.1016/j.acra.2005.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Statistics in Medicine. 2007;26:596–619. doi: 10.1002/sim.2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Satterthwaite FE. Synthesis of variance. Psychometrika. 1941;6:309–316. [Google Scholar]
- 12.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometric Bulletin. 1946;2:110–114. [PubMed] [Google Scholar]
- 13.Hanley JA, Mcneil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) Curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- 14.Hillis SL, Berbaum KS. Power estimation for the Dorfman-Berbaum-Metz method. Academic Radiology. 2004;11:1260–1273. doi: 10.1016/j.acra.2004.08.009. [DOI] [PubMed] [Google Scholar]
- 15.Obuchowski NA, Rockette HE. Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: an ANOVA approach with dependent observations. Communications in Statistics-Simulation and Computation. 1995;24:285–308. [Google Scholar]
- 16.Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
- 17.Dorfman DD, Alf E., Jr Maximum likelihood estimation of parameters of signal-detection theory and determination of confidence intervals: rating method data. Journal of Mathematical Psychology. 1969;6:487–496. [Google Scholar]
- 18.Dorfman DD, RSCORE II. In: Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Swets JA, Pickett RM, editors. Academic Press; San Diego, CA: 1982. pp. 212–232. [Google Scholar]
- 19.The SAS System for Windows, Version 9.1. SAS Institute Inc; Cary, NC: 2002. [Google Scholar]
- 20.Van Dyke CW, White RD, Obuchowski NA, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the diagnosis of thoracic aortic dissection; 79th RSNA Meetings; Chicago, IL. 1993. [Google Scholar]
- 21.Franken EA, Jr, Berbaum KS, Marley SM, Smith WL, Sato Y, Kao SC, Milam SG. Evaluation of a digital workstation for interpreting neonatal examinations: a receiver operating characteristic study. Invest Radiol. 1992;27:732–737. doi: 10.1097/00004424-199209000-00016. [DOI] [PubMed] [Google Scholar]
- 22.Berbaum KS, Schartz KM, Pesce LL, Hillis SL. DBM MRMC 2.1 (Computer software) 2006 Available for download from http://perception.radiology.uiowa.edu.
- 23.Berbaum KS, Metz CE, Pesce LL, Schartz KM. DBM MRMC 2.1 User’s Guide (Software manual) 2006 Available for download from http://perception.radiology.uiowa.edu.
- 24.Hillis SL, Schartz KM, Pesce LL, Berbaum KS. DBM MRMC 2.1 for SAS (Computer software) 2007 Available for download from http://perception.radiology.uiowa.edu.