Summary
Receiver operating characteristic (ROC) analysis is widely used to evaluate the performance of diagnostic tests with continuous or ordinal responses. A popular study design for assessing the accuracy of diagnostic tests involves multiple readers interpreting multiple diagnostic test results, called the multi-reader, multi-test design. Although several different approaches to analyzing data from this design exist, few methods have discussed the sample size and power issues. In this article, we develop a power formula to compare the correlated areas under the ROC curves (AUC) in a multi-reader, multi-test design. We present a nonparametric approach to estimate and compare the correlated AUCs by extending DeLong et al.’s (1988) approach. A power formula is derived based on the asymptotic distribution of the nonparametric AUCs. Simulation studies are conducted to demonstrate the performance of the proposed power formula and an example is provided to illustrate the proposed procedure.
Keywords: Receiver operating characteristic curve, multi-reader, multi-test design, power, sample size, U-statistics
1. Introduction
The receiver operating characteristic (ROC) curve is a standard tool used to evaluate the performance of a diagnostic test when test results are continuous or ordinal (Metz, 1978; Hanly and McNeil, 1982; Swets and Pickett, 1982). In an ROC curve, the true positive rate is plotted as a function of the false positive rate across all possible cut-points. The area under the ROC curve (AUC) is a commonly used summary measure of diagnostic accuracy. Values of AUC close to 1.0 indicate that the test result has high diagnostic accuracy, and relative accuracies of diagnostic tests can be compared through their corresponding AUCs.
One important objective in many diagnostic studies is to examine whether new diagnostic tests provide performance that is superior to that of conventional tests for a certain condition or disease. The comparison of diagnostic accuracies often depends on the subjective interpretation of readers in diagnostic evaluation, so studies of such diagnostic tests usually involve multiple readers. The multi-reader, multi-test design is most likely to be employed in such settings in which multiple readers interpret all test results from a sample of patients who undergo multiple diagnostic tests. This design is efficient for comparing tests because it requires the smallest patient population; hence, it needs fewer interpretations per reader versus other study designs (Zhou et al., 2002). Sample size and power calculations are crucial to develop a multi-reader, multi-test ROC study. Although several statistical procedures to analyze data from this design have been developed, few methods have discussed the power considerations.
Most of the existing literature for analyzing multi-reader, multi-test studies has applied mixed-effects analysis of variance (ANOVA) models (e.g., Dorfman et al., 1992; Obuchowski et al., 1995; Beiden et al., 2000; Obuchowski et al., 2004). In particular, the methods proposed by Dorfman et al. (1992) and Obuchowski and Rockette (1995) are widely used, often referred to as the Dorfman-Berbaum-Metz (DBM) and Obuchowski-Rockette (OR) methods, respectively. The DBM method uses a mixed-effects ANOVA model on the jackknife pseudo-values of the summary measures of the ROC curve. It assumes that readers and patients are random factors and tests are a fixed factor and that the random effects and error term in the model follow independent normal distributions. The DBM approach carries some concerns (Zhou et al., 2002; Hillis et al., 2005; Song and Zhou, 2005). One weakness of this approach is that the ANOVA model for pseudo-values does not have a straightforward interpretation as the jackknife pseudo-values in this model are treated as observed data. Second, the DBM method does not generally satisfy the regular assumptions for standard mixed-effects ANOVA models because the variance of the response variable may vary across tests and subjects and thus might lead to erroneous inferences. Furthermore, it is substantially conservative and not based on a satisfactory conceptual or theoretical model. Recently, solutions to various drawbacks of the original DBM method have been discussed in the literature (Hillis et al., 2005; Hillis, 2007; Hillis et al., 2008).
On the other hand, the OR method applies a mixed-effects ANOVA model to the estimated summary measures of the ROC curve for each combination of readers and tests, where tests are considered fixed and readers are considered random. For hypothesis testing, an adjusted ANOVA F-test is used to correct for the correlations between and within readers. The OR approach also makes strong assumptions as follows. First, the validity of the method depends on the assumptions about the underlying distributions of the random variables. Second, the complicated correlation structure arising from having the same patient sample evaluated by several readers in a set of tests is overly simplified by the three different correlations. Furthermore, it is not clear how well the adjusted F statistic follows an F distribution, especially in small samples.
Hillis et al. (2005) demonstrated that the DBM and OR approaches yield the identical test statistic when the same accuracy measure and covariance estimation methods are used, but inferences depend on the denominator degrees of freedom (ddf) method, DBM, or OR used. Hillis (2007) later pointed out problems with the OR and DBM ddf methods: The original OR method is very conservative with significance levels considerably below the nominal level while the DBM method can result in extremely wide confidence intervals because the ddf can be close to zero; he proposed using a new ddf estimator that overcomes these problems and described how the new ddf can be used with either the DBM or the OR procedure.
While the OR and DBM methods make use of the mixed-effects ANOVA model, several nonparametric approaches to the multi-reader ROC analysis have been developed (e.g., DeLong et al., 1988; Song, 1997; Gallas, 2006). Above all, DeLong et al. (1988) proposed a nonparametric approach to compare the correlated ROC areas using the theory of the U-statistics. Their method, however, applies only to cases in which each patient is interpreted using multiple tests or repeatedly examined using a single test; thus, it is not appropriate for the data from a multi-reader, multi-test study design. Song (1997) generalized DeLong et al.’s method for analysis of multi-reader, multi-test ROC data and proposed the jackknife method to estimate the variance of the U-statistics. Song’s variance estimation using the jackknife methods can be computationally demanding and the paper does not present how it performs with an unequal number of normal and diseased cases. It should be noted that when DBM and OR methods are used by treating readers as fixed effects and by applying Delong et al.’s variance estimation methods, their test statistics are equivalent to those by Delong at al. and Song’s 2-sample jackknife approaches. More recently, Li and Zhou (2008) proposed a nonparametric approach to compare ROC curves for a paired design with repeated or clustered data. They treated nonparametric ROC curves as stochastic processes and derived their asymptotic distribution theory. A Monte Carlo resampling method was used to approximate the empirical ROC processes and compare correlated AUCs. Although their method was not specifically developed for multi-reader diagnostic accuracy studies, it can be applied to such studies when repeated marker measurements from each subject stem from interpretation by multiple readers.
With regard to the power calculation for multi-reader diagnostic accuracy studies, a formula based on the OR method is among the most widely used approaches (Obuchowski, 1995a, 1995b, 1998; Zhou et al., 2002). Obuchowski (1995a) used the adjusted ANOVA F statistic of the OR method to determine the power to test the equality of the diagnostic accuracies of multiple tests. Possible ranges for the parameter estimates of the variance components and correlations required for the sample size calculation were introduced in Obuchowski (1995b). The author’s nonparametric power calculation was also described in Obuchowski (1998), but this paper discussed several other methods for determining the same sizes that differ by study designs (other than multi-reader ROC studies) and diagnostic accuracy measures. Recently, Hillis et al. (2011) described the procedure for estimating power that can be analyzed using either the DBM or OR method by applying the Hillis (2007)’s recommended ddf for the F-statistic. In this approach, the ROC summary measure is estimated in advance and this estimate is used as a response in the second step of fitting a mixed-effects ANOVA model. This two-stage approach can be misleading because it may depend on the estimated values of the response accounted for in the ANOVA model and is sensitive to the normality assumptions about the underlying distributions of the random variables. Additionally, the method assumes that the complex correlation structure arising from having the same patient sample evaluated by several readers in a set of tests can be described by only three correlations; correlation of error terms in diagnostic accuracies of the same reader in different tests, the correlation of error terms in diagnostic accuracies of different readers in the same test, and the correlation of error terms in diagnostic accuracies of different readers in different tests.
In this article, we propose a new power formula to compare the correlated AUCs in a multi-reader, multi-test design. Specifically, we present a nonparametric approach to estimate and compare the correlated AUCs in a multi-reader, multi-test design by extending DeLong et al. (1988)’s approach. A power formula is derived based on the asymptotic normality of the nonparametric AUCs. This article is organized as follows: Inference procedures for correlated AUCs are described in Section 2 and a formula for power calculation is presented in Section 3. In Section 4, we describe simulation studies to compare the powers of our proposed tests. We apply our proposed method to the example from the American College of Radiology Imaging Network (ACRIN) Digital Mammographic Imaging Screening Trial (DMIST) in Section 5 and a discussion is followed in Section 6.
2. Inference for Correlated AUCs
Suppose h tests are performed on a sample of N patients (m diseased, n non-diseased, N = m + n) where r readers independently examine the test results from all patients. Let be the test result for diseased subject i from test l by reader k (i = 1, ⋯, m; k = 1, ⋯, r; l = 1, ⋯, h). Similarly, let be the test result for non-diseased subject j from test l by reader k (j = 1, ⋯, n; k = 1, ⋯, r; l = 1, ⋯, h). The test results and can be either continuous or ordinal. Without loss of generality, we assume that higher values of test results are more indicative of disease. Our primary goal is to estimate and compare the correlated AUCs of h diagnostic tests.
Let denote the AUC of diagnostic test l by reader k. We assume that the ratings from diseased and non-diseased subjects are independent and have the same distribution in diseased or non-diseased subjects for a fixed reader and a test. A nonparametric AUC of is then calculated by the Mann-Whitney U statistic
(1) |
with
To evaluate the diagnostic accuracy of test l, we use the average AUC across r readers from a fixed test l. Thus, the AUC for diagnostic test l is defined as and is estimated by
(2) |
In our nonparametric approach, readers are considered fixed effects as indicated in Equation (2).
Let vector of U-statistics in which each element represents the nonparametric AUC of a diagnostic test examined by a specific reader. We use the notation (k, l) to denote the value corresponding to the lth diagnostic test interpreted by the kth reader. Asymptotic normality and variance expression for can be derived using the theory of generalized U-statistics. Define
(3) |
The covariance of the (k, l)th and (k′, l′)th statistic can be written as
(4) |
We extend DeLong et al. (1988)’s nonparametric approach to analyze multi-reader, multitest ROC data. Specifically, we use a method of structural components to provide consistent estimates of the elements of the variance-covariance matrix of proposed by Sen (1960). For the (k, l)th statistics, , the X-components and Y-components are defined, respectively, as
and
Additionally, we define the (rh × rh) matrix S10 such that ((k, l), (k′, l′))th element is
and define (rh × rh) matrix S01 such that ((k, l), (k′, l′))th element is
The covariance matrix for is then estimated by
(5) |
and are asymptotically unbiased estimates of and , respectively and the last term in Equation (4) is negligible so that it is not considered in the covariance estimation (Serfling, 1980, Chapter 5).
Theorem 1
Let and . If limN→∞m/N = λ and limN→∞n/N =1 − λ with 0 < λ < 1, then under model (1), is asymptotically normal with zero mean vector and covariance matrix Σ = (σ((k,l),(k′,l′))) where
The proof of Theorem 1 uses the central limit theorem for generalized two-sample U-statistics (Lee and Dehling, 2005). More details can be found in the appendix. The covariance matrix Σ in Theorem 1 can be estimated using its empirical counterpart in (5).
Next, we make an inference for the accuracy of each diagnostic test. The following theorem implies that is a consistent estimator for θl and follows an asymptotically normal distribution.
Corollary 1
Under the above assumptions of Theorem 1, converges in distribution to a normal distribution with mean zero and variance
See Web Appendix for the proof.
Let g be a real-valued function that is continuously differentiable with Jacobian matrix G ≡ ∂g(θ)/∂θT that has full rank. If limN→∞ m/N = λ is bounded and nonzero, is asymptotically normal with zero mean vector and covariance matrix GΣGT. When g is a linear function and C is a (1 × rh) row vector of coefficients, for any contrast Cθ, the test statistic
(6) |
is asymptotically standard normal. For example, in order to evaluate the differences in the AUCs of the two tests l = 1 and l′ = 2, we can conduct a Z-test by setting and Cθ = 0 to the test statistic (6). To compare the AUCs of two or more diagnostic tests, we define C is a (c × rh) matrix of coefficients with full rank c ≤ rh. Then, the Wald statistic
(7) |
can be used which has approximately a chi-squared distribution with c degrees of freedom. A confidence interval or region for can be easily calculated by using (6) or (7).
3. Power Calculations
In this section, we propose a formula for the power calculation of a multi-reader, multi-test study design. Specifically, we aim to determine the power to detect the difference in AUC between two diagnostic tests given a specific number of readers and subjects.
In order to test the AUC difference between any two diagnostic tests, we state the null and alternative hypotheses as
According to (6), the test statistic possesses approximately a standard normal distribution. Define
(8) |
, and (k; k′ = 1, ⋯, r; l, l′ = 1, ⋯, h) represent the correlations between ϕ’s when the test results from the two test modalities are obtained, respectively, from the same diseased and different non-diseased subjects, from different diseased and the same non-diseased subjects, and from the same diseased and non-diseased subjects. We derive the explicit expression for the variance of the difference between two nonparametric AUCs in the following theorem that depicts all correlation structures that arise from having the same readers or the same tests specified in (8).
Theorem 2
The variance of difference between two correlated nonparametric AUCs is
where . (See Web Appendix for details.)
For power determination, the followings are assumed:
is simplified as assuming that the variance is same across readers and modalities. Here denotes average of two comparing AUCs and is assumed to be zero.
The correlations specified in (8) are the same either across readers or across tests or both. Thus, the correlations are simplified as the following 11 representative correlations.
(9) |
where, k ≠ k′ and l ≠ l′. ρ11 and ρ21 are the correlations between ϕ’s when the test results are evaluated by the same reader using the same test; ρ12, ρ22, and ρ32 are the correlations between ϕ’s when the test results are evaluated by different readers using the same test; ρ13, ρ23, and ρ33 are the correlations between ϕ’s when the test results are evaluated by the same reader using different tests; and ρ14, ρ24, and ρ34 are the correlations between ϕ’s when the test results are evaluated by different readers using different tests. Note that represents the correlation between ϕ’s when the test results are obtained from the same diseased and non-diseased subjects, read by the same reader from the same test. Therefore, ρ31 = ρ3((k, l),(k, l)) is always 1 so is not considered in the estimation. Under the above assumptions (a) and (b), the variance in Theorem 2 is simplified as
(10) |
with , of the two comparing AUCs. We suggest determining V and 11 different types of correlations from a pilot study or a similar study. In a multi-reader, multi-test design, for a fixed s (s = 1, ⋯ 4), and for a fixed t (t = 1, 2, 3). The proposed power at the significance level α is then
(11) |
Details are shown in Web Appendix.
4. Simulation Studies
We considered the situation where r readers examine the test results of N (m diseased, n non-diseased) subjects who undergo two test modalities (h = 2). We varied the total sample size N from 100 to 200 and set the number of readers to r=4, 8, 12. We calculated the theoretical power based on the power formula in (11) under different scenarios for the number of readers and subjects.
First, the data is generated as follows. Let the test results for diseased subject i and non-diseased subject j be and , respectively. To generate Xi and Yj, we expressed their elements into the two components
where and account for variability due to reader k of modality l in the diseased and non-diseased groups, respectively. We assumed that and for a fixed l. On the other hand, and account for variability due to subject i, j from modality l in the two groups, independently with and . For diseased subject i, is a random variable with and . For non-diseased subject j, is a N(0, Σ0) random variable with . The diagonal elements of (or ) are the variance of (or ), and their off-diagonal elements are the covariance between the test results when they are read by the different readers using the same modality. On the other hand, the diagonal and off-diagonal elements of (or ) are the covariance between the (or ) when they are read by the same reader using the different modalities, and those when they are read by the different readers using the different modalities, respectively.
We assumed and for the reader variability and assumed , , , and for diseased subject i; and , , , and for non-diseased subject j. The assumed values for (or ) combined with Σ1 (or Σ0) imply that the correlations between the test results when they are evaluated, respectively, by the different readers using the same modality, by the same reader using the different tests, and by the different readers using the different tests are 0.3, 0.8, and 0.25 for diseased subjects and are 0.225, 0.6 and 0.1875 for non-diseased subjects. We set μd1=μd2=1.12 to obtain the true AUCs θ1=θ2=0.8; and μd1 = 1.37 and μd2 = 1.12 to obtain θ1=0.85 and θ2=0.8.
Tables 1 and 2 summarize the results based on 1,000 simulations. The difference of the two AUCs was estimated nonparametrically by Equation (2), and its variance was derived using the structural components in Equation (5). For testing the equality of the two AUCs, the Z-statistic in Equation (6) was used.
Table 1.
N | (m, n) | k | Est1 | ASE2 | SE3 | emp.power4 | theory.power5 |
---|---|---|---|---|---|---|---|
100 | (50, 50) | 4 | 0.048 | 0.018 | 0.018 | 0.788 | 0.807 |
8 | 0.049 | 0.016 | 0.016 | 0.895 | 0.894 | ||
12 | 0.048 | 0.015 | 0.015 | 0.913 | 0.921 | ||
(33, 67) | 4 | 0.049 | 0.019 | 0.020 | 0.749 | 0.726 | |
8 | 0.048 | 0.017 | 0.017 | 0.828 | 0.823 | ||
12 | 0.049 | 0.016 | 0.016 | 0.865 | 0.857 | ||
(25, 75) | 4 | 0.049 | 0.021 | 0.022 | 0.634 | 0.640 | |
8 | 0.049 | 0.019 | 0.019 | 0.760 | 0.742 | ||
12 | 0.048 | 0.017 | 0.018 | 0.802 | 0.780 | ||
200 | (100, 100) | 4 | 0.048 | 0.012 | 0.014 | 0.969 | 0.981 |
8 | 0.049 | 0.011 | 0.012 | 0.992 | 0.995 | ||
12 | 0.048 | 0.010 | 0.011 | 0.998 | 0.998 | ||
(67, 133) | 4 | 0.048 | 0.014 | 0.014 | 0.942 | 0.956 | |
8 | 0.048 | 0.012 | 0.012 | 0.980 | 0.985 | ||
12 | 0.049 | 0.011 | 0.011 | 0.993 | 0.991 | ||
(50, 150) | 4 | 0.048 | 0.015 | 0.016 | 0.893 | 0.910 | |
8 | 0.048 | 0.013 | 0.013 | 0.961 | 0.960 | ||
12 | 0.049 | 0.012 | 0.012 | 0.980 | 0.972 |
Table 2.
N | (m, n) | k | Est1 | ASE2 | SE3 | emp.power4 |
---|---|---|---|---|---|---|
100 | (50, 50) | 4 | 0.0002 | 0.019 | 0.019 | 0.049 |
8 | 0.0008 | 0.016 | 0.016 | 0.050 | ||
12 | 0.0010 | 0.015 | 0.015 | 0.053 | ||
(33, 67) | 4 | 0.0004 | 0.020 | 0.021 | 0.055 | |
8 | −0.0002 | 0.018 | 0.018 | 0.055 | ||
12 | −0.0005 | 0.017 | 0.016 | 0.053 | ||
(25, 75) | 4 | −0.0007 | 0.022 | 0.023 | 0.059 | |
8 | −0.0003 | 0.019 | 0.020 | 0.059 | ||
12 | −0.0004 | 0.018 | 0.018 | 0.053 | ||
200 | (100, 100) | 4 | 0.0009 | 0.013 | 0.013 | 0.051 |
8 | −0.0004 | 0.011 | 0.012 | 0.053 | ||
12 | −0.0003 | 0.011 | 0.011 | 0.055 | ||
(67, 133) | 4 | 0.0007 | 0.014 | 0.015 | 0.059 | |
8 | −0.0004 | 0.012 | 0.013 | 0.055 | ||
12 | −0.0002 | 0.012 | 0.011 | 0.047 | ||
(50, 150) | 4 | −0.0004 | 0.016 | 0.017 | 0.059 | |
8 | < 0.1 × 10−3 | 0.014 | 0.014 | 0.057 | ||
12 | −0.0003 | 0.013 | 0.013 | 0.058 |
In Table 1, we present the results of the theoretical powers and compare it with the empirical powers when the true AUC difference is 0.05 (θ1 = 0.85, θ2 = 0.8). The empirical AUC difference is very close to 0.05, and the asymptotic standard errors and the standard deviations are nearly identical. The last column in Table 1 presents the theoretical powers calculated using a power formula (11) in which it was assumed that and the 11 correlations are ρ11 = 0.31, ρ12 = 0.08, ρ13 = 0.24, ρ14 = 0.06, ρ21 = 0.22, ρ22 = 0.06, ρ23 = 0.17, ρ24 = 0.05, ρ32 = 0.15, ρ33 = 0.55, ρ34 = 0.12. These correlations were estimated by the corresponding sample correlations from the simulated data illustrated at the beginning of this section. The theoretical powers are quite close to the empirical powers. Overall, we can see that the power increases with the increasing number of readers, and a balanced design when m = n has the most power when the total sample size N is fixed.
Next, we conducted simulation studies to examine the type I error rates of the proposed Wald test. The results when the difference in AUC is 0 (θ1 = θ2 = 0.8) are shown in Table 2. As expected, the Wald test showed negligible bias, and the standard errors calculated from the asymptotic theory closely mimic the empirical standard deviations. The empirical type I error rates are close to the 5% level for all scenarios.
5. Practical Implementation
We illustrate the proposed power formula using data from the ACRIN DMIST retrospective multi-reader study (Hendrick et al., 2008). The goal of this study was to compare the accuracy of soft-copy digital mammography with that of screen-film mammography for breast cancer diagnosis. Three digital mammography manufacturers (Fischer, Fuji, and GE) participated in the study; each had 6 to 12 readers and 98 to 120 women screened. We selected the data from the digital mammography machine from Fuji. For the Fuji study, each of the 12 radiologists read 98 cases (27 cancer cases and 71 benign or negative cases) for both soft-copy digital and screen-film mammograms. Each radiologist identified suspicious findings and rate suspicion of breast cancer in identified lesions by using a 7-point scale (from 1 = definitely not malignant to 7 = definitely malignant).
We calculated nonparametric AUCs for each reader and modality combination separately. The values were then averaged across readers for each modality. Using the nonparametric method, the average AUCs were 0.756 (SE 0.054) for the screen-film mammography and 0.715 (SE 0.065) for the soft-copy digital mammography. The estimated AUC difference between the two modalities was 0.041 (SE 0.032) with p-value=0.20, indicating no significant difference in the AUCs between Fuji soft-copy digital and screen-film mammography.
In order to compute the power, we suggest obtaining 11 correlations in (9) using related prior studies or literature. In the DMIST data example, the estimated correlations between the two tests were , , , , , , , , , , . In addition to the 11 correlation values, if we assume and the difference between two AUCs under the alternative hypothesis δ1 is 0.05, the estimated power is 0.452 at α = 0.05 according to the power formula (11) for 27 diseased subjects, 71 non-diseased subjects and 12 readers.
Table 3 presents the power calculation for different numbers of diseased and non-diseased subjects with a varying number of readers. The total sample size was set to 100 or 200, letting the proportion of diseased subjects over non-diseased subjects be 1 or 0.5. The number of readers was set to 4, 6, 8, 10, or 12. The assumed AUCs for the two modalities (θ1, θ2) were either (0.75, 0.7) or (0.75, 0.69). Thus, the effect size δ1 is 0.05 or 0.06. We assumed that the correlation coefficients (ρ11, ρ12, ρ13, ρ14, ρ21, ρ22, ρ23, ρ24, ρ32, ρ33, ρ34) are (0.5, 0.25, 0.25, 0.25, 0.24, 0.1, 0.1, 0.1, 0.4, 0.4, 0.4) (Case I) or (0.5, 0.25, 0.25, 0.2, 0.24, 0.1, 0.1, 0.1, 0.4, 0.4, 0.3) (Case II). As indicated, the power increases as the number of total sample sizes, effect sizes, or readers increase. When the total sample size is fixed, having equal numbers of diseased and non-diseased subjects is the most powerful. It is shown that when the correlations ρ14 and ρ34 decreased, the power decreased substantially. Here, ρ4 and ρ34 indicate the correlations between ϕ’s when the test results from the two test modalities are obtained from the same diseased and different non-diseased subjects and from the same diseased and non-diseased subjects, respectively, in which the test results are evaluated by different readers using different tests. Let’s assume we need to design a study with 100 or 200 participants and up to 12 readers. We want a minimum power of 80% assuming that the effect size is 0.05. In Case I with an equal number of diseased and non-diseased subjects, we need 10 readers for N = 100 but only 6 readers for N = 200. If the ratio of the diseased to non-diseased is 1:2, then we need 12 readers for N = 100. However, In Case II, we need at least 200 participants (100 in each group) and 10 readers to achieve a power greater than 80%.
Table 3.
N | (m, n) | k | δ1 = 0.05 | δ1 = 0.06 | ||
---|---|---|---|---|---|---|
Case I | Case II | Case I | Case II | |||
100 | (50, 50) | 4 | 0.452 | 0.345 | 0.598 | 0.465 |
6 | 0.615 | 0.419 | 0.771 | 0.558 | ||
8 | 0.739 | 0.470 | 0.877 | 0.618 | ||
10 | 0.828 | 0.507 | 0.937 | 0.660 | ||
12 | 0.890 | 0.534 | 0.969 | 0.690 | ||
(33, 67) | 4 | 0.380 | 0.275 | 0.509 | 0.373 | |
6 | 0.526 | 0.328 | 0.681 | 0.443 | ||
8 | 0.647 | 0.364 | 0.801 | 0.490 | ||
10 | 0.743 | 0.390 | 0.880 | 0.523 | ||
12 | 0.817 | 0.410 | 0.930 | 0.547 | ||
| ||||||
200 | (100, 100) | 4 | 0.741 | 0.601 | 0.879 | 0.757 |
6 | 0.891 | 0.702 | 0.969 | 0.848 | ||
8 | 0.958 | 0.763 | 0.993 | 0.894 | ||
10 | 0.985 | 0.801 | 0.999 | 0.920 | ||
12 | 0.995 | 0.828 | 1.000 | 0.937 | ||
(67, 133) | 4 | 0.654 | 0.493 | 0.807 | 0.645 | |
6 | 0.822 | 0.580 | 0.933 | 0.737 | ||
8 | 0.915 | 0.634 | 0.979 | 0.789 | ||
10 | 0.961 | 0.670 | 0.994 | 0.821 | ||
12 | 0.983 | 0.696 | 0.998 | 0.843 |
Case I: ρ11 = 0.5, ρ12 = 0.25, ρ13 = 0.25, ρ14 = 0.25, ρ21 = 0.24, ρ22 = 0.1, ρ23 = 0.1, ρ24 = 0.1, ρ32 = 0.4, ρ33 = 0.4, ρ34 = 0.4; Case II: ρ11 = 0.5, ρ12 = 0.25, ρ13 = 0.25, ρ14 = 0.2, ρ21 = 0.24, ρ22 = 0.1, ρ23 = 0.1, ρ24 = 0.1, ρ32 = 0.4, ρ33 = 0.4, ρ34 = 0.3
6. Discussion
The multi-reader, multi-test design is commonly used in radiological studies to compare different diagnostic techniques because it requires the smallest number of subjects. We developed a novel power formula to detect the AUC difference for any two diagnostic tests in a multi-reader, multi-test design based on the theory of generalized two-sample U-statistics. We showed the asymptotic normality of the nonparametric AUC differences and constructed the power formula using the Wald test.
DeLong et al. (1988)’s approach is the special case of the presented nonparametric approach because the former can be applied to situations where multiple readers interpret a single test, or multiple tests are read by a single reader. Our estimation method and hypothesis testing retain the spirit of Song (1997)’s approach, in that the the AUC estimates are the same as the U-statistics in which readers are treated as fixed effects, and both methods use the Wald-test for comparing correlated AUCs. However, the proposed method differs from Song’s in the variance estimation for the nonparametric AUCs. In contrast to Song’s jackknife method, we applied Sen (1960)’s method of structural components similar to Delong’s method. We derived the explicit expression of the variance of nonparametric AUCs accounting for the correlated data structure from the multi-reader, multi-test design and next introduced 11 representative correlations to simplify the complicated variance structure for the power calculation. In addition, we are not aware of any nonparametric methods that deal with the sample size and power issues including Song’s method.
In practice, a power formula based on the OR method is one of the most widely used approaches for multi-reader diagnostic accuracy studies. Obuchowski et al. (1995) pointed out that the type I error rate based on the OR method is at the correct level for eight or more readers. This is likely due to the incorrect F-statistic approximation used in the mixed effects ANOVA model design. Note that the number of pseudo-observations used as the response in the mixed model is equal to the number of readers times the number of tests, which is very small in usual applications. In this case, the asymptotic distribution approximation may not be ideal. Instead, in our approach, the asymptotic approximation is based on the total number of subjects so that the asymptotic results are more reliable. Our simulation results also demonstrated that the proposed method performs well even with a small number of readers. The major strength of the proposed power formula is that it is easy to implement as a useful alternative to the OR method, especially when a study is expected to have a relatively small number of readers.
A major difference between the proposed and OR (or DBM) methods is that our method treats the reader as a fixed effect whereas the latter treats it as a random factor. As Obuchowski et al. (2004) explained in detail, in phase II studies in which readers are selected from a specific institution, the selected sample of readers is often not generalizable to a broad population of readers. In this case, the conclusion of the study should pertain to the particular readers only and treating readers as fixed effects makes sense. Our method is thus suitable for use in this setup. However, in phase III studies, the readers should present a general population of radiologists, and it is reasonable to assume that random readers can account for variability across readers. The proposed approach of assuming fixed readers in this situation is still applicable but may result in power loss when reader effects are actually random.
Several papers have discussed the comparison of nonparametric partial AUCs (pAUCs) by extending DeLong et al.’s (1988) approach (e.g., Zhang et al., 2002; Dodd and Pepe, 2003; He and Escobar, 2008). Although we focused on using the entire AUC as a measure of accuracy, which is most widely used in multi-reader, multi-test studies, the proposed method can be easily extended to compare pAUCs. The only change to our test statistic is that the AUC of a diagnostic test by a specific reader in Equation (1) is replaced with the pAUC. Its variance can be estimated using Sen’s (1960) method, as previously used, and the inference will be based on the asymptotic normality using trimmed U-statistics theory, as presented by He and Escobar (2008). The corresponding power calculation will be also based on the correlations of the pAUC estimators.
As a final remark, we would like to note that our power calculation for analyzing the multireader, multi-test ROC data is based on a complete (reader by test) factorial study design. In many applications, however, test result data might be missing. As our statistic is based on pooling AUC estimators across all readers, our method can allow different readers to examine a different number of diseased or non-diseased cases. We can make a simple correction for the variance estimation using the available data by taking an approach similar to that of Zhou and Gatsonis (1996). The power calculation will be modified to reflect different missing proportions from each reader.
Acknowledgments
This research was supported in part by NIH/NCI grant U01-CA 079778. The authors thank Constantine Gatsonis for his support and helpful comments.
Appendix
Proof of Theorem 1
We use the Wald’s device so consider any linear combination of , say , where qkl is any constant. The latter can be expressed as
This is one generalized U-statistics considered in Lee and Dehling (2005), where the kernel function is given by
Therefore, the asymptotical normality holds following the result in Lee and Dehling (2005). Particularly, we conclude that for some , converges in distribution to a multivariate normal with zero mean vector and covariance matrix Σ = (σ((k, l),(k′, l′))) as given in Theorem 1. Clearly, the asymptotic limit is given by
where .
Footnotes
Supplementary Materials
Web Appendix referenced in Sections 2 and 3 and R code to calculate the power are available with this paper at the Biometrics website on Wiley Online Library.
This paper has been submitted for consideration for publication in Biometrics
References
- Beiden SV, Wagner RF, Campbell G. Components-of-variance models and multiple-bootstrap experiments. Academic Radiology. 2000;7:341–349. doi: 10.1016/s1076-6332(00)80008-2. [DOI] [PubMed] [Google Scholar]
- DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
- Dodd LD, Pepe M. Partial AUC estimation and regression. Biometrics. 2003;59:614–623. doi: 10.1111/1541-0420.00071. [DOI] [PubMed] [Google Scholar]
- Dorfman DD, Berbaum KS, Metz CE. ROC rating analysis: generalization to the population of readers and cases with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]
- Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- He Y, Escobar M. Nonparametric statistical inference method for partial areas under receiver operating characteristic curves, with application to genomic studies. Statistics in Medicine. 2008;27:5291–5308. doi: 10.1002/sim.3335. [DOI] [PubMed] [Google Scholar]
- Hendrick RE, Cole EB, Pisano ED, Acharyya S, Marques H, Cohen MA, Jong RA, Mawdsley GE, Kanal KM, D’Orsi CJ, Rebner M, Gatsonis C. Accuracy of Soft-Copy Digital Mammography versus that of Screen-Film Mammography According to Digital Manufacturer: ACRIN DMIST Retrospective Multireader Study. Radiology. 2008;247:38–48. doi: 10.1148/radiol.2471070418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]
- Hillis SL, Berbaum KS. Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification. Academic Radiology. 2005;12:1534–1542. doi: 10.1016/j.acra.2005.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Statistics in Medicine. 2007;26:596–619. doi: 10.1002/sim.2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hillis SL, Berbaum KS, Metz CE. Recent developments in the Dorfman-Berbaum-Metz procedure for multireader ROC study analysis. Academic Radiology. 2008;15:647–661. doi: 10.1016/j.acra.2007.12.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hillis SL, Obuchowski NA, Berbaum KS. Power Estimation for Multireader ROC Methods: An Updated and A Unified Approach. Academic Radiology. 2011;18:129–142. doi: 10.1016/j.acra.2010.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gallas B. One-shot estimate of MRMC variance: AUC. Academic Radiology. 2006;13:353–362. doi: 10.1016/j.acra.2005.11.030. [DOI] [PubMed] [Google Scholar]
- Lee MT, Dehling HG. Generalized Two-Sample U-statistics for Clustered Data. Statistica Neerlandica. 2005;59:313–323. [Google Scholar]
- Li G, Zhou K. A Unified Approach to Nonparametric Comparison of Receiver Operating Characteristic Curves for Longitudinal and Clustered Data. Journal of the American Statistical Association. 2008;103:705–713. doi: 10.1198/016214508000000364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Metz CE. Basic principles of ROC analysis. Seminars in Nucl Med. 1978;8:283–298. doi: 10.1016/s0001-2998(78)80014-2. [DOI] [PubMed] [Google Scholar]
- Obuchowski NA, Rockette HE. Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: An ANOVA approach with dependent observations. Communications in Statistics – Simulation and Computation. 1995;24:285–308. [Google Scholar]
- Obuchowski NA. Multi-reader Multi-modality ROC Studies: Hypothesis Testing and Sample Size Estimation Using an ANOVA Approach with Dependent Observations. Academic Radiology. 1995a;2:522–529. [PubMed] [Google Scholar]
- Obuchowski NA. Multireader receiver operating characteristic studies: a comparison of study designs. Academic Radiology. 1995b;2:709–716. doi: 10.1016/s1076-6332(05)80441-6. [DOI] [PubMed] [Google Scholar]
- Obuchowski NA. Sample size calculations in studies of test accuracy. Statistical Methods in Medical Research. 1998;7:371–392. doi: 10.1177/096228029800700405. [DOI] [PubMed] [Google Scholar]
- Obuchowski NA, Beiden SV, Berbaum KS, Hillis SL, Ishwaran H, Song HH, Wagner RF. Multireader, multicase receiver operating characteristic analysis: An empirical comparison of five methods. Academic Radiology. 2004;11(9):980–995. doi: 10.1016/j.acra.2004.04.014. [DOI] [PubMed] [Google Scholar]
- Sen PK. On some convergence properties of U-statistics. Calcutta Statistical Association Bulletin. 1960;10:1–18. [Google Scholar]
- Serfling RJ. Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons; 1980. [Google Scholar]
- Song HH. Analysis of correlated ROC areas in diagnostic testing. Biometrics. 1997;53:370–382. [PubMed] [Google Scholar]
- Song X, Zhou XH. A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data. Biostatistics. 2005;6:303–312. doi: 10.1093/biostatistics/kxi011. [DOI] [PubMed] [Google Scholar]
- Swets JA, Pickett RM. Evaluation of Diagnostic Systems: Methods from Signal Detection theory. New York: Academic; 1982. [Google Scholar]
- Zhang DZ, Zhou XH, Freeman DH, Freeman JL. A non-parametric method for the comparison of partial areas under ROC curves and its application to large health care data sets. Statistics in Medicine. 2002;21:701–715. doi: 10.1002/sim.1011. [DOI] [PubMed] [Google Scholar]
- Zhou XH, Gatsonis CA. A simple method for comparing correlated ROC curves using incomplete data. Statistics in Medicine. 1996;15:1687–1693. doi: 10.1002/(SICI)1097-0258(19960815)15:15<1687::AID-SIM324>3.0.CO;2-S. [DOI] [PubMed] [Google Scholar]
- Zhou XH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. New York: Wiley; 2002. [Google Scholar]