Abstract
The correlated-error ANOVA method proposed by Obuchowski and Rockette (OR) has been a useful procedure for analyzing reader-performance outcomes, such as the area under the receiver-operating-characteristic curve, resulting from multireader multicase radiological imaging data. This approach, however, has only been formally derived for the test-by-reader-by-case factorial study design. In this paper I show that the OR model can be viewed as a marginal-mean ANOVA model. Viewing the OR model within this marginal-mean ANOVA framework is the basis for the marginal-mean ANOVA approach, the topic of this paper. This approach (1) provides an intuitive motivation for the OR model, including its covariance-parameter constraints; (2) provides easy derivations of OR test statistics and parameter estimates, as well as their distributions and confidence intervals; and (3) allows for easy generalization of the OR procedure to other study designs. In particular, I show how one can easily derive OR-type analysis formulas for any balanced study design by following an algorithm which only requires an understanding of conventional ANOVA methods.
Keywords: Receiver operating characteristic (ROC) curve, correlated ANOVA, diagnostic radiology
1. INTRODUCTION
Receiver operating characteristic (ROC) curve analysis is a well established method for evaluating and comparing the performance of diagnostic tests. In radiological imaging studies such tests typically involve a human reader (usually a radiologist) evaluating an image or images resulting from an imaging modality (such as mammography for breast cancer) for a case (i.e., subject) with respect to confidence of disease. In such situations it is important that conclusions generalize to both the case and reader populations. A typical design for comparing diagnostic tests is the balanced test×reader×case factorial study design where each image is assigned a disease-confidence rating by each reader using each diagnostic test. Throughout I use test to refer to a diagnostic test, modality, or treatment.
The methods proposed by Obuchowski and Rockette (OR) [1, 2] and Dorfman, Berbaum, andMetz (DBM) [3, 4] are the most commonly used methods for analyzing such multireader multicase studies (often referred to as MRMC studies) and have performed well in simulations. The OR procedure fits a correlated-error test×reader ANOVA to reader-performance outcomes such as the area under the ROC curve (AUC), while the DBM procedure fits a test×reader×case conventional ANOVA to case-specific pseudovalues. Although the two methods have been shown to be equivalent [5, 6] when based on the same procedural parameters, I find the OR procedure more intuitive and its parameters more interpretable because it models observed reader-performance outcomes rather than pseudovalues. For this reason the OR procedure will be the focus of this paper.
Previously published derivations of OR model statistical properties [6] are tedious to derive, do not provide motivation for the model, and have been derived only for the balanced text×reader×case factorial study design. In this paper I show that the OR model is the same as the model for the marginal mean of a conventional ANOVA model with independent errors, where the mean is computed across cases. Viewing the OR model within this marginal-mean ANOVA framework is the basis for the marginal-mean ANOVA approach (mm-ANOVA approach), the topic of this paper. This approach (1) provides an intuitive motivation for the OR model, including its covariance-parameter constraints; (2) provides easy derivations of OR test statistics and parameter estimates, as well as their distributions and confidence intervals; and (3) allows for easy generalization of the OR procedure to other study designs.
In particular, I show how one can easily derive OR-type analysis formulas for any balanced study design by following an algorithm which only requires an understanding of conventional ANOVA methods. This development is important because for many situations other designs are more suitable than the text×reader×case factorial study design. For example, diagnostic tests may be mutually exclusive for various reasons, such as high radiation dose or invasiveness of the test, and thus can not be given to each patient; readers may be trained to read under only one of the tests; or power considerations may show that it is advantageous to have replicated readings or to have groups of readers read different cases.
The outline of this paper is as follows. I review the OR method in Section 2. In Sections 3–4 and Appendices A–C I describe and justify steps of an algorithm for motivating the OR model and deriving its properties using the marginal-mean ANOVA approach. Steps are stated in a general form so that analogous OR-type procedures can be formulated for other study designs. In Section 5 I summarize the algorithm and illustrate how the algorithm can be used to develop OR-type procedures for six other study designs. A discussion and concluding remarks are given in Section 6.
2. THE OBUCHOWSKI-ROCKETTE (OR) METHOD
2.1. Design and notation
Throughout this section I assume the data have been collected using a balanced test×reader×case study factorial design. This commonly used diagnostic-radiology study design specifies that each case be subjected to each test, with the resulting images evaluated once by each reader. In addition, each case is classified as diseased or nondiseased according to an available reference standard. Typically the number of cases is 25–200 while the number of readers is 3–15. Let Zijk denote a confidence-of-disease rating assigned to the kth case by the jth reader using the ith test. For example, often an ordinal five-level ordinal integer scale or a quasi-continuous 0% to 100% confidence scale is used. The observed rating data consists of the Zijk, with i = 1, …, t, j = 1, …, r, k = 1, …, c, where t is the number of tests, r the number of readers, and c the number of cases.
2.2. Model and test statistic
Let θ̂ij denote the AUC estimate (or other ROC-curve accuracy estimate) for the ith test and jth reader. Obuchowski and Rockette [1] use a test × reader factorial ANOVA model for the AUC estimates, but unlike a conventional ANOVA model they allow the errors to be correlated to account for correlation due to each reader evaluating the same cases. Their model, which I refer as the OR model, can be written as
(1) |
i = 1, …, t, j = 1, …, r, where τi denotes the fixed effect of test i, Rj denotes the random effect of reader j, (τR)ij denotes the random test × reader interaction, and εij is the error term. Without loss of generality I assume . The Rj and (τR)ij are assumed to be mutually independent and normally distributed with zero means and respective variances and . The εij are assumed to be normally distributed with zero mean and variance and are assumed independent of the Rj and (τR)ij. Equi-covariance of the errors between readers and tests is assumed, resulting in three possible covariances given by
It follows from model (1) that , Cov1, Cov2, and Cov3 are also the variance and corresponding covariances of the AUC estimates, conditional on the reader and test × reader effects. Based on clinical considerations Obuchowski and Rockette [1] suggest the following ordering for the covariances:
(2) |
In Section 3.4 I show that these constraints can replaced by the less restrictive constraints
(3) |
Alternatively, the model can be described in terms of the error correlations, defined by .
When Cov2 and Cov3 are known, the OR statistic for testing the null hypothesis of no test effect (H0: τi = 0; i = 1, … t) is given by
(4) |
where MS(T) and MS(T * R) are the test and test × reader mean squares; i.e., and . A subscript replaced by a dot indicates that values are averaged across the missing subscript index; for example, .
In practice the statistic actually used is
(5) |
where and denote estimates for Cov2 and Cov3, respectively. Note that (5) incorporates the constraints specified by (3) by setting to zero if it is negative. Since Cov2 and Cov3 are also the corresponding covariances of the AUC estimates conditional on the reader and test × reader effects, they can be estimated using methods that treat cases as random but readers as fixed, such as jackknifing, bootstrapping, parametric methods, or the method proposed by DeLong et al [7] for trapezoidal-rule (or empirical) AUC estimates [8]. The OR estimates obtained from averaging corresponding fixed-reader AUC variances and covariances are denoted by , and . Hillis [6] shows that FOR has an approximate Ft−1;ddfH null distribution, where
(6) |
More generally, FOR has an Ft−1,df2;λ distribution where and .
Letting θi denote the expected reader performance measure for test i (i.e., θi = E(θ̂i•)), an approximate (1 − α) 100% confidence interval for contrast is given by where . An approximate (1 − α) 100% confidence interval for θi, using a standard error computed from all of the data, is given by , where and . Alternatively, an approximate (1 − α) 100% confidence interval for test i, using a standard error computed only from data for test i, is given by , where and ; here MS (R)(i) and are computed only from test i data. I recommend this latter formula for single AUC confidence intervals, since it does not depend on assuming equal error covariances and variances for each test. All of these results have been previously presented [6].
Expected mean squares are given in Table 1a; proofs for these results are given by Hillis [6]. Expressions for the variance components, in terms of the expected mean squares and covariances are presented in Table 1b; these relationships follow directly from Table 1a. Estimated variance components result by replacing expected mean squares by mean squares and covariance parameters by estimates; for example,
Typically the variance component estimates are changed to zero if the computed values are negative.
Table 1.
|
2.3. Real-data example
To illustrate the OR method for the factorial design, I compare reader AUCs for hard- and soft-copy computed radiography chest images selected randomly from a medical intensive care unit. In the study [9] four radiologists blindly read both hard- and soft-copy images obtained with computed radiography from the same patients. Six months separated the end of the hard-copy readings and the start of the soft-copy readings. A five-point ordinal scale was used to rate the likelihood of presence of the condition (which I will refer to as “disease”) implied by the reason for requesting the corresponding examination. Ninety-five images, consisting of 29 diseased and 66 nondiseased images, were read under each test condition.
The analysis of this study using empirical AUC estimates and jackknife covariance estimates is displayed in Table 2. The AUCs for soft- and hard-copy images, averaged across the four readers, are 0.804 and 0.841, respectively. The test for the null hypothesis of no test effect (i.e., the population average AUC across readers is the same for soft- and hard-copy images) is not significant (FOR = 6.01, ddfH = 3, p = .092); the 95% confidence interval for the difference of the population AUCs (hard- minus soft-copy) is (−0.011, 0.086). Parts (i) and (j) give 95% confidence intervals for the single-test AUCs, based on all of the data and only on data for the specific test, respectively. The confidence intervals from the two methods are similar; this is expected because the AUCs are similar.
Table 2.
|
Although this study showed a nonsignificant difference between soft- and hard-copy image reader performance, the confidence interval for the difference of the AUCs showed a difference as large as 0.086 to be commensurate with the data. In such a situation, the researcher may decide to design a future study that would produce a more precise estimate of the difference. Increased precision could result from an increase in the number of cases, the number of readers, or from replicated readings where each reader reads each image 2 or more times. If increasing the number of cases and readers is not feasible, then a replicated study is a natural choice for increasing power; however, OR analysis methodology has been developed only for the nonreplicated factorial design. I use the algorithm described in this paper to derive the OR-type procedure for the replicated factorial design, including the test-statistic nonnull distribution, which allows for power and sample size estimation. Using this result, I illustrate efficiency computations comparing the nonreplicated and replicated designs in Section 5.6.
In this study the same radiologists also similarly rated 95 hard-copy chest images obtained with screen-film; these images were from different patients than the computed radiographs. Because the original OR method assumes a factorial study design with readers reading the same cases under each test, it cannot be used to compare the screen-film AUC outcomes with the AUC outcomes from either the soft- or hard-copy computed radiograph images. In Section 5.3 I show how the OR approach can be adapted for this situation, which represents a split-plot study design with cases nested within test, and illustrate the analysis of these data.
2.4. Previous derivations of OR properties
Derivations of OR-procedure properties have previously been derived starting with the OR model (1, 2). For what is essentially the OR model, Pavur and Nath [10] show that, for testing the null hypothesis of equal tests, the F statistic that is appropriate when the errors are independent can be used if corrected by a multiplicative factor. The multiplicative factor is a function of the correlations, which are assumed known, and the distribution for this corrected F statistic is the same as for the uncorrected F statistic when the errors are independent. The approach taken by Obuchowski and Rockette [1] was to modify this result by replacing the assumed-known correlations by estimated correlations. This approach yielded valid ANOVA statistics but unsatisfactory degrees of freedom, resulting in overly conservative tests [6]. Alternatively, Hillis [6] directly derived properties, but the proofs are tedious and nonintuitive.
3. MM-ANOVA APPROACH – STEP 1: DERIVE THE MM-ANOVA MODEL
In Sections 3–4 and Appendices A–C I show how the properties of the OR model can easily be derived using an algorithm, based on the mm-ANOVA approach, that only requires knowing how to determine conventional ANOVA test statistics and expected mean squares. I describe and illustrate the steps in the algorithm for the typical balanced test×reader×case study design discussed in the previous section. The steps are stated in a general form so that they can be applied to other balanced study designs. The mm-ANOVA approach and corresponding algorithm have not been previously described and are the main contribution of this paper.
3.1. Step 1a: Define the conventional ANOVA model that corresponds to the study design as if each reader-performance measure was the mean of case outcomes
Let Yijk denote a hypothetical outcome for test i, reader j, and case k. For our purposes Yijk is used only to illustrate the marginal ANOVA model approach; i.e., it does not represent an actual study outcome and should be distinguished from the observed rating Zijk. I assume that the Yijk follow a three-way conventional ANOVA model that corresponds to the study design.
Thus the distribution of Yijk is given by the following test × reader × case ANOVA model that treats test as a fixed factor and reader and case as random factors:
(7) |
i = 1, …, t, j = 1, …, r, k = 1, …, c, where τi denotes the fixed effect of test i with , Rj denotes the random effect of reader j, Ck denotes the random effect of case k, the multiple symbols in parentheses denote random interactions, and εijk is the error term. The random effects are assumed to be mutually independent and normally distributed with zero means and respective variances , and . Because there are no replications, for estimation purposes and are inseparable; hence I define
Results for this model, such as mean square distributional properties and ANOVA test statistics, are well known (e.g., [11]) and will be stated without references.
3.2. Step 1b: From the conventional ANOVA model defined in step 1a, derive the mm-ANOVA model by averaging across cases and defining the mm-ANOVA model error term equal to the mean, across cases, of the sum of the conventional ANOVA model error term and random effects involving case
I say that a random effect “involves case” if it is subscripted according to case. Let Ỹij denote the marginal mean resulting from averaging over cases; i.e.,
(8) |
I use the term marginal-mean ANOVA model (mm-ANOVA model) to refer to the model implied by the conventional 3-way ANOVA model (7) for the marginal mean (8). It follows from (7) that
(9) |
where
(10) |
the Rj and (τR)ij are mutually independent and normally distributed with zero means and respective variances and , and the ε̃ij are independent of the Rj and (τR)ij.
3.3. Step 1c: Express the mm-ANOVA model error variance and covariances in terms of the conventional ANOVA model variance components
From (10) it follows that the ε̃ij are normally distributed with mean 0, variance
(11) |
and equi-correlated with
(12) |
(13) |
and
(14) |
where i ≠ i′ and j ≠ j′.
3.4. Step 1d: Determine the mm-ANOVA model covariance constraints implied by step 1c
The covariance constraints given by (3) follow from (12–14). Thus the mm-ANOVA model for Ỹij is defined by (9) and (3). It also follows from (11–14) that , but I do not include this constraint as part of the definition of the mm-ANOVA model because this constraint is implied from the relationship Var(ε̃11 − ε̃12 − ε̃21 + ε̃22) ≥ 0.
3.5. Remarks
3.5.1. One-to-one relationship between parameters of the 3-way conventional ANOVA and corresponding mm-ANOVA models
In terms of the mm-ANOVA model parameters (μ, τi, , Cov1, Cov2, and Cov3), the parameters for the corresponding three-way ANOVA model (7) are given by μ, τi, , and . Thus there is a one-to-one relationship between the parameters of the two models. Hence for any mm-ANOVA model, defined by (9) and (3), there is a corresponding conventional 3-way ANOVA model (7) that implies that model for the marginal means. These relationships between the two models are presented in Table 3.
Table 3.
3-way ANOVA parameter | Equivalent function of mm-ANOVA parameters | ||
μ | = μ | ||
τi | = τi | ||
= cCov3 | |||
= c (Cov2 − Cov3) | |||
= c (Cov1 − Cov3) | |||
mm-ANOVA parameter | Equivalent function of 3-way ANOVA parameters | ||
μ | μ | ||
τi | τi | ||
Cov1 | |||
Cov2 | |||
Cov3 |
These relationships assume covariance constraints (3) for the mm-ANOVA model and the same linear constraints for the τi (i.e., ∑ τi = 0) for both models.
3.5.2. Equivalence of the OR and mm-ANOVA models
Note that the mm-ANOVA model (9, 3) has the same form as the OR model (1, 2), with the only difference being that the mm-ANOVA model covariance constraints (3) are less restrictive. Since the OR covariance constraints (2) were suggested by Obuchowski and Rockette [1] based only on clinical considerations, to simplify comparison of the models I now modify the definition of the OR model to include the less restrictive mm-ANOVA model constraints (3); i.e., the OR model is now considered to be defined by equations (1) and (3). With this change the OR and the mm-ANOVA model become equivalent.
3.5.3. Definition of the mm-ANOVA approach
Because the OR and mm-ANOVA model are identical, statistical properties for the ROC accuracy estimates, the θ̂ij, are the same as for the marginal means, the Ỹij, for an mm-ANOVA model having the same parameter values as the OR model. The mm-ANOVA approach consists of deriving statistical properties for the OR model (1, 3) by recognizing that it is equivalent to the mm-ANOVA model (9, 3), and then deriving properties of the mm-ANOVA model by utilizing its relationship with the conventional three-way ANOVA model. The advantage of this approach is that properties of the conventional three-way ANOVA model are well known.
3.5.4. Motivation for the OR model
The mm-ANOVA approach provides an intuitive motivation for the OR model (1, 3) as follows. Suppose, hypothetically, that the reader performance outcome θ̂ij is the mean of case-specific outcomes; that is, suppose that θ̂ij = Yij• for some outcome Yijk, with k = 1, …, c. A typical way to account for variation in θ̂ij due to readers and cases would be to assume the three-way ANOVA model (7), which implies the mm-ANOVA model (9, 3) and hence also the equivalent OR model (1, 3) for θ̂ij. Of course, in practice θ̂ij is not a marginal mean, but rather a nonlinear function of the case-specific confidence-of-disease ratings and truth-state (i.e., reference standard) indicator values. However, the mm-ANOVA approach shows that the OR model accounts for reader and case variation using the covariance structure implied by a conventional three-way ANOVA model, as if the accuracy estimate was a marginal mean.
4. MM-ANOVA APPROACH – STEP 2: DERIVE THE MM-ANOVA MODEL TEST STATISTIC AND ITS NULL DISTRIBUTION FOR A HYPOTHESIS EXPRESSED IN TERMS OF TEST ACCURACIES
In this section I show how to derive the mm-ANOVA model test statistic and its null distribution for testing the null hypothesis of equal test accuracies. I define test accuracy as the expected reader-performance measure for a particular test level. However, more generally these steps can be applied to any hypothesis that can be expressed in terms of linear functions of expected reader-performance outcomes.
4.1. Step 2a: State the hypothesis of interest in terms of the mm-ANOVA model
For the mm-ANOVA model (9, 3) let θi denote the test accuracy for test i; i.e., θi = E (Ỹi•) is the expected reader-performance outcome for test i across the population of readers. The hypothesis of interest is the global null hypothesis of equal test accuracies, i.e., H0 : θ1 = … = θt, or equivalently, H0 : τ1 = … = τt = 0.
4.2. Step 2b: Express the hypothesis from step 2a in terms of the conventional ANOVA model
Noting that
it follows that H0 : θ1 = … = θt is equivalent to H0 : τ1 = … = τt = 0 for the conventional ANOVA model (7).
4.3. Step 2c: Create the expected-mean-square table for the conventional ANOVA model
Let MS(T), MS(R), and MS(C) denote the conventional ANOVA mean squares due to test, reader, and case, respectively, with interaction mean squares notated in the usual manner. The expected mean squares for the conventional ANOVA model are presented in Table 4. These relationships will be utilized in other steps.
Table 4.
Mean square | Expected mean square | |
---|---|---|
MS (T) | ||
MS (R) | ||
MS (C) | ||
MS (T * R) | ||
MS (T * C) | ||
MS (R * C) | ||
MS (T * R * C) |
4.4. Step 2d: Determine the conventional ANOVA F statistic corresponding to the step 2b hypothesis
The conventional ANOVA test statistic for testing for H0 : τ1 = … = τt = 0 is given by
(15) |
I refer to F as an ANOVA statistic because its numerator and denominator have the same expectation under H0, but the numerator has a larger expectation than the denominator under H1 : τi ≠ τj for some i ≠ j.
4.5. Step 2e: Express mm-ANOVA mean squares in terms of conventional ANOVA mean squares
For the mm-ANOVA model let , and denote the test, reader, and test×reader mean squares; i.e., and . Noting that , it follows that
(16) |
(17) |
4.6. Step 2f: Express F from step 2d in terms of mm-ANOVA model mean squares and U, where U is a linear function of conventional ANOVA model mean squares that involve case
It follows from (16–17) that (15) can be written in the form
(18) |
where
Note that U is a linear function of conventional ANOVA model mean squares involving case and (18) is an ANOVA statistic.
4.7. Step 2g: Express E (U) in terms of conventional ANOVA model variance components, and then in terms of mm-ANOVA model error covariance parameters using the relationships from step 1c
From Table 4 we have and E [MS (T * R * C)] = σ2. It follows that
(19) |
Using (13) and (14) we can write the right side of (19) in terms of the mm-ANOVA covariances: . Hence
(20) |
4.8. Step 2h: Modify F (18) from step 2f to produce the mm-ANOVA statistic by replacing U by E (U), expressed as a linear function of mm-ANOVA covariance parameters
Replacing U in equation (18) by its expectation (20) results in
(21) |
which is the OR test statistic (4) when we treat the Ỹij as the OR model outcomes θ̂ij. Because (18) is an ANOVA statistic, it follows that (21) is also an ANOVA statistic.
4.9. Step 2i: Derive FOR by replacing covariance parameters in by estimates that take into account the constraints from step 1d
An obvious estimate of Cov2−Cov3 that takes into account covariance constraints (3) is given by , where and are estimates as discussed in Section 2.2. Replacing Cov2−Cov3 in (21) by this estimate results in
(22) |
which is the OR statistic FOR (5) when we replace the Ỹij by the OR model outcomes θ̂ij.
4.10. Step 2j: Determine the approximate null distribution of FOR
Null-distribution result
Write the denominator of FOR in the form
(23) |
where the , i = 1, …, I are mm-ANOVA mean squares, d̂ is a function of the covariance parameter estimates and the ai and b are constants. Then FOR will have an approximate Fdf1,df2 null distribution, where df1 is the numerator degrees of freedom for the conventional ANOVA model test statistic in step 2d and df2 is given by
(24) |
where is the degrees of freedom for , and hence also for MSi. I have stated this result generally so that it can be easily applied to other designs. See Appendix A for a derivation of this result.
To apply this result to the balanced test×reader×case factorial study design, note that the denominator of FOR (22) is given by (23) with I = 1, a1 = 1, b = 1, , and . Using (24), the null-distribution result states that FOR (22) has an approximate Ft−1,df2 null distribution, where
(25) |
Note that the equation for df2 (25), with Ỹij replaced by θ̂ij, is the same as the equation for ddfH (6) for the OR model.
4.11. Remark: Derivation of mm-ANOVA expected mean square and variance component expressions
For the mm-ANOVA model an expected mean square table, such as Table 1a, can be created as follows. Write the mm-ANOVA expected mean squares in terms of the conventional ANOVA variance components and fixed effects using the relationships given in steps 2c and 2e. For example, for the factorial model we have
(26) |
From step 1c it follows that the conventional ANOVA variance components in (26) involving case (i.e., the corresponding random effects are subscripted according to case) can be written in terms of the mm-ANOVA covariances: and . Replacing these variance components in (26) by their corresponding mm-ANOVA covariance expressions yields , the first line in Table 1a. Similarly, the other expressions in Table 1a can be derived. A table of mm-ANOVA variance component formulas, such as Table 1b, can then be created from the mm-ANOVA expected mean square table by solving for the variance components.
5. Mm-ANOVA algorithm summary and examples
In Sections 3–4 steps 1 and 2 of the mm-ANOVA algorithm were presented. These two steps illustrated the essence of the mm-ANOVA approach. Steps 3 and 4, which are presented later in Appendices B and C, extend this approach by showing how to derive confidence intervals and the non-null distribution of the test statistic.
Table 5 presents a succinct summary of the mm-ANOVA algorithm. This summary is intended to make it easy to use the algorithm to determine the properties of OR-type models corresponding to other study designs. Note that Table 5 shows the steps for deriving the confidence interval formula, not only for a linear combination of test accuracy parameters, but also for a single accuracy parameter. Table 6 illustrates the application of Table 5 to the typical test×reader×case study design previously discussed in Sections 3 and 4.
Table 5.
|
Table 6.
|
Using the algorithm in Table 5, I derive results for several other study designs and summarize these results in the remainder of this section. For each study design the corresponding algorithm results, in a format similar to Table 6, are presented in the referenced supplementary tables that are available in the online version of this article. Note that in the summaries below the reader performance measure is denoted by θ̂ij instead of Ỹij to make it clear that, although these are mm-ANOVA models, the outcome is not restricted to a marginal mean but can be any reader-performance measure. In addition, I omit the tilde symbol over the mean squares and error term since it is clear that they are for the mm-ANOVA model rather than the corresponding conventional ANOVA model. Standard nesting notation is used; e.g., subscript (i) j denotes that the factor indexed by j is nested within the factor indexed by i, and MS[R (T)] is the mean square for reader nested within test.
5.1. Example 1: Reader×case study design (one test)
In this study design there is only one test and each reader reads each case. Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S1. The derivation begins with a conventional reader×case study-design ANOVA model that treats reader and case as random factors and includes their interaction. Averaging across cases produces the corresponding mm-ANOVA model: a one-way ANOVA model with reader as its only factor.
This mm-ANOVA model is given by θ̂j = μ + Rj + εij, j = 1, …, r, where r is the number of readers. The Rj are mutually independent and normally distributed with zero mean and variance ; the εij are normally distributed with zero mean and variance and are independent of the Rj; and Cov2 ≡ Cov (εj, εj′) ≥ 0, j ≠ j′. Thus reader is a random factor and the covariance between error terms is assumed constant. Because there is only one test, only the formula for computing a confidence interval for the single test accuracy is presented.
An approximate (1 − a) 100% confidence interval for a single test accuracy, θ = E (θ̂j), is given by , where , and . A hypothesis test for the single test accuracy can be based on this confidence interval. Although Hillis [6] discusses this single-test confidence interval formula, he does not provide a derivation of the result.
This confidence interval result can also be used with the test×reader×case study design to yield single test confidence intervals, each based only on data for the corresponding test, as was illustrated in the analysis of the example data in Section 2.3. Because properties of this confidence interval do not depend on assumptions about the variance components and covariances corresponding to the other tests, we expect these single-test confidence intervals to be more robust than those where the standard error is based on all of the data.
5.2. Example 2: Reader-nested-within-test study design
In this study design readers read images from only one test; i.e., readers are nested within test. This study design is natural when readers are trained to read under only one of the tests. The study design is balanced with an equal number of readers reading all cases using each test. Thus reader is nested within test and is crossed with case. Obuchowski [12] discusses this design and refers to this as a paired-case, unpaired-reader design. This can be viewed as a split-plot design with readers being the “whole plots,” case the split-plot (or within-plot) factor, and test the whole-plot (or between-plot) factor. This design is schematically illustrated in Table 7a.
Table 7.
a) Reader nested within test. Yijk = rating for test i from reader j reading cases 1, …, c, with readers nested in test i; i = 1, …, t, j = 1, …, r, k = 1, …, c. | ||||
---|---|---|---|---|
case | ||||
test | reader | 1 | … | c |
1 | (1)1 | Y111 | ⋯ | Y11c |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
1 | (1)r | Y1r1 | ⋯ | Y1rc |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
t | (t)1 | Yt11 | ⋯ | Yt1c |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
t | (t)r | Ytr1 | ⋯ | Ytrc |
b) Case nested within test. Yijk = rating for test i from reader j reading cases 1, …, c, with readers nested in test i; i = 1, …, t, j = 1, …, r, k = 1, …, c. | ||||
---|---|---|---|---|
reader | ||||
test | case | 1 | ⋯ | r |
1 | (1)1 | Y111 | ⋯ | Y1r1 |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
1 | (1)c | Y11c | ⋯ | Y1rc |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
t | (t)1 | Yt11 | ⋯ | Ytr1 |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
t | (t)c | Yt1c | ⋯ | Ytrc |
c) Case nested within reader. Yijk = rating for test i from reader j reading cases 1, …, c, with cases nested in reader j;i = 1, … t, j = 1, …, r, k = 1, …, c. | ||||
---|---|---|---|---|
test | ||||
reader | case | 1 | ⋯ | t |
1 | (1)1 | Y111 | ⋯ | Yt11 |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
1 | (1)c | Y11c | ⋯ | Yt1c |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
r | (r)1 | Y1r1 | ⋯ | Ytr1 |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
r | (r)c | Y1rc | ⋯ | Ytrc |
d)Reader and case crossed and nested within group. Yhijk = rating assigned by the jth reader in group h to the kth case in group h using rest i; h = 1, … g, i = 1, … t, j = 1, …, r, k = 1, …, c. Each reader and case is included in only one group. | |||||
---|---|---|---|---|---|
test | |||||
group | reader | case | 1 | ⋯ | t |
1 | (1)1 | (1)1 | Y1111 | ⋯ | Y1t11 |
⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
1 | (1)1 | (1)c | Y111c | ⋯ | Y1t1c |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1 | (1)r | (1)1 | Y11r1 | ⋯ | Y1tr1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
1 | (1)r | (1)c | Y11rc | ⋯ | Y1trc |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
g | (g)1 | (g)1 | Yg111 | ⋯ | Ygt11 |
⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
g | (g)1 | (g)c | Yg11c | ⋯ | Ygt1c |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
g | (1)r | (1)1 | Yg1r1 | ⋯ | Ygtr1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
g | (g)r | (g)c | Yg1rc | ⋯ | Ygtrc |
Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S2. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design (i.e., with reader nested within test and crossed with case) that treats reader and case as random factors and includes all possible interactions. Averaging across cases produces the corresponding mm-ANOVA model: a reader-nested-within-test ANOVA model with reader as a random factor.
The mm-ANOVA model is given by θ̂ij = μ + τi + R(i)j + εij, i = 1, …, t, j = 1, …, r where t is the number of tests, r is the number of readers, τi denotes the fixed effect of test, and . The reader effects, the R(i)j, are mutually independent and normally distributed with zero mean and variance , where “R(T)” is read “reader nested within test”. The εij are normally distributed with zero mean and variance . The εij are independent of the R(i)j; Cov2 = Cov (εij, εi′j′) with j ≠ j′ and Cov3 = Cov (εij, εij′) with i ≠ i′, with Cov2 ≥ Cov3 ≥ 0.
Thus there are two error covariances, Cov2 and Cov3, Cov2 ≥ Cov3 ≥ 0, defined as the covariances between errors for the same test and different readers, and for different tests and different readers, respectively. Note that the definition Cov3 ≡ Cov (εij, εi′j′), i ≠ i′ does not require j ≠ j′ because i ≠ i′ implies different readers. There is no Cov1 parameter because the design does not allow for one reader reading under two tests.
Let θi ≡ E (θ̂i•) denote the expected reader performance measure for test i. The test statistic for the null hypothesis of equal test accuracies (H0 : θ1 = … = θt) is
where MS(T) is defined as for the factorial model and . Under H0, FOR ~˙ Ft−1,df2 where
(27) |
More generally, FOR ~˙ Ft−1,df2;λ, where and .
An approximate (1 − α) 100% confidence interval for contrast is given by where and df2 is given by (27). An approximate (1 − α) 100% confidence interval for θi is given by , where and . Alternatively, an approximate (1 − α) 100% confidence interval for θi, using a standard error computed only from data for test i, is given by , where and , where MS (R)(i) and are computed only from test i data; note that this is the result from Section 5.1.
5.3. Example 3: Case-nested-within-test split-plot study design
In this study design each case is imaged under only one test, with the same number of cases imaged for each test. Each reader interprets all of the images from each test. This is often called a paired-reader, unpaired-case design. Obuchowski [12] notes that this design is needed when the diagnostic tests are mutually exclusive, e.g., if they are invasive, administer a high radiation dose, or carry a risk of contrast reactions. This can be viewed as a split-plot design with cases being the whole plots, reader the split-plot factor, and test the whole-plot factor. This design is schematically illustrated in Table 7b.
Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S3. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design that treats reader and case as random factors and includes all possible interactions. Averaging across cases produces the corresponding mm-ANOVA model, which is the same as the factorial mm-ANOVA model but with Cov1 and Cov3 constrained to zero; i.e., the model is defined by equation (1) and constraints Cov2 ≥ 0, Cov1 = Cov3 = 0. It follows that hypotheses-test, confidence-interval and sample-size formulas can be derived from those for the factorial model by setting Cov1 =Cov3 = 0.
Thus the test statistic for the null hypothesis of equal test accuracies is
Under H0, FOR ~˙ Ft−1,df2 where
(28) |
More generally, FOR ~˙ Ft−1,df2;λ, where and .
Letting θi denote E (θ̂i•), an approximate (1 − α) 100% confidence interval for contrast is given by , where df2 is given by (28) and . An approximate (1 − α) 100% confidence interval for θi is given by , where and . Alternatively, an approximate (1 − α) 100% confidence interval for θi, using a standard error computed only from data for test i, is given by , where and , where MS (R)(i) and are computed only from test i data. Note that these single-test confidence-interval formulas are the same as those for the factorial design.
5.3.1. Real-data example
Using the Kundel et al [9] data that were discussed in Section 2.3, I now compare soft-copy computed radiographs with screen-film radiographs. The images are from different patients for each type of radiograph, with 95 images in each group (soft-copy computed radiograph: 66 nondiseased, 29 diseased; screen-film radiograph: 68 nondiseased, 27 diseased). Because the images for each method are from different patients, this is an example of a case-nested-within-test study design. The analysis of this study using empirical AUC estimates and jackknife covariance estimates is displayed in Table 8. The AUCs for soft-copy and screen-film images, averaged across the four readers, are 0.804 and 0.829, respectively. The test for the null hypothesis of no AUC difference between soft-copy and screen-film is not significant (FOR = 0.31, df2 = 164.4, p = 0.58); the 95% confidence interval for the difference of the population AUCs (screen-film minus soft-copy) is (−0.064, 0.114). Part (h) gives 95% confidence intervals for the single-test AUCs based only on data for the specific test.
Table 8.
|
5.4. Example 4: Case-nested-within-reader split-plot study design
In this study design each reader interprets a different set of cases using all of the diagnostic tests. The study design is balanced with each reader reading the same number of cases under each test. This can be viewed as a split-plot design with cases being the whole plots, reader the whole-plot factor, and test the split-plot factor. Obuchowski [12] refers to this as a hybrid design. The advantage of this design is that for equivalent power each reader must interpret fewer cases than for the factorial design, but the disadvantage is that the total number of cases is higher [13]. Thus this design is appropriate when a large number of verified cases are available and reading time per reader is limited or relatively expensive. This design is schematically illustrated in Table 7c.
Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S4. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design that treats reader and case as random factors and includes all possible interactions. Averaging across cases produces the corresponding mm-ANOVA model, which is the same as the factorial model except with Cov2 and Cov3 constrained to zero; i.e., the model is defined by (1) and constraints: Cov1 ≥ 0, Cov2 = Cov3 = 0. Because this model is the same as the factorial model with Cov2 and Cov3 constrained to zero, hypotheses-test, confidence-interval, and sample-size formulas can be derived from those for the factorial model by setting Cov2 =Cov3 = 0.
Thus the test statistic for the null hypothesis of equal test accuracies is
Under H0, FOR ~˙ Ft−1,df2 where
(29) |
More generally, FOR ~˙ Ft−1,df2;λ, where and df2 is given by (29).
Letting θi denote E (θ̂i•), an approximate (1 − α) 100% confidence interval for contrast is given by , where df2 is given by (29) and . An approximate (1 − α) 100% confidence interval for θi is given by , where and . Alternatively, an approximate (1 − α) 100% confidence interval for θi, using a standard error computed only from data for test i, is given by , where and df2 = r − 1.
5.5. Example 5: Reader-and-case-crossed-and-nested-within-group split-plot study design
In this study design there are several groups (or blocks) of readers and cases such that (1) each reader and each case belongs to only one group and (2) within each group all readers read all cases under each test. I assume a balanced design where each group has the same number of readers and cases. Obuchowski [13] discusses this design and refers to it as a mixed design; I will refer to it as a mixed split-plot design. The motivation for this study design is to reduce the number of reader interpretations for each reader, compared to the factorial study, without requiring as many cases to be verified as the hybrid design. This design is schematically illustrated in Table 7d. Although not explicitly stated, Obuchowski [13] assumes that there is no group effect for this design; e.g., cases and readers are randomly assigned to the groups (personal communication, Nancy Obuchowski, 2012). In contrast, I allow for a group effect; e.g., readers are assigned to groups according to experience level. Obuchowski et al [14] provide a real-data example that shows how this design can be particularly useful for studying multiple imaging tests.
Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S5. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design (reader and case crossed and nested within group) that treats reader and case as random factors and group and test as fixed factors. All possible interactions are included. Averaging across cases produces the corresponding mm-ANOVA model: a three-way ANOVA model with group, test, and reader as factors.
Let θ̂hij denote the reader-performance estimate for reader j under test i, with both belonging to group h. The mm-ANOVA model is given by θ̂hij = μ + γh + τi + (γτ)hi + R(h)i + (τR)(h)ij + εhij, h = 1, …, g, i = 1, …, t, j = 1, …, r, where g is the number of groups, t is the number of tests, r is the number of readers, τi denotes the fixed effect of test i, γh denotes the fixed effect of group h, and (γτ)hi denotes the fixed group-by-test interaction with . The R(h)j and (τR)(h)ij are random reader and test-by-reader effects, nested within group; they are mutually independent and normally distributed with zero means and respective variances and . The εhij are normally distributed with zero mean and variance . The εhij are independent of the R(h)j and (τR)(h)ij. In summary, the mm-ANOVA model contains fixed effects for group, test, and their interaction, and random effects for reader nested within group and the test-by-reader interaction nested within group.
Cov1, Cov2, and Cov3 are defined and constrained similar to corresponding covariances for the typical test×reader×case factorial design, but with this difference: here they are not defined between errors corresponding to different groups because the covariance of those errors is zero. Specifically, Cov1 ≡ Cov (εhij, εhi′j), Cov2 ≡ Cov (εhij, εhij′), and Cov3 ≡ Cov (εhij, εhij) where i ≠ i′, j ≠ j′ and Cov1 ≥ Cov3, Cov2 ≥ Cov3, and Cov3 ≥ 0.
The null hypothesis of equal test accuracies is H0 : θ1 =…= θt, where θi = E (θ̂•i•). The corresponding test statistic is
Under H0, FOR ~˙ Ft−1,df2 where
(30) |
and MS[T * R(G)] denotes the mean square for test-by-reader interaction nested within group. More generally, FOR has an approximate Ft−1,df2;λ distribution, where and .
An approximate (1 − α) 100% confidence interval for contrast is given by , where df2 is given by (30) and . An approximate (1 − α) 100% confidence interval for θi is given by , where and .
5.6. Example 6: Replicated factorial study design
This study design is the same as the factorial study design except that each reader reads each case n times. Typically sessions corresponding to different readings are separated by a suitable period of time to reduce the probability that the reader will recognize cases from the earlier session. This study design has two advantages over the factorial design with one replication: it allows for estimation of within-reader reliability between two readings of the same cases, and it provides more power for the same number of cases and readers. This last aspect can be important if the number of available cases and readers is limited. In the example later in this section, I show how to estimate the gain in power based on pilot data.
Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S6. The derivation begins with a conventional three-way replicated factorial ANOVA model with reader and case as random factors and test as a fixed factor. There are n replications. All possible interactions are included between reader, case and test. Averaging across cases for each replication produces the corresponding mm-ANOVA model: a two-way replicated factorial ANOVA model with test and reader as factors.
Let θ̂ijm denote the reader-performance estimate for reader j under test i based on the mth reading of the data. The mm-ANOVA model is given by θ̂ijm = μ + τi + Rj + (τR)ij + εijm i = 1, …, t, j = 1, …, r, m = 1, …, n where t is the number of tests, r is the number of readers, n is the number of replications, τi denotes the fixed effect of test i, Rj denotes the random effect of reader j, (τR)ij denotes the random test×reader interaction, εijm is the error term, and . The Rj and (τR)ij are assumed to be mutually independent and normally distributed with zero means and respective variances and . The εij are assumed to be normally distributed with zero mean and variance and are assumed independent of the Rj and (τR)ij. The errors are equi-covariant with four possible covariances given by
and subject to the following constraints:
Let θi ≡ E (θ̂i••) denote the expected reader performance measure for test i. The test statistic for the null hypothesis of equal test accuracies (H0 : θ1 =…= θt) is
where and . Under H0, FOR ~˙ Ft−1,df2 where
(31) |
More generally, FOR ~˙ Ft−1,df2;λ, where
(32) |
and
(33) |
An approximate (1 − α) 100% confidence interval for contrast is given by where and df2 is given by (31). An approximate (1 − α) 100% confidence interval for θi is given by , where and .
Consider Cov2 ≡ cov (θ̂ijm, θij′m′) where j ≠ j′ and either m = m′ or m ≠ m′. It follows that Cov2 can be computed from one set of replications (m = m′) or from different sets of replications (m ≠ m′). For example, for test i and readers j and j′, with n = 2 we have Cov2 = cov (θ̂ij1, θij′1) = cov (θ̂ij1, θij′2) = cov (θ̂ij2, θij′1) = cov (θij2, θij′2). Thus an obvious estimate for Cov2 that utilizes all of the data is given by
where is a fixed-reader covariance estimate, as discussed in Section 2.2. Similarly, estimates for Cov1 and Cov3 can be estimated by averaging fixed-reader covariance estimates, computed for each of the n2 possible (m, m′) pairs of replications, across corresponding test-reader combinations. Obvious estimates for Cov0 and are and , where .
5.6.1. Real-data example
In Section 2.3 I compared AUCs for hard- and soft-copy computed radiography chest images. Both types of images were obtained for each patient and were read by each of the readers. Thus this was a factorial study design, which could be analyzed by the standard OR procedure. Although there was not a significant difference between the two types of images, the resulting confidence interval showed that an AUC difference as large as 0.086 was commensurate with the data. In such a situation the researcher might want to plan a similar experiment that is sized to have more power.
Increased power can be obtained by increasing the number of readers, the number of cases, or the number of replications. I now compute the number of cases needed to obtain .80 power to detect an AUC difference of .04 with alpha = .05. Because FOR ~˙ Ft−1,df2;λ, power is approximated by Pr (F1,df2,λ > F.95;1,df2) where λ and df2 are defined by (32) and (33) and F.95;1,df2 is the 95th percentile of a central F distribution with degrees of freedom 1 and df2.
For the power computations I use the following estimates, obtained from Section 2.3: , and . An estimate of Cov0 is not available from the data because there are no replicated readings; however, the similarity of the two tests (hard- and soft-copy) suggests that the within-reader correlation between replications for the same test and reader, , should be only slightly higher than the within-reader correlation based on one replication between two tests, given by from Table 2. Thus I set ρ0 = 0.60 for the power computations; it follows that . Following Hillis et al [15] I assume that the covariances are inversely proportion to the number of cases c, and hence multiply and by the factor (recall that 95 is the number of cases for the example); the resulting values are used in place of , Cov1, Cov2, and Cov3 in (32) and (33) when computing power for c cases.
The numbers of cases need to achieve 0.80 power for combinations of 4–8 readers and 1–2 replications are presented in Table 9. For example, achieving 0.80 power with 8 readers and one replication requires 173 cases versus 103 cases with two replications. Thus if cases are expensive to obtain or validate and it is difficult to obtain more than 8 readers, then using two replications appears to be an attractive option.
Table 9.
replications (n) | readers (r) | cases (c) | power |
---|---|---|---|
1 | 4 | 585 | 0.800 |
1 | 5 | 366 | 0.801 |
1 | 6 | 266 | 0.800 |
1 | 7 | 210 | 0.802 |
1 | 8 | 173 | 0.801 |
2 | 4 | 348 | 0.800 |
2 | 5 | 218 | 0.801 |
2 | 6 | 158 | 0.800 |
2 | 7 | 125 | 0.802 |
2 | 8 | 103 | 0.802 |
6. Discussion
The mm-ANOVA approach allows for analysis of ROC and other reader-performance outcomes that result from any balanced study design that has reader and case as random factors and any number of fixed factors. In addition, by providing the non-null distribution of the test statistic it allows for sample size estimation for such studies and efficiency comparisons between different types of studies. Although steps were fully justified only for the test×reader×case factorial study design, justification can be similarly established for other designs. Until now researchers have been limited to using the test×reader×case study design with the OR method because analysis methods were not developed for other designs. This work allows researchers to choose designs that are most appropriate for their study. A SAS macro for fitting some of these designs using the mm-ANOVA approach is available on request from the author.
As noted in Section 2.4, Obuchowski and Rockette [1] derived their F statistic by modifying the F statistic described by Pavur and Nath [10]. Although Pavur and Nath [10] give results only for two-factor models, their approach, which is based on results given by Pavur and Lewis [19], could conceivably be applied to other correlated-error ANOVA models; as such it would provide an alternative to the approach described in this paper. However, the results of Pavur and Lewis do not extend beyond specifying the correct form for the F test when correlations are known; in particular, they do not indicate how to implement their approach when the correlations must be estimated, do not discuss derivation of confidence interval formulas for contrasts, give little motivation for the correlated error models, and do not discuss power computations.
Explicit formulas can be derived [20, 21, 22] for the variances of reader-performance outcomes that are U-statistics [23], such as reader empirical-AUC averages and their differences. Replacing parameters in these formulas by sample estimates yields variance estimates with excellent statistical properties. However, this approach is limited to U-statistic estimators, such as the empirical AUC and presently incorporates an adaptation of the OR degrees of freedom formula. Advantages include explicit variance formulas and applicability to a wide variety of multireader study designs, including unbalanced designs.
Another alternative approach for analyzing multireader data is the marginal model approach proposed by Song and Zhou [24] for empirical AUC estimates. An advantage of their approach is that case-specific covariates can be included; disadvantages include being limited to empirical AUC outcomes, based on large-sample inferences, and thus far developed only for the factorial model.
Limitations of the mm-ANOVA approach include the following: (1) It is presently limited to balanced study designs; i.e., the number of levels for each factor does not depend on any other factor. However, because case is treated as one factor it is possible to have different numbers of normal and abnormal cases. I am currently investigating models that are not balanced with regard to case. (2) It assumes that the number of cases is large enough so that covariance estimates can be treated like known values for computing the denominator degrees of freedom. (3) It assumes that the fixed-reader measurement errors, the εij, are normally distributed. This is a reasonable assumption when the number of cases is moderate because most typical reader-performance outcomes, such as AUC, have asymptotic normal distributions for a fixed reader. (4) It assumes that the latent reader-performance outcomes (i.e., Rj + (τR)ij) have a normal distribution. If these normal distribution assumptions do not appear to be reasonable, one possible remedy is to transform the outcome, e.g., using a logarithmic or logit transformation for AUC. (5) It assumes the errors have an equi-covariance structure. I am currently investigating the robustness of the mm-ANOVA approach to this assumption.
Supplementary Material
ACKNOWLEDGEMENTS
This research was supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), grants R01EB000863 and R01EB013667. I thank Dr. Harold Kundel for sharing his data set.
Appendix
A. DERIVATION OF THE NULL-DISTRIBUTION RESULT USED IN STEP 2J
To derive the null-distribution result given in step 2j, I approximate the distribution of FOR (22) by deriving an approximate distribution for (21), where Cov2 and Cov3 are known. Each is equal to its corresponding conventional three-way ANOVA model mean square, denoted by MSi, multiplied by , with under H0 : θ1 = … = θt. It follows that the are mutually independent, each has the same degrees of freedom as its corresponding MSi and .
In general, a chi-squared-distribution approximation [25, 26] for a random variable X is given by
where
It follows that a chi-square approximation for
where the ai, b and d are constants, is given by
(A1) |
where
(A2) |
Replacing by and d by an estimate d̂ in (A2) results in the approximation for df given by df2 (24).
It follows using (A1) with i = 1, a1 = 1, , d = r (Cov2 − Cov3) and (A2) estimated by (24) that a chi-squared approximation for , the denominator of F* (21), is given by
(A3) |
where df2 is given by (25) and “~˙” stands for “is approximately distributed as.” See Reference [6] for a more detailed derivation and justification of df2 (referred to as ddfH in the reference.)
Because (21) is an ANOVA statistic, under H0. Combining this result with the chi-squared approximation (A3) for and the independence of and , it follows under H0 that
where , W is approximately , and U and W are independent. Thus has an approximate F(t−1),df2 null distribution, with df2 given by (25). Because FOR (22) approximates (21), it is reasonable to approximate the null distribution of FOR by F(t−1),df2, which is the null distribution derived by Hillis [6] for FOR, discussed in Section 2.2.
B. MM-ANOVA APPROACH STEP 3: DERIVE CONFIDENCE INTERVALS FOR A LINEAR FUNCTION g(θ) OF TEST ACCURACIES
In this section I show how to compute a confidence interval for a linear function of test accuracy parameters. Specifically, for the balanced test×reader×case factorial study design with θi ≡ E(θ̂i•) denoting the expected reader-performance outcome for test i across readers, θ = (θ1, …, θt)′, and l = (l1, …, lt)′ denoting a t-dimensional contrast vector (i.e., ), I illustrate how to derive a confidence interval for g (θ) ≡ l′θ. More generally this step can be used to determine a confidence interval for g (θ), where g (·) is any linear function and θ any vector of test accuracy parameters; this general result is given in step 3k.
B.1. Step 3a: Write the test accuracy parameter vector θ in terms of the mm-ANOVA model
In terms of the mm-ANOVA model parameterization, treating Ỹij as θ̂ij, we have θi = E (Ỹi•) = μ + τi.
B.2. Step 3b: Write θ in terms of the conventional ANOVA model
Since θi = E (Ỹi•) = E (Yi••) = μ + τi, then in terms of the conventional ANOVA model we also have θi = μ + τi.
B.3. Step 3c: Determine the conventional ANOVA estimate for θ, denoted by θ̂
The conventional unbiased ANOVA estimate for θ is given by θ̂ = (θ̂1, …, θ̂t)′ with θ̂i = Yi••.
B.4. Step 3d: Determine the variance V of g (θ̂) in terms of conventional ANOVA parameters
From (7) it follows that
Thus
Because θ̂ has a multivariate normal distribution, it follows that
where
B.5. Step 3e: Write V from step 3d in the form V = bE (∑aiMSi) for constants b and ai
Expected values of the conventional ANOVA mean squares are given in Table 4. It follows that
B.6. Step 3f: Write V from step 3e in the form where b̃ and ãi are constants and U is a linear function of conventional ANOVA mean squares that involve case
We have
where
B.7. Step 3g: Express E (U) in terms of conventional ANOVA model variance components and then in terms of mm-ANOVA model error covariance parameters, using the relationships from step 1c; then rewrite V using this expression for E (U)
We did the first part of this step in step 2g where we showed
Using this expression we have
B.8. Step 3h: Derive the variance estimate V̂ from V by replacing expected mean squares by mean squares and replacing covariances by estimates that take into account the constraints from step 1d
We have
B.9. Step 3i: Derive the degrees of freedom df2 for V̂ using the general formula for df2 (24) given in step 2j
It follows that the degrees of freedom is given by (25), which is the same as ddfH (6).
B.10. Step 3j: Write θ̂ from step 3c in terms of the mm-ANOVA model
Since θ̂i = Yi•• = Ỹi•, then in terms of the mm-ANOVA model θ̂i = Ỹi•.
B.11. Step 3k: General confidence-interval result: In terms of the mm-ANOVA model, an approximate (1 − α) 100% confidence interval for g (θ) is given by where V̂ is determined in step 3h, df2 in step 3i and θ̂ in step 3j
This result yields the following (1 − α) 100% confidence interval for l′θ:
(B1) |
where ddfH is given by (25). Letting “FOR-test denominator” denote the denominator of the FOR statistic (22) for testing H0 : θ1 = … = θt, we can write (B1) as
B.12. Derivation of the general confidence-interval result given in step 3k
I now derive the step 3k result for the test×reader×case factorial study design with g (θ) ≡ l′θ and l = (l1, …, lt)′ denoting a t-dimensional contrast vector (i.e., ). We have shown in the previous steps that g (θ̂) ~ N [g (θ), V], where
Define V* by replacing by :
Using the same argument as given in Appendix A and noting that V = E (V*), we can show that a chi-squared-distribution approximation for V* is given by with df2 given by (25). Furthermore, independence of g (θ̂) and for the mm-ANOVA model, and hence independence of g (θ̂) and V*, follows from the independence of g (θ̂) and MS(T * R) for the conventional ANOVA model (7). Thus for the mm-ANOVA model
where Z ~ N (0, 1), W is approximately , and Z and W are independent. Thus
has an approximate tdf2 distribution with df2 given by (25). In practice we replace r (Cov2 − Cov3) by and base tests and confidence intervals on
(B2) |
which we treat as having an approximate tdf2 distribution; the confidence interval result in step 3k follows.
The general result for with g (·) being any linear function can be similarly proved, with the main difference being the formula for V.
C. MM-ANOVA APPROACH – STEP 4: DERIVE THE NON-NULL DISTRIBUTION OF FOR
Power and sample size estimation for the step 2a hypothesis requires specification of the distribution of the FOR statistic, derived in step 2i, when the null hypothesis is not true. A noncentral F distribution approximation for the non-null distribution is specified by steps 4a–d below. These steps are justified in Section C.5.
C.1. Step 4a: Compute the noncentrality parameter in terms of the conventional ANOVA model
Express the noncentrality parameter in terms of the conventional ANOVA model using
(C1) |
where MSnum is the numerator mean square from the conventional ANOVA F statistic given in step 2d, df(MSnum) is its degrees of freedom, E (MSnum |H0) is its expected value under H0, and MSnum|Y= E(Y) is the mean square evaluated with outcomes replaced by their expected values.
For the balanced test×reader×case factorial design we have MSnum = MS (T) from step 2d. From Table 4 we have . Thus under H0:τ1 = … = τt = 0. Noting that E (Yijk) = μ + τi, we have . Noting that df[MS (T)] = t − 1, then from (C1) it follows that
(C2) |
C.2. Step 4b: Express λ in terms of mm-ANOVA parameters
Replace variance components in (C2) corresponding to random effects involving case by mm-ANOVA covariances. From the relationships determined in step 1c and presented in Table 3 we have
(Recall that is the error variance for the mm-ANOVA model.) Thus in terms of mm-ANOVA parameters
(C3) |
C.3. Step 4c: Determine the denominator degrees of freedom in terms of mm-ANOVA parameters
Write the denominator of from step 2h in the form . The denominator degrees of freedom is given by
(C4) |
which is the same as (A2). Note that (C4) contains the expected mean square values and the true value of d, in contrast to approximation (24) that replaces these values by sample estimates. The reason for this difference is that approximation (24) will be used for hypotheses testing and confidence intervals for a study data set; in contrast, (C4) will be used for sample-size and power estimation for a future study and will be based on parameter values that are either conjectured or estimated from pilot data.
Express the expected mean squares in (C4) in terms of mm-ANOVA model parameters by determining their expected values in terms of the conventional ANOVA parameters and then replacing variance components that involve case by mm-ANOVA covariances. For example, for the balanced test×reader×case factorial study design, the denominator of from step 2h is given by . From (17) and Tables 3–4 it follows that
with . Thus
and hence, using (C4),
(C5) |
Hillis et al [15] illustrate how these formulas can be used in practice to estimate power and sample size using pilot-data or conjectured parameter estimates.
C.4. Step 4d: General non-null distribution result
An approximation for the non-null distribution of FOR is given by
where λ is given in step 4b, df1 is the degrees of freedom for the numerator mean square from the conventional ANOVA F statistic given in step 2d and df2 is given by (C4), expressed in terms of the mm-ANOVA parameters. Thus for the balanced test×reader×case factorial study design, λ is given by (C3),df1 = t − 1, and df2 is given by (C5).
C.5. Justification of steps 4a–d
The non-null distribution result given in step 4d can derived for the test×reader×case study design along the same lines as the derivation of the null distribution result given in Section A. One difference is that , the numerator numerator mean square in FOR (22) has a noncentral chi-square distribution when appropriately normalized under H1. The distribution for MS(T) is given by
where λ is given by (C3). Because , it follows that
Using the Section A approach but with this one difference, we can show that
where W is approximately with df2 given by (C5), and U and W are independent. Thus has an approximate F(t−1),df2;λ distribution. Because FOR (22) approximates (21), it is reasonable to approximate the null distribution of FOR by F(t − 1),df2;λ.
References
- 1.Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Communications in Statistics: Simulation and Computation. 1995;24:285–308. [Google Scholar]
- 2.Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
- 3.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]
- 4.Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]
- 5.Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]
- 6.Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Statistics in Medicine. 2007;26:596–619. doi: 10.1002/sim.2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–844. [PubMed] [Google Scholar]
- 8.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- 9.Kundel HL, Gefter W, Aronchick J, Miller W, Hatabu H, Whitfill CH. Accuracy of bedside chest hard-copy screen-film versus hard-and soft-copy computed radiographs in a medical intensive care unit: receiver operating characteristic analysis. Radiology. 1997;205:859–863. doi: 10.1148/radiology.205.3.9393548. [DOI] [PubMed] [Google Scholar]
- 10.Pavur R, Nath R. Exact F tests in an ANOVA procedure for dependent observations. Multivariate Behavioral Research. 1984;19:408–420. doi: 10.1207/s15327906mbr1904_3. [DOI] [PubMed] [Google Scholar]
- 11.Searle SR. Linear Models. New York: Wiley; 1971. pp. 55–59. [Google Scholar]
- 12.Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
- 13.Obuchowski NA. Reducing the number of reader interpretations in MRMC studies. Academic Radiology. 2009;16:209–217. doi: 10.1016/j.acra.2008.05.014. [DOI] [PubMed] [Google Scholar]
- 14.Obuchowski NA, Gallas BD, Hillis SL. Multi-reader ROC studies with split-plot designs: a comparison of statistical methods. Academic Radiology. 2012;19:1508–1517. doi: 10.1016/j.acra.2012.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hillis SL, Obuchowski NA, Berbaum KS. Power estimation for multireader ROC methods: An updated and unified approach. Academic Radiology. 2011;18:129–142. doi: 10.1016/j.acra.2010.09.007. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Obuchowski NA, Lieber ML, Powell KA. Data analysis for detection and localization of multiple abnormalities with application to mammography. Academic Radiology. 2000;7:516–525. doi: 10.1016/s1076-6332(00)80324-4. [DOI] [PubMed] [Google Scholar]
- 17.Chakraborty DP, Berbaum KS. Observer studies involving detection and localization: Modeling, analysis, and validation. Medical Physics. 2004;31:2313–2330. doi: 10.1118/1.1769352. [DOI] [PubMed] [Google Scholar]
- 18.Bunch PC, Hamilton JF, Sanderson GK, Simmons AH. Free-response approach to the measurement and characterization of radiographic-observer performance. Journal of Applied Photographic Engineering. 1978;4:166–171. [Google Scholar]
- 19.Pavur RJ, Lewis TO. Unbiased F-tests for factorial-experiments for correlated data. Communications in Statistics-Theory and Methods. 1983;12:829–840. [Google Scholar]
- 20.Gallas BD. One-shot estimate of MRMC variance: AUC. Academic Radiology. 2006;13:353–362. doi: 10.1016/j.acra.2005.11.030. [DOI] [PubMed] [Google Scholar]
- 21.Gallas BD, Pennelo GA, Myers KJ. Multireader multicase variance analysis for binary data. JOSA A. 2007;24:B70–B80. doi: 10.1364/josaa.24.000b70. [DOI] [PubMed] [Google Scholar]
- 22.Gallas BD, Bandos A, Samuelson FW, Wagner RF. A framework for random-effects ROC analysis: biases with the bootstrap and other variance estimators. Communications in Statistics-Theory and Methods. 2009;38:2586–2603. [Google Scholar]
- 23.Hoeffding W. A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics. 1948;19:293–325. [Google Scholar]
- 24.Song X, Zhou XH. A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data. Biostatistics. 2005;6:303–312. doi: 10.1093/biostatistics/kxi011. [DOI] [PubMed] [Google Scholar]
- 25.Satterthwaite FE. Synthesis of variance. Psychometrika. 1941;6:309–316. [Google Scholar]
- 26.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometric Bulletin. 1946;2:110–114. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.