Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Nov 10.
Published in final edited form as: Stat Med. 2013 Aug 23;33(2):330–360. doi: 10.1002/sim.5926

A Marginal-Mean ANOVA Approach for Analyzing Multireader Multicase Radiological Imaging Data

Stephen L Hillis 1,
PMCID: PMC4640471  NIHMSID: NIHMS733736  PMID: 24038071

Abstract

The correlated-error ANOVA method proposed by Obuchowski and Rockette (OR) has been a useful procedure for analyzing reader-performance outcomes, such as the area under the receiver-operating-characteristic curve, resulting from multireader multicase radiological imaging data. This approach, however, has only been formally derived for the test-by-reader-by-case factorial study design. In this paper I show that the OR model can be viewed as a marginal-mean ANOVA model. Viewing the OR model within this marginal-mean ANOVA framework is the basis for the marginal-mean ANOVA approach, the topic of this paper. This approach (1) provides an intuitive motivation for the OR model, including its covariance-parameter constraints; (2) provides easy derivations of OR test statistics and parameter estimates, as well as their distributions and confidence intervals; and (3) allows for easy generalization of the OR procedure to other study designs. In particular, I show how one can easily derive OR-type analysis formulas for any balanced study design by following an algorithm which only requires an understanding of conventional ANOVA methods.

Keywords: Receiver operating characteristic (ROC) curve, correlated ANOVA, diagnostic radiology

1. INTRODUCTION

Receiver operating characteristic (ROC) curve analysis is a well established method for evaluating and comparing the performance of diagnostic tests. In radiological imaging studies such tests typically involve a human reader (usually a radiologist) evaluating an image or images resulting from an imaging modality (such as mammography for breast cancer) for a case (i.e., subject) with respect to confidence of disease. In such situations it is important that conclusions generalize to both the case and reader populations. A typical design for comparing diagnostic tests is the balanced test×reader×case factorial study design where each image is assigned a disease-confidence rating by each reader using each diagnostic test. Throughout I use test to refer to a diagnostic test, modality, or treatment.

The methods proposed by Obuchowski and Rockette (OR) [1, 2] and Dorfman, Berbaum, andMetz (DBM) [3, 4] are the most commonly used methods for analyzing such multireader multicase studies (often referred to as MRMC studies) and have performed well in simulations. The OR procedure fits a correlated-error test×reader ANOVA to reader-performance outcomes such as the area under the ROC curve (AUC), while the DBM procedure fits a test×reader×case conventional ANOVA to case-specific pseudovalues. Although the two methods have been shown to be equivalent [5, 6] when based on the same procedural parameters, I find the OR procedure more intuitive and its parameters more interpretable because it models observed reader-performance outcomes rather than pseudovalues. For this reason the OR procedure will be the focus of this paper.

Previously published derivations of OR model statistical properties [6] are tedious to derive, do not provide motivation for the model, and have been derived only for the balanced text×reader×case factorial study design. In this paper I show that the OR model is the same as the model for the marginal mean of a conventional ANOVA model with independent errors, where the mean is computed across cases. Viewing the OR model within this marginal-mean ANOVA framework is the basis for the marginal-mean ANOVA approach (mm-ANOVA approach), the topic of this paper. This approach (1) provides an intuitive motivation for the OR model, including its covariance-parameter constraints; (2) provides easy derivations of OR test statistics and parameter estimates, as well as their distributions and confidence intervals; and (3) allows for easy generalization of the OR procedure to other study designs.

In particular, I show how one can easily derive OR-type analysis formulas for any balanced study design by following an algorithm which only requires an understanding of conventional ANOVA methods. This development is important because for many situations other designs are more suitable than the text×reader×case factorial study design. For example, diagnostic tests may be mutually exclusive for various reasons, such as high radiation dose or invasiveness of the test, and thus can not be given to each patient; readers may be trained to read under only one of the tests; or power considerations may show that it is advantageous to have replicated readings or to have groups of readers read different cases.

The outline of this paper is as follows. I review the OR method in Section 2. In Sections 3–4 and Appendices A–C I describe and justify steps of an algorithm for motivating the OR model and deriving its properties using the marginal-mean ANOVA approach. Steps are stated in a general form so that analogous OR-type procedures can be formulated for other study designs. In Section 5 I summarize the algorithm and illustrate how the algorithm can be used to develop OR-type procedures for six other study designs. A discussion and concluding remarks are given in Section 6.

2. THE OBUCHOWSKI-ROCKETTE (OR) METHOD

2.1. Design and notation

Throughout this section I assume the data have been collected using a balanced test×reader×case study factorial design. This commonly used diagnostic-radiology study design specifies that each case be subjected to each test, with the resulting images evaluated once by each reader. In addition, each case is classified as diseased or nondiseased according to an available reference standard. Typically the number of cases is 25–200 while the number of readers is 3–15. Let Zijk denote a confidence-of-disease rating assigned to the kth case by the jth reader using the ith test. For example, often an ordinal five-level ordinal integer scale or a quasi-continuous 0% to 100% confidence scale is used. The observed rating data consists of the Zijk, with i = 1, …, t, j = 1, …, r, k = 1, …, c, where t is the number of tests, r the number of readers, and c the number of cases.

2.2. Model and test statistic

Let θ̂ij denote the AUC estimate (or other ROC-curve accuracy estimate) for the ith test and jth reader. Obuchowski and Rockette [1] use a test × reader factorial ANOVA model for the AUC estimates, but unlike a conventional ANOVA model they allow the errors to be correlated to account for correlation due to each reader evaluating the same cases. Their model, which I refer as the OR model, can be written as

θ^ij=μ+τi+Rj+(τR)ij+εij (1)

i = 1, …, t, j = 1, …, r, where τi denotes the fixed effect of test i, Rj denotes the random effect of reader j, (τR)ij denotes the random test × reader interaction, and εij is the error term. Without loss of generality I assume i=1tτi=0. The Rj and (τR)ij are assumed to be mutually independent and normally distributed with zero means and respective variances σR2 and σTR2. The εij are assumed to be normally distributed with zero mean and variance σε2 and are assumed independent of the Rj and (τR)ij. Equi-covariance of the errors between readers and tests is assumed, resulting in three possible covariances given by

Cov(εij,εij)={Cov1ii,j=j(different test,same reader)Cov2i=i,jj(same test,different reader)Cov3ii,jj(different test,different reader)

It follows from model (1) that σε2, Cov1, Cov2, and Cov3 are also the variance and corresponding covariances of the AUC estimates, conditional on the reader and test × reader effects. Based on clinical considerations Obuchowski and Rockette [1] suggest the following ordering for the covariances:

Cov1Cov2Cov30. (2)

In Section 3.4 I show that these constraints can replaced by the less restrictive constraints

Cov1Cov3,Cov2Cov3,Cov30 (3)

Alternatively, the model can be described in terms of the error correlations, defined by ρi=Covi/σε2,i=1,2,3.

When Cov2 and Cov3 are known, the OR statistic for testing the null hypothesis of no test effect (H0: τi = 0; i = 1, … t) is given by

FOR*=MS(T)MS(T*R)+r(Cov2Cov3) (4)

where MS(T) and MS(T * R) are the test and test × reader mean squares; i.e., MS(T)=rt1i=1t(θ^iθ^)2 and MS(T*R)=1(t1)(r1)i=1tj=1r(θ^ijθ^iθ^j+θ^)2. A subscript replaced by a dot indicates that values are averaged across the missing subscript index; for example, θ^=1tri=1tj=1rθ^ij.

In practice the statistic actually used is

FOR=MS(T)MS(T*R)+max[r(Cov^2Cov^3),0] (5)

where Cov^2 and Cov^3 denote estimates for Cov2 and Cov3, respectively. Note that (5) incorporates the constraints specified by (3) by setting Cov^2Cov^3 to zero if it is negative. Since Cov2 and Cov3 are also the corresponding covariances of the AUC estimates conditional on the reader and test × reader effects, they can be estimated using methods that treat cases as random but readers as fixed, such as jackknifing, bootstrapping, parametric methods, or the method proposed by DeLong et al [7] for trapezoidal-rule (or empirical) AUC estimates [8]. The OR estimates obtained from averaging corresponding fixed-reader AUC variances and covariances are denoted by σ^ε2,Cov^1,Cov^2, and Cov^3. Hillis [6] shows that FOR has an approximate Ft−1;ddfH null distribution, where

ddfH={MS(T*R)+max[r(Cov^2Cov^3),0]}2[MS(T*R)]2(t1)(r1) (6)

More generally, FOR has an Ft−1,df2 distribution where λ=ri=1tτi2σTR2+σε˜2Cov1+(r1)(Cov2Cov3) and df2=[σTR2+σε˜2Cov1+(r1)(Cov2Cov3)]2[σTR2+σε˜2Cov1Cov2+Cov3]2/[(t1)(r1)].

Letting θi denote the expected reader performance measure for test i (i.e., θi = E(θ̂i)), an approximate (1 − α) 100% confidence interval for contrast i=1tliθi(i=1tli=0) is given by i=1tliθ^i±tα/2;ddfHV^ where V^=1r(i=1tli2){MS(T*R)+max[r(Cov^2Cov^3),0]}. An approximate (1 − α) 100% confidence interval for θi, using a standard error computed from all of the data, is given by θ^i±tα/2;df2V^, where V^=1tr[MS(R)+(t1)MS(T*R)+trmax(Cov^2,0)] and df2=[MS(R)+(t1)MS(T*R)+trmax(Cov^2,0)]2[MS(R)]2/(r1)+[(t1)MS[T*R]]2/[(t1)(r1)]. Alternatively, an approximate (1 − α) 100% confidence interval for test i, using a standard error computed only from data for test i, is given by θ^i±tα/2;df2(i)V^(i), where V^(i)=1r[MS(R)(i)+rmax(Cov^2(i),0)] and df2(i)=[MS(R)(i)+rmax(Cov^2(i),0)]2[MS(R)(i)]2/(r1); here MS (R)(i) and Cov^2(i) are computed only from test i data. I recommend this latter formula for single AUC confidence intervals, since it does not depend on assuming equal error covariances and variances for each test. All of these results have been previously presented [6].

Expected mean squares are given in Table 1a; proofs for these results are given by Hillis [6]. Expressions for the variance components, in terms of the expected mean squares and covariances are presented in Table 1b; these relationships follow directly from Table 1a. Estimated variance components result by replacing expected mean squares by mean squares and covariance parameters by estimates; for example,

σ^TR2=MS(T*R)σ^ε2+Cov^1+max(Cov^2Cov^3,0)

Typically the variance component estimates are changed to zero if the computed values are negative.

Table 1.

Expected mean square and variance component formulas for the Obuchowski-Rockette model.

  1. Expected mean squares
    Mean square Expected mean square
    MS(T)
    rt1i=1tτi2+σTR2+σε2Cov1+(r1)(Cov2Cov3)
    MS(R)
    tσR2+σTR2+σε2Cov2+(t1)(Cov1Cov3)
    MS(T * R)
    σTR2+σε2Cov1Cov2+Cov3
  2. Variance components
    Variance component Equivalent function of expected mean squares and covariances
    σR2
    1tE{MS(R)MS(T*R)}Cov1+Cov3
    σTR2
    E[MS(T*R)]σε2+Cov1+(Cov2Cov3)

2.3. Real-data example

To illustrate the OR method for the factorial design, I compare reader AUCs for hard- and soft-copy computed radiography chest images selected randomly from a medical intensive care unit. In the study [9] four radiologists blindly read both hard- and soft-copy images obtained with computed radiography from the same patients. Six months separated the end of the hard-copy readings and the start of the soft-copy readings. A five-point ordinal scale was used to rate the likelihood of presence of the condition (which I will refer to as “disease”) implied by the reason for requesting the corresponding examination. Ninety-five images, consisting of 29 diseased and 66 nondiseased images, were read under each test condition.

The analysis of this study using empirical AUC estimates and jackknife covariance estimates is displayed in Table 2. The AUCs for soft- and hard-copy images, averaged across the four readers, are 0.804 and 0.841, respectively. The test for the null hypothesis of no test effect (i.e., the population average AUC across readers is the same for soft- and hard-copy images) is not significant (FOR = 6.01, ddfH = 3, p = .092); the 95% confidence interval for the difference of the population AUCs (hard- minus soft-copy) is (−0.011, 0.086). Parts (i) and (j) give 95% confidence intervals for the single-test AUCs, based on all of the data and only on data for the specific test, respectively. The confidence intervals from the two methods are similar; this is expected because the AUCs are similar.

Table 2.

Obuchowski-Rockette analysis of Kundel et al [9] data for soft- and hard-copy computed radiographs using trapezoid AUC estimation and jackknife covariance estimation for t = 2 tests, r = 4 readers, c = 95 cases (66 nondiseased, 29 diseased).

  1. Trapezoid AUCs:
    Test

    1 (Soft-copy) 2 (Hard-copy)
    Reader (j) θ̂1j θ̂2j
    1 0.815 0.854
    2 0.767 0.812
    3 0.831 0.900
    4 0.803 0.798

    θ̂ = .804 θ̂ = .841
  2. ANOVA table:
    Source df Sum of squares Mean square
    T 1 0.00281054 0.00281054
    R 4 0.00715054 0.00238351
    T*R 4 0.00140392 0.00046797
  3. Fixed-reader covariance and corresponding correlation estimates computed from jackknife covariance matrix:
    σ^ε2=.0022034331,Cov^1=.0011163046,Cov^2=.0.0008438255,Cov^3=.0008871752,ρ^1=0.507,ρ^2=0.383,ρ^3=0.403
  4. Variance component estimates using Table 1b formulas:
    σ^R2=1t{MS(R)MS(T*R)}Cov^1+Cov^3=0.0007286397
    σ^TR2=MS(T*R)σ^ε2+Cov^1+max(Cov^2Cov^3,0)=0.000662504(typically this would be changed to zero)
  5. FOR=MS(T)MS(T*R)+rmax(Cov^2Cov^3,0)=6.00576
  6. Denominator degrees of freedom:
    ddfH=[MS(T*R)+max[r(Cov^2Cov^3),0]]2[MS(T*R)]2(t1)(r1)=3
  7. P -value for H0: θ1 = θ2: p = Pr (F(t−1), ddfHFOR) = .092

  8. 95% CI for θ2θ1:θ^2·θ^1·±tddfH2r{MS(T*R)+rmax(Cov^2Cov^3,0)}=(0.0111940,.086168)

  9. Single-test 95% confidence intervals based on all of the data. Note: StdErr=1tr[MS(R)+(t1)MS(T*R)+trmax(Cov^2,0)].
    i θ̂i StdErr df2 95% CI
    1 (Soft-copy) 0.804 .0346 46.9 0.734, 0.874
    2 (Hard-copy) 0.841 .0346 46.9 0.772, 0.911
  10. Single test 95% confidence intervals using only corresponding test data. Note: StdErr(i)=1r[MS(R)(i)+r*max(Cov^2(i),0)] .
    i θ̂i
    Cov^2(i)
    MS(R)(i) StdErr(i)
    df2(i)
    95% CI
    1 (Soft-copy) 0.804 0.000880 0.000735 0.0326 100.4 0.739, 0.867
    2 (Hard-copy) 0.841 0.000808 0.002116 0.0366 19.2 0.765, 0.918

Although this study showed a nonsignificant difference between soft- and hard-copy image reader performance, the confidence interval for the difference of the AUCs showed a difference as large as 0.086 to be commensurate with the data. In such a situation, the researcher may decide to design a future study that would produce a more precise estimate of the difference. Increased precision could result from an increase in the number of cases, the number of readers, or from replicated readings where each reader reads each image 2 or more times. If increasing the number of cases and readers is not feasible, then a replicated study is a natural choice for increasing power; however, OR analysis methodology has been developed only for the nonreplicated factorial design. I use the algorithm described in this paper to derive the OR-type procedure for the replicated factorial design, including the test-statistic nonnull distribution, which allows for power and sample size estimation. Using this result, I illustrate efficiency computations comparing the nonreplicated and replicated designs in Section 5.6.

In this study the same radiologists also similarly rated 95 hard-copy chest images obtained with screen-film; these images were from different patients than the computed radiographs. Because the original OR method assumes a factorial study design with readers reading the same cases under each test, it cannot be used to compare the screen-film AUC outcomes with the AUC outcomes from either the soft- or hard-copy computed radiograph images. In Section 5.3 I show how the OR approach can be adapted for this situation, which represents a split-plot study design with cases nested within test, and illustrate the analysis of these data.

2.4. Previous derivations of OR properties

Derivations of OR-procedure properties have previously been derived starting with the OR model (1, 2). For what is essentially the OR model, Pavur and Nath [10] show that, for testing the null hypothesis of equal tests, the F statistic that is appropriate when the errors are independent can be used if corrected by a multiplicative factor. The multiplicative factor is a function of the correlations, which are assumed known, and the distribution for this corrected F statistic is the same as for the uncorrected F statistic when the errors are independent. The approach taken by Obuchowski and Rockette [1] was to modify this result by replacing the assumed-known correlations by estimated correlations. This approach yielded valid ANOVA statistics but unsatisfactory degrees of freedom, resulting in overly conservative tests [6]. Alternatively, Hillis [6] directly derived properties, but the proofs are tedious and nonintuitive.

3. MM-ANOVA APPROACH – STEP 1: DERIVE THE MM-ANOVA MODEL

In Sections 3–4 and Appendices A–C I show how the properties of the OR model can easily be derived using an algorithm, based on the mm-ANOVA approach, that only requires knowing how to determine conventional ANOVA test statistics and expected mean squares. I describe and illustrate the steps in the algorithm for the typical balanced test×reader×case study design discussed in the previous section. The steps are stated in a general form so that they can be applied to other balanced study designs. The mm-ANOVA approach and corresponding algorithm have not been previously described and are the main contribution of this paper.

3.1. Step 1a: Define the conventional ANOVA model that corresponds to the study design as if each reader-performance measure was the mean of case outcomes

Let Yijk denote a hypothetical outcome for test i, reader j, and case k. For our purposes Yijk is used only to illustrate the marginal ANOVA model approach; i.e., it does not represent an actual study outcome and should be distinguished from the observed rating Zijk. I assume that the Yijk follow a three-way conventional ANOVA model that corresponds to the study design.

Thus the distribution of Yijk is given by the following test × reader × case ANOVA model that treats test as a fixed factor and reader and case as random factors:

Yijk=μ+τi+Rj+Ck+(τR)ij+(τC)ik+(RC)jk+(τRC)ijk+εijk (7)

i = 1, …, t, j = 1, …, r, k = 1, …, c, where τi denotes the fixed effect of test i with i=1tτi=0, Rj denotes the random effect of reader j, Ck denotes the random effect of case k, the multiple symbols in parentheses denote random interactions, and εijk is the error term. The random effects are assumed to be mutually independent and normally distributed with zero means and respective variances σR2,σC2,σTR2,σTC2,σRC2,σTRC2, and σε2. Because there are no replications, for estimation purposes σTRC2 and σε2 are inseparable; hence I define

σ2=σTRC2+σε2

Results for this model, such as mean square distributional properties and ANOVA test statistics, are well known (e.g., [11]) and will be stated without references.

3.2. Step 1b: From the conventional ANOVA model defined in step 1a, derive the mm-ANOVA model by averaging across cases and defining the mm-ANOVA model error term equal to the mean, across cases, of the sum of the conventional ANOVA model error term and random effects involving case

I say that a random effect “involves case” if it is subscripted according to case. Let ij denote the marginal mean resulting from averaging over cases; i.e.,

ij=Yij (8)

I use the term marginal-mean ANOVA model (mm-ANOVA model) to refer to the model implied by the conventional 3-way ANOVA model (7) for the marginal mean (8). It follows from (7) that

ij=μ+τi+Rj+(τR)ij+ε˜ij (9)

where

ε˜ij=C+(τC)i+(RC)j+(τRC)ij+εij (10)

the Rj and (τR)ij are mutually independent and normally distributed with zero means and respective variances σR2 and σTR2, and the ε̃ij are independent of the Rj and (τR)ij.

3.3. Step 1c: Express the mm-ANOVA model error variance and covariances in terms of the conventional ANOVA model variance components

From (10) it follows that the ε̃ij are normally distributed with mean 0, variance

σε˜2=1c(σC2+σTC2+σRC2+σTRC2+σε2) (11)

and equi-correlated with

Cov1cov(ε˜ij,ε˜ij)=1c(σC2+σRC2) (12)
Cov2cov(ε˜ij,ε˜ij)=1c(σC2+σTC2) (13)

and

Cov3cov(ε˜ij,ε˜ij)=1cσC2 (14)

where ii′ and jj′.

3.4. Step 1d: Determine the mm-ANOVA model covariance constraints implied by step 1c

The covariance constraints given by (3) follow from (1214). Thus the mm-ANOVA model for ij is defined by (9) and (3). It also follows from (1114) that σε˜2(Cov1+Cov2+Cov3), but I do not include this constraint as part of the definition of the mm-ANOVA model because this constraint is implied from the relationship Var(ε̃11 − ε̃12 − ε̃21 + ε̃22) ≥ 0.

3.5. Remarks

3.5.1. One-to-one relationship between parameters of the 3-way conventional ANOVA and corresponding mm-ANOVA models

In terms of the mm-ANOVA model parameters (μ, τi, σR2,σTR2,σε˜2, Cov1, Cov2, and Cov3), the parameters for the corresponding three-way ANOVA model (7) are given by μ, τi, σR2,σTR2,σε2=c(σε˜2Cov1Cov2Cov3),σC2=cCov3,σTC2=c(Cov2Cov3), and σRC2=c(Cov1Cov3). Thus there is a one-to-one relationship between the parameters of the two models. Hence for any mm-ANOVA model, defined by (9) and (3), there is a corresponding conventional 3-way ANOVA model (7) that implies that model for the marginal means. These relationships between the two models are presented in Table 3.

Table 3.

Relationships between the 3-way ANOVA (7) and corresponding mm-ANOVA (9, 3) model parameters

3-way ANOVA parameter Equivalent function of mm-ANOVA parameters

μ = μ
τi = τi
σR2
=σR2
σTR2
=σTR2
σC2
= cCov3
σTC2
= c (Cov2 − Cov3)
σRC2
= c (Cov1 − Cov3)
σ2σTRC2+σε2
=c(σε˜2Cov1Cov2+Cov3)
mm-ANOVA parameter Equivalent function of 3-way ANOVA parameters

μ μ
τi τi
σR2
=σR2
σTR2
=σTR2
σε˜2
=1c(σC2+σTC2+σRC2+σε2)
Cov1
=1c(σC2+σRC2)
Cov2
=1c(σC2+σTC2)
Cov3
=1c(σC2)

These relationships assume covariance constraints (3) for the mm-ANOVA model and the same linear constraints for the τi (i.e., ∑ τi = 0) for both models.

3.5.2. Equivalence of the OR and mm-ANOVA models

Note that the mm-ANOVA model (9, 3) has the same form as the OR model (1, 2), with the only difference being that the mm-ANOVA model covariance constraints (3) are less restrictive. Since the OR covariance constraints (2) were suggested by Obuchowski and Rockette [1] based only on clinical considerations, to simplify comparison of the models I now modify the definition of the OR model to include the less restrictive mm-ANOVA model constraints (3); i.e., the OR model is now considered to be defined by equations (1) and (3). With this change the OR and the mm-ANOVA model become equivalent.

3.5.3. Definition of the mm-ANOVA approach

Because the OR and mm-ANOVA model are identical, statistical properties for the ROC accuracy estimates, the θ̂ij, are the same as for the marginal means, the ij, for an mm-ANOVA model having the same parameter values as the OR model. The mm-ANOVA approach consists of deriving statistical properties for the OR model (1, 3) by recognizing that it is equivalent to the mm-ANOVA model (9, 3), and then deriving properties of the mm-ANOVA model by utilizing its relationship with the conventional three-way ANOVA model. The advantage of this approach is that properties of the conventional three-way ANOVA model are well known.

3.5.4. Motivation for the OR model

The mm-ANOVA approach provides an intuitive motivation for the OR model (1, 3) as follows. Suppose, hypothetically, that the reader performance outcome θ̂ij is the mean of case-specific outcomes; that is, suppose that θ̂ij = Yij for some outcome Yijk, with k = 1, …, c. A typical way to account for variation in θ̂ij due to readers and cases would be to assume the three-way ANOVA model (7), which implies the mm-ANOVA model (9, 3) and hence also the equivalent OR model (1, 3) for θ̂ij. Of course, in practice θ̂ij is not a marginal mean, but rather a nonlinear function of the case-specific confidence-of-disease ratings and truth-state (i.e., reference standard) indicator values. However, the mm-ANOVA approach shows that the OR model accounts for reader and case variation using the covariance structure implied by a conventional three-way ANOVA model, as if the accuracy estimate was a marginal mean.

4. MM-ANOVA APPROACH – STEP 2: DERIVE THE MM-ANOVA MODEL TEST STATISTIC AND ITS NULL DISTRIBUTION FOR A HYPOTHESIS EXPRESSED IN TERMS OF TEST ACCURACIES

In this section I show how to derive the mm-ANOVA model test statistic and its null distribution for testing the null hypothesis of equal test accuracies. I define test accuracy as the expected reader-performance measure for a particular test level. However, more generally these steps can be applied to any hypothesis that can be expressed in terms of linear functions of expected reader-performance outcomes.

4.1. Step 2a: State the hypothesis of interest in terms of the mm-ANOVA model

For the mm-ANOVA model (9, 3) let θi denote the test accuracy for test i; i.e., θi = E (i) is the expected reader-performance outcome for test i across the population of readers. The hypothesis of interest is the global null hypothesis of equal test accuracies, i.e., H0 : θ1 = … = θt, or equivalently, H0 : τ1 = … = τt = 0.

4.2. Step 2b: Express the hypothesis from step 2a in terms of the conventional ANOVA model

Noting that

θi=E(i)=E(Yi)=μ+τi

it follows that H0 : θ1 = … = θt is equivalent to H0 : τ1 = … = τt = 0 for the conventional ANOVA model (7).

4.3. Step 2c: Create the expected-mean-square table for the conventional ANOVA model

Let MS(T), MS(R), and MS(C) denote the conventional ANOVA mean squares due to test, reader, and case, respectively, with interaction mean squares notated in the usual manner. The expected mean squares for the conventional ANOVA model are presented in Table 4. These relationships will be utilized in other steps.

Table 4.

Expected mean squares for the conventional test-by-reader-by-case factorial ANOVA model (7).

Mean square Expected mean square
MS (T)
rc(t1)i=1tτi2+cσTR2+rσTC2+σ2
MS (R)
tcσR2+cσTR2+tσRC2+σ2
MS (C)
trσC2+rσTC2+tσRC2+σ2
MS (T * R)
cσTR2+σ2
MS (T * C)
rσTC2+σ2
MS (R * C)
tσRC2+σ2
MS (T * R * C)
σ2σTRC2+σε2

4.4. Step 2d: Determine the conventional ANOVA F statistic corresponding to the step 2b hypothesis

The conventional ANOVA test statistic for testing for H0 : τ1 = … = τt = 0 is given by

F=MS(T)MS(T*R)+MS(T*C)MS(T*R*C) (15)

I refer to F as an ANOVA statistic because its numerator and denominator have the same expectation under H0, but the numerator has a larger expectation than the denominator under H1 : τi ≠ τj for some ij.

4.5. Step 2e: Express mm-ANOVA mean squares in terms of conventional ANOVA mean squares

For the mm-ANOVA model let MS˜(T),MS˜(R), and MS˜(T*R) denote the test, reader, and test×reader mean squares; i.e., MS˜(T)=rt1i=1t(i)2,MS˜(R)=tr1j=1r(j)2 and MS˜(T*R)=1(t1)(r1)i=1tj=1r(ijij+)2. Noting that MS(T)=rct1i=1t(YiY)2,MS(R)=tcr1i=1t(YjY)2,MS(T*R)=c(t1)(r1)i=1tj=1r(YijYiYj+Y)2, it follows that

MS˜(T)=1cMS(T) (16)
MS˜(R)=1cMS(R)
MS˜(T*R)=1cMS(T*R) (17)

4.6. Step 2f: Express F from step 2d in terms of mm-ANOVA model mean squares and U, where U is a linear function of conventional ANOVA model mean squares that involve case

It follows from (1617) that (15) can be written in the form

F=MS˜(T)MS˜(T*R)+U (18)

where

U=1c[MS(T*C)MS(T*R*C)]

Note that U is a linear function of conventional ANOVA model mean squares involving case and (18) is an ANOVA statistic.

4.7. Step 2g: Express E (U) in terms of conventional ANOVA model variance components, and then in terms of mm-ANOVA model error covariance parameters using the relationships from step 1c

From Table 4 we have E[MS(T*C)]=rσTC2+σ2 and E [MS (T * R * C)] = σ2. It follows that

E(U)=1cE[MS(T*C)MS(T*R*C)]=rcσTC2 (19)

Using (13) and (14) we can write the right side of (19) in terms of the mm-ANOVA covariances: rcσTC2=r(Cov2Cov3). Hence

E(U)=r(Cov2Cov3) (20)

4.8. Step 2h: Modify F (18) from step 2f to produce the mm-ANOVA statistic FOR* by replacing U by E (U), expressed as a linear function of mm-ANOVA covariance parameters

Replacing U in equation (18) by its expectation (20) results in

FOR*=MS˜(T)MS˜(T*R)+r(Cov2Cov3) (21)

which is the OR test statistic FOR* (4) when we treat the ij as the OR model outcomes θ̂ij. Because (18) is an ANOVA statistic, it follows that FOR* (21) is also an ANOVA statistic.

4.9. Step 2i: Derive FOR by replacing covariance parameters in FOR* by estimates that take into account the constraints from step 1d

An obvious estimate of Cov2−Cov3 that takes into account covariance constraints (3) is given by max[(Cov^2Cov^3),0], where Cov^2 and Cov^3 are estimates as discussed in Section 2.2. Replacing Cov2−Cov3 in (21) by this estimate results in

FOR=MS˜(T)MS˜(T*R)+max[r(Cov^2Cov^3),0] (22)

which is the OR statistic FOR (5) when we replace the ij by the OR model outcomes θ̂ij.

4.10. Step 2j: Determine the approximate null distribution of FOR

Null-distribution result

Write the denominator of FOR in the form

b(i=1IaiMS˜i+d^) (23)

where the MS˜i, i = 1, …, I are mm-ANOVA mean squares, is a function of the covariance parameter estimates and the ai and b are constants. Then FOR will have an approximate Fdf1,df2 null distribution, where df1 is the numerator degrees of freedom for the conventional ANOVA model test statistic in step 2d and df2 is given by

df2=[i=1IaiMS˜i+d^]2i=1I[aiMS˜i]2df(MS˜i) (24)

where df(MS˜i) is the degrees of freedom for MS˜i, and hence also for MSi. I have stated this result generally so that it can be easily applied to other designs. See Appendix A for a derivation of this result.

To apply this result to the balanced test×reader×case factorial study design, note that the denominator of FOR (22) is given by (23) with I = 1, a1 = 1, b = 1, MS˜1=MS˜(T*R), and d^=max[r(Cov^2Cov^3),0]. Using (24), the null-distribution result states that FOR (22) has an approximate Ft−1,df2 null distribution, where

df2={MS˜(T*R)+max[r(Cov^2Cov^3),0]}2[MS˜(T*R)]2(t1)(r1) (25)

Note that the equation for df2 (25), with ij replaced by θ̂ij, is the same as the equation for ddfH (6) for the OR model.

4.11. Remark: Derivation of mm-ANOVA expected mean square and variance component expressions

For the mm-ANOVA model an expected mean square table, such as Table 1a, can be created as follows. Write the mm-ANOVA expected mean squares in terms of the conventional ANOVA variance components and fixed effects using the relationships given in steps 2c and 2e. For example, for the factorial model we have

E[MS˜(T)]=1cE[MS(T)]=1c[rc(t1)i=1tτi2+cσTR2+rσTC2+σ2] (26)

From step 1c it follows that the conventional ANOVA variance components in (26) involving case (i.e., the corresponding random effects are subscripted according to case) can be written in terms of the mm-ANOVA covariances: σTC2=c(Cov2Cov3) and σ2=c(σε˜2Cov1Cov2+Cov3). Replacing these variance components in (26) by their corresponding mm-ANOVA covariance expressions yields MS˜(T)=r(t1)i=1tτi2+σTR2+σε˜2Cov1+(r1)(Cov2Cov3), the first line in Table 1a. Similarly, the other expressions in Table 1a can be derived. A table of mm-ANOVA variance component formulas, such as Table 1b, can then be created from the mm-ANOVA expected mean square table by solving for the variance components.

5. Mm-ANOVA algorithm summary and examples

In Sections 3–4 steps 1 and 2 of the mm-ANOVA algorithm were presented. These two steps illustrated the essence of the mm-ANOVA approach. Steps 3 and 4, which are presented later in Appendices B and C, extend this approach by showing how to derive confidence intervals and the non-null distribution of the test statistic.

Table 5 presents a succinct summary of the mm-ANOVA algorithm. This summary is intended to make it easy to use the algorithm to determine the properties of OR-type models corresponding to other study designs. Note that Table 5 shows the steps for deriving the confidence interval formula, not only for a linear combination of test accuracy parameters, but also for a single accuracy parameter. Table 6 illustrates the application of Table 5 to the typical test×reader×case study design previously discussed in Sections 3 and 4.

Table 5.

Algorithm for deriving mm-ANOVA formulas

  1. Derive the mm-ANOVA model
    1. Define the conventional ANOVA model that corresponds to the study design as if each reader-performance measure was the mean of case-level outcomes. (Note: Since reader-performance measures are measures of discrimination between diseased and nondiseased cases, disease status should not be included as a factor.)
    2. From the conventional ANOVA model defined in step 1a, derive the mm-ANOVA model by averaging across cases. Define the mm-ANOVA model error term equal to the mean, across cases, of the sum of the conventional ANOVA model error term and random effects involving case.
    3. Express the mm-ANOVA model error variance and covariances in terms of the conventional ANOVA model variance components.
    4. Determine the mm-ANOVA model covariance constraints implied by step 1c.
  2. Derive the mm-ANOVA model test statistic and its null distribution for a hypothesis express in terms of test accuracies (i.e., expected reader-performance measures)
    1. State the hypothesis of interest in terms of the mm-ANOVA model.
    2. Express the hypotheses from step 2a in terms of the conventional ANOVA model.
    3. Create the expected-mean-square table for the conventional ANOVA model
    4. Determine the conventional ANOVA F statistic corresponding to the step 2b hypotheses.
    5. Express mm-ANOVA mean squares in terms of conventional ANOVA mean squares.
    6. Express F from step 2d in terms of the mm-ANOVA model mean squares and U, where U is a linear function of conventional ANOVA model mean squares that involve case.
    7. Express E (U) in terms of conventional ANOVA model variance components, and then in terms of mm-ANOVA model error covariance parameters using the relationships from step 1c.
    8. Modify F from step 2f to produce the mm-ANOVA statistic FOR* by replacing U by E (U), expressed as a linear function of mm-ANOVA covariance parameters.
    9. Derive FOR by replacing covariance parameters in FOR* by estimates that take into account the constraints from step 1d.
    10. Determine the approximate null distribution of FOR in the following way: Write the denominator of FOR in the form b(iaiMSi˜+d^) where the MSi˜ are mm-ANOVA model mean squares, is a function of the covariance parameter estimates, and the ai and b are constants. Then FOR will have an approximate Fdf1,df2 null distribution, where df1 is the numerator degrees of freedom for the conventional ANOVA model test statistic in step 2d and df2 is given by
      df2=[iaiMSi˜+d^]2i[aiMSi˜]2df(MSi˜)
      where df(MSi˜) is the degrees of freedom for MSi˜, and hence also for MSi.
  3. Derive confidence intervals for a linear function g (θ) of test accuracy parameters.
    1. Write the test accuracy parameter vector θ in terms of the mm-ANOVA model.
    2. Write θ in terms of the conventional ANOVA model.
    3. Determine the conventional ANOVA estimate for θ, denoted by θ̂.
    4. Determine the variance V of g (θ̂) in terms of conventional ANOVA parameters.
    5. Write V from step 3d in the form V = bE (∑ aiMSi) for constants b and ai.
    6. Write V from step 3e in the form V=b˜E(ãiMSi˜+U) where and ãi are constants and U is a linear function of conventional ANOVA mean squares that involve case.
    7. Express E (U) in terms of conventional ANOVA model variance components and then in terms of mm-ANOVA model error covariance parameters, using the relationships from step 1c; then rewrite V using this expression for E (U).
    8. Derive the variance estimate from V by replacing expected mean squares by mean squares and replacing covariances by estimates that take into account the constraints from step 1d.
    9. Derive the degrees of freedom df2 for using the general formula for df2 given in step 2j.
    10. Write θ̂ from step 3c in terms of the mm-ANOVA model.
    11. An approximate (1 − α) 100% confidence interval for g (θ) is given by g(θ^)±tα/2;df2V^, where is determined in step 3h, df2 in step 3i and θ̂ in step 3j.
  4. Derive the non-null distribution of FOR from step 2i
    1. Compute the noncentrality parameter in terms of the conventional ANOVA model: λ=df(MSnum)MSnum|Y=E(Y)E(MSnum|H0) where MSnum is the numerator mean square from the conventional ANOVA F statistic given in step 2d.
    2. Express λ in terms of mm-ANOVA parameters by replacing variance components involving case by mm-ANOVA covariances.
    3. Determine the denominator degrees of freedom in terms of mm-ANOVA parameters using df2=[iaiE(MSi˜)+d]2i[aiE(MSi˜)]2/df(MSi˜) where b(iaiMSi˜+d) is the denominator of FOR* from step 2h
    4. The non-null distribution is given by Fdf1,df2, where df1 = df (MSnum), df2 is determined in step 4c and λ in step 4b.

Table 6.

Mm-ANOVA approach for typical test×reader×case factorial study design

  1. Derive the mm-ANOVA model
    1. Conventional ANOVA model: Yijk = μ + τi + Rj + Ck + (τR)ij + (τC)ik + (RC)jk + (τRC)ijk + εijk, i = 1, …, t; j = 1, …, r; k = 1, …, c, with variance components σR2,σC2,σTR2,σTC2,σRC2,στRC2, and σε2 and constraint i=1tτi=0. Define σ2=σTRC2+σε2.
    2. Mm-ANOVA model (note: ij = Yij):
      ij=μ+τi+Rj+(τR)ij+ε˜ij where ε˜ij=C+(τC)i+(RC)j+(τRC)ij+εij and i=1tτi=0
    3. Mm-ANOVA error variance and covariances expressed in terms of conventional ANOVA variance components: σ˜ε2=1c(σC2+σTC2+σRC2+σ2), Cov1cov(ε˜ij,ε˜ij)=1c(σC2+σRC2), Cov2cov(ε˜ij,ε˜ij)=1c(σC2+στC2), Cov3cov(ε˜ij,ε˜ij)=1cσC2, where ii′, jj
    4. Covariance constraints: Cov1 ≥ Cov3; Cov2 ≥ Cov3; Cov3 ≥ 0
  2. Derive the mm-ANOVA test statistic and its null distribution
    1. Mm-ANOVA model hypothesis of equal test accuracies: H0 : θ1 = ⋯ = θt where θi = E (i)
    2. Conventional ANOVA model hypothesis: θi = E (Yi••) = μ + τi ⇒ H0 : τ1 = ⋯ = τt
    3. Conventional ANOVA expected mean squares
      Mean square Expected mean square
      MS(T)
      rc(t1)i=1tτi2+cσTR2+rσTC2+σ2
      MS(R)
      tcσR2+cσTR2+tσRC2+σ2
      MS(C)
      trσC2+rσTC2+tσRC2+σ2
      MS(T * R)
      cσTR2+σ2
      MS(T * C)
      rσTC2+σ2
      MS(R * C)
      tσRC2+σ2
      MS(T * R * C)
      σ2σTRC2+σε2
    4. Conventional ANOVA test statistic: F=MS(T)MS(T*R)+MS(T*C)MS(T*R*C)
    5. MS˜(T)=1cMS(T),MS˜(T*R)=1cMS(T*R),MS˜(R)=1cMS(R)
    6. F=MS˜(T)MS˜(T*R)+U where U=1c{MS(T*C)MS(T*R*C)}
    7. E{MS(T*C)}=rσTC2+σ2,E{MS[T*R*C]}=σ2E(U)=1c(rσTC2)=r(Cov2Cov3).
    8. FOR*=MS˜(T)MS˜(T*R)+r(Cov2Cov3)
    9. FOR=MS˜(T)MS˜(T*R)+rmax(Cov^2Cov^3,0)
    10. Under H0, FORFt−1,df2 where df2=[MS˜[T*R]+rmax(Cov^2Cov^3,0)]2[MS˜[T*R]]2/[(t1)(r1)]
  3. Derive confidence intervals
    • (a)
      Mm-ANOVA test accuracy parameters: θ = (θ1, …, θt)′, with θi = E (i), i = 1, …, t
    • (b)
      Corresponding conventional ANOVA parameters: θi = E (Yi••) = μ + τi
    • (c)
      Conventional ANOVA estimate: θ̂i = Yi••
      • CI for l′ (θ) with l = (l1, …, lt)′, i=1tli=0:
    • (d)
      l(θ^)=i=1tliθ^i=i=1tliYi=i=1tliτi+i=1tli[(τR)i+(τC)i+(τRC)i+εi]V=i=1tli2[σTR2r+σTC2c+σ2rc]=1rci=1tli2[cσTR2+rσTC2+σ2]
    • (e)
      V=1rci=1tli2E[MS(T*R)+MS(T*C)MS(T*R*C)]
    • (f)
      V=1ri=1tli2E[MS˜(T*R)+U] where U=1c{MS(T*C)MS(T*R*C)}
    • (g)
      E(U)=rσTC2c=r(Cov2Cov3)V=1ri=1tli2{E[MS˜(T*R)]+r(Cov2Cov3)}
    • (h)
      V^=1ri=1tli2{MS˜(T*R)+max[r(Cov^2Cov^3),0]}
    • (i)
      df2=[MS˜(T*R)+rmax(Cov^2Cov^3,0)]2[MS˜(T*R)]2/[(t1)(r1)] (same as df2 in step 2j)
    • (j)
      θ̂i = i
    • (k)
      CI:i=1tlii±tα/2;df21ri=1tli2{MS˜(T*R)+max[r(Cov^2Cov^3),0]}
      • CI for θi
    • (d)
      θ^i=Yi=τi+R+C+(τR)i+(τC)i+(RC)+(τRC)i+εiV=σR2r+σC2c+σTR2r+σTC2c+σRC2rc+σ2rc=1rc(cσR2+rσC2+cσTR2+rσTC2+σRC2+σ2)
    • (e)
      V=1trcE[MS(R)+(t1)MS(T*R)+MS(C)MS(R*C)+(t1)MS(T*C)(t1)MS(T*R*C)]
    • (f)
      V=1trE[MS˜(R)+(t1)MS˜(T*R)+U]
      where
      U=1c{MS(C)MS(R*C)+(t1)MS(T*C)(t1)MS(T*R*C)}
    • (g)
      E(U)=trc(σC2+σTC2)=trCov2V=1tr{E[MS˜(R)+(t1)MS˜(T*R)]+trCov2}
    • (h)
      V^=1tr[MS˜(R)+(t1)MS˜(T*R)+trmax(Cov^2,0)]
    • (i)
      df2=[MS˜(R)+(t1)MS˜(T*R)+trmax(Cov^2,0)]2[MS˜(R)]2r1+[(t1)MS˜(T*R)]2(t1)(r1)
    • (j)
      θ̂i = i
    • (k)
      CI:i±tα/2;df21tr[MS˜(R)+(t1)MS˜(T*R)+trmax(Cov^2,0)]
  4. Derive the non-null distribution Fdf1,df2 of the step-2 F statistic
    1. Step 2d F numerator: MSnum = MS(T), E[MS(T)]=rc(t1)i=1tτi2+cσTR2+rσTC2+σ2, df (MS (T)) = t − 1, E(Yijk)=μ+τiλ=df(MSnum)MSnum|Y=E(Y)E(MSnum|H0)=rci=1tτi2cσTR2+rσTC2+σ2
    2. rσTC2+σ2=c[σε˜2Cov1+(r1)(Cov2Cov3)]λ=ri=1tτi2σTR2+σε˜2Cov1+(r1)(Cov2Cov3)
    3. Step 2h FOR*denominator=MS˜(T*R)+r(Cov2Cov3), E(MS˜(T*R))=1cE(MS˜(T*R))=1c(cσTR2+σ2)=(σTR2+σε˜2Cov1Cov2+Cov3)df2=[σTR2+σε˜2Cov1+(r1)(Cov2Cov3)]2[σTR2+σε˜2Cov1Cov2+Cov3]2(t1)(r1)
    4. FORFt−1,df2

Using the algorithm in Table 5, I derive results for several other study designs and summarize these results in the remainder of this section. For each study design the corresponding algorithm results, in a format similar to Table 6, are presented in the referenced supplementary tables that are available in the online version of this article. Note that in the summaries below the reader performance measure is denoted by θ̂ij instead of ij to make it clear that, although these are mm-ANOVA models, the outcome is not restricted to a marginal mean but can be any reader-performance measure. In addition, I omit the tilde symbol over the mean squares and error term since it is clear that they are for the mm-ANOVA model rather than the corresponding conventional ANOVA model. Standard nesting notation is used; e.g., subscript (i) j denotes that the factor indexed by j is nested within the factor indexed by i, and MS[R (T)] is the mean square for reader nested within test.

5.1. Example 1: Reader×case study design (one test)

In this study design there is only one test and each reader reads each case. Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S1. The derivation begins with a conventional reader×case study-design ANOVA model that treats reader and case as random factors and includes their interaction. Averaging across cases produces the corresponding mm-ANOVA model: a one-way ANOVA model with reader as its only factor.

This mm-ANOVA model is given by θ̂j = μ + Rj + εij, j = 1, …, r, where r is the number of readers. The Rj are mutually independent and normally distributed with zero mean and variance σR2; the εij are normally distributed with zero mean and variance σε2 and are independent of the Rj; and Cov2 ≡ Cov (εj, εj) ≥ 0, jj′. Thus reader is a random factor and the covariance between error terms is assumed constant. Because there is only one test, only the formula for computing a confidence interval for the single test accuracy is presented.

An approximate (1 − a) 100% confidence interval for a single test accuracy, θ = E (θ̂j), is given by θ^±tα/2;df2V^, where V^=1r[MS(R)+rmax(Cov^2,0)],df2=[MS(R)+rmax(Cov^2,0)]2[MS(R)]2/(r1), and MS(R)=1rj=1r(θ^jθ^)2. A hypothesis test for the single test accuracy can be based on this confidence interval. Although Hillis [6] discusses this single-test confidence interval formula, he does not provide a derivation of the result.

This confidence interval result can also be used with the test×reader×case study design to yield single test confidence intervals, each based only on data for the corresponding test, as was illustrated in the analysis of the example data in Section 2.3. Because properties of this confidence interval do not depend on assumptions about the variance components and covariances corresponding to the other tests, we expect these single-test confidence intervals to be more robust than those where the standard error is based on all of the data.

5.2. Example 2: Reader-nested-within-test study design

In this study design readers read images from only one test; i.e., readers are nested within test. This study design is natural when readers are trained to read under only one of the tests. The study design is balanced with an equal number of readers reading all cases using each test. Thus reader is nested within test and is crossed with case. Obuchowski [12] discusses this design and refers to this as a paired-case, unpaired-reader design. This can be viewed as a split-plot design with readers being the “whole plots,” case the split-plot (or within-plot) factor, and test the whole-plot (or between-plot) factor. This design is schematically illustrated in Table 7a.

Table 7.

Split-lot design layouts. For nested factors, the level of the nesting factor is given in parentheses; e.g., reader (t) 1 in (a) denotes reader 1 nested within test t.

a) Reader nested within test. Yijk = rating for test i from reader j reading cases 1, …, c, with readers nested in test i; i = 1, …, t, j = 1, …, r, k = 1, …, c.
case

test reader 1 c


1 (1)1 Y111 Y11c
1 (1)r Y1r1 Y1rc


t (t)1 Yt11 Yt1c
t (t)r Ytr1 Ytrc
b) Case nested within test. Yijk = rating for test i from reader j reading cases 1, …, c, with readers nested in test i; i = 1, …, t, j = 1, …, r, k = 1, …, c.
reader

test case 1 r


1 (1)1 Y111 Y1r1
1 (1)c Y11c Y1rc


t (t)1 Yt11 Ytr1
t (t)c Yt1c Ytrc
c) Case nested within reader. Yijk = rating for test i from reader j reading cases 1, …, c, with cases nested in reader j;i = 1, … t, j = 1, …, r, k = 1, …, c.
test

reader case 1 t


1 (1)1 Y111 Yt11
1 (1)c Y11c Yt1c


r (r)1 Y1r1 Ytr1
r (r)c Y1rc Ytrc
d)Reader and case crossed and nested within group. Yhijk = rating assigned by the jth reader in group h to the kth case in group h using rest i; h = 1, … g, i = 1, … t, j = 1, …, r, k = 1, …, c. Each reader and case is included in only one group.
test

group reader case 1 t



1 (1)1 (1)1 Y1111 Y1t11
1 (1)1 (1)c Y111c Y1t1c


1 (1)r (1)1 Y11r1 Y1tr1
1 (1)r (1)c Y11rc Y1trc


g (g)1 (g)1 Yg111 Ygt11
g (g)1 (g)c Yg11c Ygt1c


g (1)r (1)1 Yg1r1 Ygtr1
g (g)r (g)c Yg1rc Ygtrc

Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S2. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design (i.e., with reader nested within test and crossed with case) that treats reader and case as random factors and includes all possible interactions. Averaging across cases produces the corresponding mm-ANOVA model: a reader-nested-within-test ANOVA model with reader as a random factor.

The mm-ANOVA model is given by θ̂ij = μ + τi + R(i)j + εij, i = 1, …, t, j = 1, …, r where t is the number of tests, r is the number of readers, τi denotes the fixed effect of test, and i=1tτi=0. The reader effects, the R(i)j, are mutually independent and normally distributed with zero mean and variance σR(T)2, where “R(T)” is read “reader nested within test”. The εij are normally distributed with zero mean and variance σε2. The εij are independent of the R(i)j; Cov2 = Cov (εij, εi′j′) with jj′ and Cov3 = Cov (εij, εij′) with ii′, with Cov2 ≥ Cov3 ≥ 0.

Thus there are two error covariances, Cov2 and Cov3, Cov2 ≥ Cov3 ≥ 0, defined as the covariances between errors for the same test and different readers, and for different tests and different readers, respectively. Note that the definition Cov3 ≡ Cov (εij, εij), ii′ does not require jj′ because ii′ implies different readers. There is no Cov1 parameter because the design does not allow for one reader reading under two tests.

Let θi ≡ E (θ̂i) denote the expected reader performance measure for test i. The test statistic for the null hypothesis of equal test accuracies (H0 : θ1 = … = θt) is

FOR=MS(T)MS[R(T)]+rmax(Cov^2Cov^3,0)

where MS(T) is defined as for the factorial model and MS[R(T)]=1t(r1)i=1tj=1r(θ^ijθ^i)2. Under H0, FORFt−1,df2 where

df2=[MS[R(T)]+rmax(Cov^2Cov^3,0)]2{MS[R(T)]}2/[t(r1)] (27)

More generally, FORFt−1,df2, where λ=ri=1tτi2σR(T)2+σε2+(r1)Cov2rCov3 and df2=[σR(T)2+σε2+(r1)Cov2rCov3]2(σR(T)2+σε2Cov2)2/[t(r1)].

An approximate (1 − α) 100% confidence interval for contrast i=1tliθi is given by i=1tliθ^i±tα/2;df2V^ where V^=1r(i=1tli2){MS[R(T)]+rmax(Cov^2Cov^3,0)} and df2 is given by (27). An approximate (1 − α) 100% confidence interval for θi is given by θ^i±tα/2;df2V^, where V^=1r{MS[R(T)]+max(rCov^2,0)} and df2={MS[R(T)]+max(rCov^2,0)}2{MS[R(T)]}2/[t(r1)]. Alternatively, an approximate (1 − α) 100% confidence interval for θi, using a standard error computed only from data for test i, is given by θ^i±tα/2;df2(i)V^(i), where V^=1r[MS(R)(i)+rmax(Cov^2(i),0)] and df2(i)=[MS(R)(i)+rmax(Cov^2(i),0)]2[MS(R)(i)]2/(r1), where MS (R)(i) and Cov^2(i) are computed only from test i data; note that this is the result from Section 5.1.

5.3. Example 3: Case-nested-within-test split-plot study design

In this study design each case is imaged under only one test, with the same number of cases imaged for each test. Each reader interprets all of the images from each test. This is often called a paired-reader, unpaired-case design. Obuchowski [12] notes that this design is needed when the diagnostic tests are mutually exclusive, e.g., if they are invasive, administer a high radiation dose, or carry a risk of contrast reactions. This can be viewed as a split-plot design with cases being the whole plots, reader the split-plot factor, and test the whole-plot factor. This design is schematically illustrated in Table 7b.

Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S3. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design that treats reader and case as random factors and includes all possible interactions. Averaging across cases produces the corresponding mm-ANOVA model, which is the same as the factorial mm-ANOVA model but with Cov1 and Cov3 constrained to zero; i.e., the model is defined by equation (1) and constraints Cov2 ≥ 0, Cov1 = Cov3 = 0. It follows that hypotheses-test, confidence-interval and sample-size formulas can be derived from those for the factorial model by setting Cov1 =Cov3 = 0.

Thus the test statistic for the null hypothesis of equal test accuracies is

FOR=MS(T)MS(T*R)+max(rCov^2,0)

Under H0, FORFt−1,df2 where

df2={MS(T*R)+max(rCov^2,0)}2[MS(T*R)]2/[(t1)(r1)] (28)

More generally, FORFt−1,df2, where λ=ri=1tτi2σTR2+σε2+(r1)(Cov2) and df2=[σTR2+σε2+(r1)(Cov2)]2[σTR2+σε2Cov2]2/[(t1)(r1)].

Letting θi denote E (θ̂i), an approximate (1 − α) 100% confidence interval for contrast i=1tliθi is given by i=1tliθ^i±tα/2;df2V^, where df2 is given by (28) and V^=1r(i=1tli2){MS(T*R)+max[rCov^2,0]}. An approximate (1 − α) 100% confidence interval for θi is given by θ^i±tα/2;df2V^, where V^=1tr[MS(R)+(t1)MS(T*R)+trmax(Cov^2,0)] and df2=[MS(R)+(t1)MS(T*R)+trmax(Cov^2,0)]2[MS(R)]2/(r1)+{(t1)MS[T*R]}2/[(t1)(r1)]. Alternatively, an approximate (1 − α) 100% confidence interval for θi, using a standard error computed only from data for test i, is given by θ^i±tα/2;df2(i)V^(i), where V^=1r[MS(R)(i)+rmax(Cov^2(i),0)] and df2(i)=[MS(R)(i)+rmax(Cov^2(i),0)]2[MS(R)(i)]2/(r1), where MS (R)(i) and Cov^2(i) are computed only from test i data. Note that these single-test confidence-interval formulas are the same as those for the factorial design.

5.3.1. Real-data example

Using the Kundel et al [9] data that were discussed in Section 2.3, I now compare soft-copy computed radiographs with screen-film radiographs. The images are from different patients for each type of radiograph, with 95 images in each group (soft-copy computed radiograph: 66 nondiseased, 29 diseased; screen-film radiograph: 68 nondiseased, 27 diseased). Because the images for each method are from different patients, this is an example of a case-nested-within-test study design. The analysis of this study using empirical AUC estimates and jackknife covariance estimates is displayed in Table 8. The AUCs for soft-copy and screen-film images, averaged across the four readers, are 0.804 and 0.829, respectively. The test for the null hypothesis of no AUC difference between soft-copy and screen-film is not significant (FOR = 0.31, df2 = 164.4, p = 0.58); the 95% confidence interval for the difference of the population AUCs (screen-film minus soft-copy) is (−0.064, 0.114). Part (h) gives 95% confidence intervals for the single-test AUCs based only on data for the specific test.

Table 8.

Obuchowski-Rockette split-plot (cases nested within test) analysis of Kundel et al [9] data for soft-copy computed radiographs and screen-film radiographs using trapezoid AUC estimation and jackknife covariance estimation for t = 2 tests, r = 4 readers. The images were from different patients for each type of radiograph, with 95 images in each group (soft-copy computed radiograph: 66 nondiseased, 29 diseased; screen-film radiograph: 68 nondiseased, 27 diseased).

  1. Trapezoid AUCs:
    Test

    1 (Soft-copy computed radiograph) 2 (Screen-film)
    Reader (j) θ̂1j θ̂2j
    1 0.815 0.818
    2 0.767 0.836
    3 0.831 0.828
    4 0.803 0.834

    θ̂ = .804 θ̂ = .829
  2. ANOVA table:
    Source df Sum of squares Mean square
    T 1 0.00125969 0.00125969
    R 4 0.00076530 0.00025510
    T*R 4 0.00164974 0.00054991
  3. Fixed-reader covariance estimates computed from jackknife covariance matrix: σ^ε2=0.0023651313,Cov^2=0.0008800774

  4. FOR=MS(T)MS(T*R)+max(rCov^2,0)=0.31
  5. Denominator degrees of freedom:
    df2=[MS(T*R)+max(rCov^2,0)]2[MS(T*R)]2(t1)(r1)=164.4
  6. P-value for H0: θ1 = θ2: p = Pr (F(t−1), df2FOR) = 0.579

  7. 95% CI for θ2 − θ1: θ^2·θ^1·±tdf22r{MS(t*R)+rmax(Cov^2,0)}=(0.064,0.114)

  8. Single test 95% confidence intervals using only corresponding data. Note: StdErr(i)=1r{MS(R)(i)+r*max(Cov^2(i),0)}
    i θ̂i
    Cov^2(i)
    MS(R)(i) StdErr(i)
    df2(i)
    95% CI
    1(Soft-copy) 0.804 0.000880 0.000735 0.0326 100.4 0.739, 0.867
    2(Screen-film) 0.829 0.000881 0.000070 0.0300 7997.2 0.770, 0.888

5.4. Example 4: Case-nested-within-reader split-plot study design

In this study design each reader interprets a different set of cases using all of the diagnostic tests. The study design is balanced with each reader reading the same number of cases under each test. This can be viewed as a split-plot design with cases being the whole plots, reader the whole-plot factor, and test the split-plot factor. Obuchowski [12] refers to this as a hybrid design. The advantage of this design is that for equivalent power each reader must interpret fewer cases than for the factorial design, but the disadvantage is that the total number of cases is higher [13]. Thus this design is appropriate when a large number of verified cases are available and reading time per reader is limited or relatively expensive. This design is schematically illustrated in Table 7c.

Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S4. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design that treats reader and case as random factors and includes all possible interactions. Averaging across cases produces the corresponding mm-ANOVA model, which is the same as the factorial model except with Cov2 and Cov3 constrained to zero; i.e., the model is defined by (1) and constraints: Cov1 ≥ 0, Cov2 = Cov3 = 0. Because this model is the same as the factorial model with Cov2 and Cov3 constrained to zero, hypotheses-test, confidence-interval, and sample-size formulas can be derived from those for the factorial model by setting Cov2 =Cov3 = 0.

Thus the test statistic for the null hypothesis of equal test accuracies is

FOR=MS(T)MS(T*R)

Under H0, FORFt−1,df2 where

df2=(t1)(r1) (29)

More generally, FORFt−1,df2, where λ=ri=1tτi2σTR2+σε2Cov1 and df2 is given by (29).

Letting θi denote E (θ̂i), an approximate (1 − α) 100% confidence interval for contrast i=1tliθi is given by i=1tliθ^i±tα/2;df2V^, where df2 is given by (29) and V^=1r(i=1tli2)MS(T*R). An approximate (1 − α) 100% confidence interval for θi is given by θ^i±tα/2;df2V^, where V^=1tr[MS(R)+(t1)MS(T*R)] and df2=[MS(R)+(t1)MS(T*R)]2[MS(R)]2/(r1)+[(t1)MS[T*R]]2/[(t1)(r1)]. Alternatively, an approximate (1 − α) 100% confidence interval for θi, using a standard error computed only from data for test i, is given by θ^i±tα/2;df2V^, where V^=1rMS(R)(i) and df2 = r − 1.

5.5. Example 5: Reader-and-case-crossed-and-nested-within-group split-plot study design

In this study design there are several groups (or blocks) of readers and cases such that (1) each reader and each case belongs to only one group and (2) within each group all readers read all cases under each test. I assume a balanced design where each group has the same number of readers and cases. Obuchowski [13] discusses this design and refers to it as a mixed design; I will refer to it as a mixed split-plot design. The motivation for this study design is to reduce the number of reader interpretations for each reader, compared to the factorial study, without requiring as many cases to be verified as the hybrid design. This design is schematically illustrated in Table 7d. Although not explicitly stated, Obuchowski [13] assumes that there is no group effect for this design; e.g., cases and readers are randomly assigned to the groups (personal communication, Nancy Obuchowski, 2012). In contrast, I allow for a group effect; e.g., readers are assigned to groups according to experience level. Obuchowski et al [14] provide a real-data example that shows how this design can be particularly useful for studying multiple imaging tests.

Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S5. The derivation begins with a conventional split-plot ANOVA model corresponding to the study design (reader and case crossed and nested within group) that treats reader and case as random factors and group and test as fixed factors. All possible interactions are included. Averaging across cases produces the corresponding mm-ANOVA model: a three-way ANOVA model with group, test, and reader as factors.

Let θ̂hij denote the reader-performance estimate for reader j under test i, with both belonging to group h. The mm-ANOVA model is given by θ̂hij = μ + γh + τi + (γτ)hi + R(h)i + (τR)(h)ij + εhij, h = 1, …, g, i = 1, …, t, j = 1, …, r, where g is the number of groups, t is the number of tests, r is the number of readers, τi denotes the fixed effect of test i, γh denotes the fixed effect of group h, and (γτ)hi denotes the fixed group-by-test interaction with i=1tτi=h=1gγh=h=1g(γτ)hi=i=1t(γτ)hi=0. The R(h)j and (τR)(h)ij are random reader and test-by-reader effects, nested within group; they are mutually independent and normally distributed with zero means and respective variances σR(G)2 and στR(G)2. The εhij are normally distributed with zero mean and variance σε2. The εhij are independent of the R(h)j and (τR)(h)ij. In summary, the mm-ANOVA model contains fixed effects for group, test, and their interaction, and random effects for reader nested within group and the test-by-reader interaction nested within group.

Cov1, Cov2, and Cov3 are defined and constrained similar to corresponding covariances for the typical test×reader×case factorial design, but with this difference: here they are not defined between errors corresponding to different groups because the covariance of those errors is zero. Specifically, Cov1 ≡ Cov (εhij, εhij), Cov2 ≡ Cov (εhij, εhij), and Cov3 ≡ Cov (εhij, εhij) where ii′, jj′ and Cov1 ≥ Cov3, Cov2 ≥ Cov3, and Cov3 ≥ 0.

The null hypothesis of equal test accuracies is H0 : θ1 =…= θt, where θi = E (θ̂i). The corresponding test statistic is

FOR=MS(T)MS[T*R(G)]+max[r(Cov^2Cov^3),0]

Under H0, FORFt−1,df2 where

df2={MS[T*R(G)]+max[r(Cov^2Cov^3),0]}2{MS[T*R(G)]}2/[g(t1)(r1)] (30)

and MS[T * R(G)] denotes the mean square for test-by-reader interaction nested within group. More generally, FOR has an approximate Ft−1,df2 distribution, where λ=gri=1tτi2σTR(G)2+σε2Cov1+(r1)(Cov2Cov3) and df2=[σTR(G)2+σε2Cov1+(r1)(Cov2Cov3)]2[σTR(G)2+σε2Cov1Cov2+Cov3]2/[g(t1)(r1)].

An approximate (1 − α) 100% confidence interval for contrast i=1tliθi is given by i=1tliθ^i±tα/2;df2V^, where df2 is given by (30) and V^=1ri=1tli2{MS[R(T)]+rmax(Cov^2Cov^3,0)}. An approximate (1 − α) 100% confidence interval for θi is given by θ^i±tα/2;df2V^, where V^=1gtr[MS[R(G)]+(t1)MS[T*R(G)]+trmax(Cov^2,0)] and df2={MS[R(G)]+(t1)MS[T*R(G)]+trmax(Cov^2,0)}2{MS[R(G)]}2/[g(r1)]+{(t1)MS[T*R(G)]}2/[g(t1)(r1)].

5.6. Example 6: Replicated factorial study design

This study design is the same as the factorial study design except that each reader reads each case n times. Typically sessions corresponding to different readings are separated by a suitable period of time to reduce the probability that the reader will recognize cases from the earlier session. This study design has two advantages over the factorial design with one replication: it allows for estimation of within-reader reliability between two readings of the same cases, and it provides more power for the same number of cases and readers. This last aspect can be important if the number of available cases and readers is limited. In the example later in this section, I show how to estimate the gain in power based on pilot data.

Derivation of results using the mm-ANOVA algorithm is presented in Supplementary Table S6. The derivation begins with a conventional three-way replicated factorial ANOVA model with reader and case as random factors and test as a fixed factor. There are n replications. All possible interactions are included between reader, case and test. Averaging across cases for each replication produces the corresponding mm-ANOVA model: a two-way replicated factorial ANOVA model with test and reader as factors.

Let θ̂ijm denote the reader-performance estimate for reader j under test i based on the mth reading of the data. The mm-ANOVA model is given by θ̂ijm = μ + τi + Rj + (τR)ij + εijm i = 1, …, t, j = 1, …, r, m = 1, …, n where t is the number of tests, r is the number of readers, n is the number of replications, τi denotes the fixed effect of test i, Rj denotes the random effect of reader j, (τR)ij denotes the random test×reader interaction, εijm is the error term, and i=1tτi=0. The Rj and (τR)ij are assumed to be mutually independent and normally distributed with zero means and respective variances σR2 and σTR2. The εij are assumed to be normally distributed with zero mean and variance σε2 and are assumed independent of the Rj and (τR)ij. The errors are equi-covariant with four possible covariances given by

Cov(εijm,εijm)={Cov0i=i,j=j,mm(same test and reader,different replication)Cov1ii,j=j(different test,same reader)Cov2i=i,jj(same test,different reader)Cov3ii,jj(different test,different reader)

and subject to the following constraints:

Cov0Cov1Cov3;Cov0Cov2Cov3;Cov30

Let θiE (θ̂i••) denote the expected reader performance measure for test i. The test statistic for the null hypothesis of equal test accuracies (H0 : θ1 =…= θt) is

FOR=MS(T)MS(T*R)+nrmax(Cov^2Cov^3,0)

where MS(T)=nrt1i=1t(θ^iθ^)2 and MS(T*R)=n(t1)(r1)i=1tj=1r(θ^ijθ^iθ^j+θ^)2. Under H0, FORFt−1,df2 where

df2=[MS(T*R)+nrmax(Cov^2Cov^3,0)]2{MS(T*R)}2/[(t1)(r1)] (31)

More generally, FORFt−1,df2, where

λ=ri=1tτi2σTR2+σε2/nCov1+(r1)(Cov2Cov3)+[(n1)/(n)]Cov0 (32)

and

df2=[σTR2+σε˜2/nCov1+(r1)(Cov2Cov3)+[(n1)/n]Cov0]2[σTR2+σε˜2/nCov1(Cov2Cov3)+[(n1)/n]Cov0]2/[(t1)(r1)] (33)

An approximate (1 − α) 100% confidence interval for contrast i=1tliθi is given by i=1tliθ^i±tα/2;df2V^ where V^=1nr(i=1tli2){MS(T*R)+nrmax(Cov^2Cov^3,0)} and df2 is given by (31). An approximate (1 − α) 100% confidence interval for θi is given by θ^i±tα/2;df2V^, where V^=1ntr{MS(R)+(t1)MS(T*R)+max(ntrCov^2,0)} and df2=[MS(R)+(t1)MS(T*R)+ntrmax(Cov^2,0)]2[MS(R)]2/(r1)+[(t1)MS(T*R)]2/[(t1)(r1)].

Consider Cov2 ≡ cov (θ̂ijm, θijm) where jj′ and either m = m′ or mm′. It follows that Cov2 can be computed from one set of replications (m = m′) or from different sets of replications (mm′). For example, for test i and readers j and j′, with n = 2 we have Cov2 = cov (θ̂ij1, θij′1) = cov (θ̂ij1, θij′2) = cov (θ̂ij2, θij′1) = cov (θij2, θij′2). Thus an obvious estimate for Cov2 that utilizes all of the data is given by

Cov^2=2n2tr(r1)i=1tj<j1mn,1mncov^(θ^ijm,θijm)

where cov^(θ^ijm,θijm) is a fixed-reader covariance estimate, as discussed in Section 2.2. Similarly, estimates for Cov1 and Cov3 can be estimated by averaging fixed-reader covariance estimates, computed for each of the n2 possible (m, m′) pairs of replications, across corresponding test-reader combinations. Obvious estimates for Cov0 and σε2 are Cov^0=2n(n1)tri=1tj=1rm<mcov^(θ^ijm,θijm) and σ^ε2=1ntri=1tj=1rm=1nvar^(θ^ijm), where var^(θ^ijm)=cov^(θ^ijm,θ^ijm).

5.6.1. Real-data example

In Section 2.3 I compared AUCs for hard- and soft-copy computed radiography chest images. Both types of images were obtained for each patient and were read by each of the readers. Thus this was a factorial study design, which could be analyzed by the standard OR procedure. Although there was not a significant difference between the two types of images, the resulting confidence interval showed that an AUC difference as large as 0.086 was commensurate with the data. In such a situation the researcher might want to plan a similar experiment that is sized to have more power.

Increased power can be obtained by increasing the number of readers, the number of cases, or the number of replications. I now compute the number of cases needed to obtain .80 power to detect an AUC difference of .04 with alpha = .05. Because FORFt−1,df2, power is approximated by Pr (F1,df2 > F.95;1,df2) where λ and df2 are defined by (32) and (33) and F.95;1,df2 is the 95th percentile of a central F distribution with degrees of freedom 1 and df2.

For the power computations I use the following estimates, obtained from Section 2.3: σ^ε2=.0022034331,Cov^1=.0011163046,Cov^2=.0.0008438255,Cov^3=.0008871752, and σ^TR2=0. An estimate of Cov0 is not available from the data because there are no replicated readings; however, the similarity of the two tests (hard- and soft-copy) suggests that the within-reader correlation between replications for the same test and reader, ρ0=Cov0/σε2, should be only slightly higher than the within-reader correlation based on one replication between two tests, given by p^1=Cov^1/σ^ε2=0.507 from Table 2. Thus I set ρ0 = 0.60 for the power computations; it follows that Cov^0=.6σ^ε2=0.00132206. Following Hillis et al [15] I assume that the covariances are inversely proportion to the number of cases c, and hence multiply σ^ε2,Cov^1,Cov^2 and Cov^3 by the factor 95c (recall that 95 is the number of cases for the example); the resulting values are used in place of σε2, Cov1, Cov2, and Cov3 in (32) and (33) when computing power for c cases.

The numbers of cases need to achieve 0.80 power for combinations of 4–8 readers and 1–2 replications are presented in Table 9. For example, achieving 0.80 power with 8 readers and one replication requires 173 cases versus 103 cases with two replications. Thus if cases are expensive to obtain or validate and it is difficult to obtain more than 8 readers, then using two replications appears to be an attractive option.

Table 9.

Number of replications, readers, and cases needed to achieve .80 power to detect a .04 AUC difference between soft- and hard-copy radiographs using a factorial study design, based on estimates from the Kundel et al [9] data, an assumed within-reader within-replication correlation of 0.60, and alpha = .05.

replications (n) readers (r) cases (c) power
1 4 585 0.800
1 5 366 0.801
1 6 266 0.800
1 7 210 0.802
1 8 173 0.801
2 4 348 0.800
2 5 218 0.801
2 6 158 0.800
2 7 125 0.802
2 8 103 0.802

6. Discussion

The mm-ANOVA approach allows for analysis of ROC and other reader-performance outcomes that result from any balanced study design that has reader and case as random factors and any number of fixed factors. In addition, by providing the non-null distribution of the test statistic it allows for sample size estimation for such studies and efficiency comparisons between different types of studies. Although steps were fully justified only for the test×reader×case factorial study design, justification can be similarly established for other designs. Until now researchers have been limited to using the test×reader×case study design with the OR method because analysis methods were not developed for other designs. This work allows researchers to choose designs that are most appropriate for their study. A SAS macro for fitting some of these designs using the mm-ANOVA approach is available on request from the author.

As noted in Section 2.4, Obuchowski and Rockette [1] derived their F statistic by modifying the F statistic described by Pavur and Nath [10]. Although Pavur and Nath [10] give results only for two-factor models, their approach, which is based on results given by Pavur and Lewis [19], could conceivably be applied to other correlated-error ANOVA models; as such it would provide an alternative to the approach described in this paper. However, the results of Pavur and Lewis do not extend beyond specifying the correct form for the F test when correlations are known; in particular, they do not indicate how to implement their approach when the correlations must be estimated, do not discuss derivation of confidence interval formulas for contrasts, give little motivation for the correlated error models, and do not discuss power computations.

Explicit formulas can be derived [20, 21, 22] for the variances of reader-performance outcomes that are U-statistics [23], such as reader empirical-AUC averages and their differences. Replacing parameters in these formulas by sample estimates yields variance estimates with excellent statistical properties. However, this approach is limited to U-statistic estimators, such as the empirical AUC and presently incorporates an adaptation of the OR degrees of freedom formula. Advantages include explicit variance formulas and applicability to a wide variety of multireader study designs, including unbalanced designs.

Another alternative approach for analyzing multireader data is the marginal model approach proposed by Song and Zhou [24] for empirical AUC estimates. An advantage of their approach is that case-specific covariates can be included; disadvantages include being limited to empirical AUC outcomes, based on large-sample inferences, and thus far developed only for the factorial model.

Limitations of the mm-ANOVA approach include the following: (1) It is presently limited to balanced study designs; i.e., the number of levels for each factor does not depend on any other factor. However, because case is treated as one factor it is possible to have different numbers of normal and abnormal cases. I am currently investigating models that are not balanced with regard to case. (2) It assumes that the number of cases is large enough so that covariance estimates can be treated like known values for computing the denominator degrees of freedom. (3) It assumes that the fixed-reader measurement errors, the εij, are normally distributed. This is a reasonable assumption when the number of cases is moderate because most typical reader-performance outcomes, such as AUC, have asymptotic normal distributions for a fixed reader. (4) It assumes that the latent reader-performance outcomes (i.e., Rj + (τR)ij) have a normal distribution. If these normal distribution assumptions do not appear to be reasonable, one possible remedy is to transform the outcome, e.g., using a logarithmic or logit transformation for AUC. (5) It assumes the errors have an equi-covariance structure. I am currently investigating the robustness of the mm-ANOVA approach to this assumption.

Supplementary Material

Web-based Supporting Materials

ACKNOWLEDGEMENTS

This research was supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), grants R01EB000863 and R01EB013667. I thank Dr. Harold Kundel for sharing his data set.

Appendix

A. DERIVATION OF THE NULL-DISTRIBUTION RESULT USED IN STEP 2J

To derive the null-distribution result given in step 2j, I approximate the distribution of FOR (22) by deriving an approximate distribution for FOR* (21), where Cov2 and Cov3 are known. Each MS˜i is equal to its corresponding conventional three-way ANOVA model mean square, denoted by MSi, multiplied by 1c, with MSi~E(MSi)χdf(MSi)2/df(MSi) under H0 : θ1 = … = θt. It follows that the MS˜i are mutually independent, each MS˜i has the same degrees of freedom as its corresponding MSi and MS˜i~E(MS˜i)χdf(MS˜i)2/df(MS˜i).

In general, a chi-squared-distribution approximation [25, 26] for a random variable X is given by

E(X)χdf2/df

where

df=2[E(X)]2var(X)

It follows that a chi-square approximation for

X=b(i=1IaiMS˜i+d)

where the ai, b and d are constants, is given by

b(i=1IaiE(MS˜i)+d)χdf2/df (A1)

where

df=[i=1IaiE(MS˜i)+d]2i=1I[aiE(MS˜i)]2df(MS˜i) (A2)

Replacing E(MS˜i) by MS˜i and d by an estimate in (A2) results in the approximation for df given by df2 (24).

It follows using (A1) with i = 1, a1 = 1, MS˜1=MS˜(T*R), d = r (Cov2 − Cov3) and (A2) estimated by (24) that a chi-squared approximation for MS˜(T*R)+r(Cov2Cov3), the denominator of F* (21), is given by

MS˜(T*R)+r(Cov2Cov3)~˙{E[MS˜(T*R)]+r(Cov2Cov3)}χdf22/df2 (A3)

where df2 is given by (25) and “~˙” stands for “is approximately distributed as.” See Reference [6] for a more detailed derivation and justification of df2 (referred to as ddfH in the reference.)

Because FOR* (21) is an ANOVA statistic, E[MS˜(T)]=E[MS˜(T*R)]+r(Cov2Cov3) under H0. Combining this result with the chi-squared approximation (A3) for MS˜(T*R)+r(Cov2Cov3) and the independence of MS˜(T) and MS˜(T*R), it follows under H0 that

FOR*=MS˜(T)MS˜(T*R)+r(Cov2Cov3)=MS˜(T)E[MS˜(T)]MS˜(T*R)+r(Cov2Cov3)E[MS˜(T*R)]+r(Cov2Cov3)=U/(t1)W/df2

where U~χt12, W is approximately χdf22, and U and W are independent. Thus FOR* has an approximate F(t−1),df2 null distribution, with df2 given by (25). Because FOR (22) approximates FOR* (21), it is reasonable to approximate the null distribution of FOR by F(t−1),df2, which is the null distribution derived by Hillis [6] for FOR, discussed in Section 2.2.

B. MM-ANOVA APPROACH STEP 3: DERIVE CONFIDENCE INTERVALS FOR A LINEAR FUNCTION g(θ) OF TEST ACCURACIES

In this section I show how to compute a confidence interval for a linear function of test accuracy parameters. Specifically, for the balanced test×reader×case factorial study design with θiE(θ̂i) denoting the expected reader-performance outcome for test i across readers, θ = (θ1, …, θt)′, and l = (l1, …, lt)′ denoting a t-dimensional contrast vector (i.e., i=1tli=0), I illustrate how to derive a confidence interval for g (θ) ≡ lθ. More generally this step can be used to determine a confidence interval for g (θ), where g (·) is any linear function and θ any vector of test accuracy parameters; this general result is given in step 3k.

B.1. Step 3a: Write the test accuracy parameter vector θ in terms of the mm-ANOVA model

In terms of the mm-ANOVA model parameterization, treating ij as θ̂ij, we have θi = E (i) = μ + τi.

B.2. Step 3b: Write θ in terms of the conventional ANOVA model

Since θi = E (i) = E (Yi••) = μ + τi, then in terms of the conventional ANOVA model we also have θi = μ + τi.

B.3. Step 3c: Determine the conventional ANOVA estimate for θ, denoted by θ̂

The conventional unbiased ANOVA estimate for θ is given by θ̂ = (θ̂1, …, θ̂t)′ with θ̂i = Yi••.

B.4. Step 3d: Determine the variance V of g (θ̂) in terms of conventional ANOVA parameters

From (7) it follows that

g(θ^)=i=1tliθ^i=i=1tliYi=i=1tliτi+i=1tli[(τR)i+(τC)i+(τRC)i+εi]

Thus

VVar(g(θ^))=i=1tli2[σTR2r+σTC2c+σ2rc]=1rci=1tli2[cσTR2+rσTC2+σ2]

Because θ̂ has a multivariate normal distribution, it follows that

g(θ^)~N(lθ,V)

where

V=1rci=1tli2(cσTR2+rσTC2+σ2)

B.5. Step 3e: Write V from step 3d in the form V = bE (∑aiMSi) for constants b and ai

Expected values of the conventional ANOVA mean squares are given in Table 4. It follows that

V=1rci=1tli2E[MS(T*R)+MS(T*C)MS(T*R*C)]

B.6. Step 3f: Write V from step 3e in the form V=b˜E(ãiMS˜i+U) where b̃ and ãi are constants and U is a linear function of conventional ANOVA mean squares that involve case

We have

V=1ri=1tli2E[MS˜(T*R)+U]

where

U=1c[MS(T*C)MS(T*R*C)]

B.7. Step 3g: Express E (U) in terms of conventional ANOVA model variance components and then in terms of mm-ANOVA model error covariance parameters, using the relationships from step 1c; then rewrite V using this expression for E (U)

We did the first part of this step in step 2g where we showed

E(U)=r(Cov2Cov3)

Using this expression we have

V=1ri=1tli2{E[MS˜(T*R)]+r(Cov2Cov3)}

B.8. Step 3h: Derive the variance estimate V̂ from V by replacing expected mean squares by mean squares and replacing covariances by estimates that take into account the constraints from step 1d

We have

V^=1ri=1tli2{MS˜(T*R)+max[r(Cov^2Cov^3),0]}

B.9. Step 3i: Derive the degrees of freedom df2 for V̂ using the general formula for df2 (24) given in step 2j

It follows that the degrees of freedom is given by (25), which is the same as ddfH (6).

B.10. Step 3j: Write θ̂ from step 3c in terms of the mm-ANOVA model

Since θ̂i = Yi•• = i, then in terms of the mm-ANOVA model θ̂i = i.

B.11. Step 3k: General confidence-interval result: In terms of the mm-ANOVA model, an approximate (1 − α) 100% confidence interval for g (θ) is given by g(θ^)±tα/2;df2V^ where V̂ is determined in step 3h, df2 in step 3i and θ̂ in step 3j

This result yields the following (1 − α) 100% confidence interval for lθ:

i=1tlii±tα/2;ddfH1r(i=1tli2)[MS˜(T*R)+rmax(Cov^2Cov^3,0)] (B1)

where ddfH is given by (25). Letting “FOR-test denominator” denote the denominator of the FOR statistic (22) for testing H0 : θ1 = … = θt, we can write (B1) as

i=1tlii±tα/2;ddfH1r(i=1tli2){FOR-test denominator}

B.12. Derivation of the general confidence-interval result given in step 3k

I now derive the step 3k result for the test×reader×case factorial study design with g (θ) ≡ l′θ and l = (l1, …, lt)′ denoting a t-dimensional contrast vector (i.e., i=1tli=0). We have shown in the previous steps that g (θ̂) ~ N [g (θ), V], where

V=1ri=1tli2{E[MS˜(T*R)]+r(Cov2Cov3)}

Define V* by replacing E[MS˜(T*R)] by MS˜(T*R):

V*=1ri=1tli2{MS˜(T*R)+r(Cov2Cov3)}

Using the same argument as given in Appendix A and noting that V = E (V*), we can show that a chi-squared-distribution approximation for V* is given by Vχdf22/df2 with df2 given by (25). Furthermore, independence of g (θ̂) and MS˜(T*R) for the mm-ANOVA model, and hence independence of g (θ̂) and V*, follows from the independence of g (θ̂) and MS(T * R) for the conventional ANOVA model (7). Thus for the mm-ANOVA model

t=g(θ^)g(θ)1ri=1tli2{MS˜(T*R)+r(Cov2Cov3)}=g(θ^)g(θ)V*=g(θ^)g(θ)V(V*)df2V/df2=ZW/df2

where Z ~ N (0, 1), W is approximately χdf22, and Z and W are independent. Thus

t=g(θ^)g(θ)1ri=1tli2{MS˜(T*R)+r(Cov2Cov3)}

has an approximate tdf2 distribution with df2 given by (25). In practice we replace r (Cov2 − Cov3) by max[r(Cov^2Cov^3),0] and base tests and confidence intervals on

t=g(θ^)g(θ)1ri=1tli2{MS˜(T*R)+max[r(Cov^2Cov^3),0]}=g(θ^)g(θ)V^ (B2)

which we treat as having an approximate tdf2 distribution; the confidence interval result in step 3k follows.

The general result for with g (·) being any linear function can be similarly proved, with the main difference being the formula for V.

C. MM-ANOVA APPROACH – STEP 4: DERIVE THE NON-NULL DISTRIBUTION OF FOR

Power and sample size estimation for the step 2a hypothesis requires specification of the distribution of the FOR statistic, derived in step 2i, when the null hypothesis is not true. A noncentral F distribution approximation for the non-null distribution is specified by steps 4a–d below. These steps are justified in Section C.5.

C.1. Step 4a: Compute the noncentrality parameter in terms of the conventional ANOVA model

Express the noncentrality parameter in terms of the conventional ANOVA model using

λ=df(MSnum)MSnum|Y=E(Y)E(MSnum|H0) (C1)

where MSnum is the numerator mean square from the conventional ANOVA F statistic given in step 2d, df(MSnum) is its degrees of freedom, E (MSnum |H0) is its expected value under H0, and MSnum|Y= E(Y) is the mean square evaluated with outcomes replaced by their expected values.

For the balanced test×reader×case factorial design we have MSnum = MS (T) from step 2d. From Table 4 we have E[MS(T)]=rc(t1)i=1tτi2+cσTR2+rσTC2+σ2. Thus E[MS(T)|H0]=cσTR2+rσTC2+σ2 under H01 = … = τt = 0. Noting that E (Yijk) = μ + τi, we have MS(T)|Y=E(Y)=rct1i=1t(YiY)2|Yijk=μ+τi=rct1i=1tτi2. Noting that df[MS (T)] = t − 1, then from (C1) it follows that

λ=rci=1tτi2cσTR2+rσTC2+σ2 (C2)

C.2. Step 4b: Express λ in terms of mm-ANOVA parameters

Replace variance components in (C2) corresponding to random effects involving case by mm-ANOVA covariances. From the relationships determined in step 1c and presented in Table 3 we have

rσTC2+σ2=c[σε˜2Cov1+(r1)(Cov2Cov3)]

(Recall that σε˜2 is the error variance for the mm-ANOVA model.) Thus in terms of mm-ANOVA parameters

λ=ri=1tτi2σTR2+σε˜2Cov1+(r1)(Cov2Cov3) (C3)

C.3. Step 4c: Determine the denominator degrees of freedom in terms of mm-ANOVA parameters

Write the denominator of FOR* from step 2h in the form b(i=1IaiMS˜i+d). The denominator degrees of freedom is given by

df2=[i=1IαiE(MS˜i)+d]2i=1I[αiE(MS˜i)]2/df(MS˜i) (C4)

which is the same as (A2). Note that (C4) contains the expected mean square values and the true value of d, in contrast to approximation (24) that replaces these values by sample estimates. The reason for this difference is that approximation (24) will be used for hypotheses testing and confidence intervals for a study data set; in contrast, (C4) will be used for sample-size and power estimation for a future study and will be based on parameter values that are either conjectured or estimated from pilot data.

Express the expected mean squares in (C4) in terms of mm-ANOVA model parameters by determining their expected values in terms of the conventional ANOVA parameters and then replacing variance components that involve case by mm-ANOVA covariances. For example, for the balanced test×reader×case factorial study design, the denominator of FOR* from step 2h is given by i=1IaiMS˜i+d=MS˜(T*R)+r(Cov2Cov3). From (17) and Tables 34 it follows that

E(MS˜(T*R))=1cE(MS(T*R))=1c(cσTR2+σ2)

with σ2=c(σε˜2Cov1Cov2+Cov3). Thus

E(MS˜(T*R))=σTR2+σε˜2Cov1Cov2+Cov3

and hence, using (C4),

df2=[σTR2+σε˜2Cov1+(r1)(Cov2Cov3)]2[σTR2+σε˜2Cov1Cov2+Cov3]2(t1)(r1) (C5)

Hillis et al [15] illustrate how these formulas can be used in practice to estimate power and sample size using pilot-data or conjectured parameter estimates.

C.4. Step 4d: General non-null distribution result

An approximation for the non-null distribution of FOR is given by

Fdf1,df2;λ

where λ is given in step 4b, df1 is the degrees of freedom for the numerator mean square from the conventional ANOVA F statistic given in step 2d and df2 is given by (C4), expressed in terms of the mm-ANOVA parameters. Thus for the balanced test×reader×case factorial study design, λ is given by (C3),df1 = t − 1, and df2 is given by (C5).

C.5. Justification of steps 4a–d

The non-null distribution result given in step 4d can derived for the test×reader×case study design along the same lines as the derivation of the null distribution result given in Section A. One difference is that MS˜(T), the numerator numerator mean square in FOR (22) has a noncentral chi-square distribution when appropriately normalized under H1. The distribution for MS(T) is given by

(t1)MS(T)E(MS(T|H0))~χt1;λ2

where λ is given by (C3). Because MS˜(T)=1cMS(T), it follows that

(t1)MS˜(T)E(MS˜(T|H0))~χt1;λ2

Using the Section A approach but with this one difference, we can show that

FOR*=MS˜(T)MS˜(T*R)+r(Cov2Cov3)=MS˜(T)E[MS˜(T)|H0]MS˜(T*R)+r(Cov2Cov3)E[MS˜(T*R)]+r(Cov2Cov3)=U/(t1)W/df2

where U~χt1;λ2 W is approximately χdf22 with df2 given by (C5), and U and W are independent. Thus FOR* has an approximate F(t−1),df2 distribution. Because FOR (22) approximates FOR* (21), it is reasonable to approximate the null distribution of FOR by F(t − 1),df2.

References

  • 1.Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Communications in Statistics: Simulation and Computation. 1995;24:285–308. [Google Scholar]
  • 2.Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
  • 3.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]
  • 4.Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]
  • 5.Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]
  • 6.Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Statistics in Medicine. 2007;26:596–619. doi: 10.1002/sim.2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–844. [PubMed] [Google Scholar]
  • 8.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
  • 9.Kundel HL, Gefter W, Aronchick J, Miller W, Hatabu H, Whitfill CH. Accuracy of bedside chest hard-copy screen-film versus hard-and soft-copy computed radiographs in a medical intensive care unit: receiver operating characteristic analysis. Radiology. 1997;205:859–863. doi: 10.1148/radiology.205.3.9393548. [DOI] [PubMed] [Google Scholar]
  • 10.Pavur R, Nath R. Exact F tests in an ANOVA procedure for dependent observations. Multivariate Behavioral Research. 1984;19:408–420. doi: 10.1207/s15327906mbr1904_3. [DOI] [PubMed] [Google Scholar]
  • 11.Searle SR. Linear Models. New York: Wiley; 1971. pp. 55–59. [Google Scholar]
  • 12.Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
  • 13.Obuchowski NA. Reducing the number of reader interpretations in MRMC studies. Academic Radiology. 2009;16:209–217. doi: 10.1016/j.acra.2008.05.014. [DOI] [PubMed] [Google Scholar]
  • 14.Obuchowski NA, Gallas BD, Hillis SL. Multi-reader ROC studies with split-plot designs: a comparison of statistical methods. Academic Radiology. 2012;19:1508–1517. doi: 10.1016/j.acra.2012.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hillis SL, Obuchowski NA, Berbaum KS. Power estimation for multireader ROC methods: An updated and unified approach. Academic Radiology. 2011;18:129–142. doi: 10.1016/j.acra.2010.09.007. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Obuchowski NA, Lieber ML, Powell KA. Data analysis for detection and localization of multiple abnormalities with application to mammography. Academic Radiology. 2000;7:516–525. doi: 10.1016/s1076-6332(00)80324-4. [DOI] [PubMed] [Google Scholar]
  • 17.Chakraborty DP, Berbaum KS. Observer studies involving detection and localization: Modeling, analysis, and validation. Medical Physics. 2004;31:2313–2330. doi: 10.1118/1.1769352. [DOI] [PubMed] [Google Scholar]
  • 18.Bunch PC, Hamilton JF, Sanderson GK, Simmons AH. Free-response approach to the measurement and characterization of radiographic-observer performance. Journal of Applied Photographic Engineering. 1978;4:166–171. [Google Scholar]
  • 19.Pavur RJ, Lewis TO. Unbiased F-tests for factorial-experiments for correlated data. Communications in Statistics-Theory and Methods. 1983;12:829–840. [Google Scholar]
  • 20.Gallas BD. One-shot estimate of MRMC variance: AUC. Academic Radiology. 2006;13:353–362. doi: 10.1016/j.acra.2005.11.030. [DOI] [PubMed] [Google Scholar]
  • 21.Gallas BD, Pennelo GA, Myers KJ. Multireader multicase variance analysis for binary data. JOSA A. 2007;24:B70–B80. doi: 10.1364/josaa.24.000b70. [DOI] [PubMed] [Google Scholar]
  • 22.Gallas BD, Bandos A, Samuelson FW, Wagner RF. A framework for random-effects ROC analysis: biases with the bootstrap and other variance estimators. Communications in Statistics-Theory and Methods. 2009;38:2586–2603. [Google Scholar]
  • 23.Hoeffding W. A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics. 1948;19:293–325. [Google Scholar]
  • 24.Song X, Zhou XH. A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data. Biostatistics. 2005;6:303–312. doi: 10.1093/biostatistics/kxi011. [DOI] [PubMed] [Google Scholar]
  • 25.Satterthwaite FE. Synthesis of variance. Psychometrika. 1941;6:309–316. [Google Scholar]
  • 26.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometric Bulletin. 1946;2:110–114. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web-based Supporting Materials

RESOURCES