Power Estimation for Multireader ROC Methods: An Updated and Unified Approach

Stephen L Hillis; Nancy A Obuchowski; Kevin S Berbaum

doi:10.1016/j.acra.2010.09.007

. Author manuscript; available in PMC: 2012 Feb 1.

Published in final edited form as: Acad Radiol. 2011 Feb;18(2):129–142. doi: 10.1016/j.acra.2010.09.007

Power Estimation for Multireader ROC Methods: An Updated and Unified Approach

Stephen L Hillis ^1,², Nancy A Obuchowski ³, Kevin S Berbaum ⁴

PMCID: PMC3053069 NIHMSID: NIHMS245125 PMID: 21232681

Abstract

Rationale and Objectives

We describe a step-by-step procedure for estimating power and sample size for planned multireader receiver operating characteristic (ROC) studies that will be analyzed using either the Dorfman-Berbaum-Metz (DBM) or Obuchowski-Rockette (OR) method. This procedure updates previous approaches by incorporating recent methodological developments and unifies the approaches by allowing inputs to be conjectured parameter values or outputs from either a DBM or OR pilot-study analysis.

Materials and Methods

Power computations are described in a step-by-step procedure and the theoretical basis for the procedure is described. Updates include using the currently recommended denominator degrees of freedom, accounting for different pilot and planned study normal-to-abnormal case ratios, and a new method for computing the OR test-by-reader variance component.

Results

Using a real data set we illustrate how to compute the power for two planned studies, one having the same normal-to-abnormal case ratio as the pilot study and the other having a different ratio. In a simulation study we show that the proposed procedure gives mean power estimates close to the true power.

Conclusions

Application of the updated procedure is straightforward. It is important that pilot data be comparable to the planned study with respect to the modalities, reader expertise, and case selection. Variability of the power estimates warrants further investigation.

Keywords: ROC curve, sample size, power, multireader

1. Introduction

Receiver operating characteristic (ROC) curve analysis is a well-established method for evaluating and comparing the performance of diagnostic tests for radiological imaging studies. Throughout we assume that rating data have been collected using the study design where multiple readers (typically radiologists) assign disease-severity or disease-likelihood ratings, using one or more tests, to the same images using either a discrete (e.g., 1, 2, 3, 4, 5) or a quasi-continuous (e.g., 0–100%) scale. From these ratings, ROC curves and corresponding accuracy estimates are computed for each reader and each test, in order to assess how well a test performs or to compare the performance of tests. In such studies there is variability between cases and between readers. Thus it is important that results generalize to both the corresponding case and reader populations; methods that accomplish this goal are commonly referred to as multireader multicase (MRMC) methods.

Two popular MRMC methods are those proposed by Dorfman, Berbaum, and Metz (DBM) [1, 2] and by Obuchowski and Rockette (OR) [3, 4]. For the OR method, power computation using conjectured parameter estimates is discussed by Obuchowski [4, 5] and Zhou et al [6]; for the DBM method power computation based on pilot-study estimates or conjectured parameter values is discussed by Hillis and Berbaum [7]. Since the publication of these articles, it has been shown [8] that both the DBM and OR methods can be improved by using a common denominator degrees of freedom, ddf_H, for the F statistic for testing for equality of tests. When both methods use ddf_H, the DBM method can be viewed as an implementation of the OR method using jackknife covariance estimates, with both methods yielding the same conclusions. Furthermore, Reference [9] shows that if the OR method is not based on jackknife covariance estimates, then quasi pseudovalues can be generated that give the same results when analyzed by the DBM method. Thus we can consider the DBM and OR procedures to be equivalent. These developments in the DBM procedure and its relationship with the OR procedure are summarized in Reference [10].

Although equivalent results can be obtained using either method, the DBM model is not statistically acceptable since several of its assumptions are not true [8]. Thus the DBM model should be viewed only as a “working” model; although “pretending” that the DBM model is correct generally leads to valid inferences, parameters for the model are difficult to interpret in terms of the model. For these reasons, theoretical justification for results provided in this paper will be based on the OR model.

Our purpose is to describe a step-by-step procedure for computing power (and hence sample size) for either method. This procedure updates previous approaches by incorporating ddf_H, accounting for different pilot- and planned-study normal-to-abnormal case ratios, and incorporating a new method for estimating the OR test-by-reader variance component. The procedure unifies the approaches by allowing both procedures to be based on either pilot data or conjectured parameter values, and yields the same results regardless of whether the inputted values are DBM or OR pilot-data analysis outputs or conjectured parameter values. We describe the procedure for the OR method and then show how this same procedure can also be used with inputs obtained from a DBM analysis. The procedure is illustrated in an example and its performance is evaluated in a simulation study.

2. Materials and Methods

2.1. Design and notation

We assume that rating data have been collected from a test×reader×case factorial study design, where each case undergoes each diagnostic test and the resulting images are evaluated once by each reader. (We use test to refer to a diagnostic test, modality, or treatment.) Letting Z_ijk denote the rating assigned to the kth case by the jth reader using the ith test, the observed rating data consists of the Z_ijk, i = 1,…,t, j = 1,…,r, k = 1,…, c, where t is the number of tests, r the number of readers, and c the number of cases. In addition, each case is classified as diseased or nondiseased according to an available reference standard. We let ${\hat{θ}}_{i j}$ denote the AUC estimate based on all of the data for the ith test and jth reader; however, more generally ${\hat{θ}}_{i j}$ can be any ROC accuracy estimate, such as the partial AUC, sensitivity for a fixed specificity, or specificity for a fixed sensitivity. We let θ_ij denote the corresponding population AUC, defined statistically by $θ_{i j} = E ({\hat{θ}}_{i j})$ for fixed i. That is, for a given test i and case sample size c, θ_ij is the expected AUC for a randomly selected reader reading c randomly selected cases.

2.2. The DBM procedure

For the DBM procedure, AUC pseudovalues are computed using the Quenouille-Tukey jackknife [11–13] separately for each reader-test combination. Let Y_ijk denote the AUC pseudovalue for test i, reader j, and case k; by definition $Y_{i j k} = c {\hat{θ}}_{i j} - (c - 1) {\hat{θ}}_{i j (k)}$ , where ${\hat{θ}}_{i j} (k)$ denotes the AUC estimate when data for the kth case are omitted. Treating the Y_ijk as the outcomes, the original DBM procedure specified testing for a test effect using a fully-crossed three-factor ANOVA, with test treated as a fixed factor and reader and case as random factors; the original DBM estimate of θ_ij was Y_ij. $= \frac{1}{c} \sum_{k = 1}^{c} Y_{i j k}$ , which is the jackknife accuracy estimate corresponding to ${\hat{θ}}_{i j}$ . (A subscript replaced by a dot indicates that values are averaged across the missing subscript.) Later, Hillis et al [9] recommended that the DBM method be used with normalized pseudovalues, defined by $Y_{i j k}^{*} = Y_{i j k} + ({\hat{θ}}_{i j} - Y_{i j} .)$ . For normalized pseudovalues the DBM accuracy estimate, given by $Y_{i j}^{*}$ ., is equal to ${\hat{θ}}_{i j}$ and hence the analysis is not restricted to jackknife accuracy estimates.

Let MS(T)_Y, MS(T*R)_Y, MS(T*C)_Y, and MS(T*R*C)_Y denote the test, test×reader, test×case, and test×reader×case mean squares for the DBM three-way ANOVA of the pseudovalues. (Here the Y subscript is used to indicate that these mean squares are computed from pseudovalues, in contrast to the OR mean squares discussed in the next section that are computed from reader-level AUCs.) The DBM F statistic for testing the null hypothesis of no test effect is

F_{DBM} = \frac{MS {(T)}_{Y}}{MS {(T * R)}_{Y} + H [MS {(T * C)}_{Y} - MS {(T * R * C)}_{Y}]}

(1)

where the function H (·) is defined by

H (x) = {\begin{matrix} x & if x > 0 \\ 0 & if x \leq 0 \end{matrix}

Equation 1 is recommended by Hillis et al [9] and differs slightly from the original DBM formulation in that less data-based model reduction is allowed. Hillis and Berbaum [7] used Equation 1 in their power algorithm; although they used raw instead of normalized pseudovalues, the use of normalized pseudovalues does not alter their algorithm.

Hillis [8] showed that the DBM method has improved performance if the following denominator degrees of freedom for F_DBM is used:

{ddf}_{H} = \frac{{MS {(T * R)}_{Y} + H [MS {(T * C)}_{Y} - MS {(T * R * C)}_{Y}]}^{2}}{{[MS {(T * R)}_{Y}]}^{2} ∕ [(t - 1) (r - 1)]}

(2)

The updated power procedure that we will present incorporates Equation 2, which was not used by Hillis and Berbaum [7] since it had not yet been proposed. We note that since it was proposed in 2007 by Hillis [8], ddf_H has been incorporated into freely available DBM analysis software [14–16].

2.3. The OR procedure

Obuchowski and Rockette [3] analyze AUC estimates using a test × reader factorial ANOVA model, but unlike a conventional ANOVA model they allow the errors to be correlated to account for correlation due to each reader evaluating the same cases for each test. Their model, which we refer as the OR model, can be written as

{\hat{θ}}_{i j} = μ + τ_{i} + R_{j} + {(TR)}_{i j} + ∊_{i j}

(3)

i = 1,…,t, j = 1,…,r, where τ_i denotes the fixed effect of test i, R_j denotes the random effect of reader j, (TR)_ij denotes the random test × reader interaction, and ∊_ij is the error term. The R_j and (TR)_ij are assumed to be mutually independent and normally distributed with zero means and respective variances $σ_{R}^{2}$ , reflecting differences in reader ability, and $σ_{T R}^{2}$ , reflecting test-by-reader interaction. The ∊_ij are assumed to be normally distributed with zero mean and variance $σ_{∊}^{2}$ , which represents variability attributable to cases and within-reader variability that describes how a reader interprets the same image in different ways on different occasions. The ∊_ij are independent of the R_j and (TR)_ij. Equi-covariance of the errors between readers and tests is assumed, resulting in three possible covariances:

Cov (∊_{i j}, ∊_{i^{'} j^{'}}) = {\begin{matrix} {Cov}_{1} & i \neq i^{'}, j = j^{'} (different tests, same reader) \\ {Cov}_{2} & i \neq i^{'}, j = j^{'} (same test, different reader) \\ {Cov}_{3} & i \neq i^{'}, j = j^{'} (different tests, & readers) \end{matrix}

It follows from model (3) that $σ_{∊}^{2}$ , Cov₁, Cov₂, and Cov₃ are also the variance and corresponding covariances of the AUC estimates, conditional on the reader and test × reader effects. Based on clinical considerations Obuchowski and Rockette [3] suggest the following ordering for the covariances:

{Cov}_{1} \geq {Cov}_{2} \geq {Cov}_{3} \geq 0 .

(4)

The OR statistic for testing the null hypothesis of no test effect is given by

F_{OR} = \frac{MS {(T)}_{{\hat{θ}}_{i j}}}{MS {(T * R)}_{{\hat{θ}}_{i j}} + H [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}

(5)

where $MS {(T)}_{{\hat{θ}}_{i j}}$ and $MS {(T * R)}_{{\hat{θ}}_{i j}}$ are the two-way ANONVA test and test-by-reader mean squares; these mean squares are based on the AUC outcomes, in contrast to the DBM mean squares that are based on the case-level pseudovalues. The quantities ${\hat{Cov}}_{2}$ and ${\hat{Cov}}_{3}$ denote estimates for Cov₂ and Cov₃, respectively. Note that Equation 5 incorporates the constraint given by Equation 4 by setting ${\hat{Cov}}_{2} - {\hat{Cov}}_{3}$ to zero if it is negative.

Since Cov₂ and Cov₃ are also the corresponding covariances of the AUC estimates conditional on the reader and test × reader effects, they can be estimated using ROC analysis methods that treat cases as random but readers as fixed, such as jackknifing, bootstrapping, parametric methods, or the method proposed by DeLong et al [17] for trapezoidal-rule (or empirical) AUC estimates [18]. The OR estimates obtained from averaging corresponding fixed-reader AUC variances and covariances are denoted by ${\hat{σ}}_{∊}^{2}$ , ${\hat{Cov}}_{1}$ , ${\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ .

Hillis [8] shows that F_OR has an approximate null F_t−1,df2 distribution, where

{df}_{2} = \frac{{E [MS {(T * R)}_{{\hat{θ}}_{i j}}] + r ({Cov}_{2} - {Cov}_{3})}^{2}}{\frac{{[E [MS {(T * R)}_{{\hat{θ}}_{i j}}]]}^{2}}{(t - 1) (r - 1)}}

(6)

and suggests estimating df₂ by

{ddf}_{H} = \frac{{MS {(T * R)}_{{\hat{θ}}_{i j}} + H [r ({\hat{Cov}}_{2} - \hat{{Cov}_{3}})]}^{2}}{\frac{{[MS {(T * R)}_{{\hat{θ}}_{i j}}]}^{2}}{(t - 1) (r - 1)}}

(7)

Note that the estimate ddf_H replaces the parameters in df₂ by estimates; in particular, the expected test × reader mean square, $E [MS {(T * R)}_{{\hat{θ}}_{i j}}]$ , is replaced by the observed mean square, $MS {(T * R)}_{{\hat{θ}}_{i j}}$ , and r(Cov₂ − Cov₃) is replaced by $H [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]$ , which incorporates the model covariance constraints given by Equation 4. He also shows that ddf_H results in improved performance compared to the denominator degrees of freedom, ddf₀ = (t − 1) (r − 1), originally proposed by Obuchowski and Rockette [3]. A 100 (1 − α)% confidence interval for θ_i − θ_i′, i ≠ i′, is given by ${\hat{θ}}_{i} . - {\hat{θ}}_{i'} . \pm t_{a ∕ 2; {ddf}_{H}} \sqrt{\frac{2}{r} {MSden}_{OR}}$ , where MSden_OR is the denominator of the right-hand-side of Equation 5.

If the null hypothesis of equal tests is not true, then F_OR has an approximate $F_{t - 1, {df}_{2}; Δ}$ distribution where the noncentrality parameter is given by

Δ = \frac{r Σ_{i = 1}^{t} {(θ_{i} - θ_{\cdot})}^{2}}{σ_{TR}^{2} + σ_{ε}^{2} - {Cov}_{1} + (r - 1) ({Cov}_{2} - {Cov}_{3})}

(8)

and θ_i = μ + τi is the expected accuracy measure for test i. This result is stated by Obuchowski [4] and a detailed proof is provided in Reference [8].

Power estimation for the OR method has been previously described [4, 5]. However, these references use the originally proposed denominator degrees of freedom, ddf₀ = (t − 1) (r − 1), which has been shown to give overly conservative inferences [8]. The updated algorithm for computing power differs from the previously published OR power methods in that it uses ddf_H; in addition, it estimates $σ_{T R}^{2}$ , and hence the noncentrality parameter Δ, differently.

2.3.1. OR and DBM relationships

As previously noted, the DBM procedure can be viewed as an implementation of the OR procedure if the OR covariance estimates are based on jackknife covariance estimates and the DBM procedure uses normalized pseudovalues. In this case, there is a oneto-one correspondence between the parameters and outputs for the two procedures and Equations 2 and 7 yield the same value [9]. The relationships between the OR and DBM outputs are given in Table 1. Thus to use the OR power procedure with DBM output values, we only need to transform the DBM values to their corresponding OR values.

Table 1.

OR outputs in terms of DBM mean squares from a pilot study. The number of tests, readers, and cases in the study are denoted by t*, r*, and c*, respectively. Adapted and reprinted, with permission, from Hillis et al [10].

OR Output	Equivalent function of DBM mean squares
$MS {(T)}_{{\hat{θ}}_{i j}}$	$= \frac{1}{c^{*}} MS (T)$
$MS {(R)}_{{\hat{θ}}_{i j}}$	$= \frac{1}{c^{*}} MS (T)$
$MS {(T * R)}_{{\hat{θ}}_{i j}}$	$= \frac{1}{c^{}} MS (T R)$
${\hat{σ}}_{∊}^{2}$	$= [\frac{1}{t^{} r^{} c^{}} MS (C) - (t^{} - 1) MS (T * C) + (r^{} - 1) MS (R C) + (t^{} - 1) (r^{} - 1) MS (T * R * C)]$
${\hat{Cov}}_{1}$	$= \frac{1}{t^{} r^{} c^{}} [MS (C) - MS (T C) + (r^{} - 1) MS (R C) - MS (T * R * C)]$
${\hat{Cov}}_{2}$	$= \frac{1}{t^{} r^{} c^{}} [MS (C) - MS (R C) + (t^{} - 1) MS (T C) - MS (T * R * C)]$
${\hat{Cov}}_{3}$	$= \frac{1}{t^{} r^{} c^{}} [MS (C) - MS (T C) - MS (R * C) + MS (T * R * C)]$

Open in a new tab

2.3.2. OR method in terms of correlations and the $σ_{c}^{2}$ , $σ_{w}^{2}$ parameterization

The OR method can also be notationally described with population correlations $r_{i} = {Cov}_{i} ∕ σ_{ε}^{2}$ replacing corresponding Cov_i, and estimated correlations ${\hat{r}}_{i} = {\hat{Cov}}_{i} ∕ {\hat{σ}}_{ε}^{2}$ replacing corresponding ${\hat{Cov}}_{i}$ , i = 1, 2, 3. All of the results cited thus far can be equivalently expressed in terms of correlations; thus the choice of which notation to use is not important. An advantage of using correlations is that their interpretation does not depend on sample size. A disadvantage is possible misunderstanding about the definition of the denominator used to compute them. For example, Obuchowski and Rockette [3] write $σ_{ε}^{2}$ as $σ_{c}^{2} + σ_{w}^{2}$ , where $σ_{c}^{2}$ denotes variability attributable to cases and $σ_{w}^{2}$ denotes within-reader variability, and then define $r_{i} = {Cov}_{i} ∕ σ_{c}^{2}$ . This definition is convenient to use when only $σ_{c}^{2}$ can be estimated from the data (as discussed in the next paragraph); in this case, one can think of the error terms as partitioned into two parts: ε_ij = u_ij + w_ij where $var (u_{i j}) = σ_{c}^{2}$ , $var (w_{i j}) = σ_{w}^{2}$ , the w_ij are mutually independent and are independent of the u_ij, but the u_ij are correlated and have the same covariances as the ε_ij. Practically either definition will give similar correlations since $σ_{w}^{2}$ is typically neglible compared to $σ_{c}^{2}$ .

We note that formulas for the variance of the AUC or other ROC accuracy measure based on assumed parametric models that ignore within-reader inconsistency will give estimates of $σ_{c}^{2}$ rather than of $σ_{ε}^{2}$ . For example, these methods include the AUC variance estimates proposed by Hanley and McNeil [18] and Obuchowski [19]. Basing power estimates on estimates of $σ_{c}^{2}$ , obtained from such methods, technically necessitates also estimating $σ_{w}^{2}$ separately from repeated readings, as previously has been suggested by Obuchowski [4, 5], or else using a conjectured value of $σ_{w}^{2}$ . In contrast, resampling methods such as bootstrapping and jackknifing, as well as the method proposed by DeLong et al [17] yield estimates of $σ_{ε}^{2} = σ_{c}^{2} + σ_{w}^{2}$ . Thus there is no need to estimate $σ_{w}^{2}$ separately for power computation based on repeated readings when these methods are used to estimate the error variance. However, as previously noted, since since $σ_{w}^{2}$ is typically neglible compared to $σ_{c}^{2}$ , using $σ_{c}^{2}$ in place of $σ_{ε}^{2}$ in the power computations will make little difference.

2.4. Updated and unified OR/DBM power computation procedure

In this section we present a step-by-step procedure for computing power for either the OR or DBM procedure. The procedure is described for a two-sided test comparing two modalities based either on data from a pilot or previous study, or on conjectured parameter values. We assume that the ratio of normal to diseased cases in the planned study is approximately the same as in the pilot or previous study. Later we discuss how the procedure can be modified for a one-sided test and for the situation where the pilot and planned study normal-to-abnormal case ratios differ.

The steps of the procedure are the following: (1) specify the effect size; (2) transform OR or DBM outputs into OR parameter estimates, or use conjectured OR parameter values; (3) transform the OR parameter values into OR noncentrality parameter and denominator degrees of freedom values for specified case and reader sample sizes; and (4) compute the power based on the estimated OR noncentrality parameter and denominator degrees of freedom. Below we describe the steps in detail. Theoretical details are provided in Appendix A (available online at www.academicradiology.org).

Specify the effect size. Specify the effect size, denoted by d, that the researcher wants to be able to detect with sufficient power. The effect size is the absolute difference of the two population ROC accuracy measures. For example, if the AUC is the outcome of interest, then d = |AUC₁ − AUC₂|, where AUC₁ and AUC₂ are the population AUC values for the two tests. For a given number of cases c, the population AUC is the expected AUC for a randomly selected reader reading c randomly selected cases.
Transform OR or DBM outputs into OR parameter estimates, or use conjectured OR parameter values. If using outputs from an analysis of pilot data, let c* denote the number of cases for the pilot study. If using conjectured parameters, let c* denote the number of cases corresponding to the conjectured value of $σ_{ε}^{2}$ . Use step 2a or 2b below, depending on whether an OR or DBM analysis of pilot data was performed, or step 2c if conjectured values are inputted.
- (a)
  Using OR outputs. Let $\overset{‒}{MS}$ (T) and $\overset{‒}{MS}$ (T * R) denote the test and test × reader mean squares resulting from the OR pilot-data analysis, and let ${\hat{σ}}_{∊}^{2}$ , ${\hat{Cov}}_{1}$ , ${\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ denote the fixed-reader variance and co-variance estimates. (If correlations are available instead of covariances, then compute the covariances using ${\hat{Cov}}_{i} = r_{i} {\hat{σ}}_{ε}^{2}$ or ${\hat{Cov}}_{i} = r_{i} {\hat{σ}}_{c}^{2}$ , i = 1, 2, 3, depending on the definition of the correlation as discussed in Section 2.3.2.) Estimate $σ_{T R}^{2}$ using
  ${\hat{σ}}_{TR}^{2} = \overset{‒}{MS} (T * R) - {\hat{σ}}_{ε}^{2} + {\hat{Cov}}_{1} + H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})$ (9)
  If ${\hat{σ}}_{T R}^{2} < 0$ then set ${\hat{σ}}_{T R}^{2}$ equal to zero or to a positive conjectured value for the remaining steps - see Section 2.5.3 for further discussion of this point.
- (b)
  Using DBM outputs. Compute the OR quantities $\overset{‒}{MS}$ (T), $\overset{‒}{MS}$ (T * R) , ${\hat{σ}}_{∊}^{2}$ , ${\hat{Cov}}_{1}$ , ${\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ from theDBM mean squares using Table 1, and then proceed with step 2a.
- (c)
  Using conjectured inputs. This step is similar to step 2a, except that ${\hat{σ}}_{T R}^{2}, {\hat{σ}}_{∊}^{2}, {\hat{Cov}}_{1}, {\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ are conjectured OR parameter values rather than estimates from pilot data. When using conjectured inputs, it is typically conceptually easier to think in terms of the correlations and then compute the corresponding covariances. As previously noted, c* should denote the number of cases corresponding to ${\hat{σ}}_{∊}^{2}$ , which represents the AUC variance due to cases for a given treatment and fixed reader. Zhou, Obuchowski, and McClish [6, pp. 298–304] discuss choosing values for conjectured inputs. One could also first start with conjectured DBM parameter values and then transform them to OR parameter values since there is a one-to-one transformation between the parameters – these relationships are provided in Table 2.
Compute the noncentrality parameter and denominator degrees of freedom estimates for specified case and reader sample sizes. Let r and c denote the number of readers and cases, respectively, for which we want to compute power. Compute
$\hat{Δ} = \frac{\frac{r}{2} d^{2}}{{\hat{σ}}_{TR}^{2} + \frac{c *}{c} {{\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} + (r - 1) H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})}}$ (10)
and
${\hat{df}}_{2} = \frac{{[{\hat{σ}}_{TR}^{2} + \frac{c *}{2} ({\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} + (r - 1) H [{\hat{Cov}}_{2} - {\hat{Cov}}_{3}])]}^{2}}{\frac{{{\hat{σ}}_{TR}^{2} + (\frac{c *}{c}) [{\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} - H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{r - 1}}$
Here $\hat{Δ}$ is the estimated noncentrality parameter and ${\hat{df}}_{2}$ is the estimated denominator degrees of freedom for the distribution of F_OR (Eq. 5). The above formulas were derived for t = 2 tests. It is easy to show that ${\hat{df}}_{2}$ has the same value as ddf_H (Eq. 7) for c = c*.
Compute the power based on the estimated OR noncentrality parameter and denominator degrees of freedom. Let F_1,ν;δ denote a random variable having a noncentral F distribution with degrees of freedom 1 and ν and noncentrality parameter δ, and let F_1−α;1,ν denote the 100(1 − α)th percentile of a central F distribution with degrees of freedom 1 and ν. The estimated power for a two-sided test with significance level α is given by
$power = \Pr (F_{1, {\hat{df}}_{2}; \hat{Δ}} > F_{1 - α; 1, {\hat{df}}_{2}})$
treating ${\hat{df}}_{2}$ and $\hat{Δ}$ as fixed.

Table 2.

Relationships between OR and DBM variance component and covariance parameters. Notes: c is the number of cases; see Reference [9] for definitions of the DBM variance components. Adapted and reprinted, with permission, from Hillis et al [9, Table III].

OR parameter	Equivalent function of DBM variance components
$σ_{R}^{2}$	$= σ_{R}^{2}$
$σ_{T R}^{2}$	$= σ_{T R}^{2}$
$σ_{∊}^{2}$	$= (σ_{C}^{2} + σ_{T C}^{2} + σ_{R C}^{2} + σ_{T R C}^{2} + σ_{∊}^{2}) ∕ c$
Cov₁	$= (σ_{C}^{2} + σ_{R C}^{2}) ∕ c$
Cov₂	$= (σ_{C}^{2} + σ_{T C}^{2}) ∕ c$
Cov₃	$= σ_{C}^{2} ∕ c$

Open in a new tab

2.5. Other considerations

2.5.1. Accounting for different pilot and planned study normal-to-abnormal case ratios

The preceding power-computation procedure is based on the assumption that the abnormal-to-normal case ratios are similar for the pilot and planned studies. This assumption is important since the fixed-reader covariances and variance depend on the abnormal-to-normal case ratio. For the situation where the researcher expects or wants the planned study to have a normal-to-abnormal case ratio that differs considerably from that of the pilot data, we suggest the following ad hoc approach. From the group (normals or abnormals) that will be proportionately more represented in the planned study than in the pilot study, sample with replacement enough cases to achieve the desired balance between the two groups. Combine these cases with the cases from the other group to create a data set with the desired ratio between normal and abnormal cases. Repeat this process to create several (e.g., 10) data sets having the desired normal-to-abnormal case balance. For each of these data sets compute the fixed-reader covariance matrix and corresponding OR covariances. Use the averages of the OR covariances for the power computation. Note that for the power procedure c* will not be the number of cases in the original pilot study, but rather the number of cases in each of the “new” pilot data sets.

For the power procedure we can use the estimate of $σ_{T R}^{2}$ obtained from the original pilot data before doing any resampling. To understand why this is appropriate, define

η_{i j} = μ + τ_{i} + R_{j} + {TR}_{i j}

(11)

From Equation (3) if follows that for given test i and fixed reader j, $η_{i j} = E ({\hat{θ}}_{i j})$ ; this is the expected or mean AUC across the population of cases. Thus η_ij is the latent or true AUC for test i and reader j, which can be loosely interpreted as the AUC that would result if reader j read a very large number of cases. It follows that $σ_{T R}^{2}$ can be interpreted as the interaction variance component for the η_ij, and hence the value of this parameter does not depend on the ratio or numbers of normals and abnormals in the sample. We note that alternatively estimating $σ_{T R}^{2}$ using Equation 9 from each of the 10 generated data sets would not be valid, since Equation 9 assumes that both readers and cases are random units but our generated data sets only treated cases as random.

2.5.2. Comparison with earlier results

As previously noted, the proposed power procedure updates previous DBM and OR power procedures, as described in References [4–7], by incorporating the new degrees of freedom ddf_H suggested by Hillis [8]. In addition, our estimate of $σ_{T R}^{2}$ (Eq. 9) for the OR method updates the estimate previously proposed in References [4–6]; this previous estimate was a function of the sample variances of the AUCs across readers within each test and the between-test sample correlation of the AUCs. In Appendix B (available online at www.academicradiology.org) we show that this previous estimate actually estimates $σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{1} - {Cov}_{2} + {Cov}_{3}$ , with $σ_{ε}^{2} - {Cov}_{1} - {Cov}_{2} + {Cov}_{3} \geq 0$ ; thus it is likely that the previous estimator tended to overestimate $σ_{T R}^{2}$ .

Although a previously available SAS macro [20] for computing power based on DBM outputs had taken into account ddf_H, we note that the power algorithm presented in this paper gives somewhat different results when the DBM variance component estimates for $σ_{T R}^{2}$ and $σ_{T C}^{2}$ are both zero, due to the way that the covariance constraints are incorporated.

2.5.3. What to do if ${\hat{σ}}_{T R}^{2} < 0$

It has been our experience that often the pilot estimate of the test × reader interaction variance component $σ_{T R}^{2}$ is less than zero, as it is in the Example in the next section. Since the typical radiological imaging study has only a few readers, we expect the precision of the $σ_{T R}^{2}$ estimate will be low, and hence it is not surprising that estimates of $σ_{T R}^{2}$ will often not be positive, especially if the true value of $σ_{T R}^{2}$ is close to zero. In such situations one choice is to set the variance component equal to zero in step 2. However, it seems reasonable that in most studies there should be some interaction, suggesting that when the estimate is not positive we may want to conservatively use a positive value for this variance component instead of zero. One way to decide on a reasonable positive value is to consider estimates for $σ_{T R}^{2}$ from similar studies, keeping in mind, however, that estimates computed as proposed in References [4–6] tend to overestimate $σ_{T R}^{2}$ as previously discussed.

Alternatively, we can specify $σ_{T R}^{2}$ by considering its interpretation in terms of the latent AUCs, the η_ij, as defined by Equation 11. Specifically, for fixed tests i and i′, it is easy to show that

(η_{i j} - η_{i^{'} j}) - (η_{i j} - η_{i^{'} j^{'}}) ~ N (0, 4 σ_{TR}^{2})

For example, if reader 1 has latent AUC values of .95 and .90 for tests 1 and 2, respectively, and reader 2 has corresponding latent AUC values of .93 and .91, then η₁₁ −η₂₁ − (η₁₂ − η₂₂) = (.95 − .90) −(.93 − .91) =.05−.02 = .03. The quantity η_ij−η_i′j−(η_ij′ − η_i′j′) can be interpreted as the difference of the two intra-reader latent AUC differences for randomly selected readers j and j′. Thus $σ_{T R}^{2}$ is equal to one-fourth of the variance of the difference of the intra-reader latent AUC differences for two randomly chosen readers.

Suppose it seems reasonable that for a randomly selected pair of readers the absolute difference of their intra-reader latent AUC differences will be bounded by a specified value l (e.g., l = .06) with probability ≥ .95; i.e., Pr (|η_ij − η_i′j − (η_ij′ − η_i′j′)| ≤ l) ≥ .95. Then since the probability is .95 that a normal random variable is within 1.96 standard deviations of its mean, we have

l \geq (1.96) \sqrt{var (η_{i j} - η_{i^{'} j} - η_{i j^{'}} + η_{i^{'} j^{'}})} = (1.96) (2 σ_{TR})

i.e.,

σ TR \leq \frac{l}{3.92} and σ_{TR}^{2} \leq {(\frac{l}{3.92})}^{2}

For l = .06 we have $σ_{T R} \leq \frac{.06}{3.92} = .01531$ , or equivalently $σ_{T R}^{2} \leq {.01531}^{2} = .00023$ . Thus if ${\hat{σ}}_{T R}^{2} < 0$ it would be reasonable to set ${\hat{σ}}_{T R}^{2} = .00023$ in step 2 if l = .06 seems like a reasonable 95% bound. Table 3 presents values of $σ_{T R}^{2}$ corresponding to various values of l.

Table 3.

Relationship between OR test-by-reader interaction variance component $σ_{T R}^{2}$ and 95% probability upper bound l on the absolute difference of intra-reader latent AUCs; i.e., Pr {|η_ij − η_i′j − (η_ij′ − η_i′j′)|≤ l} ≥ .95, where η_ij − η_i′j is the difference in latent AUCs for tests i and _i′ for randomly chosen reader j and η_ij − η_i′j is the corresponding difference for randomly chosen reader j′.

l	$σ_{T R}^{2}$
0.01	0.00001
0.02	0.00003
0.03	0.00006
0.04	0.00010
0.05	0.00016
0.06	0.00023
0.07	0.00032
0.08	0.00042
0.09	0.00053
0.1	0.00065

Open in a new tab

2.5.4. One-sided tests

To compute power for a one-sided test, the only change that needs to be made in the power procedure is to set the significance level to twice the nominal level for the planned study. Although this approach noticeably overestimates power for very small effect sizes, the overestimate will be negligible for a clinically relevant effect size.

3. Results

Throughout this section we assume a .05 significance level.

3.1. Example: Spin echo versus cine MRI for detection of aortic dissection

Our example study [21] compares the relative performance of single spin-echo magnetic resonance imaging (MRI) to cinematic presentation of MRI for the detection of thoracic aortic dissection. There were 45 patients with an aortic dissection and 69 patients without a dissection imaged with both spin-echo and cinematic MRI. Five radiologists independently interpreted all of the images using a five-point ordinal scale: 1 = definitely no aortic dissection, 2 = probably no aortic dissection, 3 = unsure about aortic dissection, 4 = probably aortic dissection, and 5 = definitely aortic dissection.

Suppose that the researcher would like to know what combinations of reader and case sample sizes for a similar study will have at least .80 power to detect an absolute difference of .05 between the modality AUCs. We first show how to determine the power for 8 readers and 240 cases, based on an OR and DBM analysis of the data. Then we present the smallest case sample size for each of several reader sample sizes that yields .80 power.

Situation 1: Similar normal-to-abnormal ratios. The OR analysis of the data is presented in Table 4. Part (a) presents the AUCs corresponding to ROC curves estimated by the PROPROC procedure [22, 23]; part (b) the ANOVA table; part (c) the jackknife covariance matrix for the AUCs, treating readers as fixed; part (d) the variance and covariance estimates based on the covariance matrix in part (c); part (e) the correlations, computed using ${\hat{r}}_{i} = \hat{{Cov}_{i}} ∕ {\hat{σ}}_{ε}^{2}$ ; part (f) the OR F statistic; and part (g) ddf_H. From part (h) the p-value for testing the hypothesis of equal modalities is .092, and from part (i) a 95% confidence interval for the difference of the population AUCs (spin-echo – cinematic) is given by (–0.0073, 0.0921) Thus there is not sufficient evidence that the modalities differ (p = .092).

Table 4.

Obuchowski-Rockette analysis of Van Dyke et al [21] data. H₀: test AUCs ar equal; t = 2 tests; r = 5 readers.

(a) PROPROC AUC estimates for cine and spin-echo MRI
	Test
Reader	Cine	Spin-echo
1	.934	.952
2	.891	.926
3	.908	.930
4	.977	1.000
5	.841	.943

Mean:	.910	.950

(b) ANOVA table based on PROPROC AUCs
Source	df	Mean square
T	1	0.004003382
R	4	0.002834705
T*R	4	0.000622731

(c) Jackknife covariance matrix corresponding to PROPROC AUC estimates. C1–C5 = readers 1–5, cine; S1–S5 = readers 1–5, spin echo. Values have been multiplied by 10⁴.
	C1	C2	C3	C4	C5	S1	S2	S3	S4	S5
C1	9.54
C2	7.47	20.35
C3	8.73	6.64	61.78
C4	2.24	2.65	1.70	1.48
C5	5.48	12.26	3.11	2.00	18.07
S1	3.93	4.26	3.67	0.37	2.62	5.19
S2	3.28	5.50	3.26	1.07	4.70	2.46	4.94
S3	4.74	5.59	5.53	1.23	4.40	5.03	3.95	8.03
S4	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
S5	0.85	0.35	3.82	0.07	2.63	0.40	2.98	2.19	0.00	10.00

(d) Covariance estimates

{\hat{σ}}_{∊}^{2} = .001393652

;

{\hat{Cov}}_{1} = .000351859

;

{\hat{Cov}}_{2} = .000346505

;

{\hat{Cov}}_{3} = .000221453

(e) Correlation estimates (

r_{i} = {\hat{Cov}}_{i} ∕ {\hat{σ}}_{ε}^{2}

)

r₁ = 0.25247; r₂ = 0.24863; r₃ = 0.15890

(f)

F = \frac{MS (T)}{MS (T * R) + max [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}), 0]} = \frac{.004003382}{.000622731 + 5 (000346505 - .000221453)} = 3.21

(g)

{ddf}_{H} = \frac{{MS (T * R) + max [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}), 0]}^{2}}{[MS (T * R)] ∕ [(t - 1) (r - 1)]} = 16.065

(h) p-value= Pr (F_1;16.065 > 3.21) = .092

(i) 95% CI for

θ_{2} - θ_{1} : .04 \pm t_{0.25; 16.065} \sqrt{\frac{2}{5} {MSDen}_{OR}} = (- 0.0073, 0.0921)

Open in a new tab

Treating this study as a pilot study, the power computation steps are as follows:

Specify the effect size. The effect size of interest is d = .05.
Transform outputs into OR parameter estimates. For the pilot data c* = 114. From Table 4 we have $\overset{‒}{MS}$ (T) = 0.004003, $\overset{‒}{MS}$ (T * R) = 0.000623, ${\hat{σ}}_{ε}^{2}$ = 0.001394, ${\hat{Cov}}_{1}$ = 0.000352, ${\hat{Cov}}_{2}$ = 0.000347, and ${\hat{Cov}}_{3}$ = 0.000221. Substituting these values into Equation (9) yields ${\hat{σ}}_{T R}^{2}$ = −0.000294. Since ${\hat{σ}}_{T R}^{2} < 0$ then we have two choices: either set ${\hat{σ}}_{T R}^{2}$ equal to zero or to a conjectured positive value for the remaining steps. In our computations below we set it to zero.

Compute the noncentrality parameter and denominator degrees of freedom estimates. We want to compute the power for a study with r = 8 readers and c = 240 cases. We compute

\begin{matrix} {\hat{df}}_{2} & = \frac{{{\hat{σ}}_{TR}^{2} + \frac{c *}{c} ({\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1}) + \frac{c *}{c} (r - 1) H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})}^{2}}{\frac{{[{\hat{σ}}_{TR}^{2} + (\frac{c *}{c}) {{\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} - H [({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}]}^{2}}{r - 1}} \\ = \frac{{[0 + \frac{114}{240} (0.001394 - 0.000352) + \frac{114}{240} [7 (0.000347 - 0.000221)]]}^{2}}{\frac{[\frac{114}{240} [0.001394 - 0.000352 - (0.000347 - 0.000221)]]}{7}} \\ = 30.6 \end{matrix}

and

\begin{matrix} \hat{Δ} & = \frac{\frac{r}{2} d^{2}}{{\hat{σ}}_{TR}^{2} + \frac{c *}{c} {\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} + H [(r - 1) ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]} \\ = \frac{\frac{8}{2} {(.05)}^{2}}{{0 + \frac{114}{240} (0.001394 - 0.000352) + \frac{114}{240} [7 (0.000347 - 0.000221)]}} = 10.98 \end{matrix}

Compute the power. The estimated power for r = 8, c = 240, α = .05 is given by
$\begin{matrix} power & = \Pr (F_{1, {\hat{df}}_{2}; \hat{Δ}} > F_{1 - α; 1, {\hat{df}}_{2}}) \\ = \Pr (F_{1, 30.6; 10.98} > F_{.95; 1, 30.6}) = .89 \end{matrix}$

Recall that this estimate was computed assuming no test × reader interaction, since we have set ${\hat{σ}}_{T R}^{2}$ . A more conservative approach would be, for example, to set $σ_{T R}^{2}$ = .0001 corresponding to the belief that a 95% upper bound on the absolute difference of two intra-reader AUC differences is given by l = .04. Using this approach, the power is .86.

Typically the researcher will want to consider different combinations of readers and cases that result in the desired power and then choose the most suitable combination. Reader-case sample size combinations that result in approximately .80 power are presented in the left-hand side of Table 5, using both $σ_{T R}^{2}$ = 0 and ${\hat{σ}}_{T R}^{2}$ = .0001. For example, a few of the reader-case combinations that yield .80 power with ${\hat{σ}}_{T R}^{2}$ = 0 are 5 readers and 266 cases, 8 readers and 183 cases, or 15 readers and 136 cases. We see that the increase in the number of cases needed based on ${\hat{σ}}_{T R}^{2}$ = .0001 is most noticeable for r ≤ 5.

Table 5.

Combinations of cases and readers having .80 power to detect a .05 AUC difference between spin-echo and cine MRI based on the Van Dyke et al [21] data. The pilot-study estimate of ${\hat{σ}}_{T R}^{2}$ was negative. The left-hand-side results are for a planned study that has the same normal-to-abnormal ratio as the pilot study; the right-hand-side results are for a planned study that has equal numbers of normal and abnormal cases.

	normal/abnormal ratio = 69/45 = 1.53				normal/abnormal ratio = 1
	${\hat{σ}}_{T R}^{2} = 0$		${\hat{σ}}_{T R}^{2} = .0001$		${\hat{σ}}_{T R}^{2} = 0$		${\hat{σ}}_{T R}^{2} = .0001$
Readers	Cases	Power	Cases	Power	Cases	Power	Cases	Power
3	559	0.800	1898	0.800	374	0.800	1282	0.800
4	343	0.800	491	0.800	229	0.800	328	0.800
5	266	0.801	330	0.801	177	0.801	220	0.801
6	225	0.800	263	0.800	150	0.801	176	0.802
7	200	0.800	227	0.801	133	0.800	151	0.801
8	183	0.800	203	0.801	122	0.801	135	0.801
9	171	0.801	187	0.802	114	0.802	124	0.801
10	162	0.802	174	0.800	108	0.803	116	0.802
11	154	0.800	165	0.801	103	0.803	110	0.803
12	148	0.800	158	0.802	99	0.803	105	0.803
13	143	0.800	151	0.800	95	0.801	101	0.803
14	139	0.801	146	0.800	93	0.804	97	0.801
15	136	0.802	142	0.801	90	0.801	94	0.800

Open in a new tab

Appendix C (available online at www.academicradiology.org) includes the SAS [24] statements used to compute the power for this example with r = 8 and c = 240, as well as the statements used to create the left-hand side of Table 5. To produce the Table 5 output, the program loops the statements through various combinations of reader and case sample sizes and outputs the number of cases for which the power is closest but greater than. 80 for each reader sample size. These statements can be easily modified to work in another programming language.

Situation 2: Different normal-to-abnormal ratios. Suppose that the researcher wants to use equal numbers of normal and abnormal images in the planned study in order to increase power. Since there are 45 abnormal and 69 normal cases, we randomly sample with replacement 69 abnormal cases from the original 45. We repeat this process 10 times, combining each generated sample with the 69 normal cases to produce 10 data sets, each containing 69 abnormal and 69 normal cases. Note that the normal cases are the same for each data set, in contrast to the 69 abnormal cases which vary from set to set and which do not necessarily contain all of the original 45 abnormal cases.

For each of these ten data sets we compute the jackknife covariance matrix and then compute ${\hat{σ}}_{ε}^{2}$ , ${\hat{Cov}}_{1}$ , ${\hat{Cov}}_{2}$ and ${\hat{Cov}}_{3}$ . These estimates are shown in Table 6 along with the corresponding means. We use the estimate ${\hat{σ}}_{T R}^{2}$ = 0 based on the original pilot data before doing any resampling, and well as the more conservative conjectured estimate ${\hat{σ}}_{T R}^{2}$ = .0001. Using ${\hat{σ}}_{T R}^{2}$ = 0 and the means from Table 6 as inputs in our power program, with c* = 69 + 69 = 138, we find for r = 8 and c = 240 that the power has now increased from .89 to .98, showing the advantage of using a normal-to-abnormal ratio equal to 1. The right-hand side of Table 5 shows combinations of readers and case sample sizes that yield .80 power for an equal balance of normal and abnormal cases.

Table 6.

OR variance component estimates from ten randomly generated Van Dyke [20] data sets having 69 normal and 69 abnormal images. Each data set contains 69 resampled abnormal images combined with the original 69 normal images.

Sample	${\hat{σ}}_{ε}^{2}$	${\hat{Cov}}_{1}$	${\hat{Cov}}_{2}$	${\hat{Cov}}_{3}$
1	0.000512	0.000204	0.000181	0.000125
2	0.000416	0.000019	0.000112	0.000073
3	0.001173	0.000118	0.000169	0.000138
4	0.001121	0.000078	0.000129	0.000074
5	0.000545	0.000153	0.000242	0.000106
6	0.000629	0.000316	0.000224	0.000189
7	0.000634	0.000155	0.000225	0.000107
8	0.001117	0.000135	0.000204	0.000116
9	0.000608	0.000176	0.000222	0.000145
10	0.000470	0.000130	0.000136	0.000089

mean:	0.000723	0.000148	0.000184	0.000116

Open in a new tab

3.1.1. Power computation based on DBM analysis

The DBM analysis of the data based on the PROPROC AUC estimates is presented in Table 7. Part (a) presents the DBM ANOVA table, part (b) the F statistic, part (c) ddf_H, and part (d) the p-value. Note that the F statistic, ddf_H, and the p-value are the same as for the OR analysis in Table 4; this will always be the case when the OR analysis uses jackknife covariance estimates, as previously discussed. Using the Table 1 relationships, we compute the corresponding OR quantities $\overset{‒}{MS}$ (T), $\overset{‒}{MS}$ (T * R), ${\hat{σ}}_{∊}^{2}$ , ${\hat{Cov}}_{1}$ , ${\hat{Cov}}_{2}$ , ${\hat{Cov}}_{3}$ for step 2a from the DBM mean squares. Otherwise the steps are identical. Appendix D (available online at www.academicradiology.org) includes SAS statements that convert the DBM mean squares to the corresponding OR quantities for this example, based on the Table 1 relationships. The SAS output included in Appendix D shows that the resulting OR quantities are the same as those obtained from the OR analysis; thus power results are identical regardless of whether we use the OR or DBM analysis outputs – this will always be the case when the OR analysis uses jackknife covariance estimates and the DBM analysis uses normalized pseudovalues.

Table 7.

Dorfman-Berbaum-Metz (DBM) analysis of Van Dyke et al [21] data. H₀: test AUCs are equal; t = 2 tests; r = 5 readers.

(a) ANOVA table based on normalized jackknife AUC pseudovalues
Source	df	SS	MS
T	1	0.45638557	0.45638557
R	4	1.29262569	0.32315642
T*R	4	0.28396550	0.07099138
C	113	51.75139760	0.45797697
T*C	113	19.86406163	0.17578816
R*C	452	60.67694615	0.13424103
TRC	452	47.23783039	0.10450847

(b)

F = \frac{MS (T)}{Ms (T * R) + \max [MS (T * C) - MS (T * R * C), 0]} = \frac{0.45638557}{0.07099138 + 0.17578816 - 0.10450847} = 3.21

(c)

{ddf}_{H} = \frac{{Ms (T * R) + \max [MS (T * C) - MS (T * R * C), 0]}^{2}}{{[MS (T * R)]}^{2} ∕ [(t - 1) (r - 1)]} = 16.065

(d) p-value= Pr (F_4;16.065 > 3.21) = .092

Open in a new tab

3.2. Simulation study

In a simulation study we examine the performance of the proposed power procedure. We use the simulation model of Roe and Metz [25], which provides continuous decision-variable outcomes generated from a binormal model that treats both case and reader as random factors. We use their “HH” model for which the decision-variable values have relatively high within-reader correlations and reader variability (both pure reader and test × reader interaction variance components). We specify the separation between the normal and abnormal case populations such that for one test the median AUC across readers is .855 and for the other test it is .92, resulting in a nominal effect size of .065. Using this model, we simulate 4000 samples for each of nine combinations of three reader-sample sizes (readers = 3, 5, and 10) and three case-sample sizes (cases = 50, 100, and 200) with equal numbers of normal and abnormal cases. Within each simulation, all Monte Carlo readers read the same cases for each of the two tests. For these simulations we set ${\hat{σ}}_{T R}^{2}$ = 0 if it is negative.

For each sample we perform an OR analysis using the empirical AUC as the accuracy estimate. The mean values of the parameter estimates and AUC differences are displayed in Table 8. The “Power” column indicates the proportion of samples where the null hypothesis of equal tests was rejected at alpha = .05. We make the following observations: (1) The mean $σ_{T R}^{2}$ values are very similar (range: 1.20 – 1.33) regardless of the number of readers or cases; this is expected, since $σ_{T R}^{2}$ can be interpreted as the interaction variance component for the latent AUCs, as discussed in Section 2.5.1. (2) The correlations are also very similar (e.g., range of r₁: .36 – .38) across combinations as expected. (3) The covariances and error variance decrease as the number of cases increases, but are similar for similar case sample sizes regardless of the reader sample size. (4) The mean AUC difference is .066, except for one combination; note that this differs from the .065 median AUC difference for the decision variable; (5) The fact that r₂ is roughly a third larger than r₁ should not be taken as evidence that the constraint given by Equation 4 is not realistic, but rather that the simulation model does not properly reflect the typical clinical situation.

Table 8.

Simulation study results: mean parameter estimates based on the 4000 simulated samples for each combination of reader and case sample sizes. Notes: AUC₁–AUC₂ is the mean difference of the test 1 and test 2 empirical AUC estimates; r =number of readers; c =number of cases; $σ_{T R}^{2}, σ_{ε}^{2}$ , Cov₁, Cov₂, and Cov₃ values are multiplied by 1000.

				Mean parameter estimates
r	c	AUC₂–AUC₁	Power	$σ_{T R}^{2}$	$σ_{ε}^{2}$	Cov₁	Cov₂	Cov₃	r₁	r₂	r₃
3	50	0.066	0.189	1.330	2.350	0.873	1.116	0.475	0.358	0.461	0.189
3	100	0.066	0.276	1.282	1.123	0.427	0.549	0.234	0.372	0.480	0.200
3	200	0.066	0.343	1.294	0.547	0.210	0.270	0.115	0.378	0.488	0.204
5	50	0.065	0.256	1.251	2.335	0.861	1.106	0.469	0.357	0.461	0.190
5	100	0.066	0.407	1.257	1.126	0.426	0.551	0.233	0.372	0.482	0.200
5	200	0.066	0.525	1.250	0.549	0.211	0.271	0.115	0.381	0.490	0.206
10	50	0.066	0.355	1.196	2.349	0.867	1.112	0.470	0.360	0.462	0.191
10	100	0.066	0.571	1.260	1.119	0.423	0.543	0.229	0.373	0.479	0.200
10	200	0.066	0.781	1.257	0.550	0.272	0.382	0.115	0.382	0.491	0.207

Open in a new tab

We now investigate how well the sample data predict power for a planned study with 10 readers and 200 cases for an effect size of .066. From the last line in Table 8 we estimate the true power to be approximately 0.781, based on 4000 simulated data sets. For the power procedure to be valid, it should give power estimates close to the true power when reliable estimates are available. For each combination we compute the power using the parameter estimates from Table 8. The results are displayed in the “Reliable estimates” column in Table 9. We see that, with the exception of the first combination (3 readers, 50 cases), the power estimated from the reliable estimates is within .039 of the actual power and the mean of these estimates is .744, thus validating the power procedure. Note that this estimate of power is performed only once for each combination using the reliable parameter estimates. The means for the sample power estimates (computed for each sample based on the sample parameter estimates) across the 4000 samples are presented in the ”Sample estimates” column; these are closer, within .021 of the actual power, and have an overall mean of .767. The 25th and 75th percentiles and their differences for the sample power estimate distributions are presented in the last three columns. We see, for example, that the middle 50% of the sample power estimates has, on average, a range of .252.

Table 9.

Simulation results: power estimates for 10 readers, 200 cases, and effect size (AUC difference) = .066. Notes: the “Actual power” estimate 0.781 is from the last line of Table 8; the “Reliable estimates” column contains power estimates based on the mean parameter estimates from Table 8; the “Sample estimates” column contains the mean of the sample power estimates across the 4000 simulated samples; P25 and P75 are the 25th and 75th percentiles of the sample power estimates; r =number of readers; c =number of cases.

			Power estimated from
r	c	Actual power	Reliable estimates	Sample estimates	P25	P75	P75−P25
3	50	0.781	0.729	0.773	0.639	0.955	.316
3	100	0.781	0.742	0.776	0.658	0.935	.277
3	200	0.781	0.745	0.772	0.653	0.918	.265
5	50	0.781	0.742	0.768	0.635	0.930	.295
5	100	0.781	0.744	0.766	0.655	0.906	.251
5	200	0.781	0.751	0.767	0.666	0.892	.226
10	50	0.781	0.749	0.765	0.649	0.903	.254
10	100	0.781	0.747	0.760	0.662	0.868	.206
10	200	0.781	0.749	0.760	0.675	0.856	.181

mean:			0.744	0.767			.252

Open in a new tab

4. Discussion

We have provided a step-by-step procedure for estimating power for planned multireader ROC studies that will be analyzed using either the DBM or OR methods. This procedure updates previous approaches by using the currently recommended denominator degrees of freedom, accounting for Different pilot- and planned-study normal-to-abnormal case ratios, and using a new method for computing the OR test-by-reader variance component.

This procedure, as is true for most power procedures, was derived with the parameter values treated as known. A small simulation study validated the method by showing that power estimates were quite close to the actual power when computed from reliable parameter estimates. In addition, the means of sample power estimates – those based on sample-specific parameter estimates – were even closer to the actual power. However, we emphasize that this was a small simulation study based on only one latent decision-variable model, and that more extensive simulation studies are needed to more fully validate the procedure.

Variability in power estimates increases as the parameter estimates become less precise for any power procedure. Thus it is to be expected that there will be much variability in sample power estimates based on outputs from the typical pilot study that has only a few readers, due to lack of precision for the test × reader variance component estimate. In our simulation study the middle 50% of the sample power estimates had, on average, a range of .252, which we would prefer to be less. A recent simulation investigation [26] of an earlier version of the DBM power method has also noted large variability in sample power estimates. However, variability is probably much less than indicated by simulations when the same readers are used in both the pilot and future study, as is often the case. Nevertheless, the variability issue warrants further investigation. For example, one possible way to reduce the variability would be to use a conjectured value for the test × reader variance component when feasible.

Finally, we note that the pilot study should be comparable to the planned study with respect to modalities, reader expertise, and selection of cases in order that the parameter estimates obtained will accurately estimate those of the planned study.

5. Acknowledgment

We thank Carolyn Van Dyke, MD for sharing her data set for the example. We thank the reviewer for helpful suggestions that clarified the presentation.

Appendix A: Power derivation for the OR procedure

As previously noted, the OR procedure test statistic (Eq. 5) has an approximate F_{t−1,df₂;Δ} distribution with df₂ and noncentrality parameter Δ given by Equations 6 and 8, respectively. For t = 2 tests it follows that

Δ = \frac{\frac{r}{2} {(θ_{1} - θ_{2})}^{2}}{σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{1} + (r - 1) ({Cov}_{2} - {Cov}_{3})}

(A.1)

and

{df}_{2} = \frac{{[E (MS (T * R)) + r ({Cov}_{2} - {Cov}_{3})]}^{2}}{\frac{{[E (MS (T * R))]}^{2}}{r - 1}}

(A.2)

It is shown in Reference [8] that

E (MS (T * R)) = σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{1} - ({Cov}_{2} - {Cov}_{3})

(A.3)

It follows from Equations A.2–A.3 that

σ_{T R}^{2} = E (MS (T * R)) - σ_{ε}^{2} + {Cov}_{1} + ({Cov}_{2} - {Cov}_{3})

(A.4)

and

{df}_{2} = \frac{{[σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{1} + (r - 1) ({Cov}_{2} - {Cov}_{3})]}^{2}}{\frac{{[σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{1} - ({Cov}_{2} - {Cov}_{3})]}^{2}}{r - 1}}

(A.5)

Let r* and c* denote reader and case pilot-study sample sizes from which covariance parameter estimates are obtained and r and c the corresponding sample sizes for which we want to compute power. Based on Equations A.1, A.4 and A.5 we use the following estimates that incorporate the constraint Cov₂ ≥Cov₃:

\begin{matrix} {\hat{σ}}_{T R}^{2} = MS (T * R) - σ_{ε}^{2} + {\hat{Cov}}_{1} + H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}) \\ \hat{Δ} = \frac{\frac{r}{2} {(θ_{1} - θ_{2})}^{2}}{{\hat{σ}}_{T R}^{2} + \frac{c^{*}}{c} {{\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} + (r - 1) H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})}} \end{matrix}

and

{\hat{df}}_{2} = \frac{{{\hat{σ}}_{T R}^{2} + \frac{c^{*}}{c} [{\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} + (r - 1) H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{\frac{{{\hat{σ}}_{T R}^{2} + (\frac{c^{*}}{c}) [{\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1} - H ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{r - 1}}

In deriving these estimates we make the reasonable assumption that $σ_{ε}^{2}$ , Cov1₁, Cov₂, and Cov₃ are inversely proportional to the number of cases for a specified normal-to-abnormal case ratio. Note that if ${\hat{Cov}}_{2} - {\hat{Cov}}_{3} \leq 0$ , then

{\hat{df}}_{2} = \frac{{[{\hat{σ}}_{T R}^{2} + (\frac{c *}{c}) ({\hat{σ}}_{ε}^{2} - \hat{Cov} 1_{1})]}^{2}}{\frac{{[{\hat{σ}}_{T R}^{2} + (\frac{c *}{c}) ({\hat{σ}}_{ε}^{2} - {\hat{Cov}}_{1})]}^{2}}{r - 1}} = r - 1

with r − 1 being the lower bound on the denominator degrees of freedom.

The power is then estimated by

power = \Pr (F_{1, {\hat{df}}_{2}; \hat{Δ}} > F_{1 - α; 1, {\hat{df}}_{2}})

for a two-sided test with significance level α, treating ${\hat{df}}_{2}$ and $\hat{Δ}$ as constants.

Appendix B: Determination of the parameter estimated by the previously used estimate of $σ_{T R}^{2}$ for the OR procedure

We assume that the pilot data have two tests (t = 2). The estimate for $σ_{T R}^{2}$ proposed in References [4, 6] is given by

{\hat{σ}}_{T R_O}^{2} = \frac{({\hat{σ}}_{1}^{2} + {\hat{σ}}_{2}^{2})}{2} [1 - \hat{ρ}]

(B.1)

where $σ_{1}^{2}$ and $σ_{2}^{2}$ are the sample variances of the AUCs and across readers for tests 1 and 2, respectively, and $\hat{ρ}$ is the within-reader correlation coefficient for the paired data $({\hat{θ}}_{1 j}, {\hat{θ}}_{2 j})$ , j = 1,…,r; i.e.,

{\hat{σ}}_{i}^{2} = \frac{Σ_{j = 1}^{r} {({\hat{θ}}_{i j} - {\hat{θ}}_{i .})}^{2}}{r - 1}, i = 1, 2

and

\hat{ρ} = \frac{Σ_{j = 1}^{r} ({\hat{θ}}_{1 j} - {\hat{θ}}_{1 .}) ({\hat{θ}}_{2 j} - {\hat{θ}}_{2 .}) ∕ (r - 1)}{\sqrt{{\hat{σ}}_{1}^{2}} \sqrt{{\hat{σ}}_{2}^{2}}}

(B.2)

From the OR model (Eqn. 3) it follows that $var ({\hat{θ}}_{i j}) = σ_{R}^{2} + σ_{T R}^{2} + σ_{ε}^{2}$ and $_{j \neq j'}^{cov} ({\hat{θ}}_{i j}, {\hat{θ}}_{i j}') = {Cov}_{2}$ ; it follows that

E [\frac{Σ_{j = 1}^{r} {({\hat{θ}}_{1 j} - {\hat{θ}}_{1 .})}^{2}}{r - 1}] = var ({\hat{θ}}_{i j}) - {Cov}_{2}

i.e.,

E ({\hat{σ}}_{i}^{2}) = σ_{R}^{2} + σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{2}

(B.3)

Furthermore, we show at the end of this section that

\frac{Σ_{j = 1}^{r} ({\hat{θ}}_{1 j} - {\hat{θ}}_{1 \cdot}) ({\hat{θ}}_{2 j} - {\hat{θ}}_{2 \cdot})}{r - 1} = \frac{[MS (R) - MS (T * R)]}{t}

(B.4)

where MS(R) and MS(T * R) are the reader and test × reader mean squares resulting from fitting the OR model to the pilot data.

Expectations for the OR mean squares are given by Hillis [8, p. 600]. From these it follows that

E [MS (R) - MS (T * R)] ∕ t = σ_{R}^{2} + {Cov}_{1} - {Cov}_{3}

(B.5)

From Equations B.4–B.5 it follows that

E [\frac{Σ_{j = 1}^{r} ({\hat{θ}}_{1 j} - {\hat{θ}}_{1 \cdot}) ({\hat{θ}}_{2 j} - {\hat{θ}}_{2 \cdot})}{r - 1}] = σ_{R}^{2} + {Cov}_{1} - {Cov}_{3}

(B.6)

Replacing estimates by their expected value in Equation B.1 using Equations B.2, B.3, and B.6 shows that ${\hat{σ}}_{T R_O}^{2}$ estimates the following parameter:

[σ_{R}^{2} + σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{2}] [1 - \frac{(σ_{R}^{2} + {Cov}_{1} - {Cov}_{3})}{σ_{R}^{2} + σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{2}}] = σ_{R}^{2} + σ_{T R}^{2} + σ_{ε}^{2} - {Cov}_{1} - ({Cov}_{2} - {Cov}_{3})

The relationship var(ε₁₁ − ε₁₂ − ε₂₁ + ε₂₂) ≥ 0 implies that $σ_{ε}^{2} - {Cov}_{1} - ({Cov}_{2} - {Cov}_{3}) \geq 0$ .

Proof of Equation B.4:

\begin{matrix} MS (R) - MS (T R) & = \frac{t Σ_{j = 1}^{r} {({\hat{θ}}_{\cdot j} - {\hat{θ}}_{\cdot \cdot})}^{2}}{r - 1} - \frac{Σ_{i = 1}^{t} Σ_{j = 1}^{r} {({\hat{θ}}_{i j} - {\hat{θ}}_{i \cdot} + {\hat{θ}}_{\cdot j} - {\hat{θ}}_{\cdot \cdot})}^{2}}{(t - 1) (r - 1)} \\ = \frac{[(t Σ_{j = 1}^{r} {\hat{θ}}_{\cdot j}^{2} - t r {\hat{θ}}_{\cdot \cdot}^{2}) - (Σ_{i = 1}^{t} Σ_{j = 1}^{r} {\hat{θ}}_{i j}^{2} - r Σ_{i = 1}^{t} {\hat{θ}}_{i \cdot}^{2} - t Σ_{j = 1}^{r} {\hat{θ}}_{\cdot j}^{2} + t r {\hat{θ}}_{\cdot \cdot}^{2})]}{r - 1} \end{matrix}

Case 1: ${\hat{θ}}_{1} . = {\hat{θ}}_{2} . = 0$ (hence $\hat{θ} . . = 0$ ). It follows that

\begin{matrix} \frac{MS (R) - MS (T R)}{t} & = \frac{1}{t} \frac{(t Σ_{j = 1}^{r} {\hat{θ}}_{\cdot j}^{2}) - (Σ_{i = 1}^{t} Σ_{j = 1}^{r} {\hat{θ}}_{i j}^{2} - t Σ_{j = 1}^{r} {\hat{θ}}_{\cdot j}^{2})}{r - 1} \\ = \frac{1}{t} \frac{(2 t Σ_{j = 1}^{r} {\hat{θ}}_{\cdot j}^{2}) - (Σ_{i = 1}^{t} Σ_{j = 1}^{r} {\hat{θ}}_{i j}^{2})}{r - 1} \\ = \frac{1}{t} \frac{(2 t Σ_{j = 1}^{r} {(\frac{{\hat{θ}}_{1 j} + {\hat{θ}}_{2 j}}{t})}^{2}) - (Σ_{i = 1}^{t} Σ_{j = 1}^{r} {\hat{θ}}_{i j}^{2})}{r - 1} \\ = \frac{1}{t} \frac{[\frac{2}{t} (Σ_{i = 1}^{t} Σ_{j = 1}^{r} {\hat{θ}}_{i j}^{2} + 2 Σ_{j = 1}^{r} {\hat{θ}}_{1 j} + {\hat{θ}}_{2 j}) - (Σ_{i = 1}^{t} Σ_{j = 1}^{r} {\hat{θ}}_{i j}^{2})]}{r - 1} \end{matrix}

Since t = 2 we have

\frac{MS (R) - MS (T * R)}{t} = \frac{Σ_{j = 1}^{r} {\hat{θ}}_{1 j} {\hat{θ}}_{2 j}}{r - 1}

and thus Equation B.4 holds for the ${\hat{θ}}_{i j}$ .

Case 2: ${\hat{θ}}_{1} . \neq 0$ or ${\hat{θ}}_{2} . \neq 0$ . Define $W_{i j} = {\hat{θ}}_{i j} - {\hat{θ}}_{i \cdot}$ . Since W_1· = W_2· = 0, then Equation B.4 holds for the W_ij. Since it can be shown that $\sum_{j = 1}^{r} ({\hat{θ}}_{1 j} - {\hat{θ}}_{1} .) ({\hat{θ}}_{2 j} - {\hat{θ}}_{2} .) = \sum_{j = 1}^{r} (W_{1 j} - W_{1} .) (W_{2 j} - W_{2} .)$ and the quantities MS(R), and MS(T * R) computed from the W_ij are identical to those computed from the ${\hat{θ}}_{i j}$ , then Equation B.4 must also hold for the ${\hat{θ}}_{i j}$ .

Appendix C: SAS statements for computing power for the example

a) Computation of power for r = 8 readers and c = 240 cases

data data1; **Enter the OR outputs computed from pilot data**; length study $16;

input study $ c_star mstr var_error cov1 cov2 cov3 var_tr;/*Notes:

mstr = MS(test × reader)

var_tr = OR test × reader variance component

var_error = OR fixed-reader error variance component

c_star = number of cases for pilot data

Either mstr or var_tr must be specified--enter a missing value for the one not specified. If var_tr is not missing then the program uses the specified var_tr value, regardless of whether mstr is specified or missing. If var_tr is missing then var_tr is computed as a function of mstr and other inputs, and if the computed value is negative then it will be set to zero.

cards;

VanDyke 114 .000622731 .001393652 .000351859 .000346505 .000221453 . ; /*

NOTE: to obtain result with the test-by-reader variance component set to .0001, just change the missing data value above to .0001. That is, substitute the following line:

VanDyke 114 .000622731 .001393652 .000351859 .000346505 .000221453 .0001 */

proc print; title “Pilot study estimates”; run;

data data2; set data1; **Compute power for r = 8, c = 240**;

/* set the following as desired */

alpha = .05; **significance level**;

AUCdiff = .05; **effect size: difference in populations AUCs**;

r = 8; **reader sample size for power estimate**;

c = 240; **case sample size for power estimate**;

/* now estimate var_tr if it was not specified*/

if var_tr = . then do;

var_tr = mstr − var error + cov1 + max(cov2 − cov3,0);

var_tr = var_tr*(var_tr>0); *constrains var_tr to be nonnegative*; end;

/* now estimate noncentrality parameter (nc) and denominator df (df2)*/

denon = var_tr + (c_star/c)*(var_error−cov1+max((r−1)*(cov2−cov3),0));

nc = r*.5 * AUCdiff**2/denon;

df2 = denon**2/((var_tr+(c_star/c)*(var_error−cov1−max(cov2−cov3,0)))**2/(r−1));

/* now compute power */

F_critical = Finv(1−alpha, 1,df2);

power = 1 −probF(F_critical,1, df2, nc);

proc print; title “Power results”;

var study AUCdiff r c nc df2 power;

run;

Output:

Pilot study estimates
study	c_star	mstr	var_error	cov1	cov2	cov3	var_tr
VanDyke	114	.000622731	.001393652	.000351859	.000346505	.000221453	.

Open in a new tab

Power results
study	AUCdiff	r	c	nc	df2	power
VanDyke	0.05	8	240	10.9812	30.6140	0.89402

Open in a new tab

(b) Computation of reader and case sample sizes needed for power = .80. These results are presented in the left-hand side of Table 3

***looped version***;

data data2; set data1;

/* set the following as desired */

alpha = .05; **significance level**;

AUCdiff = .05; **effect size: difference in populations AUCs**;

power_target = .80; **desired power**;

/* now estimate var_tr if it was not specified */

if var_tr = . then do;

var_tr = mstr − var_error + cov1 + max(cov2 − cov3,0);

var_tr = var_tr*(var_tr>0); *constrains var_tr to be nonnegative*; end;

do r = 3 to 15; **reader sample size for power estimate**;

flag = 0;

do c = 20 to 2000; **candidate case sample sizes for power estimate --change as needed**;

/* now estimate noncentrality parameter(nc)and denominator df (df2)*/

denon = var_tr + (c_star/c)*(var_error − cov1 + max((r−1)*(cov2 − cov3),0));

nc = r*.5 * AUCdiff**2/denon; **nc = noncentrality parameter**;

df2 = denon**2/((var_tr + (c_star/c)*(var_error−cov1−max(cov2−cov3,0)))**2/(r−1));

/* now compute power */

F_critical = Finv(1−alpha, 1,df2); **F_critical = OR critical F value**;

power = 1 −probF(F_critical,1, df2, nc);

if (flag = 0) and (power ge power_target) then do;

output; flag = 1; GOTO HERE;

end;

HERE:;

end;

proc print;

var study AUCdiff r c power; run;

Output:

Power results
Obs	study	AUCdiff	r	c	power
1	VanDyke	0.05	3	559	0.80044
2	VanDyke	0.05	4	343	0.80040
3	VanDyke	0.05	5	266	0.80142
4	VanDyke	0.05	6	225	0.80045
5	VanDyke	0.05	7	200	0.80020
6	VanDyke	0.05	8	183	0.80007
7	VanDyke	0.05	9	171	0.80079
8	VanDyke	0.05	10	162	0.80175
9	VanDyke	0.05	11	154	0.80028
10	VanDyke	0.05	12	148	0.80025
11	VanDyke	0.05	13	143	0.80010
12	VanDyke	0.05	14	139	0.80055
13	VanDyke	0.05	15	136	0.80214

Open in a new tab

Appendix D: SAS statements for converting DBM mean squares to OR statistics for the example

data OR_statistics;

input t r c mst msr mstr msc mstc msrc mstrc; **DBM mean squares**;

/* Notes:

t, r, and c are number of tests, readers and cases for the data set mst, msr, mstr, msc, mstc, msrc, and mstrc are the DBM mean squares for test, reader, test × reader, case, test × case, reader × case, and test × reader × case

/*Now compute corresponding OR mean squares and fixed-reader covariances*/

mst_OR = c**−1 * mst;

msr_OR = c** −1 * msr;

mstr_OR = c**−1 * mstr;

var_error = (t*r*c)**−1 * (msc + (t−1)*mstc + (r−1)*msrc + (t−1)*(r−1)*mstrc);

cov1 = (t*r*c)**−1 *(msc − mstc +(r−1)*(msrc − mstrc));

cov2 = (t*r*c)**−1 * (msc − msrc + (t−1)*(mstc − mstrc));

cov3 = (t*r*c)**−1 * (msc − mstc − msrc + mstrc);

cards;

2 5 114 0.45638557 0.32315642 0.07099138 0.45797697 0.17578816 0.13424103 0.10450847

proc print; title “DBM mean squares”;

var t r c mst msr mstr msc mstc msrc mstrc;

proc print; title “Corresponding OR mean squares and covariances”;

var msr_OR mstr_OR var_error cov1 cov2 cov3;

run;

Output:

DBM mean squares
t	r	c	mst	msr	mstr	msc	mstc	msrc	mstrc
2	5	114	0.45639	0.32316	0.070991	0.45798	0.4579	0.13424	0.10451

Open in a new tab

Corresponding OR mean squares and covariances
msr_OR	mstr_OR	var_error	cov1	cov2	cov3
.002834705	.000622731	.001393652	.000351859	.000346505	.000221453

Open in a new tab

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Disclaimer The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs.

References

[1].Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]
[2].Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]
[3].Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Communications in Statistics: Simulation and Computation. 1995;24:285–308. [Google Scholar]
[4].Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
[5].Obuchowski NA, McClish DK. Sample size determination for diagnostic accuracy studies involving binormal ROC curve indices. Statistics in Medicine. 1997;16:1529–1542. doi: 10.1002/(sici)1097-0258(19970715)16:13<1529::aid-sim565>3.0.co;2-h. [DOI] [PubMed] [Google Scholar]
[6].Zhou X-H, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. Wiley; New York: 2002. [Google Scholar]
[7].Hillis SL, Berbaum KS. Power estimation for the Dorfman-Berbaum-Metz method. Academic Radiology. 2004;11:1260–1273. doi: 10.1016/j.acra.2004.08.009. DOI:10.1016/j.acra.2004.08.009. [DOI] [PubMed] [Google Scholar]
[8].Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Statistics in Medicine. 2007;26:596–619. doi: 10.1002/sim.2532. DOI:10.1002/sim.2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. DOI:10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]
[10].Hillis SL, Berbaum KS, Metz CE. Recent developments in the Dorfman-Berbaum-Metz procedure for multireader ROC study analysis. Academic Radiology. 2008;15:647–61. doi: 10.1016/j.acra.2007.12.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Quenoille MH. Approximate tests of correlation in time series. (Series B).Journal of the Royal Statistical Society. 1949;11:68–84. [Google Scholar]
[12].Quenoille MH. Notes on bias in estimation. Biometrika. 1956;43:353–360. [Google Scholar]
[13].Tukey JW. Bias and confidence in not quite large samples. Annals of Mathematical Statistics. 1958;29:614. abstract. [Google Scholar]
[14].Berbaum KS, Schartz KM, Pesce LL, Hillis SL. DBM MRMC 2.2. (computer software). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.
[15].Berbaum KS, Metz CE, Pesce LL, Schartz KM. DBM MRMC 2.1 User's Guide. (software manual). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.
[16].Hillis SL, Schartz KM, Pesce LL, Berbaum KS, Metz CE. DBM MRMC procedure for SAS. (computer software). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.
[17].DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–844. [PubMed] [Google Scholar]
[18].Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
[19].Obuchowski NA. Computing sample size for receiver operating characteristic studies. Investigative Radiology. 1994;29:238–243. doi: 10.1097/00004424-199402000-00020. [DOI] [PubMed] [Google Scholar]
[20].Hillis SL, Berbaum KS. MRMC sample size program user's guide. (software manual). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.
[21].Van Dyke CW, White RD, Obuchowski NA, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the diagnosis of thoracic aortic dissection. 79th RSNA Meetings; Chicago, IL. November 28 - December 3, 1993. [Google Scholar]
[22].Pan XC, Metz CE. The ”proper” binormal model: parametric receiver operating characteristic curve estimation with degenerate data. Academic Radiology. 1997;4:380–389. doi: 10.1016/s1076-6332(97)80121-3. [DOI] [PubMed] [Google Scholar]
[23].Metz CE, Pan XC. ”Proper” binormal ROC curves: theory and maximum-likelihood estimation. Journal of Mathematical Psychology. 1999;43:1–33. doi: 10.1006/jmps.1998.1218. [DOI] [PubMed] [Google Scholar]
[24].SAS for Windows. Version 9.2. SAS Institute Inc.; Cary, NC, USA: copyright (c) 2002–2008. [Google Scholar]
[25].Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Academic Radiology. 1997;4:298–303. doi: 10.1016/s1076-6332(97)80032-3. [DOI] [PubMed] [Google Scholar]
[26].Chakraborty DP. Prediction accuracy of a sample-size estimation method for ROC Studies. Academic Radiology. 2010;17:628–638. doi: 10.1016/j.acra.2010.01.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]

[R2] [2].Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]

[R3] [3].Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Communications in Statistics: Simulation and Computation. 1995;24:285–308. [Google Scholar]

[R4] [4].Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]

[R5] [5].Obuchowski NA, McClish DK. Sample size determination for diagnostic accuracy studies involving binormal ROC curve indices. Statistics in Medicine. 1997;16:1529–1542. doi: 10.1002/(sici)1097-0258(19970715)16:13<1529::aid-sim565>3.0.co;2-h. [DOI] [PubMed] [Google Scholar]

[R6] [6].Zhou X-H, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. Wiley; New York: 2002. [Google Scholar]

[R7] [7].Hillis SL, Berbaum KS. Power estimation for the Dorfman-Berbaum-Metz method. Academic Radiology. 2004;11:1260–1273. doi: 10.1016/j.acra.2004.08.009. DOI:10.1016/j.acra.2004.08.009. [DOI] [PubMed] [Google Scholar]

[R8] [8].Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Statistics in Medicine. 2007;26:596–619. doi: 10.1002/sim.2532. DOI:10.1002/sim.2532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. DOI:10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]

[R10] [10].Hillis SL, Berbaum KS, Metz CE. Recent developments in the Dorfman-Berbaum-Metz procedure for multireader ROC study analysis. Academic Radiology. 2008;15:647–61. doi: 10.1016/j.acra.2007.12.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Quenoille MH. Approximate tests of correlation in time series. (Series B).Journal of the Royal Statistical Society. 1949;11:68–84. [Google Scholar]

[R12] [12].Quenoille MH. Notes on bias in estimation. Biometrika. 1956;43:353–360. [Google Scholar]

[R13] [13].Tukey JW. Bias and confidence in not quite large samples. Annals of Mathematical Statistics. 1958;29:614. abstract. [Google Scholar]

[R14] [14].Berbaum KS, Schartz KM, Pesce LL, Hillis SL. DBM MRMC 2.2. (computer software). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.

[R15] [15].Berbaum KS, Metz CE, Pesce LL, Schartz KM. DBM MRMC 2.1 User's Guide. (software manual). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.

[R16] [16].Hillis SL, Schartz KM, Pesce LL, Berbaum KS, Metz CE. DBM MRMC procedure for SAS. (computer software). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.

[R17] [17].DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–844. [PubMed] [Google Scholar]

[R18] [18].Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]

[R19] [19].Obuchowski NA. Computing sample size for receiver operating characteristic studies. Investigative Radiology. 1994;29:238–243. doi: 10.1097/00004424-199402000-00020. [DOI] [PubMed] [Google Scholar]

[R20] [20].Hillis SL, Berbaum KS. MRMC sample size program user's guide. (software manual). Available for download from http://perception.radiology.uiowa.edu. Accessed August 1, 2009.

[R21] [21].Van Dyke CW, White RD, Obuchowski NA, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the diagnosis of thoracic aortic dissection. 79th RSNA Meetings; Chicago, IL. November 28 - December 3, 1993. [Google Scholar]

[R22] [22].Pan XC, Metz CE. The ”proper” binormal model: parametric receiver operating characteristic curve estimation with degenerate data. Academic Radiology. 1997;4:380–389. doi: 10.1016/s1076-6332(97)80121-3. [DOI] [PubMed] [Google Scholar]

[R23] [23].Metz CE, Pan XC. ”Proper” binormal ROC curves: theory and maximum-likelihood estimation. Journal of Mathematical Psychology. 1999;43:1–33. doi: 10.1006/jmps.1998.1218. [DOI] [PubMed] [Google Scholar]

[R24] [24].SAS for Windows. Version 9.2. SAS Institute Inc.; Cary, NC, USA: copyright (c) 2002–2008. [Google Scholar]

[R25] [25].Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Academic Radiology. 1997;4:298–303. doi: 10.1016/s1076-6332(97)80032-3. [DOI] [PubMed] [Google Scholar]

[R26] [26].Chakraborty DP. Prediction accuracy of a sample-size estimation method for ROC Studies. Academic Radiology. 2010;17:628–638. doi: 10.1016/j.acra.2010.01.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Power Estimation for Multireader ROC Methods: An Updated and Unified Approach

Stephen L Hillis

Nancy A Obuchowski

Kevin S Berbaum

Abstract

Rationale and Objectives

Materials and Methods

Results

Conclusions

1. Introduction

2. Materials and Methods

2.1. Design and notation

2.2. The DBM procedure

2.3. The OR procedure

2.3.1. OR and DBM relationships

Table 1.

2.3.2. OR method in terms of correlations and the σc2, σw2 parameterization

2.4. Updated and unified OR/DBM power computation procedure

Table 2.

2.5. Other considerations

2.5.1. Accounting for different pilot and planned study normal-to-abnormal case ratios

2.5.2. Comparison with earlier results

2.5.3. What to do if σ^TR2<0

Table 3.

2.5.4. One-sided tests

3. Results

3.1. Example: Spin echo versus cine MRI for detection of aortic dissection

Table 4.

Table 5.

Table 6.

3.1.1. Power computation based on DBM analysis

Table 7.

3.2. Simulation study

Table 8.

Table 9.

4. Discussion

5. Acknowledgment

Appendix A: Power derivation for the OR procedure

Appendix B: Determination of the parameter estimated by the previously used estimate of σTR2 for the OR procedure

Appendix C: SAS statements for computing power for the example

a) Computation of power for r = 8 readers and c = 240 cases

(b) Computation of reader and case sample sizes needed for power = .80. These results are presented in the left-hand side of Table 3

Appendix D: SAS statements for converting DBM mean squares to OR statistics for the example

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3.2. OR method in terms of correlations and the $σ_{c}^{2}$ , $σ_{w}^{2}$ parameterization

2.5.3. What to do if ${\hat{σ}}_{T R}^{2} < 0$

Appendix B: Determination of the parameter estimated by the previously used estimate of $σ_{T R}^{2}$ for the OR procedure