A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer ROC Analysis

Stephen L Hillis

doi:10.1002/sim.2532

. Author manuscript; available in PMC: 2007 Feb 10.

Published in final edited form as: Stat Med. 2007 Feb 10;26(3):596–619. doi: 10.1002/sim.2532

A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer ROC Analysis

Stephen L Hillis ^1,^1,²

PMCID: PMC1769324 NIHMSID: NIHMS7382 PMID: 16538699

Abstract

There are several methods available for analyzing multireader ROC studies that generalize results to both the reader and case populations. Two of these methods - the Dorfman-Berbaum-Metz (DBM) method and the Obuchowki-Rockette (OR) method - appear to be quite different in their original formulations. However, recently it has been shown that the DBM and OR procedures yield the same test statistic when based on the same accuracy measure and covariance estimation method, but inferences can vary depending on which denominator degrees of freedom (ddf) method, DBM or OR, is used. I show in simulations that there are problems with both ddf methods: OR is ultraconservative with significance levels considerably below the nominal level, and DBM can result in extremely wide confidence intervals because the ddf can be close to zero. I propose a new ddf method that overcomes both of these problems and can be used with either the OR or DBM procedure.

Keywords: Receiver operating characteristic (ROC) curve, corrected F, diagnostic radiology, degrees of freedom

1 INTRODUCTION

Receiver operating characteristic (ROC) curve analysis is a well established method for evaluating and comparing the performance of diagnostic tests [1-5]. There are several methods available for analyzing multireader ROC studies that generalize results to both the reader and case populations. Five such methods have recently been described and compared in an analysis of three data sets by Obuchowski et al [6] The study shows that the methods can lead to different conclusions, illustrating the need for more theoretical and empirical investigation of the methods to provide insight and guidance regarding the appropriateness of each procedure in a given situation.

Two of the these methods - the Dorfman-Berbaum-Metz (DBM) method [7, 8] and the Obuchowski-Rockette (OR) method [9, 10] - appear to be quite different in their original formulations. However, recently it has been shown by Hillis et al [11] that the DBM and OR procedures yield the same test statistic when based on the same accuracy measure and covariance estimation method, but inferences depend on which denominator degrees of freedom (ddf) method, DBM or OR, is used.

In simulations I find that there are problems with both ddf methods: OR is ultraconservative with significance levels considerably below the nominal level while DBM, though much closer to the nominal significance level, sometimes results in extremely wide confidence intervals because the ddf can be close to zero. I propose using a new ddf estimator that overcomes these problems and can be used with either the DBM or OR procedure.

The outline of the paper is as follows. I describe the OR method in Section 2. In Section 3 I derive the new ddf from the OR model, discuss the rationale for the original OR ddf, and compare the new ddf with the DBM ddf. In Section 4 I show how the new ddf can be derived from the DBM model and discuss the possibility that the DBM ddf may be based on an inadequate Satterthwaite approximation. In Section 5 I discuss confidence intervals for test differences, inferences for single tests, and fixed readers analysis. In Section 6 I compare the performance of the new ddf method with the OR and DBM ddf methods in a simulation study. I illustrate and compare the three ddf methods using a previously published data set in Section 7 and make concluding remarks in the final section.

2 THE OBUCHOWSKI-ROCKETTE (OR) METHOD

2.1 Design and notation

A commonly used study design in radiology that allows generalization to both the reader and case populations is the test reader case factorial design, where each case (i.e., patient) undergoes each diagnostic test (e.g., CT scan, MRI, or X-ray) and the resulting images are evaluated once by each reader (usually a radiologist). Typically the number of cases is 25-200 while the number of readers is 3-15. Throughout I assume that the data have been collected using this factorial design.

Let Z_ijk denote the rating assigned to the kth case by the jth reader using the ith modality. For example, a five-level ordinal integer scale is commonly used with higher rating values indicating a higher level of confidence that the case is diseased; thus a rating value of 1 might correspond to “definitely not diseased,” 5 to “definitely diseased,” with other values corresponding to intermediate assessments. Alternatively, a quasi continuous 0% to 100% confidence scale is often used. The observed data consists of the Z_ijk, with i = 1,..., t, j = 1,..., r, k = 1,..., c, where t is the number of tests (or modalities), r the number of readers, and c the number of cases. In addition, each case is classified as diseased or nondiseased according to an available reference standard.

2.2 Model and test statistic

The corrected F test was proposed by Obuchowski and Rockette [9]. Let ${\hat{θ}}_{i j}$ denote the AUC estimate (or other accuracy estimate) for the ith test and jth reader. b Their approach is to use a test reader ANOVA model for the AUCs, but unlike a conventional ANOVA model they allow the errors to be correlated to account for correlation in the AUCs due to each reader evaluating the same cases. Thus their two-way ANOVA model corresponds to the three-way study design. For the factorial design with only one replication, the corrected F test model, which I refer as the OR model, can be written as

{\hat{θ}}_{i j} = μ + τ_{i} + R_{j} + {(τ R)}_{i j} + ∊_{i j}

(1)

i = 1,..., t, j = 1,..., r, where τ_i denotes the fixed effect of test i; R_j denotes the random effect of reader j, (τR)_ij denotes the random test reader interaction, and ∊_ij is the error term. The R_j and (τR)_ij are assumed to be mutually independent and normally distributed with zero means and variances $σ_{R}^{2}$ , reflecting differences in reader ability, and $σ_{T R}^{2}$ , reflecting test-by-reader interaction. The ∊_ij are assumed to be normally distributed with zero mean and constant variance $σ_{∊}^{2}$ , which represents variability attributable to cases and within-reader variability that describes how a reader interprets the same image in different ways on different occasions. The ∊_ij are independent of the R_j and (τR)_ij. However, since the same cases are read by each reader using each test, the ∊_ij are not assumed to be independent. Instead, equi-covariance of the errors between readers and tests is assumed, resulting in three possible covariances given by

Cov (∊_{i j}, ∊_{i^{'} j^{'}}) = {\begin{matrix} {Cov}_{1} & i \neq i^{'}, j = j^{'} (different test, same reader) \\ {Cov}_{2} & i = i^{'}, j \neq j^{'} (same test, different reader) \\ {Cov}_{3} & i \neq i^{'}, j \neq j^{'} (different test, different reader) \end{matrix}

It follows from model (1) that $σ_{∊}^{2}$ , Cov₁, Cov₂, and Cov₃ are also the variance and corresponding covariances of the AUC estimates, conditional on the reader and test×reader effects. Obuchowski and Rockette [9] suggest the following ordering for the covariances:

{Cov}_{1} \geq {Cov}_{2} \geq {Cov}_{3} \geq 0 .

(2)

I only consider the factorial design with one replication; however, my results will also apply when there are multiple replications.

The corrected F statistic for testing the null hypothesis of no test effect (H₀: τ₁ = τ₂ = ... = τ_t) is given by

F^{*} = \frac{MS (T)}{MS (T^{*} R) + r ({Cov}_{2} - {Cov}_{3})}

(3)

where $MS (T) = \frac{r}{t - 1} \sum_{i = 1}^{t} {({\hat{θ}}_{i \cdot} - {\hat{θ}}_{\cdot \cdot})}^{2}$ and $MS (T^{*} R) = \frac{1}{(t - 1) (r - 1)} \sum_{i = 1}^{t} \sum_{j = 1}^{r} {({\hat{θ}}_{i j} - {\hat{θ}}_{i \cdot} - {\hat{θ}}_{\cdot j} + {\hat{θ}}_{\cdot \cdot})}^{2}$ . A subscript replaced by a dot indicates that values are averaged across the missing subscript; for example, ${\hat{θ}}_{i \cdot} = \frac{1}{r} \sum_{j = 1}^{r} {\hat{θ}}_{i j}$ and ${\hat{θ}}_{\cdot \cdot} = \frac{1}{t r} \sum_{i = 1}^{t} \sum_{j = 1}^{r} {\hat{θ}}_{i j}$ . Obuchowski and Rockette [9] state that F ^* has an approximate null $F_{t - 1; {ddf}_{o}}$ distribution, where

{ddf}_{o} = (t - 1) (r - 1)

(4)

Throughout, the ddf subscript denotes the first author of the paper that proposed the ddf method: O = Obuchowski, D = Dorfman, H = Hillis.

In practice we do not know Cov₂ or Cov₃ and have to estimate them from the data or use estimates from previous studies. Thus the statistic actually used is

F_{OR} = \frac{MS (T)}{MS (T^{*} R) + \max [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}), 0]}

(5)

where ${\hat{Cov}}_{2}$ and ${\hat{Cov}}_{3}$ denote estimates for Cov₂ and Cov₃, respectively. Note that equation (5) incorporates constraint (2) by setting ${\hat{Cov}}_{2} - {\hat{Cov}}_{3}$ to zero if it is negative. Since Cov₂ and Cov₃ are also the corresponding covariances of the AUC estimates conditional on the reader and test×reader effects, they can be estimated using ROC analysis methods that treat cases as random but readers as fixed. For example, for trapezoidal-rule AUC estimates [5], Obuchowski and Rockette [9] suggest estimating $σ_{∊}^{2}$ , Cov₁, Cov₂ and Cov₃ by averaging corresponding variances and covariances computed using the method of DeLong et al [12], which provides covariance and variance estimates treating only cases as random. More generally, any acceptable method for estimating the AUC variance and covariances that treats readers as fixed can be used, such as jackkni-ng or bootstrapping. The OR estimates obtained from averaging the corresponding fixed-reader AUC variances and covariances are denoted by ${\hat{σ}}_{∊}^{2}, {\hat{Cov}}_{1}, {\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ .

3 THE PROPOSED DDF

In this section I show that F ^* has an approximate F ₍ _t _-1),df₂ null distribution, where

{df}_{2} = \frac{{[E (MS (T^{*} R)) + r ({Cov}_{2} - {Cov}_{3})]}^{2}}{\frac{{[E (MS (T^{*} R))]}^{2}}{(t - 1) (r - 1)}}

(6)

In deriving (6) I treat Cov₂ and Cov₃ as known. An estimate of df₂ that incorporates constraint (2) is given by

{ddf}_{H} = \frac{{MS (T^{*} R) + \max [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}), 0]}^{2}}{\frac{{[MS (T^{*} R)]}^{2}}{(t - 1) (r - 1)}}

(7)

Thus I propose using ddf_H as the ddf for F _OR, since F _OR approximates F ^*. I also discuss why Obuchowski and Rockette suggested (t-1) (r-1) as the ddf.

3.1 Derivation of ddf_H

The expected values for MS(T), MS(R), and MS(T^*R) are derived in the Appendix and given in Table 1, where $MS (R) = \frac{t}{r - 1} \sum_{j = 1}^{r} {({\hat{θ}}_{\cdot j} - {\hat{θ}}_{\cdot \cdot})}^{2}$ . It follows from Table 1 that

E [MS (T)] = {\begin{matrix} E [MS (T^{*} R)] + r ({Cov}_{2} - {Cov}_{3}), & if H_{0} \\ \frac{r}{t - 1} \sum_{i = 1}^{t} {(τ_{i} - τ_{\cdot})}^{2} + E [MS (T^{*} R)] + r ({Cov}_{2} - {Cov}_{3}), & if H_{1} \end{matrix}

(8)

where H₀ is the null hypothesis of no test effect (H₀:τ₁ = ... = τ_t = 0). Thus the numerator and denominator of F ^* in equation (3) have the same expected value under H₀, while the numerator has a larger expected value if H₀ does not hold; hence F ^* is an appropriate ANOVA test statistic. [ Table 1 ]

Table 1.

Expected mean squares for the Obuchowski-Rockette model.

Mean square	Expected mean square
MS(T)	$\frac{r}{t - 1} \sum_{i = 1}^{t} {(τ_{i} - τ_{\cdot})}^{2} + σ_{τ R}^{2} + σ_{∊}^{2} - {Cov}_{1} + (r - 1) ({Cov}_{2} - {Cov}_{3})$
MS(R)	$t σ_{R}^{2} + σ_{τ R}^{2} + σ_{∊}^{2} - {Cov}_{2} + (t - 1) ({Cov}_{1} - {Cov}_{3})$
MS(T*R)	$σ_{τ R}^{2} + σ_{∊}^{2} - {Cov}_{1} - {Cov}_{2} + {Cov}_{3}$

Open in a new tab

To show that F ^* has an approximate F ₍ _t _-1),df₂ null distribution with df₂ defined by equation (6), I first consider an approximation for the distribution of its denominator, MS(T^*R)+r (Cov₂-Cov₃). Let X_m denote a random variable with a $χ_{m}^{2} ∕ m$ distribution that is independent of MS(T^*R). Then E (X_m) = 1 and X_m converges to 1 in probability as m → ∞. By Slutsky’s theorem the limiting distribution of MS(T^*R)+r (Cov₂-Cov₃) X_m as m → ∞ is equal to the distribution of MS(T^*R)+r (Cov₂-Cov₃). My approach is to approximate the distribution of MS(T^*R)+r (Cov₂-Cov₃) by the limiting distribution of an approximate distribution for MS(T^*R)+r (Cov₂-Cov₃) X_m as m → ∞.

I show in the Appendix that MS(T) and MS(T^*R) are independently distributed, $MS (T^{*} R) \sim χ_{(t - 1) (r - 1)}^{2} \frac{E [MS (T^{*} R)]}{(t - 1) (r - 1)}$ , and under $H_{0} MS (T) \sim χ_{t - 1}^{2} \frac{E [MS (T)]}{t - 1}$ . Satterthwaite [13, 14] has shown that a linear function of independent random variables $MS = \sum_{i}^{} a_{i} {MS}_{i}$ , where ${MS}_{i} \sim χ_{{df}_{i}}^{2} E ({MS}_{i}) ∕ {df}_{i}$ , is approximately distributed as $χ_{df}^{2} E (MS) ∕ df$ , where

df = \frac{{[\sum_{i} a_{i} E ({MS}_{i})]}^{2}}{\sum_{i} {[a_{i} E ({MS}_{i})]}^{2} ∕ {df}_{i}}

(9)

Using Satterthwaite’s procedure with a ₁ = 1; MS₁ =MS(T^*R), a ₂ = r (Cov₂-Cov₃), and MS₂ = X_m, it follows that MS(T^*R)+r (Cov₂-Cov₃) X_m is approximately distributed as

χ_{df_S (m)}^{2} \frac{E [MS (T^{*} R)] + r ({Cov}_{2} - {Cov}_{3})}{df_S (m)}

(10)

where df_S(m) is given by

df_S (m) = \frac{{[E (MS (T^{*} R)) + r ({Cov}_{2} - {Cov}_{3})]}^{2}}{\frac{{[E (MS (T^{*} R))]}^{2}}{(t - 1) (r - 1)} + \frac{{[r ({Cov}_{2} - {Cov}_{3})]}^{2}}{m}}

Since lim_m _→∞ df_S(m) = df₂, where df₂ is defined by equation (6), it follows that the limiting distribution of distribution (10) as m → ∞ is

χ_{{df}_{2}}^{2} \frac{E [MS (T^{*} R)] + r ({Cov}_{2} - {Cov}_{3})}{{df}_{2}}

(11)

Thus the distribution of the denominator of F ^*, MS(T^*R)+r (Cov₂-Cov₃), is approximated by distribution (11).

Since MS(T) and MS(T^*R)+r (Cov₂-Cov₃) are independent, $MS (T) \sim χ_{t - 1}^{2} E [MS (T)] ∕ (t - 1)$ under H₀, MS(T^*R)+r (Cov₂-Cov₃) is approximately $χ_{{df}_{2}}^{2} {E [MS (T^{*} R)] + r ({Cov}_{2} - {Cov}_{3})} ∕ {df}_{2}$ , and E[MS(T)] = E[MS(T^*R)]+r (Cov₂-Cov₃) under H₀, it follows that the null distribution of F ^* = MS(T)/[MS(T^*R)+r (Cov₂-Cov₃)] is approximated by

\frac{U ∕ (t - 1)}{V ∕ {df}_{2}}

where $U \sim χ_{t - 1}^{2}$ , $V \sim χ_{{df}_{2}}^{2}$ , and U and V are independent; that is, F ^* has an approximate F ₍ _t _-1),df₂ null distribution. Since F _OR approximates F ^* and ddf_H (7) estimates df₂, I propose using ddf_H as the ddf for F _OR.

In summary, approximating the distribution of F _OR involves three distinct approximations. First, F _OR is approximated by F ^*. The adequacy of this approximation depends on how well ${\hat{Cov}}_{2} - {\hat{Cov}}_{3}$ approximates Cov₂-Cov₃. Throughout, I assume that the number of cases is moderate or large, as is typical for ROC studies; thus it is reasonable to expect that this approximation will be adequate since the precision of the covariance estimates increases with the number of cases. Second, the distribution of MS(T^*R)+r (Cov₂-Cov₃) is approximated by the distribution of MS(T^*R)+r (Cov₂-Cov₃) X_m for large m, where $X_{m} {}^{\sim}χ_{m}^{2} ∕ m$ , so that the Satterthwaite approximation can be applied to the denominator of F _OR; this approximation is justified by Slutsky’s theorem and the law of large numbers. Finally, the distribution of MS(T^*R)+r (Cov₂-Cov₃) X_m for large m is approximated using the Satterthwaite procedure to give distribution (11), with df₂ then estimated by ddf_H. This approximation depends on the adequacy of the Satterthwaite approximation, with Cov₂-Cov₃ approximated by ${\hat{Cov}}_{2} - {\hat{Cov}}_{3}$ and E[MS(T^*R)] approximated by MS(T^*R). I expect this approximation to be adequate since the Satterthwaite approximation has been shown to perform acceptably for sums of mean squares, with expected mean squares estimated by observed means squares when estimating the degrees of freedom, even when the mean square degrees of freedom are low [13, 14]. In the simulation study I empirically investigate the performance of ddf_H.

3.2 Rationale for ddf_O

To understand why Obuchowski and Rockette [9] suggested using (t - 1) (r - 1) as the ddf, I now reexamine their argument. Based upon the work of Pavur and Nath [15], they note that under H₀

\frac{1 - ρ}{1 + (r - 1) ρ} F

(12)

is distributed as F ₍ _t _-1),( _t _-1)( _r _-1), where F =MS(T)/MS(T^*R) and

ρ = \frac{2 ({Cov}_{2} - {Cov}_{3})}{Var ({\hat{θ}}_{i j} - {\hat{θ}}_{i^{'} j})}, i \neq i^{'}

They then replace ρ in equation (12) by the estimate

\hat{ρ} = \frac{2 ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})}{\hat{Var} ({\hat{θ}}_{i j} - {\hat{θ}}_{i^{'} j})}

where

\hat{Var} ({\hat{θ}}_{i j} - {\hat{θ}}_{i^{'} j}) = 2 (MS (T^{*} R) + {\hat{Cov}}_{2} - {\hat{Cov}}_{3})

(13)

to derive the OR F statistic (5).

Although replacing ρ by a high precision estimate should not appreciably alter the distribution of (12), the variance estimate (13) lacks precision for the typical study with only a few tests and readers, since MS(T^*R) has degrees of freedom (t-1) (r-1). Furthermore, the estimate of ρ is correlated with F, and this correlation needs to be taken into account because of the lack of precision in the estimate. Thus ddf_O does not adequately describe the distribution of the F _OR statistic (5) for typical studies because it does not account for either the lack of precision in estimating ρ or the correlation between the estimate of ρ and F.

4 DDF_H AND THE DBM PROCEDURE

4.1 The DBM procedure

The DBM method proposed by Dorfman et al [7] for analyzing multireader ROC studies also generalizes results to both the reader and case populations. For this method AUC pseudovalues are computed using the Quenouille-Tukey jackknife [16-18] separately for each reader-test combination. Let Y_ijk denote the AUC pseudovalue for test i, reader j, and case k; by definition $Y_{i j k} = c {\hat{θ}}_{i j} - (c - 1) {\hat{θ}}_{i j (k)}$ , where ${\hat{θ}}_{i j}$ denotes the AUC estimate based on all of the data for the ith test and jth reader band _ij ₍ _k ₎ denotes the AUC estimate when data for the kth case are omitted. Using the Y_ijk as the responses, the DBM procedure specifies testing for a test effect using a fully crossed three-factor ANOVA, with test treated as a fixed factor and reader and case as random factors. The DBM estimate of ${\hat{θ}}_{i j}$ is Y_ij., which is the jackknife accuracy estimate corresponding to ${\hat{θ}}_{i j}$ .

Recently Hillis et al [11] generalize the DBM method by showing how it can be used with normalized pseudovalues and quasi pseudovalues: normalized pseudovalues $Y_{i j k}^{*}$ are defined by $Y_{i j k}^{*} = Y_{i j k} + ({\hat{θ}}_{i j} - Y_{i j \cdot})$ , and quasi pseudovalues are defined as any values such that the resulting test-reader sample means, variances, and covariances are identical to ${\hat{θ}}_{i j}$ and $c \hat{Σ}$ , where $\hat{Σ}$ is the estimated fixed-reader covariance matrix used to compute the OR procedure quantities ${\hat{σ}}^{2}, {\hat{Cov}}_{1}, {\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ . For normalized and quasi pseudovalues the DBM accuracy estimate, given by Y_dij, is equal to ${\hat{θ}}_{i j}$ . They show that using DBM with normalized pseudovalues yields the same test statistic b for testing for a modality effect as using OR with jackknife covariance estimates, and more generally, using DBM with quasi pseudovalues yields the same test statistic as OR for other covariance estimation methods. However, even when DBM and OR yield the same test statistic, inferences depend on which ddf method, DBM or OR, is used. From hereon I assume, when comparing the DBM and OR procedures, that the pseudovalues are the normalized pseudovalues if ${\hat{Cov}}_{2}$ and ${\hat{Cov}}_{3}$ are based on jackknife covariance estimates, and are quasi pseudovalues if ${\hat{Cov}}_{2}$ and ${\hat{Cov}}_{3}$ are based on another fixed-reader covariance estimation method such as DeLong et al [12].

Let MS(T)_pseudo, MS(T^*R)_pseudo, MS(T^*C)_pseudo, and MS(T^*R^*C)_pseudo denote the test, test×reader, test×case, and test×reader case mean squares for the DBM three-way ANOVA of the pseudovalues. The DBM F statistic for testing the null hypothesis of no test effect is

F_{DBM} = \frac{MS {(T)}_{pseudo}}{MS {(T^{*} R)}_{pseudo} + \max [MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}, 0]}

(14)

Hillis et al [11] show that F _DBM = F _OR by showing that

MS (R) = \frac{1}{c} MS {(R)}_{pseudo}

MS (T^{*} R) = \frac{1}{c} MS {(T^{*} R)}_{pseudo}

(15)

and

{\hat{Cov}}_{2} - {\hat{Cov}}_{3} = \frac{1}{r c} [MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}]

(16)

The result follows by substitution in (14). They also show that the DBM model implies the OR model, and they give the one-to-one mapping between the OR and DBM model parameters.

This form (14) of the DBM F statistic utilizes the same constraints as F _OR (5), since equation (16) implies that the constraint MS(T^*C)_pseudo-MS(T^*R^*C)_pseudo ≥ 0, utilized in the denominator of F _DBM (14), is equivalent to the constraint ${\hat{Cov}}_{2} - {\hat{Cov}}_{3} \geq 0$ employed in F _OR. Although Dorfman et al [7, 8] suggest also constraining d MS(T^*R)_pseudo-MS(T^*R^*C)_pseudo to be nonnegative, References [11, 19] discuss conceptual reasons for not using this constraint; furthermore, simulations [20] show that performance is better when only MS(T^*C)_pseudo-MS(T^*R^*C)_pseudo is constrained to be nonnegative. Throughout this paper I use F _DBM as specified by equation (14).

4.2 Comparison of ddf_H, ddf_O, and ddf_D

The DBM numerator and denominator degrees of freedom for the null distribution of F _DBM (14) are t-1 and ddf_D, respectively, where

{ddf}_{D} = {\begin{matrix} \frac{{[MS {(T^{*} R)}_{pseudo} + MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}]}^{2}}{\frac{MS {(T^{*} R)}_{pseudo}^{2}}{(t - 1) (r - 1)} + \frac{MS {(T^{*} C)}_{pseudo}^{2}}{(t - 1) (c - 1)} + \frac{MS {(T^{*} R^{*} C)}_{pseudo}^{2}}{(t - 1) (r - 1) (c - 1)}}, & MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo} > 0 \\ (t - 1) (r - 1), & MS {(T^{*}) C}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo} \leq 0 \end{matrix}

(17)

Hillis et al [11] show that ddf_D has the following form when expressed in terms of the OR mean squares and covariance estimates:

{ddf}_{D} = {\begin{matrix} \frac{{[MS (T^{*} R) + r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{\frac{MS {(T^{*} R)}^{2}}{(t - 1) (r - 1)} + \frac{{[{\hat{σ}}_{∊}^{2} - {\hat{Cov}}_{1} + (r - 1) ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{(t - 1) (c - 1)} + \frac{{[{\hat{σ}}_{∊}^{2} - {\hat{Cov}}_{1} - {\hat{Cov}}_{2} + {\hat{Cov}}_{3}]}^{2}}{(t - 1) (r - 1) (c - 1)}}, & {\hat{Cov}}_{2} - {\hat{Cov}}_{3} > 0 \\ (t - 1) (r - 1), & {\hat{Cov}}_{2} - {\hat{Cov}}_{3} \leq 0 \end{matrix}

(18)

Equations (17) and (18) yield the same value for ddf_D.

Comparing equations (7) and (18) we see that

{ddf}_{D} \leq {ddf}_{H}

(19)

This relationship is intuitive, since in deriving ddf_H I treat σ², Cov₁, Cov₂, and Cov₃ as known while DBM treats them (or equivalently, the corresponding DBM model variance components) as unknown; thus the uncertainty from estimating them is manifested in the lower DBM degrees of freedom.

It follows from equation (7) that

{ddf}_{o} = (t - 1) (r - 1) \leq {ddf}_{H}

with equality attained if and only if ${\hat{Cov}}_{2} - {\hat{Cov}}_{3}$ < 0 , showing that the ddf_O method is more conservative than the ddf_H method. There is not a similar relationship between ddf_O and ddf_D: ddf_D can be larger or smaller that ddf_O. Table 2 presents the ddf formulas in terms of the OR model in part (a) and the ddf relationships in part (b). [ Table 2 ]

Table 2.

Denominator degrees of freedom summary.

a) In terms of the OR mean squares and covariances estimates:

ddf_O = (t-1)(r-1)

{ddf}_{D} = {\begin{matrix} \frac{{[MS (T^{*} R) + r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{\frac{MS {(T^{*} R)}^{2}}{(t - 1) (r - 1)} + \frac{{[{\hat{σ}}_{∊}^{2} - {\hat{Cov}}_{1} + (r - 1) ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{(t - 1) (c - 1)} + \frac{{[{\hat{σ}}_{∊}^{2} - {\hat{Cov}}_{1} - {\hat{Cov}}_{2} + {\hat{Cov}}_{3}]}^{2}}{(t - 1) (r - 1) (c - 1)}}, & {\hat{Cov}}_{2} - {\hat{Cov}}_{3} > 0 \\ (t - 1) (r - 1), & {\hat{Cov}}_{2} - {\hat{Cov}}_{3} \leq 0 \end{matrix}

{ddf}_{H} = \frac{{[MS (T^{*} R) + \max [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}), 0]]}^{2}}{\frac{{[MS (T^{*} R)]}^{2}}{(t - 1) (r - 1)}}

b) Relationships:

ddf_D ≤ddf_H

{ddf}_{o} = (t - 1) (r - 1) {\begin{matrix} \leq {ddf}_{H}, & {\hat{Cov}}_{2} - {\hat{Cov}}_{3} > 0 \\ = {ddf}_{H}, & {\hat{Cov}}_{2} - {\hat{Cov}}_{3} = 0 \end{matrix}

ddf_D can be larger or smaller than ddf_O

c) In terms of the DBM mean squares:

ddf_O = (t-1)(r-1)

{ddf}_{D} = {\begin{matrix} \frac{{[MS {(T^{*} R)}_{pseudo} + MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}]}^{2}}{\frac{MS {(T^{*} R)}_{pseudo}^{2}}{(t - 1) (r - 1)} + \frac{MS {(T^{*} C)}_{pseudo}^{2}}{(t - 1) (c - 1)} + \frac{MS {(T^{*} R^{*} C)}_{pseudo}^{2}}{(t - 1) (r - 1) (c - 1)}}, & MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo} > 0 \\ (t - 1) (r - 1), & MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo} \leq 0 \end{matrix}

{ddf}_{H} = \frac{{MS {(T^{*} R)}_{pseudo} + \max [MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}, 0]}^{2}}{MS {(T^{*} R)}_{pseudo}^{2} ∕ [(t - 1) (r - 1)]}

Open in a new tab

4.3 Ddf_H in terms of the DBM mean squares

I can also express ddf_H in terms of the DBM analysis mean squares. It follows from equations (7,15-16) that

{ddf}_{H} = \frac{{MS {(T^{*} R)}_{pseudo} + \max [MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}, 0]}^{2}}{MS {(T^{*} R)}_{pseudo}^{2} ∕ [(t - 1) (r - 1)]}

(20)

Part (c) of Table 2 presents the ddf formulas expressed in terms of the DBM model. Note that the relationships in part (b), which we previously derived from the part (a) formulas, can also be derived from the equivalent part (c) formulas.

Alternatively, I can derive ddf_H in form (20) directly from the DBM model in the following way. It is shown by Hillis et al [11] that, under the assumptions of the DBM model,

E [MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}] = r σ_{τ C}^{2}

(21)

where $σ_{τ C}^{2}$ is the test case variance component for the DBM model. Define

F_{DBM}^{*} = \frac{MS {(T)}_{pseudo}}{MS {(T^{*} R)}_{pseudo} + r σ_{τ C}^{2}}

For typical ROC studies (c > 25) the degrees of freedom will be at least 25 for each mean square in equation (21); thus MS(T^*C)_pseudo-MS(T^*R^*C)_pseudo should approximate $r σ_{τ C}^{2}$ reasonably well, implying

F_{DBM} \approx F_{DBM}^{*}

My approach is to show that an approximate ddf for $F_{DBM}^{*}$ , and hence for F _DBM, is given by equation (20).

Under the DBM model assumptions (conventional three-way ANOVA model assumptions for the pseudovalues) the numerator and denominator of $F_{DBM}^{*}$ have the same null expected values, MS(T)_pseudo and MS(T^*R)_pseudo are independently distributed,

MS {(T^{*} R)}_{pseudo} \sim χ_{(t - 1) (r - 1)}^{2} E [MS {(T^{*} R)}_{pseudo}] ∕ [(t - 1) (r - 1)]

and

MS {(T)}_{pseudo} \sim χ_{t - 1}^{2} E [MS {(T)}_{pseudo}] ∕ (t - 1) if H_{0} is true .

Using the same argument given in Section 3.1 but with MS(T) replaced by MS(T)_pseudo, MS(T^*R) replaced by MS(T^*R)_pseudo, and r (Cov₂-Cov₃) replaced by $r σ_{τ C}^{2}$ , it follows that $F_{DBM}^{*}$ has an approximate F_t _-1,df₂ null distribution, with

{df}_{2} = \frac{{[E (MS {(T^{*} R)}_{pseudo}) + r σ_{τ C}^{2}]}^{2}}{\frac{{[E (MS {(T^{*} R)}_{pseudo})]}^{2}}{(t - 1) (r - 1)}}

Replacing $E [MS {(T^{*} R)}_{pseudo}]$ and $r σ_{τ C}^{2}$ by their estimates, MS(T^*R)_pseudo and $\max [MS {(T^{*} C)}_{pseudo} - MS {(T^{*} R^{*} C)}_{pseudo}, 0]$ , respectively, yields equation (20).

4.4 Inadequate DBM Satterthwaite approximation

When MS(T^*C)_pseudo-MS(T^*R^*C)_pseudo > 0, ddf_D in equation (17) is the Satterthwaite approximation (9) for MS(T^*R) + MS(T^*C) MS(T^*R^*C). Satterthwaite [13] remarks that caution should be used when applying formula (9) to a linear function of mean squares when some of the coefficients are negative, and Gaylor and Hopper [21] provide guidelines for determining if the Satterthwaite approximate is valid for this situation. We will see in the simulation study that an inadequate Satterthwaite approximation can cause ddf_D to approach zero, resulting in an extremely wide confidence interval.

5 SPECIAL CASES

5.1 Confidence interval for the difference of two tests

Define θ_i to be the expected accuracy estimate for test i; that is, $θ_{i} = E ({\hat{θ}}_{i j})$ . I have shown that F _OR, and hence also F _DBM, has an approximate F ₍ _t _{-1);ddf_H} distribution; it follows that for t = 2 an approximate (1-α)100% confidence interval for θ_i-θ_j is given by

({\hat{θ}}_{i \cdot} - {\hat{θ}}_{j \cdot}) \pm t_{\frac{α}{2}; {ddf}_{H}} \sqrt{\frac{2}{r} {MSden}_{OR}}

(22)

where MSden_OR is the denominator of F _OR (5), or equivalently by

({\hat{θ}}_{i \cdot} - {\hat{θ}}_{j \cdot}) \pm t_{\frac{α}{2}; {ddf}_{H}} \sqrt{\frac{2}{r c} {MSden}_{DBM}}

(23)

where MSden_DBM is the denominator of F _DBM (14). It can be shown, using that approach of Section 3.1, that equations (22-23) are also valid when t > 2.

5.2 Single test inference using all of the data

The results in this and the following section are presented without proof, but they can be derived using an approach similar to that used in Section 3.1. Define

{MSden}_{OR_single} = \frac{1}{t} {MS (R) + (t - 1) MS (T^{*} R) + t r \max ({\hat{Cov}}_{2}, 0)}

{ddf}_{H_single} = \frac{{MS (R) + (t - 1) MS (T^{*} R) + t r \max ({\hat{Cov}}_{2}, 0)}^{2}}{\frac{{MS (R)}^{2}}{r - 1} + \frac{{(t - 1) MS (T^{*} R)}^{2}}{(t - 1) (r - 1)}}

An approximate (1-α)100% confidence interval for θ_i is given by

{\hat{θ}}_{i \cdot} \pm t_{\frac{α}{2}; {ddf}_{H_single}} \sqrt{\frac{1}{r} {MSden}_{OR_single}}

(24)

and a test for H₀>:θ_i = θ₀ can be made by comparing $t = ({\hat{θ}}_{i} - θ_{0}) ∕ \sqrt{\frac{1}{r} {MSden}_{OR_single}}$ to a t distribution with degrees of freedom ddf_{H_single}. Equivalent expressions can be derived for the DBM procedure, but I omit those since I recommend instead the more robust approach described in the next section.

5.3 Single test inference using only corresponding data

A confidence interval and hypotheses test for θ_i can alternatively be based only on the data for the ith test. When we consider the data for test i, the OR model (1) reduces to a one-factor random effects ANOVA with the covariance for each pair of errors equal to Cov₂. Since this single-test model makes no assumptions about the variances or covariances of the error terms corresponding to other tests, the resulting confidence interval should be more robust than confidence interval (24). Let ${\hat{Cov}}_{2}^{(i)}$ denote the average of the fixed reader covariance estimates for the ith test and define

{MSR}^{(i)} = \frac{\sum_{j = 1}^{r} {({\hat{θ}}_{i j} - {\hat{θ}}_{i \cdot})}^{2}}{r - 1}

{MSden}_{OR_single}^{(i)} = {MSR}^{(i)} + \max [r {\hat{Cov}}_{2}^{(i)}, 0]

and

{ddf}_{H_single}^{(i)} = \frac{{{MSR}^{(i)} + \max (r {\hat{Cov}}_{2}^{(i)}, 0)}^{2}}{\frac{{({MSR}^{(i)})}^{2}}{r - 1}}

An approximate (1-α)100% confidence interval for θ_i is given by

{\hat{θ}}_{i \cdot} \pm t_{\frac{α}{2}; {ddf}_{H_single}^{(i)}} \sqrt{\frac{1}{r} {MSden}_{OR_single}^{(i)}}

(25)

A similar result is also given by Obuchowski and Rockette [9], but instead of ${ddf}_{H_single}^{(i)}$ they use r-1; thus we see that ${ddf}_{H_single}^{(i)}$ yields a less conservative result, since ${ddf}_{H_single}^{(i)} \geq r - 1$ .

Similarly, the DBM procedure reduces to a conventional reader×case ANOVA of pseudovalues when we consider only data for a single test. Let $MS {(R)}_{pseudo}^{(i)}, MS {(C)}_{pseudo}^{(i)}$ , and $MS {(R^{*} C)}_{pseudo}^{(i)}$ denote mean squares using only pseudovalues corresponding to the ith test. An equivalent expression for confidence interval (25) is given by

{\hat{θ}}_{i \cdot} \pm t_{\frac{α}{2}; {ddf}_{H_single}^{(i)}} \sqrt{\frac{1}{r c} {MSden}_{DBM_single}^{(i)}}

where

{MSden}_{DBM_single}^{(i)} = MS {(R)}_{pseudo}^{(i)} + \max [MS {(C)}_{pseudo}^{(i)} - MS {(R^{*} C)}_{pseudo}^{(i)}, 0]

and

{ddf}_{H_single}^{(i)} = \frac{{MS {(R)}_{pseudo}^{(i)} + \max [MS {(C)}_{pseudo}^{(i)} - MS {(R^{*} C)}_{pseudo}^{(i)}, 0]}^{2}}{\frac{{(MS {(R)}_{pseudo}^{(i)})}^{2}}{r - 1}}

A test for H₀: θ_i = θ₀ can be made by comparing either $t = ({\hat{θ}}_{i} - θ_{0}) ∕ \sqrt{\frac{1}{r} {MSden}_{OR_single}^{(i)}}$ or $t = ({\hat{θ}}_{i} - θ_{0}) ∕ \sqrt{\frac{1}{r c} {MSden}_{DBM_single}^{(i)}}$ to a t distribution with degrees of freedom ${ddf}_{H_single}^{(i)}$ .

5.4 Fixed readers

If readers are treated as fixed for the OR model (1) then the only random effects are the error terms. Treating the variance and covariances of the error terms as known, the method of generalized least squares can be used, as discussed by Obuchowski and Rockette [9]. The test statistic, given by

\frac{(t - 1) MS {(T)}_{{\hat{θ}}_{i j}}}{[{\hat{σ}}_{∊}^{2} - {\hat{Cov}}_{1} + (r - 1) ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}

has an approximate chi-squared distribution with t-1 degrees of freedom. Hillis et al [11] show that the DBM analysis, treating readers as fixed, will give the same test statistic when suitably normalized and approximately the same p-value for typical ROC studies. I see no problems with these methods in terms of the degrees of freedom.

6 SIMULATION STUDY

In a simulation study I compare ddf_H, ddf_O, and ddf_D with respect to the empirical significance level for testing the null hypothesis of no test effect; and with respect to the width of a 95% confidence interval for the difference of the test AUCs. The simulation model of Roe and Metz [22] provides continuous decision-variable outcomes generated from a binormal model that treats both cases and readers as random. I use this simulation model for simulating rating data, taking integer values from one to five, by transforming the continuous outcomes to discrete ratings using the same cutpoints as Dorfman et al [8]; the combinations of reader and case sample sizes, AUC values, and variance components are the same as used in Roe and Metz [22] and Dorfman et al [8]. Briefly, rating data are simulated for 144 combinations of three reader-sample sizes (readers = 3, 5, and 10), four case sample sizes (10+/90-, 25+/25-, 50+/50-, and 100+/100-, where “+” indicates a diseased case and “-” indicates a normal case), three AUC values (AUC = .702, .855, and .961), and four combinations of reader and case variance components. Two thousand samples are generated for each of the 144 combinations; within each simulation, all Monte Carlo readers read the same cases for each of two tests. Since these are null simulations the test effect in the model is set to zero. The simulation design is summarized in Table 3. [ Table 3 ]

Table 3.

Simulation study design.

Factor	Number of levels	Description of levels
reader-sample size	3	readers = 3, 5, 10
case-sample size	4	cases = 10+/90-, 25+/25-, 50+/50-, 100+/100-
AUC	3	AUC = .702, .855, .961
variance components	4	HH, HL, LH, LL

Open in a new tab

Notes: For each factor combination n = 2000 samples are simulated for t = 2 tests, with each reader reading the same cases using each test. “+”= diseased case, “-”= normal case. See Roe and Metz [22] for the definitions of HH, HL, LH, and LL.

In terms of the OR procedure, each sample is analyzed using two different methods for estimating the AUC and $σ_{∊}^{2}, {Cov}_{1}, {Cov}_{2}$ and ${Cov}_{3}$ . For one method I estimate the AUC using maximum likelihood estimation assuming a binormal model [23,24] and obtain ${\hat{σ}}_{∊}^{2}, {\hat{Cov}}_{1}, {\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ using the jackknife method; for the other method I estimate the AUC using the trapezoidal-rule (trapezoid) method and obtain ${\hat{σ}}_{∊}^{2}, {\hat{Cov}}_{1}, {\hat{Cov}}_{2}$ , and ${\hat{Cov}}_{3}$ using the DeLong method. These estimation methods are referred to as the MLE-jack and trapezoid-DeLong methods, respectively. For each method I compute F _OR, ddf_H, ddf_O, and ddf_D using equations (5,7,4,18). The null hypothesis of no test effect is rejected if F _OR > F _.05;1,ddf. For each of the 144 combinations the empirical significance level is the proportion of samples for which the null hypothesis is rejected. For each simulation I also compute the width of the 95% confidence interval for the difference of the two test AUCs using equation (22) and report the mean width for the 144·2000 simulations.

Identical results can also be obtained using the DBM procedure, as discussed in Hillis et al [11]. MLE-jack results can be obtained by computing the normalized pseudovalues corresponding to the MLE AUC estimates, computing MS(T)_pseudo, MS(T^*R)_pseudo, MS(T^*C)_pseudo, and MS(T^*R^*C)_pseudo, computing F _DBM (14) which will be the same as F _OR, and then computing ddf_O, ddf_H, and ddf_D using equations (4,20,17). A 95% confidence for the difference of the two test AUCs is given by equation (23). Trapezoid-DeLong results can similarly be obtained using the DBM procedure with quasi pseudovalues as described in Hillis et al [11]. I exploit these OR-DBM equivalent relationships by using the DBM procedure for the MLE-jack simulations and the OR procedure for the Trapezoid-DeLong simulations. Simulations are performed using the IML procedure in SAS 9.1 [25] running under Windows XP. The MLE AUC pseudovalues are computed using a dynamic linked library (DLL), written in Fortran 90 by Don Dorfman and Kevin Schartz, that is accessed from within the IML procedure; this DLL is available on request.

The empirical significance levels from the simulation study are described in Table 4 and displayed in dot plots in Figure 1. The mean significance levels, with ranges indicated in parentheses, for ddf_O, ddf_H, and ddf_D, respectively, are 0.018 (0.041), 0.051 (0.052), and 0.043 (0.050) for MLE-jackknife estimation, and 0.019 (0.047), 0.054 (0.051), and 0.047 (0.058) for trapezoid-DeLong estimation. Thus we see that the average significance levels for ddf_H and ddf_O are much closer to the nominal .05 level than for ddf_O, with ddf_D slightly more conservative than ddf_H in accord with relationship (19). The dot plots show the ultraconservative performance of ddf_O: several significance levels are 0.00 and all of the significance levels are less than .05. For ddf_H and ddf_D the dot plots show bell-shaped distributions without outliers. For ddf_H, 93% (134/144) of the significance levels are within the interval [0.03, 0.07] for both estimation methods and the standard deviations of the significance levels are 0.011 (MLE-jackknife) and 0.010 (Trapezoid-DeLong) computed across the 144 combination of design factors; for ddf_D, 86% (124/144) and 92% (133/144) of the significance levels are within the interval [0.03, 0.07] for MLE-jackknife and trapezoid-DeLong estimation, respectively, and the standard deviations of the significance levels are 0.011 for both estimation methods. I conclude from the closeness of the mean significance level to the nominal level, the relatively small standard deviation, the high proportion of significance levels within .02 of the nominal level, and the absence of outliers that both ddf_H and ddf_DBM perform satisfactorily with respect to significance levels. [ Table 4 ] [ Figure 1 ]

Table 4.

Results of the simulation study for the 144 combinations of reader-sample size, case-sample size, AUC, and variance components.

		Significance levels
Estimation method	ddf method	N	Mean	Min	Max	Range	SD	CI width mean
MLE-jackknife	O	144	0.018	0.000	0.041	0.041	0.0144	0.231
	H	144	0.051	0.025	0.077	0.052	0.0105	0.173
	D	144	0.043	0.017	0.067	0.050	0.0109	2.36E+121
Trapezoid-Delong	O	144	0.019	0.000	0.047	0.047	0.0124	0.245
	H	144	0.054	0.032	0.082	0.051	0.0100	0.184
	D	144	0.047	0.023	0.080	0.058	0.0105	2.74E+121

Open in a new tab

Notes: Min: minimum; Max: maximum; SD: standard deviation; CI width: width of a 95% confidence interval for the difference of the AUC estimates; MLE-jackknife: binormal maximum likelihood AUC estimation and jackknife covariance estimation; Trapezoid-DeLong: trapezoid AUC estimation and DeLong covariance estimation.

Figure 1. — Dot plots of the 144 empirical significance levels using MLE-jackknife and trapezoid-DeLong estimation with each ddf method. The nominal significance level is .05.

The extremely large mean confidence interval widths corresponding to ddf_D in Table 4 (mean CI width: ddf_D = 2.36E+121 vs. ddf_H = 0.173 for MLE-jackknife estimation; ddf_D = 2.74E+121 vs. ddf_H = 0.184 for Trapezoid-DeLong estimation) can be attributed to a small proportion of samples for which ddf_D approaches zero. For example, ddf_D ≤ 1 in 0.002 of the samples; when I set ddf_D = 1 for these samples, then the resulting mean confidence interval widths are comparable (although somewhat larger) than those for ddf_H. Each ddf_D value less than one corresponds to an inadequate Satterthwaite ddf approximation, using the adequacy guidelines provided by Gaylor and Hopper [21]. I conclude that although the ddf_H and ddf_D methods are comparable with respect to significance levels, ddf_H performs better when confidence interval width is also considered.

Table 5 shows the mean empirical significance levels for ddf_H by factor level for each of the four simulation study factors. The results suggest that high AUC (.961) is associated with mild conservatism, while low AUC (.701) and low number of readers (3) are associated with mild liberalism. Accordingly, 17 of the 18 lowest empirical ddf_H significance levels in Figure 1(a) and 18 of the lowest 19 empirical ddf_H significance levels in Figure 1(b) correspond to AUC = .961; in contrast, 22 of the highest 23 values in Figure 1(a) and 21 of the 22 highest values in Figure 1(b) corresponding either to AUC = .701 or readers = 3. [ Table 5 ]

Table 5.

Simulation study mean significance levels for ddf_H by factor level.

			Estimation method
factor	factor level	N	MLE-jackknife (mean = 0.051)	Trapezoid-Delong (mean = 0.054)
case-sample size	10+/90-	36	0.047	0.053
	25+/25-	36	0.049	0.053
	50+/50-	36	0.052	0.055
	100+/100-	36	0.054	0.056
reader-sample size	3	48	0.056	0.060
	5	48	0.048	0.052
	10	48	0.048	0.051
AUC	0.702	48	0.057	0.060
	0.855	48	0.052	0.056
	0.961	48	0.043	0.047
variance components	HH	36	0.051	0.054
	HL	36	0.055	0.057
	LH	36	0.046	0.051
	LL	36	0.050	0.054

Open in a new tab

7 EXAMPLE

My example comes courtesy of Carolyn Van Dyke, MD. The study [26] compared the relative performance of single spin-echo magnetic resonance imaging (SE MRI) to cinematic presentation of MRI (CINE MRI) for the detection of thoracic aortic dissection. There were 45 patients with an aortic dissection and 69 patients without a dissection imaged with both SE MRI and CINE MRI. Five radiologists independently interpreted all of the images using a five-point ordinal scale: 1 = definitely no aortic dissection, 2 = probably no aortic dissection, 3 = unsure about aortic dissection, 4 = probably aortic dissection, and 5 = definitely aortic dissection.

The analysis of this study using trapezoid AUC estimates and DeLong covariance estimates is displayed in Table 6 and the empirical ROC curves are displayed in Figure 2. Although Table 6 only displays the analysis using the OR procedure, identical results can be obtained with the DBM procedure using quasi pseudovalues. For testing H₀: θ₁ = θ₂ we have F _OR = 4.485, ddf_O = 4 (p = .102), ddf_H = 15.07 (p = .051), and ddf_D = 13.81 (p = .053). Thus ddf_O and ddf_H give somewhat different results, while ddf_H and ddf_D give similar results, with ddf_H > ddf_D in accord with relationship (19). [ Table 6 ] [ Figure 2 ]

Table 6.

Analysis of Van Dyke et al [26] data using trapezoid AUC estimation and DeLong covariance estimation for t = 2 tests and r = 5 readers.

a) Trapezoid AUCs:
	test
	1 (CINE)	2 (Spin Echo)
reader (j)	${\hat{θ}}_{1 j}$	${\hat{θ}}_{2 j q}$
1	0.9196	0.9478
2	0.8588	0.9053
3	0.9039	0.9217
4	0.9731	0.9994
5	0.8298	0.9300
	${\hat{θ}}_{1 \cdot}$ = .8970	${\hat{θ}}_{2 \cdot}$ =.9408

b) ANOVA table:
Source	df	Sum of squares	Mean square
T	1	0.00479617	0.00479617
R	4	0.01534480	0.00383620
T*R	4	0.00220412	0.00055103

c) Covariance estimates computed from DeLong covariance matrix:

{\hat{σ}}_{∊}^{2}

= .000792133,

{\hat{Cov}}_{1}

= .000342009,

{\hat{Cov}}_{2}

= .000339526,

{\hat{Cov}}_{3}

= .000235850

d) FOR =MS(T)/[MS(T*R) +

\max [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}), 0]

= 4.4849

e) Denominator degrees of freedom:

ddf_O = (t - 1)(r - 1) = 4

{ddf}_{H} = \frac{{[MS (T^{*} R) + \max [r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3}), 0]]}^{2}}{\frac{{[MS (T^{*} R)]}^{2}}{(t - 1) (r - 1)}} = 15.0661

{ddf}_{DBM} = \frac{{[MS (T^{*} R) + r ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{\frac{MS {(T^{*} R)}^{2}}{(t - 1) (r - 1)} + \frac{{[{\hat{σ}}_{∊}^{2} - {\hat{Cov}}_{1} + (r - 1) ({\hat{Cov}}_{2} - {\hat{Cov}}_{3})]}^{2}}{(t - 1) (c - 1)} + \frac{{[{\hat{σ}}_{∊}^{2} - {\hat{Cov}}_{1} - {\hat{Cov}}_{2} + {\hat{Cov}}_{3}]}^{2}}{(t - 1) (r - 1) (c - 1)}} = 13.8133

f) P-values for H₀: θ₁ = θ₂:
ddf method	ddf	p = Pr = (F ₍ _t _{-1), ddf} ≥ F _OR)
O	4	0.1016
H	15.07	0.0512
D	13.81	0.0528

g) Single test 95% confidence intervals using only corresponding data:
i	${\hat{θ}}_{i}$	${\hat{Cov}}_{2}^{(i)}$	MSR(i)	${MSDen}_{OR}^{(i)}$	${ddf}_{H}^{(i)}$	95% CI for θ_i
1 (Cine)	.8970	.0004785	.003083	.005470	12.60	.8970±.07169
2 (Spin Echo)	.9408	.0002015	.001305	.002312	12.57	.9408±.0466

Open in a new tab

Figure 2. — Trapezoid ROC curves of the five readers for SE MRI and CINE MRI in the detection of aortic dissection. The average areas under the empirical curves are .941 (SE) and .897 (CINE).

Single test 95% confidence intervals using only the corresponding data, as given by equation (25), are given in part (f) of Table 5. We see that the confidence interval for CINE (.897±0717) is roughly 50% wider that the interval for SE MRI (.941±0.0467). When based upon the combined data using equation (24), the single test confidence intervals are given by ${\hat{θ}}_{i}$ ±0.0591, i = 1, 2. I prefer the first approach since it does not assume that the AUCs have the same variances and covariances for each test, as does the combined data approach.

8 DISCUSSION

The motivation for this paper was the recent finding by Hillis et al [11] that the DBM and OR procedures yield identical test statistics when based on the same accuracy measure and covariance estimation methods, but their different ddf methods, ddf_D and ddf_O, can result in considerably different inferences. I proposed a new ddf estimator, ddf_H, that overcomes problems with the ddf_D and ddf_O methods. I derived ddf_H by showing that the null distribution of the OR test statistic can be approximated by an F_t _{-1,ddf_H} distribution. I showed how ddf_H can be used with both the OR and DBM procedures and how it can also be derived from the DBM model. The p-value corresponding to ddf_H will always be less than or equal to that corresponding to ddf_D or ddf_O.

Although ddf_H can be derived from the DBM model assumptions, the derivation in terms of the OR model is more important because the OR model provides an acceptable conceptual model. In contrast, the DBM model does not provide an acceptable conceptual model because pseudovalues have no intrinsic meaning; hence Hillis et al [11] characterize the DBM model as a “working model” that allows one to fit the OR model using conventional ANOVA software.

In simulations ddf_H performed better than ddf_O with respective to significance level, with the ddf_H mean significance level much closer to the nominal level. The shape and spread of the distributions of the significance levels were similar for the ddf_H and ddf_D methods, with ddf_H closer to the nominal level for binormal AUC estimation than ddf_D (mean significance levels: ddf_H = .051, ddf_D= .043) but deviating approximately the same amount from the nominal level for trapezoid AUC estimation (mean significance levels: ddf_H = .054, ddf_D = .047). However, a drawback of ddf_D was that confidence intervals were sometimes extremely wide due to ddf_D being close to zero. For these reasons I concluded that ddf_H performs better than either ddf_D or ddf_O. A SAS macro for implementing the OR-DBM procedure using ddf_H is available on request.

A limitation of this study is that I used only typical ROC sample sizes where the number of readers is low (≤ 10) and the number of cases is moderate (≥ 50) in the simulation study. Although the derivation of ddf_H involved several approximations, I discussed in Section 3.1 why ddf_H should perform adequately for the typical ROC diagnostic study where the number of cases is moderate to large, regardless of the number of readers. However, the performance of the new method may not be acceptable when the number of cases is less; certainly, further research is required before using it in this situation. Although the simulations only looked at small reader sample sizes, I see no reason why the new method should not work satisfactorily when the number of readers is moderate or large, but again I caution that our simulations have only been for typical ROC study sample sizes.

A topic to consider for future research is the robustness of the OR and DBM procedures to violations of the model assumptions. Since the procedures can be viewed as equivalent, only the less restrictive OR model assumptions need to be considered. The OR model assumes that the variances and covariances for the accuracy estimates do not vary by test. I conjecture that, similar to conventional ANOVA models, the test of the null hypothesis of no test effect will be fairly robust to departures from this assumption, but further investigation is required to support this conjecture. Although I would not expect single test confidence intervals based on this model to be robust to violations of this assumption, this problem can be circumvented by basing the confidence interval only on data for the corresponding test, as discussed in Section 5.3.

ACKNOWLEDGEMENTS

This research was supported by the National Institutes of Health, grant R01EB000863. I thank two anonymous referees for their excellent suggestions which greatly improved the paper, and also Nancy Obuchowski and Kevin Berbaum for their helpful suggestions in the final stage of preparing the manuscript. The views expressed in this article are those of the author and do not necessarily represent the views of the Department of Veterans Affairs.

A Appendix

In this section I show that the OR mean squares, MS(T), MS(R), and MS(T^*R), have expectations as given in Table 1 and are independently distributed as follows: $MS (T) \sim \frac{c_{1}}{t - 1} χ_{t - 1, λ}^{2}$ , with

c_{1} = σ_{τ R}^{2} + σ_{ε}^{2} - {Cov}_{1} + (r - 1) ({Cov}_{2} - {Cov}_{3})

(A1)

and noncentrality parameter $λ = \frac{r}{c_{1}} \sum_{i = 1}^{t} {(τ_{i} - τ)}^{2}; MS (R) \sim \frac{E [MS (R)]}{(r - 1)} χ_{(r - 1)}^{2}$ ; and $MS (T^{*} R) \sim \frac{E [MS (T^{*} R)]}{(t - 1) (r - 1)} χ_{(t - 1) (r - 1)}^{2}$ . Furthermore, from Table 2 we see that E[MS(T) | H₀] = c ₁; hence if H₀: τ₁ = ... = τ_t is true, then $MS (T) \sim \frac{E [MS (T)]}{t - 1} χ_{t - 1}^{2}$ .

Let $\hat{θ} = {({\hat{θ}}_{11}, \dots, {\hat{θ}}_{1 r}, \dots, {\hat{θ}}_{t 1}, \dots, {\hat{θ}}_{t r})}^{'}$ denote the vector of outcomes for the OR model (1) and let $θ = {(θ_{11}, \dots, θ_{1 r}, \dots, θ_{t 1}, \dots, θ_{t r})}^{'}$ denote the corresponding mean vector, where θ_ij = μ+τ_i. Then θ ∼ N (θ,Σ) where Σ=Cov(θ). I assume that Σ is positive definite.

I first show that Σ has the following form:

Σ = I_{t} \otimes (x_{1} I_{r} + x_{2} M_{r}) + M_{t} \otimes (x_{3} I_{r} + {Cov}_{3} M_{r})

(A2)

where ⊗ denotes the Kronecker product operator, I_p denotes the p × p identity matrix, M_p denotes a p × p matrix of ones,

x_{1} = σ_{τ R}^{2} + σ_{∊}^{2} - {Cov}_{1} - {Cov}_{2} + {Cov}_{3}

(A3)

x_{2} = {Cov}_{2} - {Cov}_{3}

(A4)

and

x_{3} = σ_{R}^{2} + {Cov}_{1} - {Cov}_{3}

(A5)

It follows from the OR model (1) that $Var ({\hat{θ}}_{i j}) = σ_{R}^{2} + σ_{τ R}^{2} + σ_{ε}^{2}$ and

Cov ({\hat{θ}}_{i j}, {\hat{θ}}_{i^{'} j^{'}}) = {\begin{matrix} {Cov}_{1} + σ_{R}^{2} & i \neq i^{'}, j = j^{'} (different test, same reader) \\ {Cov}_{2} & i = i^{'}, j \neq j^{'} (same test, different reader) \\ {Cov}_{3} & i \neq i^{'}, j \neq j^{'} (different test, different reader) \end{matrix}

Define $σ_{\hat{θ}}^{2} = σ_{R}^{2} + σ_{τ R}^{2} + σ_{ε}^{2}$ . Then, for example, for t = 2 and r = 3 the covariance matrix of $\hat{θ} = {({\hat{θ}}_{11}, {\hat{θ}}_{12}, {\hat{θ}}_{13}, {\hat{θ}}_{21}, {\hat{θ}}_{22}, {\hat{θ}}_{23})}^{'}$ is given by

Σ = [\begin{matrix} Σ_{1} & Σ_{2} \\ Σ_{2} & Σ_{1} \end{matrix}]

where

Σ_{1} = [\begin{matrix} σ_{\hat{θ}}^{2} & {Cov}_{2} & {Cov}_{2} \\ {Cov}_{2} & σ_{\hat{θ}}^{2} & {Cov}_{2} \\ {Cov}_{2} & {Cov}_{2} & σ_{\hat{θ}}^{2} \end{matrix}], Σ_{2} = [\begin{matrix} {Cov}_{1} + σ_{R}^{2} & {Cov}_{3} & {Cov}_{3} \\ {Cov}_{3} & {Cov}_{1} + σ_{R}^{2} & {Cov}_{3} \\ {Cov}_{3} & {Cov}_{3} & {Cov}_{1} + σ_{R}^{2} \end{matrix}]

We can write

Σ = [\begin{matrix} Σ_{1} - Σ_{2} & 0 \\ 0 & Σ_{1} - Σ_{2} \end{matrix}] + [\begin{matrix} Σ_{2} & Σ_{2} \\ Σ_{2} & Σ_{2} \end{matrix}] = I_{2} \otimes (Σ_{1} - Σ_{2}) + M_{2} \otimes Σ_{2}

(A6)

where 0 denotes a matrix of zeros. Since

Σ_{1} - Σ_{2} = [\begin{matrix} σ_{τ R}^{2} + σ_{ε}^{2} - {Cov}_{1} & {Cov}_{2} - {Cov}_{3} & {Cov}_{2} - {Cov}_{3} \\ {Cov}_{2} - {Cov}_{3} & σ_{τ R}^{2} + σ_{ε}^{2} - {Cov}_{1} & {Cov}_{2} - {Cov}_{3} \\ {Cov}_{2} - {Cov}_{3} & {Cov}_{2} - {Cov}_{3} & σ_{τ R}^{2} + σ_{ε}^{2} - {Cov}_{1} \end{matrix}] = x_{1} I_{3} + x_{2} M_{3}

and Σ₂ = x ₃I₃+Cov₃M₃, where x ₁, x ₂, and x ₃ are defined by equations (A3-A5), then substituting in (A6) gives

Σ = I_{2} \otimes (x_{1} I_{3} + x_{2} M_{3}) + M_{2} \otimes (x_{3} I_{3} + {Cov}_{3} M_{3})

which is equation (A2) with t = 2; r = 3. It can similarly be shown for arbitrary t and r that the covariance matrix for $\hat{θ} = {({\hat{θ}}_{11}, \dots, {\hat{θ}}_{1 r}, \dots, {\hat{θ}}_{t 1}, \dots, {\hat{θ}}_{t r})}^{'}$ is given by equation (A2).

Define $C_{p} = I_{p} - \frac{1}{p} M_{p}$ . Smith and Lewis [27] give matrix expressions for balanced ANOVA sums of squares; from their results we have $MS (T) = {\hat{θ}}^{'} A_{1} \hat{θ} ∕ (t - 1)$ , $MS (T^{*} R) = \hat{θ} A_{2} \hat{θ} ∕ [(t - 1) (r - 1)]$ , and $MS (R) = {\hat{θ}}^{'} A_{3} \hat{θ} ∕ (r - 1)$ , where $A_{1} = \frac{1}{r} C_{t} \otimes M_{r}$ , A₂ = C_t ⊗ C_r, and $A_{3} = {}_{t}^{1}M_{t} \otimes C_{r}$ .

Let A and B be any two matrices and let tr(·) denote the trace function. Results that I utilize for the mean square distribution derivations are given below. For properties (a-c) I also assume that A and B are tr × tr symmetric matrices.

$E ({\hat{θ}}^{'} A \hat{θ}) = tr (A Σ) + θ^{'} A θ$
${\hat{θ}}^{'} A \hat{θ} \sim χ_{m, λ}^{2}$ where m =rank(AΣ) and λ = θ′Aθ if and only if AΣ is idempotent
${\hat{θ}}^{'} A \hat{θ}$ and ${\hat{θ}}^{'} B \hat{θ}$ are distributed independently if and only if AΣB = 0
0 ⊗ A = A ⊗ 0 = 0
a A ⊗ b B = ab A ⊗ B
(A ⊗ C) (B ⊗ D) = (AB) ⊗ (CD)
(A + B) ⊗ C = A ⊗ C + B ⊗ C; C ⊗ (A + B) = C ⊗ A + C ⊗ B
rank(A ⊗ B) =rank(A) rank(B)
if B is nonsingular then rank(AB) =rank(A)
if A is idempotent then rank (A) =tr(A)
C_pC_p = C_p
M_pM_p = p M_p
M_pC_p = C_pM_p = 0
A ₁, A ₂, and A ₃ are idempotent
tr(A ₁) = t-1; tr(A ₂) = (t-1) (r-1); tr(A ₃) = (r-1)
A ₁ A ₂ = A ₁ A ₃ = A ₂ A ₃ = 0

Properties (a-c) are well known [28], (d-j) are standard matrix results [29], and properties (k-p) are easily derived.

To derive the distribution of $MS (T) = {\hat{θ}}^{'} A_{1} \hat{θ} ∕ (t - 1)$ , I show below that A ₁Σ = c ₁ A ₁, where c ₁ is defined by equation (A1). Using property (a) we then have

E [MS (T)] = \frac{1}{t - 1} E ({\hat{θ}}^{'} A_{1} \hat{θ}) = \frac{1}{t - 1} [tr (A_{1} Σ) + θ^{'} A_{1} θ] = \frac{1}{t - 1} [tr (c_{1} A_{1}) + θ^{'} A_{1} θ]

Since θ_ij = μ, then $θ^{'} A_{1} θ = r \sum_{i = 1}^{t} {(θ_{i \cdot} - θ_{\cdot \cdot})}^{2} = r \sum_{i = 1}^{t} {(τ_{i} - τ_{\cdot})}^{2}$ ; also, from property (o) we have tr(c ₁ A ₁) = c ₁(t - 1). Thus

E [MS (T)] = \frac{1}{t - 1} [tr (c_{1} A_{1}) + θ^{'} A_{1} θ] = \frac{1}{t - 1} [c_{1} (t - 1) + r \sum_{i = 1}^{t} {(τ_{i} - τ_{\cdot})}^{2}] = σ_{τ R}^{2} + σ_{ε}^{2} - {Cov}_{1} + (r - 1) ({Cov}_{2} - {Cov}_{3}) + \frac{r}{t - 1} \sum_{i = 1}^{t} {(τ_{i} - τ_{\cdot})}^{2}

as given in Table 1. Since A ₁Σ = c ₁ A ₁, then from property (n) it follows that $\frac{1}{c_{1}} A_{1} Σ$ is idempotent, and hence properties (b,i,j,o) imply $\frac{1}{c_{1}} {\hat{θ}}^{'} A_{1} \hat{θ} \sim χ_{t - 1, λ}^{2}$ where the noncentrality parameter is $λ = \frac{1}{c_{1}} θ^{'} A_{1} θ = \frac{r}{c_{1}} \sum_{i = 1}^{t} {(τ_{i} - τ_{\cdot})}^{2}$ . It follows that $MS (T) \sim \frac{c_{1}}{t - 1} χ_{t - 1, λ}^{2}$ .

To show A ₁ = c ₁ A ₁, using (A2) we have

A_{1} Σ = \frac{1}{r} C_{t} \otimes M_{r} [I_{t} \otimes (x_{1} I_{r} + x_{2} M_{r}) + M_{t} \otimes (x_{3} I_{r} - {Cov}_{3} M_{r})] = \frac{1}{r} C_{t} \otimes M_{r} [I_{t} \otimes (x_{1} I_{r} + x_{2} M_{r})] + \frac{1}{r} C_{t} \otimes M_{r} [M_{t} \otimes (x_{3} I_{r} + {Cov}_{3} M_{r})]

Using property (f) it follows that

A_{1} Σ = \frac{1}{r} (C_{t} I_{t}) \otimes [M_{r} (x_{1} I_{r} + x_{2} M_{r})] + \frac{1}{r} (C_{t} M_{t}) \otimes [M_{r} (x_{3} I_{r} + {Cov}_{3} M_{r})]

Using properties (d,e,g,l,m) we have

A_{1} Σ = \frac{x_{1}}{r} C_{t} \otimes M_{r} + \frac{x_{2}}{r} C_{t} \otimes M_{r} M_{r} + 0 = \frac{(x_{1} + r x_{2})}{r} C_{t} \otimes M_{r} = (x_{1} + r x_{2}) A_{1} = c_{1} A

I similarly derive the distribution of $MS (T^{*} R) = {\hat{θ}}^{'} A_{2} \hat{θ} ∕ [(t - 1) (r - 1)]$ by showing below that A ₂Σ = c ₂ A ₂, where

c_{2} = σ_{τ R}^{2} + σ_{ε}^{2} - {Cov}_{1} - {Cov}_{2} + {Cov}_{3}

Since $θ^{'} A_{2} θ = \sum_{i = 1}^{t} \sum_{j = 1}^{r} {(θ_{i j} - θ_{i \cdot} - θ_{\cdot j} + θ_{\cdot \cdot})}^{2} = 0$ , then similar to the derivation for MS(T) distribution it follows that

E [MS (T^{*} R)] = \frac{1}{(t - 1) (r - 1)} [tr (c_{2} A_{2}) + θ^{'} A_{2} θ] = c_{2} = σ_{τ R}^{2} + σ_{ε}^{2} - {Cov}_{1} - {Cov}_{2} + {Cov}_{3}

as given in Table 1, and $\frac{1}{c_{2}} {\hat{θ}}^{'} A_{2} \hat{θ} \sim χ_{(t - 1) (r - 1)}^{2}$ . Since E [MS(T^*R)] = c ₂, then it follows that $MS (T^{*} R) \sim \frac{E [MS (T^{*} R)]}{(t - 1) (r - 1)} χ_{(t - 1) (r - 1)}^{2}$ .

To show A ₂Σ = c ₂ A ₂, using (A2) we have

A_{2} Σ = C_{t} \otimes C_{r} [I_{t} \otimes (x_{1} I_{r} + x_{t} M_{r}) + M_{t} \otimes (x_{3} I_{r} + {Cov}_{3} M_{r})] = C_{t} \otimes C_{r} [I_{t} \otimes (x_{1} I_{r} + x_{t} M_{r})] + C_{t} \otimes C_{r} [M_{t} \otimes (x_{3} I_{r} + {Cov}_{3} M_{r})]

Using property (f) it follows that

A_{2} Σ = (C_{t} I_{t}) \otimes [C_{r} (x_{1} I_{r} + x_{2} M_{r})] + (C_{t} M_{t}) \otimes [C_{r} (x_{3} I_{r} + {Cov}_{3} M_{r})]

Using properties (d,e,g,m) we have

A_{2} Σ = C_{t} \otimes (x_{1} C_{r} + 0) + 0 = x_{1} C_{t} \otimes C_{r} = x_{1} A_{2} = c_{2} A

From property (c), ${\hat{θ}}^{'} A_{1} \hat{θ}$ and ${\hat{θ}}^{'} A_{2} \hat{θ}$ are independent if A ₁ΣA ₂ = 0. To show A ₁ΣA ₂ = 0, note that since A ₁Σ = c ₁ A ₁, then A ₁ΣA ₂ = c ₁ A ₁ A ₂ = 0 using property (p). It follows that MS(T) and MS(T^*R) are independent. Similarly, the distribution of MS(R) can be derived and it can be shown that MS(R) is independent of MS(T) and MS(T^*R).

References

1.Swets JA, Pickett RM. Evaluation of diagnostic systems: methods from signal detection theory. Academic Press; New York: 1982. [Google Scholar]
2.Metz CE. ROC methodology in radiologic imaging. Investigative Radiology. 1986;21:720–733. doi: 10.1097/00004424-198609000-00009. [DOI] [PubMed] [Google Scholar]
3.Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988;240:1285–1293. doi: 10.1126/science.3287615. [DOI] [PubMed] [Google Scholar]
4.Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Critical Reviews in Diagnostic Imaging. 1989;29:307–335. [PubMed] [Google Scholar]
5.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
6.Obuchowski NA, Beiden SV, Berbaum KS, Hillis SL, Ishwaran H, Song HH, Wagner RF. Multi-case ROC analyses: an empirical comparison of five methods. Academic Radiology. 2004;11:980–995. doi: 10.1016/j.acra.2004.04.014. DOI:10.1016/j.acra.2004.04.014. [DOI] [PubMed] [Google Scholar]
7.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]
8.Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]
9.Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Communications in Statistics: Simulation and Computation. 1995;24:285–308. [Google Scholar]
10.Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
11.Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. DOI:10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]
12.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–844. [PubMed] [Google Scholar]
13.Satterthwaite FE. Synthesis of variance. Psychometrika. 1941;6:309–316. [Google Scholar]
14.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometric Bulletin. 1946;2:110–114. [PubMed] [Google Scholar]
15.Pavur R, Nath R. Exact F tests in an ANOVA procedure for dependent observations. Multivariate Behavioral Research. 1984;19:408–420. doi: 10.1207/s15327906mbr1904_3. [DOI] [PubMed] [Google Scholar]
16.Quenoille MH. Approximate tests of correlation in time series. Journal of the Royal Statistical Society, Series B. 1949;11:68–84. [Google Scholar]
17.Quenoille MH. Notes on bias in estimation. Biometrika. 1956;43:353–360. [Google Scholar]
18.Tukey JW. Bias and confidence in not quite large samples. Annals of Mathematical Statistics. 1958;29:614. [Google Scholar]
19.Hillis SL, Berbaum KS. Power estimation for the Dorfman-Berbaum-Metz method. Academic Radiology. 2004;11:1260–1273. doi: 10.1016/j.acra.2004.08.009. DOI:10.1016/j.acra.2004.08.009. [DOI] [PubMed] [Google Scholar]
20.Hillis SL, Berbaum KS. Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification. Academic Radiology. 2005;12:1534–1541. doi: 10.1016/j.acra.2005.07.012. DOI:10.1016/j.acra.2005.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gaylor DW, Hopper FN. Estimating the degrees of freedom for linear combinations of mean squares by Satterthwaite’s formula. Technometrics. 1969;11:691–706. [Google Scholar]
22.Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Academic Radiology. 1997;4:298–303. doi: 10.1016/s1076-6332(97)80032-3. [DOI] [PubMed] [Google Scholar]
23.Dorfman DD, Alf E., Jr. Maximum likelihood estimation of parameters of signal-detection theory and determination of confidence intervals-rating method data. Journal of Mathematical Psychology. 1969;6:487–496. [Google Scholar]
24.Dorfman DD. RSCORE II. In: Swets JA, Pickett RM, editors. Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press; San Diego, CA: 1982. pp. 212–232. [Google Scholar]
25.SAS for Windows, Version 9.1. SAS Institute Inc.; Cary, NC, USA: copyright (c) 2002-2003 by. [Google Scholar]
26.Van Dyke CW, White RD, Obuchowski NA, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the diagnosis of thoracic aortic dissection. 79th RSNA Meetings; Chicago, IL. November 28 - December 3, 1993. [Google Scholar]
27.Smith JH, Lewis TO. Determining the effects of intra-class correlation on factorial-experiments. Communications in Statistics Part A-Theory and Methods. 1980;9:1353–1364. [Google Scholar]
28.Searle SR. Linear Models. Wiley; New York: 1971. pp. 55–59. [Google Scholar]
29.Harville DA. Matrix Algebra From a Statistician’s Perspective. Springer; New York: pp. 335–338. [Google Scholar]

[R1] 1.Swets JA, Pickett RM. Evaluation of diagnostic systems: methods from signal detection theory. Academic Press; New York: 1982. [Google Scholar]

[R2] 2.Metz CE. ROC methodology in radiologic imaging. Investigative Radiology. 1986;21:720–733. doi: 10.1097/00004424-198609000-00009. [DOI] [PubMed] [Google Scholar]

[R3] 3.Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988;240:1285–1293. doi: 10.1126/science.3287615. [DOI] [PubMed] [Google Scholar]

[R4] 4.Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Critical Reviews in Diagnostic Imaging. 1989;29:307–335. [PubMed] [Google Scholar]

[R5] 5.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]

[R6] 6.Obuchowski NA, Beiden SV, Berbaum KS, Hillis SL, Ishwaran H, Song HH, Wagner RF. Multi-case ROC analyses: an empirical comparison of five methods. Academic Radiology. 2004;11:980–995. doi: 10.1016/j.acra.2004.04.014. DOI:10.1016/j.acra.2004.04.014. [DOI] [PubMed] [Google Scholar]

[R7] 7.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology. 1992;27:723–731. [PubMed] [Google Scholar]

[R8] 8.Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA. Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Academic Radiology. 1998;5:591–602. doi: 10.1016/s1076-6332(98)80294-8. [DOI] [PubMed] [Google Scholar]

[R9] 9.Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Communications in Statistics: Simulation and Computation. 1995;24:285–308. [Google Scholar]

[R10] 10.Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]

[R11] 11.Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data. Statistics in Medicine. 2005;24:1579–1607. doi: 10.1002/sim.2024. DOI:10.1002/sim.2024. [DOI] [PubMed] [Google Scholar]

[R12] 12.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–844. [PubMed] [Google Scholar]

[R13] 13.Satterthwaite FE. Synthesis of variance. Psychometrika. 1941;6:309–316. [Google Scholar]

[R14] 14.Satterthwaite FE. An approximate distribution of estimates of variance components. Biometric Bulletin. 1946;2:110–114. [PubMed] [Google Scholar]

[R15] 15.Pavur R, Nath R. Exact F tests in an ANOVA procedure for dependent observations. Multivariate Behavioral Research. 1984;19:408–420. doi: 10.1207/s15327906mbr1904_3. [DOI] [PubMed] [Google Scholar]

[R16] 16.Quenoille MH. Approximate tests of correlation in time series. Journal of the Royal Statistical Society, Series B. 1949;11:68–84. [Google Scholar]

[R17] 17.Quenoille MH. Notes on bias in estimation. Biometrika. 1956;43:353–360. [Google Scholar]

[R18] 18.Tukey JW. Bias and confidence in not quite large samples. Annals of Mathematical Statistics. 1958;29:614. [Google Scholar]

[R19] 19.Hillis SL, Berbaum KS. Power estimation for the Dorfman-Berbaum-Metz method. Academic Radiology. 2004;11:1260–1273. doi: 10.1016/j.acra.2004.08.009. DOI:10.1016/j.acra.2004.08.009. [DOI] [PubMed] [Google Scholar]

[R20] 20.Hillis SL, Berbaum KS. Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification. Academic Radiology. 2005;12:1534–1541. doi: 10.1016/j.acra.2005.07.012. DOI:10.1016/j.acra.2005.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Gaylor DW, Hopper FN. Estimating the degrees of freedom for linear combinations of mean squares by Satterthwaite’s formula. Technometrics. 1969;11:691–706. [Google Scholar]

[R22] 22.Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Academic Radiology. 1997;4:298–303. doi: 10.1016/s1076-6332(97)80032-3. [DOI] [PubMed] [Google Scholar]

[R23] 23.Dorfman DD, Alf E., Jr. Maximum likelihood estimation of parameters of signal-detection theory and determination of confidence intervals-rating method data. Journal of Mathematical Psychology. 1969;6:487–496. [Google Scholar]

[R24] 24.Dorfman DD. RSCORE II. In: Swets JA, Pickett RM, editors. Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press; San Diego, CA: 1982. pp. 212–232. [Google Scholar]

[R25] 25.SAS for Windows, Version 9.1. SAS Institute Inc.; Cary, NC, USA: copyright (c) 2002-2003 by. [Google Scholar]

[R26] 26.Van Dyke CW, White RD, Obuchowski NA, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the diagnosis of thoracic aortic dissection. 79th RSNA Meetings; Chicago, IL. November 28 - December 3, 1993. [Google Scholar]

[R27] 27.Smith JH, Lewis TO. Determining the effects of intra-class correlation on factorial-experiments. Communications in Statistics Part A-Theory and Methods. 1980;9:1353–1364. [Google Scholar]

[R28] 28.Searle SR. Linear Models. Wiley; New York: 1971. pp. 55–59. [Google Scholar]

[R29] 29.Harville DA. Matrix Algebra From a Statistician’s Perspective. Springer; New York: pp. 335–338. [Google Scholar]

PERMALINK

A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer ROC Analysis

Stephen L Hillis

Abstract

1 INTRODUCTION

2 THE OBUCHOWSKI-ROCKETTE (OR) METHOD

2.1 Design and notation

2.2 Model and test statistic

3 THE PROPOSED DDF

3.1 Derivation of ddf_H

Table 1.

3.2 Rationale for ddf_O

4 DDF_H AND THE DBM PROCEDURE

4.1 The DBM procedure

4.2 Comparison of ddf_H, ddf_O, and ddf_D

Table 2.

4.3 Ddf_H in terms of the DBM mean squares

4.4 Inadequate DBM Satterthwaite approximation

5 SPECIAL CASES

5.1 Confidence interval for the difference of two tests

5.2 Single test inference using all of the data

5.3 Single test inference using only corresponding data

5.4 Fixed readers

6 SIMULATION STUDY

Table 3.

Table 4.

Figure 1.

Table 5.

7 EXAMPLE

Table 6.

Figure 2.

8 DISCUSSION

ACKNOWLEDGEMENTS

A Appendix

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer ROC Analysis

Stephen L Hillis

Abstract

1 INTRODUCTION

2 THE OBUCHOWSKI-ROCKETTE (OR) METHOD

2.1 Design and notation

2.2 Model and test statistic

3 THE PROPOSED DDF

3.1 Derivation of ddfH

Table 1.

3.2 Rationale for ddfO

4 DDFH AND THE DBM PROCEDURE

4.1 The DBM procedure

4.2 Comparison of ddfH, ddfO, and ddfD

Table 2.

4.3 DdfH in terms of the DBM mean squares

4.4 Inadequate DBM Satterthwaite approximation

5 SPECIAL CASES

5.1 Confidence interval for the difference of two tests

5.2 Single test inference using all of the data

5.3 Single test inference using only corresponding data

5.4 Fixed readers

6 SIMULATION STUDY

Table 3.

Table 4.

Figure 1.

Table 5.

7 EXAMPLE

Table 6.

Figure 2.

8 DISCUSSION

ACKNOWLEDGEMENTS

A Appendix

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.1 Derivation of ddf_H

3.2 Rationale for ddf_O

4 DDF_H AND THE DBM PROCEDURE

4.2 Comparison of ddf_H, ddf_O, and ddf_D

4.3 Ddf_H in terms of the DBM mean squares