Designing image segmentation studies: Statistical power, sample size and reference standard quality

Eli Gibson; Yipeng Hu; Henkjan J Huisman; Dean C Barratt

doi:10.1016/j.media.2017.07.004

. 2017 Dec;42:44–59. doi: 10.1016/j.media.2017.07.004

Designing image segmentation studies: Statistical power, sample size and reference standard quality

Eli Gibson ^a,^b,^c,^⁎, Yipeng Hu ^b, Henkjan J Huisman ^a, Dean C Barratt ^b

PMCID: PMC5666910 PMID: 28772163

Highlights

•
A sample size calculation for segmentation accuracy studies is derived.
•
Parameters include accuracy difference, algorithm disagreement and a design factor.
•
A formula is derived to account for errors in the study reference standard.
•
A case study illustrates the application of the theory to a segmentation study design.

Keywords: Image segmentation, Segmentation accuracy, Statistical power, Reference standard

Graphical abstract

Abstract

Segmentation algorithms are typically evaluated by comparison to an accepted reference standard. The cost of generating accurate reference standards for medical image segmentation can be substantial. Since the study cost and the likelihood of detecting a clinically meaningful difference in accuracy both depend on the size and on the quality of the study reference standard, balancing these trade-offs supports the efficient use of research resources.

In this work, we derive a statistical power calculation that enables researchers to estimate the appropriate sample size to detect clinically meaningful differences in segmentation accuracy (i.e. the proportion of voxels matching the reference standard) between two algorithms. Furthermore, we derive a formula to relate reference standard errors to their effect on the sample sizes of studies using lower-quality (but potentially more affordable and practically available) reference standards.

The accuracy of the derived sample size formula was estimated through Monte Carlo simulation, demonstrating, with 95% confidence, a predicted statistical power within 4% of simulated values across a range of model parameters. This corresponds to sample size errors of less than 4 subjects and errors in the detectable accuracy difference less than 0.6%. The applicability of the formula to real-world data was assessed using bootstrap resampling simulations for pairs of algorithms from the PROMISE12 prostate MR segmentation challenge data set. The model predicted the simulated power for the majority of algorithm pairs within 4% for simulated experiments using a high-quality reference standard and within 6% for simulated experiments using a low-quality reference standard. A case study, also based on the PROMISE12 data, illustrates using the formulae to evaluate whether to use a lower-quality reference standard in a prostate segmentation study.

1. Introduction

Demonstrating an improvement in segmentation algorithm accuracy typically involves comparison with an accepted reference standard, such as manual expert segmentations or other imaging modalities (e.g. histology). In many medical image segmentation problems, such segmentations are challenging due to the variable appearance of anatomical/pathological features, ambiguous anatomical definitions, clinical constraints, and interobserver variability. The resulting errors in the reference standards introduce errors in the performance measures used to compare segmentation algorithms, and can impact the probability of detecting a significant difference between algorithms, referred to as the statistical power (Beiden et al., 2000).

The cost and quality of a reference standard is affected by the time and effort devoted to segmentation accuracy, the sample size, and the number, background, experience and proficiency of the observers. For example, the PROMISE12 prostate MRI segmentation challenge used two reference standards (illustrated in Fig. 1): a high-quality reference standard manually segmented by one experienced clinical reader and verified by another independent clinical reader, and a low-quality reference standard segmented by a less experienced non-clinical observer. An alternative approach is to estimate a high-quality reference standard by combining independent segmentations from multiple observers using algorithms such as STAPLE (Warfield et al., 2004) and SIMPLE (Langerak et al., 2010). A third approach is to mitigate the errors in a lower-quality reference standard by increasing the sample size (Konyushkova, Sznitman, Fua, 2015, Top, Hamarneh, Abugharbieh, 2011, Maier-Hein, Mersmann, Kondermann, Bodenstedt, Sanchez, Stock, Kenngott, Eisenmann, Speidel, 2014, Irshad, Montaser-Kouhsari, Waltz, Bucur, Nowak, Dong, Knoblauch, Beck, 2015). All three of these approaches, however, raise the cost of generating the reference standard, both logistically and economically.

Fig. 1 — Left: Illustrative prostate MRI segmentations from the PROMISE12 prostate segmentation challenge (Litjens et al., 2014b) by two algorithms – A (blue) and B (yellow) – and the two manually contoured reference standards – L (red) which is of lower quality and H (green) that is of higher quality. Compared to H, L oversegmented anteriorly where image information was ambiguous, affecting accuracy measurements of A and B using L. Right: Harder apical segmentations showing regions containing voxels with different combinations of segmentation labels ABLH (overbar denotes negative classifications). The statistical model underlying the derived sample size formula for segmentation evaluation studies is derived from probability distributions of these voxel-wise segmentation labels. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

There are clear trade-offs between the sample size of the study, the cost of generating the reference standard, and the reference standard quality. The optimal balance of these trade-offs depends on the relationship between the study design parameters and statistical power. However, standard power calculation formulae do not, in general, account for the quality of reference standard segmentations. Thus, there is a need for new formulae to quantify these relationships. As a first step towards this goal, this paper presents a new sample size calculation relating statistical power to the quality of a reference standard (measured with respect to a higher-quality reference standard). Such a formula can answer key questions in study design:

•
How many validation images are needed to evaluate a segmentation algorithm?
•
How accurate does the reference standard need to be?

In preliminary work (Gibson et al., 2015), we derived a relationship between statistical power and the quality of a reference standard for a simplified model that cannot account for correlation between voxels, and made a strong assumption that the reference and algorithm segmentation labels are conditionally independent given the high-quality reference standard. In the present paper, we build on our initial work to develop a generalized model that takes into account the correlation between voxels and the statistical dependence between algorithms and reference standards observed in segmentation studies.

The remainder of this paper outlines the derivation (Section 2.3), application (Sections 3 and 6) and validation (Sections 4 and 5) of a statistical power formula for image segmentation. Insights and heuristics derived from the formula and its validation, as well as limitations of the work, are discussed in Section 7. Appendix A and Appendix B present mathematical details of the derivations.

2. Sample size calculations in segmentation evaluation studies

The probability of a study correctly detecting a true effect depends in part on the sample size. A study with a sample size that is too small has a higher risk of missing a meaningful underlying difference, while one with a sample size that is too large may be more expensive than necessary. Sample size calculations relate the probability of a study correctly detecting a true effect to specified and estimated parameters of the study design (Mace, 1964). The sample size depends on the probability distribution of the test statistic under the null and alternate hypotheses. This distribution, in turn, depends on the statistical analysis being performed and on an assumed statistical model of the studied population.

We derive a sample size calculation for a specific analysis: comparing the mean segmentation accuracy — i.e. the proportion of voxels in an image that match the reference standard L — of two algorithms A and B that generate binary classifications of v voxels on n images using a paired Student’s t-test (Rosner, 2015) on the per-image accuracies. Specifically, this tests the null hypothesis that the mean segmentation accuracies of A and B (both measured by comparison to L) are equal against the alternative hypothesis that they are unequal. Paired t-test analyses such as this one are frequently performed in comparisons of segmentation accuracy (Caballero et al., 2014).

2.1. Notation

Throughout this paper, we use the notation given in Table 1. Symbols used in this paper are summarized in Table 2.

Table 1.

Notation for mathematical symbols.

Type	notation
Segmentation algorithms	X (upper case non-italic)
Random variables and vectors	X (upper case)
Realizations of random variables and constants	x (lower case)
Vectors	$\vec{x}$ (arrow accent);  〈 x, y 〉  (angle brackets)
Estimates	$\hat{x}$ (circumflex accent)
Parameterized distributions	X ∼ X(θ) (bold capital with parameters in parentheses)
Expectation of X	E[X]
Conditional expectation of X given Z	E(X\|Z)
Conditional variance of X given Z	$σ_{X \| Z}^{2}$
Conditional covariance of X and Y given Z	cov(X, Y\|Z)
Event $X = 1$	x (bold lower case)
Event $X = 0$	$\bar{x}$ (bold lower case with bar)

Open in a new tab

Table 2.

Glossary of mathematical symbols.

Symbol	Support	Description
Experimental parameters
n	$N$	Sample size
v	$N$	Number of voxels per image
α	$R$	Significance threshold (acceptable Type I error)
β	$R$	$1 - p o w e r$ (acceptable Type II error)
δ_MDD	$[- 1, 1]$	Minimum difference to detect with specified power
Population parameters
$\vec{p}$	[0, 1]³	Population average marginal probability for the per-voxel accuracy difference
δ	$[- 1, 1]$	Population accuracy difference
ψ	[0, 1]	Probability that A and B disagree on voxel label
δ_H	$[- 1, 1]$	Population accuracy difference measured against high-quality reference standard H
p(a), p(b), p(l), p(h)	[0, 1]	Probabilities of voxel labels being 1 for a randomly selected voxel
ρ_{i, j}	$[- 1, 1]$	Correlation between D_{k, i} and D_{k, j} given ${\vec{O}}_{k}$
$\bar{ρ_{i, j}}$	[0, 1]	Average ρ_{i, j} over all voxel pairs i and j
$σ_{O_{1} - O_{- 1}}^{2}$	$[0, ψ - δ^{2}]$	Variance of the accuracy difference in the marginal probability prior
ω	$ω \in R^{+}$	Precision parameter of Dirichlet distribution controlling inter-image variability
Random variables
A_{k, i}, B_{k, i}, L_{k, i}, H_{k, i}	{0, 1}	Segmentation label for the ith voxel in the kth image
${\vec{O}}_{k}$	[0, 1]³	Per-image prior on average marginal probability
${\vec{O}}_{k, i}$	[0, 1]³	Per-voxel prior on marginal probability
${\vec{D}}_{k}$	${- 1, 0, 1}^{v}$	Vector of per-voxel accuracies for the kth image
D_{k, i}	${- 1, 0, 1}$	Difference in accuracy for the ith voxel of the kth image
D	${- 1, 0, 1}$	Difference in accuracy for a random voxel
${\bar{D}}_{k}$	$[- 1, 1]$	Per-image accuracy difference
Simulation variables
Dist_{i, j}	$R^{+}$	Distance between voxels i and j
σ_ρ	$R^{+}$	Scaling parameter to control spatial correlation in Monte Carlo simulations
${\bar{d}}_{k}$	$[- 1, 1]$	Per-image accuracy difference of a simulated image
d_{k, i}	${- 1, 0, 1}$	Per-voxel accuracy difference of a simulated voxel
Other notation
$p_{- 1},$ p₀, p₁	[0, 1]	Elements of $\vec{p}$ for values $- 1,$ 0, and 1
$O_{k, - 1},$ O_{k, 0}, O_{k, 1}	[0, 1]	Elements of ${\vec{O}}_{k}$ for values $- 1,$ 0, and 1
$O_{k, i, - 1},$ O_{k, i, 0}, O_{k, i, 1}	[0, 1]	Elements of ${\vec{O}}_{k, i}$ for values $- 1,$ 0, and 1
A, B, L, H		Segmentation sources denoting two algorithms, a low-quality and a high-quality reference
f		Design factor
t_p{1}, t_p{2}	$R$	1- and 2-tailed p probability critical value from a T-distribution
$σ_{0}^{2}$	$[0, \sqrt{2}]$	Per-image accuracy difference variance under the null hypothesis
$σ_{a l t}^{2}$	$[0, \sqrt{2}]$	Per-image accuracy difference variance under the alternative hypothesis

Open in a new tab

[x, y] denotes real numbers between x and y; {x, y, z} denotes a set of possible values; a superscript ^x denotes a vector with x elements; $N$ denotes natural numbers; $R$ denotes real numbers. $R^{+}$ denotes positive real numbers.

2.2. Statistical model of segmentation

Our stochastic population model represents the joint distribution of possible segmentations by A, B, and L over a population of images. The data for one image from this population comprises binary segmentation labels (encoded as integers 0 or 1) assigned by A, B and L to each of the v voxels: $a_{k, 1}, \dots, a_{k, v}, b_{k, 1}, \dots, b_{k, v}, l_{k, 1}, \dots, l_{k, v},$ where a_{k, i}, b_{k, i}, and l_{k, i} are the labels for the ith voxel in the kth image. The data for a study comprises n randomly sampled images, which we denoted with a set of random variables ${A_{k, 1}, \dots, A_{k, v}, B_{k, 1}, \dots, B_{k, v}, L_{k, 1}, \dots, L_{k, v} | k = 1 . . n},$ where A_{k, i}, B_{k, i}, and L_{k, i} are the random variables representing labels for the ith voxel in the kth randomly sampled image.

2.2.1. Accuracy difference measures

We focus on three types of segmentation accuracy differences. First, the per-voxel segmentation accuracy difference for the ith voxel in the kth image is $D_{k, i} = | B_{k, i} - L_{k, i} | - | A_{k, i} - L_{k, i} |$ . D_{k, i} can take on three values: 1 (when $A_{k, i} = L_{k, i} \neq B_{k, i}$ ), 0 (when $A_{k, i} = B_{k, i}$ ) and $- 1$ (when $A_{k, i} \neq L_{k, i} = B_{k, i}$ ). Random vector ${\vec{D}}_{k}$ represents all D_{k, i} for the kth image. Second, the per-image accuracy difference is the proportion of correct voxel labels from algorithm A (with respect to reference standard L) minus the proportion of correct voxel labels from algorithm B (with respect to reference standard L): ${\bar{D}}_{k} = \frac{1}{v} \sum_{i = 1}^{v} (1 - | A_{k, i} - L_{k, i} |) - \frac{1}{v} \sum_{i = 1}^{v} (1 - | B_{k, i} - L_{k, i} |) = \frac{1}{v} \sum_{i = 1}^{v} D_{k, i}$ . Third, the population average accuracy difference δ is the expected value $E [{\bar{D}}_{k}]$ for a randomly selected image in the population, and equivalently, $δ = p (D = 1) - p (D = - 1)$ for a randomly selected per-voxel accuracy difference D.

2.2.2. Model distribution

For calculating power, the model (summarized in Table 3 and illustrated in Fig. 2) must encode the distribution of the metric analysed in the statistical analysis: the per-image accuracy difference ${\bar{D}}_{k}$ . While ${\bar{D}}_{k}$ depends on all three segmentations A, B and L, it can be expressed more simply as a unary function of ${\vec{D}}_{k}$ . Therefore, we consider the distribution of ${\vec{D}}_{k}$ directly, modeled as a v-dimensional correlated categorical distribution. To model this distribution, we follow the common convention of breaking down complex joint distributions into the mean, and multiple simpler sources of variation about the mean.

Table 3.

Model summary. These expressions summarize the nested model used in our derivations. The motivation and detailed description is given in Section 2.2.2.

${\vec{O}}_{k} \sim \vec{P} (\vec{p})$ where $E [{\vec{O}}_{k}] = \vec{p}$
$\forall_{i} {\vec{O}}_{k, i} \sim \vec{O} ({\vec{O}}_{k})$ where $E ({\vec{O}}_{k, i} \| {\vec{O}}_{k}) = {\vec{O}}_{k}$
$\forall_{i} D_{k, i} \sim Categorical ({\vec{O}}_{k, i})$
$\forall_{i \neq j} c o v (D_{k, i}, D_{k, j} \| {\vec{O}}_{k}) = ρ_{i, j} \sqrt{σ_{D_{k, i} \| {\vec{O}}_{k}}^{2} σ_{D_{k, j} \| {\vec{O}}_{k}}^{2}}$

Open in a new tab

Fig. 2 — The illustrated nested model shows, from left to right, (1) the prior distribution of per-image average marginal probabilities $\vec{P} (\vec{p})$ (shown on the triangular (standard 2-simplex) domain with axes O_{k, 1} and $O_{k, - 1}$ shown and O_{k, 0} implicitly defined as $1 - O_{k, 1} - O_{k, - 1}$ ; darkness represents the probability density), (2) three different samples (i.e. three images) of per-image average marginal probabilities ${\vec{o}}_{k}$ (shown as arrows labelled ${\vec{o}}_{1},$ ${\vec{o}}_{2}$ and ${\vec{o}}_{3}$ ), (3) three corresponding conditional prior distributions of per-voxel marginal probabilities $\vec{O} ({\vec{O}}_{k})$ for the three images (shown as in (1)), (4) nine different samples (i.e. nine voxels from the second image) of per-voxel marginal probabilities ${\vec{o}}_{k, i}$ (shown as unlabelled arrows), and (5) the categorical distributions for the nine voxels from the second image (shown as pie charts of the relative probabilities of the per-voxel accuracy differences $p (d_{k, i} = 1 | {\vec{o}}_{k, i})$ [orange], $p (d_{k, i} = 0 | {\vec{o}}_{k, i})$ [blue], and $p (d_{k, i} = - 1 | {\vec{o}}_{k, i})$ [red]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

The mean of ${\vec{D}}_{k}$ is defined by the joint distribution of the segmentation labels. Considering the joint distribution is important, because the algorithm and reference standard labels for a randomly selected voxel (A, B and L) may not be independent from each other, as they depend on the same image information and overlapping prior knowledge. The mean of ${\vec{D}}_{k},$ therefore, encodes the inter-segmentation correlation in the population average marginal probabilities of the per-voxel accuracy difference D (marginalized over combinations of segmentations A, B and L yielding each difference value):

\begin{matrix} p (D = 1) = p (A = 1, B = 0, L = 1) + p (A = 0, B = 1, L = 0); \\ p (D = 0) = p (A = B); \\ p (D = - 1) = p (A = 1, B = 0, L = 0) + p (A = 0, B = 1, L = 1) . \end{matrix}

(1)

For example, when A and B are highly correlated, $p (D = 0)$ is higher and when A and L are highly correlated, $p (D = 1)$ increases while $p (D = - 1)$ decreases. We consider the population average marginal probabilities as a model parameter $\vec{p} = < p_{1}, p_{0}, p_{- 1} > = < p (D = 1), p (D = 0), p (D = - 1) >$ .

The variation of ${\vec{D}}_{k}$ about the mean is affected by three sources of variation:

•
intra-image inter-voxel correlation – two voxels in the same image may have correlated labels if, for example, they are adjacent or are commonly affected by the same image artifact.
•
inter-image variability – the expected segmentation performance for different images may vary, as one image may have features that are more or less challenging for a particular algorithm or observer than another image.
•
inter-voxel variability – two voxels in the same image may have different marginal probabilities depending on the image content; for example, voxels that are easy to segment for any algorithm would likely have the same labels for any algorithm, where more challenging voxels are more likely to show differences.

Both the inter-image variability and the intra-image inter-voxel correlation affect the covariance matrix of ${\vec{D}}_{k}$ . While the covariance matrix could be an explicit model parameter, interpreting the parameter is challenging because it conflates these different sources of correlation. Instead, we construct an over-parameterized nested model that allows us to separately represent inter-image variability and intra-image inter-voxel correlation. The key concept in this nested model is to introduce per-image priors (random variables ${\vec{O}}_{k} \sim \vec{P} (\vec{p})$ ) on the average marginal probability for D_{k, i} within each image, in order to model inter-image variability. $\vec{P} (\vec{p})$ is a distribution of probability vectors (i.e. ${\vec{O}}_{k} \in$ the open standard 2-simplex) with mean $\vec{p}$ . Then, for each image, the conditional distribution of D_{k, i} given ${\vec{O}}_{k}$ models the intra-image inter-voxel correlation. Specifically, we define the conditional covariance of ${\vec{D}}_{k}$ given ${\vec{O}}_{k}$ as

\begin{matrix} c o v (D_{k, i}, D_{k, j} | {\vec{O}}_{k}) = ρ_{i, j} \sqrt{σ_{D_{k, i} | {\vec{O}}_{k}}^{2} σ_{D_{k, j} | {\vec{O}}_{k}}^{2}}, \end{matrix}

(2)

where ρ_{i, j} is a pair-wise Pearson correlation coefficient and $σ_{D_{k, i} | {\vec{O}}_{k}}^{2}$ is the conditional variance of D_{k, i} given ${\vec{O}}_{k}$ .

To model the inter-voxel variability, each D_{k, i} has per-voxel priors (random variables ${\vec{O}}_{k, i}$ ) defining its marginal probabilities. The conditional distribution of ${\vec{O}}_{k, i}$ given ${\vec{O}}_{k}$ is an arbitrary distribution $\vec{O} ({\vec{O}}_{k})$ of probability vectors with mean ${\vec{O}}_{k}$ .

2.3. Derivation of the sample size formula for segmentation

The general form of the sample size formula (Connor, 1987),

\begin{matrix} n = \frac{{(t_{α {2}} \sqrt{σ_{0}^{2}} + t_{β {1}} \sqrt{σ_{a l t}^{2}})}^{2}}{δ_{M D D}^{2}}, \end{matrix}

(3)

relates the sample size (n) to the variances ( $σ_{0}^{2}$ and $σ_{a l t}^{2}$ ) of per-image accuracy differences under the null hypothesis ( $δ = 0$ ) and alternate hypothesis (δ ≠ 0), acceptable study error rates (α and β), and the minimum detectable difference (δ_MDD) in population accuracy between algorithms A and B to detect with power ( $1 - β$ ). t_α{2} and t_β{1} are two- and one-tailed critical values taken from the inverse cumulative distribution function of the t-distribution with $n - 1$ degrees of freedom. Of the parameters in Eq. (3), most are selected based on experimental design choices, but the variances of the per-image accuracy difference are derived from the statistical model.

The variance of the per-image accuracy difference $σ_{\bar{D}}^{2}$ can be derived for any prior distribution of per-image average marginal probabilities ( ${\vec{O}}_{k} \sim \vec{P} (\vec{p})$ ) in terms of moments of the prior distribution by marginalizing out ${\vec{O}}_{k}$ and ${\vec{O}}_{k, i}$ (see Appendix A for a detailed derivation), yielding

\begin{matrix} σ_{\bar{D}}^{2} = \bar{ρ_{i, j}} (ψ - δ^{2}) + (1 - \bar{ρ_{i, j}}) σ_{O_{1} - O_{- 1}}^{2}, \end{matrix}

(4)

where $ψ = p_{1} + p_{- 1}$ is the population-wide probability that algorithms A and B disagree on the labeling of a voxel (see Fig. 3), $σ_{O_{1} - O_{- 1}}^{2}$ (the variance of $O_{k, 1} - O_{k, - 1}$ for the priors ${\vec{O}}_{k}$ ) is a linear combination of moments of the prior distribution ( $σ_{O_{1} - O_{- 1}}^{2} = σ_{O_{1}}^{2} - 2 σ_{O_{1}, O_{- 1}} + σ_{O_{- 1}}^{2}$ ), and $\bar{ρ_{i, j}} = \frac{\sum_{i, j} ρ_{i, j}}{v^{2}}$ is the average of the intra-image inter-voxel correlation coefficients.

Substituting $σ_{a l t}^{2} = σ_{\bar{D} | δ = δ_{M D D}}^{2}$ and $σ_{0}^{2} = σ_{\bar{D} | δ = 0}^{2}$ (i.e. substituting $δ = δ_{M D D}$ and $δ = 0$ into $σ_{\bar{D}}^{2}$ ) yields the segmentation sample size formula for accuracy differences with respect to reference standard L,

\begin{matrix} n & = & (t_{α {2}} \sqrt{\bar{ρ_{i, j}} ψ + (1 - \bar{ρ_{i, j}}) σ_{O_{1} - O_{- 1} | δ = 0}^{2}} \\ + {t_{β {1}} \sqrt{\bar{ρ_{i, j}} (ψ - δ_{M D D}^{2}) + (1 - \bar{ρ_{i, j}}) σ_{O_{1} - O_{- 1} | δ = δ_{M D D}}^{2}})}^{2} / δ_{M D D}^{2} . \end{matrix}

(5)

It is interesting to note that when there is no inter-voxel correlation (i.e. $\bar{ρ_{i, j}} \to 1 / v$ ) and no inter-image variability in marginal probabilities (i.e. $σ_{O_{1} - O_{- 1}}^{2} = 0$ ), Eq. (5) approaches the sample size formula for McNemar’s two-sample paired proportion test with nv samples (Connor, 1987).

2.3.1. Sample size with the Dirichlet prior distribution

To gain further insight into the sample size relationship, consider the special case where the prior distribution of per-image average marginal probabilities $\vec{P} (\vec{p})$ is a Dirichlet distribution (i.e. ${\vec{O}}_{k} \sim Dirichlet (ω, \vec{p})$ ), which represents inter-image variability with a single parameter: the precision ω (Minka, 2000). When ω is large, priors ${\vec{O}}_{k}$ are likely to be near $\vec{p}$ (i.e. there is little variation between images); when ω is small, priors ${\vec{O}}_{k}$ are distributed more diffusely (i.e. there is more variation between images). The Dirichlet prior distribution has three properties that make interpretation of the sample size relationship easier:

•
It is well-characterised as a model for variability in categorical probabilities, because it is the conjugate prior distribution of the categorical and multinomial distributions and thus commonly adopted in Bayesian analysis (Tu, Mosimann, 1962, Zhu, Zöllei, Wells, 2006)
•
Representing inter-image variability with a single parameter simplifies interpretation and facilitates parameter fitting with small pilot data sets.
•
$σ_{O_{1} - O_{- 1}}^{2}$ for the Dirichlet prior distribution is proportional to $ψ - δ^{2}$ which simplifies the sample size formula.

For the Dirichlet prior distribution, $σ_{O_{1}}^{2} = \frac{p_{1} - p_{1}^{2}}{ω + 1},$ $σ_{O_{1}, O_{- 1}} = \frac{- p_{1} p_{- 1}}{ω + 1},$ and $σ_{O_{- 1}}^{2} = \frac{p_{- 1} - p_{- 1}^{2}}{ω + 1}$ ; therefore $σ_{O_{1} - O_{- 1}}^{2} = \frac{ψ - δ^{2}}{ω + 1}$ . Substituting $σ_{O_{1} - O_{- 1}}^{2}$ into Eq. (4) and simplifying algebraically gives the variance of the per-image accuracy under a Dirichlet prior:

\begin{matrix} σ_{\bar{D}}^{2} = \frac{1 + ω \bar{ρ_{i, j}}}{ω + 1} (ψ - δ^{2}) . \end{matrix}

(6)

Since $σ_{\bar{D}}^{2}$ is expressed in terms of δ, we can readily substitute $σ_{a l t}^{2} = σ_{\bar{D} | δ = δ_{M D D}}^{2}$ and $σ_{0}^{2} = σ_{\bar{D} | δ = 0}^{2}$ into Eq. (3) to get the sample size formula

\begin{matrix} n = \frac{1 + ω \bar{ρ_{i, j}}}{ω + 1} {(t_{α / 2} \sqrt{ψ / δ_{M D D}^{2}} + t_{β} \sqrt{ψ / δ_{M D D}^{2} - 1})}^{2} . \end{matrix}

(7)

Several aspects of this formula link to previous work. The term $\frac{1 + ω \bar{ρ_{i, j}}}{ω + 1}$ is a type of design factor denoted hereafter as f (analogous to the design factor in cluster-randomized trials (Kish, 1965)), modelling the inter- vs intra-image variability in accuracy differences (i.e. each image being one correlated cluster of voxel samples). When there is no inter-voxel correlation (i.e. $\bar{ρ_{i, j}} = 1 / v$ ), Eq. (7) simplifies to the formula found in our preliminary analysis (Gibson et al., 2015). The term ψ/δ² is the squared coefficient of variation of D under the idealized assumption of completely independent voxels (i.e. $f = 1 / v$ ) — or equivalently, the statistical efficiency of estimating δ (Everitt and Skrondal, 2002). We thus refer to ψ/δ² hereafter as the idealized efficiency.

2.4. Incorporating reference standard quality

Conducting segmentation accuracy comparison studies using a lower-quality reference standard introduces an additional challenge: selecting the appropriate minimum detectable difference. On one hand, for the generic sample size formula (Eq. (3)) to be valid, δ_MDD must be measured with respect to the reference standard used in the study. On the other hand, the selection of δ_MDD depends on external clinical or technical requirements. Ideally, these requirements would be defined with respect to a high-quality reference standard H (with the MDD denoted δ_{MDD, H}), to most closely approximate the true requirement. If the high-quality reference standard can be used for the entire study, there is no conflict and δ_{MDD, H} can be used directly. If, however, a lower-quality reference standard is used, an appropriate δ_MDD needs to be selected. To resolve this dilemma, we have derived a formula to express δ_MDD for a low-quality reference standard as a function of δ_{MDD, H}, by characterizing the differences between the low- and high-quality reference standards (e.g. on a small pilot dataset).

The derivation, detailed in Appendix B, expresses δ_MDD in terms of the joint probability of segmentation labels of A, B, L and H; isolates the terms of this expression that equate to δ_{MDD, H}; and simplifies the remaining terms. This yields an equation for δ_MDD as a function of δ_{MDD, H} and estimable parameters representing deviation of δ_MDD from δ_{MDD, H}:

\begin{matrix} δ_{M D D} & = & δ_{M D D, H} + 2 (p (a) - p (b)) (p (l) - p (h)) \\ + 2 c o v (A - B, L - H), \end{matrix}

(8)

where $p (x) = p (X = 1)$ for a randomly selected voxel and $c o v (A - B, L - H)$ is the covariance between errors in L (with respect to H) and differences between A and B. The second term of this expression reflects error induced by over- or under-contouring by L (with respect to H). If L tends to over-contour compared to H, algorithms that assign more voxels as foreground will appear more accurate. The third term is the covariance $c o v (A - B, L - H)$ reflecting errors in L that are biased in favour of A or B. This expression can be used to estimate the δ_MDD to use for a study using a low-quality reference standard.

3. Applying the sample size formula

The sample size formula derived above supports the design of segmentation accuracy comparison studies by estimating the sample size needed to detect a specified accuracy difference with high probability. As with all sample size calculations, three types of parameters have to be determined to apply the formula: the acceptable study error rates, the minimum detectable difference, and the variance parameters. Some of these parameters are chosen based on experimental, technical or clinical requirements outside the study design, while others are estimated from related literature or pilot data. We denote the estimate of parameter x as $\hat{x}$ .

The acceptable error rates are generally set using heuristics by study designers: $α = 0.05$ (i.e. a 5% probability of falsely detecting a difference when there is none) and $β = 0.2$ (i.e. an 80% probability of detecting a true difference).

The minimum detectable difference (δ_MDD) is typically set by technical or clinical requirements outside the study design to be the smallest difference that is large enough to be important to detect with high probability. Specifically, if the true difference is δ_MDD or higher, the study should give a true positive with probability $1 - β$ or higher. If the study will use a sufficiently high-quality reference standard, δ_MDD can be chosen directly. If the technical or clinical requirements are expressed with respect to a high-quality reference standard, but the study uses a lower-quality reference standard, then δ_{MDD, H} can be chosen and the equivalent ${\hat{δ}}_{M D D}$ can be estimated from the low-quality correction equation (Eq. (8)), using parameter estimation equations (Eqs. (9) and (10)) given in Section 3.1.

The variance parameters depend on the distribution of the data; they are not chosen a priori, but can be estimated using values from related literature, or using pilot data. In the moment-based sample size equation (Eq. (5)), the variance parameters are ψ, $\bar{ρ_{i, j}},$ $σ_{O_{1} - O_{- 1} | δ = 0}^{2}$ and $σ_{O_{1} - O_{- 1} | δ = δ_{M D D}}^{2}$ . In the Dirichlet-prior-based sample size equation (Eq. (7)), the variance parameters are ψ, $\bar{ρ_{i, j}},$ and ω. In general, estimating these variance parameters individually can be challenging because the model is parameterized by multiple parameters that affect the intervoxel covariance of per-voxel accuracy differences, and because the moments of the prior for the per-image average marginal probabilities may depend on δ. Under some assumptions, however, we can estimate variance parameters.

•
If we assume $σ_{0}^{2} = σ_{a l t}^{2} = {\hat{σ}}_{\bar{D}}^{2},$ which may be appropriate when δ and δ_MDD are sufficiently small, we can estimate ${\hat{σ}}_{\bar{D}}^{2}$ from the pilot data (using Eq. (13) in Section 3.1), and apply the generic sample size equation (Eq. (3)) directly.
•
If we assume a parametric distribution for the per-image average marginal probabilities, it may be possible to express $σ_{O_{1} - O_{- 1}}^{2}$ in terms of δ (as shown for the Dirichlet distribution in Eq. (6)) and estimate $σ_{O_{1} - O_{- 1} | δ = 0}^{2}$ and $σ_{O_{1} - O_{- 1} | δ = δ_{M D D}}^{2}$ from ${\hat{σ}}_{\bar{D}}^{2}$ . For the Dirichlet distribution, the resulting variance could be characterized by a design factor modeling the combined effect of parameters $\bar{ρ_{i, j}},$ and ω. An estimation equation for the design factor is given in Section 3.1 Eq. (14).
•
If there is a need to estimate the effects of the variance parameters individually (e.g. to explore the effect of increased intra-image inter-voxel correlation on a planned study), and we assume that the intra-image inter-voxel correlation is spatially constrained (e.g. if voxels separated by a specified distance are effectively uncorrelated given ${\vec{O}}_{k}$ ), then we can estimate $\hat{ω}$ using spatially sparse sampling and then estimate $\hat{\bar{ρ_{i, j}}}$ from $\hat{ω}$ and ${\hat{σ}}_{\bar{D}}^{2}$ . This approach is outlined for a Dirichlet prior in Section 3.1.

The optimal size for a pilot study data set has not been well-established in general, and depends on many factors (Hertzog, 2008), including the particular population being studied. In principle, the precision of the estimated sample size depends on the sensitivity of the formula to parameter estimation errors (see supplementary material) and the variances of the parameter estimators (which decrease as the pilot data set grows), both of which vary depending on the population being studied. In practice, formal sample size calculations for such pilot studies are rarely used (Hertzog, 2008); instead, heuristics, such as using 10 samples (Nieswiadomy, 2011), 12 samples (Julious, 2005) or using 10% of the anticipated size of the full study (Connelly, 2008, Lackey, Wingate, 1986) for larger studies, can be used. The risk of parameter estimation error can be mitigated using conservative parameter estimates, as described in Section 3.1 for ${\hat{σ}}_{\bar{D}}^{2}$ .

3.1. Parameter estimation equations

To estimate parameters from pilot data, a small data set of images must be collected and segmented by algorithms A and B, by the reference standard L to be used for the study, and by the high-quality reference standard H. Given a segmented pilot data set, formula parameters can be estimated as follows.

To estimate ${\hat{δ}}_{M D D}$ in terms of δ_{MDD, H}, we first estimate the proportion of positive voxels segmented by A across all images in the pilot data:

\begin{matrix} \hat{p} (a) = \frac{1}{n^{'} v} \sum_{k = 1}^{n^{'}} \sum_{i = 1}^{v} a_{k, i}, \end{matrix}

(9)

where n′ is the number of images in the pilot data set. $\hat{p} (b),$ $\hat{p} (l),$ and $\hat{p} (h)$ can be estimated similarly. $\hat{c o v} (A - B, L - H)$ can be estimated as

\begin{matrix} \hat{c o v} (A - B, L - H) & = & \frac{1}{n^{'} v - 1} \sum_{k = 1}^{n^{'}} \sum_{i = 1}^{v} (a_{k, i} - b_{k, i} - \hat{p} (a) \\ + \hat{p} (b)) (l_{k, i} - h_{k, i} - \hat{p} (l) + \hat{p} (h)) . \end{matrix}

(10)

Then, from Eq. (8), ${\hat{δ}}_{M D D} = δ_{M D D, H} + 2 (\hat{p} (a) - \hat{p} (b)) (\hat{p} (l) - \hat{p} (h)) + 2 \hat{c o v} (A - B, L - H)$ .

The probability of disagreement can be estimated using the sample mean as

\begin{matrix} \hat{ψ} = \frac{1}{n^{'} v} \sum_{k = 1}^{n^{'}} \sum_{i = 1}^{v} | a_{k, i} - b_{k, i} | . \end{matrix}

(11)

The population average accuracy difference can be estimated using the sample mean as

\begin{matrix} \hat{δ} = \frac{1}{n^{'} v} \sum_{k = 1}^{n^{'}} \sum_{i = 1}^{v} (| b_{k, i} - l_{k, i} | - | a_{k, i} - l_{k, i} |) . \end{matrix}

(12)

The variance in per-image accuracy differences can be estimated using the unbiased sample variance as

\begin{matrix} {\hat{σ}}_{\bar{D}}^{2} = \frac{1}{(n^{'} - 1)} \sum_{k = 1}^{n^{'}} {({\bar{d}}_{k} - \hat{δ})}^{2}, \end{matrix}

(13)

where ${\bar{d}}_{k} = \frac{1}{v} \sum_{i = 1}^{v} (| b_{k, i} - l_{k, i} | - | a_{k, i} - l_{k, i} |)$ . However, sample variance estimates from small pilot studies are imprecise and skewed (Browne, 1995), which inflates the probability of having an underpowered study. To mitigate this effect, Browne (1995) recommended using the upper bound of a γ% confidence interval on the variance to guarantee the specified power with γ% probability. This can be estimated using a double bootstrap method (e.g. Lee and Young, 1995 implemented for Matlab as ibootci (Penn, 2015)).

When modeling the per-image marginal probability prior as a Dirichlet distribution, the design factor encoding the combined effect of parameters $\bar{ρ_{i, j}},$ and ω can be estimated from Eq. (6) using sample estimates:

\begin{matrix} \hat{f} = {\hat{σ}}_{\bar{D}}^{2} / (\hat{ψ} - {\hat{δ}}^{2}), \end{matrix}

(14)

and the idealized efficiency can be estimated as $\hat{ψ} / δ_{M D D}^{2}$ .

To estimate the effects of the variance parameters individually, we can model the per-image marginal probability prior as a Dirichlet distribution and assume that the intra-image inter-voxel correlation is spatially constrained (i.e. voxels more than x pixels away are effectively uncorrelated given ${\vec{O}}_{k}$ ). Sampling d_{k, i} from voxels spaced x voxels apart gives counts from a Dirichlet-multinomial distribution, and we can estimate the precision parameter $\hat{ω}$ using an iterative approach described by Minka (2000). The average correlation coefficient can then be estimated from Eq. (6) using sample estimates as

\begin{matrix} \hat{\bar{ρ_{i, j}}} = \frac{{\hat{σ}}_{\bar{D}}^{2} (\hat{ω} + 1) - (\hat{ψ} - {\hat{δ}}^{2})}{(\hat{ψ} - {\hat{δ}}^{2}) \hat{ω}} . \end{matrix}

(15)

4. Simulations

Three sets of Monte Carlo simulations were used to evaluate the accuracy of the sample size formulae under three different conditions:

1.
with simulated images and segmentations from the assumed statistical model, to test the validity of the model;
2.
with real-world data (the PROMISE12 prostate MRI segmentation data set described in Section 4.2.1) using a high-quality reference standard, to test the applicability of the Dirichlet-based sample size formula (Eq. (7)) to real data; and
3.
with real-world data using a low-quality reference standard while expressing the minimum detectable difference in terms of a high-quality reference standard, to test the applicability of the low-quality correction equation (Eq. (8)) to real data.

4.1. Simulations with simulated data from the assumed statistical model

In order to characterize the validity of the model described in Section 2.2, we performed sets of simulations with controlled variation of a subset of model parameters (hereafter referred to as a simulation set). Recall that Eq. (7) defines the sample size needed to detect a significant accuracy difference with probability $1 - β$ if the underlying population difference were δ_MDD. To test this, we set δ_MDD to the specified population accuracy difference, and compare the proportion of simulated studies yielding significant accuracy differences to $1 - β$ . Note that this approach to select δ_MDD is appropriate for validating the sample size formula, but not for designing real segmentation comparison studies: in practice, δ_MDD should be chosen based on clinical or technical requirements.

In each simulation, we repeatedly simulated a segmentation evaluation study by sampling per-voxel accuracy differences for ⌈n⌉ v-voxel segmentations and reference standards (where ⌈n⌉ denotes the smallest integer  ≥ n) using the assumed model and testing for an accuracy difference using a Student’s t-test. In each simulation, we compared the observed proportion of positive statistical tests with the predicted probability (i.e. the statistical power $1 - β$ ) for sample size ⌈n⌉. To clarify the impact of this error in power, we also substituted the observed power into the Dirichlet-based sample size formula (Eq. (7)) to calculate the equivalent error in the predicted sample size n and detectable difference δ_MDD. In each simulation, we ran 25,000 repetitions in order to estimate the probability of a positive outcome with a 95% confidence interval with a width of 1%.

Each per-image accuracy difference $\bar{d}$ was computed by sampling the derived per-voxel accuracy differences d_{k, i} directly as follows:

•
the marginal probability priors of per-voxel accuracy differences were drawn from a Dirichlet prior using the rdirichlet (Warnes et al., 2015) function in R version 3.1.1 (R Core Team, 2013),
•
a correlation matrix $ρ_{i, j} = e x p (- D i s t_{i, j} / σ_{ρ}^{2})$ was constructed where Dist_{i, j} is the intervoxel distance in a $\sqrt{v} \times \sqrt{v}$ voxel image and $σ_{ρ}^{2}$ is a scale parameter controlling the spatial extent of the correlation
•
d_{k, i} were sampled using the ordsample (Barbiero and Ferrari, 2015) function in R. While this is equivalent to drawing samples from the algorithm and reference standard segmentations and computing d_{k, i}, it facilitates the direct control of the d_{k, i} correlation matrix needed in these experiments.

The scripts used to generate these samples are available at https://github.com/eligibson/MedIA2016.

The baseline parameter values in the simulation sets and the ranges of varied parameters are given in Table 4. Note that the simulations varying v, ω, σ_ρ and ψ were conducted at two baseline δ values. The parameter ranges for these simulations were chosen to balance the applicability of parameter values to medical image segmentation problems against practical constraints. The range of ω encompassed both highly consistent and highly variable prior distributions. Ranges of δ and ψ reflected plausible algorithm differences based on previous experience. Due to limitations on the ordsample algorithm the range of v and σ_ρ were constrained: v was limited to 100 because of the computational complexity of sampling high-dimensional correlated discrete random variables, and σ_ρ was constrained to 0.7 because of algorithmic constraints. The baseline parameter values were chosen to reflect typical sample sizes in segmentation studies ( $\sim 10 - \sim 200$ ). Because the population parameters derived in Section 2.4 (δ_H, p(a), p(b), p(l), p(h) and $c o v (A - B, L - H)$ ) are linked to statistical power through their influence on the parameter δ, simulations were run as a function of δ, instead of simulating many combinations of parameters that map to the same δ.

Table 4.

Simulation parameters used to estimate the accuracy of the model. Note that the simulations varying v, ω, σ_ρ and ψ were conducted twice at two baseline δ values.

	# voxels	population accuracy difference	Dirichlet precision	spatial correlation width	population probability of disagreement
	v	δ	ω	σ_ρ	ψ
Baseline	36	3% / 6%	128	0.7	15%
Minimum	9	2%	64	0	15%
Maximum	100	10%	1024	0.7	45%
Increment	$\sqrt{v}$ by +1	+1%	× 2	+0.1	+5%

Open in a new tab

4.2. Simulations with real-world data

To evaluate the applicability of sample size formula (Eq. (7)) and the low-quality correction equation (Eq. (8)) to a real-world data set, we simulated segmentation accuracy comparison studies using bootstrapped samples from the PROMISE12 data set.

The PROMISE12 challenge is an ongoing resource for comparing many state-of-the-art prostate segmentation algorithms against a common reference standard. The challenge images comprise 100 T2W prostate MR images collected from 4 centres, split into 50 training images (with publicly available reference segmentations) and 30 testing images (with reference segmentations withheld). The reference segmentations were manually segmented by an experienced clinical reader, and verified by another independent clinical reader. In order to establish a standardised scoring system for multiple metrics, the challenge had a non-clinical graduate student manually segment the images and her metric scores were used to normalize the metric scores of the algorithms. Although the PROMISE12 challenge principally used the high-quality reference standard for evaluation, the second segmentation is analogous to a presumably lower-quality reference standard that could be considered as a lower cost option. Thus, the clinical manual segmentations will represent the high-quality reference standard H, the graduate student manual segmentations will represent the low-quality reference standard L, and two algorithms from the challenge will represent A and B. Using 10 algorithms from the PROMISE12 challenge, the simulations were repeated for all 45 possible pairs of algorithms.

As in Section 4.1, we set δ_MDD to the population accuracy difference (treating the PROMISE12 test data set as the entire population) and compare the proportion of simulated studies yielding significant accuracy differences to $1 - β$ .

4.2.1. Simulations with high-quality real-world data

To evaluate the applicability of the Dirichlet-based sample size formula (Eq. (7)) to a real-world data set, each simulated study in this experiment compared two algorithms to the high-quality reference standard. For every pair of algorithms, we estimated the population accuracy difference ( ${\hat{δ}}_{H}$ ) and variance parameters using all 30 test cases from the PROMISE12 test data set. Using $α = 0.05,$ $β = 0.20,$ $δ_{M D D} = {\hat{δ}}_{H},$ and the estimated variance parameters, we computed the predicted sample size n using Eq. (7). We then simulated 100,000 segmentation accuracy comparison studies using bootstrap sampling by sampling ⌈n⌉ images with replacement from the PROMISE12 images and testing the per-image accuracy differences using a paired Student’s t-test. We compared the proportion of positive tests to the power predicted by the model for ⌈n⌉ samples.

4.2.2. Simulations with low-quality real-world data

To evaluate the applicability of the low-quality correction equation (Eq. (8)) to a real-world data set, each simulated study in this experiment compared two algorithms to the low-quality reference standard, with ${\hat{δ}}_{M D D}$ calculated from Eq. (8) and the observed ${\hat{δ}}_{H}$ . Simulation using bootstrap sampling and evaluation proceeded as in Section 4.2.1 except that ${\hat{δ}}_{M D D} = {\hat{δ}}_{H} + 2 (\hat{p} (a) - \hat{p} (b)) (\hat{p} (l) - \hat{p} (h)) + 2 \hat{c o v} (A - B, L - H),$ and the variance parameters were estimated with respect to low-quality reference standard L.

5. Results

5.1. Simulations under the statistical model

The variance of accuracy differences predicted by the model ( $σ_{\bar{D}}^{2}$ ) was within 2% relative error of the Monte Carlo simulations across all simulation sets (RMS relative error 0.5%). The predicted power was within 4% error (simulated – predicted power) of the Monte Carlo simulations across all simulation sets with 95% confidence.

Fig. 4 shows the absolute error in the predicted power (i.e., simulation - model power) under varying model parameters. The parameter with the largest impact on the accuracy of power prediction was δ. For simulations with baseline $δ = 3 %$ and $δ = 6 %,$ the predicted power was within 2% and 3% absolute error, respectively, of the simulations with 95% confidence. A larger positive bias in the power prediction error across all values of v, ω and σ_ρ was observed for simulations with $δ = 6 %,$ compared to simulations with $δ = 3 %,$ suggesting that the positive bias can be primarily attributed to the baseline accuracy difference. The simulation with $δ = 10 %$ had the largest absolute error of 4%.

A proportion of the observed error can be attributed to skew in the distribution of per-image accuracy differences, deviating from the normality assumption of the t-test used in this work. The largest skew amongst our experiments (corresponding to the largest power prediction error) occurred when $δ = 10 %$ ; this is illustrated in a histogram of the accuracy differences, shown in Fig. 4. The effect of the deviation from normality is exacerbated in the simulations with large δ due to the lower sample size ( $n = 8$ ), for which the t-test is more sensitive to violations of its assumptions. To illustrate the expected impact of skew alone on the error in predicted power, Fig. 4 shows the error of the standard paired t-test power calculation for a correspondingly skewed population (Pearson distribution with skew matching the simulation) overlaid in blue.

The impact of these errors in predicted power on the sample size and minimum detectable difference is illustrated in Figs. 5 and 6.

Fig. 5 — The equivalent error in predicted sample size (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline $δ = 3 %$ and in cyan for baseline $δ = 6 %$ ) on the absolute difference between the sample size needed to achieve the simulated power and the sample size needed to achieve the modeled power. For example, with $δ = 10 %,$ the model would overestimate by 1 the number of subjects needed to achieve the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 — The equivalent error in predicted minimum detectable difference (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline $δ = 3 %$ and in cyan for baseline $δ = 6 %$ ) on the absolute difference between the minimum difference detectable with simulated power and the minimum difference detectable with the modeled power. For example, with $δ = 10 %,$ the model would predict that a minimum detectable difference of 10.5% would result in the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

5.2. Simulations with high-quality real-world data

When the minimum detectable difference was defined and tested relative to the high-quality reference standard in the PROMISE12 data set, the simulated power was  < 4% higher than the power specified by the model (approximately 80%) for the majority of algorithm comparisons (range 0–20%). The error was strongly correlated with the skew of per-image accuracy differences in the population (Spearman’s $ρ = 0.77$ ; $p < 1 \times 10^{- 8}$ ). The model did not over-estimate the power in any comparison, suggesting that it is conservative (i.e. avoiding predictions that result in underpowered studies) in the presence of skew. The errors for each pair of algorithms are reported in Table 5.

Table 5.

Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the high-quality reference standard. The required sample sizes predicted by the model are given in parentheses.

	B	C	D	E	F	G	H	I	J

A	3 (108)	1 (41)	12 (28)	2 (31)	14 (11)	1 (50)	2 (101)	13 (8)	7 (22)
B		10 (15)	1 (163)	1 (26418)	1 (35)	1 (1.8E6)	10 (28)	0 (14)	0 (157)
C			12 (11)	4 (10)	14 (9)	11 (12)	0 (42)	17 (5)	9 (6)
D				4 (102)	2 (50)	3 (115)	13 (14)	1 (15)	2 (3357)
E					7 (19)	1 (14084)	2 (11)	12 (8)	3 (95)
F						5 (23)	12 (10)	1 (312)	5 (48)
G							7 (16)	8 (10)	0 (97)
H								20 (5)	15 (8)
I									2 (17)

Open in a new tab

5.3. Simulations with low-quality real-world data

When the minimum detectable difference was defined relative to the high-quality reference standard and tested relative to the low-quality reference standard in the PROMISE12 data set, the model predicted the simulated power with a median error of 5% (simulated – predicted power; range -29–16%) and a median absolute error of 6% (|simulated – predicted power|). The two algorithm pairs with the smallest δ_MDD (0.1% and 0.2% accuracy differences) and largest sample sizes (5714 and 3721) had the largest errors, overestimating power by 27% and 29%, respectively. The error was correlated with the skew of per-image accuracy differences (Spearman’s $ρ = 0.34$ ; $p = 0.02$ ), and excluding the 2 cases with the smallest δ_MDD, the correlation was stronger (Spearman’s $ρ = 0.67$ ; $p \approx 1 \times 10^{- 6}$ ). The errors for each pair of algorithms are reported in Table 6.

Table 6.

Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the low-quality reference standard. The required sample sizes predicted by the model are given in parentheses.

	B	C	D	E	F	G	H	I	J

A	6 (43)	$- 2$ (167)	12 (22)	$- 5$ (133)	2 (11)	$- 5$ (25)	$- 27$ (5714)	15 (7)	9 (21)
B		8 (14)	$- 6$ (403)	8 (71)	$- 5$ (67)	12 (3598)	12 (24)	$- 5$ (17)	$- 29$ (3721)
C			11 (12)	11 (34)	8 (11)	7 (13)	2 (50)	10 (6)	13 (8)
D				6 (31)	2 (87)	4 (165)	13 (16)	0 (17)	6 (508)
E					0 (15)	2 (41)	$- 1$ (76)	11 (6)	6 (34)
F						0 (37)	5 (13)	4 (159)	4 (58)
G							4 (17)	6 (12)	$- 8$ (466)
H								13 (6)	16 (11)
I									5 (16)

Open in a new tab

6. Case study

The direct application of the sample size formula to calculate the sample size is described in Section 3. The formula can also be used indirectly to guide other aspects in the design of segmentation comparison studies. In this case study, we illustrate one such application: evaluating the cost (in terms of sample size vs cost per subject) of using a lower-quality reference standard manually segmented by a non-clinical graduate student instead of one generated by clinical collaborators. For illustration, this case study simulates the availability of a pilot data set by using two algorithms and the 30 test data sets from the PROMISE12 challenge.

To evaluate the cost of the two approaches, we can compare the sample sizes under the two reference standard strategies. The error rates and minimum detectable difference δ_{MDD, H} will be the same for both scenarios. We use commonly accepted Type I and II error rates: $α = 0.05$ and $β = 0.20$ . The appropriate δ_{MDD, H} depends on the clinical or technical requirements; for example, in the context of prostate segmentation, the MDD could represent the minimal improvement in prostate segmentation accuracy that would make an automated prostate MRI computer-aided detection (CAD) system (e.g. Litjens et al., 2014a) clinically suitable as a first reader. In this case study, we suppose that an analysis of an existing CAD system suggests an improvement in accuracy of 5% (with respect to a high-quality reference standard) would be sufficient to make the system clinically suitable.

The variance parameters differ between the scenarios. To assess the scenario where the study uses a high-quality reference standard, we can estimate $\hat{ψ},$ $\hat{δ}$ and ${\hat{σ}}_{\bar{D}}^{2}$ using A, B and H. Using Eqs. (11)–(13) with h_{k, i} in place of l_{k, i} gives $\hat{ψ} = 13.4 %,$ $\hat{δ} = 4.02 %$ and ${\hat{σ}}_{\bar{D}}^{2} = 0.00231$ . Since $\hat{δ}$ and δ_MDD are small relative to $\hat{ψ},$ assuming $σ_{0}^{2} = σ_{a l t}^{2} = {\hat{σ}}_{\bar{D}}^{2}$ will yield similar results to assuming a Dirichlet prior ( $σ_{0}^{2} = 0.00234$ and $σ_{a l t}^{2} = 0.00229$ ). The resulting sample size to detect a difference $δ_{M D D, H} = 5 %$ was 9 subjects. To assess the scenario where the study uses a low-quality reference standard instead, we first estimate ${\hat{δ}}_{M D D}$ using A, B, L and H. Parameter estimation equations (Eqs. (9) and (10)) gives $\hat{p} (a) = 0.246,$ $\hat{p} (b) = 0.195,$ $\hat{p} (l) = 0.210,$ $\hat{p} (h) = 0.214,$ and $\hat{c o v} (A - B, L - H) = - 0.29 %,$ yielding ${\hat{δ}}_{M D D} = 0.0348$ . Using Eqs. (11)–(13) gives $\hat{ψ} = 13.4 %,$ $\hat{δ} = 3.37 %,$ and ${\hat{σ}}_{\bar{D}}^{2} = 0.00253$ . The resulting sample size to detect a difference $δ_{M D D, H} = 5 %$ was 12 subjects.

Based on this analysis, we estimate that a study using this lower-quality reference standard would require 30% more subjects to detect a 5% improvement in accuracy than one using the high-quality reference standard. Since the cost per subject of generating the lower-quality reference standard is typically much lower, this could be a suitable approach for comparing these algorithms.

7. Discussion

In this work, we derived a sample size formula for studies comparing the segmentation accuracy of two algorithms, and also a relationship describing the effect of using lower-quality reference standards on the minimum detectable difference in segmentation accuracy. The formula accuracy was evaluated using Monte Carlo simulations, yielding errors in predicted power of less than 4% across a range of model parameters. The applicability of the formulae to real-world data was evaluated using bootstrap sampling from the PROMISE12 prostate MRI segmentation data set yielding median errors in predicted power less than 6%, but showed the error to be sensitive to skewed distributions and small sample sizes. A case study was also analyzed to illustrate the use of the formulae in a realistic context.

7.1. Validation in segmentation comparison studies

Improvements in the methodology for the validation and comparison of segmentation algorithms span a wide variety of approaches.

One avenue to improve segmentation validation is to develop improved metrics. Simple segmentation metrics such as accuracy, Dice overlap, Cohen’s Kappa, mean absolute boundary distances and Hausdorff distances compare segmentations to a single reference standard and are commonly used (Taha and Hanbury, 2015). Newer metrics allow comparisons to multiple reference standards (e.g. the validation index (Juneja et al., 2013)) or comparisons that consider application specific utility (e.g. accuracy of quantitative measurements in segmented ROIs (Jha et al., 2012)). This latter concept can be taken further by validating segmentation through its impact on a larger system, such as the accuracy of a computer-assisted detection pipeline (Guo and Li, 2014). Model observers have also been developed to assess aspects of segmentation quality without a reference standard (Frounchi, Briand, Grady, Labiche, Subramanyan, 2011, Kohlberger, Singh, Alvino, Bahlmann, Grady, 2012); effectively creating a learned reference-standard-independent segmentation metric.

Another avenue to improve segmentation validation is to improve the reference standard quality. Label fusion algorithms, such as STAPLE (Warfield et al., 2004) and SIMPLE (Langerak et al., 2010) enable the generation of higher-quality reference standards that combine information from multiple experts. Improvements in multimodal registration (Shah, Pohida, Turkbey, Mani, Merino, Pinto, Choyke, Bernardo, 2009, Gibson, Crukley, Gaed, Gómez, Moussa, Chin, Bauman, Fenster, Ward, 2012) enable reference standards based on information that is less dependent on the image being segmented.

A third avenue is to increase the size of reference standards by reducing the cost per image, or via data augmentation. Active learning (Konyushkova, Sznitman, Fua, 2015, Top, Hamarneh, Abugharbieh, 2011) and other interactive annotation tools, reduce the cost of generating expert segmentations by partially automating the process. Crowdsourcing non-expert segmentations (Maier-Hein, Mersmann, Kondermann, Bodenstedt, Sanchez, Stock, Kenngott, Eisenmann, Speidel, 2014, Irshad, Montaser-Kouhsari, Waltz, Bucur, Nowak, Dong, Knoblauch, Beck, 2015) can cheaply generate many reference standards on many images, using the large numbers to offset the potential loss in quality. For some anatomy, artificial data with reference segmentations can be generated by simulating the imaging process (Cocosco et al., 1997) or perturbing the geometry and image signal of existing images (Hamarneh et al., 2008).

This work, in contrast, aims to improve validation by enabling researchers to design efficient and appropriately powered studies. This work focuses on a particular analysis used in segmentation comparison studies: comparing the proportion of voxels where each of two segmentation algorithms agree with a single reference standard. The presented formulae can be directly applied by researchers developing new segmentation algorithms to facilitate the design of their studies. More broadly, this work has particular importance for work focused on improving reference standard quality and reference standard size by providing a framework for understanding the tradeoffs between quality and quantity in segmentation reference standards.

7.2. Accuracy and applicability of the sample size formulae

In typical study designs, the statistical power, i.e. the probability of detecting an accuracy difference of a specified size, is fixed heuristically at 80%, specifying that a 20% risk of missing a true effect is acceptable. Other study design parameters are optimized under this constraint, balancing costs and effect sizes. A study design with statistical power substantially above the acceptable risk is using resource inefficiently, while one with lower power gives an unacceptable risk of false negatives. In our model, the largest errors observed in the model were for large accuracy differences. The variance predicted by the model matches the simulations to within 2%, suggesting that model errors are not primarily due to an incorrect variance prediction. Rather, the distribution of the accuracy differences in these simulation sets suggests that the error can be attributed to a combination of two factors: low sample size and skewness. The accuracy difference distribution under our statistical model, when using a Dirichlet prior, generally has non-zero skew when there are accuracy differences (i.e. |δ| > 0) and inter-image variability (ω < ∞), and the simulations show a skew as high as 0.3 in these simulation sets. The t-test, however, assumes samples are drawn from a normal distribution with 0 skew. While the t-test is robust to such deviations from normality at large sample sizes, large accuracy differences are more easily detectable and thus require small sample sizes. This suggests that segmentation comparison studies should be careful in their application of the t-test for studies with small sample sizes; in such cases, a McNemar test adjusted for clustered sampling (Gönen, 2004, Durkalski, Palesch, Lipsitz, Rust, 2003) may be more appropriate.

When applied to real-world data, the errors were generally larger than observed under the statistical model. The errors were strongly correlated with the skew of the distribution of per-image accuracy differences, which is consistent with our observations on simulated data. This effect was particularly evident when the predicted sample size was low: five of the six largest observed errors (where the model underestimated power by 13–20%) corresponded to simulated studies with n < 10, which is also consistent with our observations on simulated data. In general, the model underestimated the simulated power which could lead to inefficient resource usage, but would not lead to failed studies caused by insufficient power. When using a low-quality reference standard with δ_MDD defined with respect to a high-quality reference standard, the error was also correlated with skew. However, in this context, another source of error must be considered: error in the estimation of δ_MDD. When the estimated minimum detectable difference was very small ( $| {\hat{δ}}_{M D D} | < 0.2 %$ ), small absolute estimation errors ( $| δ - {\hat{δ}}_{M D D} | < 0.06 %$ ) led to large relative estimation errors, resulting in large errors in the predicted power. When using a low-quality reference standard, the model over-estimated the simulated power for 10/45 of the algorithm pairs, suggesting that additional subjects may be needed when using this model to avoid underpowered studies.

The proposed approach for using low-quality reference standards presumes that a high-quality data set can be obtained, if only for a small pilot data set, and that clinical or technical requirements on accuracy differences specified with respect to that reference standard are useful. In some medical segmentation tasks (such as prostate cancer delineation on MRI (Gibson et al., 2016) or mitosis detection on histology images (Chowdhury et al., 2006)), even expert segmentations are highly variable. For some tasks, it may be appropriate to combine segmentations from multiple experts by consensus or using a label fusion algorithm such as STAPLE to generate a high-quality reference standard on a pilot study; however, care should be taken to consider whether requirements specified with respect to the resulting reference standard will be practically useful.

7.3. Model interpretation

Although the sample size relationship is a continuous function in multiple parameters, it can be useful to break the parameters into coarse categories to see emerging trends (see Table 7). In particular, we focus on the special case of modeling the prior as a Dirichlet random variable and examine the parameters that comprise the idealized efficiency ψ/δ² and on the design factor f.

Table 7.

Number of images required to detect a desired segmentation accuracy difference. When compensating for the use of a lower-quality reference standard, use Eq. (8) to estimate the minimum detectable difference (δ_MDD) first.

		Design factor (f)
		0.01	0.05	0.1
Small differences ( $δ_{M D D} = 2 %$ )
$ψ = 2 %$	( $ψ / δ_{M D D}^{2} = 50$ )	6*	21	41
$ψ = 11 %$	( $ψ / δ_{M D D}^{2} = 275$ )	24	110	218
$ψ = 20 %$	( $ψ / δ_{M D D}^{2} = 500$ )	41	198	394
Medium differences ( $δ_{M D D} = 5 %$ )
$ψ = 5 %$	( $ψ / δ_{M D D}^{2} = 20$ )	3*	10	17
$ψ = 12.5 %$	( $ψ / δ_{M D D}^{2} = 50$ )	6*	21	41
$ψ = 20 %$	( $ψ / δ_{M D D}^{2} = 80$ )	8*	33	65
Large differences ( $δ_{M D D} = 10 %$ )
$ψ = 10 %$	( $ψ / δ_{M D D}^{2} = 10$ )	3*	6*	10
$ψ = 15 %$	( $ψ / δ_{M D D}^{2} = 15$ )	3*	8*	14
$ψ = 20 %$	( $ψ / δ_{M D D}^{2} = 20$ )	3*	10	17

Open in a new tab

* Small samples sizes calculated from Eq. (7) are reported here; however, studies with such small sample sizes may be highly sensitive to violations of the assumptions of the t-test, and are not recommended.

δ_MDD can be coarsely categorized into small (δ_MDD ≤ 2%), medium (2% < δ_MDD < 10%), and large (δ_MDD ≥ 10%) differences. Detecting small differences can require large (often infeasible) sample sizes, whereas detecting large differences may be limited not by δ_MDD but by the assumptions of the statistical analysis.

Within these effect size categories, the likelihood of disagreement between algorithms (ψ) plays an important role. ψ has the range $δ \leq ψ \leq δ + 2 m i n (p (A \neq L), p (B \neq L))$ . When ψ ≈ δ, it implies that most of the difference between the algorithm correspond to the more accurate algorithm correcting the errors of the less accurate one, while making few new errors. When ψ ≫ δ, the more accurate algorithm is making new errors on voxels where the less accurate algorithm was correct. Table 7 shows three levels of disagreement: minimal disagreement ( $ψ = δ_{M D D}$ ), large disagreement ( $ψ = 20 %$ ) and a midpoint between them. When δ_MDD is small, the level of disagreement can introduce an order of magnitude difference in required sample sizes.

The idealized efficiency is modulated by the design factor. The design factor ranges from 1/v (denoting that each voxel gives an independent estimate of accuracy differences) to 1 (denoting that each image gives an independent estimate of accuracy differences, but voxel segmentations are perfectly correlated). For realistic medical image segmentation algorithms, however, either of these extremes is unlikely. Table 7 shows three levels of the design factor: low correlation ( $f = 0.01$ ), medium correlation ( $f = 0.05$ ) and high correlation ( $f = 0.1$ ).

Our derivations show that sample sizes for studies comparing the accuracy of segmentation algorithms principally depend on the idealized efficiency $ψ / δ_{M D D}^{2}$ which relates the probability of voxel-wise disagreement (ψ) between algorithms to the minimum detectable difference δ_MDD, and the design factor f which reflects increased variability due to intervoxel correlation and inter-image variability. The sample size is approximately proportional to the idealized efficiency $ψ / δ_{M D D}^{2}$ . ψ has the range $δ \leq ψ \leq δ + 2 m i n (p (A \neq L), p (B \neq L)),$ which suggests that it is easier, in general, to detect a given accuracy difference when at least one of the algorithms is highly accurate (lowering the upper bound on ψ). Furthermore, it is easier to detect a given accuracy improvement when algorithm A principally corrects errors made by algorithm B (where ψ ≈ δ minimizing the idealized efficiency) than when algorithm A has errors that are independent from B.

Although intuition would suggest that using lower-quality reference standards should consistently increase the required sample size, our derivations and simulations suggest a more complex relationship. The impact of errors in the reference standard is reduced by using a paired analysis which excludes variance due to factors that affect both algorithms in the same way, such as reference standard errors in voxels where the algorithms agree. Reference standard errors in regions of disagreement, however, do affect the variance of per-image accuracy differences ( $σ_{\bar{D}}^{2} = \frac{1 + ω \bar{ρ_{i, j}}}{ω + 1} (ψ - δ^{2})$ from Eq. (6)). In the rightmost term of this equation, ψ (which does not depend on the reference standard) is generally much larger than δ² (see Table 7), suggesting that the impact of reference standard errors on variance is predominantly via changing the design factor. Reference standard errors also affect the sample size (Eq. (8)) by altering the detectable accuracy difference when the reference standard has errors that are biased in favour of one algorithm or when it has systematic over- or under contouring and one algorithm contours more foreground than the other. Relatively speaking, systematic over- or under contouring will have only a small impact on the detectable accuracy differences, unless the algorithms’ foreground proportions are very different: for example, if A contours 5% more foreground than B, then 10% over-contouring by L (25 ×  that observed in the PROMISE12 data) will change the measured accuracy difference by only 0.5%, unless the contouring errors are biased towards one algorithm. Furthermore, errors in the reference standard that are biased towards one algorithm do not necessarily decrease power: reference standard errors biased towards the more accurate algorithm will exaggerate the true difference, increasing power at the expense of increased type I error.¹ These observations were reflected in our analysis of the PROMISE12 challenge data (see Tables 5 and 6). Comparing the low-quality to the high-quality reference standard, the root-mean-squared relative error in $\hat{f}$ was 4%, compared to 0.3% for $\hat{ψ} - {\hat{δ}}^{2}$ . Because the low-quality reference standard had substantial agreement with the high-quality one (96% ± 1% mean ± SD accuracy), the effect of sample biases in reference standard errors were observable: for 17/45 pairs of algorithms, the studies designed to use the low-quality reference standard actually needed fewer subjects than studies using the high-quality reference standard; in all of these cases, there were slight sample biases in the low-quality reference standard towards the more accurate algorithm (primarily, as expected, in the covariance term in Eq. (8)). This increased |δ_MDD| relative to |δ_H| (i.e. the underlying differences between the algorithms were exaggerated and thus easier to detect). Because the experimental design for evaluating the model on real data required $δ_{M D D} = δ_{H},$ which was very small for some comparisons ( < 2% in 20/45 algorithm pairs and  < 0.5% in 4 algorithm pairs), this effect was magnified. Overall, our analysis of the PROMISE12 data aligns well with our theoretical model. Based on our analysis, using reference standards that are lower quality but unbiased may be a suitable approach for comparing segmentation algorithm accuracy.

7.4. Limitations

The contributions of this work should be considered in the context of its limitations. First, the sample size calculation presented in this work is specific to the statistical analysis (the paired Student’s t-test) and to the accuracy metric (proportion of voxels matching the reference standard). Further work is needed to develop these formulae for other analyses and metrics. Second, our correlation model is over-parameterized, representing inter-image variability and intra-image inter-voxel correlation separately, when their effect on the covariance of ${\vec{D}}_{k}$ is coupled. This complicates the estimation of parameters, but yields formulae expressed in concepts familiar to the image analysis community. Third, due to constraints on sampling from specified high-dimensional correlated discrete distributions, we were unable to generate Monte Carlo simulations testing the extremes of some parameter ranges (e.g. high numbers of voxels and high intervoxel correlation). Because the metric analysed in the study ${\bar{D}}_{k}$ is a mean over voxels (which becomes more precise with higher v) and because we did not observe an increase in error as v increased from 9–100, we do not anticipate notable differences in model performance with larger v. Fourth, our application of the formulae to real segmentation studies was limited by the public availability of data sets with high- and low- quality reference standards; the PROMISE12 data set used in our case study is a rare example of such data. Finally, the sensitivity of the formula to violations of its underlying assumptions was not estimated; future work in this area could clarify which of these assumptions are critical to the accuracy of the formula and which could be relaxed.

8. Conclusions

In this work, we derived formulae to address two interrelated questions in the design of studies comparing segmentation algorithms: How many validation images are needed to evaluate a segmentation algorithm? and How accurate does the reference standard need to be? The sample size formula predicted the power of simulated segmentation studies to within 4% across a range of model parameters, and when applied to the PROMISE12 prostate segmentation challenge data, predicted the power to within a median error of 6%. In addition to their direct application in calculating sample sizes, the formulae offer several insights for study design. First, it is generally easier to detect a given accuracy difference when at least one algorithm is highly accurate, as this reduces accuracy variability. Second, it is generally easier to detect a given accuracy difference when one algorithm principally corrects the errors of another, compared to when two algorithm make independent errors. Third, systematic over- or under-contouring by a low-quality reference standard does not impact accuracy measurements substantially unless one algorithm tends to contour more voxels as foreground than the other, but correlation between reference standard errors and algorithm differences can bias accuracy measurements. These formulae, and parameter estimation equations and guidelines that facilitate their use, hold the potential to enable researchers to make statistically motivated decisions about their study design and their choice of reference standard and to make the most efficient use of limited research resources.

Acknowledgements

This work was supported by the UK Medical Research Council, Radboud University Medical Centre and the Canadian Institutes of Health Research. Yipeng Hu is funded by Cancer Research UK and the UK Engineering and Physical Sciences Research Council (EPSRC) as part of the UCL-KCL Comprehensive Cancer Imaging Centre.

Footnotes

Because of this, care should be taken when estimates of this bias (Eq. (10)) are not substantially smaller than δ_H.

Appendix A. Derivation of the variance of the accuracy difference

The variance of the per-image difference in accuracy $σ_{\bar{D}}^{2}$ affects the statistical power of segmentation accuracy comparison experiments. This appendix derives an expression for $σ_{\bar{D}}^{2}$ based on the statistical model described in Section 2 for any prior distribution of per-image average marginal probabilities ( ${\vec{O}}_{k} \sim P (\vec{p})$ ) in terms of moments of the prior distribution.

A1. Statistical segmentation model and notation reiterated

The per-image difference in accuracy is ${\bar{D}}_{k} = \frac{1}{v} \sum_{i = 1}^{v} D_{k, i},$ where v is the number of voxels and random variable D_{k, i} is the per-voxel segmentation accuracy difference for the ith voxel in the kth image defined as $D_{k, i} = | B_{k, i} - L_{k, i} | - | A_{k, i} - L_{k, i} |$ . Random variables A_{k, i}, B_{k, i} and L_{k, i} are segmentation labels for the ith voxel in the kth image from segmentation algorithms A and B and reference standard L, respectively.

The statistical model, motivated and described in Section 2, models the distribution of the random vector of per-voxel accuracy differences ${\vec{D}}_{k} = 〈 D_{k, 1}, \dots, D_{k, v} 〉$ as a v-dimensional correlated categorical distribution with three categories (1, 0, and $- 1$ ). The marginal probabilities ${\vec{O}}_{k, i} = < O_{k, i, 1}, O_{k, i, 0}, O_{k, i, - 1} >$ of the categorical distribution are identically distributed random probability vectors with mean ${\vec{O}}_{k} = < O_{k, 1}, O_{k, 0}, O_{k, - 1} >,$ but no other constraint on the shape of the distribution. The covariance of the categorical distribution given ${\vec{O}}_{k}$ is defined such that $c o v (D_{k, i}, D_{k, j} | {\vec{O}}_{k}) = ρ_{i, j} \sqrt{σ_{D_{k, i} | {\vec{O}}_{k}}^{2} σ_{D_{k, j} | {\vec{O}}_{k}}^{2}},$ where $σ_{D_{k, i} | {\vec{O}}_{k}}^{2}$ is the conditional variance of D_{k, i} given ${\vec{O}}_{k}$ . Priors ${\vec{O}}_{k}$ are independently and identically distributed random variables sampled for each image with mean $\vec{p}$ (the population mean probability vector).

A2. Derivation of $σ_{\bar{D}}^{2}$ in terms of moments of priors ${\vec{O}}_{k}$

To derive $σ_{\bar{D}}^{2}$ under this model, we express the covariance matrix of variables D_{k, i} in terms of $E (D_{k, i} | {\vec{O}}_{k, i}) = O_{k, i, 1} - O_{k, i, - 1}$ and $E (D_{k, i}^{2} | {\vec{O}}_{k, i}) = O_{k, i, 1} + O_{k, i, - 1},$ marginalize out prior parameters ${\vec{O}}_{k}$ and $\vec{O_{k, i}}$ to give an expression in terms of moments of ${\vec{O}}_{k},$ and express $σ_{\bar{D}}^{2}$ as the average of covariance matrix elements.

Because ${\bar{D}}_{k} = \frac{1}{v} \sum_{i = 1}^{v} D_{k, i},$ $σ_{\bar{D}}^{2}$ can be expressed as

\begin{matrix} σ_{\bar{D}}^{2} = \frac{1}{v^{2}} \sum_{i, j} c o v (D_{k, i}, D_{k, j}), \end{matrix}

(A.1)

for a random image k. By the law of total covariance, cov(D_{k, i}, D_{k, j}) can be expressed in terms of conditional probabilities given ${\vec{O}}_{k}$ as the sum of two components,

\begin{matrix} σ_{\bar{D}}^{2} & = & \frac{1}{v^{2}} \sum_{i, j} c o v (E (D_{k, i} | {\vec{O}}_{k}), E (D_{k, j} | {\vec{O}}_{k})) \\ + E [c o v (D_{k, i}, D_{k, j} | {\vec{O}}_{k})], \end{matrix}

(A.2)

where E(X|Y) denotes the conditional expectation of X given Y, and cov(X, Y|Z) denotes the conditional covariance of X and Y given Z. The two components can be expressed in terms of moments of ${\vec{O}}_{k}$ by

1.
expressing each component in terms of marginal probabilities ${\vec{O}}_{k, i},$
2.
marginalizing them over ${\vec{O}}_{k, i}$ to express them in terms of ${\vec{O}}_{k}$ and
3.
marginalizing them over ${\vec{O}}_{k}$ to express them in terms of moments of ${\vec{O}}_{k}$ .

It is helpful to first note that $E (D_{k, j} | {\vec{O}}_{k}) = E (D_{k, i} | {\vec{O}}_{k})$ and $σ_{D_{k, i} | {\vec{O}}_{k}}^{2} = σ_{D_{k, j} | {\vec{O}}_{k}}^{2},$ since ${\vec{O}}_{k, i}$ and ${\vec{O}}_{k, j}$ are identically distributed given ${\vec{O}}_{k}$ . The first term of Eq. (A.2) represents the covariance due to variability of the prior, and can be simplified following the three steps above (shown in Eqs. (A.3), (A.4) and (A.5)) with details shown below:

\begin{matrix} c o v (E (D_{k, i} | {\vec{O}}_{k}), E (D_{k, j} | {\vec{O}}_{k})) \\ = v a r (E (D_{k, i} | {\vec{O}}_{k})) \\ = v a r (\int E (D_{k, i} | {\vec{O}}_{k, i}) p ({\vec{O}}_{k, i} | {\vec{O}}_{k}) d {\vec{O}}_{k, i}) \end{matrix}

(A.3)

\begin{matrix} = v a r (\int (O_{k, i, 1} - O_{k, i, - 1}) p ({\vec{O}}_{k, i} | {\vec{O}}_{k}) d {\vec{O}}_{k, i}) \\ = v a r (O_{k, 1} - O_{k, - 1}) \end{matrix}

(A.4)

\begin{matrix} = σ_{O_{1}}^{2} + σ_{O_{- 1}}^{2} - 2 σ_{O_{1}, O_{- 1}}, \end{matrix}

(A.5)

where $σ_{O_{1}}^{2}$ and $σ_{O_{- 1}}^{2}$ are the variances of O₁ and $O_{- 1}$ and $σ_{O_{1}, O_{- 1}}$ is the covariance of O_{k, 1} and $O_{k, - 1}$ .

The second component of Eq. (A.2) represents the covariance due to sampling the marginal probability and per-voxel accuracy difference variables, and can be simplified following the three steps above (shown in Eqs. (A.6), (A.7) and (A.8)) with details shown below:

\begin{matrix} E [c o v (D_{k, i}, D_{k, j} | {\vec{O}}_{k})] \\ = E [ρ_{i, j} σ_{D_{k, i} | {\vec{O}}_{k}} σ_{D_{k, j} | {\vec{O}}_{k}}] \\ = E [ρ_{i, j} σ_{D_{k, i} | {\vec{O}}_{k}}^{2}] \\ = ρ_{i, j} E [E (D_{k, i}^{2} | {\vec{O}}_{k}) - E {(D_{k, i} | {\vec{O}}_{k})}^{2}] \\ = ρ_{i, j} E [\int E (D_{k, i}^{2} | {\vec{O}}_{k, i}) p ({\vec{O}}_{k, i} | {\vec{O}}_{k}) d {\vec{O}}_{k, i} \\ - {(\int E (D_{k, i} | {\vec{O}}_{k, i}) p ({\vec{O}}_{k, i} | {\vec{O}}_{k}) d {\vec{O}}_{k, i})}^{2}] \end{matrix}

(A.6)

\begin{matrix} = ρ_{i, j} E [\int (O_{k, i, 1} + O_{k, i, - 1}) p ({\vec{O}}_{k, i} | {\vec{O}}_{k}) d {\vec{O}}_{k, i} \\ - {(\int (O_{k, i, 1} - O_{k, i, - 1}) p ({\vec{O}}_{k, i} | {\vec{O}}_{k}) d {\vec{O}}_{k, i})}^{2}] \\ = ρ_{i, j} E [O_{k, 1} + O_{k, - 1} - {(O_{k, 1} - O_{k, - 1})}^{2}] \end{matrix}

(A.7)

\begin{matrix} = ρ_{i, j} (E [O_{k, 1}] + E [O_{k, - 1}] - E [O_{k, 1}^{2}] \\ + 2 E [O_{k, 1} O_{k, - 1}] - E [O_{k, - 1}^{2}]) \\ = ρ_{i, j} (p_{1} + p_{- 1} - E [O_{k, 1}^{2}] + 2 E [O_{k, 1} O_{k, - 1}] - E [O_{k, - 1}^{2}]) \\ = ρ_{i, j} (p_{1} + p_{- 1} - (σ_{O_{1}}^{2} + μ_{O_{1}}^{2}) \\ + 2 (σ_{O_{1}, O_{- 1}} + μ_{O_{1}} μ_{O_{- 1}}) - (σ_{O_{- 1}}^{2} + μ_{O_{- 1}}^{2})) \\ = ρ_{i, j} (p_{1} + p_{- 1} - σ_{O_{1}}^{2} + 2 σ_{O_{1}, O_{- 1}} - σ_{O_{- 1}}^{2} - {(p_{1} - p_{- 1})}^{2}) . \end{matrix}

(A.8)

Substituting Eqs. (A.5) and (A.8) into Eq. (A.2) yields the variance of the per-image accuracy in terms of moments of the prior ${\vec{O}}_{k}$ :

\begin{matrix} σ_{\bar{D}}^{2} & = & \bar{ρ_{i, j}} (p_{1} + p_{- 1} - {(p_{1} - p_{- 1})}^{2}) \\ + (1 - \bar{ρ_{i, j}}) (σ_{O_{1}}^{2} - 2 σ_{O_{1}, O_{- 1}} + σ_{O_{- 1}}^{2}), \end{matrix}

(A.9)

where $\bar{ρ_{i, j}} = \frac{1}{v^{2}} \sum_{i, j} ρ_{i, j}$ is the average of the intra-image inter-voxel correlation coefficients. For conciseness, we introduce two terms: $ψ = p_{1} + p_{- 1}$ is the population-wide probability that algorithms A and B disagree on the labeling of a voxel, and $σ_{O_{1} - O_{- 1}}^{2} = σ_{O_{1}}^{2} - 2 σ_{O_{1}, O_{- 1}} + σ_{O_{- 1}}^{2}$ is the variance of $O_{1} - O_{- 1}$ for the prior ${\vec{O}}_{k}$ . This substitution yields a more concise expression (identical to Eq. (4)):

\begin{matrix} σ_{\bar{D}}^{2} = \bar{ρ_{i, j}} (ψ - δ^{2}) + (1 - \bar{ρ_{i, j}}) σ_{O_{1} - O_{- 1}}^{2} . \end{matrix}

(A.10)

Appendix B. Derivation of the accuracy difference in terms of the high-quality reference standard

The minimum detectable difference δ_MDD must be defined with respect to the study’s reference standard, while clinical or technical requirements may be better defined with respect to a high-quality reference standard (δ_{MDD, H}). This appendix derives an equation to express the population average accuracy difference with respect to one reference standard (L) as a function of the population average accuracy difference with respect to another reference standard (H), and uses this to express δ_MDD as a function of δ_{MDD, H} when a low-quality reference standard is used.

B1. Model and notation

As we did for A, B and L, we consider the segmentation labels of the high-quality reference standard H as random variables. We denote the population average accuracy difference with respect to L as δ, and that with respect to H as δ_H. We abbreviate the probability of a particular combination of segmentation labels for a randomly selected voxel as the conjunction of events $\bar{a},$ $\bar{b},$ $\bar{l}$ and $\bar{h}$ when the respective labels are 0 and a, b, l and h when the respective labels are 1. For example, $p (a \bar{b} l)$ denotes the probability that A gives the label 1, B gives the label 0 and L gives the label 1 for the randomly selected voxel.

B2. Derivation

As described in Section 2.4, the derivation of δ as a function of δ_H uses the following approach:

1.
Express δ in terms of the joint probability of segmentation labels of A, B, L and H
2.
Isolate the terms of this expression that equate to δ_H, and simplify the remaining terms

B3. Express δ in terms of the joint probability of segmentation labels of A, B, L and H

Since events where $A = B$ do not affect the difference in accuracy, δ is the probability of events where $A = L$ and B ≠ L minus the probability of events where A ≠ L and $B = L$ . δ can be expressed in terms of the probabilities of specific combinations of segmentation labels for A, B and L for a randomly selected voxel:

\begin{matrix} δ = p (a \bar{b} l) + p (\bar{a} b \bar{l}) - p (\bar{a} bl) - p (a \bar{b} \bar{l}) . \end{matrix}

(B.1)

We then express each term in Eq. (B.1) in terms of H with a substitution $p (xy) = p (xz) - p (x \bar{y} z) + p (xy \bar{z}),$ where x represents $a \bar{b}$ (for term 1 and 4) or $\bar{a} b$ (for term 2 and 3) and y and z represent l and h (for term 1 and 3) or $\bar{l}$ and $\bar{h}$ (for term 2 and 4):

\begin{matrix} δ & = & (p (a \bar{b} h) - p (a \bar{b} \bar{l} h) + p (a \bar{b} l \bar{h})) \\ + (p (\bar{a} b \bar{h}) - p (\bar{a} bl \bar{h}) + p (\bar{a} b \bar{l} h)) \\ - (p (\bar{a} bh) - p (\bar{a} b \bar{l} h) + p (\bar{a} bl \bar{h})) \\ - (p (a \bar{b} \bar{h}) - p (a \bar{b} l \bar{h}) + p (a \bar{b} \bar{l} h)) . \end{matrix}

(B.2)

B4. Isolate the terms of this expression that equate to δ_H, and simplify the remaining terms

The difference in accuracy with respect to H is $δ_{H} = p (a \bar{b} h) + p (\bar{a} b \bar{h}) - p (\bar{a} bh) - p (a \bar{b} \bar{h})$ . Isolating these terms in Eq. (B.2) gives the sum of δ_H and an error term:

\begin{matrix} δ = δ_{H} + 2 (p (a \bar{b} l \bar{h}) - p (a \bar{b} \bar{l} h) + p (\bar{a} b \bar{l} h) - p (\bar{a} bl \bar{h})) . \end{matrix}

(B.3)

To simplify the error term, we first expand each term with the substitution $p (x \bar{y} \bar{z}) = p (x) - p (xyz) - p (xy \bar{z}) - p (x \bar{y} z),$ where x represents the non-complemented terms, and $\bar{y}$ and $\bar{z}$ represent the complemented terms, giving

\begin{matrix} δ & = & δ_{H} + 2 (p (al) - p (ablh) - p (a \bar{b} lh) - p (abl \bar{h})) \\ - 2 (p (ah) - p (ablh) - p (a \bar{b} lh) - p (ab \bar{l} h)) \\ + 2 (p (bh) - p (ablh) - p (\bar{a} blh) - p (ab \bar{l} h)) \\ - 2 (p (bl) - p (ablh) - p (abl \bar{h}) - p (\bar{a} blh)), \end{matrix}

(B.4)

and then cancel out duplicated terms, giving

\begin{matrix} δ = δ_{H} + 2 (p (al) - p (ah) + p (bh) - p (bl)) . \end{matrix}

(B.5)

If A, B, L, and H are encoded as 0 (for background) and 1 (for foreground), this error term can be expressed as $2 (p (a) - p (b)) (p (l) - p (h)) + 2 c o v (A - B, L - H)$ in the full equation:

\begin{matrix} δ = δ_{H} + 2 (p (a) - p (b)) (p (l) - p (h)) + 2 c o v (A - B, L - H) . \end{matrix}

(B.6)

By substituting $δ = δ_{M D D}$ and $δ_{H} = δ_{M D D, H}$ into Eq. (B.6), we can express δ_MDD as a function of δ_{MDD, H} when a low-quality reference standard is used (identical to Eq. (8)):

\begin{matrix} δ_{M D D} & = & δ_{M D D, H} + 2 (p (a) - p (b)) (p (l) - p (h)) \\ + 2 c o v (A - B, L - H) . \end{matrix}

(B.7)

Supplementary material

Supplementary material associated with this article can be found, in the online version, at 10.1016/j.media.2017.07.004

Appendix C. Supplementary materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.pdf^{(55.2KB, pdf)}

References

Barbiero, A., Ferrari, P. A., 2015. GenOrd: simulation of discrete random variables with given correlation matrix and marginal distributions. http://CRAN.R-project.org/package=GenOrd. R package version 1.4.0.
Beiden S.V., Campbell G., Meier K.L., Wagner R.F. SPIE Medical Imaging. 2000. The problem of ROC analysis without truth: The EM algorithm and the information matrix; pp. 126–134. [Google Scholar]
Browne R.H. On the use of a pilot sample for sample size determination. Stat. Med. 1995;14(17):1933–1940. doi: 10.1002/sim.4780141709. [DOI] [PubMed] [Google Scholar]
Caballero J., Bai W., Price A.N., Rueckert D., Hajnal J.V. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Vol. 1. 2014. Application-driven MRI: Joint reconstruction and segmentation from undersampled MRI data; pp. 106–118. [DOI] [PubMed] [Google Scholar]
Chowdhury N., Pai M.R., Lobo F.D., Kini H., Varghese R. Interobserver variation in breast cancer grading: a statistical modeling approach. Anal. Quant. Cytol. Histol./the International Academy of Cytology [and] American Society of Cytology. 2006;28(4):213–218. [PubMed] [Google Scholar]
Cocosco C.A., Kollokian V., Kwan R.K.-S., Pike G.B., Evans A.C. Proceedings of Functional Mapping of the Human Brain; NeuroImage. Vol. 5. 1997. Brainweb: Online interface to a 3D MRI simulated brain database; p. 425. [Google Scholar]
Connelly L.M. Pilot studies. Medsurg Nursing. 2008;17(6):411–413. [PubMed] [Google Scholar]
Connor R.J. Sample size for testing differences in proportions for the paired-sample design. Biometrics. 1987;43(1):207–211. [PubMed] [Google Scholar]
Durkalski V.L., Palesch Y.Y., Lipsitz S.R., Rust P.F. Analysis of clustered matched-pair data. Stat. Med. 2003;22(15):2417–2428. doi: 10.1002/sim.1438. [DOI] [PubMed] [Google Scholar]
Everitt B.S., Skrondal A. Cambridge University Press; 2002. The Cambridge Dictionary of Statistics. [Google Scholar]
Frounchi K., Briand L.C., Grady L., Labiche Y., Subramanyan R. Automating image segmentation verification and validation by learning test oracles. Inf. Softw. Technol. 2011;53(12):1337–1348. [Google Scholar]
Gibson, E., Bauman, G. S., Romagnoli, C., Cool, D. W., Bastian-Jordan, M., Kassam, Z., Gaed, M., Moussa, M., Gómez, J. A., Pautler, S. E., Chin, J. L., Crukley, C., Haider, M. A., Fenster, A., Ward, A. D., 2016. Toward prostate cancer contouring guidelines on MRI: dominant lesion gross and clinical target volume coverage via accurate histology fusion.. (1), 188–196. doi:10.1016/j.ijrobp.2016.04.018. [DOI] [PubMed]
Gibson E., Crukley C., Gaed M., Gómez J.A., Moussa M., Chin J.L., Bauman G.S., Fenster A., Ward A.D. Registration of prostate histology images to ex vivo MR images via strand-shaped fiducials. J. Magn. Reson. Imaging. 2012;36(6):1402–1412. doi: 10.1002/jmri.23767. [DOI] [PubMed] [Google Scholar]
Gibson E., Huisman H.J., Barratt D.C. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2015. Statistical Power in Image Segmentation: Relating Sample Size to Reference Standard Quality; pp. 105–113. [Google Scholar]
Gönen M. Sample size and power for McNemar’s test with clustered data. Stat Med. 2004;23(14):2283–2294. doi: 10.1002/sim.1768. [DOI] [PubMed] [Google Scholar]
Guo W., Li Q. Effect of segmentation algorithms on the performance of computerized detection of lung nodules in CT. Med. Phys. 2014;41(9):091906. doi: 10.1118/1.4892056. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hamarneh G., Jassi P., Tang L. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2008. Simulation of ground-truth validation data via physically-and statistically-based warps; pp. 459–467. [DOI] [PubMed] [Google Scholar]
Hertzog M.A. Considerations in determining sample size for pilot studies. Res. Nurs. Health. 2008;31(2):180–191. doi: 10.1002/nur.20247. [DOI] [PubMed] [Google Scholar]
Irshad H., Montaser-Kouhsari L., Waltz G., Bucur O., Nowak J., Dong F., Knoblauch N.W., Beck A.H. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access; 2015. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd; p. 294. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jha A.K., Kupinski M.A., Rodríguez J.J., Stephen R.M., Stopeck A.T. Task-based evaluation of segmentation algorithms for diffusion-weighted mri without using a gold standard. Phys. Med. Biol. 2012;57(13):4425. doi: 10.1088/0031-9155/57/13/4425. [DOI] [PMC free article] [PubMed] [Google Scholar]
Julious S.A. Sample size of 12 per group rule of thumb for a pilot study. Pharm. Stat. 2005;4(4):287–291. [Google Scholar]
Juneja P., Evans P.M., Harris E.J. The validation index: a new metric for validation of segmentation algorithms using two or more expert outlines with application to radiotherapy planning. IEEE Trans. Med. Imaging. 2013;32(8):1481–1489. doi: 10.1109/TMI.2013.2258031. [DOI] [PubMed] [Google Scholar]
Kish L. Wiley; New York: 1965. Survey Sampling. [Google Scholar]
Kohlberger T., Singh V., Alvino C., Bahlmann C., Grady L. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2012. Evaluating segmentation error without ground truth; pp. 528–536. [DOI] [PubMed] [Google Scholar]
Konyushkova K., Sznitman R., Fua P. Proceedings of the IEEE International Conference on Computer Vision. 2015. Int roducing geometry in active learning for image segmentation; pp. 2974–2982. [Google Scholar]
Lackey N., Wingate A. The pilot study: one key to research success. Kans. Nurse. 1986;61(11):6–7. [PubMed] [Google Scholar]
Langerak T.R., van der Heide U.A., Kotte A.N., Viergever M.A., van Vulpen M., Pluim J.P. Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (SIMPLE) IEEE Trans. Med. Imag. 2010;29(12):2000–2008. doi: 10.1109/TMI.2010.2057442. [DOI] [PubMed] [Google Scholar]
Lee S.M., Young G.A. Asymptotic iterated bootstrap confidence intervals. Ann. Stat. 1995:1301–1330. [Google Scholar]
Litjens G., Debats O., Barentsz J., Karssemeijer N., Huisman H. Computer-aided detection of prostate cancer in MRI. IEEE Trans. Med. Imaging. 2014;33(5):1083–1092. doi: 10.1109/TMI.2014.2303821. [DOI] [PubMed] [Google Scholar]
Litjens G., Toth R., van de Ven W., Hoeks C., Kerkstra S., van Ginneken B., Vincent G., Guillard G. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med. Image Anal. 2014;18(2):359–373. doi: 10.1016/j.media.2013.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mace A.E. Reinhold, New York; 1964. Sample-Size Determination. [Google Scholar]
Maier-Hein L., Mersmann S., Kondermann D., Bodenstedt S., Sanchez A., Stock C., Kenngott H.G., Eisenmann M., Speidel S. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2014. Can masses of non-experts train highly accurate image classifiers? pp. 438–445. [DOI] [PubMed] [Google Scholar]
Minka T.P. Technical Report. M.I.T.; 2000. Estimating a Dirichlet distribution. [Google Scholar]
Mosimann J.E. On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika. 1962;49(1/2):65–82. [Google Scholar]
Nieswiadomy R.M. Pearson Higher Ed; 2011. Foundations in Nursing Research. [Google Scholar]
Penn, A., 2015. ibootci. https://www.mathworks.com/matlabcentral/fileexchange/52741.
R Core Team, 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.R-project.org/ISBN3-900051-07-0.
Rosner B. Nelson Education; 2015. Fundamentals of Biostatistics. [Google Scholar]
Shah V., Pohida T., Turkbey B., Mani H., Merino M., Pinto P.A., Choyke P., Bernardo M. A method for correlating in vivo prostate magnetic resonance imaging and histopathology using individualized magnetic resonance-based molds. Rev. Sci. Instrum. 2009;80(10):104301. doi: 10.1063/1.3242697. [DOI] [PMC free article] [PubMed] [Google Scholar]
Taha A.A., Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging. 2015;15(1):29. doi: 10.1186/s12880-015-0068-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Top A., Hamarneh G., Abugharbieh R. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2011. Active learning for interactive 3d image segmentation; pp. 603–610. [DOI] [PubMed] [Google Scholar]
Tu, S., 2014. The dirichlet-multinomial and dirichlet-categorical models for bayesian inference. Computer Science Division, UC Berkeley, Tech. Rep.[Online]. Available: http://www.cs.berkeley.edu/ ∼ stephentu/writeups/dirichlet-conjugate-prior.pdf.
Warfield S.K., Zou K.H., Wells W.M. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imag. 2004;23(7):903–921. doi: 10.1109/TMI.2004.828354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Warnes, G. R., Bolker, B., Lumley, T., 2015. gtools: Various R Programming Tools. R package version 3.5.0.
Zhu, Y., 2002. Correlated multinomial data. Encyclopedia of Environmetrics.
Zöllei L., Wells W. International Workshop on Biomedical Image Registration. Springer; 2006. Multi-modal image registration using Dirichlet-encoded prior information; pp. 34–42. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.pdf^{(55.2KB, pdf)}

[bib0001] Barbiero, A., Ferrari, P. A., 2015. GenOrd: simulation of discrete random variables with given correlation matrix and marginal distributions. http://CRAN.R-project.org/package=GenOrd. R package version 1.4.0.

[bib0002] Beiden S.V., Campbell G., Meier K.L., Wagner R.F. SPIE Medical Imaging. 2000. The problem of ROC analysis without truth: The EM algorithm and the information matrix; pp. 126–134. [Google Scholar]

[bib0003] Browne R.H. On the use of a pilot sample for sample size determination. Stat. Med. 1995;14(17):1933–1940. doi: 10.1002/sim.4780141709. [DOI] [PubMed] [Google Scholar]

[bib0004] Caballero J., Bai W., Price A.N., Rueckert D., Hajnal J.V. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Vol. 1. 2014. Application-driven MRI: Joint reconstruction and segmentation from undersampled MRI data; pp. 106–118. [DOI] [PubMed] [Google Scholar]

[bib0005] Chowdhury N., Pai M.R., Lobo F.D., Kini H., Varghese R. Interobserver variation in breast cancer grading: a statistical modeling approach. Anal. Quant. Cytol. Histol./the International Academy of Cytology [and] American Society of Cytology. 2006;28(4):213–218. [PubMed] [Google Scholar]

[bib0006] Cocosco C.A., Kollokian V., Kwan R.K.-S., Pike G.B., Evans A.C. Proceedings of Functional Mapping of the Human Brain; NeuroImage. Vol. 5. 1997. Brainweb: Online interface to a 3D MRI simulated brain database; p. 425. [Google Scholar]

[bib0007] Connelly L.M. Pilot studies. Medsurg Nursing. 2008;17(6):411–413. [PubMed] [Google Scholar]

[bib0008] Connor R.J. Sample size for testing differences in proportions for the paired-sample design. Biometrics. 1987;43(1):207–211. [PubMed] [Google Scholar]

[bib0009] Durkalski V.L., Palesch Y.Y., Lipsitz S.R., Rust P.F. Analysis of clustered matched-pair data. Stat. Med. 2003;22(15):2417–2428. doi: 10.1002/sim.1438. [DOI] [PubMed] [Google Scholar]

[bib0010] Everitt B.S., Skrondal A. Cambridge University Press; 2002. The Cambridge Dictionary of Statistics. [Google Scholar]

[bib0011] Frounchi K., Briand L.C., Grady L., Labiche Y., Subramanyan R. Automating image segmentation verification and validation by learning test oracles. Inf. Softw. Technol. 2011;53(12):1337–1348. [Google Scholar]

[bib0012] Gibson, E., Bauman, G. S., Romagnoli, C., Cool, D. W., Bastian-Jordan, M., Kassam, Z., Gaed, M., Moussa, M., Gómez, J. A., Pautler, S. E., Chin, J. L., Crukley, C., Haider, M. A., Fenster, A., Ward, A. D., 2016. Toward prostate cancer contouring guidelines on MRI: dominant lesion gross and clinical target volume coverage via accurate histology fusion.. (1), 188–196. doi:10.1016/j.ijrobp.2016.04.018. [DOI] [PubMed]

[bib0013] Gibson E., Crukley C., Gaed M., Gómez J.A., Moussa M., Chin J.L., Bauman G.S., Fenster A., Ward A.D. Registration of prostate histology images to ex vivo MR images via strand-shaped fiducials. J. Magn. Reson. Imaging. 2012;36(6):1402–1412. doi: 10.1002/jmri.23767. [DOI] [PubMed] [Google Scholar]

[bib0014] Gibson E., Huisman H.J., Barratt D.C. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2015. Statistical Power in Image Segmentation: Relating Sample Size to Reference Standard Quality; pp. 105–113. [Google Scholar]

[bib0015] Gönen M. Sample size and power for McNemar’s test with clustered data. Stat Med. 2004;23(14):2283–2294. doi: 10.1002/sim.1768. [DOI] [PubMed] [Google Scholar]

[bib0016] Guo W., Li Q. Effect of segmentation algorithms on the performance of computerized detection of lung nodules in CT. Med. Phys. 2014;41(9):091906. doi: 10.1118/1.4892056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0017] Hamarneh G., Jassi P., Tang L. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2008. Simulation of ground-truth validation data via physically-and statistically-based warps; pp. 459–467. [DOI] [PubMed] [Google Scholar]

[bib0018] Hertzog M.A. Considerations in determining sample size for pilot studies. Res. Nurs. Health. 2008;31(2):180–191. doi: 10.1002/nur.20247. [DOI] [PubMed] [Google Scholar]

[bib0019] Irshad H., Montaser-Kouhsari L., Waltz G., Bucur O., Nowak J., Dong F., Knoblauch N.W., Beck A.H. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access; 2015. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd; p. 294. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0020] Jha A.K., Kupinski M.A., Rodríguez J.J., Stephen R.M., Stopeck A.T. Task-based evaluation of segmentation algorithms for diffusion-weighted mri without using a gold standard. Phys. Med. Biol. 2012;57(13):4425. doi: 10.1088/0031-9155/57/13/4425. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0021] Julious S.A. Sample size of 12 per group rule of thumb for a pilot study. Pharm. Stat. 2005;4(4):287–291. [Google Scholar]

[bib0022] Juneja P., Evans P.M., Harris E.J. The validation index: a new metric for validation of segmentation algorithms using two or more expert outlines with application to radiotherapy planning. IEEE Trans. Med. Imaging. 2013;32(8):1481–1489. doi: 10.1109/TMI.2013.2258031. [DOI] [PubMed] [Google Scholar]

[bib0023] Kish L. Wiley; New York: 1965. Survey Sampling. [Google Scholar]

[bib0024] Kohlberger T., Singh V., Alvino C., Bahlmann C., Grady L. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2012. Evaluating segmentation error without ground truth; pp. 528–536. [DOI] [PubMed] [Google Scholar]

[bib0025] Konyushkova K., Sznitman R., Fua P. Proceedings of the IEEE International Conference on Computer Vision. 2015. Int roducing geometry in active learning for image segmentation; pp. 2974–2982. [Google Scholar]

[bib0026] Lackey N., Wingate A. The pilot study: one key to research success. Kans. Nurse. 1986;61(11):6–7. [PubMed] [Google Scholar]

[bib0027] Langerak T.R., van der Heide U.A., Kotte A.N., Viergever M.A., van Vulpen M., Pluim J.P. Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (SIMPLE) IEEE Trans. Med. Imag. 2010;29(12):2000–2008. doi: 10.1109/TMI.2010.2057442. [DOI] [PubMed] [Google Scholar]

[bib0028] Lee S.M., Young G.A. Asymptotic iterated bootstrap confidence intervals. Ann. Stat. 1995:1301–1330. [Google Scholar]

[bib0029] Litjens G., Debats O., Barentsz J., Karssemeijer N., Huisman H. Computer-aided detection of prostate cancer in MRI. IEEE Trans. Med. Imaging. 2014;33(5):1083–1092. doi: 10.1109/TMI.2014.2303821. [DOI] [PubMed] [Google Scholar]

[bib0030] Litjens G., Toth R., van de Ven W., Hoeks C., Kerkstra S., van Ginneken B., Vincent G., Guillard G. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med. Image Anal. 2014;18(2):359–373. doi: 10.1016/j.media.2013.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0031] Mace A.E. Reinhold, New York; 1964. Sample-Size Determination. [Google Scholar]

[bib0032] Maier-Hein L., Mersmann S., Kondermann D., Bodenstedt S., Sanchez A., Stock C., Kenngott H.G., Eisenmann M., Speidel S. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2014. Can masses of non-experts train highly accurate image classifiers? pp. 438–445. [DOI] [PubMed] [Google Scholar]

[bib0033] Minka T.P. Technical Report. M.I.T.; 2000. Estimating a Dirichlet distribution. [Google Scholar]

[bib0034] Mosimann J.E. On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika. 1962;49(1/2):65–82. [Google Scholar]

[bib0035] Nieswiadomy R.M. Pearson Higher Ed; 2011. Foundations in Nursing Research. [Google Scholar]

[bib0036] Penn, A., 2015. ibootci. https://www.mathworks.com/matlabcentral/fileexchange/52741.

[bib0037] R Core Team, 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.R-project.org/ISBN3-900051-07-0.

[bib0038] Rosner B. Nelson Education; 2015. Fundamentals of Biostatistics. [Google Scholar]

[bib0039] Shah V., Pohida T., Turkbey B., Mani H., Merino M., Pinto P.A., Choyke P., Bernardo M. A method for correlating in vivo prostate magnetic resonance imaging and histopathology using individualized magnetic resonance-based molds. Rev. Sci. Instrum. 2009;80(10):104301. doi: 10.1063/1.3242697. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0040] Taha A.A., Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging. 2015;15(1):29. doi: 10.1186/s12880-015-0068-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0041] Top A., Hamarneh G., Abugharbieh R. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2011. Active learning for interactive 3d image segmentation; pp. 603–610. [DOI] [PubMed] [Google Scholar]

[bib0042] Tu, S., 2014. The dirichlet-multinomial and dirichlet-categorical models for bayesian inference. Computer Science Division, UC Berkeley, Tech. Rep.[Online]. Available: http://www.cs.berkeley.edu/ ∼ stephentu/writeups/dirichlet-conjugate-prior.pdf.

[bib0043] Warfield S.K., Zou K.H., Wells W.M. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imag. 2004;23(7):903–921. doi: 10.1109/TMI.2004.828354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0044] Warnes, G. R., Bolker, B., Lumley, T., 2015. gtools: Various R Programming Tools. R package version 3.5.0.

[bib0045] Zhu, Y., 2002. Correlated multinomial data. Encyclopedia of Environmetrics.

[bib0046] Zöllei L., Wells W. International Workshop on Biomedical Image Registration. Springer; 2006. Multi-modal image registration using Dirichlet-encoded prior information; pp. 34–42. [Google Scholar]

PERMALINK

Designing image segmentation studies: Statistical power, sample size and reference standard quality

Eli Gibson

Yipeng Hu

Henkjan J Huisman

Dean C Barratt

Highlights

Graphical abstract

Abstract

1. Introduction

Fig. 1.

2. Sample size calculations in segmentation evaluation studies

2.1. Notation

Table 1.

Table 2.

2.2. Statistical model of segmentation

2.2.1. Accuracy difference measures

2.2.2. Model distribution

Table 3.

Fig. 2.

2.3. Derivation of the sample size formula for segmentation

Fig. 3.

2.3.1. Sample size with the Dirichlet prior distribution

2.4. Incorporating reference standard quality

3. Applying the sample size formula

3.1. Parameter estimation equations

4. Simulations

4.1. Simulations with simulated data from the assumed statistical model

Table 4.

4.2. Simulations with real-world data

4.2.1. Simulations with high-quality real-world data

4.2.2. Simulations with low-quality real-world data

5. Results

5.1. Simulations under the statistical model

Fig. 4.

Fig. 5.

Fig. 6.

5.2. Simulations with high-quality real-world data

Table 5.

5.3. Simulations with low-quality real-world data

Table 6.

6. Case study

7. Discussion

7.1. Validation in segmentation comparison studies

7.2. Accuracy and applicability of the sample size formulae

7.3. Model interpretation

Table 7.

7.4. Limitations

8. Conclusions

Acknowledgements

Footnotes

Appendix A. Derivation of the variance of the accuracy difference

A1. Statistical segmentation model and notation reiterated

A2. Derivation of σD¯2 in terms of moments of priors O→k

Appendix B. Derivation of the accuracy difference in terms of the high-quality reference standard

B1. Model and notation

B2. Derivation

B3. Express δ in terms of the joint probability of segmentation labels of A, B, L and H

B4. Isolate the terms of this expression that equate to δH, and simplify the remaining terms

Supplementary material

Appendix C. Supplementary materials

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A2. Derivation of $σ_{\bar{D}}^{2}$ in terms of moments of priors ${\vec{O}}_{k}$

B4. Isolate the terms of this expression that equate to δ_H, and simplify the remaining terms