Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2017 Dec;42:44–59. doi: 10.1016/j.media.2017.07.004

Designing image segmentation studies: Statistical power, sample size and reference standard quality

Eli Gibson a,b,c,, Yipeng Hu b, Henkjan J Huisman a, Dean C Barratt b
PMCID: PMC5666910  PMID: 28772163

Highlights

  • A sample size calculation for segmentation accuracy studies is derived.

  • Parameters include accuracy difference, algorithm disagreement and a design factor.

  • A formula is derived to account for errors in the study reference standard.

  • A case study illustrates the application of the theory to a segmentation study design.

Keywords: Image segmentation, Segmentation accuracy, Statistical power, Reference standard

Graphical abstract

graphic file with name fx1.jpg

Abstract

Segmentation algorithms are typically evaluated by comparison to an accepted reference standard. The cost of generating accurate reference standards for medical image segmentation can be substantial. Since the study cost and the likelihood of detecting a clinically meaningful difference in accuracy both depend on the size and on the quality of the study reference standard, balancing these trade-offs supports the efficient use of research resources.

In this work, we derive a statistical power calculation that enables researchers to estimate the appropriate sample size to detect clinically meaningful differences in segmentation accuracy (i.e. the proportion of voxels matching the reference standard) between two algorithms. Furthermore, we derive a formula to relate reference standard errors to their effect on the sample sizes of studies using lower-quality (but potentially more affordable and practically available) reference standards.

The accuracy of the derived sample size formula was estimated through Monte Carlo simulation, demonstrating, with 95% confidence, a predicted statistical power within 4% of simulated values across a range of model parameters. This corresponds to sample size errors of less than 4 subjects and errors in the detectable accuracy difference less than 0.6%. The applicability of the formula to real-world data was assessed using bootstrap resampling simulations for pairs of algorithms from the PROMISE12 prostate MR segmentation challenge data set. The model predicted the simulated power for the majority of algorithm pairs within 4% for simulated experiments using a high-quality reference standard and within 6% for simulated experiments using a low-quality reference standard. A case study, also based on the PROMISE12 data, illustrates using the formulae to evaluate whether to use a lower-quality reference standard in a prostate segmentation study.

1. Introduction

Demonstrating an improvement in segmentation algorithm accuracy typically involves comparison with an accepted reference standard, such as manual expert segmentations or other imaging modalities (e.g. histology). In many medical image segmentation problems, such segmentations are challenging due to the variable appearance of anatomical/pathological features, ambiguous anatomical definitions, clinical constraints, and interobserver variability. The resulting errors in the reference standards introduce errors in the performance measures used to compare segmentation algorithms, and can impact the probability of detecting a significant difference between algorithms, referred to as the statistical power (Beiden et al., 2000).

The cost and quality of a reference standard is affected by the time and effort devoted to segmentation accuracy, the sample size, and the number, background, experience and proficiency of the observers. For example, the PROMISE12 prostate MRI segmentation challenge used two reference standards (illustrated in Fig. 1): a high-quality reference standard manually segmented by one experienced clinical reader and verified by another independent clinical reader, and a low-quality reference standard segmented by a less experienced non-clinical observer. An alternative approach is to estimate a high-quality reference standard by combining independent segmentations from multiple observers using algorithms such as STAPLE (Warfield et al., 2004) and SIMPLE (Langerak et al., 2010). A third approach is to mitigate the errors in a lower-quality reference standard by increasing the sample size (Konyushkova, Sznitman, Fua, 2015, Top, Hamarneh, Abugharbieh, 2011, Maier-Hein, Mersmann, Kondermann, Bodenstedt, Sanchez, Stock, Kenngott, Eisenmann, Speidel, 2014, Irshad, Montaser-Kouhsari, Waltz, Bucur, Nowak, Dong, Knoblauch, Beck, 2015). All three of these approaches, however, raise the cost of generating the reference standard, both logistically and economically.

Fig. 1.

Fig. 1

Left: Illustrative prostate MRI segmentations from the PROMISE12 prostate segmentation challenge (Litjens et al., 2014b) by two algorithms – A (blue) and B (yellow) – and the two manually contoured reference standards – L (red) which is of lower quality and H (green) that is of higher quality. Compared to H, L oversegmented anteriorly where image information was ambiguous, affecting accuracy measurements of A and B using L. Right: Harder apical segmentations showing regions containing voxels with different combinations of segmentation labels ABLH (overbar denotes negative classifications). The statistical model underlying the derived sample size formula for segmentation evaluation studies is derived from probability distributions of these voxel-wise segmentation labels. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

There are clear trade-offs between the sample size of the study, the cost of generating the reference standard, and the reference standard quality. The optimal balance of these trade-offs depends on the relationship between the study design parameters and statistical power. However, standard power calculation formulae do not, in general, account for the quality of reference standard segmentations. Thus, there is a need for new formulae to quantify these relationships. As a first step towards this goal, this paper presents a new sample size calculation relating statistical power to the quality of a reference standard (measured with respect to a higher-quality reference standard). Such a formula can answer key questions in study design:

  • How many validation images are needed to evaluate a segmentation algorithm?

  • How accurate does the reference standard need to be?

In preliminary work (Gibson et al., 2015), we derived a relationship between statistical power and the quality of a reference standard for a simplified model that cannot account for correlation between voxels, and made a strong assumption that the reference and algorithm segmentation labels are conditionally independent given the high-quality reference standard. In the present paper, we build on our initial work to develop a generalized model that takes into account the correlation between voxels and the statistical dependence between algorithms and reference standards observed in segmentation studies.

The remainder of this paper outlines the derivation (Section 2.3), application (Sections 3 and 6) and validation (Sections 4 and 5) of a statistical power formula for image segmentation. Insights and heuristics derived from the formula and its validation, as well as limitations of the work, are discussed in Section 7. Appendix A and Appendix B present mathematical details of the derivations.

2. Sample size calculations in segmentation evaluation studies

The probability of a study correctly detecting a true effect depends in part on the sample size. A study with a sample size that is too small has a higher risk of missing a meaningful underlying difference, while one with a sample size that is too large may be more expensive than necessary. Sample size calculations relate the probability of a study correctly detecting a true effect to specified and estimated parameters of the study design (Mace, 1964). The sample size depends on the probability distribution of the test statistic under the null and alternate hypotheses. This distribution, in turn, depends on the statistical analysis being performed and on an assumed statistical model of the studied population.

We derive a sample size calculation for a specific analysis: comparing the mean segmentation accuracy — i.e. the proportion of voxels in an image that match the reference standard L — of two algorithms A and B that generate binary classifications of v voxels on n images using a paired Student’s t-test (Rosner, 2015) on the per-image accuracies. Specifically, this tests the null hypothesis that the mean segmentation accuracies of A and B (both measured by comparison to L) are equal against the alternative hypothesis that they are unequal. Paired t-test analyses such as this one are frequently performed in comparisons of segmentation accuracy (Caballero et al., 2014).

2.1. Notation

Throughout this paper, we use the notation given in Table 1. Symbols used in this paper are summarized in Table 2.

Table 1.

Notation for mathematical symbols.

Type notation
Segmentation algorithms X (upper case non-italic)
Random variables and vectors X (upper case)
Realizations of random variables and constants x (lower case)
Vectors x (arrow accent);  〈 x, y 〉  (angle brackets)
Estimates x^ (circumflex accent)
Parameterized distributions X ∼ X(θ) (bold capital with parameters in parentheses)
Expectation of X E[X]
Conditional expectation of X given Z E(X|Z)
Conditional variance of X given Z σX|Z2
Conditional covariance of X and Y given Z cov(X, Y|Z)
Event X=1 x (bold lower case)
Event X=0 x¯ (bold lower case with bar)

Table 2.

Glossary of mathematical symbols.

Symbol Support Description
Experimental parameters
n N Sample size
v N Number of voxels per image
α R Significance threshold (acceptable Type I error)
β R 1power (acceptable Type II error)
δMDD [1,1] Minimum difference to detect with specified power
Population parameters
p [0, 1]3 Population average marginal probability for the per-voxel accuracy difference
δ [1,1] Population accuracy difference
ψ [0, 1] Probability that A and B disagree on voxel label
δH [1,1] Population accuracy difference measured against high-quality reference standard H
p(a), p(b), p(l), p(h) [0, 1] Probabilities of voxel labels being 1 for a randomly selected voxel
ρi, j [1,1] Correlation between Dk, i and Dk, j given Ok
ρi,j¯ [0, 1] Average ρi, j over all voxel pairs i and j
σO1O12 [0,ψδ2] Variance of the accuracy difference in the marginal probability prior
ω ωR+ Precision parameter of Dirichlet distribution controlling inter-image variability
Random variables
Ak, i, Bk, i, Lk, i, Hk, i {0, 1} Segmentation label for the ith voxel in the kth image
Ok [0, 1]3 Per-image prior on average marginal probability
Ok,i [0, 1]3 Per-voxel prior on marginal probability
Dk {1,0,1}v Vector of per-voxel accuracies for the kth image
Dk, i {1,0,1} Difference in accuracy for the ith voxel of the kth image
D {1,0,1} Difference in accuracy for a random voxel
D¯k [1,1] Per-image accuracy difference
Simulation variables
Disti, j R+ Distance between voxels i and j
σρ R+ Scaling parameter to control spatial correlation in Monte Carlo simulations
d¯k [1,1] Per-image accuracy difference of a simulated image
dk, i {1,0,1} Per-voxel accuracy difference of a simulated voxel
Other notation
p1,p0, p1 [0, 1] Elements of p for values 1, 0, and 1
Ok,1,Ok, 0, Ok, 1 [0, 1] Elements of Ok for values 1, 0, and 1
Ok,i,1,Ok, i, 0, Ok, i, 1 [0, 1] Elements of Ok,i for values 1, 0, and 1
A, B, L, H Segmentation sources denoting two algorithms, a low-quality and a high-quality reference
f Design factor
tp{1}, tp{2} R 1- and 2-tailed p probability critical value from a T-distribution
σ02 [0,2] Per-image accuracy difference variance under the null hypothesis
σalt2 [0,2] Per-image accuracy difference variance under the alternative hypothesis

[x, y] denotes real numbers between x and y; {x, y, z} denotes a set of possible values; a superscript x denotes a vector with x elements; N denotes natural numbers; R denotes real numbers. R+ denotes positive real numbers.

2.2. Statistical model of segmentation

Our stochastic population model represents the joint distribution of possible segmentations by A, B, and L over a population of images. The data for one image from this population comprises binary segmentation labels (encoded as integers 0 or 1) assigned by A, B and L to each of the v voxels: ak,1,,ak,v,bk,1,,bk,v,lk,1,,lk,v, where ak, i, bk, i, and lk, i are the labels for the ith voxel in the kth image. The data for a study comprises n randomly sampled images, which we denoted with a set of random variables {Ak,1,,Ak,v,Bk,1,,Bk,v,Lk,1,,Lk,v|k=1..n}, where Ak, i, Bk, i, and Lk, i are the random variables representing labels for the ith voxel in the kth randomly sampled image.

2.2.1. Accuracy difference measures

We focus on three types of segmentation accuracy differences. First, the per-voxel segmentation accuracy difference for the ith voxel in the kth image is Dk,i=|Bk,iLk,i||Ak,iLk,i|. Dk, i can take on three values: 1 (when Ak,i=Lk,iBk,i), 0 (when Ak,i=Bk,i) and 1 (when Ak,iLk,i=Bk,i). Random vector Dk represents all Dk, i for the kth image. Second, the per-image accuracy difference is the proportion of correct voxel labels from algorithm A (with respect to reference standard L) minus the proportion of correct voxel labels from algorithm B (with respect to reference standard L): D¯k=1vi=1v(1|Ak,iLk,i|)1vi=1v(1|Bk,iLk,i|)=1vi=1vDk,i. Third, the population average accuracy difference δ is the expected value E[D¯k] for a randomly selected image in the population, and equivalently, δ=p(D=1)p(D=1) for a randomly selected per-voxel accuracy difference D.

2.2.2. Model distribution

For calculating power, the model (summarized in Table 3 and illustrated in Fig. 2) must encode the distribution of the metric analysed in the statistical analysis: the per-image accuracy difference D¯k. While D¯k depends on all three segmentations A, B and L, it can be expressed more simply as a unary function of Dk. Therefore, we consider the distribution of Dk directly, modeled as a v-dimensional correlated categorical distribution. To model this distribution, we follow the common convention of breaking down complex joint distributions into the mean, and multiple simpler sources of variation about the mean.

Table 3.

Model summary. These expressions summarize the nested model used in our derivations. The motivation and detailed description is given in Section 2.2.2.

OkP(p) where E[Ok]=p
iOk,iO(Ok) where E(Ok,i|Ok)=Ok
iDk,iCategorical(Ok,i)
ijcov(Dk,i,Dk,j|Ok)=ρi,jσDk,i|Ok2σDk,j|Ok2
Fig. 2.

Fig. 2

The illustrated nested model shows, from left to right, (1) the prior distribution of per-image average marginal probabilities P(p) (shown on the triangular (standard 2-simplex) domain with axes Ok, 1 and Ok,1 shown and Ok, 0 implicitly defined as 1Ok,1Ok,1; darkness represents the probability density), (2) three different samples (i.e. three images) of per-image average marginal probabilities ok (shown as arrows labelled o1,o2 and o3), (3) three corresponding conditional prior distributions of per-voxel marginal probabilities O(Ok) for the three images (shown as in (1)), (4) nine different samples (i.e. nine voxels from the second image) of per-voxel marginal probabilities ok,i (shown as unlabelled arrows), and (5) the categorical distributions for the nine voxels from the second image (shown as pie charts of the relative probabilities of the per-voxel accuracy differences p(dk,i=1|ok,i) [orange], p(dk,i=0|ok,i) [blue], and p(dk,i=1|ok,i) [red]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

The mean of Dk is defined by the joint distribution of the segmentation labels. Considering the joint distribution is important, because the algorithm and reference standard labels for a randomly selected voxel (A, B and L) may not be independent from each other, as they depend on the same image information and overlapping prior knowledge. The mean of Dk, therefore, encodes the inter-segmentation correlation in the population average marginal probabilities of the per-voxel accuracy difference D (marginalized over combinations of segmentations A, B and L yielding each difference value):

p(D=1)=p(A=1,B=0,L=1)+p(A=0,B=1,L=0);p(D=0)=p(A=B);p(D=1)=p(A=1,B=0,L=0)+p(A=0,B=1,L=1). (1)

For example, when A and B are highly correlated, p(D=0) is higher and when A and L are highly correlated, p(D=1) increases while p(D=1) decreases. We consider the population average marginal probabilities as a model parameter p=<p1,p0,p1>=<p(D=1),p(D=0),p(D=1)>.

The variation of Dk about the mean is affected by three sources of variation:

  • intra-image inter-voxel correlation – two voxels in the same image may have correlated labels if, for example, they are adjacent or are commonly affected by the same image artifact.

  • inter-image variability – the expected segmentation performance for different images may vary, as one image may have features that are more or less challenging for a particular algorithm or observer than another image.

  • inter-voxel variability – two voxels in the same image may have different marginal probabilities depending on the image content; for example, voxels that are easy to segment for any algorithm would likely have the same labels for any algorithm, where more challenging voxels are more likely to show differences.

Both the inter-image variability and the intra-image inter-voxel correlation affect the covariance matrix of Dk. While the covariance matrix could be an explicit model parameter, interpreting the parameter is challenging because it conflates these different sources of correlation. Instead, we construct an over-parameterized nested model that allows us to separately represent inter-image variability and intra-image inter-voxel correlation. The key concept in this nested model is to introduce per-image priors (random variables OkP(p)) on the average marginal probability for Dk, i within each image, in order to model inter-image variability. P(p) is a distribution of probability vectors (i.e. Ok the open standard 2-simplex) with mean p. Then, for each image, the conditional distribution of Dk, i given Ok models the intra-image inter-voxel correlation. Specifically, we define the conditional covariance of Dk given Ok as

cov(Dk,i,Dk,j|Ok)=ρi,jσDk,i|Ok2σDk,j|Ok2, (2)

where ρi, j is a pair-wise Pearson correlation coefficient and σDk,i|Ok2 is the conditional variance of Dk, i given Ok.

To model the inter-voxel variability, each Dk, i has per-voxel priors (random variables Ok,i) defining its marginal probabilities. The conditional distribution of Ok,i given Ok is an arbitrary distribution O(Ok) of probability vectors with mean Ok.

2.3. Derivation of the sample size formula for segmentation

The general form of the sample size formula (Connor, 1987),

n=(tα{2}σ02+tβ{1}σalt2)2δMDD2, (3)

relates the sample size (n) to the variances (σ02 and σalt2) of per-image accuracy differences under the null hypothesis (δ=0) and alternate hypothesis (δ ≠ 0), acceptable study error rates (α and β), and the minimum detectable difference (δMDD) in population accuracy between algorithms A and B to detect with power (1β). tα{2} and tβ{1} are two- and one-tailed critical values taken from the inverse cumulative distribution function of the t-distribution with n1 degrees of freedom. Of the parameters in Eq. (3), most are selected based on experimental design choices, but the variances of the per-image accuracy difference are derived from the statistical model.

The variance of the per-image accuracy difference σD¯2 can be derived for any prior distribution of per-image average marginal probabilities (OkP(p)) in terms of moments of the prior distribution by marginalizing out Ok and Ok,i (see Appendix A for a detailed derivation), yielding

σD¯2=ρi,j¯(ψδ2)+(1ρi,j¯)σO1O12, (4)

where ψ=p1+p1 is the population-wide probability that algorithms A and B disagree on the labeling of a voxel (see Fig. 3), σO1O12 (the variance of Ok,1Ok,1 for the priors Ok) is a linear combination of moments of the prior distribution (σO1O12=σO122σO1,O1+σO12), and ρi,j¯=i,jρi,jv2 is the average of the intra-image inter-voxel correlation coefficients.

Fig. 3.

Fig. 3

Illustration of the relationship between the proportion of disagreement (ψ) and the accuracy difference (δ). In these four examples, segmentation algorithms A (blue) and B (yellow) both over-contour the circular object taken as the reference standard segmentation L (red), adding different perturbations that lower accuracy. When sets of segmentations have higher ψ and lower δ (as in the lower right), it is harder to detect accuracy differences. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Substituting σalt2=σD¯|δ=δMDD2 and σ02=σD¯|δ=02 (i.e. substituting δ=δMDD and δ=0 into σD¯2) yields the segmentation sample size formula for accuracy differences with respect to reference standard L,

n=(tα{2}ρi,j¯ψ+(1ρi,j¯)σO1O1|δ=02+tβ{1}ρi,j¯(ψδMDD2)+(1ρi,j¯)σO1O1|δ=δMDD2)2/δMDD2. (5)

It is interesting to note that when there is no inter-voxel correlation (i.e. ρi,j¯1/v) and no inter-image variability in marginal probabilities (i.e. σO1O12=0), Eq. (5) approaches the sample size formula for McNemar’s two-sample paired proportion test with nv samples (Connor, 1987).

2.3.1. Sample size with the Dirichlet prior distribution

To gain further insight into the sample size relationship, consider the special case where the prior distribution of per-image average marginal probabilities P(p) is a Dirichlet distribution (i.e. OkDirichlet(ω,p)), which represents inter-image variability with a single parameter: the precision ω (Minka, 2000). When ω is large, priors Ok are likely to be near p (i.e. there is little variation between images); when ω is small, priors Ok are distributed more diffusely (i.e. there is more variation between images). The Dirichlet prior distribution has three properties that make interpretation of the sample size relationship easier:

  • It is well-characterised as a model for variability in categorical probabilities, because it is the conjugate prior distribution of the categorical and multinomial distributions and thus commonly adopted in Bayesian analysis (Tu, Mosimann, 1962, Zhu, Zöllei, Wells, 2006)

  • Representing inter-image variability with a single parameter simplifies interpretation and facilitates parameter fitting with small pilot data sets.

  • σO1O12 for the Dirichlet prior distribution is proportional to ψδ2 which simplifies the sample size formula.

For the Dirichlet prior distribution, σO12=p1p12ω+1,σO1,O1=p1p1ω+1, and σO12=p1p12ω+1; therefore σO1O12=ψδ2ω+1. Substituting σO1O12 into Eq. (4) and simplifying algebraically gives the variance of the per-image accuracy under a Dirichlet prior:

σD¯2=1+ωρi,j¯ω+1(ψδ2). (6)

Since σD¯2 is expressed in terms of δ, we can readily substitute σalt2=σD¯|δ=δMDD2 and σ02=σD¯|δ=02 into Eq. (3) to get the sample size formula

n=1+ωρi,j¯ω+1(tα/2ψ/δMDD2+tβψ/δMDD21)2. (7)

Several aspects of this formula link to previous work. The term 1+ωρi,j¯ω+1 is a type of design factor denoted hereafter as f (analogous to the design factor in cluster-randomized trials (Kish, 1965)), modelling the inter- vs intra-image variability in accuracy differences (i.e. each image being one correlated cluster of voxel samples). When there is no inter-voxel correlation (i.e. ρi,j¯=1/v), Eq. (7) simplifies to the formula found in our preliminary analysis (Gibson et al., 2015). The term ψ/δ2 is the squared coefficient of variation of D under the idealized assumption of completely independent voxels (i.e. f=1/v) — or equivalently, the statistical efficiency of estimating δ (Everitt and Skrondal, 2002). We thus refer to ψ/δ2 hereafter as the idealized efficiency.

2.4. Incorporating reference standard quality

Conducting segmentation accuracy comparison studies using a lower-quality reference standard introduces an additional challenge: selecting the appropriate minimum detectable difference. On one hand, for the generic sample size formula (Eq. (3)) to be valid, δMDD must be measured with respect to the reference standard used in the study. On the other hand, the selection of δMDD depends on external clinical or technical requirements. Ideally, these requirements would be defined with respect to a high-quality reference standard H (with the MDD denoted δMDD, H), to most closely approximate the true requirement. If the high-quality reference standard can be used for the entire study, there is no conflict and δMDD, H can be used directly. If, however, a lower-quality reference standard is used, an appropriate δMDD needs to be selected. To resolve this dilemma, we have derived a formula to express δMDD for a low-quality reference standard as a function of δMDD, H, by characterizing the differences between the low- and high-quality reference standards (e.g. on a small pilot dataset).

The derivation, detailed in Appendix B, expresses δMDD in terms of the joint probability of segmentation labels of A, B, L and H; isolates the terms of this expression that equate to δMDD, H; and simplifies the remaining terms. This yields an equation for δMDD as a function of δMDD, H and estimable parameters representing deviation of δMDD from δMDD, H:

δMDD=δMDD,H+2(p(a)p(b))(p(l)p(h))+2cov(AB,LH), (8)

where p(x)=p(X=1) for a randomly selected voxel and cov(AB,LH) is the covariance between errors in L (with respect to H) and differences between A and B. The second term of this expression reflects error induced by over- or under-contouring by L (with respect to H). If L tends to over-contour compared to H, algorithms that assign more voxels as foreground will appear more accurate. The third term is the covariance cov(AB,LH) reflecting errors in L that are biased in favour of A or B. This expression can be used to estimate the δMDD to use for a study using a low-quality reference standard.

3. Applying the sample size formula

The sample size formula derived above supports the design of segmentation accuracy comparison studies by estimating the sample size needed to detect a specified accuracy difference with high probability. As with all sample size calculations, three types of parameters have to be determined to apply the formula: the acceptable study error rates, the minimum detectable difference, and the variance parameters. Some of these parameters are chosen based on experimental, technical or clinical requirements outside the study design, while others are estimated from related literature or pilot data. We denote the estimate of parameter x as x^.

The acceptable error rates are generally set using heuristics by study designers: α=0.05 (i.e. a 5% probability of falsely detecting a difference when there is none) and β=0.2 (i.e. an 80% probability of detecting a true difference).

The minimum detectable difference (δMDD) is typically set by technical or clinical requirements outside the study design to be the smallest difference that is large enough to be important to detect with high probability. Specifically, if the true difference is δMDD or higher, the study should give a true positive with probability 1β or higher. If the study will use a sufficiently high-quality reference standard, δMDD can be chosen directly. If the technical or clinical requirements are expressed with respect to a high-quality reference standard, but the study uses a lower-quality reference standard, then δMDD, H can be chosen and the equivalent δ^MDD can be estimated from the low-quality correction equation (Eq. (8)), using parameter estimation equations (Eqs. (9) and (10)) given in Section 3.1.

The variance parameters depend on the distribution of the data; they are not chosen a priori, but can be estimated using values from related literature, or using pilot data. In the moment-based sample size equation (Eq. (5)), the variance parameters are ψ, ρi,j¯,σO1O1|δ=02 and σO1O1|δ=δMDD2. In the Dirichlet-prior-based sample size equation (Eq. (7)), the variance parameters are ψ, ρi,j¯, and ω. In general, estimating these variance parameters individually can be challenging because the model is parameterized by multiple parameters that affect the intervoxel covariance of per-voxel accuracy differences, and because the moments of the prior for the per-image average marginal probabilities may depend on δ. Under some assumptions, however, we can estimate variance parameters.

  • If we assume σ02=σalt2=σ^D¯2, which may be appropriate when δ and δMDD are sufficiently small, we can estimate σ^D¯2 from the pilot data (using Eq. (13) in Section 3.1), and apply the generic sample size equation (Eq. (3)) directly.

  • If we assume a parametric distribution for the per-image average marginal probabilities, it may be possible to express σO1O12 in terms of δ (as shown for the Dirichlet distribution in Eq. (6)) and estimate σO1O1|δ=02 and σO1O1|δ=δMDD2 from σ^D¯2. For the Dirichlet distribution, the resulting variance could be characterized by a design factor modeling the combined effect of parameters ρi,j¯, and ω. An estimation equation for the design factor is given in Section 3.1 Eq. (14).

  • If there is a need to estimate the effects of the variance parameters individually (e.g. to explore the effect of increased intra-image inter-voxel correlation on a planned study), and we assume that the intra-image inter-voxel correlation is spatially constrained (e.g. if voxels separated by a specified distance are effectively uncorrelated given Ok), then we can estimate ω^ using spatially sparse sampling and then estimate ρi,j¯^ from ω^ and σ^D¯2. This approach is outlined for a Dirichlet prior in Section 3.1.

The optimal size for a pilot study data set has not been well-established in general, and depends on many factors (Hertzog, 2008), including the particular population being studied. In principle, the precision of the estimated sample size depends on the sensitivity of the formula to parameter estimation errors (see supplementary material) and the variances of the parameter estimators (which decrease as the pilot data set grows), both of which vary depending on the population being studied. In practice, formal sample size calculations for such pilot studies are rarely used (Hertzog, 2008); instead, heuristics, such as using 10 samples (Nieswiadomy, 2011), 12 samples (Julious, 2005) or using 10% of the anticipated size of the full study (Connelly, 2008, Lackey, Wingate, 1986) for larger studies, can be used. The risk of parameter estimation error can be mitigated using conservative parameter estimates, as described in Section 3.1 for σ^D¯2.

3.1. Parameter estimation equations

To estimate parameters from pilot data, a small data set of images must be collected and segmented by algorithms A and B, by the reference standard L to be used for the study, and by the high-quality reference standard H. Given a segmented pilot data set, formula parameters can be estimated as follows.

To estimate δ^MDD in terms of δMDD, H, we first estimate the proportion of positive voxels segmented by A across all images in the pilot data:

p^(a)=1nvk=1ni=1vak,i, (9)

where n′ is the number of images in the pilot data set. p^(b),p^(l), and p^(h) can be estimated similarly. cov^(AB,LH) can be estimated as

cov^(AB,LH)=1nv1k=1ni=1v(ak,ibk,ip^(a)+p^(b))(lk,ihk,ip^(l)+p^(h)). (10)

Then, from Eq. (8), δ^MDD=δMDD,H+2(p^(a)p^(b))(p^(l)p^(h))+2cov^(AB,LH).

The probability of disagreement can be estimated using the sample mean as

ψ^=1nvk=1ni=1v|ak,ibk,i|. (11)

The population average accuracy difference can be estimated using the sample mean as

δ^=1nvk=1ni=1v(|bk,ilk,i||ak,ilk,i|). (12)

The variance in per-image accuracy differences can be estimated using the unbiased sample variance as

σ^D¯2=1(n1)k=1n(d¯kδ^)2, (13)

where d¯k=1vi=1v(|bk,ilk,i||ak,ilk,i|). However, sample variance estimates from small pilot studies are imprecise and skewed (Browne, 1995), which inflates the probability of having an underpowered study. To mitigate this effect, Browne (1995) recommended using the upper bound of a γ% confidence interval on the variance to guarantee the specified power with γ% probability. This can be estimated using a double bootstrap method (e.g. Lee and Young, 1995 implemented for Matlab as ibootci (Penn, 2015)).

When modeling the per-image marginal probability prior as a Dirichlet distribution, the design factor encoding the combined effect of parameters ρi,j¯, and ω can be estimated from Eq. (6) using sample estimates:

f^=σ^D¯2/(ψ^δ^2), (14)

and the idealized efficiency can be estimated as ψ^/δMDD2.

To estimate the effects of the variance parameters individually, we can model the per-image marginal probability prior as a Dirichlet distribution and assume that the intra-image inter-voxel correlation is spatially constrained (i.e. voxels more than x pixels away are effectively uncorrelated given Ok). Sampling dk, i from voxels spaced x voxels apart gives counts from a Dirichlet-multinomial distribution, and we can estimate the precision parameter ω^ using an iterative approach described by Minka (2000). The average correlation coefficient can then be estimated from Eq. (6) using sample estimates as

ρi,j¯^=σ^D¯2(ω^+1)(ψ^δ^2)(ψ^δ^2)ω^. (15)

4. Simulations

Three sets of Monte Carlo simulations were used to evaluate the accuracy of the sample size formulae under three different conditions:

  • 1.

    with simulated images and segmentations from the assumed statistical model, to test the validity of the model;

  • 2.

    with real-world data (the PROMISE12 prostate MRI segmentation data set described in Section 4.2.1) using a high-quality reference standard, to test the applicability of the Dirichlet-based sample size formula (Eq. (7)) to real data; and

  • 3.

    with real-world data using a low-quality reference standard while expressing the minimum detectable difference in terms of a high-quality reference standard, to test the applicability of the low-quality correction equation (Eq. (8)) to real data.

4.1. Simulations with simulated data from the assumed statistical model

In order to characterize the validity of the model described in Section 2.2, we performed sets of simulations with controlled variation of a subset of model parameters (hereafter referred to as a simulation set). Recall that Eq. (7) defines the sample size needed to detect a significant accuracy difference with probability 1β if the underlying population difference were δMDD. To test this, we set δMDD to the specified population accuracy difference, and compare the proportion of simulated studies yielding significant accuracy differences to 1β. Note that this approach to select δMDD is appropriate for validating the sample size formula, but not for designing real segmentation comparison studies: in practice, δMDD should be chosen based on clinical or technical requirements.

In each simulation, we repeatedly simulated a segmentation evaluation study by sampling per-voxel accuracy differences for ⌈nv-voxel segmentations and reference standards (where ⌈n⌉ denotes the smallest integer  ≥ n) using the assumed model and testing for an accuracy difference using a Student’s t-test. In each simulation, we compared the observed proportion of positive statistical tests with the predicted probability (i.e. the statistical power 1β) for sample size ⌈n⌉. To clarify the impact of this error in power, we also substituted the observed power into the Dirichlet-based sample size formula (Eq. (7)) to calculate the equivalent error in the predicted sample size n and detectable difference δMDD. In each simulation, we ran 25,000 repetitions in order to estimate the probability of a positive outcome with a 95% confidence interval with a width of 1%.

Each per-image accuracy difference d¯ was computed by sampling the derived per-voxel accuracy differences dk, i directly as follows:

  • the marginal probability priors of per-voxel accuracy differences were drawn from a Dirichlet prior using the rdirichlet (Warnes et al., 2015) function in R version 3.1.1 (R Core Team, 2013),

  • a correlation matrix ρi,j=exp(Disti,j/σρ2) was constructed where Disti, j is the intervoxel distance in a v×v voxel image and σρ2 is a scale parameter controlling the spatial extent of the correlation

  • dk, i were sampled using the ordsample (Barbiero and Ferrari, 2015) function in R. While this is equivalent to drawing samples from the algorithm and reference standard segmentations and computing dk, i, it facilitates the direct control of the dk, i correlation matrix needed in these experiments.

The scripts used to generate these samples are available at https://github.com/eligibson/MedIA2016.

The baseline parameter values in the simulation sets and the ranges of varied parameters are given in Table 4. Note that the simulations varying v, ω, σρ and ψ were conducted at two baseline δ values. The parameter ranges for these simulations were chosen to balance the applicability of parameter values to medical image segmentation problems against practical constraints. The range of ω encompassed both highly consistent and highly variable prior distributions. Ranges of δ and ψ reflected plausible algorithm differences based on previous experience. Due to limitations on the ordsample algorithm the range of v and σρ were constrained: v was limited to 100 because of the computational complexity of sampling high-dimensional correlated discrete random variables, and σρ was constrained to 0.7 because of algorithmic constraints. The baseline parameter values were chosen to reflect typical sample sizes in segmentation studies (10200). Because the population parameters derived in Section 2.4 (δH, p(a), p(b), p(l), p(h) and cov(AB,LH)) are linked to statistical power through their influence on the parameter δ, simulations were run as a function of δ, instead of simulating many combinations of parameters that map to the same δ.

Table 4.

Simulation parameters used to estimate the accuracy of the model. Note that the simulations varying v, ω, σρ and ψ were conducted twice at two baseline δ values.

# voxels population accuracy difference Dirichlet precision spatial correlation width population probability of disagreement
v δ ω σρ ψ
Baseline 36 3% / 6% 128 0.7 15%
Minimum 9 2% 64 0 15%
Maximum 100 10% 1024 0.7 45%
Increment v by +1 +1%  × 2 +0.1 +5%

4.2. Simulations with real-world data

To evaluate the applicability of sample size formula (Eq. (7)) and the low-quality correction equation (Eq. (8)) to a real-world data set, we simulated segmentation accuracy comparison studies using bootstrapped samples from the PROMISE12 data set.

The PROMISE12 challenge is an ongoing resource for comparing many state-of-the-art prostate segmentation algorithms against a common reference standard. The challenge images comprise 100 T2W prostate MR images collected from 4 centres, split into 50 training images (with publicly available reference segmentations) and 30 testing images (with reference segmentations withheld). The reference segmentations were manually segmented by an experienced clinical reader, and verified by another independent clinical reader. In order to establish a standardised scoring system for multiple metrics, the challenge had a non-clinical graduate student manually segment the images and her metric scores were used to normalize the metric scores of the algorithms. Although the PROMISE12 challenge principally used the high-quality reference standard for evaluation, the second segmentation is analogous to a presumably lower-quality reference standard that could be considered as a lower cost option. Thus, the clinical manual segmentations will represent the high-quality reference standard H, the graduate student manual segmentations will represent the low-quality reference standard L, and two algorithms from the challenge will represent A and B. Using 10 algorithms from the PROMISE12 challenge, the simulations were repeated for all 45 possible pairs of algorithms.

As in Section 4.1, we set δMDD to the population accuracy difference (treating the PROMISE12 test data set as the entire population) and compare the proportion of simulated studies yielding significant accuracy differences to 1β.

4.2.1. Simulations with high-quality real-world data

To evaluate the applicability of the Dirichlet-based sample size formula (Eq. (7)) to a real-world data set, each simulated study in this experiment compared two algorithms to the high-quality reference standard. For every pair of algorithms, we estimated the population accuracy difference (δ^H) and variance parameters using all 30 test cases from the PROMISE12 test data set. Using α=0.05,β=0.20,δMDD=δ^H, and the estimated variance parameters, we computed the predicted sample size n using Eq. (7). We then simulated 100,000 segmentation accuracy comparison studies using bootstrap sampling by sampling ⌈n⌉ images with replacement from the PROMISE12 images and testing the per-image accuracy differences using a paired Student’s t-test. We compared the proportion of positive tests to the power predicted by the model for ⌈n⌉ samples.

4.2.2. Simulations with low-quality real-world data

To evaluate the applicability of the low-quality correction equation (Eq. (8)) to a real-world data set, each simulated study in this experiment compared two algorithms to the low-quality reference standard, with δ^MDD calculated from Eq. (8) and the observed δ^H. Simulation using bootstrap sampling and evaluation proceeded as in Section 4.2.1 except that δ^MDD=δ^H+2(p^(a)p^(b))(p^(l)p^(h))+2cov^(AB,LH), and the variance parameters were estimated with respect to low-quality reference standard L.

5. Results

5.1. Simulations under the statistical model

The variance of accuracy differences predicted by the model (σD¯2) was within 2% relative error of the Monte Carlo simulations across all simulation sets (RMS relative error 0.5%). The predicted power was within 4% error (simulated – predicted power) of the Monte Carlo simulations across all simulation sets with 95% confidence.

Fig. 4 shows the absolute error in the predicted power (i.e., simulation - model power) under varying model parameters. The parameter with the largest impact on the accuracy of power prediction was δ. For simulations with baseline δ=3% and δ=6%, the predicted power was within 2% and 3% absolute error, respectively, of the simulations with 95% confidence. A larger positive bias in the power prediction error across all values of v, ω and σρ was observed for simulations with δ=6%, compared to simulations with δ=3%, suggesting that the positive bias can be primarily attributed to the baseline accuracy difference. The simulation with δ=10% had the largest absolute error of 4%.

Fig. 4.

Fig. 4

Model accuracy (95% confidence interval (shown in red for baseline δ=3% and in cyan for baseline δ=6%) on the absolute difference between the simulated and model power) for each simulation set. For example, with δ=10%, the model predicted 82% power, 4% below the 86% power observed in the simulation. Each accuracy graph shows a blue line representing the expected error due to the observed skew alone (for the simulation varying δ and the baseline δ=6%) based on applying the regular t-test sample size formula to a skewed Pearson distribution. The similar shape of this curve to the observed errors suggests that the skew is a considerable contributor to the error. The histogram (lower right) shows the distribution of accuracy differences for the simulation with δ=10%, illustrating the slight but significant skew in the distribution, which contributes to the observed error. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

A proportion of the observed error can be attributed to skew in the distribution of per-image accuracy differences, deviating from the normality assumption of the t-test used in this work. The largest skew amongst our experiments (corresponding to the largest power prediction error) occurred when δ=10%; this is illustrated in a histogram of the accuracy differences, shown in Fig. 4. The effect of the deviation from normality is exacerbated in the simulations with large δ due to the lower sample size (n=8), for which the t-test is more sensitive to violations of its assumptions. To illustrate the expected impact of skew alone on the error in predicted power, Fig. 4 shows the error of the standard paired t-test power calculation for a correspondingly skewed population (Pearson distribution with skew matching the simulation) overlaid in blue.

The impact of these errors in predicted power on the sample size and minimum detectable difference is illustrated in Figs. 5 and 6.

Fig. 5.

Fig. 5

The equivalent error in predicted sample size (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline δ=3% and in cyan for baseline δ=6%) on the absolute difference between the sample size needed to achieve the simulated power and the sample size needed to achieve the modeled power. For example, with δ=10%, the model would overestimate by 1 the number of subjects needed to achieve the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 6.

Fig. 6

The equivalent error in predicted minimum detectable difference (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline δ=3% and in cyan for baseline δ=6%) on the absolute difference between the minimum difference detectable with simulated power and the minimum difference detectable with the modeled power. For example, with δ=10%, the model would predict that a minimum detectable difference of 10.5% would result in the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

5.2. Simulations with high-quality real-world data

When the minimum detectable difference was defined and tested relative to the high-quality reference standard in the PROMISE12 data set, the simulated power was  < 4% higher than the power specified by the model (approximately 80%) for the majority of algorithm comparisons (range 0–20%). The error was strongly correlated with the skew of per-image accuracy differences in the population (Spearman’s ρ=0.77; p<1×108). The model did not over-estimate the power in any comparison, suggesting that it is conservative (i.e. avoiding predictions that result in underpowered studies) in the presence of skew. The errors for each pair of algorithms are reported in Table 5.

Table 5.

Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the high-quality reference standard. The required sample sizes predicted by the model are given in parentheses.

B C D E F G H I J

A 3 (108) 1 (41) 12 (28) 2 (31) 14 (11) 1 (50) 2 (101) 13 (8) 7 (22)
B 10 (15) 1 (163) 1 (26418) 1 (35) 1 (1.8E6) 10 (28) 0 (14) 0 (157)
C 12 (11) 4 (10) 14 (9) 11 (12) 0 (42) 17 (5) 9 (6)
D 4 (102) 2 (50) 3 (115) 13 (14) 1 (15) 2 (3357)
E 7 (19) 1 (14084) 2 (11) 12 (8) 3 (95)
F 5 (23) 12 (10) 1 (312) 5 (48)
G 7 (16) 8 (10) 0 (97)
H 20 (5) 15 (8)
I 2 (17)

5.3. Simulations with low-quality real-world data

When the minimum detectable difference was defined relative to the high-quality reference standard and tested relative to the low-quality reference standard in the PROMISE12 data set, the model predicted the simulated power with a median error of 5% (simulated – predicted power; range -29–16%) and a median absolute error of 6% (|simulated – predicted power|). The two algorithm pairs with the smallest δMDD (0.1% and 0.2% accuracy differences) and largest sample sizes (5714 and 3721) had the largest errors, overestimating power by 27% and 29%, respectively. The error was correlated with the skew of per-image accuracy differences (Spearman’s ρ=0.34; p=0.02), and excluding the 2 cases with the smallest δMDD, the correlation was stronger (Spearman’s ρ=0.67; p1×106). The errors for each pair of algorithms are reported in Table 6.

Table 6.

Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the low-quality reference standard. The required sample sizes predicted by the model are given in parentheses.

B C D E F G H I J

A 6 (43) 2 (167) 12 (22) 5 (133) 2 (11) 5 (25) 27 (5714) 15 (7) 9 (21)
B 8 (14) 6 (403) 8 (71) 5 (67) 12 (3598) 12 (24) 5 (17) 29 (3721)
C 11 (12) 11 (34) 8 (11) 7 (13) 2 (50) 10 (6) 13 (8)
D 6 (31) 2 (87) 4 (165) 13 (16) 0 (17) 6 (508)
E 0 (15) 2 (41) 1 (76) 11 (6) 6 (34)
F 0 (37) 5 (13) 4 (159) 4 (58)
G 4 (17) 6 (12) 8 (466)
H 13 (6) 16 (11)
I 5 (16)

6. Case study

The direct application of the sample size formula to calculate the sample size is described in Section 3. The formula can also be used indirectly to guide other aspects in the design of segmentation comparison studies. In this case study, we illustrate one such application: evaluating the cost (in terms of sample size vs cost per subject) of using a lower-quality reference standard manually segmented by a non-clinical graduate student instead of one generated by clinical collaborators. For illustration, this case study simulates the availability of a pilot data set by using two algorithms and the 30 test data sets from the PROMISE12 challenge.

To evaluate the cost of the two approaches, we can compare the sample sizes under the two reference standard strategies. The error rates and minimum detectable difference δMDD, H will be the same for both scenarios. We use commonly accepted Type I and II error rates: α=0.05 and β=0.20. The appropriate δMDD, H depends on the clinical or technical requirements; for example, in the context of prostate segmentation, the MDD could represent the minimal improvement in prostate segmentation accuracy that would make an automated prostate MRI computer-aided detection (CAD) system (e.g. Litjens et al., 2014a) clinically suitable as a first reader. In this case study, we suppose that an analysis of an existing CAD system suggests an improvement in accuracy of 5% (with respect to a high-quality reference standard) would be sufficient to make the system clinically suitable.

The variance parameters differ between the scenarios. To assess the scenario where the study uses a high-quality reference standard, we can estimate ψ^,δ^ and σ^D¯2 using A, B and H. Using Eqs. (11)(13) with hk, i in place of lk, i gives ψ^=13.4%,δ^=4.02% and σ^D¯2=0.00231. Since δ^ and δMDD are small relative to ψ^, assuming σ02=σalt2=σ^D¯2 will yield similar results to assuming a Dirichlet prior (σ02=0.00234 and σalt2=0.00229). The resulting sample size to detect a difference δMDD,H=5% was 9 subjects. To assess the scenario where the study uses a low-quality reference standard instead, we first estimate δ^MDD using A, B, L and H. Parameter estimation equations (Eqs. (9) and (10)) gives p^(a)=0.246,p^(b)=0.195,p^(l)=0.210,p^(h)=0.214, and cov^(AB,LH)=0.29%, yielding δ^MDD=0.0348. Using Eqs. (11)(13) gives ψ^=13.4%,δ^=3.37%, and σ^D¯2=0.00253. The resulting sample size to detect a difference δMDD,H=5% was 12 subjects.

Based on this analysis, we estimate that a study using this lower-quality reference standard would require 30% more subjects to detect a 5% improvement in accuracy than one using the high-quality reference standard. Since the cost per subject of generating the lower-quality reference standard is typically much lower, this could be a suitable approach for comparing these algorithms.

7. Discussion

In this work, we derived a sample size formula for studies comparing the segmentation accuracy of two algorithms, and also a relationship describing the effect of using lower-quality reference standards on the minimum detectable difference in segmentation accuracy. The formula accuracy was evaluated using Monte Carlo simulations, yielding errors in predicted power of less than 4% across a range of model parameters. The applicability of the formulae to real-world data was evaluated using bootstrap sampling from the PROMISE12 prostate MRI segmentation data set yielding median errors in predicted power less than 6%, but showed the error to be sensitive to skewed distributions and small sample sizes. A case study was also analyzed to illustrate the use of the formulae in a realistic context.

7.1. Validation in segmentation comparison studies

Improvements in the methodology for the validation and comparison of segmentation algorithms span a wide variety of approaches.

One avenue to improve segmentation validation is to develop improved metrics. Simple segmentation metrics such as accuracy, Dice overlap, Cohen’s Kappa, mean absolute boundary distances and Hausdorff distances compare segmentations to a single reference standard and are commonly used (Taha and Hanbury, 2015). Newer metrics allow comparisons to multiple reference standards (e.g. the validation index (Juneja et al., 2013)) or comparisons that consider application specific utility (e.g. accuracy of quantitative measurements in segmented ROIs (Jha et al., 2012)). This latter concept can be taken further by validating segmentation through its impact on a larger system, such as the accuracy of a computer-assisted detection pipeline (Guo and Li, 2014). Model observers have also been developed to assess aspects of segmentation quality without a reference standard (Frounchi, Briand, Grady, Labiche, Subramanyan, 2011, Kohlberger, Singh, Alvino, Bahlmann, Grady, 2012); effectively creating a learned reference-standard-independent segmentation metric.

Another avenue to improve segmentation validation is to improve the reference standard quality. Label fusion algorithms, such as STAPLE (Warfield et al., 2004) and SIMPLE (Langerak et al., 2010) enable the generation of higher-quality reference standards that combine information from multiple experts. Improvements in multimodal registration (Shah, Pohida, Turkbey, Mani, Merino, Pinto, Choyke, Bernardo, 2009, Gibson, Crukley, Gaed, Gómez, Moussa, Chin, Bauman, Fenster, Ward, 2012) enable reference standards based on information that is less dependent on the image being segmented.

A third avenue is to increase the size of reference standards by reducing the cost per image, or via data augmentation. Active learning (Konyushkova, Sznitman, Fua, 2015, Top, Hamarneh, Abugharbieh, 2011) and other interactive annotation tools, reduce the cost of generating expert segmentations by partially automating the process. Crowdsourcing non-expert segmentations (Maier-Hein, Mersmann, Kondermann, Bodenstedt, Sanchez, Stock, Kenngott, Eisenmann, Speidel, 2014, Irshad, Montaser-Kouhsari, Waltz, Bucur, Nowak, Dong, Knoblauch, Beck, 2015) can cheaply generate many reference standards on many images, using the large numbers to offset the potential loss in quality. For some anatomy, artificial data with reference segmentations can be generated by simulating the imaging process (Cocosco et al., 1997) or perturbing the geometry and image signal of existing images (Hamarneh et al., 2008).

This work, in contrast, aims to improve validation by enabling researchers to design efficient and appropriately powered studies. This work focuses on a particular analysis used in segmentation comparison studies: comparing the proportion of voxels where each of two segmentation algorithms agree with a single reference standard. The presented formulae can be directly applied by researchers developing new segmentation algorithms to facilitate the design of their studies. More broadly, this work has particular importance for work focused on improving reference standard quality and reference standard size by providing a framework for understanding the tradeoffs between quality and quantity in segmentation reference standards.

7.2. Accuracy and applicability of the sample size formulae

In typical study designs, the statistical power, i.e. the probability of detecting an accuracy difference of a specified size, is fixed heuristically at 80%, specifying that a 20% risk of missing a true effect is acceptable. Other study design parameters are optimized under this constraint, balancing costs and effect sizes. A study design with statistical power substantially above the acceptable risk is using resource inefficiently, while one with lower power gives an unacceptable risk of false negatives. In our model, the largest errors observed in the model were for large accuracy differences. The variance predicted by the model matches the simulations to within 2%, suggesting that model errors are not primarily due to an incorrect variance prediction. Rather, the distribution of the accuracy differences in these simulation sets suggests that the error can be attributed to a combination of two factors: low sample size and skewness. The accuracy difference distribution under our statistical model, when using a Dirichlet prior, generally has non-zero skew when there are accuracy differences (i.e. |δ| > 0) and inter-image variability (ω < ∞), and the simulations show a skew as high as 0.3 in these simulation sets. The t-test, however, assumes samples are drawn from a normal distribution with 0 skew. While the t-test is robust to such deviations from normality at large sample sizes, large accuracy differences are more easily detectable and thus require small sample sizes. This suggests that segmentation comparison studies should be careful in their application of the t-test for studies with small sample sizes; in such cases, a McNemar test adjusted for clustered sampling (Gönen, 2004, Durkalski, Palesch, Lipsitz, Rust, 2003) may be more appropriate.

When applied to real-world data, the errors were generally larger than observed under the statistical model. The errors were strongly correlated with the skew of the distribution of per-image accuracy differences, which is consistent with our observations on simulated data. This effect was particularly evident when the predicted sample size was low: five of the six largest observed errors (where the model underestimated power by 13–20%) corresponded to simulated studies with n < 10, which is also consistent with our observations on simulated data. In general, the model underestimated the simulated power which could lead to inefficient resource usage, but would not lead to failed studies caused by insufficient power. When using a low-quality reference standard with δMDD defined with respect to a high-quality reference standard, the error was also correlated with skew. However, in this context, another source of error must be considered: error in the estimation of δMDD. When the estimated minimum detectable difference was very small (|δ^MDD|<0.2%), small absolute estimation errors (|δδ^MDD|<0.06%) led to large relative estimation errors, resulting in large errors in the predicted power. When using a low-quality reference standard, the model over-estimated the simulated power for 10/45 of the algorithm pairs, suggesting that additional subjects may be needed when using this model to avoid underpowered studies.

The proposed approach for using low-quality reference standards presumes that a high-quality data set can be obtained, if only for a small pilot data set, and that clinical or technical requirements on accuracy differences specified with respect to that reference standard are useful. In some medical segmentation tasks (such as prostate cancer delineation on MRI (Gibson et al., 2016) or mitosis detection on histology images (Chowdhury et al., 2006)), even expert segmentations are highly variable. For some tasks, it may be appropriate to combine segmentations from multiple experts by consensus or using a label fusion algorithm such as STAPLE to generate a high-quality reference standard on a pilot study; however, care should be taken to consider whether requirements specified with respect to the resulting reference standard will be practically useful.

7.3. Model interpretation

Although the sample size relationship is a continuous function in multiple parameters, it can be useful to break the parameters into coarse categories to see emerging trends (see Table 7). In particular, we focus on the special case of modeling the prior as a Dirichlet random variable and examine the parameters that comprise the idealized efficiency ψ/δ2 and on the design factor f.

Table 7.

Number of images required to detect a desired segmentation accuracy difference. When compensating for the use of a lower-quality reference standard, use Eq. (8) to estimate the minimum detectable difference (δMDD) first.

Design factor (f)
0.01 0.05 0.1
Small differences (δMDD=2%)
ψ=2% (ψ/δMDD2=50) 6* 21 41
ψ=11% (ψ/δMDD2=275) 24 110 218
ψ=20% (ψ/δMDD2=500) 41 198 394
Medium differences (δMDD=5%)
ψ=5% (ψ/δMDD2=20) 3* 10 17
ψ=12.5% (ψ/δMDD2=50) 6* 21 41
ψ=20% (ψ/δMDD2=80) 8* 33 65
Large differences (δMDD=10%)
ψ=10% (ψ/δMDD2=10) 3* 6* 10
ψ=15% (ψ/δMDD2=15) 3* 8* 14
ψ=20% (ψ/δMDD2=20) 3* 10 17

* Small samples sizes calculated from Eq. (7) are reported here; however, studies with such small sample sizes may be highly sensitive to violations of the assumptions of the t-test, and are not recommended.

δMDD can be coarsely categorized into small (δMDD ≤ 2%), medium (2% < δMDD < 10%), and large (δMDD ≥ 10%) differences. Detecting small differences can require large (often infeasible) sample sizes, whereas detecting large differences may be limited not by δMDD but by the assumptions of the statistical analysis.

Within these effect size categories, the likelihood of disagreement between algorithms (ψ) plays an important role. ψ has the range δψδ+2min(p(AL),p(BL)). When ψ ≈ δ, it implies that most of the difference between the algorithm correspond to the more accurate algorithm correcting the errors of the less accurate one, while making few new errors. When ψ ≫ δ, the more accurate algorithm is making new errors on voxels where the less accurate algorithm was correct. Table 7 shows three levels of disagreement: minimal disagreement (ψ=δMDD), large disagreement (ψ=20%) and a midpoint between them. When δMDD is small, the level of disagreement can introduce an order of magnitude difference in required sample sizes.

The idealized efficiency is modulated by the design factor. The design factor ranges from 1/v (denoting that each voxel gives an independent estimate of accuracy differences) to 1 (denoting that each image gives an independent estimate of accuracy differences, but voxel segmentations are perfectly correlated). For realistic medical image segmentation algorithms, however, either of these extremes is unlikely. Table 7 shows three levels of the design factor: low correlation (f=0.01), medium correlation (f=0.05) and high correlation (f=0.1).

Our derivations show that sample sizes for studies comparing the accuracy of segmentation algorithms principally depend on the idealized efficiency ψ/δMDD2 which relates the probability of voxel-wise disagreement (ψ) between algorithms to the minimum detectable difference δMDD, and the design factor f which reflects increased variability due to intervoxel correlation and inter-image variability. The sample size is approximately proportional to the idealized efficiency ψ/δMDD2. ψ has the range δψδ+2min(p(AL),p(BL)), which suggests that it is easier, in general, to detect a given accuracy difference when at least one of the algorithms is highly accurate (lowering the upper bound on ψ). Furthermore, it is easier to detect a given accuracy improvement when algorithm A principally corrects errors made by algorithm B (where ψ ≈ δ minimizing the idealized efficiency) than when algorithm A has errors that are independent from B.

Although intuition would suggest that using lower-quality reference standards should consistently increase the required sample size, our derivations and simulations suggest a more complex relationship. The impact of errors in the reference standard is reduced by using a paired analysis which excludes variance due to factors that affect both algorithms in the same way, such as reference standard errors in voxels where the algorithms agree. Reference standard errors in regions of disagreement, however, do affect the variance of per-image accuracy differences (σD¯2=1+ωρi,j¯ω+1(ψδ2) from Eq. (6)). In the rightmost term of this equation, ψ (which does not depend on the reference standard) is generally much larger than δ2 (see Table 7), suggesting that the impact of reference standard errors on variance is predominantly via changing the design factor. Reference standard errors also affect the sample size (Eq. (8)) by altering the detectable accuracy difference when the reference standard has errors that are biased in favour of one algorithm or when it has systematic over- or under contouring and one algorithm contours more foreground than the other. Relatively speaking, systematic over- or under contouring will have only a small impact on the detectable accuracy differences, unless the algorithms’ foreground proportions are very different: for example, if A contours 5% more foreground than B, then 10% over-contouring by L (25 ×  that observed in the PROMISE12 data) will change the measured accuracy difference by only 0.5%, unless the contouring errors are biased towards one algorithm. Furthermore, errors in the reference standard that are biased towards one algorithm do not necessarily decrease power: reference standard errors biased towards the more accurate algorithm will exaggerate the true difference, increasing power at the expense of increased type I error.1 These observations were reflected in our analysis of the PROMISE12 challenge data (see Tables 5 and 6). Comparing the low-quality to the high-quality reference standard, the root-mean-squared relative error in f^ was 4%, compared to 0.3% for ψ^δ^2. Because the low-quality reference standard had substantial agreement with the high-quality one (96% ± 1% mean ± SD accuracy), the effect of sample biases in reference standard errors were observable: for 17/45 pairs of algorithms, the studies designed to use the low-quality reference standard actually needed fewer subjects than studies using the high-quality reference standard; in all of these cases, there were slight sample biases in the low-quality reference standard towards the more accurate algorithm (primarily, as expected, in the covariance term in Eq. (8)). This increased |δMDD| relative to |δH| (i.e. the underlying differences between the algorithms were exaggerated and thus easier to detect). Because the experimental design for evaluating the model on real data required δMDD=δH, which was very small for some comparisons ( < 2% in 20/45 algorithm pairs and  < 0.5% in 4 algorithm pairs), this effect was magnified. Overall, our analysis of the PROMISE12 data aligns well with our theoretical model. Based on our analysis, using reference standards that are lower quality but unbiased may be a suitable approach for comparing segmentation algorithm accuracy.

7.4. Limitations

The contributions of this work should be considered in the context of its limitations. First, the sample size calculation presented in this work is specific to the statistical analysis (the paired Student’s t-test) and to the accuracy metric (proportion of voxels matching the reference standard). Further work is needed to develop these formulae for other analyses and metrics. Second, our correlation model is over-parameterized, representing inter-image variability and intra-image inter-voxel correlation separately, when their effect on the covariance of Dk is coupled. This complicates the estimation of parameters, but yields formulae expressed in concepts familiar to the image analysis community. Third, due to constraints on sampling from specified high-dimensional correlated discrete distributions, we were unable to generate Monte Carlo simulations testing the extremes of some parameter ranges (e.g. high numbers of voxels and high intervoxel correlation). Because the metric analysed in the study D¯k is a mean over voxels (which becomes more precise with higher v) and because we did not observe an increase in error as v increased from 9–100, we do not anticipate notable differences in model performance with larger v. Fourth, our application of the formulae to real segmentation studies was limited by the public availability of data sets with high- and low- quality reference standards; the PROMISE12 data set used in our case study is a rare example of such data. Finally, the sensitivity of the formula to violations of its underlying assumptions was not estimated; future work in this area could clarify which of these assumptions are critical to the accuracy of the formula and which could be relaxed.

8. Conclusions

In this work, we derived formulae to address two interrelated questions in the design of studies comparing segmentation algorithms: How many validation images are needed to evaluate a segmentation algorithm? and How accurate does the reference standard need to be? The sample size formula predicted the power of simulated segmentation studies to within 4% across a range of model parameters, and when applied to the PROMISE12 prostate segmentation challenge data, predicted the power to within a median error of 6%. In addition to their direct application in calculating sample sizes, the formulae offer several insights for study design. First, it is generally easier to detect a given accuracy difference when at least one algorithm is highly accurate, as this reduces accuracy variability. Second, it is generally easier to detect a given accuracy difference when one algorithm principally corrects the errors of another, compared to when two algorithm make independent errors. Third, systematic over- or under-contouring by a low-quality reference standard does not impact accuracy measurements substantially unless one algorithm tends to contour more voxels as foreground than the other, but correlation between reference standard errors and algorithm differences can bias accuracy measurements. These formulae, and parameter estimation equations and guidelines that facilitate their use, hold the potential to enable researchers to make statistically motivated decisions about their study design and their choice of reference standard and to make the most efficient use of limited research resources.

Acknowledgements

This work was supported by the UK Medical Research Council, Radboud University Medical Centre and the Canadian Institutes of Health Research. Yipeng Hu is funded by Cancer Research UK and the UK Engineering and Physical Sciences Research Council (EPSRC) as part of the UCL-KCL Comprehensive Cancer Imaging Centre.

Footnotes

1

Because of this, care should be taken when estimates of this bias (Eq. (10)) are not substantially smaller than δH.

Appendix A. Derivation of the variance of the accuracy difference

The variance of the per-image difference in accuracy σD¯2 affects the statistical power of segmentation accuracy comparison experiments. This appendix derives an expression for σD¯2 based on the statistical model described in Section 2 for any prior distribution of per-image average marginal probabilities (OkP(p)) in terms of moments of the prior distribution.

A1. Statistical segmentation model and notation reiterated

The per-image difference in accuracy is D¯k=1vi=1vDk,i, where v is the number of voxels and random variable Dk, i is the per-voxel segmentation accuracy difference for the ith voxel in the kth image defined as Dk,i=|Bk,iLk,i||Ak,iLk,i|. Random variables Ak, i, Bk, i and Lk, i are segmentation labels for the ith voxel in the kth image from segmentation algorithms A and B and reference standard L, respectively.

The statistical model, motivated and described in Section 2, models the distribution of the random vector of per-voxel accuracy differences Dk=Dk,1,,Dk,v as a v-dimensional correlated categorical distribution with three categories (1, 0, and 1). The marginal probabilities Ok,i=<Ok,i,1,Ok,i,0,Ok,i,1> of the categorical distribution are identically distributed random probability vectors with mean Ok=<Ok,1,Ok,0,Ok,1>, but no other constraint on the shape of the distribution. The covariance of the categorical distribution given Ok is defined such that cov(Dk,i,Dk,j|Ok)=ρi,jσDk,i|Ok2σDk,j|Ok2, where σDk,i|Ok2 is the conditional variance of Dk, i given Ok. Priors Ok are independently and identically distributed random variables sampled for each image with mean p (the population mean probability vector).

A2. Derivation of σD¯2 in terms of moments of priors Ok

To derive σD¯2 under this model, we express the covariance matrix of variables Dk, i in terms of E(Dk,i|Ok,i)=Ok,i,1Ok,i,1 and E(Dk,i2|Ok,i)=Ok,i,1+Ok,i,1, marginalize out prior parameters Ok and Ok,i to give an expression in terms of moments of Ok, and express σD¯2 as the average of covariance matrix elements.

Because D¯k=1vi=1vDk,i,σD¯2 can be expressed as

σD¯2=1v2i,jcov(Dk,i,Dk,j), (A.1)

for a random image k. By the law of total covariance, cov(Dk, i, Dk, j) can be expressed in terms of conditional probabilities given Ok as the sum of two components,

σD¯2=1v2i,jcov(E(Dk,i|Ok),E(Dk,j|Ok))+E[cov(Dk,i,Dk,j|Ok)], (A.2)

where E(X|Y) denotes the conditional expectation of X given Y, and cov(X, Y|Z) denotes the conditional covariance of X and Y given Z. The two components can be expressed in terms of moments of Ok by

  • 1.

    expressing each component in terms of marginal probabilities Ok,i,

  • 2.

    marginalizing them over Ok,i to express them in terms of Ok and

  • 3.

    marginalizing them over Ok to express them in terms of moments of Ok.

It is helpful to first note that E(Dk,j|Ok)=E(Dk,i|Ok) and σDk,i|Ok2=σDk,j|Ok2, since Ok,i and Ok,j are identically distributed given Ok. The first term of Eq. (A.2) represents the covariance due to variability of the prior, and can be simplified following the three steps above (shown in Eqs. (A.3), (A.4) and (A.5)) with details shown below:

cov(E(Dk,i|Ok),E(Dk,j|Ok))=var(E(Dk,i|Ok))=var(E(Dk,i|Ok,i)p(Ok,i|Ok)dOk,i) (A.3)
=var((Ok,i,1Ok,i,1)p(Ok,i|Ok)dOk,i)=var(Ok,1Ok,1) (A.4)
=σO12+σO122σO1,O1, (A.5)

where σO12 and σO12 are the variances of O1 and O1 and σO1,O1 is the covariance of Ok, 1 and Ok,1.

The second component of Eq. (A.2) represents the covariance due to sampling the marginal probability and per-voxel accuracy difference variables, and can be simplified following the three steps above (shown in Eqs. (A.6), (A.7) and (A.8)) with details shown below:

E[cov(Dk,i,Dk,j|Ok)]=E[ρi,jσDk,i|OkσDk,j|Ok]=E[ρi,jσDk,i|Ok2]=ρi,jE[E(Dk,i2|Ok)E(Dk,i|Ok)2]=ρi,jE[E(Dk,i2|Ok,i)p(Ok,i|Ok)dOk,i(E(Dk,i|Ok,i)p(Ok,i|Ok)dOk,i)2] (A.6)
=ρi,jE[(Ok,i,1+Ok,i,1)p(Ok,i|Ok)dOk,i((Ok,i,1Ok,i,1)p(Ok,i|Ok)dOk,i)2]=ρi,jE[Ok,1+Ok,1(Ok,1Ok,1)2] (A.7)
=ρi,j(E[Ok,1]+E[Ok,1]E[Ok,12]+2E[Ok,1Ok,1]E[Ok,12])=ρi,j(p1+p1E[Ok,12]+2E[Ok,1Ok,1]E[Ok,12])=ρi,j(p1+p1(σO12+μO12)+2(σO1,O1+μO1μO1)(σO12+μO12))=ρi,j(p1+p1σO12+2σO1,O1σO12(p1p1)2). (A.8)

Substituting Eqs. (A.5) and (A.8) into Eq. (A.2) yields the variance of the per-image accuracy in terms of moments of the prior Ok:

σD¯2=ρi,j¯(p1+p1(p1p1)2)+(1ρi,j¯)(σO122σO1,O1+σO12), (A.9)

where ρi,j¯=1v2i,jρi,j is the average of the intra-image inter-voxel correlation coefficients. For conciseness, we introduce two terms: ψ=p1+p1 is the population-wide probability that algorithms A and B disagree on the labeling of a voxel, and σO1O12=σO122σO1,O1+σO12 is the variance of O1O1 for the prior Ok. This substitution yields a more concise expression (identical to Eq. (4)):

σD¯2=ρi,j¯(ψδ2)+(1ρi,j¯)σO1O12. (A.10)

Appendix B. Derivation of the accuracy difference in terms of the high-quality reference standard

The minimum detectable difference δMDD must be defined with respect to the study’s reference standard, while clinical or technical requirements may be better defined with respect to a high-quality reference standard (δMDD, H). This appendix derives an equation to express the population average accuracy difference with respect to one reference standard (L) as a function of the population average accuracy difference with respect to another reference standard (H), and uses this to express δMDD as a function of δMDD, H when a low-quality reference standard is used.

B1. Model and notation

As we did for A, B and L, we consider the segmentation labels of the high-quality reference standard H as random variables. We denote the population average accuracy difference with respect to L as δ, and that with respect to H as δH. We abbreviate the probability of a particular combination of segmentation labels for a randomly selected voxel as the conjunction of events a¯,b¯,l¯ and h¯ when the respective labels are 0 and a, b, l and h when the respective labels are 1. For example, p(ab¯l) denotes the probability that A gives the label 1, B gives the label 0 and L gives the label 1 for the randomly selected voxel.

B2. Derivation

As described in Section 2.4, the derivation of δ as a function of δH uses the following approach:

  • 1.

    Express δ in terms of the joint probability of segmentation labels of A, B, L and H

  • 2.

    Isolate the terms of this expression that equate to δH, and simplify the remaining terms

B3. Express δ in terms of the joint probability of segmentation labels of A, B, L and H

Since events where A=B do not affect the difference in accuracy, δ is the probability of events where A=L and B ≠ L minus the probability of events where A ≠ L and B=L. δ can be expressed in terms of the probabilities of specific combinations of segmentation labels for A, B and L for a randomly selected voxel:

δ=p(ab¯l)+p(a¯bl¯)p(a¯bl)p(ab¯l¯). (B.1)

We then express each term in Eq. (B.1) in terms of H with a substitution p(xy)=p(xz)p(xy¯z)+p(xyz¯), where x represents ab¯ (for term 1 and 4) or a¯b (for term 2 and 3) and y and z represent l and h (for term 1 and 3) or l¯ and h¯ (for term 2 and 4):

δ=(p(ab¯h)p(ab¯l¯h)+p(ab¯lh¯))+(p(a¯bh¯)p(a¯blh¯)+p(a¯bl¯h))(p(a¯bh)p(a¯bl¯h)+p(a¯blh¯))(p(ab¯h¯)p(ab¯lh¯)+p(ab¯l¯h)). (B.2)

B4. Isolate the terms of this expression that equate to δH, and simplify the remaining terms

The difference in accuracy with respect to H is δH=p(ab¯h)+p(a¯bh¯)p(a¯bh)p(ab¯h¯). Isolating these terms in Eq. (B.2) gives the sum of δH and an error term:

δ=δH+2(p(ab¯lh¯)p(ab¯l¯h)+p(a¯bl¯h)p(a¯blh¯)). (B.3)

To simplify the error term, we first expand each term with the substitution p(xy¯z¯)=p(x)p(xyz)p(xyz¯)p(xy¯z), where x represents the non-complemented terms, and y¯ and z¯ represent the complemented terms, giving

δ=δH+2(p(al)p(ablh)p(ab¯lh)p(ablh¯))2(p(ah)p(ablh)p(ab¯lh)p(abl¯h))+2(p(bh)p(ablh)p(a¯blh)p(abl¯h))2(p(bl)p(ablh)p(ablh¯)p(a¯blh)), (B.4)

and then cancel out duplicated terms, giving

δ=δH+2(p(al)p(ah)+p(bh)p(bl)). (B.5)

If A, B, L, and H are encoded as 0 (for background) and 1 (for foreground), this error term can be expressed as 2(p(a)p(b))(p(l)p(h))+2cov(AB,LH) in the full equation:

δ=δH+2(p(a)p(b))(p(l)p(h))+2cov(AB,LH). (B.6)

By substituting δ=δMDD and δH=δMDD,H into Eq. (B.6), we can express δMDD as a function of δMDD, H when a low-quality reference standard is used (identical to Eq. (8)):

δMDD=δMDD,H+2(p(a)p(b))(p(l)p(h))+2cov(AB,LH). (B.7)

Supplementary material

Supplementary material associated with this article can be found, in the online version, at 10.1016/j.media.2017.07.004

Appendix C. Supplementary materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.pdf (55.2KB, pdf)

References

  1. Barbiero, A., Ferrari, P. A., 2015. GenOrd: simulation of discrete random variables with given correlation matrix and marginal distributions. http://CRAN.R-project.org/package=GenOrd. R package version 1.4.0.
  2. Beiden S.V., Campbell G., Meier K.L., Wagner R.F. SPIE Medical Imaging. 2000. The problem of ROC analysis without truth: The EM algorithm and the information matrix; pp. 126–134. [Google Scholar]
  3. Browne R.H. On the use of a pilot sample for sample size determination. Stat. Med. 1995;14(17):1933–1940. doi: 10.1002/sim.4780141709. [DOI] [PubMed] [Google Scholar]
  4. Caballero J., Bai W., Price A.N., Rueckert D., Hajnal J.V. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Vol. 1. 2014. Application-driven MRI: Joint reconstruction and segmentation from undersampled MRI data; pp. 106–118. [DOI] [PubMed] [Google Scholar]
  5. Chowdhury N., Pai M.R., Lobo F.D., Kini H., Varghese R. Interobserver variation in breast cancer grading: a statistical modeling approach. Anal. Quant. Cytol. Histol./the International Academy of Cytology [and] American Society of Cytology. 2006;28(4):213–218. [PubMed] [Google Scholar]
  6. Cocosco C.A., Kollokian V., Kwan R.K.-S., Pike G.B., Evans A.C. Proceedings of Functional Mapping of the Human Brain; NeuroImage. Vol. 5. 1997. Brainweb: Online interface to a 3D MRI simulated brain database; p. 425. [Google Scholar]
  7. Connelly L.M. Pilot studies. Medsurg Nursing. 2008;17(6):411–413. [PubMed] [Google Scholar]
  8. Connor R.J. Sample size for testing differences in proportions for the paired-sample design. Biometrics. 1987;43(1):207–211. [PubMed] [Google Scholar]
  9. Durkalski V.L., Palesch Y.Y., Lipsitz S.R., Rust P.F. Analysis of clustered matched-pair data. Stat. Med. 2003;22(15):2417–2428. doi: 10.1002/sim.1438. [DOI] [PubMed] [Google Scholar]
  10. Everitt B.S., Skrondal A. Cambridge University Press; 2002. The Cambridge Dictionary of Statistics. [Google Scholar]
  11. Frounchi K., Briand L.C., Grady L., Labiche Y., Subramanyan R. Automating image segmentation verification and validation by learning test oracles. Inf. Softw. Technol. 2011;53(12):1337–1348. [Google Scholar]
  12. Gibson, E., Bauman, G. S., Romagnoli, C., Cool, D. W., Bastian-Jordan, M., Kassam, Z., Gaed, M., Moussa, M., Gómez, J. A., Pautler, S. E., Chin, J. L., Crukley, C., Haider, M. A., Fenster, A., Ward, A. D., 2016. Toward prostate cancer contouring guidelines on MRI: dominant lesion gross and clinical target volume coverage via accurate histology fusion.. (1), 188–196. doi:10.1016/j.ijrobp.2016.04.018. [DOI] [PubMed]
  13. Gibson E., Crukley C., Gaed M., Gómez J.A., Moussa M., Chin J.L., Bauman G.S., Fenster A., Ward A.D. Registration of prostate histology images to ex vivo MR images via strand-shaped fiducials. J. Magn. Reson. Imaging. 2012;36(6):1402–1412. doi: 10.1002/jmri.23767. [DOI] [PubMed] [Google Scholar]
  14. Gibson E., Huisman H.J., Barratt D.C. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2015. Statistical Power in Image Segmentation: Relating Sample Size to Reference Standard Quality; pp. 105–113. [Google Scholar]
  15. Gönen M. Sample size and power for McNemar’s test with clustered data. Stat Med. 2004;23(14):2283–2294. doi: 10.1002/sim.1768. [DOI] [PubMed] [Google Scholar]
  16. Guo W., Li Q. Effect of segmentation algorithms on the performance of computerized detection of lung nodules in CT. Med. Phys. 2014;41(9):091906. doi: 10.1118/1.4892056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hamarneh G., Jassi P., Tang L. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2008. Simulation of ground-truth validation data via physically-and statistically-based warps; pp. 459–467. [DOI] [PubMed] [Google Scholar]
  18. Hertzog M.A. Considerations in determining sample size for pilot studies. Res. Nurs. Health. 2008;31(2):180–191. doi: 10.1002/nur.20247. [DOI] [PubMed] [Google Scholar]
  19. Irshad H., Montaser-Kouhsari L., Waltz G., Bucur O., Nowak J., Dong F., Knoblauch N.W., Beck A.H. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access; 2015. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd; p. 294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jha A.K., Kupinski M.A., Rodríguez J.J., Stephen R.M., Stopeck A.T. Task-based evaluation of segmentation algorithms for diffusion-weighted mri without using a gold standard. Phys. Med. Biol. 2012;57(13):4425. doi: 10.1088/0031-9155/57/13/4425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Julious S.A. Sample size of 12 per group rule of thumb for a pilot study. Pharm. Stat. 2005;4(4):287–291. [Google Scholar]
  22. Juneja P., Evans P.M., Harris E.J. The validation index: a new metric for validation of segmentation algorithms using two or more expert outlines with application to radiotherapy planning. IEEE Trans. Med. Imaging. 2013;32(8):1481–1489. doi: 10.1109/TMI.2013.2258031. [DOI] [PubMed] [Google Scholar]
  23. Kish L. Wiley; New York: 1965. Survey Sampling. [Google Scholar]
  24. Kohlberger T., Singh V., Alvino C., Bahlmann C., Grady L. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2012. Evaluating segmentation error without ground truth; pp. 528–536. [DOI] [PubMed] [Google Scholar]
  25. Konyushkova K., Sznitman R., Fua P. Proceedings of the IEEE International Conference on Computer Vision. 2015. Int roducing geometry in active learning for image segmentation; pp. 2974–2982. [Google Scholar]
  26. Lackey N., Wingate A. The pilot study: one key to research success. Kans. Nurse. 1986;61(11):6–7. [PubMed] [Google Scholar]
  27. Langerak T.R., van der Heide U.A., Kotte A.N., Viergever M.A., van Vulpen M., Pluim J.P. Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (SIMPLE) IEEE Trans. Med. Imag. 2010;29(12):2000–2008. doi: 10.1109/TMI.2010.2057442. [DOI] [PubMed] [Google Scholar]
  28. Lee S.M., Young G.A. Asymptotic iterated bootstrap confidence intervals. Ann. Stat. 1995:1301–1330. [Google Scholar]
  29. Litjens G., Debats O., Barentsz J., Karssemeijer N., Huisman H. Computer-aided detection of prostate cancer in MRI. IEEE Trans. Med. Imaging. 2014;33(5):1083–1092. doi: 10.1109/TMI.2014.2303821. [DOI] [PubMed] [Google Scholar]
  30. Litjens G., Toth R., van de Ven W., Hoeks C., Kerkstra S., van Ginneken B., Vincent G., Guillard G. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med. Image Anal. 2014;18(2):359–373. doi: 10.1016/j.media.2013.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Mace A.E. Reinhold, New York; 1964. Sample-Size Determination. [Google Scholar]
  32. Maier-Hein L., Mersmann S., Kondermann D., Bodenstedt S., Sanchez A., Stock C., Kenngott H.G., Eisenmann M., Speidel S. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2014. Can masses of non-experts train highly accurate image classifiers? pp. 438–445. [DOI] [PubMed] [Google Scholar]
  33. Minka T.P. Technical Report. M.I.T.; 2000. Estimating a Dirichlet distribution. [Google Scholar]
  34. Mosimann J.E. On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika. 1962;49(1/2):65–82. [Google Scholar]
  35. Nieswiadomy R.M. Pearson Higher Ed; 2011. Foundations in Nursing Research. [Google Scholar]
  36. Penn, A., 2015. ibootci. https://www.mathworks.com/matlabcentral/fileexchange/52741.
  37. R Core Team, 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.R-project.org/ISBN3-900051-07-0.
  38. Rosner B. Nelson Education; 2015. Fundamentals of Biostatistics. [Google Scholar]
  39. Shah V., Pohida T., Turkbey B., Mani H., Merino M., Pinto P.A., Choyke P., Bernardo M. A method for correlating in vivo prostate magnetic resonance imaging and histopathology using individualized magnetic resonance-based molds. Rev. Sci. Instrum. 2009;80(10):104301. doi: 10.1063/1.3242697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Taha A.A., Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging. 2015;15(1):29. doi: 10.1186/s12880-015-0068-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Top A., Hamarneh G., Abugharbieh R. Medical Image Computing and Computer-Assisted Intervention; MICCAI. Springer; 2011. Active learning for interactive 3d image segmentation; pp. 603–610. [DOI] [PubMed] [Google Scholar]
  42. Tu, S., 2014. The dirichlet-multinomial and dirichlet-categorical models for bayesian inference. Computer Science Division, UC Berkeley, Tech. Rep.[Online]. Available: http://www.cs.berkeley.edu/ ∼ stephentu/writeups/dirichlet-conjugate-prior.pdf.
  43. Warfield S.K., Zou K.H., Wells W.M. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imag. 2004;23(7):903–921. doi: 10.1109/TMI.2004.828354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Warnes, G. R., Bolker, B., Lumley, T., 2015. gtools: Various R Programming Tools. R package version 3.5.0.
  45. Zhu, Y., 2002. Correlated multinomial data. Encyclopedia of Environmetrics.
  46. Zöllei L., Wells W. International Workshop on Biomedical Image Registration. Springer; 2006. Multi-modal image registration using Dirichlet-encoded prior information; pp. 34–42. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.pdf (55.2KB, pdf)

RESOURCES