Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Sep 1.
Published in final edited form as: IEEE J Biomed Health Inform. 2015 Apr 30;19(5):1589–1597. doi: 10.1109/JBHI.2015.2428279

Optimal MAP Parameters Estimation in STAPLE using Local Intensity Similarity Information

Subrahmanyam Gorthi 1, Alireza Akhondi-Asl 1, Simon K Warfield 1
PMCID: PMC4587381  NIHMSID: NIHMS721303  PMID: 25955854

Abstract

In recent years, fusing segmentation results obtained based on multiple template images has become a standard practice in many medical imaging applications; such multiple-templates-based methods are found to provide more reliable and accurate segmentations than the single-template-based methods. In this paper, we present a new approach for learning prior knowledge about the performance parameters of template images using the local intensity similarity information; we also propose a methodology to incorporate that prior knowledge through the estimation of the optimal MAP parameters. The proposed method is evaluated in the context of segmentation of structures in the brain Magnetic Resonance (MR) images by comparing our results with some of the state-of-the-art segmentation methods. These experiments have clearly demonstrated the advantages of learning and incorporating prior knowledge about the performance parameters using the proposed method.

Index Terms: Medical Imaging, Segmentation, Atlas-based Segmentation, Label Fusion, STAPLE, MAP Formulation, MRI, Brain

I. Introduction

It has been shown in many recent works that the automated segmentations obtained based on multiple template images provide more accurate segmentations than the single-template-based methods [1]–[10]. Multiple-templates-based segmentation can be defined as the alignment of a set of reference images with the corresponding segmentations to the target image to be segmented, and followed by the fusion of those aligned segmentations to estimate the reference standard segmentation.

Fusion methods can be broadly classified into three categories: (i) voting-based methods [4]–[6], (ii) distance-based methods [7], [10], and (iii) statistically driven methods [1]–[3], [8], [11]–[13]. Voting-based methods assign a weight to the decisions made by each template regarding the probable output label at each voxel in the target image, and finally select a label that satisfies certain optimal criteria. Distance-based methods compute the signed Euclidean distances to the contours of the structures, weigh those distances based on the similarity information, and finally assign a label that results in the least cumulative weighted Euclidean distance. On the other hand, the third category of statistical fusion methods simultaneously estimates both the probable output segmentation and performance parameters for each template, using an iterative approach.

Simultaneous Truth and Performance Level Estimation (STAPLE) is a widely used algorithm [1] that belongs to the third category of statistical fusion methods. The STAPLE algorithm not only generates the output segmentations (or reference standard), but also simultaneously rates the performances of the input segmentations. In practice, there are especially two specific scenarios where the STAPLE algorithm is widely used. First, it is used in order to generate the ground truth segmentations (also known as “reference standard”) from multiple manual delineations prepared by multiple experts (or even the multiple delineations prepared the same expert at different times). Second, STAPLE algorithm is used for merging multiple automated segmentations that are obtained by registering multiple template images to a new target image, and thereby generate more accurate segmentations for the target image than those individual segmentations from each template. It is this second scenario that we focus in the current manuscript.

The Expectation-Maximization (EM) approach used with the classical STAPLE algorithm guarantees convergence to a local optimum solution. However, if we can incorporate appropriate prior knowledge about the performance parameters of the templates into the Maximum-a-Posteriori (MAP) formulation of the STAPLE [12], [13], then it can provide more accurate estimations of both the reference standard and performance parameters.

MAP-based formulation of the STAPLE algorithm is used previously, for a different purpose, in order to merge the manual delineations made by multiple experts [12], [13]; it is used in the context of performing fusion with missing manual delineations for some of the structures of interest, in one or more template image. Such situation arises when some of the experts did not delineate all the structure of interest, but delineated only a subset of all the labels. To address such scenario, the authors in [12], [13] proposed to incorporate this “missing” information into the STAPLE by appropriately constraining the performance parameters through the MAP formulation. As that approach is specifically designed to deal with the fusion problem in the presence of missing data, it does not distinguish between the performances of the regular templates without any missing data. The current manuscript addresses a completely different problem of learning prior knowledge about the performance parameters of automated segmentations obtained from multiple template images into the MAP formulation of the STAPLE algorithm.

In this paper, we introduce a general and powerful framework for learning prior knowledge about the performance parameters of each label in each template, and for using that information to optimally set the MAP parameters of the STAPLE algorithm. More specifically, we propose here a new approach for learning the relationships between the local intensity similarities and the performance parameters of each label. Some of the previous works learn prior knowledge about performance parameters based on the training data [14], and also from the labels of the template images [13]. To the best of our knowledge, this is the first work that deals with learning prior knowledge about the performance parameters from the intensity information of the template images.

The method that we propose in this paper is an extension of the preliminary ideas that we have presented in a recent Workshop [15]. There are, however, substantial new contributions and extensions in the current manuscript compared to [15], and they are as follows. First, we modified the way we learn the relationships between the performance parameters and the similarity information. Second, unlike in [15], we compute the relationships locally, but not globally. Finally, we present here a comprehensive evaluation on 10 subcortical structures in the brain MR images, and we also compare our results with many state-of-the-art fusion methods.

The rest of the paper is organized as follows. Section II describes our new method, and the proposed optimal MAP parameters estimation procedure. Section III presents a detailed evaluation of the proposed method for the segmentation of subcortical structures in brain MR images, and comparisons with state-of-the-art methods. Finally, discussion and conclusions are presented in Section IV.

II. Methods

A. Regular EM-based Formulation of STAPLE

As mentioned in the preceding section, the STAPLE algorithm takes multiple segmentation results obtained from multiple template images as the input; it then estimates both the final output segmentations and the performance parameters for each template image.

Let D = {D1,, Di,, DN} be a matrix of size J × N, where J and N are respectively the number of templates and the number of voxels. In this matrix, Di = [Di1,, Dij,, DiJ]′ and Dij is the label of the template j at voxel i. The goal here is to estimate the output segmentation T = {T1,, Ti,, TN} and the performance parameters θ = {θ1,…, θj,…, θJ} where θj is the matrix of size S × S, θjs′s = f(Dij = s′|Ti = s), and S is the number of segmentation labels.

Since both the output segmentations (T), and the performance parameters (θ) are unknown, the following complete data log-likelihood function is maximized iteratively using an EM algorithm:

Q(θθt)=ijsWsitlog(θjDijs), (1)

where Wsit is the posterior probability of the reference standard segmentation Ti for label s.

The EM algorithm approaches the problem of maximizing the above log-likelihood function by proceeding iteratively with estimation and maximization steps. In the estimation step, the evaluation of Q(θ| θt) requires the computation of posterior probability of T for each label s, and it is given by:

P(T=sD,θt)=iWsit=iP(Ti=s)jθjDijstsP(T=s)jθjDijst. (2)

Given the estimated weight variables Wsit, the new performance parameter θt+1 at iteration number:(t+1) are computed by maximizing the complete log likelihood function Q(θ| θt).

The above EM formulation of the STAPLE algorithm guarantees convergence to a local optimum. However, incorporating appropriate prior knowledge about the performance parameters of the template images through the MAP formulation of the STAPLE algorithm could not only result in convergence to a global optimum (or strong local optimum), but also could result in more accurate estimation of both the performance parameters and the output segmentations. The following subsection presents beta distribution based MAP formulation of the STAPLE.

B. Beta Distribution based MAP formulation of STAPLE

The MAP formulation of the STAPLE algorithm can be expressed as:

QMAP(θθt)=Q(θθt)+γlog(p(θ)). (3)

where p(θ) is the prior probability of the performance parameters, and γ is the weighting parameter between the data term and of the MAP prior. As the performance parameters for each template and each label can be considered to be independent of each other [13], p(θ) can be expressed as a product of the probabilities of each performance parameter denoted by p(θjs′s).

Similar to [13], in this paper, we use beta distribution Bα,β(x)=1Zxα-1(1-x)β-1 for modeling the prior probabilities of each performance parameter. The main advantage of using beta distribution is that it facilitates modeling a variety of differently shaped performance characteristics by simply varying the two shape parameters: α and β; moreover, it is straightforward with the beta distribution to obtain its logarithm and derivatives that are required during the optimization procedure. Using beta distribution for modeling the prior probabilities of the performance parameters leads to the following expected value of the complete log-likelihood function:

QMAP(θθt)=ijsWsitlog(θjDijs)+γjss[(αjss-1)log(θjss)+(βjss-1)(log(1-θjss))]. (4)

Notice that the computation of posterior probabilities depends only on the current estimates of θt, but not on the prior on these parameters; hence, the computation procedure for the posterior probabilities of the output segmentation T for each label s is same for both the EM-based and the MAP-based formulations of the STAPLE algorithm; the posterior probabilities are already presented in Eq. (2).

θ values that optimize Eq. (4) can be obtained by equating the derivatives of QMAP to 0 for each template image j; this results in the following system of equations:

θjsst=γAss+i:Dij=sWsitn(γAns+i:Dij=nWsit), (5)

where

Ans=αjns+βjns+βjns-1θjns-1-2. (6)

The above system of equations always has a unique solution, and is known as fixed point. The solution scheme consists of an iterative process, and is described in detail in [12].

In case of a binary segmentation problem (i.e., s ∈ {0, 1}), several simplifications can be made to the above system of equations, and it finally results in the following analytical closed form solution [13]:

θjsst=i:Dij=sWsit+γ(αjss-1)iWsit+γ(αjss+βjss-2).θj01t=(1-θj11t).θj10t=(1-θj00t). (7)

In [12], [13], the authors used the MAP solution for the specific problem of missing data. To this end, they used a set of empirically fixed parameters for all of the templates containing labels, to have priors with probability close to one for diagonal performance parameters, and close to zero for off-diagonal performance parameters. However, in this paper, we are interested in incorporating the prior knowledge about the performance parameters of each label in each template. The following subsection presents our proposed approach for achieving this goal, which is based on learning the relationships between the performance parameters and the image similarity information.

C. Learning Performance Parameters vs. Image Similarity Relations

In this paper, we consider the binary segmentation problem. Notice that, in case of binary segmentation, the diagonal elements of the performance matrix θ represent specificity and sensitivity [13], while the off-diagonal elements are (1-sensitivity) and (1-specificity); thus, we only need to learn prior knowledge about sensitivity and specificity. Please note that, in the rest of the paper, when we say “performance parameters”, we are actually referring to only the diagonal elements of the matrix θ (i.e., specificity and sensitivity).

A common underlying assumption for many fusion methods [4]–[7] is that the accuracy of segmentations obtained from a given template are proportional to its intensity similarity to the target intensity image. Similarly, we make here an assumption that if the intensity similarity of a template to the target intensity image is low, there is a high probability that its performance parameters are poor. This assumption is based on the observation that, a low intensity similarity can be an indication of significant anatomical differences between the template and the target intensity images, or (and) an indication of considerable error in registering the template to the target intensity image; since both of these scenarios could eventually reduce the accuracy of segmentation results obtained based on that particular template, we make the aforementioned assumption.

We then proceed further by learning the relationships between the performance parameters and the intensity information, by using all templates as our training data. The training procedure that we proposed in [15] for learning the prior knowledge is briefly as follows:

  1. Select an image from the template database, and treat it as the target image to be segmented (i.e., pseudo-target image). The rest of the images in the database are used as templates for that pseudo-target image.

  2. Compute the non-consensus mask for the pseudo-target image that contains only those voxels for which at least two template images disagree regarding output label, and compute both the performance parameters over this mask.

  3. Compute intensity similarities over the non-consensus mask.

  4. Repeat steps 1 to 3 for each image in the template database using a leave-one-out approach.

  5. By the completion of step-4, for a database of J templates, we will have J(J − 1) pairs of sensitivity (or specificity) versus similarity values. Perform a robust linear regression analysis, and obtain the final parameters representing the overall relation between the sensitivity (or specificity) and the image-similarity.

In this paper, we propose the following modifications to the aforementioned learning approach:

  1. Instead of learning the relationships over the entire image, we propose to learn them locally. This is based on the well known observation that the intensity similarity between two images could vary significantly across different spatial locations, and thus, making inferences based on the local intensity similarity could result in more accurate results than the global intensity similarity.

  2. In order to avoid introducing any undesired bias while estimating the relationships, unlike in the aforementioned approach, we do not use any mask; instead, we compute the similarity metric at each voxel, based on the intensity information at all the neighboring voxels that are present within the predefined radius (rs ) around that voxel.

  3. Notice that, learning the relationships locally using the previously proposed approach in [15] requires performing robust linear regression at each voxel in the image; but, such approach becomes computationally very demanding with the increasing number of template images and image sizes. Hence, in this paper, we propose a new approach that estimates the MAP parameters directly based on the similarity metric values, without requiring any regression analysis at each voxel.

The MAP parameters estimation procedure that we propose in this paper is presented in the following subsection.

D. MAP Parameters Estimation

As described in the preceding subsection, if the similarity between a template and the target intensity image is low, there is a high probability that the performance parameters of that template are low; similarly, we could expect high values of performance parameters (i.e., sensitivity and specificity) for the segmentations obtained from a template that has high intensity similarity to the target image.

The intensity similarity between two images can be estimated using various metrics like “Mean Square Error” (MSE) and “Normalized Cross Correlation” (NCC). In this work, we use NCC as the intensity similarity metric; however, it is easy to notice that the proposed approach can be easily adapted to other similarity metrics as well.

Let φij represent the NCC value between the jth aligned template image and the target image, computed over a neighborhood patch of radius rp that is centered at the ith voxel. It is known that φij varies between −1 and +1, whereas the values of the performance parameters vary between 0 and 1. In order to map high intensity similarity values to high performance parameters during the initialization, and also to map the range of NCC values to the range of performance parameter values, we first apply the following exponential-based transformation:

mij=11+e-A(φij-b), (8)

where A and b are respectively the scale and the shift parameters that can be optimized for each specific problem. Notice that the above function, before applying the exponential-based transform, shifts φij by a value b, and then scales it by a factor A; thus, intuitively, this function not only maps NCC values from [−1 1] to (0 1], but also reduces the weight (or importance) given to φij values that are below b, and then scales the resultant values by a factor A.

We now present how we use mij value for computing αjss and βjss parameters of the beta distribution, at each voxel, for each template j, and label s.

Notice that the mode of a beta distribution Bα,β (x) represents the x value where the distribution reaches a maximum value. In other words, the mode value can be interpreted as the “best guess” of what we are likely to see on a single realization of the target activity. Subjective estimates of the mode value are not only easier to elicit, but also more reliable than subjective estimates of the other characteristics (or parameters) such as mean, α and β values.

In our previous work [15], we assumed that a linear relationship exists between the mode values of the beta distribution of each performance parameter, and the intensity similarity metric. In the current work, we modify this somewhat strong assumption of “linear relationship” between the mode and NCC values by assuming a more general relationship presented in Eq. (8). The scale and the shift parameters of the exponential-based transformation in Eq. (8) gives more freedom in choosing the exact form (or shape) of the relationship between the mode values of the performance parameters, and the NCC values. In our current experiments, although we have used the same scale and shift values for all the voxels, the proposed framework facilitates optimizing these values individually for each label. To summarize, we assume here that the mode value of the beta distribution for the template j and voxel i occurs at mij.

Further, notice that the variance of the beta distribution indicates our confidence on the prior knowledge about the performance parameters that we learn based on the intensity similarities; in other words, small variance value of the beta distribution indicates high confidence on the prior knowledge about the performance parameters, and conversely, high variance value indicates less confidence on the prior knowledge. In all our experiments presented in this paper, we have empirically set the variance of the beta distribution to a fixed value (1e – 4).

This implies that for each beta distribution, we know the mode and variance values, and the goal now is to obtain their equivalent α and β values as parameterized in Eq. (4). For this purpose, we use the method that was proposed in [16], and we now briefly summarize the derivation procedure.

Let m and σ2 respectively represent the mode and variance of the beta distribution. Since the mode occurs when the beta distribution reaches maximum, i.e., when the derivative is zero, the mode of the beta distribution parametrized in terms of α and β parameters is given by:

m=α-1α+β-2. (9)

Similarly, the variance of the beta distribution is given by:

σ2=αβ(α+β)2(α+β+1). (10)

Our goal now is to obtain the α and β values that result in the mode and variance values given by Eq. (9) and Eq. (10) respectively. This can be achieved through the standard rewriting and solving of the above two equations. A more detailed description of the relevant procedure can be found in [16].

For the convenience of notation, let us define an intermediate variable τ as:

τ=σ2(1-m)2. (11)

Then, the parameter β of the beta-distribution corresponds to the largest positive real root of the following cubic equation:

c3β3+c2β2+c1β+c0=0, (12)

whose coefficients are given by:

c0=-12τm3+20τm2-11τm+2τ.c1=16τm2+(2-18τ)m+5τ-1.c2=-(7τ+1)m+4τ.c3=τ.

The other shape parameter α of the fitted beta distribution is given by:

α=(β-2)m+11-m. (13)

To summarize, prior knowledge about the performance parameters of each template at each voxel is inferred based on the intensity information; this prior knowledge is incorporated into the MAP formulation of the STAPLE presented in Eq. (4), through α and β parameters of the distribution that were computed using Eq. (13) and Eq. (12) respectively.

Regarding the weighting parameter γ in Eq. (4), in all our experiments, we set its value to the average number of voxels present in the output label that is obtained based on the simple EM-based STAPLE algorithm; by this way, the two terms in the MAP formulation of the STAPLE will have approximately similar weight. While applying the MAP-STAPLE algorithm locally, for each voxel, over a neighborhood radius of rs, we scale the γ value accordingly, using the empirically driven expression presented in [17], and thus, the new weight γ′ is given by:

γ=γNwln(J)N, (14)

where N is the total number of voxels in the image, J is the number of templates, and Nw is the number of voxels present within the cube of radius rs.

Finally, we want to summarize the complete algorithm that we have described so far throughout this section.

  1. Compute the local intensity similarity metric (NCC) at each voxel in the target image, for each aligned template image.

  2. Compute the mode values corresponding to each similarity metric value (computed in step 1) using the exponential-based transformation presented in Eq. (8).

  3. Compute the α and β parameters of the beta distributions corresponding to each mode value (computed in step 2) using the system of equations presented in Eq. (12) and Eq. (13).

  4. Compute the weighting parameter γ′ for a given weight (γ) and rs using the expression presented in Eq. (14).

  5. Solve iteratively the system of equations presented in Eq. (2) and Eq. (5).

Please note that, unlike in [15], we compute the intensity similarity metric (mentioned step-1) at a given voxel based on the intensity information at all the neighboring voxels present within the radius of rs; on the other hand, like most of the STAPLE-based algorithms, we estimate the ground truth (mentioned in step-5) only at the non-consensus voxels.

III. Experiments

In this section, we validate our new method in the context of segmentation of structures in the 3D brain MR images. In addition, we compare the results from our proposed approach with the results from some of the state-of-the-art fusion methods.

A. Dataset

We utilize the IBSR brain dataset1 of 18 healthy subjects for our experiments. It is a publicly available dataset that contains T1 intensity images of subjects, and the corresponding ground truth segmentations for various structures in the brain. We considered 10 subcortical structures for our evaluation: (i) Left Thalamus, (ii) Right Thalamus, (iii) Left Caudate, (iv) Right Caudate, (v) Left Putamen, (vi) Right Putamen, (vii) Left Pallidum, (viii) Right Pallidum, (ix) Left Hippocampus, and (x) Right Hippocampus.

B. Registration Procedure

The registration procedure that we followed in this paper is very similar to [19]. We started with a linear registration step for initial alignment; we linearly registered all the 18 brain images to a common template using FMRIB Software Library’s (FSL) FLIRT with the following settings: 9-parameter, correlation ratio, trilinear interpolation; the common template was the “nonlinear MNI152,” the nonlinear average template in MNI space used by FSL.

We then rigidly registered each of the 18 brain images in the MNI space to the rest of the 17 images in a leave-one-out manner, again using FLIRT with the following settings: 6-parameter, correlation ratio, trilinear interpolation.

As a final registration step, we performed non-rigid registration between each pair of rigidly registered images, using the diffeomorphic demons registration algorithm proposed in [20]. For this purpose, we used the publicly available ITK implementation of the diffeomorphic demons registration [21]. As in [19], we used the following settings for this non-rigid registration: 3 multi-scale-pyramid levels with iterations of 30, 20 and 10 respectively, smoothing sigma of 2 for the deformation field, and use of histogram matching prior to registration.

C. Fusion Methods and Parameters

We compare the results from our new method with the results obtained from various categories of existing fusion methods, namely, simple voting method, voting-based method that uses local intensity information ([4]), STAPLE-based methods that do not use any intensity information ([1], [13], [18]), and STAPLE-based method that uses local intensity information ([8]).

More specifically, we evaluate the segmentation results obtained from the following fusion methods:

  1. Majority Voting (MV)

  2. STAPLE [1]

  3. COLLATE [18]

  4. Local Weighted Voting (LWV) [4]

  5. Logarithmic Opinion Pool STAPLE (LOP-STAPLE) [8]

  6. Empirical local MAP-STAPLE [13]

  7. Our new optimal local MAP-STAPLE

For STAPLE [1] and COLLATE [18] methods, we used their default parameters presented in those respective papers. For empirical local MAP-STAPLE, we used the same parameters that were used in [13] for the templates that didn’t have any missing labels, i.e., we set the α and β values to 5 and 1.5 respectively for the diagonal elements of the performance matrix for all the template images; notice that those particular values of α and β are equivalent to setting a mode value of 0.89, and variance of 0.02 for all the template images. We use same value of rs (i.e., half window size for computing performance parameters) for both the empirical local MAP-STAPLE, and our optimal local MAP STAPLE, so that, the comparison between these two methods will be fair.

Unlike the above three methods, LWV, LOP-STAPLE and optimal local MAP-STAPLE have certain parameters, especially related to the intensity information, that need to be optimized. To this end, we optimized those parameters for each fusion method independently, by evaluating the Dice similarity coefficient of all 10 structures, and for all 18 images in the dataset; for each fusion method, we have finally selected those parameters that resulted maximum overall Dice similarity coefficient. The parameters used for different fusion methods are described in Table I.

TABLE I.

The parameters used for Local Weighted Voting (LWV) [4], Logarithmic Opinion Pool based STAPLE (LOP-STAPLE) [8], empirical local MAP-STAPLE method [13], and our new optimal local MAP-STAPLE are described in this table.

Fusion Method Parameters values Description

LWV rp = 3 Half window size for computing intensity similarity

LOP-STAPLE rp = 2 Half window size for computing intensity similarity
A = 5 Scale parameter
b = 0.8 Shift parameter

Empirical Local MAP-STAPLE αjss (s′ ≠ s) = 1.5 α value for non-diagonal elements of the Beta distribution
αjss = 5 α value for diagonal elements of the Beta distribution
βjss (s′ ≠ s) = 1.5 β value for non-diagonal elements of the Beta distribution
βjss = 5 β value for diagonal elements of the Beta distribution
rs = 7 Half window size for computing performance parameters

Optimal Local MAP-STAPLE rp = 4 Half window size for computing intensity similarity
A = 3 Scale parameter
b = 0.8 Shift parameter
rs = 7 Half window size for computing performance parameters

D. Evaluation Results

In this subsection, we evaluate our new “optimal local MAP-STAPLE” fusion method, by comparing it with the existing MV, STAPLE, COLLATE, LWV, LOP-STAPLE, and empirical local MAP-STAPLE methods. We perform the evaluation in the context of segmenting 10 subcortical structures in the IBSR brain dataset of 18 images. We use the leave-one-out approach for template-fusion, i.e., for each target image, we combine the segmentation results obtained from the remaining 17 template images that are registered to the current target image.

As we have considered only the binary segmentation problem, all the 10 structures are segmented independently. In order to speedup the fusion process, we computed the regions of interest for each structure, based on the labeled images of all templates, and then the images are cropped accordingly.

We use the average Dice similarity coefficient for comparison of the fusion methods. Further, in order to evaluate the statistical significance of the results, we also perform two-sided Wilcoxon signed rank test (with significance level of 0.05) between each existing fusion method and the new method. Table II presents the average (mean) and standard deviation values of Dice similarity coefficients obtained from all fusion methods, for all the structures, along with the statistical tests results. Finally, Fig. 1 shows a representative segmentation obtained for one of the images, using MV, STAPLE, and the proposed method.

TABLE II.

Average Dice similarity results, computational times, and statistical results for the segmentation of 10 subcortical structures, in a dataset of 18 subjects. The fusion methods evaluated are: (i) Majority Voting (MV), (ii) STAPLE [1], (iii) COLLATE [18], (iv) Local Weighted Voting (LWV) [4], (v) Logarithmic Opinion Pool based STAPLE (LOP-STAPLE) [8], (vi) empirical local MAP-STAPLE [13], and (vii) our new optimal local MAP-STAPLE. The best Dice similarity results are marked in bold.

Structure Name MV STAPLE COLLATE LWV LOP-STAPLE Empirical Local MAP-STAPLE Optimal Local MAP-STAPLE

1. Left Thalamus 88.04% 88.25% 88.25% 88.12% 88.41% 87.98% 88.45%
2. Right Thalamus 87.25% 87.47% 87.55% 87.43% 87.74% 87.22% 88.01%
3. Left Caudate 83.02% 83.28% 83.17% 83.08% 83.33% 82.87% 83.52%
4. Right Caudate 82.03% 82.23% 82.12% 82.16% 82.31% 81.79% 82.76%
5. Left Putamen 85.49% 85.77% 85.93% 85.64% 85.98% 85.61% 86.23%
6. Right Putamen 84.77% 85.17% 85.25% 85.12% 85.57% 84.89% 86.10%
7. Left Pallidum 73.19% 75.06% 74.78% 74.48% 75.52% 74.49% 76.74%
8. Right Pallidum 70.64% 73.26% 73.90% 72.52% 74.27% 71.90% 76.13%
9. Left Hippocampus 77.66% 78.62% 78.80% 78.31% 79.18% 77.90% 79.17%
10. Right Hippocampus 77.79% 78.87% 79.04% 78.52% 79.40% 78.35% 79.45%

Average 80.99% 81.80% 81.88% 81.54% 82.17% 81.30% 82.66%
Standard Deviation 5.94% 5.17% 5.10% 5.39% 4.95% 5.47% 4.57%

Average computational time per structure < 1 s < 1 s 6.5 s 3.5 s 8.1 s 48.3 s 52.5 s

p < 1e-5 1e-5 1e-5 1e-5 0.0005 1e-5
V 15927 12640 11574.5 15828 10563 15174
CH 0.014 0.008 0.008 0.009 0.005 0.012
CL 0.009 0.004 0.003 0.006 0.001 0.008

Fig. 1.

Fig. 1

Screen-shot of segmentation results for subcortical structures in one of images in the IBSR dataset. Ground truth segmentations are shown in column (a); segmentation results from majority voting, STAPLE and optimal local MAP-STAPLE are shown in columns (b), (c) and (d) respectively. The segmentations for thalamus, caudate, putamen, pallidum and hippocampus are respectively shown in red, blue, green, magenta and yellow respectively. From qualitative comparisons with the ground truth segmentations in column (a), it can be noted that the proposed optimal local MAP-STAPLE has provided the best segmentation results among them.

Based on the average Dice similarity coefficient values presented in Table II, the following observations can be made. The proposed optimal local MAP-STAPLE method provided the best overall segmentation results among all the seven fusion methods, resulting in an average Dice similarity coefficient of 82.66%. The best overall and structure-wise Dice similarity values are marked in bold in the table for an easy reference. When we look at the segmentation results structure-wise, the proposed method provided the best segmentation results for 9 out of 10 structures; for the other remaining structure (i.e., left hippocampus), segmentation results from LOP-STAPLE are slightly better than our new method. To summarize, optimal local MAP-STAPLE provided the best overall segmentation results, and it is followed by LOP-STAPLE, COLLATE, STAPLE, LWV, empirical local MAP-STAPLE, and MV respectively.

Regarding the computational aspects, all the experiments are run on a 64-bit 10-core workstation with Intel Xeon 2.40 GHz processor, and 47 GB RAM. All the fusion methods, except COLLATE, are implemented in C++, with parallel processing. For COLLATE, we used the MATLAB-based implementation provided by the authors of [18]. Table II presents the average computational times per structure for each fusion method; notice that it took less than a minute (53 s) for the proposed algorithm. Thus, with the parallel implementation run on 10 cores, when compared to the empirical local MAP STAPLE (48 s), there is only an additional overhead of around 5 seconds per structure for the proposed method.

In addition to the average Dice similarity results, Table II also presents various statistical metrics obtained based on two-sided Wilcoxon signed rank tests, namely, the p value, the sum of the ranks assigned to the differences with positive sign (V), and the confidence interval [CL CH ] associated with each comparison. Notice that, since these statistics are computed for 10 structures, and for a dataset of 18 subjects, the maximum possible value of V is 16290. It is clear from the positive values of V that we got in all the six statistical tests, that, with 95% confidence, the Dice similarity coefficient values obtained from the proposed method are statistically better than the results from all the rest of the fusion methods. Thus, the results from the proposed method are found to be better than the other six methods, both quantitatively and statistically.

IV. Discussion and Conclusions

In this paper, we have presented a new approach for learning prior knowledge about the performance parameters of template images. We have also proposed a methodology for incorporating this prior knowledge into the STAPLE algorithm.

To the best of our knowledge, this is the first work that deals with learning prior knowledge about the performance parameters (i.e., sensitivity and specificity) from the intensity information. The prior knowledge about the performance parameters is inferred based on the local intensity similarity between each template image and the target image; it is then incorporated into the fusion method through the estimation of the optimal parameters of the MAP-based formulation of the STAPLE algorithm.

The proposed algorithm has been evaluated in the context of segmentation of structures in the brain MR images. We compared the proposed “optimal local MAP-STAPLE” algorithm with six state-of-the-art methods, namely, (i) Majority Voting (MV), (ii) STAPLE [1], (iii) COLLATE [18], (iv) Local Weighted Voting (LWV) [4], (v) Logarithmic Opinion Pool based STAPLE (LOP-STAPLE) [8], and (vi) empirical local MAP-STAPLE [13].

Notice that among all the seven fusion methods, MV, STAPLE, COLLATE and “local empirical MAP-STAPLE” algorithms do not take into account the intensity similarity information. On the other hand, voting-based fusion algorithms (MV and LWV), unlike the STAPLE-based algorithms, are not based on the explicit evaluation of rater performance parameters. In this perspective, among all the seven methods, the proposed algorithm and the LOP-STAPLE are the only methods that take into account both the intensity information, and the rater performance parameters.

When compared to the LOP-STAPLE algorithm, our proposed method uses the local intensity information in a very different manner. Notice that the LOP-STAPLE algorithm incorporates the local intensity information by modifying the way the reliability weights for each rater are computed in the EM-based STAPLE algorithm; on the contrary, our proposed algorithm learns prior knowledge about the performance parameters of each rater using the local intensity information, and then, incorporates it into the fusion process through the computation of optimal parameters of the MAP-based STAPLE algorithm. In addition, unlike the EM-based formulation, the MAP-based formulation used in our algorithm could result in convergence to a global optimum (or strong local optimum), and thereby, resulting in more accurate estimation of both the performance parameters and the output segmentations.

The aforementioned theoretical differences that we observe between different fusion methods are in coherence with the quantitative results obtained in the context of segmentation of structures in the brain MR images. For instance, LOP-STAPLE and the proposed fusion algorithm have provided the best segmentation results among all the methods; within those two methods, the proposed method has provided the best overall segmentation results. The improvements in the Dice similarity coefficient for the proposed method are found to be statistically significant when compared to the rest of the fusion methods.

As mentioned in the preceding section, the parameters for each fusion method (shown in Table I) are optimized independently. For the current application, our proposed algorithm is found to be robust, and has less sensitivity to changes in theses parameters in a broad range. For instance, we observed that the behavior of the proposed algorithm to changes in scale (A), and shift (b) parameters is very similar to the sensitivity analysis results presented for the LOP-STAPLE algorithm in [8].

Similarly, in all our evaluations, we have empirically set the variance (σ2) of the beta distribution to a fixed value of 1e-4. Notice that setting variance to very high values is, in effect, equivalent to assuming a uniform prior regarding the performance parameters; on the contrary, very low values of variance force the algorithm to strictly converge to the prior knowledge that we have learnt based on the intensity information. In other words, the variance value indicates our confidence on the prior knowledge. We observed that, for the current application, the segmentation results from the proposed methods are quite robust to changes in the variance values. In the future work, we would like to perform a detailed sensitivity analysis for various parameters of the proposed algorithm, and also explore the possibilities of learning these parameters from the training data.

Unlike in our preliminary work [15] where we assumed a strict linear relationship between the mode values of the beta distribution (of the performance parameters), and the intensity similarity, we proposed here to assume a more general exponential-based relationship. In the future work, we would like to investigate further regarding other possible relationships that one could assume between the intensity-similarity and the performance parameters. For instance, one could perhaps learn those relationships using various deep learning techniques, and then, incorporate that information into the STAPLE algorithm. One could also explore other possible strategies like Bayesian learning of MAP parameters.

In the current work, we have considered the binary segmentation problem. It is indeed possible to extend the proposed method to multi-label segmentation problem. For binary segmentation, the system of equations for the performance parameters (eq. (5)) has analytical closed form solution (eq. (7)). For a multi-label problem, although the system of equations do not have a closed form solution, and although it can be computationally more expensive than the binary segmentation, it still has a unique solution (called fixed point). In the future work, we would like to extend the current framework to multi-label segmentation problem, and also develop computationally more efficient models for multi-label fusion.

Acknowledgments

Subrahmanyam Gorthi is supported by the Swiss National Science Foundation (SNF) under grant P2ELP2 148892. This research was supported in part by NIH grants R01 EB013248 and R01 NS079788.

Footnotes

1

The MR brain data sets and their manual segmentations were provided by the Center for Morphometric Analysis at Massachusetts General Hospital and are available at http://www.cma.mgh.harvard.edu/ibsr/.

References

  • 1.Warfield S, Zou K, Wells W. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging. 2004;23(7):903–921. doi: 10.1109/TMI.2004.828354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Akhondi-Asl A, Warfield S. Simultaneous truth and performance level estimation through fusion of probabilistic segmentations. IEEE Transactions on Medical Imaging. 2013 Oct;32(10):1840–1852. doi: 10.1109/TMI.2013.2266258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Asman A, Landman B. Formulating spatially varying performance in the statistical fusion framework. IEEE Transactions on Medical Imaging. 2012;31(6):1326–1336. doi: 10.1109/TMI.2012.2190992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Artaechevarria X, Munoz-Barrutia A. Combination strategies in multi-atlas image segmentation: Application to brain MR data. IEEE Transactions on Medical Imaging. 2009;28(8):1266–1277. doi: 10.1109/TMI.2009.2014372. [DOI] [PubMed] [Google Scholar]
  • 5.Sabuncu M, Yeo B, Van Leemput K, Fischl B, Golland P. A generative model for image segmentation based on label fusion. IEEE Transactions on Medical Imaging. 2010;29(99):1714–1729. doi: 10.1109/TMI.2010.2050897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wang H, Suh J, Das S, Pluta J, Craige C, Yushkevich P. Multi-atlas segmentation with joint label fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(3):611–623. doi: 10.1109/TPAMI.2012.143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gorthi S, Bach Cuadra M, Tercier P-A, Allal A, Thiran J-P. Weighted shape-based averaging with neighborhood prior model for multiple atlas fusion-based medical image segmentation. IEEE Signal Processing Letters. 2013 Nov;20(11):1034–1037. [Google Scholar]
  • 8.Akhondi-Asl A, Hoyte L, Lockhart M, Warfield S. A logarithmic opinion pool based STAPLE algorithm for the fusion of segmentations with associated reliability weights. IEEE Transactions on Medical Imaging. 2014 Oct;33(10):1997–2009. doi: 10.1109/TMI.2014.2329603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gorthi S. PhD dissertation. EPFL; 2013. Multi Atlas Fusion Methods for Medical Image Segmentation. [Google Scholar]
  • 10.Rohlfing T, Maurer JCR. Shape-based averaging. IEEE Transactions on Image Processing. 2007 Jan;16(1):153–161. doi: 10.1109/tip.2006.884936. [DOI] [PubMed] [Google Scholar]
  • 11.Cardoso M, Leung K, Modat M, Barnes J, Ourselin S. Locally ranked staple for template based segmentation propagation. Workshop on Multi-Atlas Labeling and Statistical Fusion. 2011 [Google Scholar]
  • 12.Commowick O, Warfield SK. Incorporating priors on expert performance parameters for segmentation validation and label fusion: a Maximum A Posteriori STAPLE. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2010;6363:25–32. doi: 10.1007/978-3-642-15711-0_4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Commowick O, Akhondi-Asl A, Warfield SK. Estimating a reference standard segmentation with spatially varying performance parameters: Local MAP STAPLE. IEEE Transactions on Medical Imaging. 2012;31(8):1593–1606. doi: 10.1109/TMI.2012.2197406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Landman B, Asman A, Scoggins A, Bogovic J, Xing F, Prince J. Robust statistical fusion of image labels. IEEE Transactions on Medical Imaging. 2012 Feb;31(2):512–522. doi: 10.1109/TMI.2011.2172215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gorthi S, Akhondi-Asl A, Thiran J-P, Warfield S. Optimal MAP parameters estimation in STAPLE - learning from performance parameters versus image similarity information. Machine Learning in Medical Imaging. 2014;8679:174–181. [Google Scholar]
  • 16.AbouRizk SM, Halpin DW, Wilson JR. Visual interactive fitting of beta distributions. Journal of Construction Engineering and Management. 1991;117(4):589–605. [Google Scholar]
  • 17.Asman AJ, Landman BA. Information Processing in Medical Imaging. Springer; 2011. Characterizing spatially varying performance to improve multi-atlas multi-label segmentation; pp. 85–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Asman A, Landman B. Robust statistical label fusion through consensus level, labeler accuracy, and truth estimation (COLLATE) IEEE Transactions on Medical Imaging. 2011;30(10):1779–1794. doi: 10.1109/TMI.2011.2147795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Klein A, Andersson J, Ardekani BA, Ashburner J, Avants B, Chiang MC, Christensen GE, Collins DL, Gee J, Hellier P, et al. Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI registration. Neuroimage. 2009;46(3):786–802. doi: 10.1016/j.neuroimage.2008.12.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Vercauteren T, Pennec X, Perchant A, Ayache N. Non-parametric diffeomorphic image registration with the demons algorithm. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2007. 2007:319–326. doi: 10.1007/978-3-540-75759-7_39. [DOI] [PubMed] [Google Scholar]
  • 21.Vercauteren T, Pennec X, Perchant A, Ayache N. Diffeomorphic demons using ITKs finite difference solver hierarchy. The Insight Journal. 2007 [Online]. Available: http://www.insight-journal.org/browse/publication/154.

RESOURCES