Abstract
Lesions or organ boundaries visible through medical imaging data are often ambiguous, thus resulting in significant variations in multi-reader delineations, i.e., the source of aleatoric uncertainty. In particular, quantifying the inter-observer variability of manual annotations with Magnetic Resonance (MR) Imaging data plays a crucial role in establishing a reference standard for various diagnosis and treatment tasks. Most segmentation methods, however, simply model a mapping from an image to its single segmentation map and do not take the disagreement of annotators into consideration. In order to account for inter-observer variability, without sacrificing accuracy, we propose a novel variational inference framework to model the distribution of plausible segmentation maps, given a specific MR image, which explicitly represents the multi-reader variability. Specifically, we resort to a latent vector to encode the multi-reader variability and counteract the inherent information loss in the imaging data. Then, we apply a variational autoencoder network and optimize its evidence lower bound (ELBO) to efficiently approximate the distribution of the segmentation map, given an MR image. Experimental results, carried out with the QUBIQ brain growth MRI segmentation datasets with seven annotators, demonstrate the effectiveness of our approach.
1. INTRODUCTION
Segmentation has been a crucial prerequisite to provide a basis for various downstream diagnosis and treatment tasks.1 Lesions or organ boundaries visible through medical imaging data, however, are not perfectly identified in many cases, because of imaging related artifacts or characteristics of lesions, and therefore multimodal imaging approaches are often used to accurately characterize lesions or organ boundaries.2,3 For example, different tumor tissues can be observed with different MRI modalities; it is challenging to clearly delineate the anatomical and pathological structures with a single MR image.4 In practice, the ambiguity can result in large variations in manual delineations from different observers, thereby leading to aleatoric uncertainty in training.5
Typical segmentation solutions simply train a network to map an image to its segmentation label, which does not consider the variability of multiple labels. Using only a single most likely segmentation map, however, could lead to misdiagnosis and sub-optimal treatment.3 Therefore, quantifying the inter-observer variability of multiple annotators has been an important task in establishing a reference standard for clinical decision-making.4 With the distribution of possible segmentation maps, the underlying aleatoric uncertainty can be quantified in each voxel, and further diagnostic tests can be planned to resolve the ambiguities inherent in the segmentation result.4 In addition, clinicians can select the appropriate segmentation maps from this distribution for subsequent tasks.
Recently, the QUBIQ challenge database* was proposed to benchmark the segmentation algorithms, returning inter-observer variability estimations. The top two methods6,7 simply used the labels from different annotators to train the neural networks independently, and averaged the predictions of these networks as the inter-observer variability measurement. With insufficient labeled training data, however, the epistemic uncertainty w.r.t. the model parameters can be significant,5,8,9 and the resulting uncertainty map can be a mixture of aleatoric and epistemic uncertainties.8,9 In addition, the training with multiple annotations can potentially be used to improve segmentation accuracy. Furthermore, Bayesian dropout5 can be an alternative to measure the aleatoric uncertainty, while its simple Bernoulli prior in the dropout operation may limit its representability.
In this work, we propose a novel variational inference framework to explicitly model the inter-observer variability, i.e., the posterior distribution of segmentation maps given an image. All of the labels are used to train a network jointly, which can potentially improve the accuracy and eliminate the epistemic uncertainty.
2. RELATED WORK
Segmentation has been a vital task in medical image analysis.10–15 The semantic segmentation makes pixel-wise classification and provides more explainable results for the subsequent decision making.16,17 However, the conventional training of segmentation neural networks utilized a single label as the ground-truth, and did not consider the inter-observer variability.
The classical solution to consider the variability of multiple label is to use the probabilistic graphical models, e.g., conditional random fields have been proposed to measure the joint probability distribution over labels,18 which relies on the maximum a posteriori estimate.19 Following the development of deep learning, conditional variational autoencoders have been utilized to model the spatially correlated aleatoric uncertainty.20–22 Hu et al.23 extended this framework, by regressing the uncertainty maps in a supervised manner. In this work, we propose a new perspective of the variational inference along with a novel formulation of its ELBO.
3. METHODS
In the task of multi-reader delineations, each image x has several segmentation labels s from different annotators. To quantify the inter-observer variability, we need to model the conditional distribution of p(s|x). Directly modeling p(s|x) with a forward model is intractable, since the segmentation of x → s can be inherently ambiguous and ill-posed, due to the insufficient information in imaging data. In this work, we aim to approximate p(s|x) w.r.t. the Kullback-Leibler (KL) divergence, which is equivalent to maximizing its log-likelihood log p(s|x).24 We resort to a latent vector z to encode the multi-reader variability and counteract the inherent information loss. With the latent variable z, we are able to use an ELBO for log p(s|x), which can be efficiently estimated and optimized as follows:
| (1) |
where KL[q(z|x,s)||p(z|x,s)] and are the non-negative terms, and the last three terms can be regarded as the evidence lower bound. We note that KL divergence is non-negative, and p(x) ∈ [0,1]. Maximizing the ELBO can efficiently maximize log p(s|x), i.e., learning to approach the underlying distribution of p(s|x) using the segmentation network.
Proof. We have the following decomposition to derive the ELBO of log p(s|x):
| (2) |
where KL[·] indicates the KL divergence. We note that, with the Baye’s rule, we have ‡, and . In addition, log p(s|x) and log p(x) are independent of the value of z ∼ q(z|x,s).
Based on the ELBO, we construct our variational inference framework as shown in Fig. 1. The encoder takes both x and s as input to model q(z|x,s) and outputs the latent vector z. In addition, we require z, following a simple prior distribution, which is usually a Gaussian distribution. The decoder takes z as input to reconstruct x, which models the distribution of p(x|z). The segmentation network takes both x and z as input, and the segmentation can be different w.r.t. different values of z.
Figure 1.

Illustration of our variational inference framework for quantifying inter-observer variability, which consists of an encoder (Enc), a decoder for reconstruction (Dec), and a segmentation network (Seg). Observed variables are shaded in gray. Only the yellow shaded part is used in testing.
Empirically, we maximize by minimizing the cross-entropy (CE) loss of segmentation . All of the samples are trained with multiple labels. Maximizing can be achieved by minimizing the reconstruction loss as in vanilla VAEs. We simply adopt the pixel-wise mean square error (MSE). Suppose that is the reconstructed x, then their reconstruction loss can be formulated as:
| (3) |
In addition, maximizing −KL[q(z|x,s)||p(z)] is equivalent to minimizing KL[q(z|x,s)||p(z)], which enforces the encoded latent vectors to align with its prior. We adopt the reparameterization trick to formulate the KL divergence. Similar to vanilla VAEs,25–27 the encoder has two output vectors, i.e., µ and σ. We then utilize the reparameterization trick z = µ+σ⊙ϵ, where . The posterior distribution of z is . Specifically, the KL-divergence can be computed as
| (4) |
where N = 6 is the dimension of the latent vector z.
In testing, only the segmentation network is used. We sample on the prior distribution of p(z), i.e., the Gaussian distribution, to have different z, and generate different segmentation predictions. After the training, the network can model p(s|x,z), and z explicitly encodes the inter-observer variability.
4. RESULTS
The QUBIQ brain growth MRI segmentation database consists of a total of 34 subjects for training and a total of 5 subjects for validation. Each MRI slice has seven segmentation annotations. We note that the testing set is private, and the validation is used for comparison. Given a single image x, we were able to output the distribution of s with the sampled different z, rather than a deterministic output in conventional segmentation networks. We used the same 2D U-Net backbone for the segmentation network as in ESULR.6 The statistic of the brain growth sub-task on QUBIQ is shown in Table 1.
Table 1.
Dataset statistics of the brain growth task on QUBIQ.
| Sub-task | Training | Validation | Testing | Task | Channels |
|---|---|---|---|---|---|
| brain-growth | 34 | 5 | 10 | 1 | 1 |
Following the standard evaluation metrics on QUBIQ, we averaged 7 predictions as the confidence map and compared them with the average of seven manual labels with the continuous Dice similarity score. We note that the averaged value of each pixel is in [0,1]. The confidence map can be regarded as the reciprocal of the aleatoric uncertainty. Fig. 2 depicts representative outputs of a validation sample and the averaged inter-observer variability results. The numerical results are shown in Table 2. Our framework outperformed the independent training methods6,7 and the Bayes dropout used in5 for the aleatoric uncertainty quantification.
Figure 2.

Illustration of the predicted segmentation distribution and the comparison with the averaged manual labels. The averaged segmentation map is regarded as the ground truth quantified by inter-observer variability on QUBIQ.
Table 2.
Numerical comparisons of different methods. ± standard deviation is reported over three times training with random initialization.
In Figure 3, we show the uncertainty estimation in the testing set. We note that there is no publicly available ground truth for the testing data. The to-be segmented parts vary significantly w.r.t. the shape, scale, and location. It can be quite challenging to precisely delineate the anatomical structures with limited training data. We can see that our predictions align well with the shape.
Figure 3.

Illustration of the MR slice and its corresponding predicted segmentation distribution in QUBIQ testing set.
5. CONCLUSION
In this work, we proposed a novel variational inference framework to estimate the posterior distribution of multi-reader segmentation labels from a Bayesian’s perspective. The derivation of the ELBO of the conditional log-likelihood is analyzed, which can be efficiently estimated and optimized. Then, the networks and loss functions are designed to maximize the ELBO. After training, the segmentation network yielded diverse segmentations with the sampled latent vectors, which inherit the multi-reader variability. All of the labels can be used to train the segmentation networks, instead of independent training. The quantified inter-observer variability, i.e., the averaged segmentation, aligns well with the manual average, when evaluated on the QUBIQ datasbase, which indicates that the aleatoric uncertainty can be accurately measured. In this work, we only explored the brain growth task using the QUBIQ challenge database, while the method can be easily generalized for any inter-observer variability quantification tasks, which is subject to our future work.
ACKNOWLEDGMENTS
This work is partially supported by NIH R01DC018511, R01CA165221, and P41EB022544.
Footnotes
REFERENCES
- [1].Liu X, Xing F, Fakhri GE, and Woo J, “Self-semantic contour adaptation for cross modality brain tumor segmentation,” ISBI (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Marin T, Zhuo Y, Lahoud RM, Tian F, Ma X, Xing F, Moteabbed M, Liu X, Grogg K, Shusharina N, et al. , “Deep learning-based gtv contouring modeling inter-and intra-observer variability in sarcomas,” Radiotherapy and Oncology (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Joskowicz L, Cohen D, Caplan N, and Sosna J, “Inter-observer variability of manual contour delineation of structures in ct,” European radiology 29(3), 1391–1399 (2019). [DOI] [PubMed] [Google Scholar]
- [4].Becker AS, Chaitanya K, Schawkat K, Muehlematter UJ, Hötker AM, Konukoglu E, and Donati OF, “Variability of manual segmentation of the prostate in axial t2-weighted mri: A multi-reader study,” European journal of radiology 121, 108716 (2019). [DOI] [PubMed] [Google Scholar]
- [5].Liu X, Xing F, Stone M, Zhuo J, Reese T, Prince JL, El Fakhri G, and Woo J, “Generative self-training for cross-domain unsupervised tagged-to-cine mri synthesis,” in [International Conference on Medical Image Computing and Computer-Assisted Intervention], 138–148, Springer; (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Ma J, “Estimating segmentation uncertainties like radiologists,” MICCAI: QUBIQ (2020). [Google Scholar]
- [7].Yang Y and Ma T, “Integrated segmentation with multiple annotations,” MICCAI: QUBIQ (2020). [Google Scholar]
- [8].Der Kiureghian A and Ditlevsen O, “Aleatory or epistemic? does it matter?,” Structural safety 31(2), 105–112 (2009). [Google Scholar]
- [9].Kendall A and Gal Y, “What uncertainties do we need in bayesian deep learning for computer vision?,” arXiv preprint arXiv:1703.04977 (2017). [Google Scholar]
- [10].Liu X, Han Y, Bai S, Ge Y, Wang T, Han X, Li S, You J, and Lu J, “Importance-aware semantic segmentation in self-driving with discrete wasserstein training.,” in [AAAI], 11629–11636 (2020). [Google Scholar]
- [11].Liu X, Ji W, You J, Fakhri GE, and Woo J, “Severity-aware semantic segmentation with reinforced wasserstein training,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition], 12566–12575 (2020). [Google Scholar]
- [12].Liu X, Lu Y, Liu X, Bai S, Li S, and You J, “Wasserstein loss with alternative reinforcement learning for severity-aware semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems (2020). [Google Scholar]
- [13].Liu X, Xing F, Gaggin HK, Wang W, Kuo C-CJ, Fakhri GE, and Woo J, “Segmentation of cardiac structures via successive subspace learning with saab transform from cine mri,” arXiv preprint arXiv:2107.10718 (2021). [DOI] [PubMed] [Google Scholar]
- [14].Wang J, Liu X, Wang F, Zheng L, Gao F, Zhang H, Zhang X, Xie W, and Wang B, “Automated interpretation of congenital heart disease from multi-view echocardiograms,” Medical Image Analysis 69, 101942 (2021). [DOI] [PubMed] [Google Scholar]
- [15].Liu X, Zhang Y, Liu X, Bai S, Li S, and You J, “Reinforced wasserstein training for severity-aware semantic segmentation in autonomous driving,” arXiv preprint arXiv:2008.04751 (2020). [Google Scholar]
- [16].Liu X, Hu B, Liu X, Lu J, You J, and Kong L, “Energy-constrained self-training for unsupervised domain adaptation,” in [2020 25th International Conference on Pattern Recognition (ICPR)], 7515–7520, IEEE; (2021). [Google Scholar]
- [17].Liu X, Xing F, El Fakhri G, and Woo J, “Adapting off-the-shelf source segmenter for target medical image segmentation,” in [MICCAI], (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Blake A, Kohli P, and Rother C, [Markov random fields for vision and image processing], MIT press; (2011). [Google Scholar]
- [19].Monteiro M, Folgoc LL, de Castro DC, Pawlowski N, Marques B, Kamnitsas K, van der Wilk M, and Glocker B, “Stochastic segmentation networks: Modelling spatially correlated aleatoric uncertainty,” arXiv preprint arXiv:2006.06015 (2020). [Google Scholar]
- [20].Kohl SA, Romera-Paredes B, Meyer C, De Fauw J, Ledsam JR, Maier-Hein KH, Eslami S, Rezende DJ, and Ronneberger O, “A probabilistic u-net for segmentation of ambiguous images,” arXiv preprint arXiv:1806.05034 (2018). [Google Scholar]
- [21].Kohl SA, Romera-Paredes B, Maier-Hein KH, Rezende DJ, Eslami S, Kohli P, Zisserman A, and Ronneberger O, “A hierarchical probabilistic u-net for modeling multi-scale ambiguities,” arXiv preprint arXiv:1905.13077 (2019). [Google Scholar]
- [22].Baumgartner CF, Tezcan KC, Chaitanya K, Hötker AM, Muehlematter UJ, Schawkat K, Becker AS, Donati O, and Konukoglu E, “Phiseg: Capturing uncertainty in medical image segmentation,” in [International Conference on Medical Image Computing and Computer-Assisted Intervention], 119–127, Springer; (2019). [Google Scholar]
- [23].Hu S, Worrall D, Knegt S, Veeling B, Huisman H, and Welling M, “Supervised uncertainty quantification for segmentation with multiple annotations,” in [International Conference on Medical Image Computing and Computer-Assisted Intervention], 137–145, Springer; (2019). [Google Scholar]
- [24].Odaibo S, “Tutorial: Deriving the standard variational autoencoder (vae) loss function,” arXiv preprint arXiv:1907.08956 (2019). [Google Scholar]
- [25].Kingma DP and Welling M, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114 (2013). [Google Scholar]
- [26].Liu X, Che T, Lu Y, Yang C, Li S, and You J, “Auto3d: Novel view synthesis through unsupervisely learned variational viewpoint and global 3d representation,” European Conference on Computer Vision (2020). [Google Scholar]
- [27].Liu X, Xing F, El Fakhri G, and Woo J, “A unified conditional disentanglement framework for multimodal brain mr image translation,” in [2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)], 10–14, IEEE; (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
