Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 8.
Published in final edited form as: IEEE J Biomed Health Inform. 2026 Feb;30(2):1555–1567. doi: 10.1109/JBHI.2025.3624331

CalDiff: calibrating uncertainty and accessing reliability of diffusion models for trustworthy lesion segmentation

Xinxin Wang 1,2, Mingrui Yang 3, Sercan Tosun 4, Kunio Nakamura 5,6, Shuo Li 7, Xiaojuan Li 8,9
PMCID: PMC12682437  NIHMSID: NIHMS2118326  PMID: 41129433

Abstract

Low reliability has consistently been a challenge in the application of deep learning models for high-risk decision-making scenarios. In medical image segmentation, multiple expert annotations can be consulted to reduce subjective bias and reach a consensus, thereby enhancing the segmentation accuracy and reliability. To develop a reliable lesion segmentation model, we propose CalDiff, a novel framework that can leverage the uncertainty from multiple annotations, capture real-world diagnostic variability and provide more informative predictions. To harness the superior generative ability of diffusion models, a dual step-wise and sequence-aware calibration mechanism is proposed on the basis of the sequential nature of diffusion models. We evaluate the calibrated model through a comprehensive quantitative and visual analysis, addressing the previously overlooked challenge of assessing uncertainty calibration and model reliability in scenarios with multiple annotations and multiple predictions. Experimental results on two lesion segmentation datasets demonstrate that CalDiff produces uncertainty maps that can reflect low confidence areas, further indicating the false predictions made by the model. By calibrating the uncertainty in the training phase, the uncertain areas produced by our model are closely correlated with areas where the model has made errors in the inference. In summary, the uncertainty captured by CalDiff can serve as a powerful indicator, which can help mitigate the risks of adopting model’s outputs, allowing clinicians to prioritize reviewing areas or slices with higher uncertainty and enhancing the model’s reliability and trustworthiness in clinical practice.

Index Terms: Diffusion Model, Lesion Segmentation, Multiple Annotations, Reliability, Uncertainty Calibration

I. Introduction

Reliability and trustworthiness of deep learning models are crucial for their real-world deployment, particularly in medical image segmentation [1]. Relying on a single expert annotation or a single model prediction can be risky, as overconfident yet incorrect segmentations may affect diagnostic related features and lead to misdiagnoses [2]. It is suggested in [3], [4] that collective insights from multiple experts could be aggregated to improve patient care and safety and mitigate subjective bias. To enable automated segmentation, a reliable model could utilize multiple expert annotations and generate predictions with appropriate confidence expression, ensuring that high-confidence predictions align with greater accuracy, while areas of high uncertainty indicate potential errors. By providing uncertainty estimates, the model enhances interpretability, allowing users to better assess its predictions and ultimately improving trust in its outputs. [5]

However, traditional deep learning models can fail to produce the uncertainty in an effective and explicit way. For instance, given an image containing tumors, U-Net, as a deterministic model, produces a single segmentation prediction, which makes it difficult to identify areas of uncertainty and may further result in suboptimal radiotherapy planning. Although a deterministic segmentation model can produce a softmax probabilistic map before generating a binary prediction mask, the softmax probabilities tend to be overconfident on most of the pixels [1]. In addition, the uncertainty obtained from a pixel-wise map can be spatially incoherent without correlated uncertainty modeling over the entire map [6]. Thus, it can be infeasible to generate high-quality uncertainty maps from a deterministic model for reliable clinical use.

To resolve the issues of limited uncertainty quality in a pixel-wise probabilistic map from a deterministic model, generative probabilistic models are applied to produce a set of complete segmentation masks instead of a probabilistic map in one output. A probabilistic U-Net is proposed with combination of conditional Variational Autoencoders (c-VAE) and U-Net, where multiple hypothesis can be obtained through sampling learnt distributions to indicate the uncertainty [7]. A Hierarchical Probabilistic U-Net is proposed later to enrich the limited sampling diversity [8]. However, c-VAE-based frameworks are restricted in modeling complexity with pre-defined distributions. Compared with c-VAE structure, diffusion models are superior in expressive abilities, which has been shown in different kinds of generative tasks [9]. Recently, Collective Intelligent Medical Diffusion (CIMD) [10] is proposed to apply diffusion models for modeling and controlling ambiguities in medical image segmentation through an additional prior and posterior network, showing the fine quality of predicted segmentation masks from diffusion models. Although diffusion models are capable of modeling the uncertainty, existing research is not sufficient regarding calibrating uncertainty and measuring the quality of calibrated uncertainty.

Calibrating uncertainty plays an important role in improving the reliability and trustworthiness of deep learning models by ensuring that a model’s predicted probabilities accurately reflect the true likelihood of an event. Previous studies on uncertainty calibration are mostly focused on image classification tasks [11], where a well-calibrated model should provide confidence scores that correspond to the actual frequency of correctness. However, uncertainty calibration for semantic segmentation has not been studied adequately. [13] proposes a local temperature scaling method by deploying a neural network to predict local temperature values for semantic segmentation calibration. [11] proposes a selective scaling method based on separating predictions for scaling based on correctness and focusing on misprediction logit smoothing.

To the best of our knowledge, uncertainty calibration has not been conducted when multiple annotations and predictions are available, which can be especially useful to avoid overconfident yet incorrect results from either a single expert or a single prediction produced by a deterministic model. Therefore, the goal of this study is to develop a reliable lesion segmentation tool via diffusion models that leverage multiple annotations and predictions to generate high-quality segmentation results, accompanied with calibrated uncertainty maps, indicating regions of low model confidence.

For the first time, our work proposes a complete pipeline for reliable lesion segmentation via uncertainty calibration in diffusion models. It initiates a new research direction in uncertainty calibration by addressing the previously unexplored scenario where both multiple annotations and multiple predictions are available. Our main contributions are as follows:

  • CalDiff is the first to tackle the previously unexplored challenge of improving and evaluating uncertainty calibration in medical image segmentation, specifically in scenarios involving both multiple annotations and multiple predictions. It presents a diffusion-based generative framework that implements, assesses, and leverages uncertainty calibration to enhance model reliability with interpretable quantification and visualization analysis provided for clinical use.

  • A novel dual calibration mechanism is designed intuitively from the sequential nature of diffusion models to improve calibration performance implicitly and effectively. Step-wise uncertainty calibration module (Step-UC) enables fine-grained calibration at each denoising step by modeling and matching prior and posterior distributions across multiple scales; sequence-aware module (Seq-UC) is deployed to reduce accumulated errors of denoised masks between consecutive steps, enabling accurate uncertainty estimation and robust prediction.

  • A formalized and comprehensive evaluation methodology is developed for uncertainty calibration, extending beyond traditional segmentation metrics by incorporating a broader set of measures. We further complement quantitative evaluations with qualitative visualizations, providing an interpretable view of the model’s predictive uncertainty. This dual perspective highlights the potential of CalDiff for clinical integration, where both accuracy and interpretability are critical.

  • CalDiff, both as a whole and in individual modules, is empirically validated on two lesion segmentation datasets and demonstrates its ability to consistently improve the alignment between predicted probabilities and true likelihood—the key indicator of model reliability.

II. Literature Review

In this section, we will review previous works and describe the existing research gap that needs to be addressed in this study. Firstly, we will introduce the source and utilization of multiple annotations within non-probabilistic segmentation pipelines as shown in Fig. 1 and Fig. 2 (a). Secondly, we will discuss how to implement probabilistic segmentation pipelines to model the uncertainty in Fig. 2 (b), how previous works incorporate diffusion models for medical image segmentation, and how to measure the uncertainty calibration performance in Fig. 2 (c).

Fig. 1:

Fig. 1:

Examples showing the ambiguity of lesion segmentation in medical images. (a) and (b) are axial CT slices with lung nodules annotated by 4 radiologists from LIDC-IDRI dataset [15]; (c) presents 4 coronal MRI sequences from a subject with multiple sclerosis, annotated by 2 radiologists from MS-MRI dataset [16].

Fig. 2:

Fig. 2:

Overview of previous works for segmentation on multi-annotated datasets. (a) shows the deterministic pipeline and ensemble pipeline, which are lack of fully utilizing the multiple ground truths. (b) shows the probabilistic pipelines highlighting different model architectures and the possible uncertainty calibration issues in the most advanced diffusion model structure. One issue is that it models probabilistic distribution implicitly, lack of explicit calibration to bring closer the learnt distributions represented by multiple sampled results and the real distributions represented by multiple annotations; the other issue is that the model is misled by the training samples, whose input is a noised ground truth mask, however, the input to the denoising U-Net in the actual predicting trajectory is a predicted mask denoised from the last step. These hinder the model from refining the uncertainty accurately between the consecutive steps and predicting calibrated results robustly during the inference. (c) presents two examples of reliability diagram and ECE metrics generated from two sets of multiple predictions and multiple ground truths, which demonstrated that the traditional metrics are not effective and applicable in the setting.

A. Source and Utilization of Multiple Annotations

Medical imaging is notorious for its ambiguity because there is no sufficient context information even in the full image for making the decision regarding whether it contains a lesion or its specific borders. Multi-annotated datasets can be collected since different experts may disagree with each other over the same image [17]. As Fig. 1 (a) shows, annotations from four radiologists have varying microlobulated shapes. This might be due to different interpretations of adjacent structures to the lesion such as vascular trajectories, mucus filled bronchioles or a irregular-angulated part of the original lesion. As for Fig. 1 (b), the annotators have varying degrees of segmentation shape or even inclusion of this lesion. A few possibilities can be given to explain on the variation: the size is too small to be counted as a lesion; this form has ground glass nature, which might be interpreted as a focal atelectasis or an atypical pneumonia instead of a lung nodule. The third example is multiple scelerosis (MS) lesion segmentation shown in Fig. 1 (c). The nonidentical segmentation between two annotators may result from differences in interpreting periventricular white matter hyperintensities, variations in defining lesion boundaries, classifying focal versus diffuse abnormalities, determining lesion sizes, or simply missing small lesions. Some raters may be more influenced by clinical history, while others may require distinct Dawson finger-like shapes to confidently differentiate MS lesions from periventricular inflammation.

To deal with multiple annotations, majority voting is one of the most common ways to preprocess a multi-annotated dataset into a single ground truth dataset. For example, LUNA16 dataset [18] is originated from LIDC-IDRI by merging all the annotations through majority vote. SoftSeg [19] proposes to merge multiple annotations into one soft ground truth as the training label. However, the approach of fusing all labels into a single one ignores the diverse interpretations of the image, which is neither flexible nor effective in fully utilizing the full anatomical information provided by multiple annotations. It can be seen in Fig. 2 (a1) that only the softmax probabilistic map can be obtained given a certain input, which can only provide the pixel-wise probabilities with no effective uncertainty modeling. Training a multi-decoder U-Net [21] is the other straightforward way to utilize all ground truths. However, it can only produce a certain number of predictions and an averaged probabilistic map, which are limited to the number of decoders, and it is rigid and not capable to learn from the variation between different ground truths.

B. Uncertainty Modeling and Calibration

In this section, we will first discuss probabilistic segmentation pipelines with a focus on model architectures and uncertainty modeling. Then we will delve into uncertainty calibration and its measurement.

1). Uncertainty modeling:

Several modeling techniques have been proposed to capture the uncertainty in segmentation using probabilistic pipelines. As shown in Fig. 2 (b1), U-Net with dropout layers is proposed in [20] termed as Bayesian SegNet. It features Monte Carlo sampling with dropout at test time to generate a posterior distribution of pixel class labels. C-VAE-based frameworks randomly use one annotation during one iteration, with all annotations utilized in the whole training process, which can activate parameters of distributions modeled in the latent space [22]. An unlimited number of predictions can be generated through sampling in the distributions. However, the two types of architectures can suffer from an underestimation of uncertainty, mode collapse and limited expressiveness compared with diffusion models.

Diffusion model, as the most advanced probabilistic modeling method, has demonstrated its potential in modeling uncertainty and boosting medical image segmentation performance. As shown in Fig. 2 (b2), the image-guided diffusion model [32] can represent the segmentation distribution in an implicit way through a step-by-step denoising process. Most diffusion-based works ensemble multiple predictions from the sampling and focus on better generalization and higher accuracy: MedSegDiff [33] designs Dynamic Conditional Encoding and Feature Frequency Parser mechanisms to enhance regional attention and eliminate the negative effect of high-frequency noise components; MedSegDiff-v2 [34] proposes an SS-Former architecture to incorporate vision transformer mechanism into the denoising U-Net backbone; HiDiff [35] synergizes the strengths of discriminative models and generative diffusion models to better generalize and segment small objects. Although these works are effective in elevating segmentation accuracy, the availability of multiple annotations has not been taken into consideration. BerDiff [36] and CIMD [10] work on multi-annotated datasets, highlighting the sampling diversity and salient regions of interest indicating the uncertainty predicted by the model. However, uncertainty calibration and model reliability have not been defined and discussed in previous works.

2). Uncertainty calibration and its measurement:

The goal of uncertainty calibration is to adjust a model’s predicted probabilities so that they accurately reflect the true likelihood of outcomes. There are three groups of methods to calibrate model in machine learning [23]: post-hoc, uncertainty estimation and regularized training methods. Post-processing is the most common kind of methods: histogram binning is widely used as a simple non-parametric calibration method designed for binary classifications; platt scaling uses logistic regression to calibrate classification uncertainty and has been generalized to more temperature scaling approaches [12]; meta-cal adds a calibration map to the last layer of networks by integrating bipartite-ranking model with selective classification [24]; selective scaling method is proposed based on separating predictions for scaling based on correctness and focusing on misprediction logit smoothing [11]. In addition, some works implicitly improve calibration performance during the training such as Bayesian neural networks and deep ensembling [25] [26], which can be catogarized as uncertainty estimation methods. However, uncertainty calibration in diffusion models is still understudied for semantic segmentation tasks, especially for lesion segmentation tasks. It can be observed in Fig. 2 (b2) that diffusion model can have issues two types of calibration in the step-wise and the sequence-aware levels. In the step-wise level, each ground truth GTi is only utilized independently during different iterations, which can lead to discrepancy between real and predicted data distributions without explicit distribution modeling to capture the uncertainty. In the sequence-aware level, each denoising step in the the actual predicting trajectory is different from how each denoising step is trained in the conventional way. Each next step xb,t1 in actual trajectory is predicted based on the predicted x^b,t, however, during the training, the ground truth xb,t is provided for predicting the next step output xb,t1. The conventional training could cause the model overfit on the training samples, where the input is generated from the ground truth mask, which does not exist in the actual predicting trajectory. This introduces miscalibration issue and poses challenges for accurate and reliable inference. Although [10], [39] and [40] examined the situations, both issues are not explored from the perspective of uncertainty calibration for clinical tasks.

Furthermore, assessing the calibration of uncertainty is a key factor to enable the entire workflow. Reliability diagrams measure model calibration by plotting expected sample accuracy as a function of confidence [27]. Expected Calibration Error (ECE) quantifies the calibration error by discretizing probability intervals into a fixed number of bins [28]. These calibration metrics suffer from problems of fixed calibration ranges, bias-variance tradeoff, etc [29].As shown in Fig. 2 (c), given a single ground truth and a probabilistic map or a probabilistic map averaged from multiple predictions, we calculate the corresponding reliability diagrams and the ECE, from which we can observe the high confidence of model over each pixel and no significant difference between the two metrics for the two sets of predictions. It can be seen that reliability diagram and ECE metrics cannot provide effective measurements to evaluate the uncertainty calibration performance. Study on uncertainty calibration and measuring the calibration are both understudied and required when there are multiple annotations and multiple predictions for image segmentation.

III. Methods

In the section, we propose the CalDiff pipeline with the model architecture design to address the uncertainty calibration issue and the calibration measurement tools to address the model evaluation issue for multiple ground truths and predictions scenarios. The overall workflow is presented in Fig. 3. Fig. 3 (a) illustrates the overall training, predicting and evaluation pipeline using CalDiff model. Three denoising steps in CalDiff model are demonstrated in Fig. 3 (b), highlighting our calibration designs in Step-wise level (b1) and Sequence-aware level (b2).

Fig. 3:

Fig. 3:

Overview of our methods. (a) illustrates the proposed CalDiff pipeline, which can utilize multiple ground truths, generate multiple predictions along with the uncertainty map and error map, and provide both qualitative and quantitative evaluations of uncertainty calibration; (b) presents our proposed CalDiff model architecture, which integrates the Step-wise Hierarchical Uncertainty Calibration module and the Sequence-aware Uncertainty Calibration module; (b1) highlights the step-wise calibration module, where we enforce the model to learn from the data distributions formed from multiple annotations. (b2) showcases the sequence-aware calibration module, where we obtain denoised mask predicted from previously predicted denoised mask, thus informing the model of the accumulated error between a series of consecutive steps, mitigating the risk of overfitting on the training pattern, and generalizing well on the actual inference trajectory.

A. CalDiff Model Framework

CalDiff model, as a type of diffusion model, is comprised of the forward noising process and the reverse denoising process [30]. The forward process gradually adds Gaussian noise to the data over T timesteps. Given a data point x0, the process can be defined as:

qxtxt1=Nxt;1βtxt1,βtI (1)

where I denotes an identity matrix and βt is the variance schedule that determines the amount of noise added at each timestep t.

The overall forward process from x0 to xT can be written as:

qx1:Tx0=t=1Tqxtxt1 (2)

Denote αt=1βt and α¯t=s=1t1βs. Using the reparameterization trick, we can sample xt at any timestep directly from x0:

xt=α¯tx0+1α¯tϵ, (3)

where ϵ~N(0,I).

The reverse process aims to denoise the data, gradually removing the added noise to generate new samples. This process is defined as:

pθxt1xt=Nxt1;μθxt,t,Σθxt,t (4)

where μθ and Σθ are the mean and variance predicted by a neural network parameterized by θ. Here the neural network is a U-Net parameterized by θ as the noise predictor. xt1 can be then computed given xt:

xt1=1αtxtβt1α¯tϵθxt,t+σtz (5)

where z~N(0,I), introducing the sampling stochasticity into predictions.

The training objective is to learn the parameters θ such that the reverse process matches the forward process posteriors. The loss function can be derived as a variational lower bound (VLB) on the negative log-likelihood:

LVLB=EqDKLqxTx0pxT+t=2TDKLqxt1xt,x0pθxt1xtlogpθx0x1 (6)

It is found in [31] that a simplified version of LVLB can be beneficial by ignoring terms affecting Σθ:

Lsimple=Et,x0,ϵϵθxt,tϵ2. (7)

It calculates the mean squared error (MSE) between predicted noise and the true added noise, which is denoted as LMSE in our setting.

A hybrid training objective is then proposed in [31] to improve the log-likelihood by combining the VLB and its simplified variant:

LDiff=LMSE+λLVLB (8)

where the second term can guide Σθ and λ is set as 0.001 to avoid overwhelming LMSE.

To conduct image segmentation task, we follow the implementation in [32]. Given an image b with dimension (c,h,w) and its ground truth xb, the forward process adds noise to xb for T times, while the reverse process denoises xb,TT times to produce a segmentation prediction x^b,0. During the denoising process, b is concatenated with xb,t providing information on anatomical structures. Thus, for each step t, Xb,t=bxb,t.

Equation 3 is then adapted to:

xb,t=α¯txb+1α¯tϵ, (9)

where ϵ is sampled from N(0,I) and Equation 5 is then altered to:

xb,t1=1αtxb,tβt1α¯tϵθXt,t+σtz. (10)

where z is sampled from N(0,I), resulting in different predictions.

Therefore, the segmentation uncertainty is implicitly integrated into the Markov chain of diffusion steps. When trained with a multi-annotated dataset, the mapping learns to match the distribution over different annotations. However, calibration on the learned uncertainty modeling is limited at the step level and the sequence level. To overcome the limitations, CalDiff framework is further integrated with Step-wise Uncertainty Calibration (Step-UC) module and Sequence-aware Uncertainty Calibration (Seq-UC) module.

B. Step-wise Uncertainty Modeling and Calibration

At each time step of the reverse process, x^b,t is predicted from the denoising U-Net and compared with the ground truth xb,t by term LMSE. The issue is that uncertainty is not explicitly modeled in this framework, making it challenging to properly calibrate.

Inspired by C-VAE models, a potential solution is to separately encode the predictions and ground truths into a set of pre-defined distributions and match the two sets of distributions, which can be seen as the prior distribution and the posterior distribution. This can be inserted as an independent module into the original framework in a post-processing manner. A previous state-of-the-art framework CIMD [10] proposes a posterior encoder and a prior encoder as the Ambiguity Modeling module and Ambiguity Controlling module, where the Kullback-Leibler divergence is computed between the two. The KL loss function is denoted as

LKL=DKLNμprior,σpriorNμpost,σpost (11)

However, a single LKL loss term is restricted at the lowest resolution level, which is proved to be insufficient in modeling complexity and flexibility [8] [37]. To calibrate the uncertainty with high fidelity, we propose a hierarchical step-wise uncertainty calibration module (Step-UC).

As shown in Fig. 3 (b1), Step-UC module is input with xb,t, x^b,t and b. It is a Laplacian Pyramids-like architecture, composed of two identical structures denoted as the Uncertainty Modeling Net and Uncertainty Calibrating Net, encoding the input into hierarchical latent spaces. The Net is constructed with hierarchical convolutional layers and down-sampling layers, in which 1 × 1 convolutional layers are embedded producing parameters to model axis-aligned normal distributions.

Given L hierarchical levels, we denote the posterior distribution generated from the Modeling Net at lth level as:

Nm,l=Nμxb,t,b;ϕl,σxb,t,b;ϕl (12)

where μ and σ are computed from part of the Modeling Net parameterized by ϕl. Their dimensionality is 2l+1Lh×2l+1Lw×D, where D is a hyperparameter set to 6 in our network.

Similarly, the prior latent space is denoted as:

Nc,l=Nμx^b,t,b;ψl,σx^b,t,b;ψl (13)

where μ and σ are computed from part of the Calibrating Net parameterized by ψl.

To make the modeling and calibrating effective to the denoising process, we propose a hierarchical loss function to match the predicted uncertainty with real data distribution:

LHier=l=0L1elln(10)DKLNμxb,t,b;ϕl,σxb,t,b;ϕlNμx^b,t,b;ψl,σx^b,t,b;ψl (14)

During training, LHier is calculated for each sample b and xb and its randomly selected step t in the current batch. The loss gradient represents the divergence between the real data distribution and the predicted distribution. It will be backpropagated into both the additional Nets and the denoising U-Net, as shown in Fig. 3 (b1). The Modeling Net and Calibrating Net are trained to filter features of different resolutional levels and then model the latent space. The denoising U-Net is trained to produce segmentation masks more compatitable with the real data distributions, since the loss gradient will be reflected through x^b,t, allowing the denoising U-Net to learn to predict results with calibrated uncertainty. Note that when multiple ground truths are available, the distribution over them is modeled through the Modeling Net, which can calibrate the predicted uncertainty through the Calibrating Net.

C. Sequence-aware Uncertainty Calibration

At each iteration, the typical way of training is to denoise a given sample at a randomly selected time step t to t1. As shown in Fig. 3 (b), xb,tx^b,t1 and xb,t1x^b,t2 are separately trained in different iterations. Nevertheless, during inference, every next step is based on the prediction made by the last step, except xb,Tx^b,T1, as discussed in Fig. 2 (b). Since a minimal error can be introduced at each denoising step and a large number of denoising steps are required to complete the generation, the errors may be accumulated to a certain degree, introducing undesired uncertainty towards the final prediction. It can be seen that the conventional training procedure is inconsistent with the inference process regarding the correlation between time steps. This evolves into a miscalibration issue as the model could overfit on the provided training samples, which are not consistent with the true inference trajectory, further hindering the model from predicting uncertainty accurately and robustly.

An intuitive solution is to regularize model training by incorporating the sequential nature of diffusion models. Ideally, the estimation gap within a total of T steps could be formulated and minimized:

dθ=i=3Tα¯i11α¯1α11α¯i1dθ,i+dθ,2. (15)

which is theoretically proved in [40]. However, it is demanding to consecutively compute the next step based on the current prediction and backward the loss gradient to the recursively called denoising U-Net. Thus, it is more efficient to optimize a short trajectory sliced from the global trajectory to minimize the whole estimation gap. The local gap term within S steps is formulated as:

dθ,tS=s=tt+S1τsϵθxs,sϵs (16)

where τs is defined as :

τs=α¯s11α¯1α¯11α¯s1α¯s11αs1α¯s1α¯sα¯s. (17)

The sequence-aware loss function is denoted as LSA:

LSA=Et,x0,ϵt:t+S+11Ss=tt+S1τsϵθxs,sϵs2 (18)

where t{1S,T} and τs=0 for any s{2,,T}.

The terms of subsequent steps are reweighted to decrease the impact of further steps.

In all, the training procedure of our CalDiff model can be concluded as into Algorithm. 1

Algorithm 1.

CalDiff model training

Require: Data distribution q, number of total timesteps T, noise predictor ϵθ, Modeling Net parameterized with ϕ, Calibrating Net parameterized with ψ, number of consecutive steps S, hyper parameters γ and κ, learning rate η
repeat
x0~qx0
t~Uniform({1,,T})
for s{0,,S1} do
ϵt+s~N(0,I)
xt+s=α¯t+kx0+1α¯t+sϵt+s
 end for
L=LDiff+γLHier+κLSA
ϕϕηϕLHier
ψψηψLHier
θθηθL
until converged

D. Measurement of Uncertainty Calibration

Following the analysis in Section II-B.2, we propose to systematically quantify and visualize the uncertainty calibration performance with multiple segmentation ground truths and predictions incorporated. We have demonstrated in Fig. 2 how it can be infeasible to measure the segmentation uncertainty with traditional metrics such as ECE and Reliability Diagram, especially when there are multiple annotations and predictions. In previous works [11], a softmax prediction mask is obtained and compared with the single ground truth to calculate the ECE and Reliability Diagram. However, we notice that over confidence can be a serious issue [41]. As shown in Fig. 2 (a) and (c), the two softmax outputs obtained from a well trained Diffusion Model achieve very low ECE and we can see that most of the pixels are very confident about the label prediction, which means the pixel value is close to 0 or 1 for binary segmentation task. In this way, although our model may achieve a lower ECE than Diffusion model, the difference shown in the reliability diagram and ECE metric will not be distinct. More importantly, multiple ground truths are not employed in the measurement.

We propose an alternative as shown in Fig. 3 (a3) to measure the segmentation uncertainty calibration when multiple annotations and predictions are available. Normalised Cross Correlation (NCC) is a generic metric measuring the similarity between two signals or images [38]. The definition is given by the following:

NCC(f,g)=(f(x)f¯)(g(x)g¯)(f(x)f¯)2(g(x)g¯)2 (19)

where: f(x) and g(x) are the values of the two signals or images at each point x, f¯ and g¯ represent the mean values of the signals f and g, respectively.

Although [37] proposes an NCC score SNCC to quantify how well the model predicts regions with large uncertainty, it has not been interpreted for uncertainty calibration measurement. The definition is

SNCCpgt,ppred=Ey~pgtNCCEy~ppredCEy¯,y,Ey~ppredCEy,y (20)

where pgt is the ground truth distribution and ppred is the learned distribution. We find that SNCC indeed measures the similarity between the uncertainty and the error produced from the model, quantifying how well the model is calibrated. The item EyppredCEy¯,y is called γ map in [37], which can be interpreted as the uncertainty produced from the model by calculating the expected cross entropy between each prediction and the averaged prediction. The item EyppredCEy¯,y calculates the cross entropy between all predictions and one ground truth, which can be interpreted as the error of model outputs. The cross correlation of the two items is calculated for each ground truth, resulting the SNCC metric, which we propose to apply for quantifying the uncertainty calibration performance.

Visualization of uncertainty map and error map:

Given the concept of uncertainty calibration, a well calibrated model could provide predicted uncertainty close to the error, indicating the wrongly predicted regions. For example, an area with low confidence in the prediction should correspond to the difference between the prediction and the ground truth. In this way, Ey~ppredCEy¯,y could be interpreted and visualized as the uncertainty map since it is sourced from only the predictions by comparing each prediction with the mean prediction, representing the model confidence. The error map can be obtained through Ey~ppred,y~pgtCEy,y, where the cross entropy from each pair of ground truth and prediction is evaluated, indicating the erroneous areas predicted by the model. The uncertainty map should indicate regions where the model is likely or unlikely to make errors, thus improving the trustworthiness of model outputs.

IV. Experiments and Results

A. Datasets

Lung Image Database Consortium of Image Database Resource Initiative (LIDC-IDRI) is a publicly available dataset in the Cancer Imaging Archive (TCIA) for lung cancer diagnosis [15]. It is a multi-annotated dataset consisting of 1018 thoracic CT scans from 1010 patients. Each scan was examined by 4 radiologists and all nodules were manually delineated. Specifically, a total of 2669 pulmonary nodules were annotated by at least one radiologist. We follow the preprocessing procedure in [37], where the data is preprocessed by extracting 2D slices centered around the lesion positions and cropping them into 128 × 128 pixels patches. Every slice could have up to 3 empty labels because the experts may disagree whether it is abnormal. The training set contains 13512 slices, the validation set contains 376 slices and the test set contains 1208 slices. Each set involves different patients.

The Multiple Sclerosis Lesion Segmentation (MS-MRI) [16] dataset includes 84 longitudinal MRI scans of five subjects, with an average of 4.4 time points per subject. Two radiologists annotated the white matter lesions associated with multiple sclerosis. The volumetric MRI data has been converted into coronal view slices with size of 64 × 64 pixels as in [10]. Four types of sequences are existing in the dataset: Proton Density, FLAIR, MP-RAGE, and T2 MRI. The four sequences are concatenated together and the input size to the model is (4, 64, 64). The training set contains 52 scans from 3 patients, the validation set contains 16 scans from 1 patient and the test set contains 16 scans from the left 1 patient.

B. Experimental Setting

The time step is set as T=1000 with a linear noise schedule. The U-Net denoising diffusion model’s weights and biases are optimized using an Adam optimizer with a learning rate of 10−4. λ and γ are set as 0.001 to regularize Lvlb and LHier. κ is set to 1.0, 0.2, 0.2 for S=2,3,4. 16 samples are produced in the testing phase for calculating metrics.

C. Evaluation Metrics

1). Normalised Cross Correlation Score:

SNCC is the measurement of how well the predicted uncertainty is calibrated by comparing the similarity between the uncertainty map and error. The definition is given in Section III-D.

2). Maximum Dice Score:

Dice score is one of the most important metrics in segmentation tasks. It can be adapted in multiple ground truths and predictions setting. Assume there are m annotations yipgt and n predictions yippred,

Dmax=1mi=1mmaxDi (21)

where Di=Diceyi,y0,Diceyi,y1,,Diceyi,yn.

3). Collective Insights Score:

CI score is proposed in [10] to quantify the predicted quality using three terms. It is defined as

CI=3×Sc×Dmax×DaSc+Dmax+Da. (22)

Sc is Combined Sensitivity, given the combined ground truths yc and combined predictions yc,

Scyc,yc=TPTP+FN (23)

Da is named as Diveristy Agreement. It is fomulated as below:

Da=1ΔVmax+ΔVmin2 (24)

where the variance V is defined as 1Dice, ΔVmax=VmaxyVmaxy and ΔVmin=VminyVminy

4). Generalized energy distance:

It is a measurement for evaluating how closely the learned distribution matches the distribution of ground-truth annotations. It is defined as :

DGED2pgt,ppred=2Eypgt,yppreddy,yEyi,yjpgtdyi,yjEyi,yjppreddyi,yj (25)

where d(,)=1-IoU(,). It is demonstrated in [10] to be not valid for some cases compared with CI score.

D. Quantitative Results

1). Comparative Experiments:

The quantitative results are listed in Table I. Two C-VAE-based models (Prob U-Net [7] and and PHiSeg [37]) and two Diffusion-based models (SegDiff [32] and CIMD [10]) are selected for comparison. The averaged values of all testing samples for the four metrics are presented, as well as the standard variance.

TABLE I:

Comparison with previous methods showing CalDiff outperforms others in all metrics on two datasets. The bolded numbers in the table represent the best performance for each corresponding column. We use (↑) to indicate that higher values are better and (↓) to indicate that lower values are better.

Methods
LIDC-IDRI
MS-MRI
SNCC Dmax CI↑ DGED2 SNCC Dmax Cl↑ DGED2

Prob U-Net 0.8382±0.09 0.8795±0.10 0.7147±0.21 0.2522±0.24 0.5028±0.19 0.4873±0.26 0.2869±0.20 0.6545±0.29
PhiSeg 0.8529±0.08 0.8595±0.11 0.7422±0.19 0.2432±0.21 0.5375±0.17 0.4997±0.26 0.3126±0.19 0.6374±0.28
SegDiff 0.8523±0.09 0.8815±0.10 0.7268±0.21 0.2546±0.24 0.6717±0.16 0.5141±0.22 0.3566±0.23 0.5988±0.33
CIMD 0.8688±0.08 0.8811 ±0.11 0.7955±0.19 0.2413±0.21 0.7360±0.10 0.5052±0.19 0.4064±0.20 0.5134±0.20
CalDiff (ours) 0.8857±0.07 0.9019±0.08 0.8171±0.16 0.2240±0.18 0.7602±0.11 0.5462±0.18 0.4279±0.19 0.4777±0.16

It can be seen that as the modeling capacity increases, the model performance improves, which is consistent with our analysis in Section II-B.1. For each metric, our model performs the best among the state-of-the-art methods, with a generally lower variance showing the inference stability. The highest SNCC value achieved by our CalDiff model manifests the positive effect of proposed modules on calibration performance, since this metric measures how well the predicted uncertainty is aligned with its true performance. The highest Dmax and CI score obtained from CalDiff model exhibit the superior performance regarding the dice similarity and its consistency with collective insights. In addition, the lowest DGED2 score from CalDiff model showcases the optimization on generalized energy distance between the distribution of predictions and the distribution of ground truths.

To assess the significance of the performance improvements, We compared CalDiff directly against the previous state-of-the-art model CIMD on both the LIDC-IDRI and MS-MRI datasets, focusing on the most indicative evaluation metric SNCC. We conducted statistical significance testing using the paired t-test. The resulting p-values are 2.39 × 10−10 for LIDC-IDRI dataset and 6.97 × 10−6 for MS-MRI dataset. Both p-values are well below the standard threshold of 0.05, which indicates that the performance improvements achieved by CalDiff are statistically significant across both datasets.

2). Ablation Study:

To observe the contribution of key components of our proposed CalDiff model, we conduct the ablation study by selectively removing or modifying these components. The quantitative results are shown in Table II.

TABLE II:

Ablation study of two proposed modules showing the necessity of both the Seq-UC and Step-UC modules.

Methods LIDC-IDRI MS-MRI
Step-UC Seq-UC SNCC Dmax CI↑ DGED2 SNCC Dmax CI↑ DGED2

0.8714±0.08 0.8911±0.10 0.8024±0.17 0.2391±0.19 0.7429±0.14 0.5538±0.19 0.3910±0.20 0.5079±0.22
S=2 0.8799±0.07 0.8964±0.07 0.7878±0.17 0.2284±0.18 0.7433±0.11 0.5406±0.18 0.4214±0.20 0.4888±0.21
S=3 0.8780±0.07 0.8909±0.07 0.7801±0.17 0.2385±0.21 0.7266±0.14 0.5326±0.21 0.3683±0.20 0.5493±0.26
S=4 0.8794±0.08 0.8827±0.10 0.7942±0.18 0.2410±0.22 0.7099±0.12 0.5104±0.21 0.4020±0.21 0.5353±0.22
CalDiff (both) 0.8857±0.07 0.9019±0.08 0.8171±0.16 0.2240±0.18 0.7602±0.11 0.5462±0.18 0.4279±0.19 0.4777±0.16

After removing the Seq-UC module, the first row of Table II presents that the absence of Seq-UC module reduces the overall performance compared with the complete CalDiff model, proving the necessity of Seq-UC module. It also exhibits the improved performance brought by Step-UC compared to previous state-of-the-art models, demonstrating that the step-wise calibration boosts the performance through the Uncertainty Modeling Net and Uncertainty Calibrating Net built on multi-resolutional levels.

After removing the Step-UC module, the Seq-UC module is studied with 3 sets of parameters: S=2,3,4 corresponded to κ=1.0, 0.2, 0.2. Among these configurations, Seq-UC (S=2,κ=1.0) achieves the best overall performance. It can be observed that With only the Seq-UC module enabled, the model consistently outperforms prior state-of-the-art approaches across the majority of evaluation metrics. This suggests improved uncertainty calibration by reducing the estimation gap between sequential steps. However, its performance remains inferior to that of the complete CalDiff model, highlighting the critical contribution of the Step-UC module.

3). Training Loss Observation:

By monitoring the training losses, we can gain insights into the convergence of the model throughout the training process. Since the full loss terms are different between different models, we select the MSE loss term LMSE for fair comparison. As shown in Fig. 4, our CalDiff model converges to a lower Mean Squared Error loss with reduced fluctuations during the training process. It signifies the enhanced training efficiency and stability facilitated by regularization applied at both the step-wise and sequence-aware level, resulting in a better calibrated model.

Fig. 4:

Fig. 4:

Training loss curves. (a) is training on LIDC-IDRI dataset. (b) is training on MS-MRI dataset. The green loss curves are obtained in the proposed CalDiff model training, showing its fast and effective convergence.

E. Qualitative Analysis

The practical value of uncertainty in real clinical scenarios depends on how to effectively inform radiologists of the uncertainty for a more reliable decision making. This motivates us to devise an intuitive method to present and evaluate the predicted uncertainty. The expectation is that the predicted uncertainty is aligned with the actual mistakes, i.e., regions with high uncertainty truly coincide with areas where the model is more likely to be incorrect. Therefore, the evaluation is conducted by assessing the consistency between the uncertainty and the error, which can be visualized as the uncertainty map and the error map. The two maps are defined as EyppredCEy¯,y and Eyppred,ypgtCEy,y and elaborated in Section III-D.

Following the definition, the visual representations will provide an intuitive understanding of both the model uncertainty and the accuracy of predictions in relation to the multiple ground truths. Fig. 5 presents four examples. Predictions from two Diffusion-based methods are included for comparison. Fig. 5 (a) is a lung CT slice having the possibility of having a small nodule, which is challenging to detect due to its size. It can be seen that SegDiff and CIMD models fail to capture the possibility of having a nodule within the image, while CalDiff successfully locate the areas with the possible nodule. Fig. 5 (b) is a lung CT slice with one empty annotation and three other annotations of various shapes and boundaries. By visualizing the uncertainty map and error map, it presents the uncertainty in a more intuitive way compared with the ECE metric and Reliability Diagram. The baseline models fail to highlight the boundary uncertainty and also result in a high degree of errors, while CalDiff generates uncertainty map with less error and targeted uncertainty. 5 (c) is the brain MR slices of four sequences. Due to the limited amount of training data and the subtle nature of MS lesions, it is a challenging task and the importance of uncertainty is more significant. Other methods tend to miss parts or overly predict lesion areas, which are denoted with the yellow circle. Meanwhile, ours can generate uncertainty map with high correlation with the error map. For the case in 5 (d), other methods generate uncertainty maps that are inconsistent with the error map, where the difference is denoted using the yellow circle. However, our CalDiff model can identify the uncertainty areas identical with the error map structure, which shows the improvement brought by orienting the model with calibrated uncertainty.

Fig. 5:

Fig. 5:

Four examples with multiple ground truths, multiple independent predictions, uncertainty map calculated from the cross entropy between multiple predictions, and error map obtained by averaging the cross entropy of each prediction and each ground truth. The color bar is attached to the bottom right corner. In uncertainty maps, brighter colors representing higher entropy indicate lower confidence of the model towards a particular example, signifying a higher degree of uncertainty; in error maps, brighter colors representing the higher entropy signifies a more substantial discrepancy between the predictions and the ground truths.

V. Discussion

The quantitative and qualitative results provide compelling evidence of our CalDiff model’s effectiveness. In this section, we will further discuss the necessity of multiple annotations and predictions in achieving reliable segmentation and how to maximize its application value in clinical practice.

1). Necessity of Multiple Annotations and Predictions:

Lesion segmentation is inherently challenging due to inter-observer variability, particularly in complex cases where anatomical boundaries shown in medical images are ambiguous. While majority voting is commonly used to generate a single consensus segmentation from multiple expert annotations, this approach can obscure meaningful variations in expert opinions. A single aggregated segmentation may not fully capture the range of plausible interpretations, potentially leading to overconfident model predictions that do not reflect the inherent uncertainty in the data [41].

In contrast, training with multiple expert annotations as ground truths provides the model with a more comprehensive representation of ground truth, enabling it to learn from variations in expert assessments. Furthermore, evaluation with multiple annotations and multiple predictions from the model, rather than a single ground truth and a single deterministic output, is comprehensive and informative, allowing for the construction of an uncertainty map that highlights regions where the model exhibits low confidence and the comparison between the uncertainty map and the error map. Therefore, we believe that effectively incorporating multiple expert annotations and plausible predictions into the model’s training and evaluation pipeline is crucial for enhancing its reliability and trustworthiness.

2). Application Values:

The visual representations shown in Section IV-E not only exhibit the ability of our CalDiff model identifying subtle features that could indicate the presence of a lesion, but also demonstrate that the uncertainty or confidence level of our model over the segmentation predictions are strong indicators of areas that may diverge from the ground truths.

When applied in real-world clinical senarios, the uncertainty map will be visualized from multiple predictions, although the ground truths do not exist and the error map is not attainable. Since the model’s predicted uncertainty map is aligned with the actual performance, where the low uncertainty indicates correct predictions and the high uncertainty indicates possible false positives or false negatives, we can trust more in the model’s outputs. In this way, our model can serve as a robust tool for high-risk decision making related to lesion segmentations. The highlighted uncertain areas can guide clinicians to pay more attention on these critical regions. It will help mitigate the risks associated with false predictions. Ultimately, the proposed CalDiff model with calibrated uncertainty modeling offers a pipeline with enhanced interpretability and trustworthiness of the model. Additionally, it enables clinicians to prioritize reviewing the uncertain slices and regions closely, formulating a streamlined clinical workflow and promoting an efficient use of expert time and resources [42].

To optimize its feasibility for real-world applications, we believe efficiency is also a key factor. We investigate the model complexity and compare with other diffusion-based works. The amount of trainable parameters (M) is 23, 23, 25, 26 and 26 for SegDiff [32], BerDiff [36], MedSegDiff [33], CIMD [10] and CalDiff respectively; the amount of floating point operations (G) is 1770 for MedSegDiff which uses less denoising steps and 2399 for other frameworks. It is found that our model only consumes slightly more memory space but excels in calibration performance based on reported results for lung nodule segmentation. The Calibrating Net and Modeling Net in CalDiff are lightweight compared to the denoising U-Net backbone and they do not pose additional computational overhead during testing phase. Although its efficiency is comparable with the basic diffusion model, acceleration remains to be explored. Therefore, one of our future directions is to speed up the sampling process to promote its widespread adoption. Its potential can also be further unlocked by employing it in segmentation-based downstream tasks such as diagnostic system [43].

VI. Conclusion

In this paper, we have developed a novel framework CalDiff for calibrating the diffusion model uncertainty for reliable and trustworthy lesion segmentation. To the best of our knowledge, a dual Step-UC and Seq-UC mechanism is proposed and incorporated for the first time through leveraging the sequential nature of diffusion models towards a more calibrated lesion segmentation system. Our approach includes the measurement and visualization of the model’s performance in a multi-annotation and multi-prediction setting. The experimental results have validated the effectiveness of our proposed CalDiff framework on two multi-annotated lesion segmentation datasets. Compared to previous methods, CalDiff produces more calibrated uncertainty maps, effectively highlighting areas where the model has low confidence; meanwhile, these uncertain regions are more closely correlated with actual problematic areas, offering clinicians a more interpretable and transparent view of the model’s predictions. Overall, our work offers valuable insights for advancing deep learning-based models as reliable tools in clinical practice, paving the way for their widespread applications in the future.

Acknowledgments

This work was supported by NIH/NIAMS R01AR075422, NIH/NIAMS K25AR078928 and the Arthritis Foundation.

Footnotes

Our code is publicly available at https://github.com/xinwxin/CalDiff.

Contributor Information

Xinxin Wang, Program of Advanced Musculoskeletal Imaging and Department of Biomedical Engineering, Cleveland Clinic, Cleveland, OH 44106 USA; Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106 USA.

Mingrui Yang, Program of Advanced Musculoskeletal Imaging and Department of Biomedical Engineering, Cleveland Clinic, Cleveland, OH 44106 USA.

Sercan Tosun, Program of Advanced Musculoskeletal Imaging and Department of Biomedical Engineering, Cleveland Clinic, Cleveland, OH 44106 USA.

Kunio Nakamura, Program of Advanced Musculoskeletal Imaging and Department of Biomedical Engineering, Cleveland Clinic, Cleveland, OH 44106 USA; Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH 44106 USA.

Shuo Li, Department of Biomedical Engineering and Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106 USA.

Xiaojuan Li, Program of Advanced Musculoskeletal Imaging and Department of Biomedical Engineering, Cleveland Clinic, Cleveland, OH 44106 USA; Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH 44106 USA.

REFERENCES

  • [1].Mehrtash A, Wells WM, Tempany CM, Abolmaesumi P, and Kapur T, “Confidence calibration and predictive uncertainty estimation for deep medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 3868–3878, 2020, doi: 10.1109/TMI.2020.3006437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Chen Z, Chen J, Zhao J, Liu B, Jiang S, Si D, Ding H, Nian Y, Yang X, and Xiao J, “What matters in radiological image segmentation? Effect of segmentation errors on the diagnostic related features,” Journal of Digital Imaging, vol. 36, no. 5, pp. 2088–2099, Oct. 2023, doi: 10.1007/s10278-023-00865-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Radcliffe K, Lyson HC, Barr-Walker J, and Sarkar U, “Collective intelligence in medical decision-making: a systematic scoping review,” BMC Medical Informatics and Decision Making, vol. 19, no. 1, p. 158, Aug. 2019, doi: 10.1186/s12911-019-0882-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Kurvers RH, Herzog SM, Hertwig R, Krause J, Carney PA, Bogart A, Argenziano G, Zalaudek I, and Wolf M, “Boosting medical diagnostics by pooling independent judgments,” Proceedings of the National Academy of Sciences of the United States of America, vol. 113, no. 31, pp. 8777–8782, Aug. 2016, doi: 10.1073/pnas.1601827113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Zou K, Chen Z, Yuan X, Shen X, Wang M, and Fu H, “A review of uncertainty estimation and its application in medical imaging,” Meta-Radiology, p. 100003, 2023. [Google Scholar]
  • [6].Monteiro M, Le Folgoc L, Coelho de Castro D, Pawlowski N, Marques B, Kamnitsas K, van der Wilk M, and Glocker B, “Stochastic segmentation networks: Modelling spatially correlated aleatoric uncertainty,” Advances in Neural Information Processing Systems, vol. 33, pp. 12756–12767, 2020. [Google Scholar]
  • [7].Kohl SAA, Romera-Paredes B, Meyer C, De Fauw J, Ledsam JR, Maier-Hein K, Eslami SMA, Rezende DJ, and Ronneberger O, “A Probabilistic U-Net for Segmentation of Ambiguous Images,” ArXiv, vol. abs/1806.05034, 2018. [Google Scholar]
  • [8].Kohl SAA, Romera-Paredes B, Maier-Hein K, Rezende DJ, Eslami SMA, Kohli P, Zisserman A, and Ronneberger O, “A Hierarchical Probabilistic U-Net for Modeling Multi-Scale Ambiguities,” ArXiv, vol. abs/1905.13077, 2019. [Google Scholar]
  • [9].Croitoru FA, Hondru V, Ionescu RT, and Shah M, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10850–10869, 2023, doi: 10.1109/TPAMI.2023.3261988. [DOI] [PubMed] [Google Scholar]
  • [10].Rahman A, Valanarasu JMJ, Hacihaliloglu I, and Patel VM, “Ambiguous medical image segmentation using diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11536–11546. [Google Scholar]
  • [11].Wang D, Gong B, and Wang L, “On calibrating semantic segmentation models: Analyses and an algorithm,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp. 23652–23662. [Google Scholar]
  • [12].Guo C, Pleiss G, Sun Y, and Weinberger KQ, “On calibration of modern neural networks,” in International Conference on Machine Learning, 2017, pp. 1321–1330, PMLR. [Google Scholar]
  • [13].Ding Z, Han X, Liu P, and Niethammer M, “Local temperature scaling for probability calibration,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6889–6899. [Google Scholar]
  • [14].Ronneberger O, Fischer P, and Brox T, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” ArXiv, vol. abs/1505.04597, 2015. [Google Scholar]
  • [15].Armato SG, McLennan G, Bidaut LM, McNitt-Gray MF, Meyer CR, Reeves AP, Zhao B, Aberle DR, Henschke CI, Hoffman EA, Kazerooni EA, MacMahon H, Van Beeke EJR, Yankelevitz DF, Biancardi AM, Bland PH, Brown MS, Engelmann RM, Laderach GE, Max D, Pais RC, Qing DP, Roberts RY, Smith AR, Starkey A, Batrah P, Caligiuri P, Farooqi AO, Gladish GW, Jude CM, Munden R, Petkovska I, Quint LE, Schwartz LH, Sundaram B, Dodd LE, Fenimore C, Gur D, Petrick NA, Freymann JB, Kirby JS, Hughes B, Casteele AV, Gupte S, Sallamm M, Heath M, Kuhn M, Dharaiya E, Burns R, Fryd D, Salganicoff M, Anand V, Shreter U, Vastagh S, and Croft BY, “The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans,” Medical Physics, vol. 38, no. 2, pp. 915–931, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Carass A, Roy S, Jog A, Cuzzocreo JL, Magrath E, Gherman A, Button J, Nguyen J, Prados F, Sudre CH, et al. , “Longitudinal multiple sclerosis lesion segmentation: resource and challenge,” NeuroImage, vol. 148, pp. 77–102, 2017, Elsevier. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Joskowicz L, Cohen D, Caplan N, and Sosna J, “Inter-observer variability of manual contour delineation of structures in CT,” European Radiology, vol. 29, pp. 1391–1399, 2018. [DOI] [PubMed] [Google Scholar]
  • [18].Setio AA, Traverso A, De Bel T, Berens MSN, Van Den Bogaard C, Cerello P, Chen H, Dou Q, Fantacci ME, Geurts B, et al. , “Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge,” Medical Image Analysis, vol. 42, pp. 1–13, 2017, Elsevier. [DOI] [PubMed] [Google Scholar]
  • [19].Gros C, Lemay A, and Cohen-Adad J, “SoftSeg: Advantages of soft versus binary training for image segmentation,” Medical Image Analysis, vol. 71, p. 102038, 2021. [Google Scholar]
  • [20].Kendall A, Badrinarayanan V, and Cipolla R, “Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding,” CoRR, vol. abs/1511.02680, 2015. [Google Scholar]
  • [21].Yang Y, Guo X, Pan Y, Shi P, Lv H, and Ma T, “Uncertainty quantification in medical image segmentation with multi-decoder U-Net,” 2021, arXiv:2109.07045 [eess.IV]. Available: https://arxiv.org/abs/2109.07045. [Google Scholar]
  • [22].Zhou X, Wang X, Ma H, Zhang J, Wang X, Bai X, Zhang L, Long J, Chen J, Le H, et al. , “Customized T-time inner sampling network with uncertainty-aware data augmentation strategy for multi-annotated lesion segmentation,” Computers in Biology and Medicine, vol. 180, p. 108990, 2024, Elsevier. [Google Scholar]
  • [23].Gawlikowski Jakob, Tassi Cedrique Rovile Njieutcheu, Ali Mohsin, Lee Jongseok, Humt Matthias, Feng Jianxiang, Kruspe Anna M., Triebel Rudolph, Jung Peter, Roscher Ribana, Shahzad Muhammad, Yang Wen, Bamler Richard, and Zhu Xiao Xiang. A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342, 2021. https://arxiv.org/abs/2107.03342. [Google Scholar]
  • [24].Ma X and Blaschko MB, “Meta-cal: Well-controlled post-hoc calibration by ranking,” in International Conference on Machine Learning, 2021, pp. 7235–7245, PMLR. [Google Scholar]
  • [25].Kendall A and Gal Y, “What uncertainties do we need in Bayesian deep learning for computer vision?” in Advances in Neural Information Processing Systems, vol. 30, 2017. [Google Scholar]
  • [26].Lakshminarayanan B, Pritzel A, and Blundell C, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Advances in Neural Information Processing Systems, vol. 30, 2017. [Google Scholar]
  • [27].DeGroot MH and Fienberg SE, “The comparison and evaluation of forecasters,” Journal of the Royal Statistical Society: Series D (The Statistician), vol. 32, no. 1–2, pp. 12–22, 1983, Wiley Online Library. [Google Scholar]
  • [28].Naeini MP, Cooper G, and Hauskrecht M, “Obtaining well calibrated probabilities using Bayesian binning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015. [Google Scholar]
  • [29].Vaicenavicius J, Widmann D, Andersson C, Lindsten F, Roll J, and Schön T, “Evaluating model calibration in classification,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 3459–3467, PMLR. [Google Scholar]
  • [30].Ho J, Jain A, and Abbeel P, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020. [Google Scholar]
  • [31].Nichol AQ and Dhariwal P, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning, 2021, pp. 8162–8171, PMLR. [Google Scholar]
  • [32].Wolleb J, Sandkücher R, Bieder F, Valmaggia P, and Cattin PC, “Diffusion models for implicit image segmentation ensembles,” in International Conference on Medical Imaging with Deep Learning, 2022, pp. 1336–1348, PMLR. [Google Scholar]
  • [33].Wu Junde, et al. ”Medsegdiff: Medical image segmentation with diffusion probabilistic model.” Medical Imaging with Deep Learning. PMLR, 2024. [Google Scholar]
  • [34].Wu Junde, et al. ”Medsegdiff-v2: Diffusion-based medical image segmentation with transformer.” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024. [Google Scholar]
  • [35].Chen Tao, et al. ”HiDiff: Hybrid diffusion framework for medical image segmentation.” IEEE Transactions on Medical Imaging, 2024. [Google Scholar]
  • [36].Chen Tao, Wang Chenhui, and Shan Hongming. ”Berdiff: Conditional Bernoulli diffusion model for medical image segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023. [Google Scholar]
  • [37].Baumgartner CF, Tezcan KC, Chaitanya K, Hötker AM, Muehlematter UJ, Schawkat K, Becker AS, Donati OF, and Konukoglu E, “PHiSeg: Capturing Uncertainty in Medical Image Segmentation,” in Proceedings of MICCAI, 2019. [Google Scholar]
  • [38].Zhao F, Huang Q, and Gao W, “Image matching by normalized cross-correlation,” in 2006 IEEE International Conference on Acoustics, Speech and Signal Processing Proceedings, vol. 2, pp. II–729–II–732, 2006. [Google Scholar]
  • [39].Fu Yunguan, Li Yiwen, Saeed Shaheer U., Clarkson Matthew J., and Hu Yipeng. A Recycling Training Strategy for Medical Image Segmentation with Diffusion Denoising Models. arXiv preprint arXiv:2308.16355, 2023. https://arxiv.org/abs/2308.16355 [Google Scholar]
  • [40].Nguyen V, Vu G, Nguyen TT, Than K, and Tran T, “On inference stability for diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 13, pp. 14449–14456, 2024. [Google Scholar]
  • [41].Yeung M, Rundo L, Nan Y, Sala E, Schönlieb C-B, and Yang G, “Calibrating the dice loss to handle neural network overconfidence for biomedical image segmentation,” Journal of Digital Imaging, vol. 36, no. 2, pp. 739–752, 2023, Springer. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Griebel M, Segebarth D, Stein N, Schukraft N, Tovote P, Blum R, and Flath CM, “Deep learning-enabled segmentation of ambiguous bioimages with deepflash2,” Nature Communications, vol. 14, no. 1, p. 1679, 2023, Nature Publishing Group UK London. [Google Scholar]
  • [43].Ghoshal B, Tucker A, Sanghera B, and Wong WL, “Estimating uncertainty in deep learning for reporting confidence to clinicians in medical image segmentation and disease detection,” Computational Intelligence, vol. 37, no. 2, pp. 701–734, 2021, Wiley Online Library. [Google Scholar]

RESOURCES