Abstract
Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable.
However, there is a mismatch between the distributions of cases and the difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics used to assess performance fail to capture the impact of this mismatch, particularly when dealing with datasets in clinical settings that involve challenging segmentation tasks, pathologies with low signal, and reference annotations that are uncertain, small, or empty. Limitations of common metrics may result in ineffective machine learning research in designing and optimizing models. To effectively evaluate the clinical value of such models, it is essential to consider factors such as the uncertainty associated with reference annotations, the ability to accurately measure performance regardless of the size of the reference annotation volume, and the classification of cases where reference annotations are empty.
We study how uncertain, small, and empty reference annotations influence the value of metrics on a stroke in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify suitable metrics in such a setting. We compare our results to the BRATS 2019 and Spinal Cord public data sets. We show how uncertain, small, or empty reference annotations require a rethinking of the evaluation. The evaluation code was released to encourage further analysis of this topic https://github.com/SophieOstmeier/UncertainSmallEmpty.git.
MSC: 41A05, 41A10, 65D05, 65D17
Keywords: Deep learning, Dice, Ischemic stroke, Medical image segmentation, Metrics
1. Introduction
The performance of machine learning algorithms is assessed by metrics. The optimal choice of metrics depends on the data set and the machine learning task to guarantee that the predictions accurately describe the intended phenomenon (Taha and Hanbury, 2015). Metrics can be used in two different ways. First, as the criteria that the models try to optimize as a loss function. Second, as a way of validating and evaluating the performance of the model. This work focuses on the latter, referred to as performance metrics.
Performance metrics differ in their characteristics. The correlations between them determine the additional information revealed. Therefore, the appropriate selection of a performance metric for a specific task ensures consistency in model performance between development and deployment. For example, physicians that potentially use model predictions for treatment decisions of patients rely on an optimization and evaluation process of the models toward reliable and meaningful clinical information.
For data sets with uncertain (inter-expert variability), small (e.g. < 1% of organ), or empty reference annotations, common metrics may penalize or misinterpret clinically meaningful information. Prior studies have described the importance of quantifying uncertainty in the reference annotation (Mehta et al., 2022), the dependency of metric values on the segmentation size and degree of class imbalance (Taha and Hanbury, 2015; Liu et al., 2021; Commowick et al., 2018), the equal weighting of all regions of misplaced delineation independently of their distance from the surface (Nikolov et al., 2018) or missing definition for empty reference annotations (Commowick et al., 2018; Maier-Hein et al., 2022).
The failure to describe uncertain, small, or empty segmentations may lead to irrelevant and misleading optimization and evaluation procedures in segmentation models.
Here, we determine how to implement clinically meaningful metrics for medical segmentation models with the UncertainSmallEmpty (USE)-Evaluator. We analyze the behavior of established metrics on benchmark deep learning models trained on four data sets with and without uncertain, small, and empty reference annotations (in-house and public).
1.1. Uncertain reference annotations
While experts annotations may exhibit variations in the volume and location of segmented objects, we assume that each expert possesses the highest level of human ability for the given task and, as a result, their judgments are considered equally valid (Jungo et al., 2018). Identifying a superior expert who can definitively determine the correctness among the experts would require someone with even higher human ability. However, the process of appointing such an overruling expert would necessitate another individual with even greater abilities to make this judgment, leading to an infinite loop. Consequently, in the context of our study, we assume the absence of an overruling expert.
For volume agreement, the reference annotation’s classification of a voxel can be true, and the segmentation of another expert or the prediction of the models can be false or vice versa. In practice, the spectrum ranges from a worst-case to a best-case scenario. In the best-case scenario, all false positives are truly positives. In the worst-case scenario, all are truly . For example, in Fig. 1(a) the union annotation of an acute stroke from experts A, B, and C (blue, green, and red) is larger than the majority vote (green and red). Some blue voxels at the border of the segmentation might falsely or truly be part of acute stroke ( or ). Another example is shown in Fig. 1(c). Experts reference annotation is empty (first row). However, the prediction (second row) is not empty. Visual investigation shows an ambiguous lesion that was not segmented by the experts making all voxels but maybe truly . The underlying low signal-to-noise ratio of stroke on Non-contrast Computed Tomography (NCCT) and the continuous transition from healthy to pathological brain tissue inevitably prevent a precise membership of these voxels.
Fig. 1(a).

Row 1: Uncertainty: Non-contrast Computed Tomography from an acute stroke patient within 16 h. Row 2: Segmentation of all experts. The segmentations of all experts do not completely overlap.
Fig. 1(c).

Empty reference annotations: Row 1, Segmentation of all experts is “empty”. Row 2, predicted voxel probabilities (softmax output values) of the models (low to high probability indicated by blue to red colors). All colored pixels may be “false positives”.
For location agreement, the distance between voxels from the reference annotation to another expert or prediction might be longer or shorter. For example, in Fig. 1(a) the surface voxels of the green voxels will have a different distance than the blue surface voxels to the surface voxels of a predicted segmentation.
In the BRATS 2019 data set, we reproduce an underlying low signal-to-noise ratio and a more continuous transition by using the non-enhancing tumor segmentation on native T1 as reference annotation. We compare to a high signal-to-noise ratio setting with a more discrete transition by using the whole tumor segmentation on T1, T1 enhanced, T2-flair, and T2 MRI images and Spinal Cord white matter segmentation on T2 MR images.
We propose the Uncertainty score (U-score) as a quantifying measure of reference annotations uncertainty.
1.2. Small reference annotations
Depending on the clinical context, small reference annotations may be defined as relative to the total size of the studied body region. (e.g. less than 1%). For the brain, 1% is about 13 ml (Akeret et al., 2021). The distributions of reference annotation volumes vary across medical image data sets and segmentation tasks (Fig. 1(b)) (Bakas et al., 2018, 2017; Menze et al., 2015; Prados et al., 2017).
Fig. 1(b).

Small Volumes: Boxplot with volume distribution of all data sets.
We hypothesize that the distribution of reference annotation volumes influences the value of metrics independently from the model’s performance (Fig. 1(a)) (Maier-Hein et al., 2022). For example, an acute ischemic stroke patient with a suspected large vessel occlusion undergoes emergent imaging to quantify the extent of the irreversible brain injury. The stroke volume is often quite small (1–5 ml in volume (Powers et al., 2018)). Models may segment a 1–2 ml lesion volume that has poor overlap with the segmentation by a neuroradiologist and have a low-performance metric despite properly identifying the volume. A slight difference in volume location within the brain is highly unlikely to influence a physician’s decision to treat the patient.
We describe how the distribution of reference annotation volumes produces different metrics values, irrespective of the level of location and volume agreement between the model’s predictions and the annotations.
1.3. Empty reference annotations
Images with empty reference annotations are described as masks in which the object of interest could not be identified by the annotators. The object might have been invisible at the time of the segmentation (Fig. 1(c)).
Segmentation of an object within an image is a different task than a classification of an image. A classification task confirms the presence or absence of an object in the image (image-level), while a segmentation task assigns each voxel of the image to an object class (voxel-level) (Maier-Hein et al., 2022). An image-level classification task can also be formulated as a segmentation task by checking if the reference and predicted masks are empty. Therefore, when using a segmentation model in this way, it is important for the performance metrics to capture behavior on empty masks. However, some metrics for image segmentation return “ NaN “ or 0, if the model correctly predicts an empty mask (e.g. Dice, Specificity, Sensitivity, IoU).
For clinical deployment, images with empty reference annotation are possible and their presence is crucial information. The predictions of segmentation models need to be optimized and evaluated for correct image-level classification (Commowick et al., 2018). For example, it is possible that a stroke lesion in an early time window (0–4 h after symptom onset) has a very low signal and cannot be segmented on NCCT. In this case, the reference annotation and the predicted segmentation should both be empty and an image-classification metric should return the optimal value. No visible and no predicted lesion would result in a treatment decision in favor of endovascular therapy (Powers et al., 2018).
We explore a potential solution by setting a volumetric threshold tailored to each clinical context where voxel-wise agreement is expected to go beyond clinical relevance. Below the threshold, the agreement between the reference annotation and prediction is automatically evaluated as an image-level classification task (e.g. stroke present or absent in the image USE-Evaluator).
1.4. Clinical value
For a successful transition to clinical translatable challenge-winning segmentation models, the focus on clinically meaningful optimization and performance metrics for each clinical context is crucial. Clinical value includes:
Robustness toward uncertainty in the reference annotation
Independence from the reference volume
Reward of volumetric and location agreement between the reference annotation and predictions
Evaluation of correct classification of empty reference annotations and predictions
2. Metrics
2.1. Fundamentals
A 3D image consists of a voxel grid with width , height , and depth . We refer to the set of voxels as with .
A segmentation mask is a grid with the same shape as the image. Pixels/voxels are assigned integer values indicating the semantic class (e.g. organ, pathology) they belong to. In the context of this publication, segmentation masks are either created manually by human experts or, with an automatic algorithm from an image.
A mask can be evaluated by the volume and location agreement of the segmented object. On a voxel level, the agreement between the reference mask, , and the predicted mask, , can be measured with (i) voxel class agreement or (ii) spatial distances between corresponding voxels.
For voxel class agreement, we use the assignment of voxels to classes, in the reference mask, as the true classes. The model’s classification for each voxel results in a predicted mask. Let be the set of classes. We note that completely partitions the mask. That is, and .
For a binary classification task a confusion matrix of four cardinalities namely , false negatives , and true negatives () can be defined, where (Table 2).
Table 2.
Voxel-Level Cardinalities for a binary classification task in which each voxel is assigned to one of the cardinalities depending on its mapping in the reference mask, , and predicted mask, .
The volume, , of the target object in the reference mask and volume, , in the predicted mask is defined as
| (1) |
| (2) |
where refers to the physical volume of each voxel.
For distance agreement, the distance between a voxel and a set of surface voxels is defined as
For a binary segmentation task with , the set of voxels, , is defined as the surface voxels of the target object in the reference mask, and as the surface voxels of the target object in the predicted mask.
Metric definitions for common volume, overlap, and distance metrics, that were used for the experiments, can be found with their implementations on GitHub and in Table 1.
Table 1.
Definitions of performance metrics for medical image segmentation.
With threshold.
Subscript indicates the image-level cardinalities.
We note that the frequent class imbalance of 3D medical image segmentation () limits meaningful performance evaluation by any metric that includes (Specificity, ROC, Accuracy, Kappa, etc.). Therefore, metrics that include in their function should be avoided. Overlap metrics measure relative to a combination of , and . Overlap metrics that exclude are Dice, Recall, and Precision. Volume metrics without are VS (Volumetric Similarity) and AVD (Absolute volume Difference).
We also note that the Jaccard index (Intersection over Union, IoU) and Dice are equivalent and one can be derived from one to the other using the following formula (Bertels et al., 2019).
| (3) |
| (4) |
The concrete choice for either one of these metrics depends on the user or community preference (Maier-Hein et al., 2022).
2.2. Surface dice at tolerance
Surface Dice is an evaluation metric introduced by Nikolov et al. (2018). It describes which portion of voxels on the surface of the target object in the predicted mask have the same spatial location as the surface voxels in the reference mask. For that, it classifies the surface voxels into , and depending on their distance to the closest surface voxel in the reference/predicted mask. The contribution of individual voxels to these terms is weighted by the estimated surface area that it represents. The tolerable distance from the surface at which a voxel still counts as a TP establishes a set of border voxels . is a variable that needs to be set according to the clinical context and (estimated) inter-rater variability for the given segmentation task.
is the target object class. The tolerated distance can be set depending on the task. A possible method to choose the tolerated distance is to compute the distance between different experts as an acceptable variability (e.g. ASSD Table 1). This might lead to an optimization procedure with a realistic tolerance in which uncertainty within the voxel classification of the reference mask is expected and acceptable.
The Boundary IoU with a distance proposed by Cheng et al. (2021) can be converted to the Surface Dice at Tolerance with tolerated distance by using Eq. (3), where and are equivalent.
2.3. Uncertainty score
We develop a score to estimate the uncertainty across a set of experts across the set of cases. This score may be used as an indicator of the uncertainty in the data set. Our score is built around evaluating entropy, a measure of information contained in samples, on a target region of each image.
We index cases using . We index the reference masks and membership functions by expert as and membership functions as . We consider the case where .
Our score will require counting over experts, classes, and voxels. Let be the function that returns the number of experts that puts voxel of case in class . Formally, .
We compute the U-score over the set of voxels where at least one expert classified the voxel as positive. We denote this set as .
For a case, we compute its U-score as the average, over voxels, of the expert annotation entropy of the voxel. Formally
| (5) |
With entropy computed as
We can compute the U-score of a dataset as the average U-score over cases.
2.4. Voxel-level class imbalances
The class imbalance ratio () is commonly defined as the ratio between the cardinality of the majority class and the cardinality of the minority class (Zhu et al., 2020).
2.4.1. Class imbalances of segmentation
In this context of image segmentation, the is given by .
An image segmentation task can be a perfectly balanced voxel-level binary classification problem . However, it is often the case that the target object is small relative to the image, that is and . This indicates high class imbalance. This can result in even very simple models achieving many true negatives on with a low false positive rate (Table 2). Since the number of background voxels in medical images may vary (due to scanner settings, and image processing) we aim to control the considered background voxels in a consistent way. We do so by restricting the region of interest to either an organ or the immediate body cavity. Background voxels in this region of interest are referred to as . For example, for stroke and brain tumor this would be the brain, for the gray matter in the spinal cord this would be the total spinal cord. We then get
| (6) |
2.4.2. Image-level class imbalances
In the realm of image classification, we denote the class imbalance ratio as . When considering reference and predicted volumes that fall below a clinically reasonable threshold (i.e. 1 ml for the NCCT and BRATS models), the significance of segmentation performance diminishes. Images that are correctly or incorrectly classified below this threshold are designated as or respectively. As a result, we derive the equation:
| (7) |
with an optimal value of 1. Here, the positive cases (i.e. patients with a stroke larger than 1 ml), represented by , serve as the majority class, while the negative cases (i.e. patients with a stroke smaller than 1 ml), represented by , serve as the minority class. Visual examples illustrating , and can be observed in Fig. 2.
Fig. 2.

Example of true positives , true negatives , false positive () and false negative () cases for the NCCT and BRATS data set for a threshold of 1 ml.
3. Methods
3.1. Data sets
To evaluate metrics for models trained on uncertain, small, or empty reference annotations we use several data sets (Table 3).
Table 3.
Data set properties.
| Data set | Target object | Multiple labels | USE1 | Positive cases2 | Negative cases3 |
|---|---|---|---|---|---|
| NCCT4 | ischemic core | ✓ | ✓ | ✓ | ✓ |
| BRATS 2019 | non-enhancing tumor | – | ✓ | ✓ | ✓ |
| BRATS 2019 | whole tumor | – | – | ✓ | – |
| Spinal Cord | gray matter | ✓ | – | ✓ | – |
USE= uncertain, small and empty reference annotations.
Cases with the target object present in the image.
Case without the target object or below a volume threshold present in the image.
NCCT= Non-Contrast Computed Tomography.
A de-identified dataset of 200 NCCT images of patients with an acute ischemic stroke from the DEFUSE3 trial (Albers et al., 2018) was provided to three neuroradiologists 4, 4, and 5 years of experience in neuroradiology (B.V.,A.M.,J.J.H.) (study design https://clinicaltrials.gov/ct2/show/NCT02586415). The experts were instructed to segment abnormal hypodensity on the NCCT that corresponds to acute ischemic brain injury. Detailed instructions and videos, as well as an oral explanation of the task, were given. Any missed lesions or missed slices were not corrected. The experts’ masks were fused by a majority vote to form the reference mask. In addition, 156 institutional NCCT images were added of patients who were scanned with suspicion of stroke but were confirmed on follow-up Diffusion-weighted MR imaging not to have a stroke.
The BRATS 2019 public data set was used to reproduce and compare results and included 345 MRIs of high and low-grade glioma patients (Bakas et al., 2018, 2017; Menze et al., 2015). One to four experts segmented the brain tumors followed by a consensus procedure. The reference masks had four target objects; “background”, “edema”, “non-enhancing”, and “enhancing”. We used this data set to train two segmentation tasks (i) with the target object “non-enhancing” tumor on only T1 and (ii) with the target object whole tumor, defined as the union over “edema”, “non-enhancing”, and “enhancing” target objects, on T1, T1 contrast-enhanced, T2-Flair and T2.
The Spinal cord data set is a public data set with 40 annotated MRIs of 40 healthy patients from 4 different hospitals and annotated by 4 experts per case. The annotations include the white and gray matter of the spinal cord on T2 (Prados et al., 2017). The experts’ masks were fused by a majority vote to form the reference mask.
3.2. Data partition
For each segmentation tasks the cases were randomly divided into 5 folds that consisted of 80% training and 20% test examples. The default self-configured nnUNet was used to train all folds for each segmentation task. All analyzes were done on the aggregated 5 test sets for each segmentation task (Supplemental material, Fig. 6).
All models shared the same training schedule with 500 epochs, Stochastic gradient descent with Nestov momentum of 0.99, the initial learning rate of 0.01 with linear decay, and oversampling of 33% for the target lesion.
3.3. Models
3.3.1. Deep learning models
We chose the 3D full-resolution nnUNet as our deep learning framework (Isensee et al., 2021). For fairness and ease of comparability, we let all models undergo the same training schedule and did not modify hyperparameters.
The default configured model included a patch size of (1 × 28 × 512 × 512) and spacing of (3.00, 0.45, 0.45), Dice and Cross-Entropy loss function, seven stages, two 3D convolutions per stage and a leaky ReLU as activation function.
For the NCCT ischemic core segmentation task, the model input was the NCCT image (356 cases) to output a predicted mask for ischemic brain tissue.
For the first BRATS 2019 segmentation task, the model input encompasses only the T1 image (345 cases) to simulate a lower signal-to-noise ratio and the output was the predicted mask for the non-enhancing tumor. For the second BRATS 2019 segmentation task, the model input included all available MR sequences images (345 cases) to output the predicted mask of all tumor parts.
For the Spinal cord gray matter segmentation task (40 cases), the model input was a T2 image to output a predicted mask of gray matter.
3.3.2. Random model
To demonstrate that this trend is independent of the model, we analyze the behavior of the Dice on a model that randomly labels voxels.
Our objective is to investigate the impact of target object size on the Dice score. As the class imbalance ratio increases, the Dice score tends to decrease, creating difficulties in comparing model performance across different levels of class imbalance. However, we aim to demonstrate that this relationship is also a general property not tied to a single task.
Consider a binary segmentation model that decides each voxel’s membership in the predicted mask randomly with a biased coin toss. We will show this by deriving the expected Dice score for images drawn from a random model and showing that the trend observed empirically matches the trend in this theoretical model (Supplemental material A.1.). It is cleaner to parameterize this random model using the expected portion of voxels that are positive. We refer to this as and note that it can be directly computed from the class imbalance ratio .
3.4. Evaluation tool
All evaluations were performed using the USE-Evaluator inspired by Nikolov et al. (2018), Isensee et al. (2021) (Table 1). The source code can be applied to folders with reference annotation and prediction mask in .nii.gz format and produces a .xslx file with sheets for all studies, the means, medians, and image-level classification with bootstrapped 95% confidence interval. A threshold flag can be set as a lower volume threshold for the segmentation and image-level classification evaluation. If the reference or predicted volume is below the threshold, a case is excluded from the segmentation evaluation but included as a negative case for the image-level classification evaluation.
3.5. Evaluation of reference annotations
We analyzed the variability among different experts’ annotations masks, available for the NCCT and the Spinal cord data set, with the evaluation tool described in Section 3.4. To estimate uncertainty we compute the U-score (Eq. (5)) and the median inter-expert agreement, and the median agreement to the majority vote (majority-expert) with the metrics presented in Table 1.
3.6. Evaluation of model performance
Performances were measured with the evaluation tool (Section 3.4) with a threshold of 1 ml for the NCCT and BRATS 2019 data sets. For other medical applications, this might depend on the clinical task the model is trained on. With the evaluator tool, this threshold can be easily changed. For the Spinal cord data set, we did not set a threshold, because the clinical concern in healthy populations would not be about the non-existence vs. existence of gray matter in the spinal cord.
3.7. Evaluation of metrics
We evaluate the segmentation metrics by correlation to uncertainty among the expert’s masks, independence from reference volume, the reward of volumetric and location expert-model agreement, and evaluation of correct classification of empty reference masks or small reference volumes cases using the R package corrplot (Version 0.92).
To compute and for the stylized model, we defined the regions as the entire brain for the NCCT and BRATS datasets, and the entire spinal cord for the Spinal Cord dataset (Section 2.4.1). The BET_CT was used to extract the brain on NCCT according to Schell et al. (2019). For the extraction of the brain on MRI, the HD_BET was applied (Isensee et al., 2019). For extraction of the spinal cord, the union of the gray and white matter in the majority vote reference mask was used.
For the evaluation of empty reference and predicted masks, we explore possible image-classification metrics and their relationship to , where we refer to .
4. Results and discussion
In this section, we will examine the relationship between metric values and varying prevalence of uncertain, small, or empty reference annotations.
In Section 4.1 we measure uncertainty in reference annotations. We conduct empirical validation of the U-score across data sets and its correlation with inter-expert variability and consensus among the majority of experts.
In Section 4.2 we analyze all models’ performances with each metric across data sets in order to provide a first indication of trends between dataset properties and metric values that we explore in further detail in the following section.
In Section 4.3 we use the correlation of metric values to provide empirical evidence of the link between the uncertain, small, and empty reference annotations and the metric values.
For the Dice metric, we demonstrate that the link is even more general by illustrating that the relationship found empirically is present in the evaluation of a stylized theoretical model (Section 4.3.2). Finally, we explore trends in image-classification metrics in Section 4.4.
Upon negative tests for normal distribution, results for each metric are shown as medians with 95% confidence interval (bootstrapped, 1000 repetitions), and the correlations are reported as Spearman’s rank correlation coefficient.
4.1. Evaluation of reference annotations of experts
Variability in reference annotations can impact the model’s segmentation performance and solutions have been discussed (Karimi et al., 2020). However, we focus on a better choice of evaluation techniques to enhance the clinical applicability of segmentation models. In this regard, we first propose the introduction of the U-score as a measure of uncertainty for reference annotations (Section 1.1). We found an overall median U-score is significantly different between the NCCT ischemic core and the Spinal cord gray matter segmentation task (0.87 ±0.05 vs. 0.39 ±0.02, respectively). These findings are consistent with common measures such as inter-expert and majority-expert agreement (supplemental material, Table 7) (Yang et al., 2023). Inter-expert and majority-expert agreements use pairwise expert comparison and rely on common segmentation metrics to indirectly estimate uncertainty in reference annotations. The U-score directly measures uncertainty.
We found varying distributions of reference volumes across the studied data sets (median (IQR) volume 6 (2–21)ml, 10 (4–25)ml, 89 (48–146)ml and 0.7 (0.3–1.1)ml for NCCT, BRATS 2019 non-enhancing tumor part segmentation task, BRATS 2019 whole tumor segmentation task and Spinal Cord gray matter segmentation, respectively.
We further conduct correlation analyzes between the U-score and reference volumes to common metrics outlined in Section 4.3.
4.2. Evaluation of segmentation and image classification performance
The performance of the NCCT ischemic core and BRATS 2019 non-enhancing tumor models, trained using uncertain, small, and empty reference annotations, shows similar results across volume, overlap, and distance metrics. However, the BRATS 2019 whole tumor and Spinal cord model, trained on larger and more certain reference annotations, consistently outperforms both the NCCT and BRATS 2019 non-enhancing tumor model (Table 4).
Table 4.
Results of segmentation task performance1.
| Categories | Metrics2 | NCCT ischemic core | BRATS 2019 non-enhancing tumor | BRATS 2019 whole tumor | Spinal Cord gray matter | ||||
|---|---|---|---|---|---|---|---|---|---|
| Volume | VS | 0.58 | ± 0.09 | 0.78 | ± 0.03 | 0.97 | ± 0 | 0.92 | ± 0.02 |
| AVD | 4.48 | ± 1.17 | 4.41 | ± 0.95 | 4.95 | ± 0.85 | 0.08 | ± 0.04 | |
| Overlap | Dice | 0.56 | ± 0.04 | 0.60 | ± 0.04 | 0.93 | ± 0.01 | 0.83 | ± 0.02 |
| Precision | 0.69 | ± 0.11 | 0.66 | ± 0.06 | 0.94 | ± 0.01 | 0.89 | ± 0.03 | |
| Recall | 0.20 | ± 0.08 | 0.58 | ± 0.05 | 0.93 | ± 0.01 | 0.79 | ± 0.03 | |
| Distance | ASSD | 2.34 | ± 0.41 | 2.50 | ± 0.15 | 0.94 | ± 0.07 | 0.12 | ± 0.02 |
| HD 95 | 8.30 | ± 1.51 | 8.00 | ± 0.55 | 2.87 | ± 0.3 | 0.50 | ± 0.05 | |
| SDT small3 | 0.61 | ± 0.05 | 0.56 | ± 0.03 | 0.93 | ± 0.01 | 0.84 | ± 0.08 | |
| SDT large4 | 0.86 | ± 0.03 | 0.85 | ± 0.02 | 0.99 | ±0 | 0.84 | ± 0.08 | |
Median ± 95% Confidence Interval (bootstrapped).
VS = Volumetric Similarity, AVD = Absolute Volume Difference, ASSD = Average Surface Distance, HD 95 = Hausdorff Distance 95th percentile, SDT = Surface Dice at Tolerance.
Surface Dice at Tolerance with 2 mm for NCCT and BRATS 2019 models and 0.05 mm for the Spinal Cord model.
Surface Dice at Tolerance with 5 mm for NCCT and BRATS 2019 models and 0.1 mm for the Spinal Cord model.
For image classification, the total number of cases with reference volumes < 1ml is 192 cases for the NCCT ischemic core segmentation task, 36 cases for the BRATS 2019 non-enhancing tumor part segmentation task and 0 cases for the BRATS 2019 whole tumor. We visualize these class distributions and report the confusion matrix (Fig. 3) (Maier-Hein et al., 2022). Reference Volumes >1 ml cluster around the identity line, whereas references <1 ml are more spread. Sensitivity, F1-score and ACC of the NCCT model are lower compared to the BRATS non-enhancing tumor models, however, the AUC and Specificity are higher for the NCCT models (Table 5). Further analysis between data set properties and common image classification metrics analysis are summarized in Section 4.4.
Fig. 3.

Scatter plot with log-scale and confusion matrix with a volume threshold of 1 ml dividing and from and . For the NCCT data set(violet points), almost all incorrectly classified cases are too small, namely , whereas for the BRATS non-enhancing tumor data set the opposite is the case. None of the cases of BRATS whole tumor are incorrectly classified.
Table 5.
Results of image-classification task1.
| Categories | Metrics2 | NCCT ischemic core | BRATS 2019 non-enhancing tumor | BRATS 2019 whole tumor | Spinal Cord gray matter4 | |||
|---|---|---|---|---|---|---|---|---|
| Class imbalance | 0.46 | 0.89 | 1.00 | - | ||||
| Image-level | Sensitivity | 0.67 | ± 0.04 | 0.93 | ± 0.01 | 1.00 | ± 0 | - |
| Specificity | 0.98 | ± 0.01 | 0.53 | ± 0.09 | - | |||
| F1-score | 0.79 | ± 0.03 | 0.94 | ± 0.01 | 1.00 | ± 0 | - | |
| ACC | 0.84 | ± 0.02 | 0.89 | ± 0.02 | 1.00 | ± 0 | - | |
| AUC | 0.91 | ± 0.02 | 0.86 | ± 0.03 | - | |||
Median ± 95% Confidence Interval (bootstrapped).
1 ml threshold.
ACC=Accuracy. AUC=Area under the Curve.
Healthy cohort, no threshold for pathology set.
4.3. Evaluation of segmentation metrics
We use the relationship between metrics and dataset properties to identify evaluation strategies robust of the presence to uncertain, small or empty reference annotations. These recommendations are backed by the empirical data (Fig. 5) and we provide an intuition of how the given formula provides the observed effect (Table 1).
Fig. 5.

Correlation matrices of Spearman coefficient for data sets and metrics. X indicates insignificant correlations with . Overall correlation patterns among metrics (e.g. Dice and SDT) remain similar over the data sets. The correlation between Dice and uncertainty, as well as the reference volume, is reproducible in all datasets, albeit to varying degrees.
We categorize the segmentation metrics according to volume, overlap, or distance agreement. We then analyze every segmentation metric based on the characteristics outlined in Section 1.4. If not otherwise specified, all numbers in this section refer to the Spearman correlation coefficients presented in Fig. 5. A summary of the core results and guidelines for the choice of metrics are provided in Table 6.
Table 6.
Suggestions for choosing meaningful metrics for data sets with uncertain, small and empty reference annotation.
| Category | Metric1 | Robustness toward Uncertainty in reference annotation | Independence from volume of reference annotation | Reward of volume and location agreement | Reward of agreement of emptiness |
|---|---|---|---|---|---|
| Volume | VS | ✓ | ✓ | – | ✓ |
| AVD | ✓ | – | – | ✓ | |
| Overlap | Dice | – | – | ✓ | – |
| set threshold2 | |||||
| Recall | – | – | ✓ | – | |
| set Threshold2 | |||||
| Precision | – | – | ✓ | – | |
| set Threshold2 | |||||
| Distance | HD 95 | ✓ | ✓ | – | – |
| set Threshold2 | |||||
| ASSD | (✓) | – | ✓ | – | |
| set Threshold2 | |||||
| SDT small | ✓ | ✓ | ✓ | – | |
| set Threshold2 | |||||
| SDT large | ✓ | ✓ | ✓ | – | |
| set Threshold2 |
VS = Volumetric Similarity, AVD = Absolute Volume Difference, ASSD = Average Surface Distance, HD 95 = Hausdorff Distance 95th percentile, SDT = Surface Dice at Tolerance.
Set threshold volume = below this volume threshold images are considered to have no lesion.
4.3.1. Volume agreement
VS:
Robustness toward Uncertainty and Independence from Reference Volume:
Conceptually, VS allows location variability of reference volumes (Table 6), because and voxels can be anywhere in the image without an influence on the value of VS. This characteristic becomes particularly valuable when dealing with uncertain reference annotations. Assuming that a source of and is uncertainty (see Section 1.1); VS does not penalize uncertainty as long as their difference has a linear relationship to reference volumes. This is because VS normalizes to the sum of the reference and predicted volume. Our findings support this with a low correlation to uncertainty (−0.17 and −0.32 for NCCT and Spinal Cord) and reference volume (across all data set below 0.25). We conclude that VS value is less driven by uncertainty or reference volume.
Reward of Volume and Location Agreement:
VS does not reward location agreement since VS measures the relative relationship between and rather than their distance. VS is therefore suitable if volume agreement is the major clinical concern, as for some applications in neuroimaging, like stroke (Powers et al., 2018).
In theory, VS may be less appropriate for clinical datasets and segmentation tasks that heavily rely on spatial information, such as those involving multiple sclerosis (Filippi et al., 2019). However, our observations indicate a consistent moderate to strong correlation with overlap metrics (e.g., Dice coefficient ranging from 0.58 to 0.84) and distance metrics (e.g., ranging from 0.57 to 0.82), particularly in datasets where reference annotations are uncertain and small in size.
Reward of Agreement of Emptiness:
For cases with empty references and predicted masks, VS returns the optimal value of 1. Therefore, VS is suitable for data sets with expected empty reference masks. Nevertheless, we recommend setting a threshold for very small volumes (<1 ml), because the frequency of empty reference or predicted masks could screw the distribution of values compared to other metrics.
AVD:
Our findings suggest, no advantage of AVD over VS for data sets with a small median of reference volumes and uncertainty.
In contrast to VS, AVD does not normalize to the sum of reference and predicted volumes. Larger reference volumes have potentially larger volume differences, resulting in a notably positive correlation between AVD and reference volumes across all data sets. In datasets with a wide spread of reference volumes (Fig. 1(b)), it is unclear whether a reduction of AVD as a metric leads to slightly improved performance for large reference volumes or substantially for small reference volumes. This ambiguity can introduce bias when comparing model performance within and across datasets, as evidenced by inconsistent correlation patterns with overlap and distance agreement metrics in our correlation analysis.
4.3.2. Overlap agreement
Dice:
Robustness toward Uncertainty:
We observed that the Dice correlates more with uncertainty compared to other metrics (−0.62 to −0.72). This indicates that the Dice value is influenced not only by the extent of overlap but also by the level of uncertainty.
In a theoretical context, let us consider two scenarios. In the best-case scenario, a model outperforms the experts (as determined by the majority vote of reference annotation) by correctly classifying voxels. In this ideal situation, all are , and all are . However, the denominator contains the sum of and (Table 1) and would disproportionately increase and lead to a lower Dice value. As a result, the performance of the models is underestimated. In the worst-case scenario, a model is inferior to the experts in classifying voxels correctly; all are truly and all are truly . The Dice value does not change. As a result, Dice is biased toward the worst-case scenario. Hence, the Dice over-penalizes overlap disagreement in the presence of uncertainty between the experts’ masks with a lower value.
Independence from Reference Volume:
In our study, we consistently observed a positive correlation between the reference volume and the across all datasets, ranging from 0.40 to 0.56, with the NCCT dataset exhibiting the highest correlation. We hypothesized that the size of the target object affects the Dice value. More specifically, we investigated how the impacts the likelihood of a voxel being classified as because the Dice primarily rewards accurate voxel assignment to . To validate our hypothesis, we analyzed the Dice value on a random model using the parameter (Section 2.4.1). The value of can be directly calculated from the and represents the probability of a voxel in the prediction mask being classified as belonging to the target object class (see Section 3.3.2). We plot the Dice curve of the random model (dark red line) and compared it to all data sets (Fig. 4). If is very low at 0.01 (1% of the brain), then the expected Dice of the random model is 0.02. If is 0.5 (50% of the brain), the expected Dice is much higher at 0.5 (dashed line). The regression lines for the random model, NCCT, and BRATS non-enhancing tumor show a positive monotonic tendency of the Dice values with higher . This behavior is also present in BRATS 2019 whole tumor and Spinal cord models with larger , but less (shallower slope of gray and orange lines).
Fig. 4.

Dot plot with regression lines for the Dice over class imbalance for all segmentation models, where . The gray areas represent 95% confidence intervals. The dark red dots and line represent the random model with the expected Dice defined here. The dashed line indicates the expected Dice for a balanced reference mask.
We infer that a high imbalance ratio is more likely to produce lower Dice values. Location and volume errors for small reference annotations may be more penalized than larger reference annotations, making the Dice a sub-optimal choice of metric for data sets with small reference annotations and a wide distribution of reference volumes.
Reward of Volume and Location Agreement:
The numerator of the Dice, which comprises , represents the voxels assigned to both the reference and prediction masks. The maximization of this value occurs when there is a high agreement in terms of both location and volume between the masks. We empirically see that the Dice rewards of volume and location agreement with a consistent, moderate to strong correlation to VS and distance metrics across all data sets.
Reward of Agreement of Emptiness:
The Dice does not reward the agreement of emptiness between the reference and predicted mask, but returns “NaN”. We found a high number of cases in the NCCT data set with a Dice value of 0. Investigation showed, that the Dice is zero if target objects are right next to each other and also zero if they are far from each other, especially for small reference volumes. This may lead to a disproportionate count of cases with Dice equal to zero. Depending on the clinical context, very small reference volumes (i.e. <1 ml) may be excluded from the evaluation of segmentation metrics. This is done to avoid introducing bias to the overall performance without obtaining meaningful information.
Instead, we suggest image-classification metrics to evaluate very small reference volumes or empty reference annotations masks. For example, a case with may be better evaluated by image classification metrics than by a segmentation metric. We implemented this idea with the USE-Evaluator, where a lower volume threshold can be set that will exclude studies with and automatically initializes an image-classification evaluation.
Recall and Precision:
Overall, Recall and Precision show similar behavior compared to the Dice, but only capture certain aspects of overlap agreement, and should be evaluated with other segmentation metrics and in the context of the clinical question.
They differ in their consideration of and in the denominator. Precision rewards relative to the predicted volume, , and Recall relative to the reference volume, (Table 1).
Especially Recall showed a correlation to uncertainty in the NCCT and Spinal cord data set (−0.60 and −0.43). One can argue that in the setting of high-class imbalance, the models learn to classify voxels with high entropy less frequently to , because the chance of being correct if classified to is higher, increasing (Leevy et al., 2018). We then get a higher denominator for Recall, thus an underestimation of uncertain reference volumes.
Similarly to the Dice, Recall, and Precision do not reward the agreement of emptiness. We, therefore, recommend setting a threshold for very small reference volumes and evaluating such cases with image-classification metrics.
4.3.3. Distance agreement
Overall, we found that distance metrics, especially SDT, show favorable behavior in the context of small and uncertain reference annotations, while still exhibiting a consistent correlation to metrics that measure volume and overlap agreement.
SDT:
Robustness toward Uncertainty and Independence from Reference Volume:
SDT assigns cardinalities to surface voxels based on their proximity to the nearest surface voxel in either the reference or predicted mask. This approach emulates the behavior of the Dice while serving as a distance metric. However, contrary to Dice, if the reference and predicted volumes are right next to each other and within the border region , SDT still measures this agreement. This becomes particularly advantageous when a lower signal caused by pathophysiological factors and modality-related effects introduces more uncertainty in the outer regions of the target object compared to its inner regions, i.e. like a stroke on NCCT. Compared to the Dice, we found weaker correlations to both the U-score and reference volume for the NCCT data set (−0.37, respectively).
In the Spinal cord data set, there is a correlation between SDT to the U-score. Image analysis of a few distinct cases with high uncertainty and low SDT value revealed deteriorating image quality in the cranial and caudal slices of the spinal cord, which is suggested to be the primary source of this relationship.
Reward of Volume and Location Agreement:
SDT shares similarities with overlap measures, due to its reliance on the spatial relationships among surface voxels and the direct influence of the object size on . Consequently, SDT captures both the agreement in location and volume, which is further supported by its strong correlation with volume and distance metrics across all data sets (0.52–0.87).
Reward of Agreement of Emptiness:
In the presence of empty reference masks, all distance metrics return “inf”. Similarly, to overlap metrics, distance metrics may need a lower bound volume threshold to evaluate empty and small volume reference masks with image-classification metrics.
HD 95 and ASSD:
Given that HD 95 and ASSD are metrics based on distance (as discussed in Section 2.1), which implies that if the model predicts a slightly different volume, HD 95 and ASSD should still yield values close to an optimal result, primarily capturing location accuracy and allowing volume error.
Since HD 95 and ASSD are distance-based metrics (as explained in Section 2.1), they primarily assess location accuracy while accommodating for volume errors. Values are close to the optimal even if the model’s predicted volume slightly deviates from the reference.
Consistent with this, we found that HD 95 and ASSD exhibited mostly no correlations with reference volumes and uncertainty. However, similar to SDT, we observed a correlation between ASSD and the U-score in the Spinal Cord data set, likely attributed to low image quality in the cranial and caudal slices in distinct cases.
Overall, HD 95 and ASSD show robustness to uncertainty and reference volume, however, mostly measure distance agreement. Even though they empirically show strong correlations with volume and overlap metrics across all datasets, SDT should be preferred as a metric, if volume and location agreement is crucial.
4.4. Evaluation of image-level classification metrics
We propose a simultaneous evaluation of image-level classification metrics besides segmentation metrics to ensure an unbias evaluation of model performance when trained on data sets that include cases with uncertain, small, or empty reference annotations. Negligible reference volumes below a certain threshold may only be evaluated with image classification metrics
For example, Liu et al. only included positive cases and proposed image classification metric, LDR (Liu et al., 2021). However, the agreement in image-level classification is not assessed in the case of empty reference masks or small-volume cases. Clinical tests are unlikely to exclusively be performed on patients with a present pathology.
Data sets with negative cases or cases with negligible reference volumes would be more representative of the distribution of patients in clinical practice. This can have major implications for the idea of mostly positive or ambiguous cases being read by a radiologist and negative cases confidently evaluated by an algorithm (Wang et al., 2021).
In this section, we briefly highlight how a class imbalance between positive and negative/small-volume cases also introduces evaluation biases for inter-models and inter-data set comparison.
Sensitivity, Specificity, and F1-score:
Whether to use Specificity or Sensitivity as the primary image-classification metric depends on the clinical context. For this study, we found higher Sensitivity, Specificity, and F1-score associated with higher (Supplemental material Fig. 7). However, further studies are needed for more general statements.
ACC:
The ACC evaluates the agreement in image-level classification in the case of and (Maier-Hein et al., 2022). As for Sensitivity, Specificity, and F1-score, we found that a higher ACC value is associated with higher (Supplemental material, Fig. 7).
AUC:
AUC is a standard multi-threshold classification metric to evaluate a predictor and is not defined for populations where only one class is present (therefore only NCCT and BRATS non-enhancing tumor) (Maier-Hein et al., 2022). As the true discrete class in this setting is defined by the volume threshold, the AUC reveals information on how well the models classify volumes. We found that AUC does not change with , suggesting AUC is a more robust metric for unbalanced data sets.
5. Limitations
The first limitation is that we evaluate metrics behavior and reference annotation uncertainty in only three medical neuroimaging data sets and examine four different target objects. We introduce methodologies aimed to be applied to a broader range of medical imaging data sets, allowing for a comprehensive examination of our findings. The second limitation is that the choice of baseline models might influence the correlations between metrics. In order to mitigate this, we choose nnUNet, a model that is generalizable to many medical segmentation tasks (Isensee et al., 2021). Furthermore, correlation does not prove causation. For example, the correlation between reference volumes and uncertainty to the value of metrics does not imply that a higher reference volume value causes a higher metric value. The correlation of Dice and reference volumes have been found in previous works (Taha and Hanbury, 2015; Liu et al., 2021; Commowick et al., 2018; Maier-Hein et al., 2022) however analysis of data sets properties, in-depth analysis, quantification of the uncertainty were missing.
6. Conclusion
We notice a mismatch between dataset properties in challenge-winning segmentation models and cases encountered in clinical practice. Some commonly used metrics (i.e. Dice score) might not capture whether models’ performance generalize well to the distribution of images encountered in clinical practice. In particular, (i) the presence of uncertainty in reference annotations causes misleading values, (ii) small reference volumes lead to unreasonable low metric values, (iii) empty reference annotations cause a return of “NaN”, “inf” or zero. For a data set with uncertain, small, and empty reference annotations, we suggest that model performance generalizes better to clinical practice when evaluated by the Surface Dice at Tolerance. We further proposed to set a lower volume threshold for very small volumes or empty reference masks and use image-level classification metrics such as AUC (USE-Evaluator).
It is crucial to evaluate the performance of the model using multiple metrics that effectively encompass the specific objectives of the clinical segmentation task. These objectives can vary significantly across different areas of clinical practice. To facilitate the selection of appropriate metrics, we recommend referring to Table 6.
We highlight the difficulty of comparing models trained to address different clinical problems. While uncertain, small, and empty reference annotations require a rethinking of evaluation, it also increases the value an algorithmic tool provides because the underlying task is hard for human experts.
Acknowledgments
We would like to thank Georg Schramm for the critical review of this manuscript. We extend our appreciation to the German Research Foundation (DFG) for its support of this project through the Walter-Benjamin fellowship (ID: 517316550).
Appendix. Supplemental material
A.1. Dice score of the random model
We aim to show how the Dice score depends on the volume of the target object, independently of the model performance. We do this by showing that the trend of the Dice score, with respect to volume, is present in the Dice score for a simple, random model. Furthermore, we show that this trend is replicated in multiple settings.
Fig. 6.

Data Sampling and Partition of 5-fold-Cross-Validation.
Fig. 7.

Line plot of image classification metrics value over for the NCCT and BRATS 2019 non-enhancing tumor and whole tumor data set, respectively.
Table 7.
Reference annotation uncertainty: Inter-expert and majority-expert agreement.
| Categories | Metric2 | NCCT inter-expert1 | NCCT majority-expert1 | Spinal inter-expert1 | Spinal majority-expert1 | ||||
|---|---|---|---|---|---|---|---|---|---|
| Uncertainty | U-score | 0.87 | ± 0.05 | 0.39 | ± 0.02 | ||||
| Volume | VS | 0.50 | ± 0.02 | 0.75 | ± 0.03 | 0.93 | ± 0.01 | 0.95 | ± 0.01 |
| AVD [ml] | 10.50 | ± 2.11 | 4.25 | ± 0.98 | 0.13 | ± 0.02 | 0.10 | ± 0.02 | |
| Overlap | Dice | 0.39 | ± 0.05 | 0.67 | ± 0.04 | 0.84 | ± 0.01 | 0.91 | ± 0.01 |
| Precision | 0.39 | ± 0.05 | 0.64 | ± 0.06 | 0.83 | ± 0.02 | 0.95 | ± 0.01 | |
| Recall | 0.41 | ± 0.04 | 0.91 | ± 0.02 | 0.85 | ± 0.02 | 0.88 | ± 0.01 | |
| Distance | ASSD | 4.75 | ± 0.54 | 2.03 | ± 0.23 | 0.11 | ± 0.01 | 0.07 | ± 0.01 |
| HD 95 [mm] | 16.73 | ± 2.12 | 9.49 | ± 1.15 | 0.50 | ± 0.14 | 0.48 | ± 0.13 | |
| SDT small3 | 0.40 | ± 0.03 | 0.67 | ± 0.02 | 0.85 | ± 0.07 | 0.93 | ± 0.04 | |
| SDT large4 | 0.59 | ± 0.05 | 0.83 | ± 0.03 | 0.85 | ± 0.07 | 0.93 | ± 0.04 | |
Per case and data set median ± 95% Confidence Interval (bootstrapped).
VS = Volumetric Similarity, AVD = Absolute Volume Difference, ASSD = Average Surface Distance, HD 95 = Hausdorff Distance 95th percentile, SDT = Surface Dice at Tolerances.
Surface Dice at Tolerance with 2mm for NCCT and BRATS 2019 models and 0.05 mm for the Spinal Cord model.
Surface Dice at Tolerance with 5 mm for NCCT and BRATS 2019 models and 0.1 mm for the Spinal Cord.
We define the random model for a parameter , as one where each voxel is chosen to be positive in the predicted mask with a probability and there are exactly voxels in the target object. Note that the expected number of predicted positive voxels under this model is exactly .
Under this model, we can compute the expected Dice score, , across multiple draws. We use the standard combinations notation, to denote the number of orderings where we flip heads from coin flips. Note that by definition, , the size of the target object.
| (8) |
where
| (9) |
and
| (10) |
Footnotes
Dataset link: https://github.com/SophieOstmeier/UncertainSmallEmpty
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data availability
The evaluation code was released to encourage further analysis of this topic https://github.com/SophieOstmeier/UncertainSmallEmpty. The data set will be made available upon reasonable request. Please contact Jeremy Heit, MD, PhD.
References
- Akeret K, Bas van Niftrik CH, Sebök M, Muscas G, Visser T, Staartjes VE, Marinoni F, Serra C, Regli L, Krayenbühl N, Piccirelli M, Fierstra J, 2021. Topographic volume-standardization atlas of the human brain. 2021.02.26.21251901. 10.1101/2021.02.26.21251901 medRxiv . [DOI] [PMC free article] [PubMed] [Google Scholar]
- Albers GW, Marks MP, Kemp S, Christensen S, Tsai JP, Ortega-Gutierrez S, McTaggart RA, Torbey MT, Kim-Tenser M, Leslie-Mazwi T, Sarraj A, Kasner SE, Ansari SA, Yeatts SD, Hamilton S, Mlynash M, Heit JJ, Zaharchuk G, Kim S, Carrozzella J, Palesch YY, Demchuk AM, Bammer R, Lavori PW, Broderick JP, Lansberg MG, 2018. Thrombectomy for stroke at 6 to 16 hours with selection by perfusion imaging. N. Engl. J. Med 378 (8), 708–718. 10.1056/NEJMoa1713973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amukotuwa S, Straka M, Aksoy D, Fischbein N, Desmond P, Albers G, Bammer R, 2019. Cerebral blood flow predicts the infarct core. Stroke 50 (10), 2783–2789. 10.1161/STROKEAHA.119.026640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby JS, Freymann JB, Farahani K, Davatzikos C, 2017. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 170117. 2052–4463 Bakas, Spyridon Orcid: 0000-0001-8734-6482 Akbari, Hamed Orcid: 0000-0001-9786-3707 Sotiras, Aristeidis Bilello, Michel Rozycki, Martin Kirby, Justin S Freymann, John B Farahani, Keyvan Davatzikos, Christos U24 CA189523/CA/NCI NIH HHS/United States Dataset Journal Article Research Support, N.I.H., Extramural 2017/09/06 Sci Data. 2017 Sep 5;4:170117. doi: 10.1038/sdata.2017.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bakas S, Reyes M, Jakab A, Bauer S, Rempfler M, Crimi A, Shinohara RT, Berger C, Ha SM, Rozycki M, 2018. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv preprint arXiv:1811.02629 [Google Scholar]
- Becker AS, Chaitanya K, Schawkat K, Muehlematter UJ, Hötker AM, Konukoglu E, Donati OF, 2019. Variability of manual segmentation of the prostate in axial T2-weighted MRI: A multi-reader study. Eur. J. Radiol 121, 108716. 10.1016/j.ejrad.2019.108716. [DOI] [PubMed] [Google Scholar]
- Bertels J, Eelbode T, Berman M, Vandermeulen D, Maes F, Bisschops R, Blaschko MB, 2019. Optimizing the dice score and jaccard index for medical image segmentation: Theory and practice. In: Medical Image Computing and Computer Assisted Intervention-MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22. Springer, pp. 92–100. [Google Scholar]
- Brosch T, Peters J, Groth A, Stehle T, Weese J, 2018. Deep learning-based boundary detection for model-based segmentation with application to MR prostate segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 515–522. [Google Scholar]
- Caradu C, Spampinato B, Vrancianu AM, Bérard X, Ducasse E, 2021. Fully automatic volume segmentation of infrarenal abdominal aortic aneurysm computed tomography images with deep learning approaches versus physician controlled manual segmentation. J. Vasc. Surg 74 (1), 246–256.e6. 10.1016/j.jvs.2020.11.036. [DOI] [PubMed] [Google Scholar]
- Cheng B, Girshick R, Dollár P, Berg AC, Kirillov A, 2021. Boundary iou: Improving object-centric image segmentation evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15334–15342. [Google Scholar]
- Commowick O, Istace A, Kain M, Laurent B, Leray F, Simon M, Pop SC, Girard P, Améli R, Ferré J-C, Kerbrat A, Tourdias T, Cervenansky F, Glatard T, Beaumont J, Doyle S, Forbes F, Knight J, Khademi A, Mahbod A, Wang C, McKinley R, Wagner F, Muschelli J, Sweeney E, Roura E, Lladó X, Santos MM, Santos WP, Silva-Filho AG, Tomas-Fernandez X, Urien H, Bloch I, Valverde S, Cabezas M, Vera-Olmos FJ, Malpica N, Guttmann C, Vukusic S, Edan G, Dojat M, Styner M, Warfield SK, Cotton F, Barillot C, 2018. Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Sci. Rep 8 (1), 13650. 10.1038/s41598-018-31911-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Vos V, Timmins KM, van der Schaaf IC, Ruigrok Y, Velthuis BK, Kuijf HJ, 2021. Automatic cerebral vessel extraction in TOF-MRA using deep learning. arXiv preprint arXiv:2101.09253 [Google Scholar]
- Dewey BE, Zhao C, Reinhold JC, Carass A, Fitzgerald KC, Sotirchos ES, Saidha S, Oh J, Pham DL, Calabresi PA, van Zijl PCM, Prince JL, 2019. DeepHarmony: A deep learning approach to contrast harmonization across scanner changes. Magn. Reson. Imaging 64, 160–170. 10.1016/j.mri.2019.05.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elguindi S, Zelefsky MJ, Jiang J, Veeraraghavan H, Deasy JO, Hunt MA, Tyagi N, 2019. Deep learning-based auto-segmentation of targets and organs-at-risk for magnetic resonance imaging only planning of prostate radiotherapy. Phys. Imaging Radiat. Oncol 12, 80–86. 10.1016/j.phro.2019.11.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Filippi M, Preziosa P, Banwell BL, Barkhof F, Ciccarelli O, De Stefano N, Geurts JJ, Paul F, Reich DS, Toosy AT, et al. , 2019. Assessment of lesions on magnetic resonance imaging in multiple sclerosis: Practical guidelines. Brain 142 (7), 1858–1875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gautam A, Raman B, 2021. Towards effective classification of brain hemorrhagic and ischemic stroke using CNN. Biomed. Signal Process. Control 63, 102178. [Google Scholar]
- Heimann T, Van Ginneken B, Styner MA, Arzhaeva Y, Aurich V, Bauer C, Beck A, Becker C, Beichel R, Bekes G, et al. , 2009. Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Trans. Med. Imaging 28 (8), 1251–1265. [DOI] [PubMed] [Google Scholar]
- Huttenlocher DP, Klanderman GA, Rucklidge WJ, 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell 15 (9), 850–863. [Google Scholar]
- Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH, 2021. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18 (2), 203–211. 10.1038/s41592-020-01008-z Isensee, Fabian Jaeger, Paul F Kohl, Simon A A Petersen, Jens Maier-Hein, Klaus H eng Research Support, Non-U.S. Gov’t 2020/12/09 Nat Methods. 2021 Feb;18(2):203–211. Epub 2020 Dec 7. [DOI] [PubMed] [Google Scholar]
- Isensee F, Schell M, Pflueger I, Brugnara G, Bonekamp D, Neuberger U, Wick A, Schlemmer H-P, Heiland S, Wick W, et al. , 2019. Automated brain extraction of multisequence MRI using artificial neural networks. Hum. Brain Mapp 40 (17), 4952–4964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janssens R, Zeng G, Zheng G, 2018. Fully automatic segmentation of lumbar vertebrae from CT images using cascaded 3D fully convolutional networks. In: 2018 IEEE 15th International Symposium on Biomedical Imaging. ISBI 2018, IEEE, pp. 893–897. [Google Scholar]
- Jungo A, Meier R, Ermis E, Blatti-Moreno M, Herrmann E, Wiest R, Reyes M, 2018. On the effect of inter-observer variability for a reliable estimation of uncertainty of medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 682–690. [Google Scholar]
- Karimi D, Dou H, Warfield SK, Gholipour A, 2020. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Med. Image Anal 65, 101759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuijf HJ, Biesbroek JM, De Bresser J, Heinen R, Andermatt S, Bento M, Berseth M, Belyaev M, Cardoso MJ, Casamitjana A, et al. , 2019. Standardized assessment of automatic segmentation of white matter hyperintensities and results of the WMH segmentation challenge. IEEE Trans. Med. Imaging 38 (11), 2556–2568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N, 2018. A survey on addressing high-class imbalance in big data. J. Big Data 5 (1), 42. 10.1186/s40537-018-0151-6. [DOI] [Google Scholar]
- Litjens G, Toth R, Van De Ven W, Hoeks C, Kerkstra S, van Ginneken B, Vincent G, Guillard G, Birbeck N, Zhang J, et al. , 2014. Evaluation of prostate segmentation algorithms for MRI: The PROMISE12 challenge. Med. Image Anal 18 (2), 359–373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu CF, Hsu J, Xu X, Ramachandran S, Wang V, Miller MI, Hillis AE, Faria AV, Wintermark M, Warach SJ, Albers GW, Davis SM, Grotta JC, Hacke W, Kang D-W, Kidwell C, Koroshetz WJ, Lees KR, Lev MH, Liebeskind DS, Sorensen AG, Thijs VN, Thomalla G, Wardlaw JM, Luby M, The S, investigators VI, 2021. Deep learning-based detection and segmentation of diffusion abnormalities in acute ischemic stroke. Commun. Med 1 (1), 61. 10.1038/s43856-021-00062-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maier-Hein L, Reinke A, Christodoulou E, Glocker B, Godau P, Isensee F, Kleesiek J, Kozubek M, Reyes M, Riegler MA, 2022. Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv preprint arXiv:2206. 01653 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mehta R, Filos A, Baid U, Sako C, McKinley R, Rebsamen M, Dätwyler K, Meier R, Radojewski P, Murugesan GK, et al. , 2022. QU-BraTS: MICCAI BraTS 2020 challenge on quantifying uncertainty in brain tumor segmentation-analysis of ranking scores and benchmarking results. J. Mach. Learn. Biomed. Imaging 1. [PMC free article] [PubMed] [Google Scholar]
- Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, Lanczi L, Gerstner E, Weber MA, Arbel T, Avants BB, Ayache N, Buendia P, Collins DL, Cordier N, Corso JJ, Criminisi A, Das T, Delingette H, Demiralp Ç, Durst CR, Dojat M, Doyle S, Festa J, Forbes F, Geremia E, Glocker B, Golland P, Guo X, Hamamci A, Iftekharuddin KM, Jena R, John NM, Konukoglu E, Lashkari D, Mariz JA, Meier R, Pereira S, Precup D, Price SJ, Raviv TR, Reza SMS, Ryan M, Sarikaya D, Schwartz L, Shin HC, Shotton J, Silva CA, Sousa N, Subbanna NK, Szekely G, Taylor TJ, Thomas OM, Tustison NJ, Unal G, Vasseur F, Wintermark M, Ye DH, Zhao L, Zhao B, Zikic D, Prastawa M, Reyes M, Leemput KV, 2015. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34 (10), 1993–2024. 10.1109/TMI.2014.2377694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nikolov S, Blackwell S, Zverovitch A, Mendes R, Livne M, De Fauw J, Patel Y, Meyer C, Askham H, Romera-Paredes B, 2018. Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy. arXiv preprint arXiv:1809.04430 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Powers WJ, Rabinstein AA, Ackerson T, Adeoye OM, Bambakidis NC, Becker K, Biller J, Brown M, Demaerschalk BM, Hoh B, 2018. 2018 Guidelines for the early management of patients with acute ischemic stroke: A guideline for healthcare professionals from the American heart association/American stroke association. Stroke 49 (3), e46–e99. [DOI] [PubMed] [Google Scholar]
- Prados F, Ashburner J, Blaiotta C, Brosch T, Carballido-Gamio J, Cardoso MJ, Conrad BN, Datta E, Dávid G, De Leener B, et al. , 2017. Spinal cord grey matter segmentation challenge. Neuroimage 152, 312–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schell M, Tursunova I, Fabian I, Bonekamp D, Neuberger U, Wick W, Bendszus M, Maier-Hein K, Kickingereder P, 2019. Automated Brain Extraction of Multi-Sequence MRI Using Artificial Neural Networks. European Congress of Radiology-ECR 2019. [Google Scholar]
- Shusharina N, Söderberg J, Edmunds D, Löfman F, Shih H, Bortfeld T, 2020. Automated delineation of the clinical target volume using anatomically constrained 3D expansion of the gross tumor volume. Radiother. Oncol 146, 37–43. 10.1016/j.radonc.2020.01.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Styner M, Lee J, Chin B, Chin M, Commowick O, Tran H, Markovic-Plese S, Jewells V, Warfield S, 2008. 3D segmentation in the clinic: A grand challenge II: MS lesion segmentation. MIDAS J 2008, 1–6. [Google Scholar]
- Taha AA, Hanbury A, 2015. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC medical imaging 15, 29. 10.1186/s12880-015-0068-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tiulpin A, Finnilä M, Lehenkari P, Nieminen HJ, Saarakkala S, 2020. Deep-learning for tidemark segmentation in human osteochondral tissues imaged with micro-computed tomography. In: International Conference on Advanced Concepts for Intelligent Vision Systems. Springer, pp. 131–138. [Google Scholar]
- Vania M, Mureja D, Lee D, 2019. Automatic spine segmentation from CT images using convolutional neural network via redundant generation of class labels. J. Comput. Des. Eng 6 (2), 224–232. 10.1016/j.jcde.2018.05.002. [DOI] [Google Scholar]
- Wang B, Jin S, Yan Q, Xu H, Luo C, Wei L, Zhao W, Hou X, Ma W, Xu Z, et al. , 2021. AI-assisted CT imaging analysis for COVID-19 screening: Building and deploying a medical AI system. Appl. Soft Comput 98, 106897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang F, Zamzmi G, Angara S, Rajaraman S, Aquilina A, Xue Z, Jaeger S, Papagiannakis E, Antani SK, 2023. Assessing inter-annotator agreement for medical image segmentation. IEEE Access 11, 21300–21312. 10.1109/ACCESS.2023.3249759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu R, Guo Y, Xue J-H, 2020. Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit. Lett 133, 217–223. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The evaluation code was released to encourage further analysis of this topic https://github.com/SophieOstmeier/UncertainSmallEmpty. The data set will be made available upon reasonable request. Please contact Jeremy Heit, MD, PhD.
