Abstract
Accurate segmentation of small brain lesions in magnetic resonance imaging (MRI) is essential for understanding neurological disorders and guiding clinical decisions. However, detecting small lesions remains challenging due to low contrast and limited size. This study proposes two simple yet effective labeling strategies, Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL), that can seamlessly integrate into existing segmentation networks. MSL groups lesions based on volume to enable size-aware learning, while DBL emphasizes lesion boundaries to enhance structural sensitivity. We evaluate our approach on two benchmark datasets: stroke lesion segmentation using the Anatomical Tracings of Lesions After Stroke (ATLAS) v2.0 dataset and multiple sclerosis lesion segmentation using the MSLesSeg dataset. On ATLAS v2.0, our approach achieved higher Dice (+1.3%), F1 (+2.4%), precision (+7.2%), and recall (+3.6%) scores compared to the top-performing method from a previous challenge. On MSLesSeg, our approach achieved the highest Dice score (0.7146) and ranked first among 16 international teams. Additionally, we examined the effectiveness of attention-based and mamba-based segmentation models but found that our proposed labeling strategies yielded more consistent improvements. These findings demonstrate that MSL and DBL offer a robust and generalizable solution for enhancing small brain lesion segmentation across various tasks and architectures. Our code is available at: https://github.com/nadluru/StrokeLesSeg.
Keywords: Small Lesion, Segmentation, Stroke, Multiple Sclerosis, Labeling, Data Augmentation, Deep Learning
1. Introduction
Segmenting small brain lesions from structural MRI remains a significant challenge in medical image analysis. These lesions, often associated with neurological disorders such as ischemic stroke, multiple sclerosis (MS), and focal cortical displasia (FCD) are difficult to detect due to their small spatial extent, low contrast, and class imbalance in the number of voxels belonging to non-lesional and lesional tissue [1]. The morphology and spatial distribution of these lesions vary substantially across pathologies. Stroke lesions are often localized and follow vascular territories. MS lesions, in contrast, are generally small, ovoid, and scattered, showing a predilection for periventricular, juxtacortical, infratentorial, and spinal cord regions of the white matter [2]. Focal cortical dysplasia (FCD) lesions are most commonly found in the cerebral cortex, often presenting as cortical thickening, blurring of the gray-white matter junction, and abnormal gyral patterns; these lesions tend to have an irregular, wedge-shaped or transmantle appearance and may preferentially affect the frontal and temporal lobes [3, 4].
Notably, the accurate detection and segmentation of small lesions are crucial for early diagnosis, prognostication, and timely intervention, particularly in conditions such as stroke, MS, and early-stage FCDs. For example, for ischemic stroke, especially small vessel or lacunar strokes, diffusion-weighted imaging (DWI) can detect small, acute lesions within minutes of onset, providing a definitive early diagnosis [5, 6]. For MS, the diagnosis often relies on McDonald criteria, which require evidence of lesions spread over space and time. The appearance of new, small lesions on a follow-up MRI is a key indicator of disease activity and can be crucial for confirming a diagnosis [7]. In patients with unexplained epilepsy, identifying a small, subtle area of focal cortical dysplasia (FCD) can provide a diagnosis for their seizure disorder [4]. The total volume and number of brain lesions (including small ones) at diagnosis and over time are correlated with long-term disability and disease progression [8, 9]. Accurately detecting a small acute stroke is what triggers time-sensitive interventions like thrombolysis (clot-mitigating drugs) or thrombectomy [10]. The location and total burden of small lesions can help predict the risk of future strokes and the likelihood of developing vascular cognitive impairment or dementia [11, 12]. For patients with drug-resistant epilepsy caused by FCD, the accurate segmentation of the lesion is the critical first step in planning for potentially curative epilepsy surgery, and the ability to precisely delineate the boundaries of a dysplasia is a strong predictor of whether a patient will be seizure-free after surgery [13].
Although small lesions are clinically significant, they have received limited attention in the literature, particularly regarding the unique challenges of their segmentation. Among the few efforts, several studies have investigated ensemble learning strategies that fuse predictions from multiple convolutional neural network (CNN) branches operating at different resolutions to enhance sensitivity to small lesion features [14, 15, 16]. While such approaches have demonstrated performance gains, they often necessitate specialized architectural modifications and increased computational complexity, potentially limiting their widespread adoption. To preserve fine-grained lesion details, SPiN [17] introduced a subpixel embedding technique for 2D MR image slices, effectively enhancing boundary delineation for small structures; however, extending this approach to volumetric (3D) data is computationally demanding and remains an open research area. Recently, the UNet-MSCSA model [18] incorporated multi-stage cross-scale attention mechanisms [19] within the widely used U-Net [20, 21, 22, 23, 24], specifically to bolster the network’s ability to capture subtle morphological cues of small lesions. The segmentation of small lesions continues to be an active area of investigation, with ongoing efforts focused on balancing model complexity, computational efficiency, and cl inical applicability.
Despite these advances, the segmentation of small lesions continues to lag behind that of larger ones. This limitation fundamentally arises in part because most segmentation methods are evaluated using voxel-wise overlap metrics, such as the Dice similarity coefficient, which inherently favor large lesions due to their greater voxel count [25, 26]. Consequently, segmentation models often achieve high overall accuracy by accurately delineating large lesions, while failing to detect or accurately segment smaller ones. This imbalance is further exacerbated by the pronounced class imbalance between lesional and non-lesional tissue, which can bias neural network training toward the background or larger structures [27, 28]. Accordingly, the development and rigorous validation of segmentation algorithms that can reliably detect and delineate small lesions remains a central and unresolved challenge in advancing the clinical translation of medical image analysis techniques.
In this work, we propose two novel label-level strategies that explicitly incorporate lesion size and spatial structure into the training process. We demonstrate the advantages of the strategies using non-contrast T1-weighted MRI (T1w-MRI) data1 since those are currently the primary imaging data publicly available in the lesion segmentation tasks. The first strategy, Multi-Size Labeling (MSL), stratifies lesions into small, medium, and large categories, enabling size-aware supervision and allowing the network to learn features tailored to lesions of varying scales. The second strategy, Distance-Based Labeling (DBL), encodes lesion boundaries through signed distance transforms, providing continuous representations that emphasize shape and edge information. Both approaches operate directly on the ground-truth masks and can be seamlessly integrated into existing training pipelines without necessitating any modifications t o the underlying segmentation architecture. When applied in conjunction with a standard U-Net [20], these “plug and play” label-level strategies yield substantial improvements in the detection and delineation of small lesions, while also maintaining or enhancing segmentation performance across all lesion sizes. Importantly, the methods are generalizable and can be readily incorporated into a wide range of neural network-based segmentation frameworks, facilitating broader applicability in medical image analysis.
We evaluate the effectiveness o f t he p roposed label-level strategies across two different n euroimaging domains. The first i s t he A natomical Tracings o f Lesions After Stroke (ATLAS) v2.0 dataset [29], which provides manually delineated ischemic stroke lesions, representing a challenging and clinically relevant scenario for lesion segmentation. The second domain is the Multiple Sclerosis Lesion Segmentation (MSLesSeg) dataset [30, 31, 32], the official benchmark for the International Conference on Pattern Recognition (ICPR) 2024 Competition on Multiple Sclerosis Lesion Segmentation [33]. This dataset is widely recognized for its complexity, as it includes a wide range of lesion sizes and appearances across different imaging protocols.
Our experimental results demonstrate that both Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL) generalize robustly across varying pathologies and imaging conditions. Notably, an ensemble of models incorporating these strategies achieved a Dice score of 0.7146 on the MSLesSeg test set, earning 1st place in the 2024 International Conference on Pattern Recognition (ICPR) Competition on Multiple Sclerosis Lesion Segmentation. These findings highlight the potential of our approaches to advance the state-of-the-art in small lesion segmentation and support their applicability in real-world clinical and research settings.
1.1. Contributions
The key contributions of this work are:
We propose two novel label-aware strategies, Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL), which incorporate lesion size and spatial cues. Both are plug and play and compatible with different segmentation networks.
We design ensemble and postprocessing techniques that improve segmentation accuracy under these strategies, especially for small lesions.
On ATLAS v2.0, our ensemble surpasses state-of-the-art baselines in recall, F1, and Dice on the full dataset and an evaluation-only small-lesion subset. Even a single MSL model outperforms baseline ensembles on small lesions.
Our system ranked 1st in the 2024 ICPR MSLesSeg Challenge, setting a new benchmark in multiple sclerosis lesion segmentation.
Together, these contributions advance the state-of-the-art in small lesion segmentation and offer promising solutions for improving clinical and basic research applications of lesion segmentation in neuroimaging.
2. Approach
Most contemporary brain lesion segmentation methods [14, 15, 16, 17, 34] use binary masks to distinguish normal tissue from lesions. While effective for large lesions, this binary formulation often underperforms on small lesions due to severe class imbalance, limiting sensitivity and accuracy for clinically important but under-represented lesion types. To address this, we reformulate binary voxel-wise classification as a multi-class segmentation task. Lesion voxels are further categorized by size or boundary proximity, enabling more informative supervision that enhances the detection of subtle lesions. Notably, this strategy leaves backbone architectures such as U-Net [20] unchanged. Specifically, we introduce two lightweight label augmentation methods: Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL). MSL improves small-lesion detection by differentiating lesions by size, while DBL leverages signed distance transforms to emphasize lesion shape and edges. As shown in Table 1, both MSL and DBL add minimal overhead (< 0.4% latency), ensuring practicality for clinical use. An overview of our full pipeline and integration of these labeling strategies is shown in Fig. 1.
Table 1:
Computational efficiency comparison on the ATLAS v2.0 dataset. Results are shown for the baseline U-Net and our proposed variants (MSL and DBL). All measurements were performed on an RTX 3080 GPU with PyTorch 1.11. Latency is reported for a 128 × 128 × 128 image patch, matching the input size used during training and inference. Both FLOPs and parameter counts increase only marginally, and inference latency remains effectively unchanged, confirming that MSL and DBL introduce negligible computational overhead.
| Model | FLOPs (G) | Params (M) | Latency (s) |
|---|---|---|---|
| Baseline U-Net | 534.034 | 311.948 | 0.11212 |
| MSL (5 classes) | 534.302 +0.05% | 311.972 +0.007% | 0.11256 +0.4% |
| DBL (3 classes) | 534.123 +0.02% | 311.956 +0.003% | 0.11238 +0.2% |
Figure 1:

(a) Method Overview. In the conventional U-Net pipeline, the input MRI volume is processed end-to-end to yield a binary lesion mask. In our approach, we introduce non-binary labeling strategies of Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL) to enhance small lesion and boundary segmentation. Their outputs are ensembled, followed by postprocessing. (b) Ensemble Strategy. MSL/DBL outputs are linearly blended for small lesions; DBL is used exclusively for large ones. (c) Postprocessing. Small lesions with low maximum probability are excluded to reduce false positives.
2.1. Proposed Labeling Strategies
Multi-Size Labeling (MSL).
Segmenting small lesions is inherently challenging due to the severe size imbalance. For example, in the ATLAS v2.0 dataset [29], over 500 lesions are smaller than 100 mm3, yet their combined volume is still less than that of a single large lesion exceeding 100,000 mm3. This imbalance causes evaluation metrics like Dice to heavily favor large lesions. To address this, MSL stratifies lesion voxels into size-based categories using empirically defined volume thresholds. For instance, lesion voxels may be classified as:
| (1) |
where denotes the volume of lesion , and , are empirically chosen thresholds. For ATLAS v2.0, we selected thresholds of 100, 1,000, and 10,000 mm3, following a logarithmic scale. This choice directly addresses the skewed lesion volume distribution: while a few very large lesions dominate voxel counts, the vast majority fall into the small-to-medium range. As shown in Fig. B.1 (a) of the Supplement, logarithmic partitioning yields a more balanced lesion-level distribution, thereby improving learning of underrepresented small lesions. Distance-Based Labeling (DBL). The discriminative power of interior voxels is often limited, whereas bound-ary voxels are crucial for localizing and characterizing le-sion morphology. Motivated by this, DBL assigns each voxel to either a boundary or interior region based on its shortest Euclidean distance to the nearest background voxel:
| (2) |
where is the distance to the nearest non-lesion voxel, and is a dataset-specific threshold. For ATLAS v2.0, voxels within 2 pixels of the lesion boundary were designated as “boundary,” with the remainder as “interior.” This threshold was chosen to approximate a clinically meaningful boundary thickness, enabling the model to focus on delineating lesion edges, which are critical for characterizing perilesional morphology. As shown in Fig. B.1 (b) of the Supplement, this setting produces a balanced division of boundary versus interior voxels. We also explored alternative configurations, including 5-class versus 3-class MSL (Fig. B.2) and the introduction of a “transition” class in DBL (Fig. B.3), but these variants proved less stable, confirming the robustness of our adopted design.
In Section 4.3, we further present ablations of lesion-class and boundary-class splits for MSL and DBL, respectively. The results show that the adopted four-class MSL and two-class DBL consistently outperform more complex alternatives, supporting our rationale that these thresholds achieve an effective balance between accuracy, stability, and model simplicity.
From Multi-Class to Binary Prediction.
Although MSL and DBL adopt a multi-class formulation during training, the final segmentation output i s typically required in binary form for downstream applications. To generate binary masks, we aggregate the softmax-normalized probabilities for each lesion voxel k across all foreground classes (with representing background), yielding the overall lesion probability . A voxel is then classified as lesional if , where is a predefined threshold (defaulting to 0.5).
2.2. Ensemble Integration and Postprocessing
Ensemble Integration.
While MSL and DBL each address limitations of conventional lesion segmentation, they exhibit complementary strengths. MSL enhances small lesion detection via size-aware supervision but may underperform on large lesions that require broader context. DBL sharpens boundaries and captures morphology, but can struggle with very small lesions where interior and boundary voxels are hard to distinguish. To exploit their strengths, we propose a targeted ensemble strategy that dynamically integrates their predictions.
For each lesion voxel , let and denote the foreground probabilities from MSL and DBL models, respectively. At inference, for lesions identified as small (based on MSL thresholds), we blend predictions using a mixing coefficient :
| (3) |
The optimal value of is selected via 5-fold cross-validation, as illustrated in Section 4.3, to balance sensitivity and specificity. This adaptive fusion enhances robustness across lesion sizes and mitigates failure cases of individual methods.
Postprocessing (PP).
Despite improvements from MSL and DBL, small lesion predictions remain prone to false positives from low-confidence a ctivations. T o address this, we apply a lightweight postprocessing step to suppress noise while preserving true positives.
Let be a predicted small lesion component. For each voxel in , we define a binary indicator:
| (4) |
where is the predicted probability and is a confidence threshold selected via ablation. If the maximum probability in falls below , the entire region is suppressed as a false positive. This targeted filtering improves segmentation precision by retaining only high-confidence small lesion predictions. The selection of the optimal is also carried out using 5-fold cross-validation, with details provided in Section 4.3.
3. Experimental Settings
ATLAS v2.0 [29].
This dataset includes 655 T1-weighted MRI scans with manually annotated stroke lesions. Preprocessing involves intensity non-uniformity correction, intensity standardization, and linear registration to the MNI-152 space. All scans have isotropic 1 mm3 resolution, with lesion volumes ranging from 13 to over 200,000 mm3. To support focused evaluation of small lesion segmentation, we extract a subset of 138 scans, each containing only lesions smaller than 1000 mm3. This subset avoids ambiguity from lesion merging and enables precise assessment of small lesion detection.
For Multi-Size Labeling (MSL), lesion voxels are categorized by lesion volume:
| (5) |
For Distance-Based Labeling (DBL), lesion voxels are labeled by distance to the lesion boundary:
| (6) |
MSLesSeg [30, 31, 32].
This dataset contains 93 T1-weighted MRI scans registered to the MNI-152 template, with 2,688 annotated lesions ranging from 1 to 67,904 mm3. Notably, 66% of lesions are smaller than 100 mm3, and 95% are under 1,000 mm3, making it ideal for small lesion segmentation evaluation.
For Multi-Size Labeling (MSL), lesion voxels are categorized by volume:
| (7) |
For Distance-Based Labeling (DBL), lesion voxels are labeled by proximity to the lesion boundary:
| (8) |
Implementation Details.
All models are implemented in PyTorch [35] using the nnU-Net framework [36] with U-Net [20] as the backbone. Training is performed for 1000 epochs with a batch size of 2, SGD (initial learning rate = 0.01, momentum = 0.99), and -score input normalization. A polynomial learning-rate decay with exponent 0.9 is applied as the learning-rate scheduler. We employ 5-fold cross-validation with lesion-size balanced splits to ensure reliable comparisons. During training, each batch consists of two 128 × 128 × 128 patches: one randomly sampled to provide contextual diversity, and one centered on lesion voxels to guarantee coverage of target regions. For data augmentation, we follow the pipeline from our baseline method [34] and adopt nnUNet’s “moreDA” pipeline with modifications. Specifically, we widen 3D rotations to ±30◦, extend the scal-ing range to (0.7, 1.4), and disable elastic deformations. All other transformations follow the nnU-Net defaults, including spatial transforms, Gaussian noise, Gaussian blur, multiplicative brightness, contrast, simulated low resolution, gamma transforms (standard and inverted), and 3D mirroring. Deep-supervision scales are derived from the network pooling plan so that targets are consistently downsampled for multi-scale supervision. The detailed data augmentation settings are summarized in Table E.5 of the Supplement.
Training Schemes.
On the ATLAS v2.0 dataset, we train MSL and DBL models using U-Net [20] with Dice+Cross-Entropy (CE) loss, following [34]. To ensure fair comparison with ensemble-based baselines, which combine predictions from four separately trained models, MSL and DBL models are also trained using Dice+Focal [37] loss, where the focal term emphasizes underrepresented categories. For each strategy, we average softmax outputs from both loss variants before applying our MSL+DBL ensemble integration.
On the MSLesSeg dataset, we evaluate five configurations: U-Net with Dice+CE loss; U-Net with Dice+Focal loss [37]; Res U-Net to assess backbone adaptability; UNet-MSCSA [18, 19], a ViT [38, 39]-based model tailored for small lesions; and LightM-UNet [40], a Mamba [41]-based architecture. Softmax outputs from selected MSL and DBL variants are averaged and then integrated using our ensemble method.
Evaluation Metrics.
We adopt four evaluation metrics to assess segmentation performance: Dice score, F1 score, precision, and recall. Dice is computed in a voxel-wise manner, quantifying the spatial overlap between predicted and ground-truth masks across all voxels. While Dice remains the most widely used benchmark in medical image segmentation, it is inherently biased toward large lesions because their greater voxel counts dominate the calculation. To complement this, we also report lesion-wise metrics, F1, precision, and recall, which more directly capture whether individual lesions are detected. This combination balances segmentation performance, reflecting both overlap quality and lesion-level detection ability, particularly for small but clinically important lesions.
4. Results
4.1. Results on ATLAS v2.0
To establish strong baselines, we replicate the top-performing model from the 2022 MICCAI ATLAS Challenge [34], a standard U-Net [20] trained with Dice+CE loss. For reference, we also report results from UNet-MSCSA [18], which integrates MSCSA modules to enhance small lesion segmentation. We then evaluate our Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL) strategies. Unlike architectural changes, our methods enhance supervision using size- and boundary-aware labels within standard segmentation frameworks. To focus on small lesion detection, we construct a test subset of 138 ATLAS v2.0 scans containing only lesions smaller than 1,000 mm3. All models are trained on the full dataset; this “small lesion” subset is used solely for evaluation to avoid ambiguity from size-unaware models merging small lesions into adjacent large ones.
As shown in Table 2, MSL under default training outperforms all baselines, including the ensemble in [34], with gains of +0.3% Dice, +1.0% F1, and +10.7% precision. It also outperforms UNet-MSCSA, showing that size-aware labeling is more effective t han architectural modules. The F1 improvement, measured lesion-wise, indicates more small lesions are successfully detected rather than just better overlap with large lesions. DBL likewise improves over its default baseline, validating the benefit of boundary-aware supervision. Our final ensemble, averaging MSL and DBL predictions trained with Dice+CE and Dice+Focal loss, yields further gains: +7.2% precision, +3.6% recall, +2.4% F1, and +1.3% Dice over the baseline ensemble, and a +2.4% F1 gain over the UNet-MSCSA ensemble, reflecting superior lesion detection.
Table 2: Small Lesion Subset Performance.
Top four rows are baseline methods [34, 18]; bottom three rows are our proposed strategies. ‘−’ indicates unavailable metrics. Bold marks best scores. MSL outperforms all baselines in Dice, F1, and precision. DBL also improves over its default. The MSL+DBL ensemble yields the best overall performance.
| Setting | Scheme | Dice | F1 | Precision | Recall |
|---|---|---|---|---|---|
| Baseline [34] | Default | 0.417 | 0.500 | 0.509 | 0.628 |
| Baseline [34] | Ensemble | 0.443 | 0.574 | 0.627 | 0.618 |
|
| |||||
| UNet-MSCSA [18] | Default | 0.436 | 0.527 | - | - |
| UNet-MSCSA [18] | Ensemble | 0.458 | 0.574 | - | - |
|
| |||||
| MSL | Default | 0.446 | 0.584 | 0.734 | 0.612 |
| DBL | Default | 0.421 | 0.536 | 0.655 | 0.576 |
| MSL+DBL | Ensemble | 0.456 | 0.598 | 0.699 | 0.654 |
On the full ATLAS v2.0 dataset (Table 3), our methods maintain competitive overall performance. Both MSL and DBL provide consistent gains in lesion-wise metrics, particularly F1, precision, and recall. Although the Dice scores differ from the UNet-MSCSA ensemble by only −0.3%, the MSL+DBL ensemble achieves the best F1 while remaining comparable across other metrics. This reflects the intended advantage of our strategies: enhancing the detection of small and clinically meaningful lesions without sacrificing overall voxel-wise accuracy.
Table 3: Entire Dataset Performance.
Top four rows are baseline methods [34, 18]; bottom three rows are our proposed strategies. ‘−’ indicates unavailable metrics. Bold marks best scores. MSL and DBL improve F1; MSL+DBL ensemble yields best F1 with strong Dice.
| Setting | Scheme | Dice | F1 | Precision | Recall |
|---|---|---|---|---|---|
| Baseline [34] | Default | 0.635 | 0.549 | 0.686 | 0.561 |
| Baseline [34] | Ensemble | 0.645 | 0.575 | 0.747 | 0.553 |
|
| |||||
| UNet-MSCSA [18] | Default | 0.636 | 0.551 | - | - |
| UNet-MSCSA [18] | Ensemble | 0.648 | 0.573 | - | - |
|
| |||||
| MSL | Default | 0.632 | 0.566 | 0.727 | 0.559 |
| DBL | Default | 0.634 | 0.581 | 0.766 | 0.556 |
| MSL+DBL | Ensemble | 0.645 | 0.590 | 0.721 | 0.590 |
Fig. 2 illustrates these effects by comparing ground truth contours (green) with predictions from different models (red). The first two rows highlight the strengths of our methods: boundaries are delineated more finely, and additional small lesions are detected compared to the baselines, reinforcing the clinical relevance of the proposed strategies.
Figure 2: Segmentation results on ATLAS v2.0.

Green contours show ground truth; red contours are predictions. The top two rows show cases where MSL and DBL better capture small lesions and refine lesion boundaries compared to baseline methods, reducing false negatives and improving delineation. The middle row highlights subtle lesions that were missed by all methods, which may reflect ambiguity in the ground truth for atypical or faint cases. The bottom row illustrates an extremely all, low-contrast lesion where all methods failed, underscoring the trinsic limitations of current approaches when lesions approach the resolution and signal-to-noise limits of the imaging data.
The figure also showcases two challenging conditions. the second row, above the zoomed-in contours, several subtle lesions were missed by all methods. We hypothesize that these cases reflect ambiguity in the ground truth itself, potentially arising from low inter-rater reliability in annotating atypical or faint lesions. In the bottom row, all methods failed to capture an extremely small, low-contrast lesion (highlighted in the zoomed view). This illustrates a fundamental limitation of current segmentation approaches: when lesions approach the limits of spatial resolution and signal-to-noise ratio, reliable detection may be infeasible. These cases underscore that future progress may depend not only on methodological advances but also on higher-resolution imaging, multimodal data integration, or incorporation of clinical priors. Additional representative failure cases are provided Section D.1 of the Supplement.
We also provide additional analyses in the Supplement, including distance-based metrics (HD95 and ASSD) in Section C.1, supplementary uncertainty measures, e.g., standard error values in Section C.2, comparisons with Focal loss models in Section C.3, and evaluations of different postprocessing settings in Section C.4. These analyses complement the main experiments and reinforce our conclusions, while keeping the core presentation in the manuscript focused and accessible.
4.2. Results on MSLesSeg
We further evaluate MSL and DBL on the MSLesSeg dataset, which contains many small, scattered lesions and poses additional segmentation challenges. All results are summarized in Table 4. As the competition uses Dice as the main metric, we adopt Dice for model selection. For each of MSL and DBL, we trained five m odels using different s chemes: Default, Focal, Res U-Net, UNet-MSCSA, and LightM-UNet. We also evaluate two averaging strategies: one averaging the predictions from the first three schemes, and another averaging all five. This protocol allowed us to test MSL and DBL under a broad benchmarking context, combining them with diverse backbones and loss functions in line with the competition rules. The best-performing ensemble was then submitted for blind testing, where it achieved first place in the final leaderboard. This outcome not only validates the robustness of our labeling strategies but also demonstrates their compatibility with a wide range of methodological choices.
Table 4: MSLesSeg Performance.
Dice averaged across 5 folds. The MSL+DBL ensemble using Default, Focal, and Res U-Net achieves the best results, outperforming those with MSCSA and LightM-UNet.
| Labeling | Scheme | Dice |
|---|---|---|
| MSL | Default | 0.748 |
| Focal [37] | 0.744 | |
| Res U-Net [34] | 0.751 | |
| UNet-MSCSA [18] | 0.735 | |
| LightM-UNet [40] | 0.729 | |
| Average (Default, Focal, Res U-Net) | 0.754 | |
| Average (All) | 0.753 | |
|
| ||
| DBL | Default | 0.750 |
| Focal [37] | 0.750 | |
| Res U-Net [34] | 0.749 | |
| UNet-MSCSAA [18] | 0.742 | |
| LightM-UNet [40] | 0.740 | |
| Average (Default, Focal, Res U-Net) | 0.759 | |
| Average (All) | 0.759 | |
|
| ||
| MSL+DBL | Ensemble (Default, Focal, Res U-Net) | 0.764 |
| Ensemble (All) | 0.763 | |
Interestingly, MSCSA and LightM-UNet underperform compared to simpler U-Net-based setups. One possible explanation is that while attention [38] and selective state space models [41] excel at capturing long-range dependencies across image patches, they typically require larger datasets to realize their full potential. In contrast, averaging the Default, Focal, and Res U-Net configurations yields the highest Dice scores: 0.754 for MSL and 0.759 for DBL. Their ensemble (MSL+DBL) further improves Dice to 0.764, outperforming the five-scheme ensemble. These findings suggest that added architectural complexity does not necessarily lead to better generalization, particularly on small datasets dominated by small lesions.
Motivated by these insights, we selected the best-performing MSL+DBL ensemble (Default, Focal, Res U-Net) as our final s ubmission t o t he M SLesSeg Competition. This configuration a chieved fi rst pl ace wi th a Dice of 0.7146 on 22 hidden test cases, outperforming the second-best by +0.63%. Notably, these improvements were achieved despite MSLesSeg not being specifically tailored for small lesion analysis, demonstrating that our label-aware supervision improved the detection of clinically significant small lesions without compromising accuracy on larger ones. We also include representative visualizations of MSLesSeg samples in Section D.2 of the Supplement to illustrate these results qualitatively.
4.3. Ablation Studies
We conduct ablation experiments on the ATLAS v2.0 dataset to analyze the design choices and hyperparameters.
Number of Categories in MSL and DBL.
For MSL, we compare 3, 4, and 5 size-based classes. As shown in Table 5, using 4 categories provides the best overall performance, balancing size distinction and label complexity. For DBL, we test 1, 2, and 3-region splits. The 2-class setup, boundary and interior, achieves the best performance, highlighting the importance of edge emphasis without over-fragmentation. Definitions and lesion distributions for each configuration, along with additional results, are detailed in Section B of the Supplement. These findings support our chosen four-class MSL and two-class DBL thresholds described in Section 2.1, confirming that they strike an effective balance between accuracy and model simplicity.
Table 5: Ablation on Number of Label Categories.
Dividing binary masks into 4 classes in MSL yields optimal results. For DBL, a 2-class split between boundary and interior outperforms alternatives.
| #Classes | Small Lesion Subset |
Entire Dataset |
||
|---|---|---|---|---|
| Dice | F1 | Dice | F1 | |
| Multi-Scale Labeling (MSL) | ||||
|
| ||||
| 3 | 0.430 | 0.545 | 0.628 | 0.571 |
| 4 | 0.446 | 0.584 | 0.632 | 0.566 |
| 5 | 0.353 | 0.463 | 0.605 | 0.522 |
|
| ||||
| Distance-Based Labeling (DBL) | ||||
|
| ||||
| 1 | 0.415 | 0.544 | 0.632 | 0.577 |
| 2 | 0.421 | 0.536 | 0.634 | 0.581 |
| 3 | 0.406 | 0.525 | 0.624 | 0.574 |
Mixing Rate and Post-Processing Threshold in Ensemble.
Our ensemble strategy and post-processing (PP) step involve two hyperparameters: the mixing rate between MSL and DBL, and a probability threshold for filtering low-confidence voxels. A grid search is conducted under a 5-fold cross-validation protocol, with results shown in Fig. 3. The best overall performance is achieved with a mixing rate of 0.8 and a PP threshold of 0.75. Higher mixing rates (favoring MSL) improve Dice when the threshold exceeds 0.6, confirming MSL’s advantage in small lesion detection. While F1 peaks near a mixing rate of 0.5 on the full dataset, it continues to rise beyond that on the small lesion subset, supporting stronger weighting toward MSL in such cases.
Figure 3: Ablation of Ensemble Hyperparameters.

Performance under various mixing rates and probability thresholds. The best configuration (mixing = 0.8, threshold = 0.75) is consistent across evaluation datasets.
5. Conclusion
In this study, we have presented two novel “plug and play” label design strategies, Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL), that can augment any deep learning segmentation architectures with enhanced supervision tailored to the unique challenges of small lesion segmentation. MSL discretizes lesion masks according to size, enabling the network to learn features relevant to lesions of varying scales, while DBL incorporates boundary-aware information by modeling the signed distance from lesion voxels to lesion borders. Both approaches are compatible with widely used architectures, such as U-Net, and can be seamlessly integrated into existing workflows without requiring major modifications.
Through extensive experiments on the ATLAS v2.0 stroke dataset, we demonstrate that both MSL and DBL significantly outperform conventional binary labeling schemes across key evaluation metrics. Notably, a single MSL model surpasses the strong ensemble baseline from the 2022 MICCAI ATLAS Challenge, achieving gains of 1.0% in lesion-wise F1 and 0.3% in Dice score on the small lesion subset. When combined as an ensemble, MSL and DBL deliver further performance improvements, with increases of 3.6% in recall, 2.4% in F1 score, and 1.3% in Dice score over the challenge-winning ensemble for small lesion detection. Similar or superior gains are observed for the overall dataset, particularly with respect to lesion-wise F1, highlighting the broad utility of these strategies.
To evaluate generalizability, we applied our methods to the MSLesSeg dataset, which features an even greater prevalence of small and diffusely distributed lesions. Here, the ensemble of MSL and DBL, trained across three loss configurations (Default, Focal, and Res U-Net), achieves a cross-validation Dice score of 0.764, outperforming more complex and recent segmentation architectures such as UNet-MSCSA and LightM-UNet. These findings suggest that careful label refinement and focused supervision can have a greater impact on small lesion segmentation than increasing architectural complexity alone.
Applying our methods to the non-contrast T1w-MRI results in accurate and robust segmentation of chronic, mature porencephalic cavities that appear in the aftermath of demyelination. This is a clinically relevant task, for example, in quantifying long-term lesion burden in clinical trials or studying the chronic effects of disease. Additionally, the model weights learnt from non-contrast T1w-MRI could be used in a transfer learning framework to further train models involving T2-FLAIR and DWI/ADC MRI that are used in the acute phase in the clinics. While our experiments benchmarked against the most relevant and competitive methods within the medical imaging community (U-Net, UNet-MSCSA, and LightM-UNet), we acknowledge that broader connections exist with state-of-the-art small-object detection approaches in computer vision. Such methods, including feature pyramid networks and transformer-based detectors, are designed to address challenges similar to those encountered in small lesion segmentation. Adapting these ideas to medical imaging represents a promising direction for future research, and we view our label-centric strategies as complementary to potential architectural advances in this area.
In summary, MSL and DBL offer simple yet powerful alternatives to architectural redesigns, substantially enhancing small lesion detection and delineation in stroke and multiple sclerosis. Beyond their immediate application to neuroimaging, these strategies provide a general and adaptable framework that can be extended to a wide range of medical image segmentation tasks where subtle or rare structures are of clinical interest.
Supplementary Material
6. Acknowledgement
We thank the organizers of the MSLesSeg challenge, which offered a valuable platform for advancing research in small brain lesion segmentation. We acknowledge the funding support from NIH grants R01NS123378, P50HD105353, NIH R01NS105646, NIH R01NS11102, and R01NS117568.
Footnotes
We would like to note that non-contrast T1w-MRI is not the primary imaging for either detecting lesions in the acute phase of the stroke or for applying the McDonald criteria in MS.
References
- [1].Valverde S, et al. , Improving automated multiple sclerosis lesion segmentation with a cascaded 3d convolutional neural network approach, NeuroImage 155 (2017) 159–168. [DOI] [PubMed] [Google Scholar]
- [2].Filippi M, et al. , Multiple sclerosis, Nature Reviews Disease Primers 4 (1) (2018) 43. [Google Scholar]
- [3].Barkovich AJ, et al. , A developmental and genetic classification for malformations of cortical development: update 2012, Brain 135 (5) (2012) 1348–1369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Blümcke I, et al. , The clinicopathologic spectrum of focal cortical dysplasias: A consensus classification proposed by an ad hoc task force of the ilae diagnostic methods commission, Epilepsia 52 (1) (2011) 158–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Lansberg MG, et al. , Evolution of apparent diffusion coefficient, diffusion-weighted, and t2-weighted signal intensity of acute stroke, American Journal of Neuroradiology 22 (4) (2001) 637–644. [PMC free article] [PubMed] [Google Scholar]
- [6].Latchaw RE, et al. , Recommendations for imaging of acute ischemic stroke, Stroke 40 (11) (2009) 3646–3678. [DOI] [PubMed] [Google Scholar]
- [7].Thompson AJ, et al. , Diagnosis of multiple sclerosis: 2017 revisions of the mcdonald criteria, The Lancet Neurology 17 (2) (2018) 162–173. [DOI] [PubMed] [Google Scholar]
- [8].Sormani MP, Bruzzi P, Mri lesions as a surrogate for relapses in multiple sclerosis: a meta-analysis of randomised trials, The Lancet Neurology 12 (7) (2013) 669–676. [DOI] [PubMed] [Google Scholar]
- [9].Dobson R, Giovannoni G, Multiple sclerosis – a review, European Journal of Neurology 26 (1) (2019) 27–40. [DOI] [PubMed] [Google Scholar]
- [10].Powers WJ, et al. , Guidelines for the early management of patients with acute ischemic stroke: 2019 update to the 2018 guidelines for the early management of acute ischemic stroke: A guideline for healthcare professionals from the american heart association/american stroke association, Stroke 50 (12) (2019) e344–e418. [DOI] [PubMed] [Google Scholar]
- [11].Wardlaw JM, et al. , Neuroimaging standards for research into small vessel disease and its contribution to ageing and neurodegeneration, The Lancet Neurology 12 (8) (2013) 822–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Debette S, et al. , Clinical significance of magnetic resonance imaging markers of vascular brain injury: A systematic review and meta-analysis, JAMA Neurology 76 (1) (2019) 81–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Krsek P, et al. , Incomplete resection of focal cortical dysplasia is the main predictor of poor postsurgical outcome, Neurology 72 (3) (2009) 217–223. [DOI] [PubMed] [Google Scholar]
- [14].Kamnitsas K, et al. , Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation, Medical Image Analysis 36 (2017) 61–78. [DOI] [PubMed] [Google Scholar]
- [15].Xu B, et al. , Orchestral fully convolutional networks for small lesion segmentation in brain mri, in: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI; 2018), 2018, pp. 889–892. [Google Scholar]
- [16].Liu C-F, et al. , Deep learning-based detection and segmentation of diffusion abnormalities in acute ischemic stroke, Communications Medicine 1 (1) (2021) 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Wong A, et al. , Small lesion segmentation in brain mris with subpixel embedding, in: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Cham, 2022, pp. 75–87. [Google Scholar]
- [18].Shang L, et al. , Stroke lesion segmentation using multi-stage cross-scale attention (2025). arXiv: 2501.15423. [Google Scholar]
- [19].Shang L, et al. , Vision backbone enhancement via multi-stage cross-scale attention (2023). arXiv: 2308.05872. [Google Scholar]
- [20].Ronneberger O, et al. , U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241. [Google Scholar]
- [21].Liu Q, Liu M, Zhu Y, Liu L, Zhang Z, Wang Y, Daunet: A deformable aggregation unet for multiorgan 3d medical image segmentation, Pattern Recognition Letters 191 (2025) 58–65. [Google Scholar]
- [22].Lin D, Li Y, Nwe TL, Dong S, Oo ZM, Refineu-net: Improved u-net with progressive global feedbacks and residual attention guided local refinement for medical image segmentation, Pattern Recognition Letters 138 (2020) 267–275. [Google Scholar]
- [23].Trebing K, Staǹczyk T, Mehrkanoon S , Smaatunet: Precipitation nowcasting using a small attention-unet architecture, Pattern Recognition Letters 145 (2021) 178–186. [Google Scholar]
- [24].Ji Z, Zou B, Kui X, Li H, Vera P, Ruan S, Generation of super-resolution for medical image via a self-prior guided mamba network with edge-aware constraint, Pattern Recognition Letters 187 (2025) 93–99. [Google Scholar]
- [25].Crum WR, et al. , Generalized overlap measures for evaluation and validation in medical image analysis, IEEE Transactions on Medical Imaging 25 (11) (2006) 1451–1461. [DOI] [PubMed] [Google Scholar]
- [26].Zhang Y, et al. , Rethinking the dice loss for deep learning lesion segmentation in medical images, Journal of Shanghai Jiaotong University (Science) 26 (1) (2021) 93–102. [Google Scholar]
- [27].Milletari F, et al. , V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 2016 Fourth International Conference on 3D Vision (3DV), 2016, pp. 565–571. [Google Scholar]
- [28].Zhou Z, et al. , Unet++: Redesigning skip connections to exploit multiscale features in image segmentation, IEEE Transactions on Medical Imaging 39 (6) (2020) 1856–1867. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Liew S-L, et al. , A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms, Scientific Data 9 (1) (2022) 320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Guarnera F, et al. , Mslesseg: baseline and benchmarking of a new multiple sclerosis lesion segmentation dataset, Scientific Data 12 (1) (2025) 920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Rondinella A, et al. , Boosting multiple sclerosis lesion segmentation through attention mechanism, Computers in Biology and Medicine 161 (2023) 107021. [DOI] [PubMed] [Google Scholar]
- [32].Rondinella A, et al. , Enhancing multiple sclerosis lesion segmentation in multimodal mri scans with diffusion models, in: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2023, pp. 3733–3740. [Google Scholar]
- [33].Rondinella A, et al. , Icpr 2024 competition on multiple sclerosis lesion segmentation—methods and results, in: Pattern Recognition. Competitions, 2025, pp. 1–16. [Google Scholar]
- [34].Huo J, et al. , Mapping: Model average with post-processing for stroke lesion segmentation (2022). arXiv:2211.15486. [Google Scholar]
- [35].Paszke A, et al. , Pytorch: an imperative style, high-performance deep learning library, in: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 8026–8037. [Google Scholar]
- [36].Isensee F, et al. , nnu-net: a self-configuring method for deep learning-based biomedical image segmentation, Nature Methods 18 (2) (2021) 203–211. [DOI] [PubMed] [Google Scholar]
- [37].Lin T-Y, et al. , Focal loss for dense object detection, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007. [Google Scholar]
- [38].Vaswani A, et al. , Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, p. 6000–6010. [Google Scholar]
- [39].Dosovitskiy A, et al. , An image is worth 16×16 words: Transformers for image recognition at scale, in: 9th International Conference on Learning Representations, 2021. [Google Scholar]
- [40].Liao W, et al. , Lightm-unet: Mamba assists in lightweight unet for medical image segmentation (2024). arXiv:2403.05246. [Google Scholar]
- [41].Gu A, Dao T, Mamba: Linear-time sequence modeling with selective state spaces (2024). arXiv: 2312.00752. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
