ABSTRACT
Background and Purpose
Assessment of brain lesions on magnetic resonance imaging (MRI) is crucial for research in multiple sclerosis (MS). Manual segmentation is time‐consuming and inconsistent. We aimed to develop an automated MS lesion segmentation algorithm for T2‐weighted fluid‐attenuated inversion recovery (FLAIR) MRI.
Methods
We developed FLAIR Lesion Analysis in Multiple Sclerosis (FLAMeS), a deep learning‐based MS lesion segmentation algorithm based on the nnU‐Net 3D full‐resolution U‐Net and trained on 668 FLAIR 1.5 and 3 tesla scans from persons with MS. FLAMeS was evaluated on three external datasets: MSSEG‐2 (n = 14), MSLesSeg (n = 51), and a clinical cohort (n = 10), and compared to SAMSEG, LST‐LPA, and LST‐AI. Performance was assessed qualitatively by two blinded experts and quantitatively by comparing automated and ground truth lesion masks using standard segmentation metrics.
Results
In a blinded qualitative review of 20 scans, both raters selected FLAMeS as the most accurate segmentation in 15 cases, with one rater favoring FLAMeS in two additional cases. Across all testing datasets, FLAMeS achieved a mean Dice score of 0.74, a true positive rate of 0.84, and an F1 score of 0.78, consistently outperforming the benchmark methods. For other metrics, including positive predictive value, relative volume difference, and false positive rate, FLAMeS performed similarly to or better than benchmark methods. Most lesions missed by FLAMeS were smaller than 10 mm3, whereas the benchmark methods missed larger lesions in addition to smaller ones.
Conclusions
FLAMeS is an accurate, robust method for MS lesion segmentation that outperforms other publicly available methods.
Keywords: lesion segmentation, machine learning, MRI, multiple sclerosis
1. Introduction
Multiple sclerosis (MS) is a chronic immune‐mediated condition of the central nervous system, characterized by inflammatory demyelination and neurodegeneration [1]. Inflammatory demyelination results in focal brain lesions that are hyperintense on T2‐weighted (T2w) magnetic resonance imaging (MRI) images. Clinically, these lesions are key for MS diagnosis [2], assessment of response to treatment [3], and predictions of future disability [4]. Assessment of lesion number and volume is also essential for nearly all MS imaging research. Reliable lesion segmentation provides the foundation for studying advanced imaging biomarkers, including central vein sign and paramagnetic rim lesions [5].
Manual segmentation of MS lesions is time‐consuming and susceptible to interrater variability, particularly in people with high lesion burden. Automated methods offer faster and more consistent results, making them a powerful tool for MS research. Among these, convolutional neural network‐based approaches have emerged as the state‐of‐the‐art technique [6, 7, 8, 9], consistently outperforming traditional machine learning methods and achieving the best performance in MS lesion segmentation challenges [10, 11, 12].
The vast majority of automated lesion segmentation methods require both a T2w FLAIR and a T1‐weighted (T1w) image as input. These algorithms are typically trained on high‐resolution, research‐grade scans (1 mm isotropic resolution acquired at 3 tesla, T). However, such imaging has only been widely feasible in the last decade, and even now, high‐resolution, 3D scans are not always acquired in routine clinical practice, which may limit the generalizability of some approaches to clinical and prior research data.
In this study, we aimed to develop an accurate and effective automated method for MS lesion segmentation from T2w FLAIR images alone that improves upon currently available algorithms. Our proposed algorithm, FLAIR Lesion Analysis in Multiple Sclerosis (FLAMeS), is a deep learning‐based segmentation method that is trained on a diverse dataset of 668 scans acquired with 1.5 and 3 T MRI scanners. We evaluate the performance of FLAMeS in comparison to two established and publicly available algorithms, SAMSEG and LST. As testing datasets, we consider both research‐ and clinical‐grade MRI scans acquired on several different MRI scanners from four different manufacturers at 1.5 and 3T.
2. Methods
2.1. Datasets
All datasets used in this study were stripped of protected health information. For all private datasets, participants provided informed consent as part of an Institutional Review Board‐approved research protocol.
2.2. Training Set
The training set was composed of 668 T2w FLAIR brain MRI scans from 575 people with MS (pwMS), imaged at seven sites (Table 1). The training set included images from three publicly [12−13] and five privately available datasets, acquired at 1.5 and 3T. Ground truth lesion masks were either manually segmented by experts [12, 13, 14] (n = 97) or derived from other automated lesion segmentation tools [15, 16, 17] and subsequently refined through manual correction (n = 571). The median number of lesions per scan was 49 (IQR: 27–92; range: 0–426), and the median total lesion volume was 4574 mm3 (IQR: 1429–12,804 mm3; range: 0–88,930 mm3) (Figure 1).
TABLE 1.
Training datasets.
| Dataset | Number of scans a | Manufacturer | Model | Field strength (T) | In‐plane resolution (mm) | Slice thickness (mm) | 3D versus 2D |
|---|---|---|---|---|---|---|---|
| RADIEMS [16] | 241 (182) | Siemens | Skyra | 3 | 0.45×0.45 | 0.9 | 3D |
| NIH [17] | 200 (181) | Siemens | Skyra | 3 | 1×1 | 1 | 3D |
| LCL [15] | 64 | Siemens | Skyra | 3 | 1×1 | 1 | 3D |
| MSSEG‐1 Challenge [13] | |||||||
| Center 01 | 15 | Siemens | Verio | 3 | 0.5×0.5 | 1.1 | 3D |
| Center 03 | 8 | GE | Discovery | 5 | 0.47×0.47 | 0.9 | 3D |
| Center 07 | 15 | Siemens | Aera | 1.5 | 1.03×1.03 | 1.25 | 3D |
| Center 08 | 15 | Philips | Ingenia | 3 | 0.74×0.74 | 0.7 | 3D |
| MSCog | 66 | Philips | Achieva dStream | 3 | 0.89×0.89 | 1 | 3D |
| ISBI Challenge [12] | 20 (5) | Philips | Not specified | 3 | 0.82×0.82 | 2 | 2D |
| PubMRI [14] | 20 | Siemens | Magnetom Trio | 3 | 0.47×0.47 | 0.8 | 3D |
| NYU | 4 | Siemens | Prisma | 3 | 1×1 | 1 | 3D |
Note: MRI scanner and acquisition parameters for the training dataset.
Number of scans (number of subjects, if different).
FIGURE 1.

Lesion volume distribution for training data.
This plot displays the distribution of lesion volumes for the 668 scans used to train the FLAMeS model.
We initially trained the FLAMeS model using 161 FLAIR scans. To enhance performance, we later expanded the training dataset by incorporating an additional 507 scans from three external datasets [16]. The lesion masks for 307 of these additional training scans were initially generated by FLAMeS [18, 19] and then manually refined.
2.3. Testing Sets
2.3.1. MSSEG‐2 Challenge
The test set from the MSSEG‐2 challenge [10] dataset (n = 14) from diverse sites and scanner manufacturers and models was used for testing (Table 2, demographic data not available). Lesions were manually segmented on 3D FLAIR images using ITK‐SNAP [20] by E.D., followed by review and adjustment if needed by E.S.B. All foci of hyperintensity in the brain that were at least 3 voxels in size on one slice and appeared on at least two consecutive slices were segmented in the ground truth masks. Lesions were characterized by sharply defined borders, allowing for a clear distinction between the lesion and diffusely abnormal white matter (DAWM). The median number of lesions per scan was 85 (IQR: 31–141; range 5–712), and the median total lesion volume was 9970 mm3 (IQR: 5873−16,013 mm3; range 494−55,348 mm3). All segmented lesions were included in the initial evaluation during testing, with no minimum lesion size threshold applied. A secondary evaluation was then conducted, excluding lesions smaller than 14 mm3 or approximately 3 mm in diameter. The 14 cases were selected from the larger MSSEG‐2 challenge set, with the goal of including images acquired across various 3T MRI scanners with differing manufacturers and models (Table 2).
TABLE 2.
MSSEG‐2 testing dataset.
| ID | Manufacturer | Model | In‐plane resolution (mm) | Slice thickness (mm) |
|---|---|---|---|---|
| 013 | Philips | Ingenia | 0.98×0.98 | 0.5 |
| 014 | Siemens | Prisma | 1×1 | 1 |
| 015 | Philips | Ingenia | 0.98×0.98 | 0.5 |
| 017 | Philips | Achieva | 1.04×1.04 | 0.56 |
| 018 | Philips | Achieva | 1.04×1.04 | 0.56 |
| 026 | Siemens | Skyra | 1.25×1.25 | 1.25 |
| 029 | Siemens | Verio | 0.5×0.5 | 1.1 |
| 035 | Siemens | Skyra | 0.49×0.49 | 0.65 |
| 054 | GE | Signa HDx | 0.49×0.49 | 1 |
| 071 | GE | Signa HDx | 0.49×0.49 | 1 |
| 075 | Philips | Ingenia | 0.98×0.98 | 0.5 |
| 081 | Siemens | Verio | 0.5×0.5 | 1.1 |
| 092 | Philips | Achieva dStream | 0.63×0.63 | 0.45 |
| 096 | Siemens | Prisma | 1×1 | 1 |
Note: MRI scanner and acquisition parameters for MSSEG‐2 testing dataset (14 subjects).
2.3.2. MSLesSeg Challenge
The MSLesSeg challenge was held at the International Conference on Pattern Recognition (ICPR) in 2024 [11, 21, 22]. It included a training dataset of images from 53 pwMS imaged clinically at 1.5 T at different hospital centers in Catania, Italy. The imaging protocols varied, but they all included FLAIR and T1‐w images. Ground truth MS lesion annotations were performed using Jim 9 (Xinapse Systems Ltd, West Bergholt, UK, https://www.xinapse.com/j‐im‐9‐software/), a semi‐automated segmentation tool, refined manually by two raters and reviewed by an experienced neurologist specialized in MS. Lesions that could not be distinctly identified on at least two consecutive slices were excluded from the ground truth lesion masks. The lesion masks were publicly released as part of the challenge dataset. The median number of lesions per scan was 23 (IQR: 14–38; range 3–106), and the median total lesion volume was 4679 mm3 (IQR: 2465−13,060 mm3; range 744−72,872 mm3). All segmented lesions were included in the initial evaluation during testing, with no minimum lesion size threshold applied. A secondary evaluation was then conducted, excluding lesions smaller than 14 mm3 or approximately 3 mm in diameter. The images were released publicly after the following preprocessing steps: deidentification, conversion to NIfTI format, linear registration to the standard MNI152 template space in 1 mm3 resolution, and skull‐stripping. Original scan parameters for this testing set, including original resolution, were not available. For our analysis, we excluded two cases from the MSLesSeg challenge due to failed skull‐stripping, leading to errors by all automated segmentation methods. The demographics of the pwMS included are not known.
2.3.3. Clinical Dataset
To evaluate the performance of the segmentation methods on diverse clinical data, we selected 10 clinical‐grade FLAIR scans of pwMS (8 [80%] female, mean age 46 ± 12 years, mean time since diagnosis 9 ± 8 years, 9 relapsing‐remitting multiple sclerosis (RRMS), 1 secondary progressive multiple sclerosis (SPMS) performed at Mount Sinai. These images varied in manufacturer, model, field strength, slice thickness, and dimensions to best represent the diversity in clinical MRI acquisition (Table 3). MS lesions were manually segmented on ITK‐SNAP [20] by E.D. and subsequently reviewed and adjusted as needed by E.S. All foci of hyperintensity in the brain that were at least 3 voxels in size on one slice and appeared on at least two consecutive slices were segmented in the ground truth masks. Lesions were characterized by sharply defined borders, allowing for a clear distinction between the lesion and DAWM. The median number of lesions per scan was 36 (IQR: 22–44; range 12–84), and the median total lesion volume was 4124.9 mm3 (IQR: 3498.8−9752.5 mm3; range 1587.4−30,680.1 mm3). All segmented lesions were included in the initial evaluation during testing, with no minimum lesion size threshold applied. A secondary evaluation was then conducted, excluding lesions smaller than 14 mm3 or approximately 3 mm in diameter.
TABLE 3.
Clinical testing dataset.
| ID | Manufacturer | Model | Software version | Field strength (T) | In‐plane resolution (mm) | Slice thickness a (mm) | Acquisition |
|---|---|---|---|---|---|---|---|
| 1 | GE | Signa HDxt | 23 LX | 1.5 | 0.45×0.45 | 0.9 (1.8) | 3D |
| 2 | GE | Discovery MR750 | 25 LX | 3 | 0.47×0.47 | 0.8 (1.6) | 3D |
| 3 | Siemens | Skyra | Syngo MR E11 | 3 | 0.49×0.49 | 1 | 3D |
| 4 | GE | Signa Excite | 11 LX | 1.5 | 0.43×0.43 | 6 | 2D‐Ax |
| 5 | Siemens | Skyra | Syngo MR E11 | 3 | 0.49×0.49 | 3 | 2D‐Ax |
| 6 | Siemens | Aera | Syngo MR E11 | 1.5 | 0.69×0.69 | 5 | 2D‐Ax |
| 7 | GE | Signa HDxt | 23 LX | 1.5 | 0.43×0.43 | 0.8 (1.6) | 3D |
| 8 | Siemens | Aera | Syngo MR E11 | 1.5 | 0.86×0.86 | 5.5 | 2D‐Ax |
| 9 | Siemens | Aera | Syngo MR E11 | 1.5 | 0.98×0.98 | 1 | 3D |
| 10 | Siemens | Verio | Syngo MR B19 | 3 | 0.69×0.69 | 5 | 2D‐Ax |
Note: MRI scanner and acquisition parameters for clinical testing dataset (10 subjects).
Abbreviation: T, Tesla.
Reconstructed slice thickness (acquired slice thickness, if different).
2.4. FLAMeS
FLAMeS is based on the nnU‐Net 3D full‐resolution U‐Net architecture [23], optimized for segmenting MS lesions from FLAIR images. The model takes as input 3D volumetric data (FLAIR images), leveraging a six‐stage U‐Net architecture with progressively increasing feature channels: 32, 64, 128, 256, 320, and 320. Each stage consists of two convolutional layers with 3 × 3 × 3 kernels, followed by instance normalization and LeakyReLU activation. Downsampling is performed via strided convolutions, with strides of (2,2,2) at most levels, except for the first stage, which maintains full resolution, and the final stage, which has an asymmetric stride of (1,2,2) because of the anisotropic voxel spacing. The decoder follows a symmetric structure with transposed convolutions for upsampling. FLAMeS is trained for 8000 epochs with a batch size of 2, using stochastic gradient descent with momentum (0.99) and weight decay (3e−5). The standard nnUNet loss function is a combination of cross‐entropy and dice loss.
We compared the standard loss function with a loss function that combines Tversky loss and Focal loss, balancing sensitivity and specificity while addressing the class imbalance typical in lesion segmentation tasks. The Tversky loss parameters were α = 0.55 and β = 0.45, adjusting the trade‐off between false positives and false negatives, where slightly higher α values penalize false positives more heavily. The Focal loss used γ = 1.5 to focus learning on harder‐to‐classify lesions, with an additional weighting factor α = 0.20 to control the contribution of positive samples. To improve generalizability, nnU‐Net's built‐in deep supervision strategy is employed, enforcing loss constraints at multiple feature levels. The model ensemble combines predictions from five independently trained networks, further enhancing robustness across diverse MRI datasets.
The only preprocessing step needed for FLAMeS is skull‐stripping, which we performed with the SynthSeg [24] tool from Freesurfer [25], selecting the ‐no‐csf flag, which excludes the cerebrospinal fluid from the skull‐stripped image. FLAMeS then follows the nnU‐Net standardized processing pipeline to ensure robustness across different datasets and scanner manufacturers. All input FLAIR images are resampled to a uniform voxel spacing of 0.9 × 0.9 × 1.0 mm, using third‐order spline interpolation for intensity data and first‐order interpolation for segmentation masks. Z‐score intensity normalization is applied within the brain mask to standardize signal distributions across scans. To minimize unnecessary background and optimize GPU memory usage, images are cropped to a bounding box around the brain. During training, nnU‐Net's standard data augmentation strategies are used to reduce the risk of overfitting. These include random rotations and scaling, elastic deformation, random axis flipping, additive Gaussian noise, Gaussian blur, contrast adjustment, gamma intensity augmentation, bias field simulation, and foreground oversampling. All transformations are applied on‐the‐fly for each batch of training images. Following inference, the nnU‐Net automatically resamples the predicted segmentation masks back to the native resolution of the original FLAIR images using a nearest‐neighbor interpolation.
We developed a web app based on the Hugging Face Spaces platform [26] (https://huggingface.co/spaces/FrancescoLR/FLAMeS) to publicly deploy FLAMeS, enabling users to upload skull‐stripped FLAIR images and download automated lesion segmentations. Built using Python and Gradio library, the web app features an intuitive user interface, real‐time processing status updates, and downloadable output files, including the binary lesion mask, a mask with individual labels for each lesion, and a visual overlay of the segmentation on the input image. This deployment ensures accessibility for researchers and clinicians, facilitating seamless integration of FLAMeS into various neuroimaging research workflows without requiring local computational resources. Inference through the web app takes approximately 1 min per input image. For accelerated performance and large‐scale processing, we recommend running FLAMeS in a GPU‐enabled environment.
2.5. Evaluation
2.5.1. Benchmark Methods
LST: The Lesion Segmentation Toolbox (LST) [27] provides three distinct methods for MS lesion segmentation: LST‐LGA (Lesion Growth Algorithm), LST‐LPA (Lesion Prediction Algorithm), and LST‐AI (Artificial Intelligence‐based segmentation) [7]. Briefly, LST‐LPA is a binary regression model that integrates the lesion belief map with predefined parameters. These parameters were initially derived using logistic regression during the tool's development to compute the lesion probability map. The final binary lesion mask is then obtained by applying a threshold of 0.5 to this probability map. The most recently described LST‐AI is a deep learning‐based extension of LST that consists of an ensemble of three 3D U‐Net architectures. The networks were trained on 491 pairs of T1w and FLAIR images, collected in‐house from a 3T MRI scanner, and their respective manual lesion segmentations. A combination of binary cross‐entropy and Tversky loss was used to increase sensitivity to heterogeneous MS lesions. Following the authors’ recommendations, we ran LST‐LPA on the datasets where only a FLAIR image was available, while LST‐AI was used when both FLAIR and T1w scans were available.
SAMSEG: The Sequence Adaptive Multimodal SEGmentation (SAMSEG) algorithm [28] is part of Freesurfer [25] and can operate with a single MRI contrast but also supports multiple modalities. The algorithm was trained on manual segmentations from 212 pwMS. SAMSEG employs a deformable probabilistic atlas as a segmentation prior, which is iteratively adapted to the input data. Each voxel is assigned to the most probable brain structure, including lesions. The final binary lesion mask is generated by including voxels classified as lesions while assigning a value of zero to all remaining ones. In this study, FLAIR images were always provided as input, and when available, T1w images were included to enhance segmentation accuracy.
2.5.2. Blinded Qualitative Evaluation
We selected 20 FLAIR scans, including the 10 clinical testing set cases and 10 randomly selected from the MSLesSeg test set. Each scan was paired with lesion masks generated by FLAMeS, LST‐AI, and SAMSEG. Two independent raters, blinded to the method used, evaluated all three segmentations side by side for each scan. Segmentations were rated on a 4‐point scale. A score of 1 reflected high accuracy with little to no need for adjustment, 2 indicated minor issues such as a few small missed lesions, 3 denoted multiple missed lesions or boundary inaccuracies, and 4 represented substantial errors, including large missed or poorly segmented lesions. Each rater also indicated the segmentation that they deemed most accurate for each case.
2.5.3. Metrics
To quantitatively evaluate the segmentation tools, we computed segmentation metrics using an in‐house Python script. Lesions were defined as clusters of connected voxels using 3D 26‐connectivity. For voxel‐wise metrics, we considered the Dice similarity coefficient (DSC), the positive predictive value (PPV), and the relative volume difference (RVD). They are defined as follows:
where TP (true positive) is the number of correctly predicted lesion voxels, FP (false positive) is the number of nonlesion voxels incorrectly classified as lesion, FN (false negative) is the number of lesion voxels missed by the automated tool, Vpred is the total lesion volume in the predicted segmentation, and VGT is the total lesion volume in the ground truth segmentation.
We also calculated a normalized DSC (nDSC) [29], which adjusts the standard Dice coefficient to account for class imbalance by penalizing false positives more heavily in proportion to their overrepresentation. It is defined as:
where and . The penalty term is calculated as , where is the predicted class imbalance (the ratio of negative to positive voxels in the predicted mask), and is the reference class balance estimated from a large dataset.
We also assessed lesion‐wise segmentation performance using the Lesion True Positive Rate (LTPR), Lesion False Positive Rate (LFPR), and the Lesion‐wise F1 Score (F1). These metrics evaluate the model's ability to accurately detect individual lesions, as opposed to voxel‐level segmentation accuracy. The definitions of these metrics are as follows:
where TPL (true positive lesions) is the number of correctly detected lesions, FNL (false negative lesions) is the number of true lesions missed by the tool, and FPL (false positive lesions) is the number of incorrectly predicted lesions.
2.5.4. Statistical Analysis
Given the relatively small sample size, the presence of outliers, and strong evidence of non‐normality, all pairwise comparisons were performed using the Wilcoxon signed‐rank test with median values reported. We accounted for multiple comparisons using the Benjamini−Hochberg correction for false discovery rate. All significant results remained statistically significant after adjustment. The data were analyzed using IBM SPSS Statistics (Version 30.0, Armonk, NY: IBM Corp).
3. Results
3.1. Qualitative Evaluation
Lesion segmentation performance was evaluated on three testing datasets through both qualitative and quantitative comparisons with ground truth masks. Visual inspection showed that FLAMeS‐generated masks more closely matched the ground truth than those from benchmark methods, with more precise lesion boundaries and fewer visible false positives or missed lesions (Figure 2). In a blinded qualitative assessment in which two raters evaluated FLAMeS, SAMSEG, and LST‐AI masks side by side, both raters chose FLAMeS as the most accurate segmentation in 15 out of 20 cases. In two additional cases, one rater chose FLAMeS, while the other preferred a benchmark method, and in the remaining three cases, both raters preferred a benchmark method. When evaluating the amount of manual correction needed (rated on a scale from 1 to 4, where 1 indicates minimal adjustment and 4 indicates extensive adjustment), rater 1 assigned FLAMeS an average score of 1.7, compared to 3.1 for SAMSEG and 2.5 for LST‐AI. Rater 2 gave FLAMeS an average score of 2.3, versus 3.4 for SAMSEG and 3.1 for LST‐AI.
FIGURE 2.

FLAMeS generates qualitatively superior lesion masks. The top row shows an axial FLAIR slice and the ground truth lesion mask. The bottom row displays the lesion masks from each method overlaid on the FLAIR image. Multiple lesions are missed (yellow arrows), undersegmented (blue arrows), or falsely segmented (orange arrow) by LST‐AI and SAMSEG.
FLAMeS exhibits a slight tendency toward oversegmentation, particularly in regions with intermediate signal intensity between focal lesions and normal‐appearing white matter, often with ill‐defined borders. This behavior is also observed in benchmark methods (Figure 3A). Additionally, our review of the MSLesSeg masks revealed numerous small lesions that our raters would have annotated but were missing from the ground truth masks and were thus classified as false positives by our metrics script (Figure 3B).
FIGURE 3.

Assessment of false positive voxels and apparently false positive lesions identified by automated methods. (A) Oversegmentation of diffusely abnormal white matter surrounding lesions by FLAMeS and SAMSEG. (B) Example of a periventricular hyperintensity segmented by all three automated methods but not present on the MSLesSeg ground truth mask (yellow arrow). On manual review, this area was judged to be a missed lesion on the original ground truth masks.
3.2. Quantitative Evaluation
3.2.1. Msseg‐2
In the MSSEG‐2 testing set, FLAMeS outperforms benchmark methods across all voxel‐wise and lesion‐wise metrics, except for LFPR, which was similar to that of SAMSEG but better than LST‐LPA (Table 4). FLAMeS correctly identifies 70% of reference lesions—significantly more than LST‐LPA (37%, p<0.001) and SAMSEG (13%, p<0.001)—while maintaining a low false positive rate of just 5%. Additionally, FLAMeS achieves an F1 score of 0.81, substantially higher than LST‐LPA (0.37, p<0.001) and SAMSEG (0.13, p<0.001), demonstrating a superior balance of precision and recall. FLAMeS also has a significantly lower RVD than the benchmark methods, with an 18% discrepancy between the automated and manual lesion masks. In comparison, RVD was 55% for SAMSEG (p = 0.003) and 44% for LST‐LPA (p = 0.03).
TABLE 4.
Performance comparison of segmentation methods on the MSSEG‐2 testing dataset.
| voxel‐wise | lesion‐wise | ||||||
|---|---|---|---|---|---|---|---|
| Tool | DSC | nDSC | PPV | RVD | LTPR | LFPR | F1 |
| FLAMeS | 0.72***††† (0.72)§ | 0.75**†† (0.76) | 0.96***† (0.96) | 0.18*†† (0.20) | 0.70***††† (0.73) | 0.05*** (0.05) | 0.81***††† (0.83) |
| LST‐LPA | 0.51 (0.51) | 0.68 (0.58) | 0.50 (0.79) | 0.48 (0.49) |
0.37 (0.34) |
0.35 (0.26) |
0.38 (0.47) |
| SAMSEG | 0.47 (0.49) | 0.66 (0.47) | 0.93 (1.0) | 0.55 (0.52) |
0.13 (0.21) |
0.04 (0.00) |
0.23 (0.35) |
Note: FLAMeS had the best performance on all metrics except lesion‐wise false positive rate. Median values across 14 subjects are reported for each metric. Pairwise comparisons were performed using the Wilcoxon signed‐rank test.
Abbreviations: DSC, Dice similarity coefficient; LFPR, lesion false positive rate; LTPR, lesion true positive rate; nDSC, normalized Dice similarity coefficient; PPV, positive predictive value; RVD, relative volume difference.
Asterisks indicate comparisons between FLAMeS and LST * p<0.05; ** p<0.01; *** p<0.001. Daggers indicate comparisons between FLAMeS and SAMSEG † p<0.05; †† p<0.01; ††† p<0.001. § Numbers in parentheses for all metrics indicate scores after all lesions under 14 mm3 have been excluded from ground truth mask and automated mask.
3.2.2. MSLesSeg
In the MSLesSeg testing dataset, FLAMeS significantly outperformed LST‐AI across all voxel‐wise and lesion‐wise metrics (p<0.05 or less for all comparisons) (Table 5 and Figure 4). FLAMeS also outperformed SAMSEG on most metrics. While no significant differences were observed between FLAMeS and SAMSEG for PPV (p = 0.05) or RVD (p = 0.49), SAMSEG had a lower LFPR than FLAMeS (0.21 vs. 0.26, p = 0.007). Notably, FLAMeS achieved a significantly higher LTPR (0.86) than LST‐AI (0.71, p<0.001) and SAMSEG (0.43, p<0.001). It also demonstrated a significantly higher F1 score (0.77) than both LST‐AI (0.57, p<0.001) and SAMSEG (0.53, p<0.001) (Table 5).
TABLE 5.
Performance comparison of segmentation methods on the MSLesSeg testing dataset.
| voxel‐wise | lesion‐wise | ||||||
|---|---|---|---|---|---|---|---|
| Tool | DSC | nDSC | PPV | RVD | LTPR | LFPR | F1 |
| FLAMeS | 0.74***††† (0.74)§ | 0.70***††† (0.70) | 0.72*** (0.81) | 0.17* (0.18) | 0.87***††† (0.81) | 0.26***†† (0.18) | 0.77***††† (0.83) |
| LST‐AI | 0.43 (0.43) | 0.52 (0.41) | 0.45 (0.63) | 0.37 (0.37) | 0.71 (0.62) | 0.44 (0.32) | 0.57 (0.64) |
| SAMSEG | 0.55 (0.58) | 0.59 (0.56) | 0.75 (0.82) | 0.31 (0.32) | 0.43 (0.42) | 0.21 (0.20) | 0.53 (0.55) |
Note: FLAMeS had the best performance on all metrics except positive predictive value and lesion‐wise false positive rate. Median values across 51 subjects are reported for each metric. Pairwise comparisons were performed using the Wilcoxon signed‐rank test.
Abbreviations: DSC, Dice similarity coefficient; LFPR, lesion false positive rate; LTPR, lesion true positive rate; nDSC, normalized Dice similarity coefficient; PPV, positive predictive value; RVD, relative volume difference.
Asterisks indicate comparisons between FLAMeS and LST * p<0.05; *** p<0.001. Daggers indicate comparisons between FLAMeS and SAMSEG †† p<0.01; ††† p<0.001. § Numbers in parentheses for all metrics indicate scores after all lesions under 14 mm3 have been excluded from ground truth mask and automated mask.
FIGURE 4.

FLAMeS outperforms SAMSEG and LST‐AI in DSC and LTPR. Box plots display the median (center line), interquartile range (IQR, box edges), and whiskers extending up to 1.5 × IQR. Abbreviations: DSC, Dice similarity coefficient; LTPR, lesion wise true positive rate.
3.2.3. Clinical
In the clinical testing set, the proposed method outperforms the benchmark models across all metrics except for nDSC, where LST‐AI matches FLAMeS with a score of 0.75 (Table 6 and Figure 4). FLAMeS achieves a higher DSC of 0.68 compared to 0.63 for LST‐AI (p = 0.009) and 0.51 for SAMSEG (p = 0.005). FLAMeS’ LFPR is 0.03, lower than LST‐AI (0.24, p = 0.02) but similar to SAMSEG (0.11, p = 0.89). Additionally, FLAMeS attains an F1 score of 0.78, compared to 0.74 for LST‐AI (p = 0.009) and 0.50 for SAMSEG (p = 0.007). We compared the performance of each model on 2D versus 3D FLAIR scans in the clinical dataset and found no substantial differences for DSC, LTPR, LFPR, or RVD, suggesting comparable robustness to both acquisition types for all methods (Table 7).
TABLE 6.
Performance comparison of segmentation methods on the clinical testing dataset.
| voxel‐wise | lesion‐wise | ||||||
|---|---|---|---|---|---|---|---|
| Tool | DSC | nDSC | PPV | RVD | LTPR | LFPR | F1 |
| FLAMeS | 0.68**†† (0.68)§ | 0.75 (0.70) | 0.97* (0.90) | 0.24 (0.25) | 0.75†† (0.78) | 0.03* (0.10) | 0.78**†† (0.80) |
| LST‐AI | 0.63 (0.63) | 0.75 (0.64) | 0.75 (0.82) | 0.25 (0.25) | 0.73 (0.71) | 0.24 (0.18) | 0.74 (0.78) |
| SAMSEG | 0.51 (0.52) | 0.71 (0.53) | 0.88 (1.0) | 0.48 (0.48) | 0.35 (0.41) | 0.11 (0.00) | 0.50 (0.54) |
Note: FLAMeS had the equivalent or superior performance on all metrics. Median values across 10 subjects are reported for each metric. Pairwise comparisons were performed using the Wilcoxon signed‐rank test.
Abbreviations: DSC, Dice similarity coefficient; LFPR, lesion false positive rate; LTPR, lesion true positive rate; nDSC, normalized Dice similarity coefficient; PPV, positive predictive value; RVD, relative volume difference.
Asterisks indicate comparisons between FLAMeS and LST * p<0.05; ** p<0.01. Daggers indicate comparisons between FLAMeS and SAMSEG †† p<0.01. § Numbers in parentheses for all metrics indicate scores after all lesions under 14 mm3 have been excluded from ground truth mask and automated mask.
TABLE 7.
Performance comparison between 2D and 3D scans in the clinical dataset across all models.
| 2D (five subjects) | 3D (five subjects) | |
|---|---|---|
| DSC | ||
| FLAMeS | 0.70 | 0.66dawm |
| LST‐AI | 0.66 | 0.573 |
| SAMSEG | 0.59 | 0.53 |
| LTPR | ||
| FLAMeS | 0.77 | 0.728 |
| LST‐AI | 0.80 | 0.642 |
| SAMSEG | 0.311 | 0.31 |
| LFPR | ||
| FLAMeS | 0.09 | 0.19 |
| LST‐AI | 0.21 | 0.26 |
| SAMSEG | 0.31 | 0.13 |
| RVD | ||
| FLAMeS | 0.25 | 0.25 |
| LST‐AI | 0.29 | 0.29 |
| SAMSEG | 0.36 | 0.42 |
Note: Average values across five subjects are provided for each metric.
Abbreviations: LFPR, lesion false positive rate; LTPR, lesion true positive rate; RVD, relative volume difference.
3.2.4. Cumulative Assessment
We also assessed the performance of FLAMeS versus SAMSEG when all three datasets were combined. LST methods were excluded from this analysis as LST‐AI could not be applied to the MSSEG‐2 dataset, which does not include T1w images. Across all testing sets, FLAMeS significantly outperformed SAMSEG, with superior DSC (0.74 vs. 0.53, p<0.001), LTPR (0.87 vs. 0.37, p<0.001), and F1 scores (0.78 vs. 0.47, p<0.001), as well as higher nDSC and lower RVD (Table 8 and Figure 4). FLAMeS and SAMSEG had similar PPV (0.77 vs. 0.80, p = 0.46) and LFPR (0.21 vs. 0.15, p = 0.06).
TABLE 8.
Performance comparison for FLAMeS and SAMSEG across all three testing datasets.
| voxel‐wise | lesion‐wise | ||||||
|---|---|---|---|---|---|---|---|
| Tool | DSC | nDSC | PPV | RVD | LTPR | LFPR | F1 |
| FLAMeS | 0.74††† (0.73)§ | 0.73††† (0.72) | 0.77 (0.88) | 0.18† (0.19) | 0.84††† (0.81) | 0.21 (0.13) | 0.78††† (0.83) |
| SAMSEG |
0.53 (0.55) |
0.63 (0.54) | 0.80 (0.89) | 0.38 (0.43) |
0.37 (0.38) |
0.15 (0.10) | 0.47 (0.52) |
Note: FLAMeS had the best performance on all metrics except positive predictive value and lesion‐wise false positive rate. Median values are reported for every metric after pooling all three datasets (75 total subjects). Pairwise comparisons were performed using the Wilcoxon signed‐rank test.
Abbreviations: DSC, Dice similarity coefficient; LFPR, lesion false positive rate; LTPR, lesion true positive rate; nDSC, normalized Dice similarity coefficient; PPV, positive predictive value; RVD, relative volume difference.
Daggers indicate comparisons between FLAMeS and SAMSEG † p<0.05; ††† p<0.001. § Numbers in parentheses for all metrics indicate scores after all lesions under 14 mm3 have been excluded from ground truth mask and automated mask.
We next determined how lesion size is related to lesion detection for each method. Figure 5A depicts the lesion volume distribution across all three testing datasets. FLAMeS detects a higher proportion of smaller lesions, identifying 35% of lesions in the 3–10 mm3 range compared to just 6% detected by SAMSEG. Conversely, 81% of lesions over 10 mm3 are detected by FLAMeS compared to only 25% by SAMSEG. Since LST‐AI could not be run on MSSEG‐2 due to a lack of T1w images in the challenge dataset, we evaluated FLAMeS against LST‐AI using the remaining two datasets (Figure 5B), where FLAMeS detected 43% of lesions from 3 to 10 mm3 and 85% of lesions over 10 mm3, compared to 38% and 74% for LST‐AI, respectively. We also found that the median volume of missed lesions was smaller for FLAMeS than for SAMSEG in every dataset, as well as for LST‐AI in the MSLesSeg dataset but not the clinical dataset.
FIGURE 5.

Lesion detection rates by lesion size for each segmentation method. (A) Shows the distribution of lesion volumes and detection rates by lesion size across all three testing datasets, comparing FLAMeS to SAMSEG. (B) Focuses specifically on the MSLesSeg and clinical testing datasets, showing lesion volume distribution and detection rates for these subsets, comparing FLAMeS to LST‐AI.
3.3. Loss Function Comparison
To explore whether segmentation performance could be improved in regions with ill‐defined lesion borders, we compared the standard nnUNet loss function with a balanced combination of Tversky and Focal losses (FLAMeS_TF). This modification was motivated by FLAMeS’ tendency toward slight oversegmentation in areas of intermediate signal intensity between lesions and normal‐appearing white matter, particularly when lesion boundaries were not clearly distinguishable.
Qualitative assessment revealed minimal overall differences in segmentation quality between the two models. However, FLAMeS_TF generally produced lesion masks with slightly more conservative borders. In most cases, the lesion margins appeared reduced by approximately one voxel compared to the standard FLAMeS model. Quantitative evaluation of segmentation performance showed that the two models performed similarly across metrics. The median LTPR was 0.84 for FLAMeS and 0.81 for FLAMeS_TF (p<0.001), the LFPR was 0.21 and 0.17, respectively (p<0.01), and the RVD was 0.18 and 0.17 (p = 0.042). While these differences are statistically significant, the overall performance remained largely consistent between models.
3.4. Model Availability
We developed a web‐based version of FLAMeS, hosted on Hugging Face [26]. The app features a user‐friendly interface, real‐time processing updates, and downloadable output files, including multiplanar snapshots of the input image and lesion mask, as well as the binary lesion mask in NIfTI format. Users also have the option to apply SynthSeg for automatic skull stripping of the input image. The app also provides an output mask with a unique label assigned to each voxel cluster to facilitate clear differentiation between lesions. Inference takes approximately 1 min per MRI scan, though the app currently processes only one scan at a time due to Hugging Face's zero‐GPU usage limitations.
4. Discussion
In this work, we propose a deep learning‐based method for lesion segmentation in MS, termed FLAMeS. The model is built on a 3D nnU‐Net architecture and requires only a skull‐stripped FLAIR brain MRI as input. Trained on 668 FLAIR brain MRI scans from 408 individuals across seven scanning sites, the model consistently demonstrated accurate lesion segmentation across diverse acquisition parameters. To evaluate its robustness, we validated FLAMeS on three external datasets with diverse acquisition parameters. We compared its performance against two benchmark methods, SAMSEG and LST, and found that FLAMeS outperforms them across multiple metrics. These results establish FLAMeS as a reliable and widely applicable tool for automated lesion segmentation in MS. A key strength of FLAMeS is its ability to perform consistently across diverse datasets without retraining. The only preprocessing step needed is skull‐stripping, which can be efficiently performed using the SynthSeg tool from FreeSurfer.
FLAMeS delivers highly accurate lesion segmentation, effectively identifying nearly all lesions and capturing their boundaries with precision. In a blinded qualitative evaluation, two experienced raters independently selected FLAMeS segmentations as the most accurate in 15 out of 20 cases. One rater also selected FLAMeS as the top choice in two additional scans. These results highlight the accuracy, consistency, and practical utility of FLAMeS. The resulting segmentations require minimal manual adjustment and are well‐suited for lesion‐based analyses, especially in large datasets where substantial manual editing is often impractical.
FLAMeS consistently achieves a high DSC across all three test sets, significantly outperforming LST‐AI and SAMSEG and demonstrating strong overall segmentation accuracy across diverse cases. When evaluating cumulative scores across all testing sets, FLAMeS significantly outperforms SAMSEG on every metric except for PPV and LFPR. Although FLAMeS has a marginally higher LFPR than SAMSEG, this is driven by SAMSEG's poor lesion detection, as reflected in its very low LTPR and F1 score. In contrast, FLAMeS provides segmentations with substantially higher sensitivity and overall accuracy, as evidenced by its superior LTPR and F1 score. The slight increase in false positives may be a reasonable trade‐off for robust lesion detection.
When evaluating lesion segmentation models, the F1 score is a critical metric as it reflects the balance between two key clinical priorities: accurate lesion detection (recall) and reliable prediction (precision). FLAMeS demonstrates a strong balance between these factors, achieving consistently high F1 scores and significantly outperforming LST‐AI and SAMSEG across all testing sets. In addition to its high performance, FLAMeS exhibits a more favorable distribution of scores, with a tighter range across subjects and metrics (Figure 4). These findings highlight the robustness of FLAMeS at varying field strengths and on clinical and research scans.
To further evaluate model performance in clinically realistic settings, we compared results on 2D versus 3D FLAIR scans in the clinical testing set. Despite differences in isotropic versus anisotropic acquisition, all evaluated methods, including FLAMeS, demonstrated comparable performance across scan types with no substantial difference in sensitivity, precision, or overall segmentation accuracy (Table 7). Although the sample size is small, these results suggest that FLAMeS, along with the other evaluated methods, is well‐suited for use across a range of imaging protocols, including 2D anisotropic scans that are still commonly acquired in routine clinical practice.
Detecting very small MS lesions, particularly those under 10 mm3, remains a challenge for automated models due to their subtle appearance and the influence of the partial volume effect [30, 31]. While all methods struggle with the smallest lesions, FLAMeS demonstrates greater sensitivity than LST‐AI and SAMSEG (Figure 5). The lesions missed by FLAMeS tend to be smaller than those missed by the benchmark methods, indicating that FLAMeS reliably detects most larger lesions and primarily struggles with only the smallest ones. In contrast, LST‐AI and SAMSEG often miss both small and large lesions, suggesting a higher threshold for consistent detection. FLAMeS’ improved sensitivity to smaller lesions enhances segmentation accuracy, reducing the need for manual corrections and ensuring a more comprehensive assessment of lesion burden.
However, it is important to note that performance on small lesions may vary depending on scan resolution and orientation. Many clinical datasets, including the one used in this study, include 2D anisotropic FLAIR images, which can introduce variability in lesion appearance by obscuring small or thin lesions depending on their alignment with the imaging plane. While FLAMeS is able to detect most lesions identified by human raters on a 2D FLAIR image, these images are intrinsically limited in their sensitivity for small lesions. Additionally, although FLAIR is highly effective for detecting supratentorial white matter lesions, it has known limitations in regions such as the posterior fossa, where artifacts and lower contrast‐to‐noise ratios can obscure lesion visibility. As FLAMeS relies solely on FLAIR imaging, infratentorial lesions may be underrepresented.
To further evaluate the impact of small lesion detection on model performance, we repeated our analysis after applying a 14 mm3 lesion volume threshold. This value, approximately equivalent to a 3 mm diameter lesion, is commonly used in clinical trials and MS imaging studies [32, 33]. The threshold was applied uniformly across all ground truth and automated lesion masks, excluding clusters below that size. Across most datasets, this led to a reduction in LFPR and an increase in PPV for all methods, particularly in the MSLesSeg dataset, where small lesions were less consistently labeled in the ground truth masks (Table 5). Notably, overall performance rankings remained mostly consistent, and the differences between FLAMeS and comparator methods were almost always preserved (Tables 4, 5, 6, and 8). This analysis suggests that FLAMeS maintains superior performance even when small lesions are excluded.
In regard to the MSLesSeg testing set, it is important to note that this dataset differs from the other testing sets in its lesion annotation criteria. In the description of the challenge, organizers state that lesions that could not be distinctly identified on at least two consecutive slices were excluded from the ground truth lesion masks [11]. On a qualitative review, we noticed that many small lesions that our raters would have segmented were not included in the MSLesSeg ground truth masks, likely due to greater slice thickness on many of these scans (Figure 3B). This stricter annotation criterion likely accounts for the comparatively higher LFPR observed across all three methods for the MSLesSeg dataset (Table 5).
Our qualitative and quantitative assessments demonstrate that FLAMeS provides more accurate and reliable MS lesion segmentation compared to benchmark models. This prompts the question of whether FLAMeS’ superior performance stems primarily from its underlying architecture or the diversity of its training data. It is likely that both factors play a significant role. FLAMeS is built on the nnU‐Net framework—a self‐configuring architecture that automatically adapts to dataset‐specific properties such as image size, voxel spacing, and class imbalance, eliminating the need for manual tuning [23]. Its preprocessing pipeline includes automated resampling, intensity normalization, and class balancing, making it particularly well‐suited for heterogeneous datasets. In contrast, LST‐AI employs an ensemble of 3D U‐Nets with fixed hyperparameters and standardized preprocessing, while SAMSEG is based on a generative probabilistic framework. The difference in training data is also a key factor. FLAMeS was trained on a larger and more diverse set of 668 MS scans, compared to 491 for LST‐AI and 212 for SAMSEG. Notably, LST‐AI was trained exclusively on data acquired on a single 3T Achieva scanner [7], and SAMSEG was similarly trained on data from a single site [28]. This limited diversity likely hinders the generalizability of the benchmark models to scans acquired with varied protocols, contributing to their reduced performance relative to FLAMeS.
One limitation of FLAMeS is its slight tendency toward oversegmentation, both at the edges of lesions identified on the ground truth lesion masks as well as in areas of FLAIR signal that are intermediate in intensity between focal lesions and normal‐appearing white matter, sometimes termed diffusely abnormal white matter (DAWM) (Figure 3A). Similar discrepancies have been reported in other automated methods, which also tend to segment DAWM, and these regions are often difficult to classify even for human raters [34]. Oversegmentation of DAWM likely stems from the inherent limitations of a model trained solely on FLAIR images to predict lesion segmentation patterns. Notably, the benchmark methods exhibit a similar tendency toward oversegmentation (Figure 3A).
To address this issue, we trained a modified version of FLAMeS, FLAMeS_TF, incorporating a loss function that combines Tversky and Focal losses to mitigate oversegmentation. The Tversky loss, unlike the Dice present in the standard nnU‐Net loss function, allows explicit penalization of false positives. Combining it with Focal loss further emphasizes hard‐to‐classify voxels, possibly reducing the model's tendency to oversegmentation. Quantitative comparison between the two models reveals modest but significant differences. FLAMeS_TF achieved a lower LFPR and RVD, indicating improved precision and reduced lesion overestimation. However, this came at the cost of a slightly reduced LTPR, reflecting a marginal decrease in sensitivity to true positive lesions. These results suggest a small tradeoff between sensitivity and specificity, with the Tverky and Focal loss yielding slightly fewer false positives at the cost of slightly reduced sensitivity. Despite these differences, the overall F1 scores remained similar, and visual inspection confirmed that the segmentations produced by both models were nearly indistinguishable. We conclude that neither model substantially outperforms the other in practical terms. Future work may benefit more from expanding the training data to include additional scans with DAWM, which could help improve the model's ability to discern between lesions and areas of diffuse hyperintensity.
While FLAMeS was benchmarked against several existing segmentation methods, we acknowledge that direct comparisons with other deep learning‐based models are limited. In particular, we were unable to directly compare LST‐AI and FLAMeS on all of the same datasets because many public challenge datasets do not include T1w imaging, a requirement for LST‐AI. This reflects a broader challenge in the field: the lack of publicly available deep learning tools and standardized datasets that support comprehensive and consistent benchmarking. Addressing this gap was a central motivation for our work. With FLAMeS, we aim to provide a robust, publicly available deep learning‐based model for MS lesion segmentation that can be readily adopted by the research community.
To conclude, we introduce an automated MS lesion segmentation tool for FLAIR brain MRI. The model requires no preprocessing beyond skull‐stripping and runs efficiently, with the inference taking around 1 min per scan. Our validation on three external datasets demonstrates FLAMeS’ robustness and strong performance across both research and clinical scans with diverse acquisition parameters. By offering improved accuracy and resilience to variations in image quality, resolution, and acquisition protocols, FLAMeS represents a valuable tool for MS research applications. We have made FLAMeS publicly available via a web app on Hugging Face (https://huggingface.co/spaces/FrancescoLR/FLAMeS). In addition, we have publicly released the model's weights and code [35] for use and to allow others to fine‐tune or further improve the method.
Conflicts of Interest
Dr. Reich has received research funding from Abata and Sanofi, unrelated to this work. The other authors declare no competing financial interests.
Disclaimer
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense.
Dereskewicz E., La Rosa F., Dos Santos Silva J., et al. “A Novel Convolutional Neural Network for Automated Multiple Sclerosis Brain Lesion Segmentation.” Journal of Neuroimaging 35, no. 5 (2025): e70085. 10.1111/jon.70085
Emma Dereskewicz and Francesco La Rosa contributed equally to this work.
Funding: F.L.R. received support through a Swiss National Science Foundation (SNSF) Postdoc Mobility Fellowship (P500PB_206833), Schmidt Sciences, and the Office of the Assistant Secretary of Defense for Health Affairs through the Multiple Sclerosis Research Program (HT9425‐24‐1‐0857). O.A.‐L. was supported by the National Multiple Sclerosis Society (FAN‐1807‐32163) and the Office of the Assistant Segretary of Defense for Health Affairs through the Multiple Sclerosis Research Program (HT9425‐23‐1‐0571). M.W. is supported by TRAIL and the Walloon Region. J.A. acknowledges support from the NIH/NINDS (grant R01 NS131948) and NIH/NIBIB (grant P41 EB017183). Support for data collection was provided by the Intramural Research Program of the NINDS/NIH to D.S.R. (1Z1ANS003119), a grant from Bristol Myers Squibb to A.J.S., and NIH R01NS136523 to J.S. This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award numbers S10OD026880 and S10OD030463.
Prior Dissemination: This work has been previously published as a preprint [18] and as an abstract at the 2025 American Committee for Treatment and Research in Multiple Sclerosis (ACTRIMS) Forum [19].
Contributor Information
Francesco La Rosa, Email: francesco.larosa@mssm.edu.
Erin S. Beck, Email: erin.beck@mssm.edu.
References
- 1. Reich D. S., Lucchinetti C. F., and Calabresi P. A., “Multiple Sclerosis,” New England Journal of Medicine 378, no. 2 (2018): 169–180, 10.1056/NEJMra1401483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Wattjes M. P., Ciccarelli O., Reich D. S., et al., “2021 MAGNIMS–CMSC–NAIMS Consensus Recommendations on the Use of MRI in Patients With Multiple Sclerosis,” Lancet Neurology 20, no. 8 (2021): 653–670, 10.1016/S1474-4422(21)00095-8. [DOI] [PubMed] [Google Scholar]
- 3. Calabresi P. A., Kieseier B. C., Arnold D. L., et al., “Pegylated Interferon Beta‐1a for Relapsing‐Remitting Multiple Sclerosis (ADVANCE): A Randomised, Phase 3, Double‐Blind Study,” Lancet Neurology 13, no. 7 (2014): 657–665, 10.1016/S1474-4422(14)70068-7. [DOI] [PubMed] [Google Scholar]
- 4. Rovira À., Wattjes M. P., Tintoré M., et al., “MAGNIMS Consensus Guidelines on the Use of MRI in Multiple Sclerosis—Clinical Implementation in the Diagnostic Process,” Nature Reviews Neurology 11, no. 8 (2015): 471–482, 10.1038/nrneurol.2015.106. [DOI] [PubMed] [Google Scholar]
- 5. La Rosa F., Wynen M., and Al‐Louzi O., “Cortical Lesions, Central Vein Sign, and Paramagnetic Rim Lesions in Multiple Sclerosis: Emerging Machine Learning Techniques and Future Avenues,” NeuroImage Clinical 36 (2022): 103205, 10.1016/j.nicl.2022.103205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. La Rosa F., Abdulkadir A., Fartaria M. J., et al., “Multiple Sclerosis Cortical and WM Lesion Segmentation at 3T MRI: A Deep Learning Method Based on FLAIR and MP2RAGE,” NeuroImage Clinical 27 (2020): 102335, 10.1016/j.nicl.2020.102335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Wiltgen T., McGinnis J., Schlaeger S., et al., “LST‐AI: A Deep Learning Ensemble for Accurate MS Lesion Segmentation,” NeuroImage Clinical 42 (2024): 103611, 10.1016/j.nicl.2024.103611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Aslani S., Dayan M., Storelli L., et al., “Multi‐Branch Convolutional Neural Network for Multiple Sclerosis Lesion Segmentation,” Neuroimage 196 (2019): 1–15, 10.1016/j.neuroimage.2019.03.068. [DOI] [PubMed] [Google Scholar]
- 9. Hashemi S. R., Salehi S. S. M., Erdogmus D., Prabhu S. P., Warfield S. K., and Gholipour A., “Asymmetric Loss Functions and Deep Densely Connected Networks for Highly Imbalanced Medical Image Segmentation: Application to Multiple Sclerosis Lesion Detection,” IEEE Access 7 (2019): 721–1735, 10.1109/ACCESS.2018.2886371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Commowick O., Cervenansky F., Cotton F., and Dojat M.. MSSEG‐2 Challenge Proceedings: Multiple Sclerosis New Lesions Segmentation Challenge Using a Data Management and Processing Infrastructure. HAL‐Inria, 2021.
- 11. Rondinella A., Guarnera F., Crispino E., et al., “ICPR 2024 Competition on Multiple Sclerosis Lesion Segmentation—Methods and Results,” in Pattern Recognition. Competitions, ed. Antonacopoulos A., Chaudhuri S., Chellappa R., Liu C. L., Bhattacharya S., Pal U. (Springer Nature Switzerland, 2025), 1–16, 10.1007/978-3-031-80139-6_1. [DOI] [Google Scholar]
- 12. Carass A., Roy S., Jog A., et al., “Longitudinal Multiple Sclerosis Lesion Segmentation: Resource & Challenge,” Neuroimage 148 (2017): 77–102, 10.1016/j.neuroimage.2016.12.064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Commowick O., Kain M., Casey R., et al., “Multiple Sclerosis Lesions Segmentation From Multiple Experts: The MICCAI 2016 Challenge Dataset,” Neuroimage 244 (2021): 118589, 10.1016/j.neuroimage.2021.118589. [DOI] [PubMed] [Google Scholar]
- 14. Lesjak Ž., Galimzianova A., Koren A., et al., “A Novel Public MR Image Dataset of Multiple Sclerosis Patients With Lesion Segmentations Based on Multi‐Rater Consensus,” Neuroinformatics 16, no. 1 (2018): 51–63, 10.1007/s12021-017-9348-7. [DOI] [PubMed] [Google Scholar]
- 15. Beck E. S., Maranzano J., Luciano N. J., et al., “Cortical Lesion Hotspots and Association of Subpial Lesions With Disability in Multiple Sclerosis,” Multiple Sclerosis Journal 28, no. 9 (2022): 1351–1363, 10.1177/13524585211069167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Brandstadter R., Katz Sand I., and Sumowski J. F., “Beyond Rehabilitation: A Prevention Model of Reserve and Brain Maintenance in Multiple Sclerosis,” Multiple Sclerosis Journal 25, no. 10 (2019): 1372–1378, 10.1177/1352458519856847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Al‐Louzi O., Roy S., Osuorah I., et al., “Progressive Multifocal Leukoencephalopathy Lesion and Brain Parenchymal Segmentation From MRI Using Serial Deep Convolutional Neural Networks,” NeuroImage Clinical 28 (2020): 102499, https://pubmed.ncbi.nlm.nih.gov/33395989/. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Dereskewicz E., La Rosa F., dos Santos Silva J., et al., “FLAMeS: A Robust Deep Learning Model for Automated Multiple Sclerosis Lesion Segmentation,” Medrxiv (2025): 2025051925327707, 10.1101/2025.05.19.25327707. [DOI] [Google Scholar]
- 19. Dereskewicz E., La Rosa F., Dos Santos Silva J., et al., “FLAIR Lesion Analysis in Multiple Sclerosis (FLAMeS): Advancing Longitudinal Automated Lesion Segmentation in MS,” Poster presented at: ACTRIMS Forum (2025).
- 20. Yushkevich P. A., Piven J., Hazlett H. C., et al., “User‐Guided 3D Active Contour Segmentation of Anatomical Structures: Significantly Improved Efficiency and Reliability,” Neuroimage 31, no. 3 (2006): 1116–1128, 10.1016/j.neuroimage.2006.01.015. [DOI] [PubMed] [Google Scholar]
- 21. Rondinella A., Crispino E., Guarnera F., et al., “Boosting Multiple Sclerosis Lesion Segmentation Through Attention Mechanism,” Computers in Biology and Medicine 161 (2023): 107021, 10.1016/j.compbiomed.2023.107021. [DOI] [PubMed] [Google Scholar]
- 22. Rondinella A., Guarnera F., Giudice O., et al., “Enhancing Multiple Sclerosis Lesion Segmentation in Multimodal MRI Scans With Diffusion Models,” in 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (IEEE, 2023), 3733–3740, 10.1109/BIBM58861.2023.10385334. [DOI] [Google Scholar]
- 23. Isensee F., Jaeger P. F., Kohl S. A. A., Petersen J., and Maier‐Hein K. H., “nnU‐Net: A Self‐Configuring Method for Deep Learning‐Based Biomedical Image Segmentation,” Nature Methods 18, no. 2 (2021): 203–211, 10.1038/s41592-020-01008-z. [DOI] [PubMed] [Google Scholar]
- 24. Billot B., Greve D. N., Puonti O., et al., “SynthSeg: Segmentation of Brain MRI Scans of Any Contrast and Resolution Without Retraining,” Medical Image Analysis 86 (2023): 102789, 10.1016/j.media.2023.102789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Fischl B., “FreeSurfer,” Neuroimage 62, no. 2 (2012): 774–781, 10.1016/j.neuroimage.2012.01.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hugging Face—The AI Community Building the Future, accessed April 7, 2025, https://huggingface.co/spaces/FrancescoLR/FLAMeS/.
- 27. Schmidt P., Gaser C., Arsic M., et al., “An Automated Tool for Detection of FLAIR‐Hyperintense White‐Matter Lesions in Multiple Sclerosis,” Neuroimage 59, no. 4 (2012): 3774–3783, 10.1016/j.neuroimage.2011.11.032. [DOI] [PubMed] [Google Scholar]
- 28. Cerri S., Puonti O., Meier D. S., et al., “A Contrast‐Adaptive Method for Simultaneous Whole‐Brain and Lesion Segmentation in Multiple Sclerosis,” Neuroimage 225 (2021): 117471, 10.1016/j.neuroimage.2020.117471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Raina V., Molchanova N., Graziani M., et al., Tackling Bias in the Dice Similarity Coefficient: Introducing nDSC for White Matter Lesion Segmentation. 2023, 10.48550/arXiv.2302.05432. [DOI]
- 30. Krishnan A. P., Song Z., Clayton D., Jia X., de Crespigny A., and Carano R. A. D., “Multi‐Arm U‐Net With Dense Input and Skip Connectivity for T2 Lesion Segmentation in Clinical Trials of Multiple Sclerosis,” Scientific Reports 13, no. 1 (2023): 4102, 10.1038/s41598-023-31207-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Nair T., Precup D., Arnold D. L., and Arbel T., “Exploring Uncertainty Measures in Deep Networks for Multiple Sclerosis Lesion Detection and Segmentation,” Medical Image Analysis 59 (2020): 101557, 10.1016/j.media.2019.101557. [DOI] [PubMed] [Google Scholar]
- 32. Sedaghat S., Jang H., Athertya J. S., et al., “The Signal Intensity Variation of Multiple Sclerosis (MS) Lesions on Magnetic Resonance Imaging (MRI) as a Potential Biomarker for Patients' Disability: A Feasibility Study,” Frontiers in Neuroscience 17 (2023): 1145251, 10.3389/fnins.2023.1145251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Al‐Louzi O., Manukyan S., Donadieu M., et al., “Lesion Size and Shape in Central Vein Sign Assessment for Multiple Sclerosis Diagnosis: An In Vivo and Postmortem MRI Study,” Multiple Sclerosis Journal 28, no. 12 (2022): 1891–1902, 10.1177/13524585221097560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Tran P., Thoprakarn U., Gourieux E., et al., “Automatic Segmentation of White Matter Hyperintensities: Validation and Comparison With State‐of‐the‐Art Methods on Both Multiple Sclerosis and Elderly Subjects,” NeuroImage Clinical 33 (2022): 102940, 10.1016/j.nicl.2022.102940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. FrancescoLR/FLAMeS‐Model Hugging Face, accessed April 7, 2025, https://huggingface.co/FrancescoLR/FLAMeS‐model.
