Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
. 2024 Jun 26;6(6):e230229. doi: 10.1148/ryai.230229

Optimizing Performance of Transformer-based Models for Fetal Brain MR Image Segmentation

Nicolò Pecco 1, Pasquale Anthony Della Rosa 1,, Matteo Canini 1, Gianluca Nocera 1, Paola Scifo 1, Paolo Ivo Cavoretto 1, Massimo Candiani 1, Andrea Falini 1, Antonella Castellano 1,#, Cristina Baldoli 1,#
PMCID: PMC11605146  PMID: 38922031

Abstract

Purpose

To test the performance of a transformer-based model when manipulating pretraining weights, dataset size, and input size and comparing the best model with the reference standard and state-of-the-art models for a resting-state functional (rs-fMRI) fetal brain extraction task.

Materials and Methods

An internal retrospective dataset (172 fetuses, 519 images; collected 2018–2022) was used to investigate influence of dataset size, pretraining approaches, and image input size on Swin-U-Net transformer (UNETR) and UNETR models. The internal and external (131 fetuses, 561 images) datasets were used to cross-validate and to assess generalization capability of the best model versus state-of-the-art models on different scanner types and number of gestational weeks (GWs). The Dice similarity coefficient (DSC) and the balanced average Hausdorff distance (BAHD) were used as segmentation performance metrics. Generalized equation estimation multifactorial models were used to assess significant model and interaction effects of interest.

Results

The Swin-UNETR model was not affected by the pretraining approach and dataset size and performed best with the mean dataset image size, with a mean DSC of 0.92 and BAHD of 0.097. Swin-UNETR was not affected by scanner type. Generalization results on the internal dataset showed that Swin-UNETR had lower performance compared with the reference standard models and comparable performance on the external dataset. Cross-validation on internal and external test sets demonstrated better and comparable performance of Swin-UNETR versus convolutional neural network architectures during the late-fetal period (GWs > 25) but lower performance during the midfetal period (GWs ≤ 25).

Conclusion

Swin-UNTER showed flexibility in dealing with smaller datasets, regardless of pretraining approaches. For fetal brain extraction from rs-fMR images, Swin-UNTER showed comparable performance with that of reference standard models during the late-fetal period and lower performance during the early GW period.

Keywords: Transformers, CNN, Medical Imaging Segmentation, MRI, Dataset Size, Input Size, Transfer Learning

Supplemental material is available for this article.

© RSNA, 2024

See also the commentary by Prabhu in this issue.

Keywords: Transformers, CNN, Medical Imaging Segmentation, MRI, Dataset Size, Input Size, Transfer Learning


graphic file with name ryai.230229.VA.jpg


Summary

Swin-UNETR with pretraining weights, dataset size, and input size optimization showed comparable performance with that of state-of-the-art models in resting-state functional MRI fetal brain segmentation, with performance differences between mid- and late-fetal periods.

Key Points

  • ■ When comparing the performance of Swin-UNETR with that of UNETR on a retrospective fetal resting-state functional MRI dataset (172 fetuses; 519 images), the Swin-UNETR model showed the best performance metrics using the mean image voxel size, with a mean Dice similarity coefficient (DSC) of 0.92 and balanced average Hausdorff distance of 0.097.

  • ■ The Swin-UNETR best model showed comparable performance with that of convolutional neural network models for both 1.5-T and 3-T scanners, with mean DSC of 0.918 and 0.923, respectively.

  • ■Model performance generalization results showed that Swin-UNETR had lower performance compared with the reference standard on the internal dataset and comparable performance on the external dataset, and model cross-validation on both test sets demonstrated better model performance on the internal test set and performance differences between the midfetal and late-fetal periods.

Introduction

Automated extraction of the fetal brain from the surrounding maternal compartment faces substantial challenges due to the considerable change of the brain’s shape and size throughout the gestational weeks (GWs).

Convolutional neural networks (CNNs) (1) have proven to be a powerful tool for segmentation tasks across various medical imaging modalities (2,3). Integration of attention mechanisms (47) and vision transformers (8) in CNN-inspired architectures (810) has led to the development of transformer-based architectures such as U-Net transformer (UNETR) and Swin-UNETR, showing promising results for image segmentation tasks (1113). The UNETR is a modified version of the U-Net architecture, which replaces the convolutional layers with multihead self-attention modules (14,15). The Swin-UNETR instead introduces a hierarchical structure with alternating stages of shift-based windows and nonlocal self-attention modules (11).

The main goal of this work was twofold: (a) to evaluate the performance of Swin-UNETR against UNETR transformer-based architectures on a fetal resting-state functional (rs-fMRI) brain extraction task while assessing the influence of pretraining weight and dataset size between architectures and testing the influence of image size on the transformer-based “best model” setup; and (b) to assess the transformer-based best model performance generalization and cross-validation as compared with a reference standard and state-of-the-art existing models with different architectures and degrees of optimization for the same task (ie, nnU-Net [16], U-Net [17], generative adversarial network [GAN] [18]).

Materials and Methods

Study Sample

Data from 179 fetuses retrospectively collected at San Raffaele Hospital (Ospedale San Raffaele) between 2018 and 2022 were initially considered for inclusion in the study. Fetuses with central nervous system congenital anomalies, brain parenchymal signal alterations, and inaccurate masking were excluded by a senior developmental neuroscientist (P.A.D.R.) with more than 20 years of experience in advanced neuroimaging data preprocessing and analyses, resulting in a final study sample of 172 fetuses with 519 scans. Seventy-seven fetuses were scanned at 1.5 T (median GWs, 32; minimum GWs, 21; maximum GWs, 37) and 95 at 3 T (median GWs, 28; minimum GW, 21; maximum GWs, 36). The number of fetal GWs at scanning ranged from 21 to 37 (median, 29 ± 3.72 [SD]). The study protocol was conducted in accordance with the Declaration of Helsinki and approved a priori by the Ospedale San Raffaele Ethics Committee (registration no. EK Nr.2174/2016). All women provided written informed consent prior to MRI examination. Additionally, an independent sample of 131 fetuses with 561 scans (OpenNeuro.org) was included for external testing. Rs-fMRI acquisition details for the internal and external datasets and specifics about creation, using the RS-FetMRI package (19), and validation of manually generated ground truth for the internal dataset are reported in Appendix S1 (internal dataset and external dataset).

Development of Models

Models were implemented in PyTorch (https://pytorch.org/ version 1.13.1) and MONAI (https://monai.io/ version 1.1.0) on a workstation equipped with Ubuntu (version 18.04; https://releases.ubuntu.com/18.04/), as well as an Intel Xeon Processor with 20 cores and 16 GB of RAM, and eight Tesla V100 graphics cards (NVIDIA DGX-1; 6 GB). Pretraining weights were downloaded from MONAI (https://github.com/Project-MONAI/tutorials). Performances of our “best-model” setup and architecture were compared with three deep learning models: (a) a reference standard U-Net (17) model developed and optimized for a fetal brain extraction task with a different architecture (ie, CNN 1) sourced from https://github.com/rauschecker-sugrue-labs/fetal-brain-segmentation and two state-of-the-art models; (b) nnU-Net (16) optimized for brain extraction task-general purposes with a similar architecture as the reference standard (CNN 2), obtained from the repository at https://github.com/MIC-DKFZ/nnUNet; and (c) SeGAN (18) optimized for the same task with a different architecture (ie, GAN) and downloaded from https://github.com/YuanXue1993/SegAN (Fig 1). For Swin-UNETR, the parameters of the reference standard and state-of-the-art models and detailed experimental methodology are listed in Table S1 and Appendix S1.

Figure 1:

Flow diagram shows internal and external dataset allocation and splits for models training, validation, and test sets and with respect to the study aims. CNN = convolutional neural network, GW = gestational week, Val. = validation.

Flow diagram shows internal and external dataset allocation and splits for models training, validation, and test sets and with respect to the study aims. CNN = convolutional neural network, GW = gestational week, Val. = validation.

Data Stratification

The internal dataset (172 fetuses; 519 images) was divided at fetus level and split into training and testing set with a ratio of 90:10. A fixed optimal train-test split ratio (20) was chosen to reach convergence toward the optimum values of the weights in the pretrained approaches. The training set was further split into training and validation sets, with a ratio of 80:20, with equally distributed fetuses across GWs and scanners. The test set (17 fetuses; 47 images) included one fetus per GW and remained consistent across all experiments. For GWs with a limited number of images, assignment priority was given to the test, validation, and finally the training sets. The training dataset was progressively downsized to two subsets (ie, 66% [103 fetuses; 310 images], 33% [52 fetuses; 154 images]) by reducing the number of fetuses sequentially from the 66% subset to the 33% subset, keeping GWs and scanner representativeness consistent (Table S2; Figs 1, S1). Additionally, the external dataset was used for two distinct purposes. First, the external dataset was split with the same splitting procedure implemented for our internal dataset (ie, train [94 fetuses; 403 images]; validation [23 fetuses; 103 images]; test [14 fetuses 55 images]) to train and test all of the models on a different data distribution, scan acquisition, and image processing, thus assessing models’ generalization capabilities. Second, a test set to mirror the GW distribution of our internal test set (one fetus per GW) was created for the cross-validation of models trained on our internal dataset, aimed at testing differences or similarities between model performances at specific GW ranges. Appendix S1 lists the details about the external test set construction.

Statistical Analysis

Dice similarity coefficient (DSC) and balanced average Hausdorff distance (BAHD) (21) metrics were used to measure the segmentations overlap and shape disparities, by using the “EvaluateSegmentation” tool (22). We used a generalized equation estimation (GEE) approach, as implemented in SPSS Statistics (version 23; IBM), for assessing (a) GW representativeness across dataset size (GEE 1); (b) performance differences between Swin-UNETR and UNETR, while evaluating the influence of pretraining and scratch approaches and dataset size (GEE 2); (c) the influence of image size on the best-model performance (GEE 3); (d) model performance generalization, in terms of differences between our best model, nnU-Net, U-Net, and GAN performance metrics, while evaluating the influence of scanner type differences in the internal dataset (GEE 4) and the use of an internal or external dataset (GEE 5); and (e) model performance cross-validation, in terms of differences between our best model, nnU-Net, U-Net, and GAN performance metrics on internal and external test sets, while comparing models within test sets for each GW and between test sets for midfetal (GWs ≤ 25) and late-fetal periods (GWs > 25) (GEE 6) (23). For detailed descriptions of each GEE model, please refer to Table 1.

Table 1:

GEE Model Descriptions for Model Effect

graphic file with name ryai.230229.tbl1.jpg

We report model effects and only post hoc tests, corrected for sequential Bonferroni comparisons (P < .05), of significant factorial interactions of interest for each GEE model based on our experimental aims. Table 2 summarizes significant post hoc mean differences. All multifactorial post hoc comparisons, including both significant and not significant effects, are reported instead in Table S9. Descriptive statistics, including mean, standard errors, and 95% CIs for each GEE model are included in Tables S3S8.

Table 2:

Significant Results of the GEE Models Summarized for DSC and BAHD for Models, Pretraining and Dataset Size Effect (GEE 2c), Image Size Effect (GEE 3c), and Models and Scanner Effect (GEE 4c)

graphic file with name ryai.230229.tbl2.jpg

Differences in performance between models were considered statistically significant when both DSC and BAHD metrics were statistically significant.

Data and Code Availability

The internal dataset is publicly accessible at the Ospedale San Raffaele repository (https://ordr.hsr.it/datasets/dyg9dpmgvs/1). The source code of the Swin-UNETR model for the fetal rs-fMRI brain extraction task is publicly accessible on GitHub (https://github.com/NicoloPecco/Swin-Functional-Fetal-Brain-Segmentation), and the original code for the Swin-UNETR algorithm is available at https://github.com/Project-MONAI/research-contributions/tree/main/SwinUNETR.

Results

GEE Approach Results

GEE 1 showed no significant phase- and dataset size–dependent effects on the number of images representative of each GW (ie, no significant effects for the interactions phase × GW, dataset size × GW, and phase × dataset size × GW; throughout the article and in the Tables, with specific reference to the models and the effects investigated, × and * refer to the combined effects of factors) (Table 1, GEE 1).

GEE 2 showed a significant approach-dependent × dataset size-dependent interaction effect on DSC and BAHD metrics differences between Swin-UNETR and UNETR (ie, significant effect for the interaction approach × dataset size × model) (Table 1, GEE 2 and GEE 2b). Post hoc tests revealed that for all pairwise comparisons, the Swin-UNETR consistently demonstrated significantly higher performance across all approach × dataset combinations compared with the UNETR (Table 2, GEE 2c with significant P values). No evidence of differences was observed between pretrain and scratch approaches, nor among diverse dataset sizes for the Swin-UNETR (see Table S9, GEE 2c). UNETR had significantly higher performance on the 66% compared with the 33% subsets in the pretrain approach (DSC: P = .004; BAHD: P = .02) and on the 100% dataset compared with the 66% subset in the scratch approach (DSC: P = .009; BAHD: P = .02) (Tables 2 [GEE 2c], S3, S9 [GEE 2c]; Fig 2).

Figure 2:

Plots compare the performance of pretraining (orange bars) and scratch (blue bars) approaches for Swin-UNETR (left) and UNETR (right) models on three dataset configurations (full, 66%, and 33%), as well as mean DSC performance and mean BAHD performance. BAHD = balanced average Hausdorff distance, DSC = Dice similarity coefficient.

Plots compare the performance of pretraining (orange bars) and scratch (blue bars) approaches for Swin-UNETR (left) and UNETR (right) models on three dataset configurations (full, 66%, and 33%), as well as mean DSC performance and mean BAHD performance. BAHD = balanced average Hausdorff distance, DSC = Dice similarity coefficient.

GEE 3 showed a significant main effect of image size on DSC and BAHD best model performance metrics (Table 1, GEE 3 and GEE 3b). Post hoc tests demonstrated that using the mean dataset image size resulted in higher performance when compared with smaller voxels or larger ones (Table 2, GEE 3c with significant P values) and no evidence of a difference compared with a voxel size of 2 mm3 (Table S9, GEE 3c). The smallest image voxel size of 0.50 mm3 resulted in lower performance when compared with larger voxel sizes (Table 2, GEE 3c with significant P values). No evidence of differences in performance was found for image voxel size of 0.75 mm3 when compared with larger voxel sizes (Table S9, GEE 3c), except for comparison with the mean image voxel size (DSC: P = .001; BAHD: P = .048). Performance with a voxel size of 3 mm3 was higher than performance with voxel sizes of 4 mm3 (DSC: P < .001; BAHD: P < .001) and 5 mm3 (DSC: P < .001; BAHD: P < .001) (Tables 2 [GEE 3c], S4, S9 [GEE 3c]; Figs 3, S2).

Figure 3:

Top: Graphs show the mean DSC (upper left) and mean BAHD (upper right) performance on the internal test set for the Swin-UNETR model using eight different input image sizes. Bottom: Graph reports DSC values for each validation fold. Color coding is relative to the image voxel size. DSC = Dice similarity coefficient, BAHD = balanced average Hausdorff distance.

Top: Graphs show the mean DSC (upper left) and mean BAHD (upper right) performance on the internal test set for the Swin-UNETR model using eight different input image sizes. Bottom: Graph reports DSC values for each validation fold. Color coding is relative to the image voxel size. DSC = Dice similarity coefficient, BAHD = balanced average Hausdorff distance.

GEE 4 showed a significant scanner-dependent effect on DSC and BAHD metrics differences for comparisons between best model, nnU-Net, U-Net, and GAN performance on the internal dataset (ie, significant effect for the interaction scanner × model) (Table 1, GEE 4 and GEE 4b). Post hoc tests revealed that the reference standard demonstrated significantly higher performance metrics against the CNN 2 at both 1.5 T (DSC: P < .001; BAHD: P < .001) and 3 T (DSC: P < .001; BAHD P < .001). Swin-UNETR showed no evidence of a difference in performance when compared with CNN 2 and the reference standard on both 1.5- and 3-T scanners (Table S9, GEE 4c). In addition, when comparing models on the 1.5-T scanner, the reference standard model showed higher performance than the GAN model (DSC: P < .001; BAHD: P = .02). The GAN model always exhibited significantly lower performance when compared with other models on 3 T (Tables 2 [GEE 4c], S5; Fig S4).

GEE 5 showed a significant dataset type–dependent effect on DSC and BAHD metrics differences for comparisons between best model, nnU-Net, U-Net, and GAN performance, while using the internal or external dataset (ie, significant effect for the interaction dataset type × model) (Table 1, GEE 5 and GEE 5b). Post hoc tests revealed higher performance of the reference standard model on the internal dataset when compared with other models (Table 3, GEE 5c with significant P values) but no evidence of differences when compared with CNN 2 and SWIN on the external dataset (Table S9, GEE 5c). Our best model showed no evidence of a difference in performance when compared with the CNN 2 model for both datasets (Table S9, GEE 5c). Conversely, the GAN model demonstrated significantly lower performance when compared with the reference standard model, the other state-of-the-art model, and our best model for both datasets (Tables 3 [GEE 5c], S6, S9 [GEE 5c]; Fig S3).

Table 3:

Significant Results of the GEE Models Summarized for DSC and BAHD for Models and Datasets Effect

graphic file with name ryai.230229.tbl3.jpg

GEE 6 showed a significant test set type–dependent × GW-dependent interaction effect on DSC and BAHD metrics differences for comparisons between best model, nnU-Net, U-Net, and GAN performances (ie, significant effect for the interaction test set type × GW × model) (Table 1, GEE 6 and 6b). Post hoc tests for comparisons within test set type at each GW revealed that when comparing performance of the models on the internal test set against the reference standard across GWs, the Swin-UNETR consistently demonstrated lower performance throughout the midfetal period except for GW 23 and GWs 26, 33, and 36 (Table 4, GEE 6c with significant P values). Conversely, at GWs 28 and 31, Swin-UNETR demonstrated higher performance (Table 4, GEE 6c with significant P values). The CNN 2 had lower performance as compared with the reference standard for GWs 25, 27–30, and 32–37 (Table 4, GEE 6c with significant P values). When comparing the Swin-UNETR with the CNN 2, better performance was observed for the CNN 2 at GWs 22 and 25; while for GWs 28, 30, 31, and 37 the Swin-UNETR showed better performance (Table 4, GEE 6c with significant P values). The GAN model always showed significantly lower performance compared with the reference standard for each GW, except GWs 22, 28, 31, 35, and 37, where no evidence of a difference was observed (Tables 4, GEE 6c with significant P values and S9, GEE 6c).

Table 4:

Significant Results of the GEE Models Summarized for DSC and BAHD for Model, Gestational Week, and Test Sets Effect

graphic file with name ryai.230229.tbl4.jpg

For the external test set, post hoc tests revealed that the Swin-UNETR demonstrated significantly lower performance at GWs 26, 28, and 37 when compared with the reference standard (Table 4, GEE 6c with significant P values). Conversely, at GWs 31 and 36, the Swin-UNETR demonstrated significantly higher performance (Table 4, GEE 6c with significant P values). No evidence of differences in DSC and BAHD metrics was found between reference standard and Swin-UNETR performance across other GWs (Table S9, GEE 6c). The CNN 2 showed consistently higher performance metrics when compared with the reference standard at GWs 30, 31, 32, and 36 but lower performance at GW 26 (Table 4, GEE 6c with significant P values). No evidence of differences was found when comparing CNN 2 performance with the reference standard for both DSC and BAHD scores across other GWs (Table S9, GEE 6c). When assessing performance differences between the Swin-UNETR and CNN 2, significantly higher performance was observed for the CNN 2 at GWs 23, 26, 28, 30, and 31; whereas at GW 29, Swin-UNETR showed a significantly higher performance (Table 4, GEE 6c with significant P values). The GAN always showed significantly lower performance at GWs 23 and 28 when compared with the reference standard (Table 4, GEE 6c with significant P values).

Post hoc tests for comparisons between test set types for midfetal (GWs ≤ 25) and late-fetal periods (GWs > 25) revealed that, in the midfetal period, almost all comparisons (75% [12 of 16]) between internal and external test sets showed significantly higher performance for all models on the internal dataset. For the late-fetal period, half of the comparisons (50% [24 of 48]) showed a significantly higher performance for all models on the internal dataset except for the CNN 2 at GW 37 (Tables 4 [GEE 6c], S7, S8, S9 [GEE 6c]; Figs 4, 5).

Figure 4:

Raw fetal resting-state functional MRI volumes (first column) obtained during GW 21 for the midfetal period (top row) and GW 37 for the late-fetal period (bottom row). Segmentation masks are shown for ground truth resting-state fetal MRI (GT) (second column), the Swin-UNETR model (third column), SeGAN (fourth column), CNN 1 (fifth column), and CNN 2 (sixth column). Arrows indicate the models’ segmentation errors (ie, inclusion of nonfetal brain tissue or exclusion of certain regions within the fetal brain). CNN = convolutional neural network, GW = gestational week.

Raw fetal resting-state functional MRI volumes (first column) obtained during GW 21 for the midfetal period (top row) and GW 37 for the late-fetal period (bottom row). Segmentation masks are shown for ground truth resting-state fetal MRI (GT) (second column), the Swin-UNETR model (third column), SeGAN (fourth column), CNN 1 (fifth column), and CNN 2 (sixth column). Arrows indicate the models’ segmentation errors (ie, inclusion of nonfetal brain tissue or exclusion of certain regions within the fetal brain). CNN = convolutional neural network, GW = gestational week.

Figure 5:

Graphs show the mean DSC (left) and mean BAHD (right) for each gestational week (GW) spanning from 21 to 37 for the Swin-UNETR (violet), CNN 1 (blue), CNN 2 (yellow), and SeGAN (orange) for model cross-validation on internal and external test sets. The y-axes scale was log-transformed for visualization purposes of the lower and upper bound CIs for all models’ DSC and BAHD performance metrics in the same graph. BAHD = balanced average Hausdorff distance, CNN = convolutional neural network, DSC = Dice similarity coefficient.

Graphs show the mean DSC (left) and mean BAHD (right) for each gestational week (GW) spanning from 21 to 37 for the Swin-UNETR (violet), CNN 1 (blue), CNN 2 (yellow), and SeGAN (orange) for model cross-validation on internal and external test sets. The y-axes scale was log-transformed for visualization purposes of the lower and upper bound CIs for all models’ DSC and BAHD performance metrics in the same graph. BAHD = balanced average Hausdorff distance, CNN = convolutional neural network, DSC = Dice similarity coefficient.

Discussion

The general objective of this study was twofold. First, we aimed to pinpoint the best transformer-based deep learning model architecture (ie, Swin-UNTER vs UNETR) for a fetal brain extraction task using rs-fMR images, while evaluating first the impact of different approaches (pretrain vs scratch) and dataset sizes (100%, 66%, 33%) on different transformer-based architectures and then the impact of rs-fMR image resolution (upsampling vs downsampling) on the best transformer-based architecture model setup. Second, we aimed to evaluate the performance of the best transformer-based architecture model setup against the most representative alternative architectures (ie, CNNs and GAN) tailored for the same task in terms of both model generalization for the task on-hand (ie, validation of best transformer-based architecture model setup as compared with the other architectures model setup across performance on two different datasets) and model cross-validation for specific task characteristics (ie, cross-validation of model performances at different GWs across two fetal periods [midfetal and late fetal] on an internal test set and external test set).

First, both UNETR and Swin-UNETR were not affected by the pretraining approach, likely due to the pretraining weights employed for CT, which may not offer significant benefits for the fetal brain rs-fMRI extraction task. The Swin-UNETR did not show any performance difference when trained with different subset sizes, demonstrating flexibility in dealing with smaller datasets. On the contrary, the UNETR showed a difference between the 66% and 33% subset, suggesting a more limited capability to manage smaller datasets. In addition, the best transformer-based architecture model setup (ie, Swin-UNETR best model) achieved higher performances when employing the mean dataset image size as compared with both smaller and larger voxel sizes likely due to redundant information and artifacts.

Second, our model generalization findings highlight the importance of dataset choice in model training and evaluation, as different fetal rs-fMRI datasets may contain unique image characteristics but also a particular selection of fetuses that can impact the variability on model performances.

Model cross-validation results on the GW-split internal dataset revealed that fetal brain extraction at earlier GWs presented a heightened challenge for the Swin-UNETR best model. The specific features of fetal brain–maternal abdomen boundaries at early GWs may provide insight into why the Swin-UNETR best model performance drops with respect to CNN 2, which shows the highest performance for GWs at the lower tail of the GW sample distribution. CNNs demonstrate effective feature hierarchies, parameter efficiency, and local detail recognition, likely translating, in our study, into optimal model performance metrics also at lower GWs. Notably, nnU-Net (CNN 2) is optimized for brain extraction task-general purposes as the Swin-UNETR, thus a more “CNN architecture-general” advantage seems to arise, with CNN 2 being less affected by (a) the lower number of scans at the lower tail of the GW distribution (2123) and (b) the substantial differences between fetal brain anatomic landmarks at the midfetal period (≤ 25 GW) with respect to late-fetal period (> 25 GW) (23).

On the contrary, for GWs at the upper tail of the GW sample distribution, the Swin-UNETR best model showed a significantly higher performance compared with CNN 2 and comparable performance with the reference standard model, most likely due to the emergence of fetal brain discriminating features from maternal abdominal tissue, resembling a “more general” brain extraction task. Taken together, these results suggest that, for a fetal brain extraction task, the sensitivity of model performance is closely linked to the representativeness of scans across different GWs. This, in turn, is intricately tied to the diverse features exhibited in fetal images, particularly in delineating boundaries between the fetal brain and maternal abdomen, change in fetal brain size, and contrast variability in the rs-fMRI scans especially in the midfetal period. Therefore, augmenting dataset representativeness could potentially lead to more comparable results between transformer-based and CNN architectures on the full range of GWs, augmenting their generalization capabilities (24).

Interestingly, performance differences between models on GWs at the lower tail of the GW sample distribution were instead less evident for the external dataset. Furthermore, for GWs in the midfetal period, 75% of the comparisons between performance metrics on internal and external test sets showed a significantly higher performance of all models on the internal dataset; while in the late-fetal period, the percentage dropped by 25%, highlighting a general model sensitiveness to dataset type and GW representativeness within each dataset. This finding is further confirmed by significantly higher model performance differences observed for earlier GWs when using the internal test-test, while vanishing for the external test set, where differences instead remained only at later GWs, with the Swin-UNETR best-model and CNN 2 still showing better performances.

This study had some limitations. First, the images are not consistently and homogeneously represented among each GW considered in the GW range, especially at lower GWs (GW < 25). This lack in lower GW representativeness could affect model performance and generalizability and introduce potential biases toward higher represented GWs. Second, our sample can be considered relatively small for deep learning application in radiology, and unavailability of open-source databases containing fetal functional MR images limits external validation, preventing from fully addressing advantages or disadvantages associated with either architecture and models. Thus, future research should aim to create a publicly available fetal resting-state large database with balanced and representative images among GWs to enhance the robustness and generalization of deep learning models for fetal resting-state brain extraction tasks across the entire gestational period.

In conclusion, future exploration and development involving transformer-based models, using ad hoc hyperparameter fine-tuning, contrast-enhancing approaches, and GW data–specific representativeness augmentation procedures, for this specific fetal brain extraction task holds great potential for translational research on specific prenatal clinical purposes with smaller datasets. In addition, this study adds compelling evidence to the generalization and cross-validation of deep learning architectures for a fetal brain extraction task at rs-fMRI, as the study uses all of the most representative architectures and compares model performances at different fetal stages of development (see Ciceri et al [25] for a review) while using datasets with different acquisition characteristics and GW representativeness in sample distributions.

Acknowledgments

Acknowledgments

The authors would like to thank the reviewers for their exceptionally detailed, thoughtful, and constructive feedback and all the participants for their participation and their motivation. We also want to thank the Service Center for Research (CSTOR) at IRCCS San Raffaele Research Institute for their support on IT setup and implementation.

*

A.C. and C.B. are co–senior authors.

Supported by the Italian Ministry of Health’s Ricerca Finalizzata 2016 grant (RF-2016-02364081) (principal investigator, P.A.D.R.).

Disclosures of conflicts of interest: N.P. No relevant relationships. P.A.D.R. Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081), principal investigator. M. Canini Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081). G.N. No relevant relationships. P.S. No relevant relationships. P.I.C. No relevant relationships. M. Candiani Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081). A.F. No relevant relationships. A.C. No relevant relationships. C.B. Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081).

Abbreviations:

BAHD
balanced average Hausdorff distance
CNN
convolutional neural network
DSC
Dice similarity coefficient
GAN
generative adversarial network
GEE
generalized equation estimation
GW
gestational week
rs-fMRI
resting-state functional MRI
UNETR
U-Net transformer

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The internal dataset is publicly accessible at the Ospedale San Raffaele repository (https://ordr.hsr.it/datasets/dyg9dpmgvs/1). The source code of the Swin-UNETR model for the fetal rs-fMRI brain extraction task is publicly accessible on GitHub (https://github.com/NicoloPecco/Swin-Functional-Fetal-Brain-Segmentation), and the original code for the Swin-UNETR algorithm is available at https://github.com/Project-MONAI/research-contributions/tree/main/SwinUNETR.


Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

RESOURCES