Abstract
Purpose
To test the performance of a transformer-based model when manipulating pretraining weights, dataset size, and input size and comparing the best model with the reference standard and state-of-the-art models for a resting-state functional (rs-fMRI) fetal brain extraction task.
Materials and Methods
An internal retrospective dataset (172 fetuses, 519 images; collected 2018–2022) was used to investigate influence of dataset size, pretraining approaches, and image input size on Swin-U-Net transformer (UNETR) and UNETR models. The internal and external (131 fetuses, 561 images) datasets were used to cross-validate and to assess generalization capability of the best model versus state-of-the-art models on different scanner types and number of gestational weeks (GWs). The Dice similarity coefficient (DSC) and the balanced average Hausdorff distance (BAHD) were used as segmentation performance metrics. Generalized equation estimation multifactorial models were used to assess significant model and interaction effects of interest.
Results
The Swin-UNETR model was not affected by the pretraining approach and dataset size and performed best with the mean dataset image size, with a mean DSC of 0.92 and BAHD of 0.097. Swin-UNETR was not affected by scanner type. Generalization results on the internal dataset showed that Swin-UNETR had lower performance compared with the reference standard models and comparable performance on the external dataset. Cross-validation on internal and external test sets demonstrated better and comparable performance of Swin-UNETR versus convolutional neural network architectures during the late-fetal period (GWs > 25) but lower performance during the midfetal period (GWs ≤ 25).
Conclusion
Swin-UNTER showed flexibility in dealing with smaller datasets, regardless of pretraining approaches. For fetal brain extraction from rs-fMR images, Swin-UNTER showed comparable performance with that of reference standard models during the late-fetal period and lower performance during the early GW period.
Keywords: Transformers, CNN, Medical Imaging Segmentation, MRI, Dataset Size, Input Size, Transfer Learning
Supplemental material is available for this article.
© RSNA, 2024
See also the commentary by Prabhu in this issue.
Keywords: Transformers, CNN, Medical Imaging Segmentation, MRI, Dataset Size, Input Size, Transfer Learning
Summary
Swin-UNETR with pretraining weights, dataset size, and input size optimization showed comparable performance with that of state-of-the-art models in resting-state functional MRI fetal brain segmentation, with performance differences between mid- and late-fetal periods.
Key Points
■ When comparing the performance of Swin-UNETR with that of UNETR on a retrospective fetal resting-state functional MRI dataset (172 fetuses; 519 images), the Swin-UNETR model showed the best performance metrics using the mean image voxel size, with a mean Dice similarity coefficient (DSC) of 0.92 and balanced average Hausdorff distance of 0.097.
■ The Swin-UNETR best model showed comparable performance with that of convolutional neural network models for both 1.5-T and 3-T scanners, with mean DSC of 0.918 and 0.923, respectively.
■Model performance generalization results showed that Swin-UNETR had lower performance compared with the reference standard on the internal dataset and comparable performance on the external dataset, and model cross-validation on both test sets demonstrated better model performance on the internal test set and performance differences between the midfetal and late-fetal periods.
Introduction
Automated extraction of the fetal brain from the surrounding maternal compartment faces substantial challenges due to the considerable change of the brain’s shape and size throughout the gestational weeks (GWs).
Convolutional neural networks (CNNs) (1) have proven to be a powerful tool for segmentation tasks across various medical imaging modalities (2,3). Integration of attention mechanisms (4–7) and vision transformers (8) in CNN-inspired architectures (8–10) has led to the development of transformer-based architectures such as U-Net transformer (UNETR) and Swin-UNETR, showing promising results for image segmentation tasks (11–13). The UNETR is a modified version of the U-Net architecture, which replaces the convolutional layers with multihead self-attention modules (14,15). The Swin-UNETR instead introduces a hierarchical structure with alternating stages of shift-based windows and nonlocal self-attention modules (11).
The main goal of this work was twofold: (a) to evaluate the performance of Swin-UNETR against UNETR transformer-based architectures on a fetal resting-state functional (rs-fMRI) brain extraction task while assessing the influence of pretraining weight and dataset size between architectures and testing the influence of image size on the transformer-based “best model” setup; and (b) to assess the transformer-based best model performance generalization and cross-validation as compared with a reference standard and state-of-the-art existing models with different architectures and degrees of optimization for the same task (ie, nnU-Net [16], U-Net [17], generative adversarial network [GAN] [18]).
Materials and Methods
Study Sample
Data from 179 fetuses retrospectively collected at San Raffaele Hospital (Ospedale San Raffaele) between 2018 and 2022 were initially considered for inclusion in the study. Fetuses with central nervous system congenital anomalies, brain parenchymal signal alterations, and inaccurate masking were excluded by a senior developmental neuroscientist (P.A.D.R.) with more than 20 years of experience in advanced neuroimaging data preprocessing and analyses, resulting in a final study sample of 172 fetuses with 519 scans. Seventy-seven fetuses were scanned at 1.5 T (median GWs, 32; minimum GWs, 21; maximum GWs, 37) and 95 at 3 T (median GWs, 28; minimum GW, 21; maximum GWs, 36). The number of fetal GWs at scanning ranged from 21 to 37 (median, 29 ± 3.72 [SD]). The study protocol was conducted in accordance with the Declaration of Helsinki and approved a priori by the Ospedale San Raffaele Ethics Committee (registration no. EK Nr.2174/2016). All women provided written informed consent prior to MRI examination. Additionally, an independent sample of 131 fetuses with 561 scans (OpenNeuro.org) was included for external testing. Rs-fMRI acquisition details for the internal and external datasets and specifics about creation, using the RS-FetMRI package (19), and validation of manually generated ground truth for the internal dataset are reported in Appendix S1 (internal dataset and external dataset).
Development of Models
Models were implemented in PyTorch (https://pytorch.org/ version 1.13.1) and MONAI (https://monai.io/ version 1.1.0) on a workstation equipped with Ubuntu (version 18.04; https://releases.ubuntu.com/18.04/), as well as an Intel Xeon Processor with 20 cores and 16 GB of RAM, and eight Tesla V100 graphics cards (NVIDIA DGX-1; 6 GB). Pretraining weights were downloaded from MONAI (https://github.com/Project-MONAI/tutorials). Performances of our “best-model” setup and architecture were compared with three deep learning models: (a) a reference standard U-Net (17) model developed and optimized for a fetal brain extraction task with a different architecture (ie, CNN 1) sourced from https://github.com/rauschecker-sugrue-labs/fetal-brain-segmentation and two state-of-the-art models; (b) nnU-Net (16) optimized for brain extraction task-general purposes with a similar architecture as the reference standard (CNN 2), obtained from the repository at https://github.com/MIC-DKFZ/nnUNet; and (c) SeGAN (18) optimized for the same task with a different architecture (ie, GAN) and downloaded from https://github.com/YuanXue1993/SegAN (Fig 1). For Swin-UNETR, the parameters of the reference standard and state-of-the-art models and detailed experimental methodology are listed in Table S1 and Appendix S1.
Figure 1:
Flow diagram shows internal and external dataset allocation and splits for models training, validation, and test sets and with respect to the study aims. CNN = convolutional neural network, GW = gestational week, Val. = validation.
Data Stratification
The internal dataset (172 fetuses; 519 images) was divided at fetus level and split into training and testing set with a ratio of 90:10. A fixed optimal train-test split ratio (20) was chosen to reach convergence toward the optimum values of the weights in the pretrained approaches. The training set was further split into training and validation sets, with a ratio of 80:20, with equally distributed fetuses across GWs and scanners. The test set (17 fetuses; 47 images) included one fetus per GW and remained consistent across all experiments. For GWs with a limited number of images, assignment priority was given to the test, validation, and finally the training sets. The training dataset was progressively downsized to two subsets (ie, 66% [103 fetuses; 310 images], 33% [52 fetuses; 154 images]) by reducing the number of fetuses sequentially from the 66% subset to the 33% subset, keeping GWs and scanner representativeness consistent (Table S2; Figs 1, S1). Additionally, the external dataset was used for two distinct purposes. First, the external dataset was split with the same splitting procedure implemented for our internal dataset (ie, train [94 fetuses; 403 images]; validation [23 fetuses; 103 images]; test [14 fetuses 55 images]) to train and test all of the models on a different data distribution, scan acquisition, and image processing, thus assessing models’ generalization capabilities. Second, a test set to mirror the GW distribution of our internal test set (one fetus per GW) was created for the cross-validation of models trained on our internal dataset, aimed at testing differences or similarities between model performances at specific GW ranges. Appendix S1 lists the details about the external test set construction.
Statistical Analysis
Dice similarity coefficient (DSC) and balanced average Hausdorff distance (BAHD) (21) metrics were used to measure the segmentations overlap and shape disparities, by using the “EvaluateSegmentation” tool (22). We used a generalized equation estimation (GEE) approach, as implemented in SPSS Statistics (version 23; IBM), for assessing (a) GW representativeness across dataset size (GEE 1); (b) performance differences between Swin-UNETR and UNETR, while evaluating the influence of pretraining and scratch approaches and dataset size (GEE 2); (c) the influence of image size on the best-model performance (GEE 3); (d) model performance generalization, in terms of differences between our best model, nnU-Net, U-Net, and GAN performance metrics, while evaluating the influence of scanner type differences in the internal dataset (GEE 4) and the use of an internal or external dataset (GEE 5); and (e) model performance cross-validation, in terms of differences between our best model, nnU-Net, U-Net, and GAN performance metrics on internal and external test sets, while comparing models within test sets for each GW and between test sets for midfetal (GWs ≤ 25) and late-fetal periods (GWs > 25) (GEE 6) (23). For detailed descriptions of each GEE model, please refer to Table 1.
Table 1:
GEE Model Descriptions for Model Effect
We report model effects and only post hoc tests, corrected for sequential Bonferroni comparisons (P < .05), of significant factorial interactions of interest for each GEE model based on our experimental aims. Table 2 summarizes significant post hoc mean differences. All multifactorial post hoc comparisons, including both significant and not significant effects, are reported instead in Table S9. Descriptive statistics, including mean, standard errors, and 95% CIs for each GEE model are included in Tables S3–S8.
Table 2:
Significant Results of the GEE Models Summarized for DSC and BAHD for Models, Pretraining and Dataset Size Effect (GEE 2c), Image Size Effect (GEE 3c), and Models and Scanner Effect (GEE 4c)
Differences in performance between models were considered statistically significant when both DSC and BAHD metrics were statistically significant.
Data and Code Availability
The internal dataset is publicly accessible at the Ospedale San Raffaele repository (https://ordr.hsr.it/datasets/dyg9dpmgvs/1). The source code of the Swin-UNETR model for the fetal rs-fMRI brain extraction task is publicly accessible on GitHub (https://github.com/NicoloPecco/Swin-Functional-Fetal-Brain-Segmentation), and the original code for the Swin-UNETR algorithm is available at https://github.com/Project-MONAI/research-contributions/tree/main/SwinUNETR.
Results
GEE Approach Results
GEE 1 showed no significant phase- and dataset size–dependent effects on the number of images representative of each GW (ie, no significant effects for the interactions phase × GW, dataset size × GW, and phase × dataset size × GW; throughout the article and in the Tables, with specific reference to the models and the effects investigated, × and * refer to the combined effects of factors) (Table 1, GEE 1).
GEE 2 showed a significant approach-dependent × dataset size-dependent interaction effect on DSC and BAHD metrics differences between Swin-UNETR and UNETR (ie, significant effect for the interaction approach × dataset size × model) (Table 1, GEE 2 and GEE 2b). Post hoc tests revealed that for all pairwise comparisons, the Swin-UNETR consistently demonstrated significantly higher performance across all approach × dataset combinations compared with the UNETR (Table 2, GEE 2c with significant P values). No evidence of differences was observed between pretrain and scratch approaches, nor among diverse dataset sizes for the Swin-UNETR (see Table S9, GEE 2c). UNETR had significantly higher performance on the 66% compared with the 33% subsets in the pretrain approach (DSC: P = .004; BAHD: P = .02) and on the 100% dataset compared with the 66% subset in the scratch approach (DSC: P = .009; BAHD: P = .02) (Tables 2 [GEE 2c], S3, S9 [GEE 2c]; Fig 2).
Figure 2:
Plots compare the performance of pretraining (orange bars) and scratch (blue bars) approaches for Swin-UNETR (left) and UNETR (right) models on three dataset configurations (full, 66%, and 33%), as well as mean DSC performance and mean BAHD performance. BAHD = balanced average Hausdorff distance, DSC = Dice similarity coefficient.
GEE 3 showed a significant main effect of image size on DSC and BAHD best model performance metrics (Table 1, GEE 3 and GEE 3b). Post hoc tests demonstrated that using the mean dataset image size resulted in higher performance when compared with smaller voxels or larger ones (Table 2, GEE 3c with significant P values) and no evidence of a difference compared with a voxel size of 2 mm3 (Table S9, GEE 3c). The smallest image voxel size of 0.50 mm3 resulted in lower performance when compared with larger voxel sizes (Table 2, GEE 3c with significant P values). No evidence of differences in performance was found for image voxel size of 0.75 mm3 when compared with larger voxel sizes (Table S9, GEE 3c), except for comparison with the mean image voxel size (DSC: P = .001; BAHD: P = .048). Performance with a voxel size of 3 mm3 was higher than performance with voxel sizes of 4 mm3 (DSC: P < .001; BAHD: P < .001) and 5 mm3 (DSC: P < .001; BAHD: P < .001) (Tables 2 [GEE 3c], S4, S9 [GEE 3c]; Figs 3, S2).
Figure 3:
Top: Graphs show the mean DSC (upper left) and mean BAHD (upper right) performance on the internal test set for the Swin-UNETR model using eight different input image sizes. Bottom: Graph reports DSC values for each validation fold. Color coding is relative to the image voxel size. DSC = Dice similarity coefficient, BAHD = balanced average Hausdorff distance.
GEE 4 showed a significant scanner-dependent effect on DSC and BAHD metrics differences for comparisons between best model, nnU-Net, U-Net, and GAN performance on the internal dataset (ie, significant effect for the interaction scanner × model) (Table 1, GEE 4 and GEE 4b). Post hoc tests revealed that the reference standard demonstrated significantly higher performance metrics against the CNN 2 at both 1.5 T (DSC: P < .001; BAHD: P < .001) and 3 T (DSC: P < .001; BAHD P < .001). Swin-UNETR showed no evidence of a difference in performance when compared with CNN 2 and the reference standard on both 1.5- and 3-T scanners (Table S9, GEE 4c). In addition, when comparing models on the 1.5-T scanner, the reference standard model showed higher performance than the GAN model (DSC: P < .001; BAHD: P = .02). The GAN model always exhibited significantly lower performance when compared with other models on 3 T (Tables 2 [GEE 4c], S5; Fig S4).
GEE 5 showed a significant dataset type–dependent effect on DSC and BAHD metrics differences for comparisons between best model, nnU-Net, U-Net, and GAN performance, while using the internal or external dataset (ie, significant effect for the interaction dataset type × model) (Table 1, GEE 5 and GEE 5b). Post hoc tests revealed higher performance of the reference standard model on the internal dataset when compared with other models (Table 3, GEE 5c with significant P values) but no evidence of differences when compared with CNN 2 and SWIN on the external dataset (Table S9, GEE 5c). Our best model showed no evidence of a difference in performance when compared with the CNN 2 model for both datasets (Table S9, GEE 5c). Conversely, the GAN model demonstrated significantly lower performance when compared with the reference standard model, the other state-of-the-art model, and our best model for both datasets (Tables 3 [GEE 5c], S6, S9 [GEE 5c]; Fig S3).
Table 3:
Significant Results of the GEE Models Summarized for DSC and BAHD for Models and Datasets Effect
GEE 6 showed a significant test set type–dependent × GW-dependent interaction effect on DSC and BAHD metrics differences for comparisons between best model, nnU-Net, U-Net, and GAN performances (ie, significant effect for the interaction test set type × GW × model) (Table 1, GEE 6 and 6b). Post hoc tests for comparisons within test set type at each GW revealed that when comparing performance of the models on the internal test set against the reference standard across GWs, the Swin-UNETR consistently demonstrated lower performance throughout the midfetal period except for GW 23 and GWs 26, 33, and 36 (Table 4, GEE 6c with significant P values). Conversely, at GWs 28 and 31, Swin-UNETR demonstrated higher performance (Table 4, GEE 6c with significant P values). The CNN 2 had lower performance as compared with the reference standard for GWs 25, 27–30, and 32–37 (Table 4, GEE 6c with significant P values). When comparing the Swin-UNETR with the CNN 2, better performance was observed for the CNN 2 at GWs 22 and 25; while for GWs 28, 30, 31, and 37 the Swin-UNETR showed better performance (Table 4, GEE 6c with significant P values). The GAN model always showed significantly lower performance compared with the reference standard for each GW, except GWs 22, 28, 31, 35, and 37, where no evidence of a difference was observed (Tables 4, GEE 6c with significant P values and S9, GEE 6c).
Table 4:
Significant Results of the GEE Models Summarized for DSC and BAHD for Model, Gestational Week, and Test Sets Effect
For the external test set, post hoc tests revealed that the Swin-UNETR demonstrated significantly lower performance at GWs 26, 28, and 37 when compared with the reference standard (Table 4, GEE 6c with significant P values). Conversely, at GWs 31 and 36, the Swin-UNETR demonstrated significantly higher performance (Table 4, GEE 6c with significant P values). No evidence of differences in DSC and BAHD metrics was found between reference standard and Swin-UNETR performance across other GWs (Table S9, GEE 6c). The CNN 2 showed consistently higher performance metrics when compared with the reference standard at GWs 30, 31, 32, and 36 but lower performance at GW 26 (Table 4, GEE 6c with significant P values). No evidence of differences was found when comparing CNN 2 performance with the reference standard for both DSC and BAHD scores across other GWs (Table S9, GEE 6c). When assessing performance differences between the Swin-UNETR and CNN 2, significantly higher performance was observed for the CNN 2 at GWs 23, 26, 28, 30, and 31; whereas at GW 29, Swin-UNETR showed a significantly higher performance (Table 4, GEE 6c with significant P values). The GAN always showed significantly lower performance at GWs 23 and 28 when compared with the reference standard (Table 4, GEE 6c with significant P values).
Post hoc tests for comparisons between test set types for midfetal (GWs ≤ 25) and late-fetal periods (GWs > 25) revealed that, in the midfetal period, almost all comparisons (75% [12 of 16]) between internal and external test sets showed significantly higher performance for all models on the internal dataset. For the late-fetal period, half of the comparisons (50% [24 of 48]) showed a significantly higher performance for all models on the internal dataset except for the CNN 2 at GW 37 (Tables 4 [GEE 6c], S7, S8, S9 [GEE 6c]; Figs 4, 5).
Figure 4:
Raw fetal resting-state functional MRI volumes (first column) obtained during GW 21 for the midfetal period (top row) and GW 37 for the late-fetal period (bottom row). Segmentation masks are shown for ground truth resting-state fetal MRI (GT) (second column), the Swin-UNETR model (third column), SeGAN (fourth column), CNN 1 (fifth column), and CNN 2 (sixth column). Arrows indicate the models’ segmentation errors (ie, inclusion of nonfetal brain tissue or exclusion of certain regions within the fetal brain). CNN = convolutional neural network, GW = gestational week.
Figure 5:
Graphs show the mean DSC (left) and mean BAHD (right) for each gestational week (GW) spanning from 21 to 37 for the Swin-UNETR (violet), CNN 1 (blue), CNN 2 (yellow), and SeGAN (orange) for model cross-validation on internal and external test sets. The y-axes scale was log-transformed for visualization purposes of the lower and upper bound CIs for all models’ DSC and BAHD performance metrics in the same graph. BAHD = balanced average Hausdorff distance, CNN = convolutional neural network, DSC = Dice similarity coefficient.
Discussion
The general objective of this study was twofold. First, we aimed to pinpoint the best transformer-based deep learning model architecture (ie, Swin-UNTER vs UNETR) for a fetal brain extraction task using rs-fMR images, while evaluating first the impact of different approaches (pretrain vs scratch) and dataset sizes (100%, 66%, 33%) on different transformer-based architectures and then the impact of rs-fMR image resolution (upsampling vs downsampling) on the best transformer-based architecture model setup. Second, we aimed to evaluate the performance of the best transformer-based architecture model setup against the most representative alternative architectures (ie, CNNs and GAN) tailored for the same task in terms of both model generalization for the task on-hand (ie, validation of best transformer-based architecture model setup as compared with the other architectures model setup across performance on two different datasets) and model cross-validation for specific task characteristics (ie, cross-validation of model performances at different GWs across two fetal periods [midfetal and late fetal] on an internal test set and external test set).
First, both UNETR and Swin-UNETR were not affected by the pretraining approach, likely due to the pretraining weights employed for CT, which may not offer significant benefits for the fetal brain rs-fMRI extraction task. The Swin-UNETR did not show any performance difference when trained with different subset sizes, demonstrating flexibility in dealing with smaller datasets. On the contrary, the UNETR showed a difference between the 66% and 33% subset, suggesting a more limited capability to manage smaller datasets. In addition, the best transformer-based architecture model setup (ie, Swin-UNETR best model) achieved higher performances when employing the mean dataset image size as compared with both smaller and larger voxel sizes likely due to redundant information and artifacts.
Second, our model generalization findings highlight the importance of dataset choice in model training and evaluation, as different fetal rs-fMRI datasets may contain unique image characteristics but also a particular selection of fetuses that can impact the variability on model performances.
Model cross-validation results on the GW-split internal dataset revealed that fetal brain extraction at earlier GWs presented a heightened challenge for the Swin-UNETR best model. The specific features of fetal brain–maternal abdomen boundaries at early GWs may provide insight into why the Swin-UNETR best model performance drops with respect to CNN 2, which shows the highest performance for GWs at the lower tail of the GW sample distribution. CNNs demonstrate effective feature hierarchies, parameter efficiency, and local detail recognition, likely translating, in our study, into optimal model performance metrics also at lower GWs. Notably, nnU-Net (CNN 2) is optimized for brain extraction task-general purposes as the Swin-UNETR, thus a more “CNN architecture-general” advantage seems to arise, with CNN 2 being less affected by (a) the lower number of scans at the lower tail of the GW distribution (21–23) and (b) the substantial differences between fetal brain anatomic landmarks at the midfetal period (≤ 25 GW) with respect to late-fetal period (> 25 GW) (23).
On the contrary, for GWs at the upper tail of the GW sample distribution, the Swin-UNETR best model showed a significantly higher performance compared with CNN 2 and comparable performance with the reference standard model, most likely due to the emergence of fetal brain discriminating features from maternal abdominal tissue, resembling a “more general” brain extraction task. Taken together, these results suggest that, for a fetal brain extraction task, the sensitivity of model performance is closely linked to the representativeness of scans across different GWs. This, in turn, is intricately tied to the diverse features exhibited in fetal images, particularly in delineating boundaries between the fetal brain and maternal abdomen, change in fetal brain size, and contrast variability in the rs-fMRI scans especially in the midfetal period. Therefore, augmenting dataset representativeness could potentially lead to more comparable results between transformer-based and CNN architectures on the full range of GWs, augmenting their generalization capabilities (24).
Interestingly, performance differences between models on GWs at the lower tail of the GW sample distribution were instead less evident for the external dataset. Furthermore, for GWs in the midfetal period, 75% of the comparisons between performance metrics on internal and external test sets showed a significantly higher performance of all models on the internal dataset; while in the late-fetal period, the percentage dropped by 25%, highlighting a general model sensitiveness to dataset type and GW representativeness within each dataset. This finding is further confirmed by significantly higher model performance differences observed for earlier GWs when using the internal test-test, while vanishing for the external test set, where differences instead remained only at later GWs, with the Swin-UNETR best-model and CNN 2 still showing better performances.
This study had some limitations. First, the images are not consistently and homogeneously represented among each GW considered in the GW range, especially at lower GWs (GW < 25). This lack in lower GW representativeness could affect model performance and generalizability and introduce potential biases toward higher represented GWs. Second, our sample can be considered relatively small for deep learning application in radiology, and unavailability of open-source databases containing fetal functional MR images limits external validation, preventing from fully addressing advantages or disadvantages associated with either architecture and models. Thus, future research should aim to create a publicly available fetal resting-state large database with balanced and representative images among GWs to enhance the robustness and generalization of deep learning models for fetal resting-state brain extraction tasks across the entire gestational period.
In conclusion, future exploration and development involving transformer-based models, using ad hoc hyperparameter fine-tuning, contrast-enhancing approaches, and GW data–specific representativeness augmentation procedures, for this specific fetal brain extraction task holds great potential for translational research on specific prenatal clinical purposes with smaller datasets. In addition, this study adds compelling evidence to the generalization and cross-validation of deep learning architectures for a fetal brain extraction task at rs-fMRI, as the study uses all of the most representative architectures and compares model performances at different fetal stages of development (see Ciceri et al [25] for a review) while using datasets with different acquisition characteristics and GW representativeness in sample distributions.
Acknowledgments
Acknowledgments
The authors would like to thank the reviewers for their exceptionally detailed, thoughtful, and constructive feedback and all the participants for their participation and their motivation. We also want to thank the Service Center for Research (CSTOR) at IRCCS San Raffaele Research Institute for their support on IT setup and implementation.
A.C. and C.B. are co–senior authors.
Supported by the Italian Ministry of Health’s Ricerca Finalizzata 2016 grant (RF-2016-02364081) (principal investigator, P.A.D.R.).
Disclosures of conflicts of interest: N.P. No relevant relationships. P.A.D.R. Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081), principal investigator. M. Canini Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081). G.N. No relevant relationships. P.S. No relevant relationships. P.I.C. No relevant relationships. M. Candiani Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081). A.F. No relevant relationships. A.C. No relevant relationships. C.B. Grant from Italian Ministry of Health’s “RicercaFinalizzata 2016” (grant RF-2016-02364081).
Abbreviations:
- BAHD
- balanced average Hausdorff distance
- CNN
- convolutional neural network
- DSC
- Dice similarity coefficient
- GAN
- generative adversarial network
- GEE
- generalized equation estimation
- GW
- gestational week
- rs-fMRI
- resting-state functional MRI
- UNETR
- U-Net transformer
References
- 1.Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022; 11976–11986. https://openaccess.thecvf.com/content/CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html. Accessed March 28, 2023. [Google Scholar]
- 2. Akkus Z , Galimzianova A , Hoogi A , Rubin DL , Erickson BJ . Deep Learning for Brain MRI Segmentation: State of the Art and Future Directions . J Digit Imaging 2017. ; 30 ( 4 ): 449 – 459 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lundervold AS , Lundervold A . An overview of deep learning in medical imaging focusing on MRI . Z Med Phys 2019. ; 29 ( 2 ): 102 – 127 . [DOI] [PubMed] [Google Scholar]
- 4. Vaswani A , Shazeer N , Parmar N , et al . Attention is All you Need . In: Advances in Neural Information Processing Systems 30 (NIPS 2017) . Curran Associates https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Published 2017. Accessed March 28, 2023 . [Google Scholar]
- 5. Devlin J , Chang MW , Lee K , Toutanova K . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . arXiv 1810.04805 [preprint] https://arxiv.org/abs/1810.04805. Posted October 11, 2018. Accessed March 28, 2023 .
- 6. Fedus W , Zoph B , Shazeer N . Switch transformers: scaling to trillion parameter models with simple and efficient sparsity . J Mach Learn Res 2023. ; 23 ( 1 ): 5232 – 5270 . [Google Scholar]
- 7. Shamshad F , Khan S , Zamir SW , et al . Transformers in Medical Imaging: A Survey . arXiv 2201.09873 [preprint] https://arxiv.org/abs/2201.09873. Posted January 24, 2022. Accessed March 28, 2023 . [DOI] [PubMed]
- 8. Dosovitskiy A , Beyer L , Kolesnikov A , et al . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . arXiv 2010.11929 [preprint] https://arxiv.org/abs/2010.11929. Posted October 22, 2020. Accessed March 28, 2023 .
- 9. Bello I , Zoph B , Vaswani A , Shlens J , Le QV . Attention Augmented Convolutional Networks . In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2019. ; 3286 – 3295 . https://openaccess.thecvf.com/content_ICCV_2019/html/Bello_Attention_Augmented_Convolutional_Networks_ICCV_2019_paper.html. Accessed March 28, 2023 . [Google Scholar]
- 10. Vaswani A , Ramachandran P , Srinivas A , Parmar N , Hechtman B , Shlens J . Scaling Local Self-Attention for Parameter Efficient Visual Backbones . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021. ; 12894 – 12904 . https://openaccess.thecvf.com/content/CVPR2021/html/Vaswani_Scaling_Local_Self-Attention_for_Parameter_Efficient_Visual_Backbones_CVPR_2021_paper.html. Accessed March 28, 2023 . [Google Scholar]
- 11. Hatamizadeh A , Nath V , Tang Y , Yang D , Roth H , Xu D . Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images . arXiv 2201.01266 [preprint] https://arxiv.org/abs/2201.01266. Posted January 4, 2022. Accessed March 28, 2023 .
- 12. Tang Y , Yang D , Li W , et al . Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis . In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE; , 2022. ; 20698 – 20708 . [Google Scholar]
- 13. Hatamizadeh A , Tang Y , Nath V , et al . UNETR: Transformers for 3D Medical Image Segmentation . In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 2022. ; 574 – 584 . https://openaccess.thecvf.com/content/WACV2022/html/Hatamizadeh_UNETR_Transformers_for_3D_Medical_Image_Segmentation_WACV_2022_paper.html. Published 2022. Accessed March 28, 2023 . [Google Scholar]
- 14. Fu J , Liu J , Tian H , et al . Dual Attention Network for Scene Segmentation . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019. ; 3146 – 3154 . https://openaccess.thecvf.com/content_CVPR_2019/html/Fu_Dual_Attention_Network_for_Scene_Segmentation_CVPR_2019_paper.html. Accessed March 28, 2023 . [Google Scholar]
- 15. Wang X , Girshick R , Gupta A , He K . Non-Local Neural Networks . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. ; 7794 – 7803 . https://openaccess.thecvf.com/content_cvpr_2018/html/Wang_Non-Local_Neural_Networks_CVPR_2018_paper.html. Accessed March 28, 2023 . [Google Scholar]
- 16. Isensee F , Jaeger PF , Kohl SAA , Petersen J , Maier-Hein KH . nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation . Nat Methods 2021. ; 18 ( 2 ): 203 – 211 . [DOI] [PubMed] [Google Scholar]
- 17. Tran CBN , Nedelec P , Weiss DA , et al . Development of Gestational Age-Based Fetal Brain and Intracranial Volume Reference Norms Using Deep Learning . AJNR Am J Neuroradiol 2023. ; 44 ( 1 ): 82 – 90 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Xue Y , Xu T , Zhang H , Long LR , Huang X . SegAN: Adversarial Network with Multi-scale L1 Loss for Medical Image Segmentation . Neuroinformatics 2018. ; 16 ( 3-4 ): 383 – 392 . [DOI] [PubMed] [Google Scholar]
- 19. Pecco N , Canini M , Mosser KHH , et al . RS-FetMRI: a MATLAB-SPM Based Tool for Pre-processing Fetal Resting-State fMRI Data . Neuroinformatics 2022. ; 20 ( 4 ): 1137 – 1154 . [DOI] [PubMed] [Google Scholar]
- 20. Reis HC , Turk V , Khoshelham K , Kaya S . MediNet: transfer learning approach with MediNet medical visual database . Multimedia Tools Appl 2023. ; 82 ( 25 ): 1 – 44 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Aydin OU , Taha AA , Hilbert A , et al . On the usage of average Hausdorff distance for segmentation performance assessment: hidden error when used for ranking . Eur Radiol Exp 2021. ; 5 ( 1 ): 4 . [Published correction appears in Eur Radiol Exp 2022;6(1):56.] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Taha AA , Hanbury A . Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool . BMC Med Imaging 2015. ; 15 ( 1 ): 29 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kostović I, Katušić A, Kostović Srzentić M. Chapter 19 - Linking histology and neurological development of the fetal and infant brain. In: Martin CR, Preedy VR, Rajendram R, eds. Factors Affecting Neurodevelopment: Genetics, Neurology, Behavior, and Diet.Academic Press, 2021; 213–225. [Google Scholar]
- 24. Khosla C , Saini BS . Enhancing Performance of Deep Learning Models with different Data Augmentation Techniques: A Survey . In: 2020 International Conference on Intelligent Engineering and Management (ICIEM) , London, UK: . IEEE; , 2020. ; 79 – 85 . [Google Scholar]
- 25. Ciceri T , Squarcina L , Giubergia A , Bertoldo A , Brambilla P , Peruzzo D . Review on deep learning fetal brain segmentation from Magnetic Resonance images . Artif Intell Med 2023. ; 143 : 102608 . [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The internal dataset is publicly accessible at the Ospedale San Raffaele repository (https://ordr.hsr.it/datasets/dyg9dpmgvs/1). The source code of the Swin-UNETR model for the fetal rs-fMRI brain extraction task is publicly accessible on GitHub (https://github.com/NicoloPecco/Swin-Functional-Fetal-Brain-Segmentation), and the original code for the Swin-UNETR algorithm is available at https://github.com/Project-MONAI/research-contributions/tree/main/SwinUNETR.