FastSurferVINN: Building resolution-independence into deep learning segmentation methods—A solution for HighRes brain MRI

Leonie Henschel; David Kügler; Martin Reuter

doi:10.1016/j.neuroimage.2022.118933

. Author manuscript; available in PMC: 2022 Dec 30.

Published in final edited form as: Neuroimage. 2022 Feb 3;251:118933. doi: 10.1016/j.neuroimage.2022.118933

FastSurferVINN: Building resolution-independence into deep learning segmentation methods—A solution for HighRes brain MRI

Leonie Henschel ^a,¹, David Kügler ^a,¹, Martin Reuter ^a,^b,^c,^*

PMCID: PMC9801435 NIHMSID: NIHMS1849831 PMID: 35122967

Abstract

Leading neuroimaging studies have pushed 3T MRI acquisition resolutions below 1.0 mm for improved structure definition and morphometry. Yet, only few, time-intensive automated image analysis pipelines have been validated for high-resolution (HiRes) settings. Efficient deep learning approaches, on the other hand, rarely support more than one fixed resolution (usually 1.0 mm). Furthermore, the lack of a standard submillimeter resolution as well as limited availability of diverse HiRes data with sufficient coverage of scanner, age, diseases, or genetic variance poses additional, unsolved challenges for training HiRes networks. Incorporating resolution-independence into deep learning-based segmentation, i.e., the ability to segment images at their native resolution across a range of different voxel sizes, promises to overcome these challenges, yet no such approach currently exists. We now fill this gap by introducing a Voxel-size Independent Neural Network (VINN) for resolution-independent segmentation tasks and present FastSurferVINN, which (i) establishes and implements resolution-independence for deep learning as the first method simultaneously supporting 0.7–1.0 mm whole brain segmentation, (ii) significantly outperforms state-of-the-art methods across resolutions, and (iii) mitigates the data imbalance problem present in HiRes datasets. Overall, internal resolution-independence mutually benefits both HiRes and 1.0 mm MRI segmentation. With our rigorously validated FastSurferVINN we distribute a rapid tool for morphometric neuroimage analysis. The VINN architecture, furthermore, represents an efficient resolution-independent segmentation method for wider application.

Keywords: Computational neuroimaging, Deep learning, Structural MRI, Artificial intelligence, High-resolution

1. Introduction

While neuroimaging pipelines have benefited substantially from the standardization of Magnetic Resonance Imaging (MRI) at 1.0 mm, the resulting fixed-resolution paradigm now hinders transition to high-resolution (HiRes) MRI. With the hope of advancing quantification of structural detail, increasing explanatory power, and improving our understanding of the brain in health and disease (Glasser et al., 2013; Mellerio et al., 2014; Solano-Castiella et al., 2011; Stankiewicz et al., 2011; Wattjes et al., 2006; Zaretskaya et al., 2018), leading large-cohort neuroimaging studies have started to acquire structural MRI at 3T field strength and 0.7–0.9 mm resolutions (see Section 2.1). However, the lack of reference segmentations and limited diversity of HiRes MRI datasets (e.g. regarding scanner, disease, genetic variation) lead to substantial limitations for bias-free method development and validation. Additionally, since no de-facto standard resolution exists for HiRes imaging, neuroimaging tools introducing HiRes processing (Bazin et al., 2014; Gaser and Dahnke, 2016; Huntenburg et al., 2018; Yushkevich et al., 2014; Zaretskaya et al., 2018) have to provide resolution-independence instead of following the fixed-resolution paradigm. Although Convolutional Neural Networks (CNNs) deliver convincing performance under the fixed-resolution paradigm (Chen et al., 2018; Coupé et al., 2020; Henschel et al., 2020; Huo et al., 2019; Ito et al., 2019; McClure et al., 2019; Mehta et al., 2017; Roy et al., 2019; Sun et al., 2019; Wachinger et al., 2018), no methodological solution leverages explicit knowledge of the native image resolution, consequently limiting all output segmentations to one pre-defined voxel size and potentially ignoring important structural detail specifically for submillimeter scans. By introducing Voxel-size Independent Neural Networks (VINNs), we now leverage the diversity of widely available 1.0 mm MRIs and enrich the model with details derived from HiRes MRI, achieving not only resolution-independence but improving segmentation performance across resolutions.

For resolution-independent deep learning, we establish two core requirements: 1. Native-resolution segmentation: the network’s input and output for training and crucially inference should be at the native resolution, to avoid any external resampling. 2. Resolution-independence: the network should be able to learn and predict from images at a range of different resolutions. Segmentation for HiRes images additionally aims to improve the quality of fine structures (e.g. narrow sulci, gyri and white matter (WM) details). To achieve high-quality whole brain segmentation and avoid training biases, a neural network should generalize to various resolutions (seen and unseen during training) and datasets with different characteristics (e.g. with respect to scanners, demographics, diseases, genetic variation) ideally sharing and transferring knowledge between resolutions.

While some traditional neuroimaging pipelines fulfill these requirements (Bazin et al., 2014; Gaser and Dahnke, 2016; Huntenburg et al., 2018; Yushkevich et al., 2014; Zaretskaya et al., 2018), no related work directly addresses these challenges with deep learning. Fig. 1 illustrates two adaptations to form baseline solutions: A. dedicated fixed-resolution networks and B. a single resolution-ignorant CNN that accepts multiple resolutions through training. Training one dedicated network per resolution (A.) trades the potential of a larger, more diverse training corpus for compatibility with the native resolution raising bias and generalization limitations (there are, for example, currently no HiRes neurodegeneration datasets). However, a fixed-resolution network trained and evaluated on its native resolution represents an upper bound for achievable performance under same-size training datasets. On the other hand, if one network is naïvely trained on multiple resolutions (B.), each convolutional layer has to learn to generalize across scales as the network cannot easily differentiate between different voxel sizes (i.e. resolution-ignorant), which allocates network capacity to this task. To support this process and reduce potential bias from missing or unbalanced resolution data, one can add external scale augmentation (+exSA) by resampling individual images and reference labels during training. This, however, induces information loss and interpolation artefacts, e.g. from lossy nearest-neighbour (NN) interpolation of discrete label maps. Since neither approach (A. or B.) has been implemented or compared for submillimeter whole brain segmentation, we introduce respective baseline models utilizing our proposed, optimized micro-architecture in all models for a fair comparison.

Fig. 1. — Resolution-independence in deep learning networks: A. Dedicated fixed-resolution convolutional neural networks (CNNs) only work on the resolution they are trained on and are limited by the availability and quality of corresponding datasets. B. One single resolution-ignorant CNN can learn to segment multiple resolutions by training on a diverse dataset. External scale augmentation (+exSA, B. left) simulates resolutions with few or no training cases by resampling the image and the reference segmentation map. Here, however, lossy interpolation and resulting artefacts, especially from nearest-neighbour interpolation of discrete label maps, may result in a loss of structural details and sub-optimal performance. C. Our voxel size independent neural network (VINN) avoids interpolation of the images and discrete labels by integrating the interpolation step into the network architecture. Further, the explicit transition from the native resolution to a normalized internal resolution facilitates an understanding of the difference between image features (MultiRes blocks with distances measured in voxels) and anatomical features (FixedRes inner blocks with normalized distances).

To overcome limitations of both approaches (A. and B.), we propose a VINN with innovations including its micro-architecture and the addition of a HiRes loss. The core contribution, however, is the network-integrated resolution-normalization to support native segmentation at various voxel sizes. In fact, any UNet-based architecture (Ronneberger et al., 2015) is, by design, a multi-scale approach where pooling operations represent fixed-factor integer down- and up-scale transitions (usually by the scale factor 2). Our network-integrated resolution-normalization in VINN now replaces this fixed scale transition with a flexible re-scaling for the first and last scale transitions. This has the advantage of placing our interpolation operation at a position where information loss naturally occurs (down- or up-scaling via pooling). As illustrated in Fig. 1C., we retain compatibility with a range of resolutions for input and output (MultiRes) by shifting the interpolation into the architecture itself. At the same time, we leverage the lower variance of perceived size differences in the inner normalized resolution blocks (FixedRes). This has the advantages of (i) retaining important image information at the native resolution in the MultiRes blocks, (ii) interpolating multi-dimensional continuous feature maps rather than discrete labels or single slice images hence avoiding lossy NN interpolation and extending contextual neighbourhood information during the resolution-normalization, and (iii) disentangling perceived voxel versus actual structure size differences inside the VINN.

Especially the last point may have been underappreciated so far: Due to the nature of convolutional layers, CNNs ‘perceive’ distances and thus structure sizes by number of voxels rather than millimeters. Therefore, the original voxel size impacts perceived distances for baseline architectures, requiring them to cover larger size variety during training and inference in the multi-resolution context. The VINN, on the other hand, transitions from a voxel-based distance context in the MultiRes blocks to a standardized distance context in FixedRes via the network-integrated resolution-normalization and, therefore, releases network capacity in the inner layers for other tasks.

1.1. Contributions

All in all, our VINN, for the first time, effectively addresses the challenges associated with HiRes MRI (i.e. multiple resolutions and reduced variety in datasets) in a single framework. We extensively test variations of the network architectures and demonstrate that inherent biases introduced by the unavailability of certain HiRes scans are reduced by transferring information across resolutions via our network-integrated resolution-normalization.

Specifically, we show that our VINN

segments 3T brain MRI at their native resolutions (0.7–1.0 mm) into 95 regions in less than 1 min on the GPU and generalizes robustly to unseen resolutions within and beyond the training corpus,
significantly outperforms state-of-the-art scale augmentations as well as fixed-resolution models with respect to segmentation accuracy and generalizability with an optimized architecture, and
improves accuracy by combining and leveraging both the increased structural information from submillimeter 3T brain scans and the generalizability to intensity, scanner, disease, and other variations from standard 1.0 mm MRIs.

2. Related work

2.1. High-resolution MRI

In neuroimaging, spatial resolution is of great importance as the available voxel size directly dictates the degree to which fine-scale subcortical and cortical structures, specifically narrow gyri and sulci, can be resolved in an MRI. In particular, the diversity of tissue types in a single voxel, which heavily depends on its size, influences the signal intensity known as the partial volume effect (PVE). A strong partial volume effect (PVE) specifically complicates the delineation of tissue borders on the voxel grid (Glasser et al., 2013; Zaretskaya et al., 2018). HiRes images offer finer sampling of the underlying information, thereby directly reduce PVE and enable more accurate segmentations, improved volume-based measurements, morphometry, surface placement, and derived thickness measures (see Fig. 2) (Glasser et al., 2013; Luesebrink et al., 2013; Zaretskaya et al., 2018).

Fig. 2. — Image resolution affects the detail of discrete segmentation label maps and derived measures such as surface models and thickness. A. The low-resolution image is less detailed and causes partial volume effects (PVEs) by accumulating signals across tissue boundaries into larger voxels, whereas B. High-resolution images and derived segmentations allow more precise region delineation and capture details, e.g. for improved shape or thickness analysis (white arrows).

It is, therefore, not surprising that HiRes MRIs are becoming increasingly popular within the neuroimaging community. While most established large-scale neuroimaging studies acquire data at 1.0 mm, the current and next generation studies are shifting to submillimeter resolutions (e.g. Human Connectome Project (HCP) (Glasser et al., 2013), Rhineland Study (RS) (Breteler et al., 2014), Autism Brain Imaging Data Exchange II (ABIDE-II) (Di Martino et al., 2017), TRACK-PD (Wolters et al., 2020)). However, to date the consolidated superset of publicly available 3T HiRes neuroimages is sparse, unbalanced, and heterogeneous with respect to available submillimeter resolutions (e.g. HCP: 0.7 and 0.8, ABIDE-II: 0.7-0.9, RS: 0.8). Especially the limited data variety at each specific resolution poses a real challenge for data-driven computational methods, which translates into limited compatibility with scanners, age spans, and especially disease groups. Moreover, unbalanced training data can easily lead to the introduction of biases into the model.

While suffering from stronger PVEs and ensuing detrimental effects on segmentation accuracy, the available collection of standardized 1.0 mm images, on the other hand, is large and diverse. Consequently, a wide coverage of age-groups, diseases, genetic variants, and scanners can be retrieved from the rich reservoir of openly available MRI data sources (e.g. OpenfMRI database (Poldrack et al., 2013; Poldrack and Gorgolewski, 2014), OpenNeuro (Markiewicz et al., 2021), NITRC-IR²). Furthermore, manual reference labels for validation are exclusively published openly for 1.0 mm (Klein and Tourville, 2012). To address dataset sparsity and bias at submillimeter resolutions, we believe data-driven computational methods require built-in resolution-independence. Only a single model that spans across the available resolutions can simultaneously provide benefits for both HiRes and standard 1.0 mm image analysis.

2.2. Automated analysis of high-resolution images

Traditionally, common neuroimaging pipelines have been developed and optimized for 1.0 mm voxels (Fischl et al., 2002; Friston et al., 2007; Gaser and Dahnke, 2016; Jenkinson et al., 2012; Zhang et al., 2001) representing the de-facto standard for years. FreeSurfer offers a validated HiRes stream (Zaretskaya et al., 2018), which provides sub-segmentation of the cortex into 31 structures per hemisphere (DKTatlas) (Klein and Tourville, 2012). However, one problem here is the extended processing time caused by the cubic voxel increase – a common issue limiting applicability of traditional tools to large cohort studies.

The introduction of CNNs for whole brain segmentation has substantially reduced processing times to seconds on the GPU. Recent works employ both 2.5D and 3D UNet architectures (Billot et al., 2020; Chen et al., 2018; Coupé et al., 2020; Henschel et al., 2020; Huo et al., 2019; Iglesias et al., 2021; Ito et al., 2019; McClure et al., 2019; Mehta et al., 2017; Roy et al., 2019; Sun et al., 2019; Wachinger et al., 2018). Since GPU memory limitations render full volume 3D models impractical specifically for higher number of feature channels and output classes, top performing methods process the volume in slices (QuickNat (Roy et al., 2019), FastSurfer (Henschel et al., 2020)) or in large patches (DeepNat (Wachinger et al., 2018), SLANT (Huo et al., 2019), AssemblyNet (Coupé et al., 2020)) and then leverage aggregation schemes to recombine predictions into the full volume.

Despite their success for brain segmentation and other applications no work has introduced deep learning for submillimeter whole brain segmentation. Since CNNs require a large and diverse number of volume and segmentation pairs for effective training, the missing availability of diverse training data hinders this resolution transition. Additionally, the cubic relationship between resolution increase and GPU memory requirements puts memory hungry 3D architectures at a disadvantage, e.g. factor of 2.92 = (1/0.7)³ memory demand for a 1 to 0.7 mm reduction of voxel sizes.

In the past, multi-branch segmentation frameworks (Fu et al., 2018; Gerard et al., 2020; Gu et al., 2018; Kamnitsas et al., 2017; Liu et al., 2020; van Rijthoven et al., 2021; Wang et al., 2017; Xu et al., 2021; Yang and Peng, 2018; Zheng et al., 2021) have been used to avoid memory issues while simultaneously leveraging HiRes information. Here, multiple potentially cross-linked pathways are dedicated to specific down-scaled or cropped versions of the original image. The same principle is employed in other scale-aware networks (Chen et al., 2016; Huang et al., 2021; Li et al., 2019) by implementing more trans-scale connections or combinations of different dilation and kernel sizes. To further avoid sub-optimal equal weighting of different scale information, the networks often include attention mechanisms (Chen et al., 2016; Qin et al., 2018; Xu et al., 2021; Yang and Peng, 2018; Zheng et al., 2021). In practice, these methods are only compatible with the discrete input and sub-level resolutions they were trained on, leaving subvoxel scaling more common in neuroimaging data (0.7/0.8/1) unexploited. In fact, their intend is to explicitly integrate information from multiple image scales rather than achieve multi-resolution compatibility. Critically, in neuroimaging settings the true scale is known (image resolution) while scale-aware architectures assume this knowledge is not available. In VINN we inject this explicit knowledge into the network at the spatial resolution normalization step.

2.3. Resolution-independence in deep learning

Built-in resolution-independence in CNNs has not been described for brain segmentation nor – to our knowledge – for any other segmentation tasks. Approaches such as (Billot et al., 2020; Iglesias et al., 2021) pre-sample input images (with associated reliability maps) to a common resolution (here 1.0 mm) and provide outputs there, which makes them inherently fixed-resolution techniques. While they can provide 1.0 mm segmentations (and even images) for lower resolutional clinical scans via heavy augmentation, they can neither profit from submillimeter details, nor provide native HiRes segmentations. A transfer to higher resolutions would require retraining with a fixed submillimeter training set. This is, however, problematic as (i) no standard submillimeter resolution exists, meaning one would either need to train multiple versions or focus on the highest available resolution to retain input resolution-independence, (ii) HiRes datasets, as mentioned before, demonstrate low subject variance (disease, age, genetics) making this approach susceptible to training biases, and (iii) upscaling the 3D-UNet architecture to 0.7 mm would require 2.9-times as much GPU memory surpassing memory limits.

Different from segmentation networks, state-of-the-art super-resolution networks aim specifically at reconstructing a (fixed resolution) HiRes image based on a low-resolution (LowRes) input, also requiring ground truth at the higher resolution. They often rely on pre-sampling, i.e. using interpolation-based up-sampling methods to initialize the desired output grid and fine-tune the features in the following network (Dong et al., 2016; Kim et al., 2016a; 2016b). Using augmentation, these networks are trained to restore images coming from various lower resolutions via a single model, similar to the discussed segmentation strategy (Fig. 3B.). While showing better performance than dedicated single-scale models (fixed input and fixed output resolution), pre-sampling significantly increases the computational complexity and cost of the architectures due to the increase in image size. To overcome this limitation, interpolation-based post-sampling has been introduced recently (Alao et al., 2021; Shen et al., 2021). Here, the up-sampling step is shifted towards the end of the network architecture and performs the interpolation in the latent space. Interestingly, this post-sampling approach is as effective or even superior to the pre-sampling methods, while maintaining versatility with respect to the chosen output scale (Alao et al., 2021; Shen et al., 2021). All these super-resolution architectures, in addition to not being aimed at segmentation, differ from our approach by transitioning once, from the input resolution to an output resolution. In contrast, we insert two latent-space interpolation blocks transitioning both ways between the native and the inner resolution.

Fig. 3. — Voxel size independence in FastSurferVINN. Flexible transitions between resolutions become possible by replacement of (un)pooling with our network-integrated resolution-normalization (green) after the first encoder (pre-IDB) and before the last decoder block (post-CDB). Scale transitions between the other competitive dense blocks (CDB) remain as standard MaxPool and UnPool operations. Each CDB is composed of four sequences of parametric rectified linear unit (PReLU), convolution (Conv) and batch normalization (BN). In the first two encoder blocks ((pre)-IDB), the PReLU is replaced with a BN to normalize the inputs.

More generally, multi-source domain adaptations (MSDAs) leverage available data with different underlying distributions (i.e. resolutions in our case). This field of research focuses on enhancing the generalization ability of a model by transferring knowledge between resource-full source domains (i.e. LowRes data) to a sparsely represented target domain (i.e. HiRes data) by transforming either the features in the latent space or images on a pixel-level. In contrast to our resolution-normalization, the latent space transformations in MSDAs are based on optimizing a discrepancy (Guo et al., 2018; Peng et al., 2019; Zhu et al., 2019) or adversarial loss (Li et al., 2019; Wang et al., 2019; Xu et al., 2018). These latent space alignments are, however, often insufficient for segmentation tasks due to their focus on high-level information only (Zhao et al., 2019). To circumvent this problem, intermediate domain generators have been proposed and successfully applied for semantic segmentation (Hoffman et al., 2018; Russo et al., 2019; Zhao et al., 2019). Here, a pixel-level alignment between source and target domain is learned through Generative adversarial networks. However, these MSDA methods are limited to a single target distribution. Extension to multi-targets is a relatively unexplored area with only a few published methods for classification tasks so far (Chen et al., 2019; Gholami et al., 2020; Jin et al., 2020; Liu et al., 2020; Peng et al., 2019; Roy et al., 2021; Yang et al., 2020).

Spatial transformers (Jaderberg et al., 2015) represent another technology relying on internal interpolations as an important building block. Here, spatial invariance (registration) is targeted via a learnable affine transformation inside the network (localisation network). After the source feature map coordinates are computed (grid calculator), intensity values for each target pixel are determined via bi-linear interpolation (sampler). Spatial transformers hence attempt to implicitly learn data representations in a resolution-ignorant way. While our approach shares grid calculation and interpolation within the network with spatial transformers, our training approach is different: Instead of a localisation network, we directly determine the sampling-grid based on the input scale factors, i.e. the ratio between the native input resolution and the desired normalized inner resolution. Hence, we explicitly integrate knowledge about the image resolutions into the architecture. As a result, computational complexity is reduced while still achieving desired resolution-independence.

3. Material and methods

3.1. Datasets

The following three submillimeter MRI datasets were selected for training, testing, and validation of FastSurferVINN. An extended list of the used 1.0 mm datasets can be found in the Section A.3 including a tabulated overview of all used datasets (Table 3). When not specifically mentioned otherwise, all sets are balanced for gender, age, and study. Participants of the individual studies gave informed consent in accordance with the Institutional Review Board at each of the participating sites. Complete ethic statements are available at the respective study webpages.

ABIDE-II

The Autism Brain Imaging Data Exchange II Di Martino et al. (2017) contains cross-sectional data and focuses on autism spectrum disorders. The dataset contains 1114 subjects from 19 different institutions in total and is accessible online³. The 3D magnetization prepared rapid gradient echo (MPRAGE) sequence, or a vendor specific variant, was used to acquire all data using 3T GE, Philips, and Siemens scanners. Voxel resolutions are not standardized across sites although most scans are acquired at 1.0 mm. In this work we use only the ETH_1 sub-cohort, which provides 0.9 mm HiRes scans acquired on a 3T Philips Achieva with a repetition time (TR) of 3 ms, an echo time (TE) of 3.90 ms, an inversion time (TI) of 1.15 ms, and flip angle of 8°. Since HiRes datasets often exclusively feature Siemens scanners, a subset of 25 HiRes scans from ABIDE-II ETH_1 covering an age range of 20–31 years serves as an independent test set to evaluate generalizability to this unseen Philips scanner and 0.9 mm resolution.

HCP

The Human Connectome Project Young Adult (Van Essen et al., 2012) is a cross-sectional study including 3T MRIs from 1200 healthy participants acquired on a customized Siemens scanner (Connectome Skyra). It provides 0.7 mm isotropic de-faced scans of individuals between ages 22 and 35. All scans follow the same MPRAGE protocol with TR 2.4 s, TE 2.14 ms, TI 1 s, and flip angle of 8°. The full dataset is available online.⁴ The Human Connectome Lifespan Pilot Project (Phase 1a)⁵ is an extension of the Young Adult project and contains imaging data from five age groups. Five participants per age group 25–35, 45–55, and 65–75 were scanned on a 3T Siemens Connectome scanner at an isotropic voxel resolution of 0.8 mm following the Young Adult protocol except for a slightly smaller TE of 2.12 ms. In the present study, 30 cases from the Young Adult dataset are used for network training and 20 for validation. A total of 80 cases are used in the final test set. Further, 10 scans from the Lifespan Pilot Project are assembled into a separate test set to assess generalizability to another dataset at 0.8 mm in Section 4.4.

Rhineland Study

The Rhineland Study (Breteler et al., 2014) is a large cohort population dataset spanning ages 30 to 90. The 0.8 mm isotropic T1-weighted MRI data is acquired on a 3T Siemens Magnetom Prisma scanner using a multi-echo MPRAGE (ME-MPRAGE) sequence with TR 2560 ms, 10 TEs (1.68 ms, 3.29 ms, 4.90 ms, 6.51 ms, 6 × 5.0 ms), TI 1100 ms, and flip angle 7°. Two separate sets of 30 and 20 subjects are selected for network training and validation, respectively. The final testing set contains another 80 RS subjects.

All datasets were processed using the open source neuroimage analysis suite FreeSurfer⁶ Fischl (2012); Fischl et al. (2002). In particular, the FreeSurfer v7.1.1 HiRes stream (Zaretskaya et al., 2018) was used to generate the desired parcellations following the “Desikan–Killiany–Tourville” (DKT) protocol atlas (Desikan et al., 2006; Klein and Tourville, 2012). We follow the same label mapping approach as in FastSurferCNN (Henschel et al., 2020). In short, identical cortical regions on the left and right hemisphere are joined into one class unless they are in close proximity to each other reducing the total number of labels from 95 (DKT without corpus callosum segmentations) to 78 during network training. The affiliation to the left or right hemisphere are restored in the final prediction by estimating the closest WM centroid (left or right hemisphere) to each label cluster. A list of all segmentation labels is provided in the Appendix (see Table 5). In accordance with the FreeSurfer HiRes stream, all MRI brain volumes are conformed to a uniform coordinate orientation at their respective isotropic voxel resolutions and robustly normalized to unsigned characters (0… 255), i.e. the trained network does not depend on skull stripping or bias-field removal.

3.2. Voxel size independent neural network

As mentioned above, inside a standard UNet (Ronneberger et al., 2015), fixed transitions between scales (i.e. resolutions) are performed via down- and up-sampling operations (e.g. pooling and unpooling, see Fig. 3), naturally leading to a reduction in information. Theoretically, any scale-transition operation is replaceable with an alternative sampling strategy as long as it still allows gradients to flow effectively through the network. In FastSurferVINN, we use this concept to enforce voxel size independence by changing the first level transition in the encoder and the last in the decoder to a flexible interpolation step (i.e. network-integrated resolution-normalization, see Fig. 3). Thus, variable transitions between resolutions without restrictions to pre-defined fixed voxel sizes become possible, both during training and inference.

3.2.1. Network-integrated resolution-normalization

Similar to spatial transformers (Jaderberg et al., 2015), the interpolation-based scale transition can be divided into two parts: (i) calculation of the sampling coordinates (grid generator) and (ii) interpolation operation (sampler) to retrieve the spatially transferred output map.

The sampling coordinates are calculated based on the scale factor $S F \in R^{+}$ – the quotient of the resolution information of the inner normalized scale Res_inner (a tuneable hyperparameter set to 1.0 mm throughout our experiments) and the input image Res_native. For optimal interpolation, SF is slightly adjusted to ensure integer feature map dimensions. In the first transition step (Fig. 3, pre-IDB to IDB), the output feature map $V \in R^{H_{inner} \times W_{inner} \times C}$ is produced by sampling the input feature map $U \in R^{H_{native} \times W_{native} \times C}$ . in the final transition step (Fig. 3, CDB to post-CDB), this process is reversed effectively by using the inverse scale factor 1/SF. The interpolation itself is performed by applying a sampling kernel to the input map U to retrieve the value at a particular pixel in the output map V. Different interpolation strategies can be defined based on the selection of the sampling kernel. Theoretically, any kernel with definable (sub-)gradients is applicable. Here, we evaluate the bi-linear, bi-cubic, area, and integer sampling kernels (=NN interpolation). The sampling is identical for each channel, hence, conserving the spatial consistency.

3.2.2. Network architecture modifications

The proposed interpolation strategy can, in general, be included in any CNN equipped with pooling-based scale transitions. Here, due to its success in neuroanatomical segmentation, the principal network design is based on FastSurferCNN (Henschel et al., 2020) – a UNet-type network with a series of four competitive dense blocks (CDB) in the encoder and decoder arm separated by a CDB bottleneck layer. In FastSurferVINN (FastSurferVINN) one additional CDB layer is added (Fig. 3 to each arm, i.e. the pre-IDB and post-CDB).

CDB design

In FastSurferCNN (Henschel et al., 2020), a CDB is formed by repetitions of the basic composite function consisting of a 5 × 5 convolution, followed by batch-normalization (BN) and a probabilistic rectified linear unit (pReLU) activation function. In this work we optimize the architecture design slightly by replacing each 5 × 5 convolution kernel with two 3 × 3 kernels (see Fig. 3). This keeps the effective receptive field size within each block identical to FastSurferCNN while reducing parameter load. We implement this change in FastSurferVINN and also, for better comparability, in an updated FastSurferCNN version denoted by: FasturferCNN*. An ablation study detailing the changes from FastSurferCNN to FasturferCNN* is included in the appendix (Section A.2). As in FastSurferCNN, feature competition is achieved by using maxout (Goodfellow et al., 2013) instead of concatenations (Jégou et al., 2017) in the local skip connections. In order to guarantee normalized inputs to the maxout activation, the feature map stacking operation is always performed after the BN (see position of maxout in CDB design in Fig. 3).

Pre-IDB

The additional encoder block in FastSurferVINN (see Fig. 3, pre-IDB) transfers image intensity information from the native image to the latent space and encodes voxel size-dependent information before the internal interpolation step. In contrast to the described CDB, the raw inputs are normalized by first passing them through a BN-Conv-BN combination before adhering to the original composite function scheme (Conv-BN-pReLU) (see Fig. 3, (pre-)IDB).

Post-CDB

Akin to the pre-IDB, an additional CDB block in the decoder is used to merge the non-interpolated feature information returned from the pre-IDB skip connection and the upsampled feature maps from the network-integrated resolution-normalization step. Both maps are combined via a concatenation operation and then fed to a standard CDB block (see Fig. 3, (post-)CDB). After the final 1 × 1 convolution a softmax operation returns the desired class probabilities.

3.3. High-res network modifications

To improve segmentation accuracy of detailed structural features, we explore two network modifications, namely a loss-function weighting scheme and an adaptive attention mechanism.

Loss function

The network is trained with a weighted composite loss function of logistic loss and Dice loss (Roy et al., 2017). With p_l,i(x) the estimated probability of pixel i to belong to class l and the corresponding ground truth probability y, the loss function can be formulated as

𝓛 = \underset{Logistic loss}{\underset{︸}{- \sum_{l, i} ω_{i} y_{l, i} \log p_{l, i} (x)}} - \underset{Soft Dice loss}{\underset{︸}{\sum_{l} \frac{2 \sum_{i} p_{l, i} (x) y_{l, i}}{\sum_{i} p_{l, i} (x) + \sum_{i} y_{l, i}}}}

(1)

with ω_i = ω_{median freq.} + ω_gradient + ω_GM + ω_WM/Sulci.

Localized weights

Here, ω_{median freq.} represents median frequency balancing and ω_gradient boundary refinement through a 2D gradient vector (Roy et al., 2017). We now extend (ωi by two weighting terms (ω_GM and ω_WM/Sulci) to improve segmentation quality in highly convoluted areas of the cortex that are better represented in submillimeter scans. The WM strand and deep sulci mask (ω_WM/Sulci) emphasizes thin WM strands and narrow sulci, and is defined by the voxels added through a binary closing operation on the gray matter (GM) labels (left side of Fig. 4). The outer gray matter mask (ω_GM) accentuates pixels at the boundary of the cortex and is defined by the voxels lost during brain mask erosion (right side of Fig. 4). Overall, ω_GM and ω_WM/Sulci aim to adjust the underlying decision boundary to closely match the target segmentation in PVE-affected locations by assigning higher weights to narrow WM strands, deep sulci, and tissue boundaries with emphasis on the border between cortex and cerebrospinal fluid (CSF).

Fig. 4. — HiRes weight mask generation. Left: The difference between the gray matter label map (blue) and its closure (gray) produces the deep sulci and WM strand mask. Right: The difference between the original (blue) and eroded (gray) brain mask produces the outer gray matter mask.

Adaptive attention module

Context-driven learnable attention mechanisms have been used to automatically select optimal scale or filter sizes for specific image regions and can boost segmentation accuracy of differently sized structures within an image (Qin et al., 2018). As a reference, we therefore evaluate the addition of attention in the pre-encoder and post-decoder. The generation of the activation map follows the method proposed in Qin et al. (2018). A detailed description is given in the Appendix (Section A.1). In short, learned activation maps from the attention module are used to dynamically weight each feature response generated by the sequence of convolutions within the CDB. This online weight calculation introduces non-linearities outside the activation function into the CDB.

3.4. View aggregation

In order to account for the inherent 3D geometry of the brain, we adopt the same view aggregation scheme as in Henschel et al. (2020) for all evaluated models: one F-CNN per anatomical plane is trained and the resulting probability maps are aggregated through a weighted average. Due to missing lateralization in the sagittal view, the number of classes is effectively reduced from 78 to 50 and the weight of the sagittal predictions is reduced by one half compared to the other two views. Inherently, and similar to ensemble learning, the view aggregation combines final soft predictions boosting segmentation accuracy.

3.5. Augmentations

External scaling augmentation for CNNs (exSA)

The current state-of-the-art method to introduce resolution invariance into neural networks is extensive scale augmentation (see Fig. 1B). Therefore, we contrast our proposed network-integrated resolution-normalization against this approach. We use random linear transforms with scaling parameters sampled from a uniform distribution of the predefined range 0.8 to 1.15 to augment images during the training phase. Every minibatch hence consist of a potentially scaled MRI (bi-linear interpolation) and a corresponding equally re-scaled label map (NN sampling). To disentangle resolution-independence strategies from micro-architecture changes, we use the same CDB and IDB implementations as for FastSurferVINN described above, with the exception of the resolution-normalization.

Internal scaling augmentation for VINNs (inSA)

In order to increase the robustness of the latent space interpolation, we augment the scale factor SF introduced by the VINN with a parameter α, so SF = Res_inner/Res_native + α during the network-integrated resolution-normalization. This effectively introduces small resolution variations within the grid sampling procedure. The values for α are sampled from a Gaussian distribution with parameters sigma=0.1 and mean=0. Overall, this modification can be interpreted as an internal scale augmentation randomly resizing the feature maps in the latent space (as opposed to externally augmenting the native images).

3.6. Evaluation metrics

We use the Dice Similarity Coefficient (DSC) and Average Surface Distance (ASD) to compare different network architectures and modifications against each other, and estimate similarity of the predictions to a number of previously unseen scans with respect to FreeSurfer and manual labels as a reference. Both are standard metrics to evaluate segmentation performance.

Here, the DSC Dice (1945); Sørensen (1948) is defined as twice the intersection of ground truth and prediction divided by the sum of their cardinalities and multiplied by 100. A larger DSC represents better overlap between the segmentations with a maximum value of 100 for perfect agreement. The ASD measures the average distance (in mm) between all points x ∈ Y, x′ ∈ P on the outer surface of the ground truth (Y) and the prediction (P). It is defined as

ASD = \frac{1}{∣ Y ∣ + ∣ P ∣} (\sum_{x \in Y} d (x, P) + \sum_{x^{'} \in P} d (x^{'}, Y))

(2)

with distance d(x, P) = min_x′) ∈P ∥x − x′∥₂ representing the minimum of the Euclidean norm. In contrast to the DSC, a smaller ASD indicates better capture of the segmentation boundaries with a value of zero being the minimum (perfect match). Within each section, improvements in segmentation performance are confirmed by statistical testing (Wilcoxon signed-rank test (Wilcoxon, 1945) after Benjamini-Hochberg correction (Benjamini and Hochberg, 1995) for multiple testing) referred to as corrected p throughout the paper.

3.7. Training setup

Training dataset

Due to the nature of the experiments performed in this paper, the dataset composition varies between individual sections. An overview of the different trainingsets is given in the Appendix (Table 4). All representative datasets are balanced with regard to gender and age. Empty slices were filtered from the volumes. All directly compared networks are trained under the same conditions unless stated otherwise. Experimental setups demanding different datasets for training are separated by vertical white lines between bar plots and/or indicated in the figure legend. We additionally discuss this in the corresponding sections. In the first ablative evaluations of FastSurferVINN (Sections 4.1 and 4.2), 120 representative subjects are selected for training (60 1.0 mm and 60 submillimeter subjects, see Table 4: Mix), leaving on average 155 single view planes per subject and a total training size of at least 23k images per network. To determine the generalization performance of FastSurferVINN across resolutions, the training set is changed such that the submillimeter scans are of the same resolution and study (see Table 4: No 0.7 mm and No 0.8 mm). Similarly, the fixed-resolution networks are trained with 60 or 120 0.8 mm scans (see Table 4: Only 0.8 mm). For the LowRes version 120 1.0 mm scans are used (see Table 4: Only 1.0 mm). In Big-FastSurferVINN, the 1.0 mm component is extended to 1255 scans while keeping the same number (60) of HiRes scans as in the original training set (see Table 4: Mix (Big)).

Training parameters

Independent models for the coronal, axial, and sagittal plane are implemented in PyTorch (Paszke et al., 2017) and trained for 70 epochs using one NVIDIA V100 GPU with 32 GB RAM. The modified adam optimizer (Loshchilov and Hutter, 2019) is used with a learning rate set to 0.001. A cosine annealing schedule (Loshchilov and Hutter, 2017) adapts the learning rate during training where the number of epochs between two warm restarts is initially set to 10 and subsequently increased by a factor of two. The momentum parameter is fixed at 0.95 to compensate for the relatively small mini batch size of 16 images. For maximum fairness, all networks presented within this paper have been trained under equal hardware and hyper-parameter settings.

4. Results

We group the presentation of results into two blocks: 1. ablative architecture improvements to determine the best performing multi-resolution architecture (Sections 4.1 and 4.2), and 2. performance analysis to comprehensively characterize the advantages of our VINN on a wider variety of datasets, resolutions, scanners, and variations of the training corpus (Sections 4.3-4.5). Following best practice in data-science, we utilize completely separate datasets during the evaluations: the validation set Table 4: Validation (for 1.), and various test sets Table 4: Testing (for 2.). This avoids data-leakage and overfitting, i.e., it ensures that training, architectural design decisions, and final testing cannot influence each other, which could lead to overly optimistic results.

4.1. Scaling augmentation versus network-integrated resolution-normalization in FastSurferVINN

The central contribution of this paper is the design and evaluation of a Voxel-size Independent Neural Network for (sub)millimeter whole brain segmentation. Here, we compare segmentation performance of our FastSurferVINN, which avoids interpolation of label maps, with several approaches that rely on traditional scaling data augmentation. Each subsequent improvement in segmentation performance is confirmed by statistical testing (corrected p < 0.05).

Firstly, the original FastSurferCNN CDBs consecutively perform two 5 × 5 convolution operations followed by a final 1 × 1 convolution (Section 3.2.2, FastSurferCNN in Fig. 5). In total, an average DSC of 88.63 and a SD of 0.317 mm is reached for the subcortical structures while the cortical structures average around 87.09 for the DSC and 0.283 mm for the ASD. Optimization of the CDB design (kernel size of 3 × 3, Section 3.2.2) leads to a significant increase in the DSC and reduction of the ASD on both, the subcortical and cortical structures (Fig. 5, FastSurferCNN*). Particularly, the cortical segmentations are improved with an average DSC of 88.01 and a SD of 0.257 mm. Similarly, addition of external scaling augmentation (exSA) to FastSurferCNN* significantly improves segmentation accuracy on the cortical structures (Fig. 5, FastSurferCNN* + exSA). Performance on the subcortical structures is, however, slightly reduced.

Fig. 5. — Ablative optimization of FastSurferCNN and comparison to FastSurfer-VINN. FastSurferCNN (light green) is optimized through a switch to 3 × 3 kernels (FastSurferCNN*, green). Addition of data augmentation (external scaling augmentation, FastSurferCNN* + exSA, dark green) improves performance further. VINN equipped with internal scaling augmentation (inSA) (FastSurferVINN, orange) outperforms all other models on both subcortical (left) and cortical (right) structures with respect to Dice Similarity Coefficient (DSC, top) and average surface distance (ASD, bottom). Further addition of external scaling augmentation negatively affects performance (VINN + inSA + exSA). Segmentation results with FastSurferVINN are significantly better compared to all other models (corrected p < 10⁻⁷).

Interestingly, VINN, which avoids label map interpolation al-together, shows a positive effect across structures compared to FastSurferCNN* with and without scaling augmentation (Fig. 5, VINN). Specifically, combination of VINN with our internal scaling augmentation (VINN + inSA, referred to as FastSurferVINN) strengthens segmentation performance significantly. The DSC increases to 89.05 for the subcortical and 88.93 for the cortical structures representing the best value across the compared architectures. Similarly, the ASD is significantly reduced with a distance of 0.293 mm for the subcortical and 0.226 mm for the cortical structures (Fig. 5, FastSurferVINN). Finally, addition of external scaling augmentation negatively impacts segmentation performance of VINN (Fig. 5, VINN + inSA + exSA). On average, performance on the subcortical structures is reduced to the level of FastSurferCNN* + exSA, while the cortical structures still represent the second best result for both, DSC and ASD. Overall, FastSurferVINN outperforms all traditional scale augmentation approaches by a significant margin (corrected p < 10⁻⁷).

As our network-integrated resolution-normalization directly operates on continuous 2D feature maps in the latent space, various sampling kernels can be incorporated. Here, we evaluate the effect of four different interpolation strategies, namely NN, area, bi-cubic, and bi-linear. As visible in Fig. 6, with the exception of NN changing the sampling kernel does not significantly affect segmentation performance. The NN interpolation (Fig. 6, first column) overall reduces performance by 2% for DSC. The change on the ASD is more severe with a decrease in performance by 13% for the subcortical and 7% for the cortical structures. The bi-linear kernel (Fig. 6, last column) shows the best performance overall with a 89.05 DSC and an ASD of 0.293 mm for the subcortical structures and a 88.93 DSC and an ASD of 0.226 mm for the cortical structures. Given that the bi-linear interpolation is computationally favourable to bi-cubic, we henceforth keep it as the sampling kernel of choice.

Fig. 6. — Effect of sampling kernels on network-integrated resolution-normalization. Comparison of nearest-neighbour (NN, purple), area (light violet), bi-cubic (light yellow), and bi-linear (orange) sampling kernels with respect to the Dice Similarity Coefficient (DSC, top) and the average surface distance (ASD, bottom) for subcortical (left) and cortical (right) structures. Segmentation performance with NN is significantly worse than all other interpolation strategies (corrected p < 10⁻¹³). Area, bi-cubic and bi-linear give equivalent results.

4.2. HiRes specific adjustments

The higher resolution in submillimeter scans reduces PVEs and offers a potential to optimize segmentation performance (see Fig. 2). Overall, the changes in tissue and border assignment might, however, not be accurately captured during network training as they represent marginal changes in the loss compared to the whole brain volume. In order to focus on the PVE affected regions (specifically small WM strands and deep sulci), we test two different modifications. First, a HiRes loss function (HiRes Loss) up-weights information from the areas in question. Second, attention mechanisms are introduced to enable automatic refocusing on important information during network training. Here, we equip FastSurferVINN separately with both modifications and evaluate the change in segmentation performance with respect to the DSC and ASD.

Training FastSurferVINN with the new HiRes loss function (see Fig. 7, + HiRes Loss, right bar) significantly improves segmentation performance on the cortical structures (corrected p < 10⁻⁵), while maintaining high accuracy on the subcortical structures (no significant change consistent with expectations). A final DSC of 89.3 and an ASD of 0.209 mm is achieved for the cortical structures. Further addition of attention, while simultaneously adjusting the number of feature maps to keep the total number of network parameters of FastSurferVINN constant, does not lead to a significant change in segmentation performance (see Fig. 7, + attention, middle bar). Due to the overall significant improvement on DSC and ASD, the HiRes Loss modification is included in all further comparisons.

4.3. Generalizability

After selection of the best architectures for both, VINN and CNN, we now perform a detailed evaluation in a broad selection of test scenarios (see Table 4: Testing). To highlight the advantages of our VINN, we compare FastSurferVINN with FastSurferCNN* + exSA. The latter is improved over the state-of-the-art FastSurferCNN by architectural updates. The difference between FastSurferCNN* + exSA and FastSurfer-VINN is exclusively the substitution of external scale augmentation (exSA, Section 3.5) with network-integrated resolution-normalization (see Section 3.2.1) and internal scale augmentation (inSA, Section 3.5).

4.3.1. Across datasets

In this section, we evaluate the generalization capabilities of FastSurferVINN in comparison to scale augmentation on the test corpus to establish the performance metrics across multiple datasets.

FastSurferVINN consistently reaches the best DSC and ASD across all nine datasets (Fig. 8, orange bar). All improvements compared to external scale augmentation (dark green bar) are again significant (corrected p < 0.01). The best performance is reached for the submillimeter scans from HCP and RS with a subcortical DSC of 88.78 and 90.07 and a cortical DSC of 89.86 and 89.68, respectively. La5c reaches the highest DSC for the 1.0 mm scans (89.51 and 89.07 subcortical and cortical DSC). Similar to the DSC, the ASD is significantly improved for FastSurferVINN compared to the scale augmentation by around 5.2% for the subcortical and 10.5% for the cortical structures. On the cortical and subcortical structures the best ASD is reached for HCP (0.31 mm, 0.22 mm), RS (0.28 mm, 0.21 mm) and la5c (0.27 mm, 0.22 mm). The biggest improvement with regards to comparability between FastSurfer-VINN and scale augmentation can be seen for ABIDE-II. Here, the DSC differs by 1.1 and 2.1% for the subcortical and cortical DSC and around 11% for the ASD. This again reflects the better cross-resolution generalization performance of FastSurferVINN already outlined in the previous section. ADNI and OASIS1 also benefit strongly from FastSurferVINN, specifically for the cortical structures. Here, the DSC and ASD are improved by around 1.66% and 12.1%, respectively. Overall, the cortical structures benefit stronger from the internal resolution-normalization in FastSurferVINN.

4.3.2. Unseen resolutions

A core aspect of FastSurferVINN is the implicit compatibility with a variety of resolutions independent of their explicit presence in the training corpus. In order to investigate the generalization capacity of FastSurferVINN in contrast to scale augmentation approaches, we evaluate the inter- and extrapolation capabilities of the trained networks based on the segmentation performance for three resolutions purposefully excluded during network training.

To this end, we specifically drop either (i) all 30 0.8 mm, or (ii) all 30 0.7 mm scans from the training corpus (see Table 4: No 0.7 mm and No 0.8 mm). In order to ensure comparability with respect to the total number of data points and balance between HiRes and LowRes scans, an equal number of subjects from the other respective HiRes datasets are added. In addition, we evaluate performance using the original mixed training set on 25 subjects from ABIDE-II, representing an unseen scanner type (Philips) and resolution (0.9 mm).

As presented in Fig. 9, FastSurferVINN (orange bar) consistently outperforms traditional scale augmentation (dark green bar) across all resolutions (corrected p < 10⁻⁴). Segmentation performance on the 0.7 mm scans reflects the network’s resolution extrapolation capabilities (training corpus consists of 0.8 mm and 1.0 mm scans only). Here, FastSurferVINN reaches a DSC of 86.49 and 87.50 for the subcortical and cortical structures, respectively, representing a significant increase compared to the traditional scale augmentation approach. The improvements on the ASD are even more pronounced with FastSurferVINN reducing the ASD by 4.5% to 0.397 mm on the subcortical and by 19.7% to 0.294 mm on the cortical structures. Comparison of the interpolation capabilities reflected in the 0.8 mm results (training corpus consists of 0.7 mm and 1.0 mm) paint a similar picture. FastSurferVINN reaches the highest DSC and lowest ASD for both subcortical (DSC: 88.75, ASD: 0.316 mm) and cortical structures (DSC of 88.28 and ASD of 0.273 mm). The difference to traditional scale augmentation approaches is again more evident on the cortical structures. Finally, metrics on the 0.9 mm Philips scans are significantly better with FastSurferVINN. A final DSC of 87.85 and 83.68 and an ASD of 0.311 mm and 0.313 mm is reached on the subcortical and cortical structures, respectively.

We face the limitation that no publicly available 3T datasets exists at finer resolutions than 0.7 mm. In order to further explore the extrapolation capabilities of FastSurferVINN in comparison to traditional scale augmentation (FastSurferCNN* + exSA), we therefore evaluate segmentation performance at lower resolutions vastly outside the training range (1.4 mm and 1.6 mm). Note, that FreeSurfer segmentations are not available at lower native resolutions either, as images at resolutions coarser than 1.0 mm are upsampled to a millimeter voxel resolution in an initial conversion step (Fischl, 2012). Therefore, we downsample the high-resolution FreeSurfer segmentations of HiRes test sets (HCP and RS) by a factor of 2 along each axis using majority voting to generate “ground truth”. The intensity images used for inference were resampled using cubic interpolation.

In Fig. 10 we illustrate a strong divergence between FastSurferVINN (orange) and FastSurferCNN* + exSA (dark green) as we test with resolutions increasingly outside the training range. With FastSurferVINN, accuracy stays consistently high for both, 1.4 mm and 1.6 mm downsampled images. Here, FastSurferVINN reaches a DSC of above 84.90 and a ASD of 0.473 mm on the subcortical structures (left plot) and a DSC of around 77.70 (77.88 on the 1.4 mm and 77.69 on the 1.6 mm images) and an ASD of 0.480 mm and 0.494 mm on the cortical structures (right plot). The difference to FastSurferCNN* + exSA is again more evident on the cortical structures. Overall, FastSurferVINN outperforms the scaling augmentation by 6.74% on the subcortical and 16.80% on the cortical structures with respect to the DSC when using the downsampled 1.6 mm image as an input. Note, that this comparison is solemnly performed to visualize the performance difference between VINN and augmentation in the absence of high-resolution ground truth beyond the training range. We expect a reduction of performance at low resolutions due to stronger PVE and an 8-fold information reduction.

4.3.3. Comparison to manual reference

Due to the limited availability of manual labels, FreeSurfer segmentations have been used as the reference for comparison so far. In order to account for potential biases we now also evaluate DSC and ASD for the manually edited cortical regions on 78 subjects from the 1.0 mm Mindboggle101 dataset (Klein and Tourville, 2012). The subcortical segmentations available for a subset of 20 subjects within this cohort are used for subcortical evaluations.

On the subcortical structures (see Fig. 11, left part of left plot), FastSurferVINN (orange bar) and traditional scale augmentation (dark green bar) perform equally well with respect to both, DSC (80.26) and ASD (0.616 mm). The difference between the two methods is more pronounced on the cortical structures (Fig. 11, right side of left plot). FastSurferVINN reaches a final DSC of 81.89 and an ASD of 0.471 mm), representing a significant improvement compared to the external scale augmentation approach (corrected p < 0.005). These results again indicate that the cortical structures specifically benefit from the internal interpolation approach of FastSurferVINN.

In order to evaluate the benefit of FastSurferVINN for submillimeter scans, we compare segmentation accuracy on three structures (WM, GM and hippocampus) on an in-house set, as no manual HiRes full brain segmentations are publicly available. Note, while we previously reported averages across the 45 cortical and 33 subcortical structures listed in Table 5 for clarity, we now calculate performance measures specifically for these three individual structures. To generate manual annotations, a trained expert corrected FreeSurfer generated segmentations on six cases from the RS. The results are depicted in the left part of Fig. 11. FastSurferVINN again outperforms traditional scale augmentation with a final DSC of 97.54, 96.04, and 93.04 and ASD of 0.075 mm, 0.062 mm, 0.181 mm for WM, GM, and hippocampus, respectively. Consistent with previous results, the GM segmentations show the strongest improvement. Due to the small sample size, no statistical analysis could be performed.

4.4. Fixed-resolution networks versus FastSurferVINN

In order to analyse the advantage of the multi-resolution training approach over a fixed-resolution network, we compare FastSurferVINN with networks trained on (i) 60 0.8 mm subjects, (ii) 120 0.8 mm subjects, and (iii) 120 1.0 mm subjects (see Table 4: Only 0.8 mm and Only 1.0 mm). Inherently, this analysis highlights how the interplay of HiRes information from submillimeter 3T scans and dataset-variations from LowRes MRIs can mutually benefit segmentation performance on the different resolutions. Overall, the analysis of Fig. 12 highlights three important properties.

Fig. 12. — Flexible-versus fixed-resolution networks. FastSurferVINN (orange) is comparable to, or outperforms, all fixed-resolution networks (green) (left plot: 0.8 mm from RS only, right plots: 1.0 mm) with respect to the Dice Similarity Coefficient (DSC, top) and average surface distance (ASD, bottom). On the submillimeter scans (left plots) generalization to an unseen dataset (HCPL) is significantly improved. Results are consistently better for the 1.0 mm scans (right plot). To highlight cumulative VINN and architectural optimizations, we also compare with the state-of-the-art FastSurferCNN (gray, without optimizations from Sections 4.1 and 4.2, which are already included in CNN*). We retrain this 1 mm fixed-resolution network ensuring equal training datasets and conditions.

First, increasing the number of data samples has a strong effect on segmentation performance. The fixed-resolution network trained with 60 0.8 mm scans (left plot; light green bar) reaches an average DSC of 89.28 and ASD of 0.257 mm for RS data (same cohort as in the training set). Doubling the number of cases to 120, significantly improves both measures (corrected p < 10⁻⁹) with a final DSC of 89.85 and ASD of 0.245 mm (FastSurferVINN, orange bar). Interestingly, FastSurferVINN trained with 60 HiRes (30 0.7 mm and 30 0.8 mm) and 60 LowRes scans (1.0 mm) reaches the same accuracy, as if these scans were actually all HiRes scans from the RS cohort (120 0.8 mm, dark green bar). Benefits from the sample size increase are, therefore, independent of the data resolution. Advantageously, the additional 90 HiRes cases at 0.8 mm may easily be integrated into the FastSurferVINN training dataset, hence, increasing the training corpus with another expected performance gain.

Second, fixed-resolution networks perform well on the cohort they are trained on, but lack generalization capability to other datasets (left plot, HCPL). When comparing networks trained with equally sized datasets (120 HiRes RS fixed-resolution net (dark green bar) and FastSurferVINN (orange bar)) on the 0.8 mm HCPL cohort, FastSurferVINN clearly outperforms the fixed-resolution network with an increase in DSC to 88.99 and a decrease of the ASD to 0.326 mm, representing a significant improvement (corrected p < 0.005). The gap in segmentation accuracy between the RS and HCP cohort with respect to DSC is halved for FastSurferVINN compared to the fixed-resolution approach and a similar reduction is visible for the difference in ASD. Third, multi-resolution training benefits 1.0 mm datasets as well. In the right part of Fig. 12, we compare FastSurferVINN (orange bar) and two fixed-resolution 1.0 mm networks – the original FastSurferCNN (CNN, gray bar) as well as the optimized FastSurferCNN* (CNN*, dark green bar) – with DSC (top) and ASD (bottom) metrics. FastSurferCNN is included to highlight the cumulative performance gain of FastSurferVINN (architectural optimization already included in CNN* and voxel-size independence via resolution normalization). The training sets of all three networks contain the exact same subjects, with the only difference being the MRIs resolution: We train the fixed-resolution networks exclusively with 1.0 mm images (native or downsampled from HiRes), while FastSurferVINN uses all images at their native resolution (1.0 mm, 0.8 mm, or 0.7 mm). Note, since the FreeSurfer-based label maps are obtainable at either resolution, we circumvent resampling with NN. Interestingly, the HiRes information from submillimeter scans boosts performance for the 1.0 mm scans. FastSurferVINN reaches a DSC of 87.62 and an ASD of 0.296 mm representing a significant improvement in DSC and ASD compared to the fixed-resolution networks (corrected p < 0.001). Compared to the original FastSurferCNN, on average an improvement of 1.46% DSC and 10.06% ASD (0.93 to 2.37% and 8.3 to 12.7%, respectively, across datasets) can be achieved with FastSurferVINN.

Overall, the inherent voxel size independence of FastSurferVINN and resulting multi-resolution training option is highly beneficial to both, submillimeter and 1.0 mm scans.

4.5. Big-FastSurferVINN

Based on the observed performance gain with a larger training corpus, we evaluate whether ASD and DSC can be further improved across resolutions by expanding the training set from 120 to 1315 cases when exclusively adding 1.0 mm scans.

As illustrated in Fig. 13, expansion of the training corpus (n = 1315. yellow bar) leads to a noticeable performance gain. Specifically the 1.0 mm dataset benefits from the sample increase with a final DSC of 90.26 and an ASD of 0.231 mm. On average, a 2.95% increase in the DSC and 16.26% decrease in the ASD can be observed for the 1.0 mm scans compared to the smaller training set (n = 120, orange bar), representing a significant performance gain (corrected p < 10⁻²⁰). Additionally, ASD and DSC are significantly improved for the 0.7 mm, 0.8 mm and most strongly for the 0.9 mm scans (p < 10⁻⁵, Wilcoxon signed-rank test). Here, DSC and ASD improve by 3.77% and 24.08%, respectively. Note, that the 0.9 mm dataset is exclusively present in the testing corpus and thus completely new to the networks. A significant resolution bias in the training set, hence, does not decrease, but rather elevate performance across 1.0 mm and submillimeter scans.

5. Discussion

In this paper, we present the first multi-resolution deep learning tool for accurate and efficient (sub)millimeter 3T MRI whole brain segmentation. FastSurferVINN addresses the two main difficulties associated with HiRes MRIs, namely limited availability and diversity specifically for certain groups (e.g. neurodegeneration or scanner types) as well as resolution non-uniformity. Applicability of deep learning approaches is generally restricted to domains were enough training data exists (traditionally 1.0 mm scans). FastSurferVINNs’ network-integrated resolution-normalization provides independence to the input voxel grid during both, training and inference, and favourably extends processing spans to resolutions not explicitly included in the training corpus. Further, both, LowRes and HiRes scans, can be included during training and, in turn, benefit from each others favourable properties, i.e. coverage and size versus better representation of detailed structures.

The current state-of-the-art to approximate voxel size independence in networks is data augmentation. Here, transformations with randomly sampled scale parameters are applied to the images (both intensity and label map) in the native space. In FastSurferVINN, we shift this step into the latent space by replacing the first scale transition from a pooling/unpooling operation to interpolation-based down/upsampling. This network-integrated resolution-normalization interpolates the continuous feature maps on both the encoder and decoder arms and avoids alteration of the underlying ground truth labels entirely.

When optimizing the network architecture, we first show that the best results are achieved with a 3 × 3 kernel and bilinear interpolation. Also, our HiRes loss which focuses segmentation performance on PVE-affected structures (deep sulci, thin white matter strands and GM-CSF boundary) improves DSC and ASD specifically for the cortical structures. Interestingly, the improvement is consistent across resolutions, indicating that the structural information learned from the HiRes scans is effectively transferred to the 1.0 mm scans. Finally, we evaluate the introduction of attention mechanisms into the architecture. Overall, performance does not improve if the total number of parameters of the network with and without attention is controlled, indicating that network capacity (i.e. number of learnable parameters) rather than adaptive attention is the important factor here.

We demonstrate FastSurferVINN’s superior performance compared to state-of-the-art data augmentation for the fast and detailed segmentation of whole brain MRI. Our network-integrated resolution-normalization in combination with internal scale augmentation outperforms traditional scale augmentation in terms of accuracy by a significant margin both with respect to FreeSurfer and a manual standard. Across nine different datasets and four resolutions, including both defaced (HCP) and full-head images, our network achieves the highest DSC compared to FreeSurfer as a reference (88.38 on average), as well as the lowest ASD (0.283 mm on average). In addition, FastSurferVINN achieves the best results on the manually labeled 1.0 mm Mindboggle101 dataset with a DSC of 81.89 and an AVG HD of 0.471 mm on the cortical structures. Correspondingly, WM, GM, and hippocampus are better segmented with FastSurferVINN on six manually corrected scans from the submillimeter RS dataset.

One possible explanation for the consistently improved segmentation metrics with our VINN compared to traditional scale augmentation is the circumvention of label interpolation. For discrete labels, NN or majority voting have to be applied. These kernels are prone to remove important structural details and create jagged segmentation maps (Allebach, 2005; Parker et al., 1983; Schaum, 1993; Thevenaz, 2009). This is underlined by the observation that segmentation performance deteriorates for FastSurferVINN when external scale augmentation is added. Further, analysis of the interpolation methods during the network-integrated resolution-normalization also highlights the negative impact of the NN interpolation kernels. While all other methods (area, bi-linear, and bi-cubic) performed equally well, performance drops for NN kernels on average by approximately 2% DSC and 10% ASD. Overall, NN seems to have a systematic negative effect on segmentation performance and should, hence, be avoided wherever possible.

In addition, our resolution-normalization interpolates a feature map of vectors instead of a single scalar image slice. As each vector describes a neighborhood in the input image, the available contextual information is extended before the interpolation step and may subsequently support a smoother transition between resolutions. Further, our network-integrated resolution-normalization reduces the range of perceived anatomical size variation in the inner blocks, thereby liberating resources to focus on structural details at a common metric scale (structures have similar sizes after normalization). Finally, a skip-connection transfers the non-interpolated feature maps from the pre-IDB into the post-CDB, potentially improving ambiguous label borders with details available at the native resolution. The differentiation between native voxel-scale and normalized inner scale may specifically be helpful in settings with large inter-subject size similarities (e.g. head size is relatively stable). As such, our network-integrated resolution-normalization approach is not limited to the neuroimaging context, but can be expected to benefit segmentation performance in other domains.

FastSurferVINN further demonstrates excellent generalization performance with respect to inter- and extrapolation, achieving the best results for a variety of unseen submillimeter resolutions excluded from the training corpus. Especially more detailed cortical structures targeted with HiRes acquisition show a significant DSC increase on the 0.7 mm, 0.8 mm and 0.9 mm scans, respectively. Similarly, ASD improved by 19.7%, 10% and 11% highlighting specifically the improved extrapolation capabilities of FastSurferVINN on the left-out 0.7 mm dataset component compared to traditional scale augmentation. Given the lack of a de-facto standard in HiRes datasets, consistent performance towards unseen resolutions is an important property. Specifically, new releases such as the upcoming HCP disease studies⁷ may benefit from a flexible and validated method to process scans with good accuracy at their native submillimeter resolution. Conveniently, FastSurferVINN avoids lossy image downsampling, as well as time intensive re-training, and simplifies re-validation. One future aspect to investigate, is the extend to which this generalization is effective. Unfortunately, with increasing resolutions, evaluation becomes virtually impossible due to (i) unavailability of manual segmentations and (ii) limited support by existing neuroimaging pipelines (i.e. validation is difficult here as well due to point (i)). Our substitute evaluation on the downsampled, lower resolution images underlines that FastSurferVINN robustly generalizes beyond the tested resolution range (see Section 4.3.2, Fig. 10). Specifically, differences to traditional data augmentation are magnified with a more than 16.8% increase in DSC for the 1.6 mm scans. The improved generalizability of FastSurferVINN to under- or unrepresented resolutions especially indicates superior performance in real applications. Furthermore, FastSurferVINN was trained and evaluated on 3T scans only. Adoption to ultra high-field MRI capable of producing resolutions below 0.5 mm is an interesting future avenue, as direct transferability is limited due to the strong signal, contrast, and noise differences (van der Kolk et al., 2013; van der Zwaag et al., 2016).

Overall, FastSurferVINN allows processing of both LowRes and HiRes data without reducing precision on either set. On the contrary, we demonstrated that FastSurferVINN even outperforms a fixed-resolution 1.0 mm network by a significant margin. As the only difference in the training corpus is the increased image resolution of 60 subjects, the gain in accuracy seems to be predominantly motivated by the structural details provided by the submillimeter scans.

Further, the intensity and demographical variety added by the 1.0 mm scans promoted generalization performance to the HCPL component when FastSurferVINN was compared to a fixed-resolution RS HiRes network. This is of special interest in longitudinal settings, where training on images from only the first time point may introduce a bias towards healthier or younger brains, which can be mitigated by introducing appropriate LowRes out-of-study cases. Similarly, potential future variations of acquisition parameters within studies may negatively affect performance of fixed-resolution networks and, hence, would require their retraining.

Comparison to fixed-resolution networks revealed another promising application of FastSurferVINN. As shown in Section 4.4, FastSurferVINN’s segmentation performance improves across resolutions even if only additional 1.0 mm scans are included. Surprisingly, performance is dominated by the training corpus size and not significantly influenced by its heterogeneous resolution. On RS, the performance was comparable for FastSurferVINN when training with a mixed resolution corpus (120 total, with only 30 0.8 mm RS cases) as opposed to a fixed-resolution network trained exclusively on 120 0.8 mm RS scans (see Section 4.4). This can be exploited in settings where training cases are scarce at a specific resolution and new acquisitions difficult (e.g. manual segmentations or custom sub-resolution acquisition protocols). FastSurfer-VINN’s resolution flexibility is optimally suited to integrate a breadth of already existing resources, reducing the amount of newly acquired data necessary to achieve good performance.

Finally, the results of Big-FastSurferVINN indicate that an imbalance between 1.0 mm and submillimeter data distribution does not introduce a resolution bias. On the contrary, increasing the 1.0 mm component in Big-FastSurferVINN by a factor 20 improved segmentation performance across all resolutions (0.7 to 1.0 mm). Specifically, the 0.9 mm dataset (ABIDE-II) benefits from the training set extension with an 3.67% increase in DSC and 22.11% decrease in ASD. Unlike HCP and RS, the 0.9 mm ABIDE-II scans were acquired on a Philips scanner, which is present in a larger proportion in the extended training set (30% Philip scanners compared to 1.6%). However, no submillimeter Philips scan was included at any point during training. The improved performance on ABIDE-II, therefore, highlights the potential for FastSurferVINN to actively reduce scanner-biases through inclusion of 1.0 mm MRIs. Exploring an expected alleviation of age- or disease-biases in submillimeter datasets with FastSurferVINN presents an interesting direction for future work.

Overall, we introduce a fast, voxel size independent neural network that scales well to large datasets and enables seamless integration of a variety of resolutions during both, training and inference. Thereby, FastSurferVINN offers the potential to improve generalization performance to future HiRes datasets without retraining, reduce potentially existing dataset biases, and curtail necessary labour and time intensive manual labeling efforts.

FastSurferVINN will be made available as part of the open source FastSurfer (Henschel et al., 2020) package.⁸

Acknowledgments

This work was supported by DZNE institutional funds, by the Federal Ministry of Education and Research of Germany (031L0206, 01GQ1801), and by NIH (R01 LM012719, R01 AG064027, R56 MH121426, and P41 EB030006).

We would like to thank the Rhineland Study group (PI Monique M.B. Breteler) for supporting the data acquisition and management. We also thank Rika Etteldorf for providing corrected high-resolution labels.

Data used in the preparation of this article were obtained in part by the OASIS Cross-Sectional with principal investigators D. Marcus, R, Buckner, J, Csernansky J. Morris; P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, U24 RR021382, and OASIS: Longitudinal: Principal Investigators: D. Marcus, R, Buckner, J. Csernansky, J. Morris; P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, U24 RR021382. Further, data used in the preparation of this article were obtained from the MIRIAD database. The MIRIAD investigators did not participate in analysis or writing of this report. The MIRIAD dataset is made available through the support of the UK Alzheimer’s Society (Grant RF116). The original data collection was funded through an unrestricted educational grant from GlaxoSmithKline (Grant 6GKC). Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abb-Vie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IX-ICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Data were also provided in part by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.

Abbreviations:

HiRes: high-resolution
LowRes: low-resolution
VINN: Voxel-size Independent Neural Network
exSA: external scale augmentation
inSA: internal scale augmentation
SF: scale factor
PVE: partial volume effect

Appendix A

A1. Attention

A context-driven learnable attention mechanism following the method proposed in Qin et al. (2018) is evaluated as a reference. This modular mechanism can be inserted into any CDB and offers variable placement across the network architecture. Here, we investigate the addition of attention in the pre-IDB and post-CDB (see Section 3.2.2) as they are predominantly dealing with feature maps at the native image resolution. In short, a sequence of two convolutions ( ${\tilde{H}}_{1}^{a}$ : 3 × 3 kernel, half the number of filters as the input; ${\tilde{H}}_{2}^{a}$ : 1 × 1 kernel, one more filter than convolutions in the dense blocks) is followed by a final elementwise softmax function normalizing the activations, denoted as λ₀, λ₁, λ₂ and λ₃, to a sum of 1. The activation map λ₀ is used for weighting the input features representing an effective receptive field of 1. Each voxel is, therefore, assigned a different attention value driven by the image context. In contrast to Qin et al. (2018), the feature responses at different scales (c₁, c₂ and X_a) are generated by a sequence of convolutions within the CDB ( $H_{a}^{1}$ , $H_{a}^{2}$ and $H_{a}^{3}$ ). Each convolution increases the receptive field size by (k₁ − 1 × k₂ − 1) with (k₁ × k₂) representing the kernel size. The calculated attention maps are then used to weight the local skip connections which combine the outputs of each convolution within these blocks. This is mathematically formulated in Eqs. (A.1a)-(A.1c).

X_{a} = λ_{3} H_{a}^{3} (c_{2})

(A.1a)

c_{2} = m a x (λ_{2} H_{a}^{2} (c_{1}), c_{1})

(A.1b)

\underset{Attention CDB}{\underset{︸}{c_{1} = m a x (λ_{1} H_{a}^{1} (X_{a - 1}), λ_{0} X_{a - 1})}}

(A.1c)

A2. Ablative optimization of FastSurferCNN

Ablative evaluation of the CNN base architecture optimizations are summarized in Tables 1 and 2. Training and validation are based on the multi-resolution datasets listed in Table 4 (Mix). The original FastSurferCNN CDBs consecutively perform two 5 × 5 convolution operations followed by a final 1 × 1 convolution (Section 3.2.2, first row in Tables 1 and 2). In total, an average DSC of 88.63 and 87.09 and a ASD of 0.317 mm and 0.283 mm is reached for the subcortical and cortical structures, respectively. Changing the kernel size to 3 × 3 while keeping the effective receptive field size of 9 × 9 per CDB constant (Tables 1 and 2, 3 × 3, 64 F) leads to a significant improvement in DSC (p < 0. 01, Wilcoxon-signed rank test). Increasing the number of filters per layer from 64 to 71 preserves the original number of trainable parameters (approximately 1.85 × 10⁶). This change leads to a slight improvement on the subcortical and cortical structures. For comparability to our final VINN, we also add the pre-IDB and post-CDB blocks (Tables 1 and 2, FastSurferCNN*). This addition merely assures comparability between the augmentation and interpolation approach. As visible in Tables 1 and 2, this change improves segmentation accuracy further with a final DSC of 88.85 and 88.01 and an ASD of 0.307 mm and 0.257 mm. The optimized FastSurferCNN* architecture is used as the baseline for all comparisons (i.e. augmentation and interpolation).

Table 1.

FastSurferCNN optimization. Dice Similarity Coefficent evaluation.

	DSC
Datasets	Subcortical	Cortical
FastSurferCNN	88.63 (±1.87)	87.09 (±2.4)
+3 × 3, 64F	88.75 (±1.78)	87.68 (±2.35)
+3 × 3, 71F	88.79 (±1.79)	87.78 (±2.41)
FastSurferCNN*	88.85 (±2)	88.01 (±2.4)

Dataset	Scanner	1.5T/3T	Groups	Age	Res
HCP	Siemens	3T	Normal	22–35	0.7 mm
HCPL	Siemens	3T	Normal	25–75	0.8 mm
RS	Siemens	3T	Normal	30–95	0.8 mm
ABIDE-II ETHZ1	Philips	3T	ASD/Normal	20–31	0.9 mm
ABIDE-I	Philips/GE/Siemens	3T	ASD/Normal	18–64	1.0 mm
ADNI	Philips/GE/Siemens	1.5T/3T	AD/MCI/Normal	55–93	1.0 mm
IXI	Philips/GE	1.5T/3T	Normal	19–87	1.0 mm
LA5C	Siemens	3T	Neuropsych/Normal	21–50	1.0 mm
MBB	Siemens	3T	Normal	20–77	1.0 mm
MIRIAD	GE	1.5T	AD/Normal	55–86	1.0 mm
OASIS1	Siemens	1.5T/3T	Normal	18–90	1.0 mm
OASIS2	Siemens	1.5T/3T	AD/Normal	60–96	1.0 mm

Usage		Datasets (subjects)	n
Training	Mix	HCP (30), RS (30), ADNI (15), LA5C (16), MIRIAD (7), OASIS1 (14), OASIS2 (8)	120
	No 0.8 mm	HCP (60), ADNI (15), LA5C (16), MIRIAD (7), OASIS1 (14), OASIS2 (8)	120
	No 0.7 mm	RS (60), ADNI (15), LA5C (16), MIRIAD (7), OASIS1 (14), OASIS2 (8)	120
	Only 0.8 mm	RS (small=60, big= 120)	60/120
	Only 1.0 mm	HCP* (30), RS* (30), ADNI (15), LA5C (16), MIRIAD (7), OASIS1 (14), OASIS2 (8)	120
	Mix (Big)	HCP (30), RS (30), ABIDE-I (68), ADNI (215), IXI (400), LA5C (203), MBB (195), MIRIAD (30), OASIS1 (79), OASIS2 (65)	1315
Validation	Mix (Big), No 0.8 mm, No 0.7 mm	HCP (20), RS (20), ADNI (8), LA5C (9), MIRIAD (7), OASIS1 (11), OASIS2 (5)	80
	Only 0.8 mm	RS (20)	20
	Only 1.0 mm	HCP* (20), RS* (20), ADNI (8), LA5C (9), MIRIAD (7), OASIS1 (11), OASIS2 (5)	80
Testing	Mix	HCP (80), RS (80), ABIDE-II (25), ABIDE-I (20), ADNI (40), IXI (43), LA5C (15), OASIS1 (30), OASIS2 (17)	350
	No 0.7 mm, No 0.8 mm	HCP (80), RS (80), ABIDE-II (25)	185
	Manual Labels	RS (6), Mindboggle (78)	84
	Only 0.8 mm, Only 1.0 mm	RS (102), HCPL (10), ABIDE-I (20), ADNI (40), IXI (43) LA5C (15), OASIS1 (36), OASIS2 (17)	259
	Mix (Big)	HCP (80), RS (80), ABIDE-II (25), ABIDE-I (20), ADNI (40), IXI (43), LA5C (15), OASIS1 (35), OASIS2 (17)	355

Subcortical structures	FastS	FreeS	Cortical structures	FastS	FreeS
Cerebral white matter (lh)	1	2	caudalanteriorcingulate (lh)	34	1002
Lateral Ventricle (lh)	2	4	caudalmiddlefrontal (lh, rh)	35	1003, 2003
Inferior Lateral Ventricle (lh)	3	5	cuneus (lh)	36	1005
Cerebellar White Matter (lh)	4	7	entorhinal (lh, rh)	37	1006, 2006
Cerebellar Cortex (lh)	5	8	fusiform (lh, rh)	38	1007, 2007
Thalamus (lh)	6	10	inferiorparietal (lh, rh)	39	1008, 2008
Caudate (lh)	7	11	inferiortemporal (lh, rh)	40	1009, 2009
Putamen (lh)	8	12	isthmuscingulate (lh)	41	1010
Pallidum (lh)	9	13	lateraloccipital (lh, rh)	42	1011, 2011
3rd-Ventricle	10	14	lateralorbitofrontal (lh)	43	1012
4th-Ventricle	11	15	lingual (lh)	44	1013
Brain Stem	12	16	medialorbitofrontal (lh)	45	1014
Hippocampus (lh)	13	17	middletemporal (lh, rh)	46	1015, 2015
Amygdala (lh)	14	18	parahippocampal (lh)	47	1016
CSF	15	24	paracentral (lh)	48	1017
Accumbens (lh)	16	26	parsopercularis (lh, rh)	49	1018, 2018
Ventral DC (lh)	17	28	parsorbitalis (lh, rh)	50	1019, 2019
Choroid Plexus (lh)	18	31	parstriangularis (lh, rh)	51	1020, 2020
Cerebral white matter (rh)	19	41	pericalcarine (lh)	52	1021
Lateral Ventricle (rh)	20	43	postcentral (lh)	53	1022
Inferior Lateral Ventricle (rh)	21	44	posteriorcingulate (lh)	54	1023
Cerebellar White Matter (rh)	22	46	precentral (lh)	55	1024
Cerebellar Cortex (rh)	23	47	precuneus (lh)	56	1025
Thalamus (rh)	24	49	rostralanteriorcingulate (lh, rh)	57	1026, 2026
Caudate (rh)	25	50	rostralmiddlefrontal (lh, rh)	58	1027, 2027
Putamen (rh)	26	51	superiorfrontal (lh)	59	1028
Pallidum (rh)	27	52	superiorparietal (lh, rh)	60	1029, 2029
Hippocampus (rh)	28	53	superiortemporal (lh, rh)	61	1030, 2030
Amygdala (rh)	29	54	supramarginal (lh, rh)	62	1031, 2031
Accumbens (rh)	30	58	transversetemporal (lh, rh)	63	1034, 2034
Ventral DC (rh)	31	60	insula (lh, rh)	64	1035, 2035
Choroid Plexus (rh)	32	63	caudalanteriorcingulate (rh)	65	2002
WM-hypointensities	33	77	cuneus (rh)	66	2005
			isthmuscingulate (rh)	67	2010
			lateralorbitofrontal (rh)	68	2012
			lingual (rh)	69	2013
			medialorbitofrontal (rh)	70	2014
			parahippocampal (rh)	71	2016
			paracentral (rh)	72	2017
			pericalcarine (rh)	73	2021
			postcentral (rh)	74	2022
			posteriorcingulate (rh)	75	2023
			precentral (rh)	76	2024
			precuneus (rh)	77	2025
			superiorfrontal (rh)	78	2028

	DSC		ASD
Networks	Subcortical	Cortical	Subcortical	Cortical
FastSurferCNN	88.63 (± 1.87)	87.09 (± 2.4)	0.317 (± 0.081)	0.283 (± 0.085)
FastSurferCNN*	88.85 (± 2)	88.01 (± 2.4)	0.307 (± 0.077)	0.257 (± 0.082)
FastSurferCNN* + exSA	88.79 (± 1.33)	88.33 (± 2.23)	0.311 (± 0.066)	0.242 (± 0.077)
VINN	88.86 (± 1.82)	88.42 (± 2.28)	0.308 (± 0.081)	0.243 (± 0.077)
FastSurferVINN	89.05 (±1.7)	88.93 (±2.01)	0.293 (±0.076)	0.226 (±0.066)
FastSurferVINN + exSA	88.83 (± 1.31)	88.6 (± 2.13)	0.31 (± 0.066)	0.23 (± 0.069)

		Subcortical		Cortical
		FastSurferCNN* + exSA	FastSurferVINN	FastSurferCNN* + exSA	FastSurferVINN
DSC	HCP	88.5 (± 1.25)	88.78 (± 1.2)	89.12 (± 1.93)	89.86 (± 1.89)
	Rhineland	89.54 (± 0.94)	90.07 (± 0.88)	88.71 (± 2.16)	89.68 (± 2.21)
	abide-i	86.55 (± 2.37)	86.67 (± 2.31)	84.42 (± 3.12)	84.97 (± 3.13)
	abide-ii	86.9 (± 1.2)	87.85 (± 1.19)	81.92 (± 1.72)	83.68 (± 1.55)
	adni	88.65 (± 1.86)	88.58 (±2)	87.3 (± 2.99)	87.96 (± 2.69)
	ixi	86.51 (± 3.33)	86.54 (± 3.67)	85.94 (± 3.01)	86.2 (± 2.83)
	la5c	89.49 (± 0.51)	89.63 (± 0.49)	88.86 (± 1.83)	89.46 (± 1.49)
	oasis1	88.9 (± 0.88)	89.12 (± 0.91)	88.33 (± 0.88)	89.03 (± 0.8)
	oasis2	88.35 (± 1.59)	88.49 (± 1.65)	87.00 (± 2.21)	87.57 (± 2.26)
ASD	HCP	0.317 (± 0.06)	0.306 (± 0.059)	0.245 (± 0.126)	0.224 (± 0.128)
	Rhineland	0.291 (± 0.043)	0.276 (± 0.042)	0.241 (± 0.13)	0.215 (± 0.131)
	abide-i	0.428 (± 0.114)	0.421 (± 0.108)	0.366 (± 0.225)	0.345 (± 0.226)
	abide-ii	0.347 (± 0.058)	0.311 (± 0.054)	0.352 (± 0.079)	0.313 (± 0.073)
	adni	0.307 (± 0.07)	0.31 (± 0.075)	0.278 (± 0.176)	0.261 (± 0.164)
	ixi	0.44 (± 0.203)	0.416 (± 0.151)	0.296 (± 0.107)	0.281 (± 0.102)
	la5c	0.279 (± 0.038)	0.274 (± 0.034)	0.236 (± 0.076)	0.217 (± 0.062)
	oasis1	0.291 (± 0.034)	0.283 (± 0.034)	0.233 (± 0.031)	0.215 (± 0.028)
	oasis2	0.312 (± 0.058)	0.309 (± 0.065)	0.255 (± 0.064)	0.24 (± 0.064)

		DSC		ASD
Data	Networks	Subcortical	Cortical	Subcortical	Cortical
0.7 mm	FastSurferCNN* + exSA	85.8 (± 1.69)	86.11 (± 1.69)	0.416 (± 0.088)	0.367 (± 0.088)
0.7 mm	FastSurferVINN	86.49 (± 1.62)	87.5 (± 1.62)	0.397 (± 0.084)	0.294 (± 0.084)
0.8 mm	FastSurferCNN* + exSA	88.31 (± 1.13)	87.31 (± 1.13)	0.328 (± 0.057)	0.273 (± 0.057)
0.8 mm	FastSurferVINN	88.75 (± 0.93)	88.28 (± 0.93)	0.316 (± 0.042)	0.246 (± 0.042)
0.9 mm	FastSurferCNN* + exSA	86.9 (± 1.2)	81.92 (± 1.2)	0.347 (± 0.052)	0.352 (± 0.052)
0.9 mm	FastSurferVINN	87.85 (± 1.19)	83.68 (± 1.19)	0.311 (± 0.049)	0.313 (± 0.049)

Measure		FastSurferCNN* + exSA	FastSurferVINN
DSC	GM	95.77 (± 0.46)	96.04 (± 0.5)
	WM	97.40 (± 0.4)	97.54 (± 0.38)
	Hippocampus	92.94 ( ± 1.02)	93.04 (± 1.07)
ASD	GM	0.067 (± 0.016)	0.062 (± 0.018)
	WM	0.079 (± 0.025)	0.075 (± 0.024)
	Hippocampus	0.185 (± 0.078)	0.181 (± 0.081)

	Networks	FastSurferCNN (n = 120)	FastSurferCNN* (n = 120)	FastSurferVINN (n = 120)
DSC	ABIDE-I	83.33 (± 4.29)	84.55 (± 3.46)	85.30 (± 2.88)
	ADNI	86.99 (± 2.15)	87.59 (± 2.09)	87.94 (± 1.91)
	IXI	84.46 (± 4.61)	85.22 (± 5.05)	86.14 (± 3.70)
	LA5C	88.3 (± 1.40)	88.84 (± 1.33)	89.25 (± 1.35)
	OASIS1	88 (± 1.38)	88.53 (± 1.37)	88.89 (± 1.39)
	OASIS2	87.36 (± 1.53)	87.81 (± 1.51)	88.18 (± 1.26)
ASD	ABIDE-I	0.441 (± 0.140)	0.416 ( ± 0.129)	0.402 (± 0.125)
	ADNI	0.315 (± 0.094)	0.294 ( ± 0.086)	0.281 (± 0.079)
	IXI	0.381 (± 0.128)	0.367 (± 0.148)	0.332 (± 0.106)
	LA5C	0.273 (± 0.045)	0.256 (± 0.044)	0.241 (± 0.043)
	OASIS1	0.278 (± 0.062)	0.264 ( ± 0.063)	0.255 (± 0.066)
	OASIS2	0.29 (± 0.048)	0.276 ( ± 0.048)	0.262 (± 0.039)

	Datasets	FastSurferVINN + inSA (n = 120)	FastSurferVINN + inSA (n = 1300)		FastSurferVINN + inSA (n = 120)	FastSurferVINN + inSA (n = 1300)
DSC	0.7 mm	89.4 (± 1.29)	89.87 (± 1.28)	ASD	0.259 (± 0.08)	0.243 (± 0.080)
	0.8 mm	89.84 (± 1.46)	90.57 (± 1.44)		0.241 (± 0.079)	0.218 (± 0.080)
	0.9 mm	85.45 (± 1.13)	88.75 (± 1.39)		0.347 (± 0.050)	0.263 (± 0.052)
	1.0 mm	87.67 (± 2.51)	90.26 (± 1.54)		0.296 (± 0.105)	0.231 (± 0.074)

PERMALINK

FastSurferVINN: Building resolution-independence into deep learning segmentation methods—A solution for HighRes brain MRI

Leonie Henschel

David Kügler

Martin Reuter

Abstract

1. Introduction

Fig. 1.

1.1. Contributions

2. Related work

2.1. High-resolution MRI

Fig. 2.

2.2. Automated analysis of high-resolution images

2.3. Resolution-independence in deep learning

Fig. 3.

3. Material and methods

3.1. Datasets

ABIDE-II

HCP

Rhineland Study

3.2. Voxel size independent neural network

3.2.1. Network-integrated resolution-normalization

3.2.2. Network architecture modifications

CDB design

Pre-IDB

Post-CDB

3.3. High-res network modifications

Loss function

Localized weights

Fig. 4.

Adaptive attention module

3.4. View aggregation

3.5. Augmentations

External scaling augmentation for CNNs (exSA)

Internal scaling augmentation for VINNs (inSA)

3.6. Evaluation metrics

3.7. Training setup

Training dataset

Training parameters

4. Results

4.1. Scaling augmentation versus network-integrated resolution-normalization in FastSurferVINN

Fig. 5.

Fig. 6.

4.2. HiRes specific adjustments

Fig. 7.

4.3. Generalizability

4.3.1. Across datasets

Fig. 8.

4.3.2. Unseen resolutions

Fig. 9.

Fig. 10.

4.3.3. Comparison to manual reference

Fig. 11.

4.4. Fixed-resolution networks versus FastSurferVINN

Fig. 12.

4.5. Big-FastSurferVINN

Fig. 13.

5. Discussion

Acknowledgments

Abbreviations:

Appendix A

A1. Attention

A2. Ablative optimization of FastSurferCNN

Table 1.

Table 2.

A3. 1.0 mm Datasets

ABIDE I

ADNI

IXI

LA5c

Table 3.

Table 4.

Mindboggle-101

MIRIAD

Mind-brain-body

Oasis-1

Table 5.

A4. Tabulated results

Table 6.

Table 7.