Abstract
Neuroimaging pipelines have long been known to generate mildly differing results depending on various factors, including software version. While considered generally acceptable and within margin of reasonable error, little is known about their effect in common research scenarios such as inter-group comparisons between healthy controls and various pathological conditions. The aim of the presented study was to explore the differences in the inferences and statistical significances in a model situation comparing volumetric parameters between healthy controls and type 1 diabetes patients using various FreeSurfer versions. Worryingly, FreeSurfer 5.3 detected both cortical and subcortical volume differences out of the preselected regions of interest, but newer versions such as FreeSurfer 5.3 HCP and FreeSurfer 6.0 reported only subcortical differences of lower magnitude and FreeSurfer 7.1 failed to find any statistically significant inter-group differences. Since group averages of individual FreeSurfer versions closely matched, in keeping with previous literature, the main origin of this disparity seemed to lie in substantially higher within-group variability in the model pathological condition. Ergo, until validation in common research scenarios as case-control comparison studies is included into the development process of new software suites, confirmatory analyses utilising a similar software based on analogous, but not fully equivalent principles, might be considered as supplement to careful quality control.
Keywords: FreeSurfer, Magnetic resonance imaging, Volumetry, Repeatability
Introduction
The automation of brain segmentation techniques continuously developed over the last few decades has substantially contributed to the advancements in the field by enabling volumetric analyses of large cohorts with reasonable computation time requirements. Even though expecting full autonomy from any software package without any follow-up inspection is scientifically irresponsible and may yield misleading results, automatic methods often require mere minutes of operator’s time per subject for the production of generally good quality of processing output, comparing favourably to the hours necessary for the demanding manual tracing protocols.
Among the more complex segmentation and parcellation tools, FreeSurfer (FS) (Fischl 2012) holds the position of de facto standard tool for the measurement of cortical thickness and volume of a wide spectrum of neuroanatomical structures. Its accuracy has been repeatedly validated against manually derived volumes for the studies of hippocampus and amygdala (Morey et al. 2009; Tae et al. 2008) and against histological analyses (Cardinale et al. 2014; Rosas et al. 2002). Studies assessing the influence of MRI scanner and acquisition-specific parameters such as field strength, pulse sequence or manufacturer on FS output (Han et al. 2006; Jovicich et al. 2009), scan-rescan reliability (Morey et al. 2010) and even the effect of utilising different computing platforms (Glatard et al. 2015) have sparked much interest. Furthermore, the variability of FS across software versions is non-negligible as well, with reported average volume differences of 8.8 ±6.6% (Gronenschild et al. 2012). While inherent and understandable given the character of implemented improvements of existing tools, correction of bugs and extension of the processing pipeline steps as described in detail in relevant documentation (http://surfer.nmr.mgh.harvard.edu/fswiki/ReleaseNotes), these differences cast an uneasy shadow on the validity of the outcomes and inferences. Indeed, the developers themselves strongly advocate against changing the program version during one study. Nonetheless, the magnitude of the observed volume differences related to various hardware and software differences is often close to the effect sizes reported in studies investigating differences between healthy controls (HC) and both primarily neurodegenerative diseases (Hlavatá et al. 2020; Messina et al. 2011; Vasconcellos et al. 2018) and secondary cerebral affections (Bednarik et al. 2017; Filip et al. 2020). This poses uneasy questions about possible systemic software bias in virtually all research approaches utilizing one approach and one software package, which even large population studies are not fully shielded from. In contrast, it has been shown that the use of different FS versions does not affect the classification accuracy of controls, those with Alzheimer’s disease, and those with mild cognitive impairment despite the significant differences in reported cortical thicknesses between FS packages (Chepkoech et al. 2016) or the outcomes of analyses correlating the volumes of intracranial structures with demographic parameters (Bigler et al. 2020). The consequent impact of different FS versions in a simple exploration of differences between a control group and a patient cohort, with inference constrained by generally accepted hard thresholding of statistically significant and non-significant inter-group differences, has been largely neglected in the current literature.
The primary aim of the present study was to address the above stated concern in the most straightforward way conceivable – repeating the automated segmentation of the same cohort of HC and patients with various FS versions, thereby simulating a common processing workflow in a volumetric study, followed by a simple, common inter-group comparison with Student’s T-test. One of the main incentives for this study was the disparity in volumetric results between our two previous studies in type 1 diabetes (T1D), which even shared nearly half of the subject cohort. The first, which was a purely volumetric study utilising FS 5.3 with meticulous manual correction and editing to achieve best possible segmentation and parcellation results (Bednarik et al. 2017) found statistically significant differences between the T1D and HC in the volume of the whole cortex and the volume of the frontal cortex. The second study, primarily focusing on T1-weighted/T2-weighted (T1w/T2w) ratio, where the structural analysis was processed using the Human Connectome Project (HCP) pipeline (Glasser et al. 2013) based on a substantially modified version of FS 5.3, found lower volume of both putamina and thalami in T1D, with no apparent changes at the level of the whole cortex or in the frontal cortex [Filip et al. 2020]. Ergo, in this current study we are comparing separate group-level analyses of volumetric outcomes provided by FS 5.3 without manual correction (FS 5.3 raw), FS 5.3 with careful manual corrections (FS 5.3 MC), FS 5.3 HCP version, FS 6.0 and FS 7.1 to see whether the cross-version differences induce a systemic bias reflected in all the analysed subjects in the same way, hence not affecting the ultimate group comparison results, or whether the variability of FS outputs is reflected in the final data-based inferences on inter-group difference.
Methods
Subjects
The subjects were retrospectively drawn from a previously published project (Bednarik et al. 2017; Filip et al. 2020) with the aim to describe the brain response to hypoglycaemia in healthy population and T1D patients. In total, 24 T1D patients (15 women; average age ± standard deviation (SD): 36.6 ±11.9 years) and 27 HC (13 women; 33.9 ±14.3 years) were included in the study. These subject groups contain only the partial cohort overlap between the two previously published reports. For more information on the T1D cohort inclusion and exclusion criteria, see the referenced publications. As described in more detail in the second paper, the switch to another MRI scanner and the differences in the T2w sequence, which was updated during the course of the project, do not enable a new analysis of the whole cohort, since the HCP pipeline utilises the T2w scans both for pial surface refinement and for general intensity normalisation before FS processing.
The study protocol was approved by the institutional review board of the University of Minnesota. Each subject provided his/her written informed consent in keeping with the Declaration of Helsinki.
Image acquisition protocol
MRI acquisition was performed using the 3D Siemens 3T Prisma system (Siemens, Erlangen Germany). T1w images were acquired using the MPRAGE sequence, 1-mm isotropic voxel size, time to repetition (TR) of 2,150 ms, time to echo (TE) of 2.47 ms, GRAPPA 2 and flip angle of 8°. T2w images were acquired using the SPACE sequence, 1-mm isotropic voxel size, TR 3,200 ms, TE 137 ms, GRAPPA 2.
Data processing
The volumetric output for FS 5.3 was drawn from the already published study (Bednarik et al. 2017). Briefly, the processing included standard FS 5.3 pipeline with pial surface refinement based on the T2w scan. Subsequently, the segmentation outputs were diligently corrected in several steps consisting of control point additions, manual editing of white matter segmentations and pial surface corrections by an experienced neuroradiologist (P.B.). For all further analyses, these outputs (FS 5.3 MC) are considered the baseline results to which the other pipelines are compared. This approach was chosen to simplify the analysis outputs and avoid listing the comparisons between all the versions, without inferring that FS 5.3 MC results are closer to the ground truth about the presence or absence of atrophy in T1D. In addition to these outputs, the raw, uncorrected results of automatic FS 5.3 segmentation and parcellation (FS 5.3 raw) were included as the second analysis set to assess the effect of careful manual corrections in a way similar to the evaluation of the effect of different FS versions. Similarly to the FS 5.3 output, the FS 5.3 HCP results were reused from the other published study (Filip et al. 2020). Contrary to the other simple FS analyses, the HCP pipeline feeds the T1w and T2w images to FS only after the gradient distortion correction based on a proprietary Siemens gradient coefficient file, alignment of the anterior commissure, anterior-posterior commissure line and interhemispheric plane, custom brain extraction different from the method of brain mask extraction implemented in FS, boundary-based registration of the T2w scan to the T1w space and lastly intensity inhomogeneity correction (Glasser et al. 2013). FS 6.0 and FS 7.1 were run in the standard pipeline based on the raw T1w scans without any preprocessing, followed by pial surface refinement using the T2w image. For FS 5.3 HCP, FS 6.0 and FS 7.1, less stringent quality control criteria were implemented, where only defects and/or imprecisions exceeding 3-voxel threshold in any direction were manually corrected.
All the above-described processing outputs were separately used to generate the same 9 regions of interest (ROIs) as in the first study, specifically whole cortex, frontal lobe, parietal lobe, temporal lobe, occipital lobe, anterior cingulate cortices (ACC; corresponding to the bilateral rostral anterior cingulate parcel delineated by FS), posterior cingulate cortices (PCC), thalami and hippocampi. The ROI volumes were scaled as a percent of the estimated intracranial volume (eICV) as reported by individual FS versions. Both eICV-normalised and raw ROI volumes were used in group-level statistical analysis to be able to appreciate the eventual influence of both the ROI volume and eICV on the final results. And lastly, the newly introduced and supposedly superior optional Sequence Adaptive Multimodal Segmentation (SAMSEG) in FS 7.1 was used to assess the consistency of FS-derived eICV values.
We also assessed the spatial overlap of each of these ROIs against FS 5.3 MC using a measure previously labelled Similarity Index (SI) or Dice coefficient (Dice 1945; Hammers et al. 2007). Boundary-based rigid-body coregistration matrix of the T1w scan in each FS run to the FS 5.3 MC space was applied to each of the 9 ROIs and to the brain mask in each subject. This step was necessary due to the use of a different space in FS 5.3 HCP and minor inter-version differences in the outputs of linear Talairach transform preceded by intensity non-uniformity correction in several subjects. Overall (pooled HC and T1D) averages and SDs for each ROI and each FS version are presented.
Statistical analyses
In contrast to the two previous publications, the presented study used two-sample two-tailed Student’s T-test with unequal variance to compare the normalised volumes of individual ROIs in HC and T1D patients. This approach was deliberately chosen for the sake of simplicity, to be able to fully appreciate individual contributors to the results of the comparison, namely the roles of averages and of SDs in any inconsistencies. The results were considered statistically significant at the predetermined alpha of 0.05.
Subsequently, the percentage differences of both averages and SDs when compared to FS 5.3 MC were calculated (e.g., (FS 5.3 raw – FS 5.3 MC)/FS 5.3 MC) and plotted for those ROIs considered statistically significantly different in the analysis based on FS 5.3 MC [Bednarik et al., 2017].
Results
In the FS 5.3 MC quality control, 0.04 ±1.5% (average ±SD) and 0.06 ±1.1% of voxels were removed on average in controls and T1D patients, respectively (Bednarik et al. 2017). Based on the less stringent quality control protocols, where manual correction was performed only in defects exceeding 3 contiguous voxels in any direction, FS 5.3 HCP and FS 6.0 performed exceedingly well, as only 1 subject in each FS run required manual pial surface correction. However, FS 7.1 underperformed due to intensity normalisation problems leading to the exclusion of a substantial part of the occipital cortex from the brain mask in 6 subjects (4 HCs and 2 T1D patients) and reporting of implausibly low eICV in 2 further subjects (1 HC and 1 T1D). All of these errors were manually corrected.
The averages and SDs for both HC and T1D patients are presented together with the results of T-tests for all the reported FS versions in the Table 1. Interestingly, the inter-group differences at the level of the whole cortex as previously published were detected only by FS 5.3, both raw and MC. The other FS versions failed to reveal any statistically significant results (p >0.20) in the volume of the whole cortex. The situation was similar for all the individual cerebral lobes, only FS 5.3 found differences in the frontal (FS 5.3 MC) or parietal (FS 5.3 raw) cortex, with sub-threshold differences in several other lobes (see Table 1). Alarmingly, the other FS versions failed to reveal any plausible group effects (p >0.20), with the exception of occipital cortex in FS 5.3 HCP, although also at borderline level. Interestingly, the most consistent structure was thalamus, with group-level differences detected using FS 5.3 both raw and MC, FS 5.3 HCP and FS 6.0. All in all, FS 7.1 failed to yield any results with significance levels close to those of the other versions.
Table 1:
Within-group average volumes normalised by estimated intracranial volume and their standard deviations in healthy controls and type 1 diabetes patients as reported by different versions of FreeSurfer, including the percentage inter-group differences of averages and p-values of two-sample two-tailed T-tests.
| Cortex whole | Frontal lobe | Parietal lobe | Temporal lobe | Occipital lobe | Anterior cingulate | Posterior cingulate | Thalami | Hippocampi | ||
|---|---|---|---|---|---|---|---|---|---|---|
| FS5.3 with careful manual correction | HC (average [SD]) | 0.3313 [0.0235] | 0.1207 [0.0093] | 0.0805 [0.0066] | 0.0752 [0.0049] | 0.0315 [0.0037] | 0.0034 [0.0004] | 0.0034 [0.0004] | 0.0105 [0.0010] | 0.0059 [0.0004] |
| T1D (average [SD]) | 0.3187 [0.0182] | 0.1158 [0.0080] | 0.0773 [0.0053] | 0.0729 [0.0049] | 0.0299 [0.0022] | 0.0034 [0.0004] | 0.0033 [0.0004] | 0.0098 [0.0008] | 0.0059 [0.0005] | |
| % inter-group dif. | −3.80% | −4.05% | −4.01% | −3.12% | −5.10% | 1.08% | −4.64% | −6.87% | −0.17% | |
| p (uncorrected) | 0.0348 | 0.0467 | 0.0575 | 0.0927 | 0.0615 | 0.7499 | 0.1674 | 0.0052 | 0.9405 | |
| FS5.3 raw | HC (average [SD]) | 0.3341 [0.0238] | 0.1173 [0.0094] | 0.0756 [0.0064] | 0.0761 [0.0053] | 0.0329 [0.0040] | 0.0034 [0.0004] | 0.0035 [0.0004] | 0.0102 [0.0010] | 0.0058 [0.0004] |
| T1D (average [SD]) | 0.3214 [0.0181] | 0.1133 [0.0070] | 0.0722 [0.0050] | 0.0740 [0.0052] | 0.0312 [0.0022] | 0.0035 [0.0004] | 0.0034 [0.0004] | 0.0095 [0.0007] | 0.0057 [0.0005] | |
| % inter-group dif. | −3.81% | −3.44% | −4.52% | −2.66% | −5.24% | 1.85% | −3.82% | −6.79% | −0.53% | |
| p (uncorrected) | 0.0355 | 0.0875 | 0.0376 | 0.1741 | 0.0599 | 0.5817 | 0.2585 | 0.0059 | 0.8201 | |
| FS5.3 HCP | HC (average [SD]) | 0.3477 [0.0221] | 0.1268 [0.0097] | 0.0810 [0.0065] | 0.0810 [0.0042] | 0.0350 [0.0033] | 0.0037 [0.0004] | 0.0037 [0.0005] | 0.0105 [0.0011] | 0.0055 [0.0004] |
| T1D (average [SD]) | 0.3394 [0.0265] | 0.1243 [0.0112] | 0.0780 [0.0068] | 0.0791 [0.0069] | 0.0335 [0.0024] | 0.0037 [0.0005] | 0.0035 [0.0004] | 0.0098 [0.0010] | 0.0056 [0.0005] | |
| % inter-group dif. | −2.38% | −2.01% | −3.64% | −2.41% | −4.10% | 0.62% | −4.66% | −6.82% | 2.10% | |
| p (uncorrected) | 0.2351 | 0.3938 | 0.1221 | 0.2351 | 0.0803 | 0.8580 | 0.1784 | 0.0179 | 0.3695 | |
| FS6.0 | HC (average [SD]) | 0.3271 [0.0236] | 0.1173 [0.0098] | 0.0754 [0.0064] | 0.0755 [0.0047] | 0.0341 [0.0045] | 0.0032 [0.0004] | 0.0035 [0.0004] | 0.0104 [0.0011] | 0.0054 [0.0004] |
| T1D (average [SD]) | 0.3218 [0.0221] | 0.1156 [0.0094] | 0.0730 [0.0056] | 0.0746 [0.0055] | 0.0332 [0.0031] | 0.0033 [0.0004] | 0.0034 [0.0004] | 0.0097 [0.0010] | 0.0055 [0.0004] | |
| % inter-group dif. | −1.62% | −1.50% | −3.27% | −1.25% | −2.51% | 2.45% | −1.43% | −6.96% | 1.22% | |
| p (uncorrected) | 0.4119 | 0.5162 | 0.1492 | 0.5135 | 0.4299 | 0.5152 | 0.6795 | 0.0180 | 0.5715 | |
| FS7.1 | HC (average [SD]) | 0.3276 [0.0289] | 0.1194 [0.0113] | 0.0756 [0.0074] | 0.0734 [0.0059] | 0.0327 [0.0048] | 0.0032 [0.0004] | 0.0034 [0.0004] | 0.0107 [0.0012] | 0.0056 [0.0005] |
| T1D (average [SD]) | 0.3202 [0.0284] | 0.1163 [0.0117] | 0.0735 [0.0066] | 0.0718 [0.0069] | 0.0316 [0.0035] | 0.0032 [0.0005] | 0.0033 [0.0004] | 0.0102 [0.0011] | 0.0057 [0.0006] | |
| % inter-group dif. | −2.27% | −2.60% | −2.90% | −2.27% | −3.42% | 1.86% | −1.32% | −4.61% | 1.25% | |
| p (uncorrected) | 0.3582 | 0.3424 | 0.2695 | 0.3635 | 0.3432 | 0.6482 | 0.6909 | 0.1425 | 0.6396 | |
Green colour denotes statistically significant group differences at the predetermined alpha of 0.05, yellow colour denotes borderline significant group differences with the p-value between 0.05 and 0.10. Abbreviations: HC – healthy controls; T1D – type 1 diabetes; SD – standard deviation; FS – FreeSurfer.
When looking at the differences between the within-group average and SD for HC and T1D patients in individual FS versions against FS 5.3 MC, a clear pattern emerges – while version differences in averages in both HC and T1D patients generally keep under the 10% threshold, with the exception the occipital cortex, version differences in the SDs, mainly in the T1D group, commonly approach 50% (see Table 2, Figure 1 for chart of 2 ROIs (whole cortex and thalami) statistically significant in FS 5.3 MC).
Table 2:
Comparison of the percentage differences between individual FS versions (no manual correction) relative to FS 5.3 MC (with careful manual correction) for within-group averages and standard deviations of volumes of predetermined regions of interest normalised using the estimated intracranial volume (positive values denote the parameter being larger in the respective FS version than in FS 5.3 MC).
| Percent difference compared to FS 5.3 with careful manual correction | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cortex whole | Frontal lobe | Parietal lobe | Temporal lobe | Occipital lobe | Anterior cingulate | Posterior cingulate | Thalami | Hippocampi | eICV | |||
| Averages | Healthy controls | FS5.3 raw | 0.84% | −2.81% | −6.09% | 1.14% | 4.53% | 0.36% | 3.68% | −2.43% | −2.38% | 0.00% |
| FS5.3 HCP | 4.95% | 5.05% | 0.61% | 7.74% | 11.07% | 9.95% | 8.77% | 0.54% | −7.42% | −0.61% | ||
| FS6.0 | −1.27% | −2.80% | −6.29% | 0.39% | 8.38% | −5.70% | 1.83% | −0.98% | −8.30% | 1.39% | ||
| FS7.1 | −1.11% | −1.09% | −6.01% | −2.35% | 3.87% | −6.48% | −1.40% | 2.24% | −4.38% | −2.77% | ||
| Type 1 diabetes patients | FS5.3 raw | 0.83% | −2.18% | −6.59% | 1.61% | 4.37% | 1.13% | 4.58% | −2.34% | −2.73% | 0.00% | |
| FS5.3 HCP | 6.49% | 7.29% | 1.00% | 8.53% | 12.25% | 9.46% | 8.76% | 0.59% | −5.31% | −3.43% | ||
| FS6.0 | 0.96% | −0.22% | −5.56% | 2.32% | 11.34% | −4.42% | 5.26% | −1.07% | −7.03% | −0.64% | ||
| FS7.1 | 0.46% | 0.41% | −4.92% | −1.50% | 5.71% | −5.75% | 2.04% | 4.72% | −3.02% | −1.25% | ||
| Standard deviations | Healthy controls | FS5.3 raw | 1.10% | 1.20% | −2.89% | 7.18% | 7.94% | 4.77% | 8.69% | −1.77% | −1.97% | 0.00% |
| FS5.3 HCP | −5.77% | 4.64% | −1.67% | −14.74% | −11.22% | 10.99% | 16.49% | 16.09% | 4.84% | −1.18% | ||
| FS6.0 | 0.53% | 5.13% | −3.65% | −5.65% | 21.82% | 11.14% | 2.88% | 10.48% | −10.63% | 0.52% | ||
| FS7.1 | 23.16% | 21.95% | 12.25% | 20.27% | 31.41% | 19.57% | −5.60% | 25.39% | 18.06% | −2.73% | ||
| Type 1 diabetes patients | FS5.3 raw | −0.21% | −11.69% | −6.61% | 5.40% | 1.20% | −5.84% | −1.18% | −4.81% | −1.90% | 0.00% | |
| FS5.3 HCP | 45.97% | 40.69% | 27.99% | 39.49% | 9.47% | 12.82% | 5.77% | 21.13% | −13.11% | −2.22% | ||
| FS6.0 | 21.67% | 18.05% | 5.75% | 10.88% | 41.14% | −1.32% | 6.45% | 29.41% | −16.78% | 3.65% | ||
| FS7.1 | 56.20% | 47.24% | 23.70% | 40.99% | 55.71% | 6.23% | −0.55% | 44.06% | 5.79% | 16.75% | ||
Colour shading, for better visualisation, denotes the magnitude of the relative difference against FS 5.3 MC. Abbreviations: eICV – estimated intracranial volume; FS – FreeSurfer.
Figure 1:

Column charts of percentage differences between individual FS versions (no manual correction) relative to FS 5.3 MC (with careful manual correction) for the within-group average normalised volume of whole cortex and thalamus and for the within-group standard deviation of normalised volumes for healthy controls and type 1 diabetes patients (positive values denote the parameter being larger in the respective FS version than in FS 5.3 MC). Abbreviations: FS – FreeSurfer; SD – standard deviation; HC – healthy control; T1D – type 1 diabetes
The differences between eICVs used to normalise the volumes as reported by individual FS versions were non-negligible (see Figure 2) as well. Compared to the SAMSEG output, FS 7.1 provided substantially lower eICVs, with the difference exceeding 3% on average, FS 6 overestimated the values. Surprisingly, FS 5.3 and FS 5.3 HCP provided the closest results to SAMSEG. No clear group-dependent pattern was seen contrary to the within-group SDs for ROI volumes as seen in the Figure 1.
Figure 2:

Column charts of percentage differences between the individual FS versions relative to the output of Sequence Adaptive Multimodal Segmentation (SAMSEG) for the estimated intracranial volume for healthy controls and type 1 diabetes patients (positive values denote the parameter being larger in the respective FS version than in the SAMSEG output). Abbreviations: FS – FreeSurfer; SD – standard deviation; HC – healthy control; T1D – type 1 diabetes
Interestingly, when repeating the same inter-group ROI volume comparisons, but using volumes normalised with SAMSEG-derived ICV (uniformly in all the FS versions), virtually all the previously detected inter-group differences in the whole cortex and individual lobes disappear, leaving only borderline thalamus group differences in FS 5.3 MC, FS 5.3 raw, FS 5.3 HCP and FS 6.0 (see Supplementary table 1). Again, FS 7.1 fails to detect any significant difference. Furthermore, the pattern of increased SDs mainly in the T1D is not apparent for FS 7.1 (see Supplementary table 2).
And finally, brain mask SIs comparing other FS versions to the FS 5.3 MC showed generally good correspondence, over 95% in all the FS versions (see Table 3). For the whole cortical mask, the correspondence level was slightly lower, but still over 90%. The most substantial inconsistencies were detected in the ACC and PCC parcellation masks, generally below 85% and for the ACC mask in FS 7.1, even lower than 80% (see Supplementary Figure 1 for an example of a good and a poor ACC spatial overlap of individual FS versions).
Table 3:
Similarity indices (spatial overlap) comparing individual FreeSurfer versions to FreeSurfer 5.3 with manual correction.
| FS5.3 raw | FS5.3 HCP | FS6.0 | FS7.1 | |
|---|---|---|---|---|
| Brain mask (av. [SD]) | 0.993 [0.005] | 0.971 [0.005] | 0.979 [0.005] | 0.967 [0.008] |
| Cortex whole (av. [SD]) | 0.978 [0.015] | 0.905 [0.010] | 0.936 [0.017] | 0.901 [0.017] |
| Frontal lobe (av. [SD]) | 0.974 [0.016] | 0.893 [0.013] | 0.923 [0.018] | 0.891 [0.033] |
| Parietal lobe (av. [SD]) | 0.970 [0.018] | 0.880 [0.014] | 0.905 [0.022] | 0.876 [0.033] |
| Temporal lobe (av. [SD]) | 0.971 [0.019] | 0.911 [0.009] | 0.931 [0.013] | 0.878 [0.032] |
| Occipital lobe (av. [SD]) | 0.951 [0.029] | 0.858 [0.013] | 0.903 [0.027] | 0.847 [0.045] |
| Anterior cingulate (av. [SD]) | 0.927 [0.049] | 0.823 [0.023] | 0.814 [0.057] | 0.759 [0.055] |
| Posterior cingulate (av. [SD]) | 0.948 [0.030] | 0.847 [0.017] | 0.863 [0.032] | 0.830 [0.030] |
| Thalami (av. [SD]) | 0.994 [0.021] | 0.920 [0.022] | 0.897 [0.031] | 0.899 [0.031] |
| Hippocampi (av. [SD]) | 0.993 [0.023] | 0.855 [0.017] | 0.862 [0.021] | 0.858 [0.022] |
Overall (pooled healthy controls and type 1 diabetes patients) averages and standard deviations for the brain mask, each region of interest and each FreeSurfer version are presented. Values below 0.85 marked bold. Abbreviations: FS – FreeSurfer; av – average; SD – standard deviation.
Discussion
Neuroimaging pipelines have long been known to generate different results depending on various seemingly innocuous factors, including the computing platform, hardware architecture, MRI acquisition parameters and software package version. Indeed, it is reasonable to expect minor variability of absolute measurements related to the development of new software with the introduction of new processes and/or upgrading the original pipelines. Nonetheless, from practical research perspective, these differences may be of relatively low importance if outputs are reasonably accurate and stable over all the populations of interest. In line with previous reports (Gronenschild et al. 2012; Jovicich et al. 2009), the absolute differences between volumetric outputs of individual FS versions detected in the presented study were mostly below 10%. However, this accumulation of minor variability seemed to have detrimental effects on one of the main practical implementations of such volumetric analyses – inference on neuroanatomical volumetric differences between the group of healthy controls and a group with a pathological condition, T1D in this case. While FS 5.3 (both raw and MC) indicated statistically significant group differences in several major brain structures, other versions of this software suite were unable to replicate these findings utilizing the same data, with Student’s T-test in FS 7.1 data yielding p-values well above 0.3 for all cortical structures, which would commonly be deemed a failure to find differences.
When contrasting the inter-version variability to inter-group differences in brain morphology, i.e. volume quantification differences between FS versions up to 10% vs. generally up to 7% volume difference between HC and T1D patients, it becomes clear that the power of simple volumetric studies may be substantially compromised by FS version changes. Nevertheless, in one of the main findings in FS 5.3 MC data, the whole cortical volume, the inter-version difference in the average values was minimal, e.g. −1.1 and 0.5% in FS 7.1 for HC and T1D patients, respectively (see Tables 1 and 2). It was the SD, which exhibited 23.2% and 56.2% rise in the FS 7.1 when compared to FS 5.3 MC, that ultimately decreased the statistical significance of the T-test. And this trend of relatively consistent average values with extreme increases in the within-group SDs mainly in T1D patients is apparent in other regions as well (see Table 2 and Figure 1).
A further word of caution to this issue is directed towards the eICV-based normalisation used to standardise for the general differences in head sizes. FS derives this value from the relationship between the intracranial volume and the linear transform to MNI305 space using the T1w scan (Buckner et al. 2004). While associated with several problems (Nordenskjöld et al. 2013; Sargolzaei et al. 2015), it remains a standard in the field, possibly to be replaced by the newly introduced SAMSEG in the FS 7.1. Interestingly, the linear transform-derived eICV reported in the main FS 7.1 pipeline output differs substantially from the SAMSEG output (see Figure 2), even though both methods are available as a part of single FS 7.1 package. In the current study, eICV presented a problem in FS 7.1 outputs for two people where the underestimation of the value lead to substantial increase of all eICV normalised volumes, which was manually corrected, but could be overlooked in less stringent quality control protocols. Given the relatively high SD of eICV in FS 7.1 output for T1D patients, it is probable that there were further under- or overestimations that would seem plausible (see Table 2), and hence not be questioned, when operating within a simple pipeline utilising one FS version. Furthermore, the inter-version differences in eICV seem to be a major culprit in the inconsistence reported in this study, where more advanced segmentation-based ICV calculation methods may be of advantage to further decrease the software-related bias.
Lastly, there were relatively high differences in the volumes of ACC and PCC (see Table 2) and, more importantly, their low SIs (see Table 3) for FS 5.3 HCP, FS 6.0 and FS 7.1, pointing to parcellation differences, which are very difficult to account for in standard quality control procedures. Interestingly, similar problems were encountered even due to “mere” operating system changes (Glatard et al. 2015), where SI values as low as 0.59 were reported in FSL-derived subcortical classifications.
It is difficult to speculate on all the origins of these inter-version differences. The “PreFreeSurfer” step of the HCP pipeline introducing non-negligible warps and further data manipulation in the FS 5.3 HCP recon-all process is definitely one of the possible culprits. However, the full exploration of all the potential contributors to the inter-version variance is beyond the scope of the current study, requiring extensive analysis of the source code and validation not only in healthy brains, but also across various pathological states. Our data seem to indicate that even relatively minor cerebral effects in a primarily non-neurodegenerative condition as T1D induced substantial increase in the variability of the output when compared to HC, as reflected in higher SDs in the T1D group despite the virtually stable average volumes. Nonetheless, be it atrophy, blurring of the boundary between grey and white matter, subject motion or other disease related factors, it seems that the very presence of pathology affects different FS versions differently, even though they provide rather consistent results in healthy brains. Indisputably, this issue is of high importance and relevance for the field, which would definitely benefit from further studies and simulations specifically designed to identify neuropathological factors introducing eventual systemic bias.
However, there are several limitations to be addressed in the presented study. Firstly, the analysis of a subset of HC and T1D patients, even though defined by chance with little risk of true selection bias, failed to fully replicate the results of either of our previous T1D studies (Bednarik et al. 2017; Filip et al. 2020), since it detected both cortical (in keeping with the first study, but contrary to the second study) and subcortical changes (in keeping with the second, but contrary to the first study). This rather “non-convenient” issue likely reflects the inherent biologic variability of the condition we currently diagnose as T1D, which contains disease spectrum of various pathophysiologic origins. Unfortunately, this ambiguity is true for other T1D studies as reflected in the heterogeneity of the reported neuroanatomical outcomes (Hughes et al. 2013; Moulton et al. 2015; Musen et al. 2006; Wessels et al. 2006) and almost all general diagnoses. Secondly, the utilised sample size may be of concern. While the use of larger cohorts might lessen the FS version effects detected in this study, it is very common to see neuroanatomical studies basing the outcomes and evaluation of relevant hypotheses on similar cohort sizes. And it is exactly in this highly relevant scenario that we recommend caution in the interpretation of results. And lastly, the inconsistency of FS version outcomes may also be examined in light of our lack of multiple comparison correction in this study (which was deliberate), as sufficiently stringent multiple comparison correction would lead to the survival of only thalamic results in FS 5.3 MC, FS 5.3 raw and possibly FS 5.3 HCP and FS 6.0. However, the aim of this study was not to find a definitive answer on the presence of T1D-related cerebral atrophy, but to explore the differences in outcomes when using different FS versions. Just this seemingly innocuous version change would mean the difference between reporting cortical effects in T1D, when choosing the multiple comparison correction approach as in our first study (Bednarik et al. 2017), and reporting no disease-related alterations at all.
Conclusions
The inter-version differences detected in the presented study support the recommendations of FS developers to avoid mixing FS package versions. Moreover, they open an uneasy question about the replicability of intricate, complex processing results, not only for FS, but likely for other packages or approaches to the same research question (Rajagopalan and Pioro 2015). Until validation of new software suites in general research scenarios such as inter-group comparisons is included into the development process, it might be optimal to supplement the meticulous and extensive quality control of automated pipelines with corroborating analyses of complementary variables, e.g. combining volumetry with voxel-based morphometry, or with the confirmation of the main results utilising a similar software based on analogous, but not fully equivalent, principles.
Supplementary Material
Acknowledgments and funding
The authors wish to express their sincere thanks to the volunteers who participated in the study and the MRI support available at CMRR (namely Erik Solheid and Wendy Elvendahl). Furthermore, the authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota (http://www.msi.umn.edu) for the computing resources required for extensive repeated analyses necessary for this study. Research reported in this publication was supported by the National Institutes of Health (Award Numbers P41 EB015894, P41 EB027061, P30 NS076408, R01DK099137 and R56DK099137) and by the National Center for Advancing Translational Sciences of the National Institutes of Health, Award Numbers KL2TR000113 and UL1TR000114.
The authors are solely responsible for the content, which does not necessarily represent the official views of the funding agencies.
Funding statement:
Research reported in this publication was supported by the National Institutes of Health (Award Numbers P41 EB015894, P41 EB027061, P30 NS076408, R01DK099137 and R56DK099137) and by the National Center for Advancing Translational Sciences of the National Institutes of Health, Award Numbers KL2TR000113 and UL1TR000114.
Abbreviations
- MRI
magnetic resonance imaging
- T1w
T1-weighted
- T2w
T2-weighted
- FS
FreeSurfer
- ACC
anterior cingulate cortex
- PCC
posterior cingulate cortex
- eICV
estimated intracranial volume
- T1D
type 1 diabetes
- HC
healthy controls
- SD
standard deviation
Footnotes
Conflicts of interest: There are no potential conflicts of interests and no financial relationships regarding this paper which could bias this work.
Ethics approval: The study protocol was approved by the institutional review board of the University of Minnesota.
Consent to participate: Each subject provided his/her written informed consent in keeping with the Declaration of Helsinki.
Data availability:
Raw or processed data of the presented study are not publicly available because of the sensitive nature of human data acquired in patients. However, they are available upon reasonable request addressed to the corresponding author.
References
- Bednarik P, Moheet AA, Grohn H, Kumar AF, Eberly LE, Seaquist ER, & Mangia S (2017). Type 1 Diabetes and Impaired Awareness of Hypoglycemia Are Associated with Reduced Brain Gray Matter Volumes. Frontiers in Neuroscience, 11, 529. 10.3389/fnins.2017.00529 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bigler ED, Skiles M, Wade BSC, Abildskov TJ, Tustison NJ, Scheibel RS, et al. (2020). FreeSurfer 5.3 versus 6.0: are volumes comparable? A Chronic Effects of Neurotrauma Consortium study. Brain Imaging and Behavior, 14(5), 1318–1327. 10.1007/s11682-018-9994-x [DOI] [PubMed] [Google Scholar]
- Buckner RL, Head D, Parker J, Fotenos AF, Marcus D, Morris JC, & Snyder AZ (2004). A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. Neuroimage, 23(2), 724–738. [DOI] [PubMed] [Google Scholar]
- Cardinale F, Chinnici G, Bramerio M, Mai R, Sartori I, Cossu M, et al. (2014). Validation of FreeSurfer-estimated brain cortical thickness: comparison with histologic measurements. Neuroinformatics, 12(4), 535–542. [DOI] [PubMed] [Google Scholar]
- Chepkoech J-L, Walhovd KB, Grydeland H, Fjell AM, & Initiative ADN (2016). Effects of change in FreeSurfer version on classification accuracy of patients with Alzheimer’s disease and mild cognitive impairment. Human brain mapping, 37(5), 1831–1841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dice LR (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. [Google Scholar]
- Filip P, Canna A, Moheet A, Bednarik P, Grohn H, Li X, et al. (2020). Structural Alterations in Deep Brain Structures in Type 1 Diabetes. Diabetes. 10.2337/db19-1100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fischl B (2012). FreeSurfer. NeuroImage, 62(2), 774–781. 10.1016/j.neuroimage.2012.01.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, et al. (2013). The minimal preprocessing pipelines for the Human Connectome Project. Neuroimage, 80, 105–124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glatard T, Lewis LB, Ferreira da Silva R, Adalat R, Beck N, Lepage C, et al. (2015). Reproducibility of neuroimaging analyses across operating systems. Frontiers in Neuroinformatics, 9. 10.3389/fninf.2015.00012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gronenschild EHBM, Habets P, Jacobs HIL, Mengelers R, Rozendaal N, van Os J, & Marcelis M (2012). The effects of FreeSurfer version, workstation type, and Macintosh operating system version on anatomical volume and cortical thickness measurements. PloS One, 7(6), e38234. 10.1371/journal.pone.0038234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hammers A, Heckemann R, Koepp MJ, Duncan JS, Hajnal JV, Rueckert D, & Aljabar P (2007). Automatic detection and quantification of hippocampal atrophy on MRI in temporal lobe epilepsy: a proof-of-principle study. Neuroimage, 36(1), 38–47. [DOI] [PubMed] [Google Scholar]
- Han X, Jovicich J, Salat D, van der Kouwe A, Quinn B, Czanner S, et al. (2006). Reliability of MRI-derived measurements of human cerebral cortical thickness: the effects of field strength, scanner upgrade and manufacturer. Neuroimage, 32(1), 180–194. [DOI] [PubMed] [Google Scholar]
- Hlavatá P, Linhartová P, Šumec R, Filip P, Světlák M, Baláž M, et al. (2020). Behavioral and Neuroanatomical Account of Impulsivity in Parkinson’s Disease. Frontiers in Neurology, 10. 10.3389/fneur.2019.01338 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hughes TM, Ryan CM, Aizenstein HJ, Nunley K, Gianaros PJ, Miller R, et al. (2013). Frontal gray matter atrophy in middle aged adults with type 1 diabetes is independent of cardiovascular risk factors and diabetes complications. Journal of Diabetes and Its Complications, 27(6), 558–564. 10.1016/j.jdiacomp.2013.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jovicich J, Czanner S, Han X, Salat D, van der Kouwe A, Quinn B, et al. (2009). MRI-derived measurements of human subcortical, ventricular and intracranial brain volumes: reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths. Neuroimage, 46(1), 177–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Messina D, Cerasa A, Condino F, Arabia G, Novellino F, Nicoletti G, et al. (2011). Patterns of brain atrophy in Parkinson’s disease, progressive supranuclear palsy and multiple system atrophy. Parkinsonism & Related Disorders, 17(3), 172–176. 10.1016/j.parkreldis.2010.12.010 [DOI] [PubMed] [Google Scholar]
- Morey RA, Petty CM, Xu Y, Hayes JP, Wagner II HR, Lewis DV, et al. (2009). A comparison of automated segmentation and manual tracing for quantifying hippocampal and amygdala volumes. Neuroimage, 45(3), 855–866. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morey RA, Selgrade ES, Wagner HR, Huettel SA, Wang L, & McCarthy G (2010). Scan–rescan reliability of subcortical brain volumes derived from automated segmentation. Human brain mapping, 31(11), 1751–1762. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moulton CD, Costafreda SG, Horton P, Ismail K, & Fu CHY (2015). Meta-analyses of structural regional cerebral effects in type 1 and type 2 diabetes. Brain Imaging and Behavior, 9(4), 651–662. 10.1007/s11682-014-9348-2 [DOI] [PubMed] [Google Scholar]
- Musen G, Lyoo IK, Sparks CR, Weinger K, Hwang J, Ryan CM, et al. (2006). Effects of type 1 diabetes on gray matter density as measured by voxel-based morphometry. Diabetes, 55(2), 326–333. [DOI] [PubMed] [Google Scholar]
- Nordenskjöld R, Malmberg F, Larsson E-M, Simmons A, Brooks SJ, Lind L, et al. (2013). Intracranial volume estimated with commonly used methods could introduce bias in studies including brain volume measurements. NeuroImage, 83, 355–360. 10.1016/j.neuroimage.2013.06.068 [DOI] [PubMed] [Google Scholar]
- Rajagopalan V, & Pioro EP (2015). Disparate voxel based morphometry (VBM) results between SPM and FSL softwares in ALS patients with frontotemporal dementia: which VBM results to consider? BMC Neurology, 15(1), 32. 10.1186/s12883-015-0274-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosas HD, Liu AK, Hersch S, Glessner M, Ferrante RJ, Salat DH, et al. (2002). Regional and progressive thinning of the cortical ribbon in Huntington’s disease. Neurology, 58(5), 695–701. [DOI] [PubMed] [Google Scholar]
- Sargolzaei S, Sargolzaei A, Cabrerizo M, Chen G, Goryawala M, Noei S, et al. (2015). A practical guideline for intracranial volume estimation in patients with Alzheimer’s disease. BMC Bioinformatics, 16(Suppl 7), S8. 10.1186/1471-2105-16-S7-S8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tae WS, Kim SS, Lee KU, Nam E-C, & Kim KW (2008). Validation of hippocampal volumes measured using a manual method and two automated methods (FreeSurfer and IBASPM) in chronic major depressive disorder. Neuroradiology, 50(7), 569. [DOI] [PubMed] [Google Scholar]
- Vasconcellos LF, Pereira JS, Adachi M, Greca D, Cruz M, Malak AL, & Charchat-Fichman H (2018). Volumetric brain analysis as a predictor of a worse cognitive outcome in Parkinson’s disease. Journal of Psychiatric Research, 102, 254–260. 10.1016/j.jpsychires.2018.04.016 [DOI] [PubMed] [Google Scholar]
- Wessels AM, Simsek S, Remijnse PL, Veltman DJ, Biessels GJ, Barkhof F, et al. (2006). Voxel-based morphometry demonstrates reduced grey matter density on brain MRI in patients with diabetic retinopathy. Diabetologia, 49(10), 2474–2480. 10.1007/s00125-006-0283-7 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw or processed data of the presented study are not publicly available because of the sensitive nature of human data acquired in patients. However, they are available upon reasonable request addressed to the corresponding author.
