Abstract
When analyzing large multi-center databases, the effects of multiple confounding covariates increase the variability in the data and may reduce the ability to detect changes due to the actual effect of interest, e.g. changes due to disease. Efficient ways to evaluate the effect of covariates towards the data harmonization are therefore important. In this paper, we showcase techniques to assess the “goodness of harmonization” of covariates. We analyse 7656 MR images in the multi-site, multi-scanner ADNI database. We present a comparison of three methods for estimating total intracranial volume to assess their accuracy and longitudinal consistency and correct the brain structure volumes using the residual method and the proportional (normalization by division) method. We then evaluated the distribution of brain structure volumes over the entire ADNI database before and after accounting for multiple covariates such as total intracranial volume, scanner field-strength, sex and age using two techniques: 1) Zscapes, a panoramic visualization technique to analyze the entire database and 2) empirical cumulative distributions functions. The results from this study highlight the importance of assessing the goodness of data harmonization as a necessary preprocessing step when pooling large dataset with multiple covariates, prior to further statistical data analysis.
Keywords: data harmonization, total intracranial volume, field strength, Magnetic resonance Imaging, LDDMM, Multi-atlas fusion
2. Introduction
Data harmonization is an important step for data mining and statistical analysis for many fields of research, especially in the era of big data [1]. Such “goodness of harmonization” is important to ensure the optimal power of statistical analysis, since the effect of additional covariates introduces undesirable variations that may swamp the true effect of interest. In the field of neuroimaging, brain imaging databases such as the Alzheimer Disease Neuroimaging Initiative (ADNI) now include thousands of brain images [59] from multiple sites. In such databases, confounding covariates can enter at multiple steps due to differences in protocols for data acquisition [40], processing [100] and analysis [24, 22, 19, 102].
Significant efforts are being directed to harmonize the data acquisition and processing protocols to minimize site-related variations. The EADC-ADNI harmonization protocol initiated by Frisoni et al. [24] aims to generate consensus for manual hippocampus segmentation among research groups around the world, to reduce the systematic bias of the data due to intra-rater variability. The ENIGMA consortium [91] studied the genetic-association to harmonize the diffusion tensor imaging (DTI) [35, 43]. Potvin et al. have constructed normative data of structure volumes and cortical thicknesses from large number healthy controls subjects across different studies by taking into account the effect of age, sex, total intracranial volume (TIV), scanner manufacture, and magnetic field strength [71, 70]. Mirzaalian et al. have proposed a multi-model registration-based framework to harmonize the raw diffusion MRI signal in a model-independent manner, and reduced the analysis bias on data acquired from multiple sites [57, 58]. Fortin et al. have addressed the importance of controlling the non-biological variance (the scanner-specific effects), effectively harmonizing the signal intensity of T1W image [22], the fractional anisotropy (FA), and mean diffusivity (MD) for DTI [21], as well as the automatically estimated cortical thickness [19] improving the statistical and classification power for data analysis. Using the same harmonization methods (ComBat), Yu et al. have successfully removed the site effects from multi-site resting-sate fMRI data [102]. Data harmonization also helps to improve the performance for machine learning algorithms, as removing unwanted covariates from the data not only help to improve the training accuracy, but also help to generalize the model and prevent overfitting due to learning of signatures from unrelated covariates. Rozycki et al. [78] have shown that data pooled from multi-site with inter-site image harmonization improves both group-level statistical analysis as well as multivariate classification power compared to single site analysis.
The harmonization of the data can be affected by various sources of covariates. For instance, MRI-derived structural volumetric measures such as hippocampal atrophy [51] and ventricle expansion [96, 62, 67] are important quantitative imaging biomarkers of disease progression and these are influenced by head size (measured via TIV) [5, 36, 28, 94]. The measurement of head size itself, and brain structural volumes concurrently, are influenced by scanner field strength (1.5T vs 3T) [40, 13, 51, 11]. Sex is also an important source of demographic-related variation in TIV and volumes of brain structures [27, 69, 76]. Another source of individual-level variation is due to normal aging-related changes [83, 89, 90] that are introduced when analyzing databases including subjects over a range of ages. Signatures of subtle structural change due to disease in these neuroimaging measures may be masked by the gross variations due to head size, sex or age across subjects [69, 27, 92, 3, 5, 64, 33, 72], or because of the selection of image processing pipelines [63]. These sources of variation have become much more prominent as multi-site databases are beginning to be pooled. Changes in data distribution and variability measures before and after adjusting for such covariates are therefore important indicators of how well multiple sources of data are harmonized. For example, when analyzing the brain structure volume, the difference between subgroups of each covariates (i.e. the male and female, 1.5T and 3T MRI scanner) should be minimized after the data harmonization.
In this paper, we propose two qualitative and one quantitative method to assess such “goodness of harmonization”. One of the qualitative (visual) methods is a heatmap of normalized regional structural volumes. This panoramic visualization of the entire database, which we term as Zscapes, offers a visual assessment of the harmonization procedure. Harmonized measures can be visually assessed after removal of one or more covariates by viewing the data in its entirety and any systematic biases remaining can be seen in patterns of color changes across the database. Another visual technique we propose is through the use of empirical cumulative distribution functions (ECDFs) where the covariate-induced variability introduces overlaps between the distributions of each measure. After harmonization, the ECDFs converge to a common distribution if the effect of interest (such as disease) is the primary source of remaining variability. We also propose the use of the Kolmogorov-Smirnov statistic as a quantitative measure of the overlap of each of the ECDFs, before, and after accounting for covariates. Using these tools, we investigated the effect of TIV, field strength, sex and age towards brain structure volumetric analysis. To the best of our knowledge, this is the first study to investigate multivariate multi-feature effects towards demographic-related data harmonization over a large number (7656 images) taken from the ADNI database.
3. Methods
In this study, we analyzed the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. ADNI is a large cross-sectional and longitudinal neuroimaging database with three clinically diagnosed groups at the time of the assessments: the cognitive normal group (CN), the mild cognitive impairment group (MCI) and the Alzheimer’s disease group (AD). The CN group is treated as the reference group for all analyses. In sections below, we detail each of the steps of analysis. Briefly, we segmented the grey matter structures using FreeSurfer and extract the raw volumes for each FreeSurfer labeled-structure. We present three TIV measurement methods on these ADNI images, and show their longitudinal variabilities. We also compare the TIV estimation methods on paired 1.5T/3T scans. We performed a comparison of two methods for head size normalization using TIV, namely the proportional and the residual-based method. We evaluated the effect of the covariates such as field strength, TIV, sex, and age on volumetric analysis. We then propose two qualitative (visual) and a quantitative method to assess the “goodness of harmonization” of data before and after accounting for the covariates.
3.1. Experimental data
3.1.1. The ADNI Database
T1-weighted structural MRI data along with corresponding demographic and scanner-specific information were obtained from the publicly available ADNI database (adni.loni.usc.edu) [97, 34, 60, 59]. A general description of the image acquisition parameter protocol for the dataset is described in detail in a previous study by Chow et al. [12], and detailed scanner-specific parameters are described in Jack et al. (2008) [34]. The MRI database we analyzed consists of a total of 7656 scans collected from 1727 subjects, acquired longitudinally for up to 13 timepoints (from baseline up to 120 months) for which covariate information on field-strength, sex, age, and clinical diagnosis was available. The ADNI dataset includes a mixture of 1.5T and 3T images, with subjects’ average ages at 75 ranging from 55 to 95.
3.1.2. Database with pairs of 1.5T/3T scans for each subject
To study the effect of field-strength on the TIV estimation (section 3.5), we also analyzed MR images from 187 subjects (91 male and 96 female) with both 1.5T and 3T MRI scans (755 images for each field strength) taken back to back at multiple timepoints (up to 36 months). This set of 1510 longitudinally scanned images was collected by the ADNI MRI core specifically for methods comparison [100], and the corresponding 1.5T scans have been included in the main ADNI dataset described above. A subset of this dataset has been used to show improved statistical power for 3T over 1.5T for measuring hippocampal volume [12].
3.2. FreeSurfer structure segmentation and volume extraction
We used the volume-based stream of the FreeSurfer processing pipeline version 5.3.0 [15] to segment 87 anatomical structures (left/right separated) of the cortical [17] and subcortical [18] grey matter and extracted their volumes. The FreeSurfer processing pipeline consists of five steps: 1) affine registration to the MNI305 spaces, 2) B1 intensity inhomogeneities correction, 3) non-rigid registration to the MNI305 spaces, and 4) atlas-based structure labeling based on the maximum likelihood of the probability atlas. The FreeSurfer volume-based pipeline is described in the papers by Fischel et al. [18, 17]. All images were pre-processed with non-parametric non-uniform normalization (N3) [87] prior to the structure segmentation and TIV estimation.
3.3. Evaluation of the automatic TIV estimation methods
The brain structure volumes are known to be dependent on the individual’s head size [5, 36, 94, 28]. The TIV is a measurement of head size, and is a crucial covariate to be adjusted for when performing volumetric analysis [28, 99]. Accurate estimation of the TIV is therefore important to minimize the bias during data analysis [80]. Ideally, TIV measurement is done by segmenting the cranial vault directly and measuring its volume, but other indirect ways of estimating the TIV without segmenting the cranial vault have also been proposed. We compared three different automatic TIV estimation methods. Among them, FreeSurfer and SPM are two widely-used brain image processing and analyzing packages that provide fully-automated process to estimate the TIV indirectly through affine scaling and tissue segmentation [81, 41, 28, 68, 63, 52, 93, 31]. In addition, the multi-atlas label fusion (MALF)-based TIV estimation has been proposed which segments the cranial vault directly and demonstrated higher correlation and similarity measurements when compared with the manual segmentation as ground truth [84, 53, 32].
For large databases like ADNI, it is very difficult to undertake manual segmentation for TIV to perform the standard analysis based on Dice overlap accuracy. Therefore, we adopt two alternative evaluation criteria to study the robustness of automated estimation: the longitudinal consistency and test-retest reliability.
Firstly, the adult bony cranial vault is not expected to change over time [98], and previous studies on elderly subjects (age > 52) demonstrated no association between the measured TIV and aging for both healthy and AD patients [16, 36]. The ADNI dataset includes elderly adults subjects (age range between 55 to 95 years old) and hence their TIV is not expected to change during the ADNI study. Therefore, we chose to use longitudinal consistency defined as the change of estimated TIV over time as a metric to evaluate the robustness of the automated estimation methods. TIV from within-subject longitudinal images for FreeSurfer, SPM, and MALF methods is obtained from the available longitudinal images. Longitudinal consistency of TIV is thus used as an outcome metric to identify the TIV estimation method delivering the most consistent measures over time.
Secondly, we evaluate the robustness of the three TIV estimation methods by analyzing the test-retest reliability using a subset of OASIS-1 study with consecutive scans dedicated for evaluating the robustness of the image processing methods. The pairwise percentage volume difference between the test and retest data were used as an outcome metric for assessing the robustness of the TIV estimation methods [8, 26, 61].
3.3.1. Three TIV estimation methods
FreeSurfer:
TIV is estimated using a scaling factor derived from an affine transformation between the template and the target, and applying that scaling factor to the TIV of the template.
SPM:
The most recent version of SPM [52] utilizes a generative model to integrate partial-volume tissue classification with image registration and intensity non-uniformity correction [88]. Each brain image is segmented into white matter (WM), grey matter (GM), cerebral spinal fluid (CSF) and additional 3 tissue types (bone, soft tissue, and air/background) for more accurate characterization of tissue composition in the image. We used the “Tissue Volumes” utility introduced in SPM12 which wrapped and constrained the tissue segmentation within a manually-corrected TIV mask, then summed up the WM, GM, and CSF volumes to obtain the estimated TIV [52].
MALF:
We used the OASIS-BC2 atlas by Huo et al. [32] containing 27 T1 MR images taken from the OASIS dataset [55, 54] as the templates in image-registration-based label propagation and fusion. The corresponding manual TIV labels were created based on the corresponding CT images of the same subjects to ensure very accurate segmentation following the BrainCOLOR protocol [42], and is part of the brain structure atlas MICCAI12 Multi-Atlas Grand Challenge [46]. The image intensity for each template-test image pair is normalized, the template MRI is registered to the target MR using first an affine and then a nonrigid large deformation diffeomorphic metric mapping (LDDMM) algorithm [6]. Each manually-segmented template TIV Label was then propagated from the template atlas to the target image with the derived deformation map and finally fused together to generate the TIV mask with weighted majority voting. All TIV segmentations were visually inspected by two experienced raters for quality control.
3.3.2. Robustness analysis
We evaluate the robustness of automated estimation through two criteria: the longitudinal consistency and test-retest reliability. To evaluate the longitudinal consistency, we use linear mixed-effect (LME) model with random-intercept (Equation 1) [7, 101] to measure the correlation between TIV and age. For this experiment, TIV is considered as the dependent variable whereas age is the independent variable (predictor) with fixed effects, with the field strength (1.5T and 3T) and sex (male and female) modeled as the independent variables with random effect, each with 2 level. We use the restricted maximum likelihood approach (REML) to fit the model:
(1) |
where β1 is the fixed effect coefficient for the age variable (Xi) for the ith subject, and is the random effect vector for the rth grouping variable (b1: field strength, b2: sex) and level m(r,i) (m ⊂ (0,1)).
To evaluate the test-retest reliability, we analyzed a dataset from the OASIS-1 database [55] dedicated for testing the reproducibility of image processing methods. This dataset includes 20 healthy subjects between 20–34 years of age who underwent two consecutive MRI scans using the same 1.5T scanner. Detailed scanning protocol and subject demographics of the OASIS-1 reliability dataset are described in Marcus et al. [55]. Bland-Altman analysis is used to study the pairwise percentage volume difference (PVD, Equation 2) between the estimated TIV from the test data and the retest data for assessing the robustness of the three methods [8, 26, 61].
(2) |
where TIVtest is the TIV estimated from the test data, and TIVretest is the TIV estimated from the retest data.
3.4. Evaluation of the TIV normalization methods
There are two methods commonly used for TIV normalization [79]: 1) the proportion method [37], and 2) the residual method [64, 79]. The proportion methods simply divide the structure volume by the TIV; while the residual method (Equation 3) models the structural volume as a linear combination of the TIV and the residual terms, computes the linear regression from the reference (CN) group measures, and takes the residual εi (the difference between the actual measure and that predicted from using the reference-group fitted linear model) as the normalized measure for further analysis.
(3) |
Specifically, it has been recommended to use the standardized residual, also known as the W-score (defined as Wi = (εi − μεCN)/σεCN), rather than the raw residual when accessing the structural changes such as atrophy [14, 66, 45]. The W-score is the Z-score of the residuals where μεCN is the mean of the residuals within the reference group (CN) group and σεCN is the standard deviation.
3.5. TIV variation due to scanner field strength difference
To study the influence of scanning field strength (1.5T vs 3T) on TIV, we performed an additional analysis using a second ADNI cohort of subjects with both 1.5T and 3T MRI scans back to back at multiple timepoints (up to 36 months) as described in Section 3.1.2. We measured the correlation between the field strength (1.5T and 3T) and the TIV for each processing method, and calculated the coefficient of determination R2. In addition, we utilize Bland-Altman analysis and study the pairwise PVD similar to the test-retest analysis in the section 3.3.2.
Here, the TIVtest is the 1.5T TIV, and the TIVretest is the 3T TIV. We also calculated the empirical cumulative density function (ECDF) for each field strength for male and female subjects separately.
3.6. GLM-based combined accounting of covariates
We evaluate the data distribution and variability of the structural volume before and after harmonization (adjusting for the covariates such as field strength, TIV, sex, and age). We used the general linear model (GLM), where the structure volume and all the other covariates are independent (predictive) variables (Equation 4).
(4) |
where Xi are covariates such as field-strength, TIV, sex and age of each subject i, and R is the total number of independent variables. We can analyze the variability in data that is explained by these covariates individually and together. In this paper, we selected and presented some covariate combinations to illustrate the difference in terms of data harmonization in different scenario: a) the scanner specific covariate (field strength) only; b) the individual-specific covariate (TIV) only; c) the combination of field strength and TIV, and d) the combination of field strength, TIV and demographic covariate (sex and age).
3.7. Evaluation of “goodness of harmonization” of a database
The goal of harmonization of covariates is to remove the unwanted sources of variation (field-strength, TIV, sex, and age) within acquired measures (structural volumes) and retain only those sources of variation that are of interest (such as disease). The hypothesis is that variation of structural volume measures within each diagnostic group (CN, MCI, and AD) will be progressively diminished as more unwanted covariates are removed, and minimized when all covariates have been suitably accounted. As a result, the distance between distributions of measures across the effect of interest (eg. disease diagnostic groups CN, MCI, and AD) will be progressively enhanced, and maximized when all unwanted covariates have been accounted for and removed. These outcome metrics form the basis of the following methods proposed to demonstrate the “goodness of harmonization”.
3.7.1. Visualization of “goodness of harmonization” using Zscapes
To evaluate the variation of the structure volume feature across the entire sampled population after each covariate regression, we first assess the within-group variation by calculating the Z-score of each measure Xi for subject i given by where is the mean value of the reference (CN) group, and σCN is the standard deviation of the reference (CN) group. In the cases where residual method is used to regress out covariates, Z-scores effectively become the W-scores. The Z-score represents the distance of each measurement to the reference group mean, normalized by the reference group standard deviation. By measuring the Z-score, changes in each structure with respect to the reference mean are highlighted and comparable across range of structural volumes due to standardization as a multiple of the standard deviation.
We plot the Z-score over the entire ADNI database analyzed such that all structure volumes for a subject are presented in one column, and each FreeSurfer-derived structure volume is presented in one row across all subjects. This resulting panoramic heat-map is denoted as a Zscape and enables the assessment of patterns across all the subjects and all the structures at the same time.
3.7.2. Visualization of “goodness of harmonization” using ECDF
To quantitatively evaluate and compare the data variability before and after the harmonization, we plotted data distribution for each structure’s volume using the empirical cumulative distribution function (ECDF) and tested the goodness-of-fit. Since the MCI group is heterogeneous, that is, it includes subjects who will develop AD (progressive MCI) and subjects who will not develop AD in their lifetime (stable MCI), we exclude the MCI group in this step and only include the CN and AD group, in order to control the effect of unknown covariates when evaluating the goodness of harmonization. We first calculated the ECDF for the CN and AD group, different sex group (male and female) and field strength group (1.5T and 3T) separately. As the covariates of TIV, sex, and field-strength are removed, first individually, then in combination, we expect the disease remains the primary source of variability ultimately and hence the ECDFs for the residual of each measure are expected to coalesce to a common ECDF for the control and AD group respectively.
3.7.3. Measurement of “goodness of harmonization” using the Kolmogorov-mirnov statistic
We propose to use the Kolmogorov-Smirnov (K-S) test, a non-parametric test for the “goodness-of-fit” of the ECDF, which is widely used to evaluate the maximum absolute difference between the CDFs of sample distributions [4]. We used the two-sample K-S test ([56], Equation 5) to measure the separation of the sample distributions and quantitatively compare the ECDFs for sex, age, and diagnosis. The hypothesis is that, for more harmonized data, the distance between the ECDF curves of the subgroups with different value of confounding covariates (e.g. field strength and sex) will be smaller, and the distance among the ECDF curves of different diagnostic groups should be larger.
(5) |
where F1,m(x) and F2,n(x) are the ECDF of the two samples. In the K-S test, the two distribution are considered as significantly different (reject the null hypothesis) when the score is above the Dm,n.
(6) |
in which α is the reject level, and is set to 0.05, and we denote the level of rejection as.
We selected the hippocampus to demonstrate the effect of different regression results to separate the ECDF of different subgroups for each grouping variables, given that hippocampal atrophy is considered one of the signature hallmarks for AD progression. We included all the subject currently available in ADNI who are diagnosed as either CN or AD to evaluate the result of the comparison, and compared the difference between the two diagnostic groups (CN versus AD), as well as the two sex groups (male versus female), and the two field strength groups (1.5T versus 3T).
4. Results
4.1. Demographic analysis
The results of demographic analysis are listed in Table 1. Statistical comparisons of the age distribution were performed at each level of the categorization, i.e. among diagnostic groups, between male and female within each diagnostic group, and between 1.5T and 3T within each sex subgroup. The population age in the MCI groups is found to be significantly smaller than the other two group (CN and AD). Significant age differences were detected for all the comparisons between male and female groups, and between the 1.5T and 3T groups. These point to the necessity of adjusting for age when performing groupwise structure volume comparison, as age affects regional brain structure volumes [50].
Table 1:
Demographic analysis of the entire ADNI database. Some subjects were scanned on 1.5T scanner at early timepoints and 3T scanner for their later timepoints. CN = Cognitively normal, MCI = Mild cognitive impairment, AD = clinically diagnosed Alzheimer’s Disease. The mean ± SD of age distribution for each group is shown in the brackets (Unit: year). Statistical comparison of the age distribution were performed at each level of the categorization. One-way ANOVA was performed among CN/MCI/AD group. Unpaired two-tailed t-test were performed between male and female population for each diagnostic group, as well as between 1.5T and 3T for each sex subgroup within each diagnostic group. Multiple comparisons were corrected with false discovery rate (FDR) set to 0.05.
Diagnosis (Age, mean ± SD years) |
Sex (Age) |
Field Strength (Age) | Subjects |
---|---|---|---|
CN (76.15±6.24) |
Female* (75.79±6.11) |
1.5T* (78.13±5.09) | 169 |
3T (73.30±6.13) | 132 | ||
Male (76.50±6.34) |
1.5T* (77.49±5.87) | 198 | |
3T (75.12±6.71) | 114 | ||
MCI* (74.64±7.70) |
Female* (73.44±8.04) |
1.5T* (75.18±7.78) | 294 |
3T (72.12±7.99) | 284 | ||
Male (75.41±7.37) |
1.5T* (77.11±7.04) | 239 | |
3T (73.54±7.28) | 163 | ||
AD (76.06±7.47) |
Female* (75.15±7.93) |
1.5T* (75.99±7.55) | 150 |
3T (73.29±8.42) | 221 | ||
Male (76.75±7.04) |
1.5T* (77.07±6.88) | 113 | |
3T (76.10±7.31) | 170 |
The population age in the MCI group is statistically significantly smaller than the CN and AD groups. Significant age differences were found in all subgroup comparisons.
4.2. TIV estimation
Figure 1 shows the sample sagittal images of MALF TIV overlaid on the brain image for male and female subjects, acquired at both 1.5T and 3T MRI. All 7657 images passed the visual inspected quality check. The MALF not only provides an estimate for TIV, but also provides a delineation of the boundary of the cranial vault giving a 3D mask of the cranial-vault independent of the brain tissue outline. The surface and shape information of TIV mask, in addition to the volume measure, could also be potentially useful for additional analyses. Comparatively, the FreeSurfer TIV only estimates the intracranial volume through affine-based scaling factor; therefore, no FreeSurfer TIV mask is available. In SPM, a TIV mask is generated in the subject space during the pipeline process (through the template-based non-rigid registered “reverse brain mask” as part of the “new segmentation” method in SPM 12). However, in SPM 12 this TIV mask is not used to calculate the final measurement of TIV, but rather used to constrain the final TIV calculation through the summation of threshold tissue probability map. Compared to this single-template-based TIV mask, the MALF provides a 3D TIV mask through the fusion of multiple non-rigid registered template masks [32] giving a direct measure of the 3D surface/shape of the cranial vault.
Figure 1:
Sagittal view of multiple atlas label fusion (MALF) estimated TIV overlaid on the brain images for (A) 1.5T image of a male subject; (B) 3T image for the same male subject; (C) 1.5T image of a female subject; (D) 3T image of the same female subject. This visualization shows the MALF method is able to generate accurate outlines of the cranial vault based on the OASISBC2 atlas, and the cranial vault contour shapes are comparable for the same subject on both field strengths.
4.2.1. Longitudinal consistency
Figure 2 A–C show the longitudinal trajectory of TIV normalized to the baseline volume across all available time-points using the FreeSurfer, SPM, and MALF methods. The estimate of TIV exhibits variability as a function of different acquisition timepoints (in months). FreeSurfer TIV estimate on 1.5T data (top row) shows a small negative longitudinal trend. SPM TIV shows better overall longitudinal consistency, although there are more variations in the data (more data points lie outside the ±5% percentage change from the baseline). The MALF exhibits the most visually consistent longitudinal TIV among the three methods, and most of the estimated TIV measures are within the ±5% variation range.
Figure 2:
(A-C) Longitudinal trajectories of percentage change of TIV from baseline for both 1.5T (top row) and 3T (bottom row) over time (in months). Each colored line represents the longitudinal trajectory of an individual subject. Median of TIV trajectory is shown in the black line. The dashed line represents the ±5% variation range. The MALF method shows smaller longitudinal variations of TIV as compared to FreeSurfer and SPM methods. (D) Visualization of the test-retest reliability analysis via the Bland-Altman plot. Dashed lines represent 95% confidence interval (CI) of the mean difference, and solid lines represent the linear regression result that fit the data. The FreeSurfer (Left column) showed larger CI than the SPM (middle column) and MALF (right column).
To quantitatively evaluate the longitudinal consistency, we used LME model to remove the effects of field strength and sex on the linear intercept (base TIV), and examine the relationship between TIV estimates and scanning time [47]. The results of LME model are shown in Table 2. Theoretically, there should be no association between the adult TIV and time. No significant correlation between age and TIV were detected with all three methods, with FreeSurfer exhibits the largest coefficient (−0.45%/year) and largest variance (−0.34%/year), SPM showed a modest coefficient (−0.15%/year) and variance (0.25%/year). and MALF showed the smallest correlation (0.11%/year) and variance (0.11%/year). In addition, Figure 2 (C) shows the TIV residual after fit with the LME Model. Most the MALF-estimated TIV lies within the ±5% residual range, while for both FreeSurfer and SPM, there are large proportion of residuals that exceed the ±5 range. In summary, all three TIV estimation methods showed good longitudinal consistency, with MALF demonstrating marginally better performance.
Table 2:
Quantitative evaluation of the longitudinal consistency and test-retest reliability for three TIV estimation methods (FreeSurfer, SPM, and MALF) across all the time points of 1.5T and 3T. The estimated coefficient of age (first column) represent the longitudinal slope of TIV change across time. The residual variance (second column) represent the standard error (SE) after fitting LME Model. The p-value (third column) reflect the significance to detect the correlation between the coefficient (age) and the dependent variable (TIV). The forth column report the 95% confidence interval of the Bland-Altman analysis, which shows the percentage difference between the estimated TIV of the test and retest data. All three methods showed p-values larger than 0.1, and the estimated coefficients are with the same magnitude of the residual variance which indicates no significant correlation between age and TIV were detected. MALF showed the smallest coefficient and SE among the three methods, although all three methods show comparable level of consistency. The MALF methods also showed the smallest and most balanced confidence interval among all the three methods.
LMEM coefficient vs. time | Bland-Altman analysis | |||
---|---|---|---|---|
Estimated coefficient |
Residual variance |
P-value | 95% Confidence interval | |
FreeSurfer | −0.45% | 0.34% | 0.73 | −1.5 ~ 2.0% |
SPM | −0.15% | 0.25% | 0.83 | −0.52 ~ 0.58% |
MALF | 0.11% | 0.11% | 0.31 | −0.55% ~ 0.55% |
4.2.2. Test-retest reliability
Figure 2 D and Table 2 showed the result of test-retest reliability using Bland-Altman analysis. The FreeSurfer showed larger confidence interval (CI) (−1.53 ~ 2.01%) of the mean difference among the three, followed by SPM (−0.52 ~ 0.58%) and MALF (−0.55% ~ 0.55%).
In conclusion, MALF showed the most robust performance over FreeSurfer and SPM both in terms of longitudinal consistency and test-retest reliability. Since MALF also provides an accurate 3D mask of the intra-cranial space, hence, we used the MALF-based estimate of TIV in the following analyses.
4.3. TIV variation due to scanner field strength difference
When comparing TIV estimated from 1.5T and 3T images using the second cohort which includes back to back scanned images of both 1.5T and 3T, both the correlation (Figure 3 A–C) and percentage volume difference (PVD) (Figure 3 DF, Bland-Altman plots) [26] showed good agreement. However, as seen in these results, the TIV estimates for 3T images are smaller than the 1.5T estimates across all three methods (FreeSurfer, SPM and MALF). Such field-strength-related discrepancy is also shown in the plot of ECDF (Fig. 3 G–I) of the 1.5T and 3T TIV, where the ECDF of 3T TIVs are shifted leftward (representing relatively lower value) compared with the 1.5T TIVs.
Figure 3:
Comparing TIV at 1.5T and 3T for all three methods: FreeSurfer, SPM, and MALF. (A,B,C) Correlation, (D,E,F) agreement in terms of percentage volume difference using Bland-Altman plots and (G,H,I) empirical cumulative distribution function (CDF). The PVD in the Bland-Altman plot is defined in Equation 2. (A-C): All three methods show good correlation, with MALF being the highest. TIV at 3T is slightly lower than at 1.5T. (D-F) Visualization of agreement between the values via the Bland-Altman plot shows qualitatively lower disagreement between 1.5T and 3T TIVs with MALF as evidenced by a narrower 95% confidence interval (CI) (dashed lines) as compared to FreeSurfer and SPM. Further, no systematic biases towards larger or smaller TIVs are noted for each method. (G-I): The 3T TIV values are slightly lower than 1.5T values and the female TIV values at each field strength are markedly lower than male TIV values. (X-axis unit: mm3)
4.4. Correlation between ROI volume and TIV
Figure 4 shows the correlation between volumes of a set of FreeSurfer-derived ROIs (14 subcortical/cortical structures and lateral ventricle) and the TIV for the CN group across all timepoints. The 1.5T and 3T data are shown separately. An overall positive correlation between ROI volumes and TIV is found, indicating that larger head sizes generally translate to larger brain structures. However, the strength of the correlation appeared to vary among different structures. The variation of the correlation indicates that, different structures in the brain are scaled with TIV in a non-proportional way.
Figure 4:
Correlation between the MALF-based TIV (x-axis) and some selected structure volumes (y-axis) for the CN group for males (blue) and females (red). The correlations are shown with the left and right sides volumes combined, and separated for field-strength (1.5T separate row as 3T). The TIV of male subjects tends towards larger values at both field strengths compared to TIV of females. Males with larger TIV showed larger structure volumes compared to female subjects as evidenced by a positive correlation between the ROI volumes and TIV. The strength of correlation varies across ROIs.
4.5. Evaluation of “goodness of harmonization”
In this section, we evaluated the distribution and variation of the brain structure volumes over the entire ADNI database before and after accounting for covariates such as TIV, scanner field-strength, sex and age using 1) Zscapes and 2) ECDFs.
4.5.1. Visualization of the goodness of harmonization using volume Zscapes
Figure 5 shows the Zscape - a panoramic view of the Z-score of each grey matter ROI volume for all subjects in the ADNI database (with CN group regarded as the reference group). The CN, MCI and AD diagnostic groups are shown separately, each divided into male and female, which are further divided into 1.5T and 3T. Within each Zscape plot, the horizontal axis is sorted according to age at the time of scan in ascending order. Color spectrum from blue to red represent the value of the Z-score ranging from −6 to +6, showing the level of volume shift from the mean of the reference (CN) group. If the data is fully harmonized, we expect the visual patterns within any diagnostic group (CN/MCI/AD) to be homogeneously distributed with minimum intra-group variation, which means minimum male-v-female or 3T-v-1.5T differences, and minimum volume variation due to normal aging. Figure 5 demonstrates different levels of data harmonization after adjusting for the different confounding covariates.
Figure 5:
The Zscape of all FreeSurfer segmented GM structure across the entire ADNI, showing the Z-score of all the structures for every subject in the database. Data is firstly categorized into three diagnostic groups: CN, MCI, and AD, with CN group be the reference control group to calculate the Z-score. Each diagnostic group is then further divided into female and male groups, which are then further separated into the 3T and 1.5T subgroups. Within each 1.5T and 3T subgroup, the data were sorted left to right according to increasing age. Only Z-score beyond ±1 SD of the CN group is shown. Color spectrum from blue to red represent the value of the Z-score ranging from −6 to +6, showing the level of volume shift from the mean of the reference (CN) group. (A) The raw structure volume showed systematic volume difference between the 1.5T and 3T subgroup, as well as between the male and female group. Within each subgroup, the volume is also decreased when the age increases (from left to right) reflecting the effect or normal aging. (B) Regress out only the covariate of field strength remove the systematic difference between 1.5T and 3T. (C) Normalize the TIV with proportional method (direct divide the volume with TIV) doesn’t remove intra-diagnosis-group variation. (D) Regress out the TIV only reduces the sex-based data variation, but the systematic bias between 1.5T and 3T remains. (E) Regress out the covariate of field strength followed by proportional based TIV normalization doesn’t reduce the data variation further. (F) Regress out both the field strength and the TIV removes the systematic volume difference between the 1.5T and 3T as well as between the male and female, which is similar to: (G) Regress out the TIV, field strength and female, indicating that TIV and sex is highly correlated. (H) Including age in the model further remove the effect of structure volume reduction due to normal aging. In summary, the residual-based covariate regression reduces the variation within each individual diagnostic group.
Figure 5 (A): No Covariates Adjusted. There is a clear distinction between each covariate subgroup: the structure volume decreases (left to right) with age given trend towards cooler colors. The volumes in male group appear larger (warmer colors) than the female group (cooler colors). The structure volumes at 3T appear larger (warmer colors) than at 1.5T (cooler colors). Compared to Figure 3, which showed smaller 3T TIV compared to 1.5T, the result shows that the effect of field strength towards the TIV is not proportionally scaled across different tissue types. Figure 8 in the later section shows more in-depth investigation of this finding.
Figure 5 (B): Adjusting for field strength. The discrimination between 1.5T and 3T has been controlled for, while the distinction between male and female and across age is still visible.
Figure 5 (C): Volume normalized by direct division with TIV. Contrary to the raw data Zscape in (A), the normalized female structural volumes tend towards larger values (warmer colors) than the normalized male volumes. The variation between 1.5T and 3T colors still persist after the TIV normalization.
Figure 5 (D): Volume normalization by TIV with the residual method. Compared to (C), the regression normalized male and female volume W-scores tend to become more similar, although the differences between 1.5T and 3T volumes still remain.
Figure 5 (E): Adjust field strength then divide by TIV. Compared to (B) which only adjusts for field strength, little improvement of harmonization is observed.
Figure 5 (F): Adjust field strength and TIV with the residual method. Compared to (B) which only adjust for field strength, the difference between male and female group is reduced significantly as well, indicate a strong correlation between the TIV and sex. This finding aligns with the results shown in Figure 3 (G-I) and Figure 4.
Figure 5 (G): Adjust field strength, TIV, and sex. Compared to (F), the improvement of data harmonization in terms of reducing the female/male structure volume difference is not obvious, as most of the difference has been removed when the TIV is adjusted.
Figure 5 (H): Adjust all covariates, including field strength, TIV, sex, and age. This harmonization process has removed the color patterns across the subgroups leading to a uniform pattern of structure volume distribution across subjects within each disease diagnostic group.
Figure 8:
ECDF of tissue volume (GM/WM/CSF) from FreeSurfer and SPM taken from the CN group only. Thick line = 3T, thin line = 1.5T; Red = Female, Blue = Male. Field-strength corrected residual shows a prominent sex-effect, whereas correcting additionally for TIV accounts for the variability attributed to sex as well. In addition, the ECDFs show that GM is scaled larger at 3T field-strength (thicker lines of the 3T ECDF to the right of thinner lines for the 1.5T ECDF, for both males and females), while WM and CSF are scaled smaller at the 3T field-strength (thicker lines to the left of thinner lines) indicating non-linear scaling of different tissue types across field-strengths.
4.5.2. Visualization of “goodness of harmonization” using ECDF
Figures 6 and 7 show the ECDF for a selected sampling of subcortical and cortical structures respectively, including both the left and right hemisphere’s structural volume measures to simplify the presentation. The ECDFs of the raw measures (column 1) show marked scatter and reduced separations between CN and AD groups prior to the control of covariates. The female (red) ECDF curves are generally to the left as compared to the male (blue) ECDF curves indicating overall smaller uncorrected regional volumes in females. The 1.5T measures (thin lines) are generally to the left of the 3T measures (thick lines) indicating that grey matter volumes are lower at 1.5T relative to 3T except for the lateral ventricles where the pattern is reversed indicating that ventricles are larger on 1.5T. The AD group measures (dashed lines) are generally to the left or coincident with the CN group measures (solid lines) indicating that structural volumes are lower, or preserved, in AD as compared to controls, except for the ventricles where the pattern is reversed, indicating enlargement of ventricles in AD.
Figure 6:
The empirical cumulative distribution function (ECDF) of the volumetric measures taken from a select few subcortical structures. Solid line = CN, dashed line = AD; thick line = 3T, thin line = 1.5T; Red = Female, Blue = Male. The residual in the title represents the standardized residual after regression (W-score with respect to the CN reference group). Note the overlap of ECDFs in the raw measures. As the variability attributed to field-strength, TIV, sex, and age are accounted for traversing from left to right, the ECDFs of the harmonized measures tend to coalesce leaving ECDFs for the CN and AD distribution.
Figure 7:
The empirical cumulative distribution function (ECDF) of the volumetric measures taken from a select few cortical structures. Solid line = CN, dashed line = AD; thick line = 3T, thin line = 1.5T; Red = Female, Blue = Male. The residual in the title represents the standardized residual after regression (W-score with respect to the CN reference group). Note the overlap of ECDFs in the raw measures. As the variability attributed to field-strength, TIV, sex, and age are accounted for traversing from left to right, the ECDFs of the harmonized measures tend to coalesce leaving ECDFs for the CN and AD distribution.
After accounting for field strength (2nd column), the systematic bias between 1.5T and 3T measures is reduced as shown by the coalescing of the corresponding 3T (thick) and 1.5T (thin) ECDF lines. The variabilities due to female/male differences still remain, as evidenced by the leftward shift of the female ECDFs (red lines) compared to the male ECDFs (blue lines). Removing TIV (by division as in third column or by regression as in fourth column) without adjusting for field-strength shows that the male and female ECDFs tend to coalesce, as TIV is correlated to sex, but the variation due to field-strength is evident in the separation of the 1.5T (thin) and the 3T (thick) ECDF lines.
Using a GLM with field-strength and TIV further (column 6) reduces the systematic bias between female and male ECDFs which is similar to the ECDF after introducing the sex covariate to the GLM (7th column), reaffirming the correlation between TIV and sex. Interestingly, controlling for field-strength with regression residual, and then dividing by TIV, as is often done in literature, does not as satisfactorily account for these covariates as shown in column 5 compared to column 6 as the distributions generally do not coalesce. Introducing age into the GLM does not show a marked change in the ECDFs, indicating no distinctive effect of age towards the distribution pattern when comparing among the different covariate subgroups (i.e. the age-dependent volume variation is similar for each subgroup).
The ECDFs also showcase the influence of AD on these structures relative to the CN group by the leftward separation of the ECDFs after accounting for covariates. The hippocampus and amygdala (in Figure 6), and entorhinal cortex, para-hippocampal gyrus, precuneus, posterior and isthmus cingulate (in Figure 7) show lowering of volume in the AD group as the dashed lines all coalesce into a single distribution leftward of the coalesced solid lines. Ventricles, on the other hand, show enlargement, as expected (in Figure 6). On the other hand, for putamen, thalamus, the dashed lines (AD) and solid lines (CN) are relatively closer compared to other subcortical structures (in Figure 6), indicating a smaller effect of AD to lower the volume. Interestingly, for entorhinal cortex (Figure 7, first row), normalizing the volume by dividing with TIV (column 3) or residual with TIV (column 4) already accounted for the variability induced by other covariates. This indicates that different structures have different nonlinear relationships to field-strength and TIV, and visual evaluation of “goodness of harmonization” of measures can help assess whether the accounting for covariates via the chosen method achieved the intended result.
We further plot the ECDF of grey matter (GM), white matter (WM) and cerebrospinal fluid (CSF) tissue volume of the CN group extracted from both the FreeSurfer and SPM pipelines (Figure 8) to assess the normalization effectiveness on the total GM/WM/CSF compartments. The field-strength corrected residuals in this Figure show a prominent sex-effect, whereas correcting additionally for TIV accounts for the variability attributed to sex as well. In addition, the ECDFs of the raw tissue volume (1st column) show that GM is scaled larger at 3T field-strength (thicker lines of the 3T ECDF to the right of thinner lines for the 1.5T ECDF, for both males and females), while WM and CSF are scaled smaller at the 3T field-strength (thicker lines to the left of thinner lines) indicating non-linear scaling of different tissue types across field-strengths.
4.5.3. Quantitative evaluation of data harmonization based on ECDF
To quantitatively assess the shift of ECDFs after accounting for each covariate, we performed the K-S test between two subgroups for each of the three variables (diagnosis, field strength and sex) for hippocampus (both left and right), a region considered to be a hallmark of AD-induced degeneration. Figure 9 (A) shows the CDF comparing 4 different normalizations, and Figure 9 (B-D) shows the result of K-S test representing the statistical distances between the distributions. In Figure 9 (B-D), the y-axis shows the value of K-S statistic Dm,n, and the dashed line represents the threshold value to reject the null hypothesis that the two sample distributions come from the same population. We anticipate that the Dm,n statistic will be maximized between subgroups of CN and AD, the main effect of interest (e.g.: CN 1.5T female vs AD 1.5T female will show larger separation after normalization) and minimized between nuisance covariates that need to be reduced/removed such as 1.5T vs 3T (e.g.: 1.5T CN male vs 3T CN male ECDFs will show reduced separation after normalization). When comparing the diagnostic group (panel B), all normalization methods showed significant differences between CN and AD. The value of K-S statistic Dm,n increases after including all covariates in the GLM, representing a larger difference between sample ECDFs, effectively increasing the power for discrimination (red bar, representing the 4th column, “Residual(All)”, in panel A). When comparing 1.5T and 3T (panel C), the significant difference between the two distributions is removed after all covariates have been controlled. The difference between the ECDF of male and female group (panel D) becomes insignificant after controlling the TIV as the standardized residual of the GLM, confirming the strong correlation between the TIV and subject sex.
Figure 9:
Panel (A): Comparison of normalization methods for hippocampus volumes as a function of field strength and TIV, sex, age. ECDFs for structural volumes of each subgroup are shown after accounting for covariates. The rightmost column, for example, shows the residual after regression for TIV, age, sex and field-strength showing that the ECDFs of all subgroups coalesce closer into ECDFs of CN and AD (the effect of interest). Solid line = CN, dashed line = AD; thick line = 3T, thin line = 1.5T; Red = Female, Blue = Male. Panel (B-D): The Kolmogorov-Smirnov (KS) statistic (Equation 5) comparing the ECDF separation within each subgroup. In each panel, different colors represent different normalization methods used in panel (A). Panel (B) shows the K-S statistic between the ECDFs of CN and AD groups. For example, the group of 4 bars on the left in panel B are the K-S statistic between the ECDFs of CN 1.5T female and AD 1.5T female subgroups (the thin red lines in panel (A), both solid and dash lines) for each of the four normalization methods (corresponding to the four columns in panel (A)). The K-S statistic is the highest for the “residual-all” method (red) indicating an increasing separation between the ECDFs with this method of harmonization. Panel (C) shows the K-S statistic between 1.5T and 3T group ECDFs. As an example, the bars on the left in this panel show K-S statistic for 1.5T CN female and 3T CN female for each of the four methods. The separation between these ECDFs decreases after “residual-all” method (red) is utilized. Panel (D) shows the K-S statistic between female and male groups. As an example, the bars on the left in this panel show K-S statistic between the female 1.5T CN ECDF and the male 1.5T CN ECDF for the four methods. This panel shows that the separation between female and male ECDFs is reduced with the “residual (TIV)” (yellow) and “residual-all” (red) method.
5. Discussion
5.1. “Goodness of harmonization”
In the quest towards improved understanding of brain structure and function, recent neuroimaging databases such as ADNI leverage data-sharing from multiple sites, and multiple research labs to increase the number of imaging scans available for analyses. Differences in site-specific parameters (such as acquisition pulse sequences) or processing-specific parameters (such as segmentation protocols) can introduce undesirable variability in data that can reduce the power to detect smaller effect sizes of interest. To reduce/remove such unrelated and undesirable sources of variability, a significant amount of recent collaborative research effort has been directed towards harmonization of acquisition and processing protocols. Techniques for assessment of “goodness of harmonization” are relevant even with harmonized data acquisition and processing protocols, as systematic sources of variability can still exist due to unaccounted methodological or demographic covariates potentially biasing all downstream analyses [86].
In this study, we presented several methods to assess the “goodness of harmonization” of images with varying field-strength (1.5T/3T), TIV (proportional/residual normalization), sex (male/female) and age as covariates. Using these methods, we demonstrate the effect of different data harmonization choices, before and after controlling for the effect of these covariates. Our experiments indicate that the GLM-based residuals are the appropriate choice for these covariates for volumetric analysis purposes. Group difference analysis can of-course directly incorporate multiple covariates into the GLM [49]. By directly analyzing the residuals at each stage of the GLM correction, deeper insights assessing the accounting of covariates can be obtained. These harmonized residuals are inputs for the development of and/or testing of classification models. Proper modeling and accounting of covariates that help reduce spurious variability while retaining the variability of interest in the input features for classification are known to help increase the discrimination between the patterns related to the effect of interest inherent in these features [78].
We used the Kolmogorov-Smirnoff (K-S) statistic [56] to quantify the distance between ECDFs after covariate normalization. We note that this quantification can also be performed with other alternative distance measures and statistical tests such as Discrete Cramer-von Mises (CVM) criterion [2], Kullback-Leibler divergence [44], or the k-sample Anderson-Darling test [85] depending on the distribution of the data. In addition, although the covariates we evaluated in this study only include scanner field strength, TIV, sex and age, the proposed methods can be extended to evaluate the effect of additional covariates including, but not limited to technique covariates such as scanner vendor [47] or demographic/biological covariates such as disease-risk-related genes, such as APOE mutation status. In addition, further research is needed to identify universal thresholds for assessing “good” or “bad” harmonization, which may likely depend on the databases being pooled, and the particular covariates chosen for the study.
Fortin et al. have previously reviewed and compared several different data harmonization methods [21], such as functional normalization [20], RAVEL [22], surrogate variable analysis (SVA) [48], ComBat [38], and RUV [25]. The tools developed in this study for assessing “goodness of harmonization” could be potentially used for comparison of these competing harmonization strategies. In addition, although we only evaluated the goodness of harmonization for data within a single database (ADNI) in this study, the data harmonization can also be extended to pool data from multiple studies by including additional site-specific covariates [19].
5.2. TIV estimation and normalization
TIV is an important covariate for neuroanatomical studies looking at the changes in brain structure. However, accurate TIV estimation from T1-weighted (T1W) brain MRI is not easy given the lack of adequate contrast between the skull and the CSF. Currently, the best-validated and widely-used TIV estimation methods are found in FreeSurfer [10] and SPM toolboxes [41], and the MALF method [84, 32]. FreeSurfer (version 5.3.0) uses a template with pre-calculated TIV, which is affinely registered to the target image, and uses the scaling factors derived from the affine matrix to approximate the TIV [10]. In SPM, the TIV is calculated as the sum of the all intracranial tissues, with additional tissue class most introduced in the recent version of SPM 12 (e.g. external CSF as appose to part of the entire CSF classes in the early version) and regularized through wrapping the tissue segmentation with a manually corrected TIV mask [41] to improve the segmentation accuracy. The more rigorous definition of T1-based TIV in SPM appears to be more consistent compared to FreeSurfer’s scaling-based estimation [80, 81, 82, 41, 28], and is less affected by the brain atrophy [68]. However, both the FreeSurfer and the SPM8 automatic TIV estimation introduce systematic overestimation [63], which has been alleviated in SPM12 in which case a new method is introduced using template registration, which improves accuracy [52] and consistency [31] of both TIV estimation, as well as brain volume measurements [93, 31].
On the other hand, the MALF approach has demonstrated great accuracy in brain structure segmentation and parcellation, brain extraction [30], and skull stripping [77]. Schaerer et al. [84] used a MALF framework (STAPLE) to estimate the TIV on ADNI dataset and demonstrated better performance compared to FreeSurfer and SPM. Manjön et al. introduced a TIV extraction framework [53] using a MALF-based TIV as an extension of the brain extraction framework BEaST by including extra-cerebral spinal fluid in the manual atlas templates to obtain the entire TIV using conditional mask dilation (only over CSF voxels) followed by manual correction. More recently, Huo et al. [32] applied an improved MALF framework (Non-Local Spatial STAPLE - NLSS) and reported better TIV estimation accuracy compared to SPM12 and FreeSurfer, validated using a semimanual segmented atlas of CT-MRI image pairs as gold standard true TIV volume.
Since the aim of the longitudinal consistency analysis is to evaluate the performance of the automated procedure from the three commonly-used image processing packages (FreeSurfer, SPM and MALF) with minimal or no human intervention to ensure the unbiased validation and comparison, we show all the TIVs estimated from all the subjects from theses three methods. Those samples whose percentage change lies outside the ±5% variation range (Figure 2) are particularly highlighted to enhance the visual comparison across the three methods.
5.3. 3D TIV mask versus a scalar for volume
One advantage of MALF over other TIV methods such as FreeSurfer and SPM is the readily available 3D mask of the intracranial vault (a sample shown in Figure 10). SPM also generated a TIV mask in the subject space during the pipeline process (through the single-template-based non-rigid registered “reverse brain mask” as part of the “new segmentation” method in SPM 12). However, this TIV mask is not used to calculate the final measurement of TIV, but rather to constrain the final TIV calculation through the summation of threshold tissue probability map. Compared to this single-template-based TIV mask, the MALF provides a 3D TIV mask through the fusion of multiple non-rigid registered template masks resulting in higher accuracy [32]. When studying the shape of subcortical structures, such as the hippocampus, a typical approach is to perform affine registration of the segmented hippocampus ROI to a template hippocampus segmentation prior to non-rigid registration-based shape analysis [95]. However, not only may head size influences the size of brain structures, the shape of the cranial vault could also potentially influence the shape of brain structures in a non-isotropic manner. If this is the case, the shape of the intracranial vault could be used to perform non-isotropic normalization of the shape of brain structures such as the hippocampus to account for this 3D covariate of cranial-vault shape. Normalization of cranial vault shape could, therefore, be important for studying shape changes of brain structures and is an interesting topic for future investigation.
Figure 10:
The sagittal view of sample images showing the overlay of the surface rendering of the 3D TIV mask estimated using the MALF with the corresponding T1 brain MR. Top row: 1.5T, bottom row: 3T (A) Sample male CN subject (B) Sample female CN subject (C) Sample male AD subject (D) sample female AD subject. (E,F) The TIV mask of the same subject in (D) with an overlaid view (E) and contour view (F).
5.4. TIV normalization methods
The two commonly-used methods proposed to account for the influence of TIV variations when analyzing changes in brain structure volumes are the proportion method and the residual method [65, 64, 94]. Sanfilipo et al. [79] have performed a theoretical comparison between the proportion vs. residual method for TIV normalization. The results of that study showed that, the proportional method may aggregate errors that come from the numerator (structure volume) and the denominator (TIV) - which can be observed from Figure 5 C E). On the other hand, the residual method guarantees that the error is minimized in the predicted value through the least square solution of the linear regression. The flexibility of residual methods also enables the TIV to be combined with other covariates for better intra-group harmonization (Figure 5 D,F,H). One limitation of the residual method is the requirement of large enough sample in the reference group to derive linear coefficient to fit the TIV (the covariate) and structure volume (the dependent variable), unless such coefficient is generated a priori. However, this is not an issue when performing analyses on large databases such as ADNI.
5.5. Influence of covariates on TIV and brain tissue volumes
Our results show that, as is previously known, male TIV are generally larger than female TIV for both field strengths 1.5T and 3T (Figure 3, panels G,H,I and Figure 4). Larger TIV could lead to an assumption of a tendency towards larger brain structure volumes (a uniform scaling effect for all structures). Indeed, positive correlation between TIV and ROI volume is observed as seen in Figure 4. However, the strength of correlation varies across structures indicating that not all structures are uniformly larger with larger TIVs. This suggests that structural volumes in the brain are non-uniformly scaling with head size as measured by TIV.
TIV is found smaller on 3T than at 1.5T for both male and female subjects (Figure 3). Jovicich et. al have previously reported systematic smaller TIV from 1.5T than the 3T counterpart using FreeSurfer with 25 subjects [40]. Keihaninejad et al. have proposed an SPM5-based TIV estimation method - Reverse Brain Mask (RBM), and reported over-estimation of TIV from 1.5T an under-estimation the 3T data when compared with manually defined ground truth with smaller samples (5 healthy and 2 AD subject [41]). Heckemann et al. [29] studied a large number of subjects (n=176) from ADNI with a semi-automatic method [23, 103] and also showed smaller 3T TIV than 1.5T. A recent study by Heinen et al. compared the TIV of ten subjects and reported smaller 3T TIV compared to 1.5T using FreeSurfer 5.3.0, while interestingly no such significance TIV differences were reported with SPM12 [31] in which the author attribute to the potentially explanation due to the improved estimation accuracy with the newer version of the analysis package. However, our result showed that the SPM 12 also showed significant smaller 3T TIV compared to 1.5T data, also the difference is smaller than FreeSurfer. The difference in the results of the two study may be due to the much-increased sample size of the dataset used in this study (n = 187) compared to the study of Heinen et al. (n = 10) which significantly increases the effect size.
There is currently no consensus yet to the explain field-strength-dependent TIV difference. Chu, R et al. reported similar finding that the brain volume measured from the 3T data is smaller than the 1.5T data, and concluded that this is due to the lower image contrast presented in the 1.5T data, which causes the over-segmentation of the brain volume at the boundary between the parenchyma and CSF compartments because of the effect of partial volume averaging [13]. In another words, the improved tissue contrast in the 3T data may be helping to prevent over-segmentation.
Possible biases of MRI-derived volume measurements could also come from other scanner-specific factors such as the effect of field strength different on the B1 intensity inhomogeneities correction [40]. The ADNI MR imaging core used standardized preprocessing protocol to remove the intensity inhomogeneity in the data [60, 39], although further studies are required to investigate the effect of field strength towards the preprocessing steps.
Lateral ventricles are smaller on 3T than at 1.5T (Figure 4), for both male and female subjects. This trend is seemingly more prominent for males than for females as evidenced by slightly larger separation for male ECDFs for lateral ventricle volumes. The fact that grey matter is larger at 3T vs 1.5T for both males and females, while white matter and CSF are smaller at 3T (Figure 8) further indicating a non-uniform scaling of tissue compartments with scanner field-strength. A previous study by Brunton et al. has also reported the similar observations using SPM8, and concluded that the CSF/WM/GM tissue volume measurement from 1.5T and 3T is not directly comparable in voxel-wise analysis with tools such as SPM [9]. These tissue-type-dependent volume variations might further affect the TIV estimation, especially by the SPM method which estimates the TIV through a combination of CSF/WM/GM volumes [52]. The differential effect of field-strength towards different tissues may be due to the physics-based properties of MR-induced tissue relaxation times.
Interestingly, the influence of other covariates (sex and field strength) is not uniformly distributed across tissue types and across grey matter structure. Comparison between Figure 5 (A) and (C) indicate that, the difference of grey matter structure volume between male and female is smaller than the difference in TIV, so the TIV difference between sex is also not simply proportionally scaled across tissue. In addition, Figure 8 shows that, the effect of field strength (1.5T vs. 3T) is also nonuniform across tissue types.
Finally, due to differences in TIV definition, the TIV estimates differs among the three methods, with SPM estimates being the smallest TIV and FreeSurfer the largest. Similar differences have been reported by previous studies [23, 103, 40, 41, 29, 31].
5.6. Limitation of the current study
Reuter et al. have introduced the longitudinal stream in the FreeSurfer framework [74, 73, 75]. Xu et al. [101] have recommended initializing the segmentation of within-subject longitudinal images with an average template for each subject and within subject registration across longitudinal data. This is a valid procedure for analyses such as longitudinal hippocampus volume change. On the other hand, the affine-scale-based TIV estimation in the FreeSurfer may have limited benefit from the use of this longitudinal stream, hence, in this paper, we have used the standard FreeSurfer cross-sectional stream to estimate TIV and segmentation for each image without taking advantage of the potential availability of longitudinal context where available. The evaluation of FreeSurfer longitudinal stream towards TIV estimation is suggested for future analyses.
6. Conclusion
In conclusion, we have proposed two qualitative and one quantitative method to assess the “goodness of harmonization” of covariates such as field-strength, TIV, sex and age for volumetric analysis of brain MR imaging data. Our results show that normalization of covariates based on a GLM model can be adopted based on their satisfactory assessment at harmonizing the selected covariates. The methods proposed for assessing goodness of harmonization can be used for comparing existing and novel harmonization methods. With these tools, diverse databases can be harmonized and assessed for “goodness of harmonization” before further statistical analysis.
1. Acknowledgement
There is no conflict of interest to declare from all authors. Funding for this research is gratefully acknowledged from Natural Science and Engineering Research Council (NSERC), Canadian Institutes of Health Research (CIHR), Brain Canada, Pacific Alzheimer’s Research Foundation, the Michael Smith Foundation for Health Research (MSFHR), and the National Institute on Aging (R01 AG055121–01A1). We thank Compute Canada for the computational infrastructure provided for the data processing in this study. Data collection and sharing for this project was funded by the Alzheimers Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimers Association; Alzheimers Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada.
Footnotes
Data used in preparation of this article were obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
7. REFERENCES
- [1].Agarwal P, Shroff G, and Malhotra P (2013). Approximate incremental big-data harmonization. In Proceedings - 2013 IEEE International Congress on Big Data, BigData 2013, pages 118–125. [Google Scholar]
- [2].Anderson TW (1962). On the Distribution of the Two-Sample Cramer-von Mises Criterion. The Annals of Mathematical Statistics, 33(3):1148–1159. [Google Scholar]
- [3].Aoyagi M, Kim Y, Yokoyama J, Kiren T, Suzuki Y, and Koike Y (1990). Head size as a basis of gender difference in the latency of the brainstem auditory-evoked response. International Journal of Audiology, 29(2):107–112. [DOI] [PubMed] [Google Scholar]
- [4].Arnold TB and Emerson JW (2011). Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions. The R Journal, pages 34–39. [Google Scholar]
- [5].Barnes J, Ridgway GR, Bartlett J, Henley SM, Lehmann M, Hobbs N, Clarkson MJ, MacManus DG, Ourselin S, and Fox NC (2010). Head size, age and gender adjustment in MRI studies: A necessary nuisance? NeuroImage, 53(4):1244–1255. [DOI] [PubMed] [Google Scholar]
- [6].Beg MF, Miller MI, Trouve A, and Younes L (2005). Computing´ Large Deformation Metric Mappings via Geodesic Flows of Diffeomorphisms. International Journal of Computer Vision, 61(2):139–157. [Google Scholar]
- [7].Bernal-Rusiel JL, Greve DN, Reuter M, Fischl B, and Sabuncu MR (2013). Statistical analysis of longitudinal neuroimage data with Linear Mixed Effects models. NeuroImage, 66:249–260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Bland JM and Altman DG (1994). Statistics Notes: Correlation, regression, and repeated data. BMJ, 308(6933):896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Brunton S, Gunasinghe C, Jones N, Kempton M, Westman E, and Simmons A (2014). a Voxel-Wise Morphometry Comparison of the Adni 1.5T and Adni 3.0T Volumetric Mri Protocols. Alzheimer’s & Dementia, 10(4):P823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Buckner RL, Head D, Parker J, Fotenos AF, Marcus D, Morris JC, and Snyder AZ (2004). A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. NeuroImage, 23(2):724–738. [DOI] [PubMed] [Google Scholar]
- [11].Chow N, Hwang K, Hurtz S, Green A, Somme J, Thompson P, Elashoff D, Jack C, Weiner M, and Apostolova L (2015a). Comparing 3T and 1.5T MRI for Mapping Hippocampal Atrophy in the Alzheimer’s Disease Neuroimaging Initiative. American Journal of Neuroradiology, 36(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Chow N, Hwang KS, Hurtz S, Green AE, Somme JH, Thompson PM, Elashoff DA, Jack CR, Weiner M, and Apostolova LG (2015b). Comparing 3T and 1.5T MRI for mapping hippocampal atrophy in the Alzheimer’s disease neuroimaging initiative. American Journal of Neuroradiology, 36(4):653–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Chu R, Tauhid S, Glanz BI, Healy BC, Kim G, Oommen VV, Khalid F, Neema M, and Bakshi R (2016). Whole Brain Volume Measured from 1.5T versus 3T MRI in Healthy Subjects and Patients with Multiple Sclerosis. Journal of neuroimaging : official journal of the American Society of Neuroimaging, 26(1):62–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Collij LE, Heeman F, Kuijer JPA, Ossenkoppele R, Benedictus MR, Moller C, Verfaillie SCJ, Sanz-Arigita EJ, van Berckel B¨ N. M, van der Flier, Scheltens WM, P., Barkhof F, and Wink AM (2016). Application of Machine Learning to Arterial Spin Labeling in Mild Cognitive Impairment and Alzheimer Disease. Radiology, 281(3):865–875. [DOI] [PubMed] [Google Scholar]
- [15].Desikan RS, Segonne F, Fischl B, Quinn BT, Dickerson BC,´ Blacker D, Buckner RL, Dale AM, Maguire RP, Hyman BT, Albert MS, and Killiany RJ (2006). An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage, 31(3):968–980. [DOI] [PubMed] [Google Scholar]
- [16].Edland SD, Xu Y, Plevak M, O’Brien P, Tangalos EG, Petersen RC, and Jack CR (2002). Total intracranial volume: Normative values and lack of association with Alzheimer’s disease. Neurology. [DOI] [PubMed] [Google Scholar]
- [17].Fischl B (2004). Automatically Parcellating the Human Cerebral Cortex. Cerebral Cortex, 14(1):11–22. [DOI] [PubMed] [Google Scholar]
- [18].Fischl B, Salat DH, Busa E, Albert M, Dieterich M, Haselgrove C, Kouwe AVD, Killiany R, Kennedy D, Klaveness S, Montillo A, Makris N, Rosen B, Dale AM, and van der Kouwe A (2002). Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron, 33(3):341–55. [DOI] [PubMed] [Google Scholar]
- [19].Fortin JP, Cullen N, Sheline YI, Taylor WD, Aselcioglu I, Cook PA, Adams P, Cooper C, Fava M, McGrath PJ, McInnis M, Phillips ML, Trivedi MH, Weissman MM, and Shinohara RT (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage, 167:104–120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Fortin JP, Labbe A, Lemire M, Zanke BW, Hudson TJ, Fertig EJ, Greenwood CM, and Hansen KD (2014). Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biology, 15(11). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Fortin J-P, Parker D, Tunc B, Watanabe T, Elliott MA, Ruparel, Roalf DR, Satterthwaite TD, Gur RC, Gur RE, Schultz RT, Verma R, and Shinohara RT (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage, 161:149–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Fortin JP, Sweeney EM, Muschelli J, Crainiceanu CM, and Shinohara RT (2016). Removing inter-subject technical variability in magnetic resonance imaging studies. NeuroImage, 132:198–212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Freeborough PA, Fox NC, and Kitney RI (1997). Interactive algorithms for the segmentation and quantitation of 3-D MRI brain scans. Computer Methods and Programs in Biomedicine, 53(1):15–25. [DOI] [PubMed] [Google Scholar]
- [24].Frisoni GB and Jack CR (2015). HarP: The EADC-ADNI Harmonized Protocol for manual hippocampal segmentation. A standard of reference from a global working group. Alzheimer’s & Dementia, 11(2):107–110. [DOI] [PubMed] [Google Scholar]
- [25].Gagnon-Bartsch JA and Speed TP (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13(3):539–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Giavarina D (2015). Understanding Bland Altman analysis. Biochemia medica, 25(2):141–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Gur RC, Mozley PD, Resnick SM, Gottlieb GL, Kohn M, Zimmerman R, Herman G, Atlas S, Grossman R, and Berretta D (1991). Gender differences in age effect on brain atrophy measured by magnetic resonance imaging. Proceedings of the National Academy of Sciences of the United States of America, 88(7):2845–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Hansen TI, Brezova V, Eikenes L, Haberg A, and Vangberg XTR˚ (2015). How does the accuracy of intracranial volume measurements affect normalized brain volumes? sample size estimates based on 966 subjects from the HUNT MRI cohort. American Journal of Neuroradiology, 36(8):1450–1456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Heckemann RA, Keihaninejad S, Aljabar P, Gray KR, Nielsen C, Rueckert D, Hajnal JV, and Hammers A (2011). Automatic morphometry in Alzheimer’s disease and mild cognitive impairment. NeuroImage, 56(4):2024–2037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Heckemann RA, Ledig C, Gray KR, Aljabar P, Rueckert D, Hajnal JV, and Hammers A (2015). Brain extraction using label propagation and group agreement: Pincram. PLoS ONE, 10(7):e0129211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Heinen R, Bouvy WH, Mendrik AM, Viergever MA, Biessels GJ, and de Bresser J (2016). Robustness of Automated Methods for Brain Volume Measurements across Different MRI Field Strengths. PLOS ONE, 11(10):e0165719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Huo Y, Asman AJ, Plassard AJ, and Landman BA (2017). Simultaneous total intracranial volume and posterior fossa volume estimation using multi-atlas label fusion. Human Brain Mapping, 38(2):599–616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Ingalhalikar M, Smith A, Parker D, Satterthwaite TD, Elliott MA, Ruparel K, Hakonarson H, Gur RE, Gur RC, and Verma R (2014). Sex differences in the structural connectome of the human brain. Proceedings of the National Academy of Sciences, 111(2):823–828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Jack CR, Bernstein MA, Fox NC, Thompson P, Alexander G, Harvey D, Borowski B, Britson PJ, Whitwell JL, Ward C, Dale AM, Felmlee JP, Gunter JL, Hill DLG, Killiany R, Schuff N, Fox-Bosetti S, Lin C, Studholme C, DeCarli CS, Krueger G, Ward HA, Metzger GJ, Scott KT, Mallozzi R, Blezek D, Levy J, Debbins JP, Fleisher AS, Albert M, Green R, Bartzokis G, Glover G, Mugler J, and Weiner MW (2008). The Alzheimer’s Disease Neuroimaging Initiative (ADNI): MRI methods. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Jahanshad N, Kochunov PV, Sprooten E, Mandl RC, Nichols TE, Almasy L, Blangero J, Brouwer RM, Curran JE, de Zubicaray GI, Duggirala R, Fox PT, Hong LE, Landman BA, Martin NG, McMahon KL, Medland SE, Mitchell BD, Olvera RL, Peterson CP, Starr JM, Sussmann JE, Toga AW, Wardlaw JM, Wright MJ, Hulshoff Pol HE, Bastin ME, McIntosh AM, Deary IJ, Thompson PM, and Glahn DC (2013). Multi-site genetic analysis of diffusion images and voxelwise heritability analysis: A pilot project of the ENIGMA-DTI working group. NeuroImage, 81:455–469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Jenkins R, Fox NC, Rossor AM, Harvey RJ, and Rossor MN (2014). Intracranial Volume and Alzheimer Disease. Archives of neurology, 57(2):220–224. [DOI] [PubMed] [Google Scholar]
- [37].Jernigan TL, Zatz LM, Moses JA, and Berger PA (1982). Computed Tomography in Schizophrenics and Normal Volunteers: I. Fluid Volume. Archives of General Psychiatry, 39(7):765–770. [DOI] [PubMed] [Google Scholar]
- [38].Johnson WE, Li C, and Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127. [DOI] [PubMed] [Google Scholar]
- [39].Jovicich J, Czanner S, Greve D, Haley E, Van Der Kouwe A, Gollub R, Kennedy D, Schmitt F, Brown G, MacFall J, Fischl B, and Dale A (2006). Reliability in multi-site structural MRI studies: Effects of gradient non-linearity correction on phantom and human data. NeuroImage. [DOI] [PubMed] [Google Scholar]
- [40].Jovicich J, Czanner S, Han X, Salat D, van der Kouwe A, Quinn B, Pacheco J, Albert M, Killiany R, Blacker D, Maguire P, Rosas D, Makris N, Gollub R, Dale A, Dickerson BC, and Fischl B (2009). MRI-derived measurements of human subcortical, ventricular and intracranial brain volumes: Reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths. NeuroImage, 46(1):177–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Keihaninejad S, Heckemann RA, Fagiolo G, Symms MR, Hajnal JV, and Hammers A (2010). A robust method to estimate the intracranial volume across MRI field strengths (1.5T and 3T). NeuroImage, 50(4):1427–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Klein A and Tourville J (2012). 101 labeled brain images and a consistent human cortical labeling protocol. Frontiers in Neuroscience, 6:171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Kochunov P, Jahanshad N, Marcus D, Winkler A, Sprooten E, Nichols TE, Wright SN, Hong LE, Patel B, Behrens T, Jbabdi S, Andersson J, Lenglet C, Yacoub E, Moeller S, Auerbach E, Ugurbil K, Sotiropoulos SN, Brouwer RM, Landman B, Lemaitre H, den Braber A, Zwiers MP, Ritchie S, van Hulzen K, Almasy L, Curran J, DeZubicaray GI, Duggirala R, Fox P, Martin NG, McMahon KL, Mitchell B, Olvera RL, Peterson C, Starr J, Sussmann J, Wardlaw J, Wright M, Boomsma DI, Kahn R, de Geus EJ, Williamson DE, Hariri A, van ‘t Ent D, Bastin ME, McIntosh A, Deary IJ, Hulshoff pol HE, Blangero J, Thompson PM, Glahn DC, and Van Essen DC (2015). Heritability of fractional anisotropy in human white matter: A comparison of Human Connectome Project and ENIGMA-DTI data. NeuroImage, 111:300–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Kullback S (1997). Information theory and statistics. Courier Corporation. [Google Scholar]
- [45].La Joie R, Perrotin A, Barre L, Hommet C, Mezenge F, Ibazizene M, Camus V, Abbas A, Landeau B, Guilloteau D, de La Sayette V, Eustache F, Desgranges B, and Chetelat G (2012). Region-Specific Hierarchy between Atrophy, Hypometabolism, and -Amyloid (A ) Load in Alzheimer’s Disease Dementia. Journal of Neuroscience, 32(46):16265–16273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Landman B and Warfield S (2012). Miccai 2012 multi-atlas labeling challenge. In MICCAI 2012 Workshop on Multi-Atlas Labeling, pages 1–164. [Google Scholar]
- [47].Lee H, Nakamura K, Narayanan S, Brown RA, and Arnold DL (2018). Estimating and accounting for the effect of MRI scanner changes on longitudinal whole-brain volume change measurements. NeuroImage, 184:555–565. [DOI] [PubMed] [Google Scholar]
- [48].Leek JT and Storey JD (2007). Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLoS Genetics, 3(9):e161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Lenoski B, Baxter LC, Karam LJ, Maisog J, and Debbins J (2008). On the performance of autocorrelation estimation algorithms for fMRI analysis. IEEE Journal on Selected Topics in Signal Processing, 2(6):828–838. [Google Scholar]
- [50].Li X, Pu F, Fan Y, Niu H, Li S, and Li D (2013). Age-related changes in brain structural covariance networks. Frontiers in human neuroscience, 7:98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Macdonald KE, Leung KK, Bartlett JW, Blair M, Malone IB, Barnes J, Ourselin S, Fox NC, Initiative ADN, et al. (2014). Automated template-based hippocampal segmentations from mri: the effects of 1.5 t or 3t field strength on accuracy. Neuroinformatics, 12(3):405–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Malone IB, Leung KK, Clegg S, Barnes J, Whitwell JL, Ashburner J, Fox NC, and Ridgway GR (2015). Accurate automatic estimation of total intracranial volume: A nuisance variable with less nuisance. NeuroImage, 104:366–372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Manjon JV, Eskildsen SF, Coup´ e P, Romero JE, Collins DL,´ and Robles M (2014). Nonlocal intracranial cavity extraction. International journal of biomedical imaging, 2014:820205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Marcus DS, Fotenos AF, Csernansky JG, Morris JC, and Buckner RL (2010). Open Access Series of Imaging Studies: Longitudinal MRI Data in Nondemented and Demented Older Adults. Journal of Cognitive Neuroscience, 22(12):2677–2684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Marcus DS, Wang TH, Parker J, Csernansky JG, Morris JC, and Buckner RL (2007). Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults. Journal of Cognitive Neuroscience, 19(9):1498–1507. [DOI] [PubMed] [Google Scholar]
- [56].Massey FJ (1951). The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association, 46(253):68–78. [Google Scholar]
- [57].Mirzaalian H, Ning L, Savadjiev P, Pasternak O, Bouix S, Michailovich O, Grant G, Marx C, Morey R, Flashman L, George M, McAllister T, Andaluz N, Shutter L, Coimbra R, Zafonte R, Coleman M, Kubicki M, Westin C, Stein M, Shenton M, and Rathi Y (2016). Inter-site and inter-scanner diffusion MRI data harmonization. NeuroImage, 135:311–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Mirzaalian H, Ning L, Savadjiev P, Pasternak O, Bouix S, Michailovich O, Karmacharya S, Grant G, Marx CE, Morey RA, Flashman LA, George MS, McAllister TW, Andaluz N, Shutter L, Coimbra R, Zafonte RD, Coleman MJ, Kubicki M, Westin CF, Stein MB, Shenton ME, and Rathi Y (2018). Multi-site harmonization of diffusion MRI data in a registration framework. Brain Imaging and Behavior, 12(1):284–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W, Trojanowski JQ, Toga AW, and Beckett L (2005a). The alzheimer’s disease neuroimaging initiative. Neuroimaging Clinics, 15(4):869–877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack CR, Jagust W, Trojanowski JQ, Toga AW, and Beckett L (2005b). Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s & Dementia, 1(1):55–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Myles PS and Cui J (2007). I. Using the BlandAltman method to measure agreement with repeated measures. British Journal of Anaesthesia, 99(3):309–311. [DOI] [PubMed] [Google Scholar]
- [62].Nestor SM, Rupsingh R, Borrie M, Smith M, Accomazzi V, Wells JL, Fogarty J, and Bartha R (2008). Ventricular enlargement as a possible measure of Alzheimer’s disease progression validated using the Alzheimer’s disease neuroimaging initiative database. Brain, 131(9):2443–2454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [63].Nordenskjold R, Malmberg F, Larsson EM, Simmons A, Brooks SJ,¨ Lind L, Ahlstrom H, Johansson L, and Kullberg J (2013). Intracranial¨ volume estimated with commonly used methods could introduce bias in studies including brain volume measurements. NeuroImage, 83:355–360. [DOI] [PubMed] [Google Scholar]
- [64].O’Brien LM, Ziegler DA, Deutsch CK, Frazier JA, Herbert MR, and Locascio JJ (2011). Statistical adjustments for brain size in volumetric neuroimaging studies: Some practical implications in methods. Psychiatry Research: Neuroimaging, 193(2):113–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [65].O’brien LM, Ziegler DA, Deutsch CK, Kennedy DN, Goldstein JM, Seidman LJ, Hodge S, Makris N, Caviness V, Frazier JA, et al. (2006). Adjustment for whole brain and cranial size in volumetric brain studies: a review of common adjustment factors and statistical methods. Harvard review of psychiatry, 14(3):141–151. [DOI] [PubMed] [Google Scholar]
- [66].O’brien PC and Dyck PJ (1995). Procedures for setting normal values. Neurology, 45(1):17–23. [DOI] [PubMed] [Google Scholar]
- [67].Ott BR, Cohen RA, Gongvatana A, Okonkwo OC, Johanson CE, Stopa EG, Donahue JE, Silverberg GD, and Alzheimer’s Disease Neuroimaging Initiative (2010). Brain ventricular volume and cerebrospinal fluid biomarkers of Alzheimer’s disease. Journal of Alzheimers disease JAD, 20(2):647–657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].Pengas G, Pereira JMS, Williams GB, and Nestor PJ (2009). Comparative Reliability of Total Intracranial Volume Estimation Methods and the Influence of Atrophy in a Longitudinal Semantic Dementia Cohort. Journal of Neuroimaging, 19(1):37–46. [DOI] [PubMed] [Google Scholar]
- [69].Perlaki G, Orsi G, Plozer E, Altbacker A, Darnai G, Nagy SA, Horvath R, Toth A, Doczi T, Kovacs N, Bogner P, Schwarcz A, and Janszky J (2014). Are there any gender differences in the hippocampus volume after head-size correction? A volumetric and voxel-based morphometric study. Neuroscience Letters, 570:119–123. [DOI] [PubMed] [Google Scholar]
- [70].Potvin O, Dieumegarde L, and Duchesne S (2017). Freesurfer cortical normative data for adults using Desikan-Killiany-Tourville and ex vivo protocols. NeuroImage, 156:43–64. [DOI] [PubMed] [Google Scholar]
- [71].Potvin O, Mouiha A, Dieumegarde L, and Duchesne S (2016). Normative data for subcortical regional volumes over the lifetime of the adult human brain. NeuroImage, 137:9–20. [DOI] [PubMed] [Google Scholar]
- [72].Rathore S, Habes M, Iftikhar MA, Shacklett A, and Davatzikos C (2017). A review on neuroimaging-based classification studies and associated feature extraction methods for alzheimer’s disease and its prodromal stages. NeuroImage, 155:530–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [73].Reuter M and Fischl B (2011). Avoiding asymmetry-induced bias in longitudinal image processing. Neuroimage, 57(1):19–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [74].Reuter M, Rosas HD, and Fischl B (2010). Highly accurate inverse consistent registration: A robust approach. NeuroImage, 53(4):1181–1196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [75].Reuter M, Schmansky NJ, Rosas HD, and Fischl B (2012). Withinsubject template estimation for unbiased longitudinal image analysis. NeuroImage, 61(4):1402–1418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [76].Ritchie SJ, Cox SR, Shen X, Lombardo MV, Reus LM, Alloza C, Harris MA, Alderson HL, Hunter S, Neilson E, Liewald DCM, Auyeung B, Whalley HC, Lawrie SM, Gale CR, Bastin ME, McIntosh AM, and Deary IJ (2018). Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants. Cerebral Cortex, 28(8):2959–2975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [77].Roy S, Butman JA, and Pham DL (2017). Robust skull stripping using multiple MR image contrasts insensitive to pathology. NeuroImage, 146:132–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [78].Rozycki M, Satterthwaite TD, Koutsouleris N, Erus G, Doshi J, Wolf DH, Fan Y, Gur RE, Gur RC, Meisenzahl EM, Zhuo C, Yin H, Yan H, Yue W, Zhang D, and Davatzikos C (2018). Multisite Machine Learning Analysis Provides a Robust Structural Imaging Signature of Schizophrenia Detectable Across Diverse Patient Populations and Within Individuals. Schizophrenia Bulletin, 44(5):1035–1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [79].Sanfilipo MP, Benedict RH, Zivadinov R, and Bakshi R (2004). Correction for intracranial volume in analysis of whole brain atrophy in multiple sclerosis: the proportion vs. residual method. NeuroImage, 22(4):1732–1743. [DOI] [PubMed] [Google Scholar]
- [80].Sargolzaei S, Goryawala M, Cabrerizo M, Chen G, Jayakar P, Duara R, Barker W, and Adjouadi M (2014). Comparative reliability analysis of publicly available software packages for automatic intracranial volume estimation In Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference, volume 2014, pages 2342–2345. [DOI] [PubMed] [Google Scholar]
- [81].Sargolzaei S, Sargolzaei A, Cabrerizo M, Chen G, Goryawala M, Noei S, Zhou Q, Duara R, Barker W, and Adjouadi M (2015a). A practical guideline for intracranial volume estimation in patients with Alzheimer’s disease. BMC Bioinformatics, 16(Suppl 7):S8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [82].Sargolzaei S, Sargolzaei A, Cabrerizo M, Chen G, Goryawala M, Pinzon-Ardila A, Gonzalez-Arias SM, and Adjouadi M (2015b). Estimating Intracranial Volume in Brain Research: An Evaluation of Methods. Neuroinformatics, 13(4):427–441. [DOI] [PubMed] [Google Scholar]
- [83].Scahill RI, Frost C, Jenkins R, Whitwell JL, Rossor MN, and Fox NC (2003). A longitudinal study of brain volume changes in normal aging using serial registered magnetic resonance imaging. Archives of Neurology. [DOI] [PubMed] [Google Scholar]
- [84].Schaerer J, Belaroussi B, Bonnand F, Roche F, Bracoud L, Yu HJ, and Pachai C (2012). Accurate intracranial cavity volume estimation using multiatlas segmentation. Alzheimer’s and Dementia, 8(4):P272. [Google Scholar]
- [85].Scholz FW and Stephens MA (1987). K-sample AndersonDarling tests. Journal of the American Statistical Association, 82(399):918–924. [Google Scholar]
- [86].Shinohara R, Oh J, Nair G, Calabresi P, Davatzikos C, Doshi J, Henry R, Kim G, Linn K, Papinutto N, Pelletier D, Pham D, Reich D, Rooney W, Roy S, Stern W, Tummala S, Yousuf F, Zhu A, Sicotte N, and Bakshi R (2017). Volumetric Analysis from a Harmonized Multisite Brain MRI Study of a Single Subject with Multiple Sclerosis. American Journal of Neuroradiology, 38(8):1501–1509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [87].Sled JG, Zijdenbos AP, and Evans AC (1998). A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE transactions on medical imaging, 17(1):87–97. [DOI] [PubMed] [Google Scholar]
- [88].Friston T,JA, K. J, and Ashburner J (2005). Unified segmentation. NeuroImage, 26(3):839–51. [DOI] [PubMed] [Google Scholar]
- [89].Takao H, Hayashi N, and Ohtomo K (2012). A longitudinal study of brain volume changes in normal aging. European journal of radiology, 81(10):2801–2804. [DOI] [PubMed] [Google Scholar]
- [90].Taki Y, Thyreau B, Kinomura S, Sato K, Goto R, Wu K, Kawashima R, and Fukuda H (2013). A longitudinal study of age- and gender-related annual rate of volume changes in regional gray matter in healthy adults. Human Brain Mapping, 34(9):2292–2301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [91].Thompson PM, Dennis EL, Gutman BA, Hibar DP, Jahanshad N, Kelly S, Stein JL, Whelan CD, Andreassen OA, Andreassen OA, Arias-Vasquez A, Bearden CE, Bearden CE, Bearden CE, Boedhoe PS, van den Heuvel OL, Veltman DJ, Brouwer RM, de Reus MA, Pol HE, van den Heuvel MP, Buckner RL, Buitelaar JK, Fisher SE, Francks C, Franke B, Hoogman M, van Rooij D, Buitelaar JK, Bulayeva KB, Cannon DM, Cannon DM, McDonald C, Cohen RA, Conrod PJ, Dale AM, Holland D, Thompson PM, Deary IJ, Wardlaw JM, Desrivieres S, Schumann G, Dima D, Dima D, Frangou S, Donohoe G, Fisher SE, Francks C, Guadalupe T, Fouche JP, Stein DJ, Franke B, Hoogman M, Franke B, Ganjgahi H, Garavan H, Glahn DC, Glahn DC, Grabe HJ, Grabe HJ, Guadalupe T, Hashimoto R, Hosten N, Kochunov P, Kremen WS, Lee PH, Lee PH, Lee PH, Mackey S, Mazoyer B, Martin NG, Medland SE, Morey RA, Nichols TE, Nichols TE, Paus T, Paus T, Paus T, Pausova Z, Pausova Z, Shen L, Shen L, Sisodiya SM, Smit DJ, Smoller JW, Stein DJ, Stein JL, Toro R, Turner JA, Boedhoe PS, Schmaal L, van den Heuvel OL, Veltman DJ, Boedhoe PS, Schmaal L, Smit DJ, van den Heuvel OL, Veltman DJ, van Erp TG, Walter H, Wang Y, Wardlaw JM, Wardlaw JM, Wright MJ, Ye J, and Ye J (2017). ENIGMA and the individual: Predicting factors that affect the brain in 35 countries worldwide. NeuroImage, 145:389–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [92].Trune DR, Mitchell C, and Phillips DS (1988). The relative importance of head size, gender and age on the auditory brainstem response. Hearing research, 32(2–3):165–74. [DOI] [PubMed] [Google Scholar]
- [93].Vagberg M, Ambarki K, Lindqvist T, Birgander R, and Svenningsson A (2016). Brain parenchymal fraction in an age-stratified healthy population determined by MRI using manual segmentation and three automated segmentation methods. Journal of Neuroradiology, 43(6):384–391. [DOI] [PubMed] [Google Scholar]
- [94].Voevodskaya O (2014). The effects of intracranial volume adjustment approaches on multiple regional MRI volumes in healthy aging and Alzheimer’s disease. Frontiers in Aging Neuroscience, 6(OCT):264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [95].Wang L, Beg F, Ratnanather T, Ceritpglu C, Younes L, Morris JC, Csernansky JG, and Miller MI (2007). Large deformation diffeomorphism and momentum based hippocampal shape discrimination in dementia of the alzheimer type. IEEE Transactions on Medical Imaging, 26(4):462–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [96].Weiner MW (2008). Expanding ventricles may detect preclinical Alzheimer disease. Neurology, 70(11):824–825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [97].Weiner MW, Veitch DP, Aisen PS, Beckett LA, Cairns NJ, Green RC, Harvey D, Jack CR, Jagust W, Liu E, Morris JC, Petersen RC, Saykin AJ, Schmidt ME, Shaw L, Shen L, Siuciak JA, Soares H, Toga AW, and Trojanowski JQ (2013). The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception. Alzheimer’s & Dementia, 9(5):e111–e194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [98].Whitwell JL, Crum WR, Watt HC, and Fox NC (2001). Normalization of cerebral volumes by use of intracranial volume: implications for longitudinal quantitative MR imaging. AJNR. American journal of neuroradiology, 22(8):1483–9. [PMC free article] [PubMed] [Google Scholar]
- [99].Wolf H, Kruggel F, Hensel A, Wahlund LO, Arendt T, and Gertz HJ (2003). The relationship between head size and intracranial volume in elderly subjects. Brain Research, 973(1):74–80. [DOI] [PubMed] [Google Scholar]
- [100].Wyman BT, Harvey DJ, Crawford K, Bernstein MA, Carmichael O, Cole PE, Crane PK, Decarli C, Fox NC, Gunter JL, Hill D, Killiany RJ, Pachai C, Schwarz AJ, Schuff N, Senjem ML, Suhy J, Thompson PM, Weiner M, and Jack CR (2013). Standardization of analysis sets for reporting results from ADNI MRI data. Alzheimer’s and Dementia, 9(3):332–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [101].Xu Z, Shen X, and Pan W (2014). Longitudinal analysis is more powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS ONE, 9(8):e102312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [102].Yu M, Linn KA, Cook PA, Phillips ML, McInnis M, Fava M, Trivedi MH, Weissman MM, Shinohara RT, and Sheline YI (2018). Statistical harmonization corrects site effects in functional connectivity measurements from multi-site fMRI data. Human Brain Mapping, 0(0):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [103].Zhang Y, Brady M, and Smith S (2001). Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Transactions on Medical Imaging, 20(1):45–57. [DOI] [PubMed] [Google Scholar]