Abstract
Background-
Neuroimaging data from large epidemiologic cohort studies often come from multiple scanners. The variations of MRI measurements due to differences in magnetic field strength, image acquisition protocols, and scanner vendors can influence the interpretation of aggregated data. The purpose of the present study was to compare methods that meta-analyze findings from a small number of different MRI scanners.
Methods-
We proposed a bootstrap resampling method using individual participant data and compared it with two common random effects meta-analysis methods, DerSimonian-Laird and Hartung-Knapp, and a conventional pooling method that combines MRI data from different scanners. We first performed simulations to compare the power and coverage probabilities of the four methods in the absence and presence of scanner effects on measurements. We then examined the association of age with white matter hyperintensity (WMH) volumes from 787 participants.
Results-
In simulations, the bootstrap approach performed better than the other three methods in terms of coverage probability and power when scanner differences were present. However, the bootstrap approach was consistent with pooling, the optimal approach, when scanner differences were absent. In the association of age with WMH volume, we observed that age was significantly associated with WMH volumes using the bootstrap approach, pooling, and the DerSimonian-Laird method, but not using the Hartung-Knapp method (p <0.0001 for the bootstrap approach, DerSimonian-Laird, and pooling but p=0.1439 for the Hartung-Knapp approach).
Conclusion-
The bootstrap approach using individual participant data is suitable for integrating outcomes from multiple MRI scanners regardless of absence or presence of scanner effects on measurements.
Keywords: white matter hyperintensity, multiple MRI scanners, meta-analysis, bootstrap resampling
Graphical Abstract

Introduction
Access to multiple magnetic resonance imaging (MRI) facilities is indispensable when recruiting volunteers for research scans in large community-based studies in order to reduce participant burden. However, pooling data from multiple MRI scanners has potential limitations because MRI measurements are sensitive to vendor, magnetic field strength, and image acquisition parameters (Takao et al., 2011,Takao et al., 2013,Barrio-Arranz et al., 2015).
White matter hyperintensities (WMHs), changes in white matter as seen on T2-weighted MRI scans, are thought to reflect vascular disease (Wardlaw et al., 2015,Chao et al., 2013,Dickie et al., 2016) and Alzheimer’s Disease (AD) risk (Brickman et al., 2009,Alosco et al., 2018,Kandel et al., 2016,Arfanakis et al., 2020). Total WMH or lesion volume is reported to vary by image acquisition parameters, magnetic field strength, vendors and automatic segmentation tools (Guo et al., 2019,Shinohara et al., 2017,Tudorascu et al., 2016). For this reason, while pooling data increases statistical power, the validity of pooling MRI measures acquired from different scanners and protocols has been questioned (Ashburner., 2009,Focke et al., 2011,Stonnington et al., 2008,Bauer et al., 2010,Marchewka et al., 2014).
Meta-analysis has been utilized for synthesizing results from multiple studies or multi-center studies, where the study effect is often treated as a random entity. Application of meta-analysis in biomedical research has increased dramatically over the past decade. A search with the keyword meta-analysis in PubMed showed about 151,000 publications between 2008 and 2018, more than five times the 27,000 publications from the previous decade (1998–2008). Random-effects meta-analysis has been employed for neuroimaging studies (Costafreda., 2009,Berlingeri et al., 2019,Tench et al., 2017,Fusar-Poli, P. et al., 2009,Fusar-Poli, Paolo., 2012). However, random effects meta-analysis that incorporates variance across scanners/centers in a population might be overly conservative when inferences are about the average effect with data from a small number of MRI scanners.
In order to measure the effects of interest without undue variation that comes from combining data across scanners, we propose a meta-analysis using bootstrap resampling with individual participant data within each scanner, where the variance of combined summary statistics across scanners is estimated by a bootstrap method. Here we compared this approach to two common random effects meta-analysis methods, DerSimonian and Laird (DerSimonian and Laird., 1986) and Hartung and Knapp (Hartung and Knapp., 2001b,Hartung and Knapp., 2001a), and a pooling method that combines individual-level data from different MRI scanners. We simulated data under scenarios with and without scanner effects on measurements and compared coverage probability and power of four methods. In addition, the performance of all four methods was compared using WMH volumes from 787 participants from four ongoing cohort studies of aging and dementia (Bennett et al., 2018,Barnes et al., 2012,Barnes et al., 2015,Schneider et al., 2009) at the Rush Alzheimer’s Disease Center to replicate the well-established association of age with WMH (Gunning-Dixon et al., 2009).
2. Methods
2.1. Meta-analysis
We analyzed the association of WMH volume with age after controlling for sex, race, clinical diagnosis, and global cognition at each scanner. The association of age with WMH from each scanner is represented by the regression coefficient of age (, i=1, …, K; number of scanners, K) in a multiple linear regression on the WMH volumes. In random effects meta-analysis, the association of age estimated from the data obtained from the ith MRI scanner, denoted by βi, is considered a random perturbation from the population-level effect (β) so that βi= β +εi. A synthesized estimate is estimated as a weighted sum of individual estimates, (i=1,…,K).
| (1) |
where is the weight for ith study using the sampling variance estimated from the regression model from the ith scanner, and is the between-scanner variance of the across the K estimated coefficients.
Estimation of the variance of the weighted sum, , differs by methods. We investigated two commonly applied meta-analytic approaches: DerSimonian-Laird (DerSimonian and Laird., 1986) and Hartung-Knapp (Hartung and Knapp., 2001b,Hartung and Knapp., 2001a). In DerSimonian-Laird, the variance of the weighted sum is , whereas Hartung-Knapp adopts a more empirical approach, a weighted expansion of empirical variance of (Hartung and Knapp., 2001a,Hartung and Knapp., 2001b). The Hartung-Knapp approach is often preferred because its coverage probability is closer to the nominal level than DerSimonian-Laird approach (Partlett and Riley., 2017,Guolo and Varin., 2017). Additionally, while DerSimonian-Laird approach uses a normal distribution for tests, the Hartung-Knapp approach uses a t-distribution with degrees of freedom (df) being one less than the number of studies. When the number of studies is small, the t-test can be conservative (Higgins et al., 2008,Guolo and Varin., 2017).
2.2. Bootstrap approach with individual participant data
In this paper, we proposed a bootstrap approach to deal with scanner effects in meta-analyzing data. In contrast to the random effects meta-analysis in Eq. (1), we did not incorporate the variance across scanners into the weight terms because we performed inferences on the average effect from the observed data. Therefore, we set the weights as . And the weighted sum in the approach is given as
| (2) |
This formation is also referred to as common effect meta-analysis. In this study, we adopt an empirical approach for estimating the variance of the weighted sum in Eq. (2). Common effect meta-analysis adopts theoretical variance based on an unrealistic assumption that weights (or inverse of variance) are known prior to study. It has been reported that the theoretical variance underestimates the true variance (Li et al., 1994,Hedges and Olkin., 1985,Domínguez Islas and Rice., 2018). Our method requires individual participant data to estimate the variance of the weighted sum in Eq. (2). We estimate the variance by employing bootstrap resampling (Efron., 1979).
Our procedure consists of three steps, as specified below. We denote the data set for each MRI scanner as (Di; i =1,...,K), and the jth bootstrap replicate of the data as . We repeat Steps 1 and 2 below M times (j=1, …, M).
Step 1: Within-scanner estimation
Both the estimated coefficient and its estimated sampling variance, (, ), are estimated from (i=1, …, K).
Step 2: Integration of estimates across scanners
The weighted sum, , with the jth bootstrap replicate of the data (: i=1,...,K) is given as
| (3) |
where the weights are given as .
Step 3: Construction of the confidence interval
Finally, we construct confidence level by using quantiles from normal distribution as suggested in (DiCiccio and Efron., 1996):
where is the variance estimated with the bootstrap samples of (j=1, …, M) in Eq. (3), i.e., and . A diagram of the proposed bootstrap approachis shown in Figure 1. In Figure 2, procedures of the bootstrap approach, two random effects approaches, and pooling were compared. R code for implementing the proposed bootstrap algorithm is provided in Supplementary material 1.
Figure 1. Diagram of the Bootstrap approach.

At each bootstrap resampling (j=1,…,M), the weighted sum, , is estimated, and the variance is obtained as the final outcome. 95% CI is estimated using quantiles from the standard normal distribution.
Figure 2.

Procedures from the four methods.
2.3. Simulation design
We conducted simulations under two scenarios: presence and absence of scanner differences. Parameters applied in the simulations included the number of scanners and the number of participants at each scanner. The number of scanners (K) tested was small (2, 5); the number of participants (n) at each scanner was tested with small to moderate (30, 50, 100) sample sizes. Individual participant data at each scanner without scanner differences was generated as yil = α + βxl + eil, with eil ~ N(0,σ2) (i=1, …, K; l=1, …, n). We incorporated scanner-specific level effect and variance in the presence of scanner effects on measurements as yil = αi+ βxl + eil with , where αi = α +δi and . Two random components were sampled from δi~U(−5,5), and (5). The coefficient of determination was 0.09 on average per dataset in the absence of scanner effects and 0.018 with the presence of scanner effects . We compared all four methods based on coverage probability and power. Coverage probability is defined as the probability that the confidence interval includes the true value, and is estimated as the proportion of confidence intervals in simulations that cover the pre-specified true value. At (1-p)×100% confidence interval, we expected to cover the true value with (1-p) probability, where p refers to the type I error rate when the null hypothesis is true. Higher than the expected coverage probability indicates a higher type II error, and lower coverage probability a higher type I error. Accuracy of a method is thus determined by closeness to the target coverage probability. Power is the probability of rejecting the null hypothesis when an alternative hypothesis is true. Thus, higher power is preferred.Details of the simulation design are shown in Supplementary material 2.
2.4. MRI Acquisition
Scanner A is a 3 Tesla Siemens Trio (Erlangen, Germany) MRI scanner. High-resolution T1-weighted anatomical data were obtained using a 3D MPRAGE sequence. T2–weighted FLAIR data were collected on all participants using a 2D fast spin-echo sequence. Details of acquisition parameters for the 3D MPRAGE and 2D fast spin-echo sequence are shown in Table 1.
Table 1.
MRI acquisition parameters
| Scanner | Scanner A | Scanner B |
|---|---|---|
| Vendor | Siemens Trio | Philips Achieva |
| Field strength | 3T | 3T |
| T1-weighted 3D MPRAGE | ||
| TE (msec) | 2.98 | 3.7 |
| TR (msec) | 2300 | 8 |
| TI (msec) | 900 | 955 |
| flip angle (radian) | 9 | 8 |
| FOV (cm2) | 25.6×25.6 | 24.0×22.8 |
| Sagittal slices (n) | 176 | 181 |
| Slice thickness (mm) | 1 | 1 |
| Acquisition matrix | 256×256 | 240×228 |
| Parallel Imaging Acceleration Factor | 2 | 2 |
| T2-weighted 2D FLAIR | ||
| TE (msec) | 150 | 90 |
| TR (sec) | 9 | 9 |
| TI (sec) | 2.49 | 2.5 |
| FOV (cm2) | 22×22 | 22×22 |
| Axial slices (n) | 35 | 35 |
| Slice thickness (mm) | 4 | 4 |
| Acquisition matrix | 256×256 | 256×201 |
| Parallel Imaging Acceleration Factor | 2 | 1.6 |
Scanner B is a 3 Tesla Philips Achieva (Best, The Netherlands) MRI scanner. High-resolution T1-weighted anatomical data were obtained using a 3D MPRAGE sequence. T2–weighted FLAIR data were collected on all participants using a 2D fast spin-echo sequence. Details of acquisition parameters for the 3D MPRAGE and 2D fast spin-echo sequence are shown in Table 1.
2.5. WMH assessment
The total volume of WMH was quantified for each participant using an automated segmentation of WMHs: the brain intensity abnormality classification algorithm (BIANCA)(Griffanti et al., 2016). The T1-weighted MPRAGE data was first spatially registered to the T2-weighted FLAIR data using affine registration (FLIRT, FMRIB, University of Oxford, UK). The brain was extracted from the coregistered MPRAGE and FLAIR image volumes (BET, FMRIB, University of Oxford, UK). WMHs were then automatically segmented for each participant’s native space using BIANCA with T2-weighted FLAIR image. For each participant, maps of WMHs were generated, and the total volume of WMHs was measured and normalized by the intracranial volume (ICV) derived from CAT12 (http://www.neuro.uni-jena.de/cat/). WMH volume, which is represented as percentage of ICV, was then log10-transformed to reduce skewness.
2.6. Participants
Participants were older adults enrolled in ongoing cohort studies of aging and dementia including the Religious Orders Study (ROS), the Rush Memory and Aging Project (MAP) (Bennett et al., 2018), the Minority Aging Research Study (MARS) (Barnes et al., 2012), and the Rush Clinical Core (Barnes et al., 2015,Schneider et al., 2009). Eligibility for each study requires age ≥65 years, absence of known dementia, and agreement to annual clinical evaluations. The 3T MRI sub-study began in 2012. Of about 1900 participants eligible for and consenting to MRI scans since 2012, 1127 participants completed at least one MRI scanning session. All four studies and the MRI substudy were approved by the Institutional Review Board of Rush University Medical Center. Written informed consent was obtained from each participant. De-identified data employed in this study can be requested at the Research Resource Sharing Hub of the Rush Alzheimer’s Disease Center (www.radc.rush.edu).
2.7. Demographic and clinical variables
Date of birth, sex, years of education, and race were self-reported at study entry; age at scan was calculated. Race categories were based on the 1990 U.S. Census as previously described (Barnes et al., 2012). Vascular risk factors such as smoking, diabetes and hypertension have been shown to be associated with WMH volumes, with a higher number of vascular risk factors being associated with a higher volume of WMH (Williamson, Lewandowski et al. 2018, Gebeily, Fares et al. 2014). Self-report of smoking status and histories of hypertension and diabetes were obtained through self-report and inspection of medications. The number of these factors reported was a summary of vascular risk (0 – 3).
Participants underwent a comprehensive neuropsychological assessment. A core battery of 17 tests was used to make a composite score for global cognition, as previously described (Wilson et al., 2015,Wilson et al., 2005). To create the composite, raw scores from the individual cognitive scores were converted to z-scores using the baseline mean and standard deviation of all participants enrolled in the parent cohorts. Each participant’s standardized z-scores were then averaged to yield a composite global cognition score. Data were reviewed by a neuropsychologist to determine cognitive status at each visit. Participants were evaluated by a clinician who used all cognitive and clinical data available to provide a diagnosis. Dementia and related conditions were diagnosed using the guidelines of the joint working group of the National Institute of Neurological and Communicative Disorders and Stroke and Alzheimer’s Disease and Related Disorders Association (McKhann et al., 1984). Diagnosis of MCI was applied to individuals who had cognitive impairment but did not meet criteria for dementia.
2.8. Association of age with WMH volume
We analyzed the association of age with WMH volume after controlling for sex, education, race, vascular risk, clinical diagnosis, and a measure of global cognition. The association of age with WMH volume was estimated from each scanner, and then meta-analyzed using the four methods described above: the bootstrap approach with individual participant data, pooling, DerSimonian-Laird, and Hartung-Knapp. We compared coverage probability and power of four methods using simulated data with and without scanner differences.
3. Results
In simulation with absence of scanner differences, the bootstrap approach performed better than two random effects meta-analyses, with its coverage probability being closer to the nominal level, 95% (Bootstrap 0.94, DerSimonian-Laird 0.99, Hartung-Knapp 1.0, Pooling 0.95 with the number of scanners (K) =2 and the number of participants (N) = 50) (Table 2, Figure 3). Moreover, the coverage of the bootstrap approach was similar to that of pooling, which is the optimal approach in the absence of scanner differences (Table 2, Figure 3). In the presence of scanner differences, the bootstrap approach performed better than two random-effects meta-analyses and pooling (Table 3, Figure 3). While pooling performed the best in the absence of scanner differences, its performance was poorer in the presence of scanner differences (Bootstrap 0.94, DerSimonian-Laird 0.99, Hartung-Knapp 1.0, Pooling 0.92 with K=2 and N=50). Coverage probabilities of all four methods were more accurate when the number of scanners was five and as the number of subjects per scanner increased, whether scanner differences were present or absent.
Table 2.
Coverage probability in the absence of scanner differences.
| Bootstrap | Pooling | DerSimonian-Laird | Hartung-Knapp | |
|---|---|---|---|---|
| N | Number of scanners (K) = 2 | |||
| 30 | 0.945 | 0.950 | 0.985 | 1 |
| 50 | 0.941 | 0.945 | 0.985 | 1 |
| 100 | 0.956 | 0.959 | 0.989 | 1 |
| Number of scanners (K) = 5 | ||||
| 30 | 0.941 | 0.942 | 0.969 | 0.998 |
| 50 | 0.950 | 0.950 | 0.978 | 1 |
| 100 | 0.948 | 0.949 | 0.976 | 0.998 |
Number of people within each scanner (n)
The nominal coverage probability was set at 0.95.
Figure 3.

Coverage probability. Coverage probability (y-axis) versus the number of participants (N= 30, 50, 100) in x-axis from four methods, bootstrap, pooling, DerSimonian-Laird, and Hartung-Knapp, are shown. Simulation results without scanner differences are shown in (a) and (b), and those with scanner differences in (c) and (d). The number of scanners is two in (a) and (c), and five in (b) and (d). In each plot, red line is for the nominal coverage probability (0.95), the blue solid line connects the simulation results for the bootstrap method, the black solid line for the pooling approach, the black dotted line for Dersimonian-Laird, and the black dot-dashed line those for Hartung-Knapp.
Table 3.
Coverage probability in the presence of scanner differences.
| Bootstrap | Pooling | DerSimonian-Laird | Hartung-Knapp | |
|---|---|---|---|---|
| N | Number of scanners (K) = 2 | |||
| 30 | 0.927 | 0.921 | 0.982 | 1.000 |
| 50 | 0.942 | 0.917 | 0.987 | 1.000 |
| 100 | 0.942 | 0.908 | 0.987 | 1.000 |
| Number of scanners (K) = 5 | ||||
| 30 | 0.944 | 0.920 | 0.975 | 0.996 |
| 50 | 0.944 | 0.937 | 0.976 | 0.998 |
| 100 | 0.941 | 0.927 | 0.981 | 0.998 |
Number of people within each scanner (n)
The nominal coverage probability was set at 0.95.
The bootstrap approach showed higher power than the two random-effects meta-analyses in the absence of scanner differences in simulation (Table 4, Figure 4). Again, the power of bootstrap approach was similar to that of pooling, the optimal approach in the absence of scanner differences (Bootstrap 0.89, DerSimonian-Laird 0.60, Hartung-Knapp 0.0, Pooling 0.90 with K=2 and N=50). (Table 4, Figure 4). In the presence of scanner differences, the bootstrap approach showed higher power than two random-effects meta-analyses and pooling (Table 5, Figure 4). While pooling performed the best in the absence of scanner differences, its performance was again poorer when scanner differences were present (Bootstrap 0.45, DerSimonian-Laird 0.20, Hartung-Knapp 0.0, Pooling 0.41 with K=2 and N=50). Power of all four methods, was higher as the number of subjects increased and when the number of scanners was five, whether scanner differences were present or not.
Table 4.
Power in the absence of scanner differences.
| Bootstrap | Pooling | DerSimonian-Laird | Hartung-Knapp | |
|---|---|---|---|---|
| N | Number of scanners (K) = 2 | |||
| 30 | 0.738 | 0.773 | 0.456 | 0.000 |
| 50 | 0.890 | 0.897 | 0.602 | 0.000 |
| 100 | 0.998 | 0.999 | 0.930 | 0.000 |
| Number of scanners (K) = 5 | ||||
| 30 | 0.998 | 0.973 | 0.911 | 0.679 |
| 50 | 0.999 | 0.999 | 0.993 | 0.940 |
| 100 | 1.000 | 1.000 | 1.000 | 1.000 |
Number of people within each scanner (n)
The coefficient of determination simulated was 0.09 on average per scanner.
Figure 4.

Power. Power (y-axis) versus the number of participants (N= 30, 50, 100) in x-axis from four methods, bootstrap, pooling, DerSimonian-Laird, and Hartung-Knapp, are shown. Results without scanner differences are shown in (a) and (b), and those with scanner differences in (c) and (d). The number of scanners is two in (a) and (c), and five in (b) and (d). In each plot, blue solid line is for the bootstrap method, black solid line for the pooling approach, black dotted line for Dersimonian-Laird, and black dotdashed line for Hartung-Knapp.
Table 5.
Power in the presence of scanner differences.
| Bootstrap | Pooling | DerSimonian-Laird | Hartung-Knapp | |
|---|---|---|---|---|
| N | Number of scanners (K) = 2 | |||
| 30 | 0.311 | 0.308 | 0.127 | 0.000 |
| 50 | 0.445 | 0.414 | 0.205 | 0.000 |
| 100 | 0.680 | 0.615 | 0.364 | 0.000 |
| Number of scanners (K) = 5 | ||||
| 30 | 0.590 | 0.470 | 0.417 | 0.154 |
| 50 | 0.866 | 0.748 | 0.699 | 0.398 |
| 100 | 0.980 | 0.933 | 0.917 | 0.712 |
Number of people within each scanner (n)
The coefficient of determination simulated was 0.018 on average per scanner.
We examined four methods with WMH data from 787 participants described above. There were 409 participants scanned in scanner A and 378 scanned in scanner B. The mean age was 79.6 years (SD=6.7) and 77.4 years (SD=6.8) for scanners A and B, respectively, and the mean education was about 16 years (SD=3.2), and 15.8 (SD=3.6), respectively. Over 70% of participants were female at each scanner. The Scanner B group was younger, had a higher percentage of African Americans, higher vascular risk scores, lower global cognitive scores, and a higher proportion of cognitively intact participants (Table 6).
Table 6.
Characteristics of study participants by Scanner.
| Scanner A (N=409) | Scanner B (N=378) | P-value for difference | |
|---|---|---|---|
| Age | |||
| Mean (SD) | 79.6 (6.7) | 77.4 (6.8) | P < 0.0001 |
| Men (%) | 96 (23.5) | 70 (18.5) | P = 0.0968 |
| Education | |||
| Mean (SD) | 16.0 (3.2) | 15.8 (3.6) | P = 0.3423 |
| Race | |||
| AA (%) | 90 (22.0) | 275 (72.8) | P < 0.0001 |
| Global cognition | |||
| Mean (SD) | 0.2048 (0.61) | 0.1181 (0.62) | P = 0.0337 |
| Clinical diagnosisa | |||
| NCI (number,%) | 316 (77.5) | 318 (85.5) | P = 0.0044 |
| MCI (number,%) | 87 (21.3) | 48 (12.9) | |
| Dementia (number,%) | 5 (1.2) | 6 (1.6) | |
| Vascular risk factors | 1.2 | 1.4 | P= 0.0002 |
| Smoke (number, %)b | 166 (41.4) | 155 (41.3) | P= 1.0000 |
| Diabetes (number, %)c | 63 (15.5) | 99 (26.3) | P= 0.0002 |
| Hypertension (number, %)d | 243 (59.7) | 268 (71.1) | P= 0.0010 |
AA, African American (AA); NCI, no cognitive impairment; MCI, mild cognitive impairment (MCI)
Clinical diagnosis was not available for seven participants.
History of smoking was not available for eleven participants.
History of diabetes was not available for three participants
History of hypertension was not available for three participants
Note: Participants from two scanners showed difference in age, racial distribution, cognitive score, clinical diagnosis, and vascular risk factors.
In a meta-analysis for the association of age with WMH volumes after controlling for demographics, vascular risk, clinical diagnosis, and global cognition, the proposed bootstrap approach, DerSimonian-Laird and pooling all indicated that older age was associated with increased WMH volume (all Ps <0.0001), but the Hartung-Knapp approach (P=0.1439) was not, as shown in Table 7. The association of age with WMH volumes by MRI scanner is shown in Figure 5.
Table 7.
Association of age with WMH volume.
| Coefficient | S.E. | P-value | |
|---|---|---|---|
| Scanner A | 0.0198 | 0.0029 | P < 0.0001 |
| Scanner B | 0.0106 | 0.0022 | P < 0.0001 |
| Bootstrap approach | 0.0139 | 0.0017 | P < 0.0001 |
| Pooling | 0.0154 | 0.0018 | P < 0.0001 |
| DerSimonian-Laird | 0.0146 | 0.0034 | P < 0.0001 |
| Hartung-Knapp | 0.0146 | 0.0034 | P = 0.1439 |
Note: 1. WMH volume = log10 (WMH volume in mm3 as a percentage of ICV) 2. One year of additional age is associated with 1.05% greater WMH volume at Scanner A, and 1.02% at Scanner B. 3. In these analyses, the DerSimonian-Laird approach applies the standard Gaussian distribution for inference, while Hartung-Knapp approach applies the t-distribution with 1 degree of freedom.
white matter hyperintensity (WMH); intracranial volume (ICV).
Figure 5.

Association of age with WMH volume by scanner. Blue dots show participants from Scanner A, and turquoise dots from Scanner B. WMH volume = log10 (WMH volume in mm3 as a percentage of ICV).
4. Discussion
Large epidemiological studies involving MRI often utilize multiple scanners. Estimated lesion volume has been reported to vary by image acquisition parameters, magnetic field strength, and vendors as well as automatic segmentation tools (Shinohara et al., 2017,Tudorascu et al., 2016,Guo et al., 2019). We sought a bootstrap integration of data from different MRI scanners with individual participant data. We also examined pooling and two random effect meta-analysis methods: DerSimonian-Laird and Hartung-Knapp. In simulations with an absence of scanner differences, the bootstrap approach performed better than two random effects meta-analyses in terms of coverage probability and power, and was similar to pooling, the optimal approach. However, when scanner differences were present, the bootstrap approach performed better than the other three methods. In testing the association of age with WMH, we found the association was supported with the bootstrap, pooling, and DerSimonian-Laird approaches, but not with the Hartung-Knapp approach.
While meta-analysis is a standard approach for synthesis of findings, random effects meta-analysis incorporating variance across scanners could be overly conservative when inferences are about the average effect in data from a small number of MRI scanners. We demonstrated this using simulated data and actual WMH volume data. In simulations, random effects meta-analysis showed higher coverage probability and lower power for either presence or absence of scanner effects. In the data analysis of the association of age with WMH volume, Hartung-Knapp approach was overly conservative and failed to show an association (p-values < 0.0001 with all methods except Hartung-Knapp (p-value=0.1439)). We also found that meta-analysis results varied depending on the method chosen. Out of the two methods, Hartung-Knapp was more conservative. Of note, the underlying distribution for testing statistics is a normal distribution for DerSimonian- Laird, compared to a t-distribution for Hartung-Knapp (df = the number of studies-1; df=1 with two studies). While differences between normal and t-distributions become negligible when the number of scanners increases, hypothesis testing using t-distributions has very low power when the number of studies is small, as was the case in the current study (Higgins et al., 2008,Guolo and Varin., 2017).
While pooling had properties similar to those of the bootstrap approach in simulations with absence of scanner effects, pooling was poorer when scanner effects were present. Potential bias and increased variation from multiple scanners have been reported (Barrio-Arranz et al., 2015,Takao et al., 2011,Takao et al., 2013), therefore, conclusions from pooling can differ from those from the bootstrap approach and can often be misleading. We thus discourage the pooling approach unless scanner differences are found to be negligible. We simulated scanner-specific level and variance. Further study would be warranted to understand which causes of scanner differences have more deterimental influence on scanner differences. We demonstrated our bootstrap approach with case resampling (Efron., 1979). However, our method can be combined with other types of resampling methods as well, for example, residual or parametric resampling. Residual resampling is a semi-parametric approach that resamples residuals from the regression model with the whole data under assumption that the regression estimates are close to the truth, while parametric resampling makes assumptions on both the shape of regression and distribution of errors. Since case resampling requires no assumptions, it is suitable for situations where large uncertainty in the true model is expected, and gives wider confidence intervals than does residual or parametric resampling (Davison and Hinkley., 2013). Case resampling is typically recommended when the sample size is relatively small.
Since the bootstrap approach resamples data with replacement, the number of all possible resampling is nn at each scanner. We randomly selected 10000-bootstrap samples. While larger M gives better approximation, we confirmed that the estimate was fairly stable as shown in Supplementary Figure S.3. The bootstrap approach was constructed based on the assumption that there exists a common effect across scanners. If the common effect assumption of the bootstrap method is suspect, we suggest using random effects meta-analysis which allows random variation in the effect of interest.
We acknowledge that there are other methods that adjust MRI intensity as opposed to meta-analysis, such as the ComBat (Johnson et al., 2007). It is noteworthy however, that to estimate scanner effects with the ComBat method, multiple MRI measurements from the same scanner are required, such as FreeSurfer (Fischl and Dale., 2000) measurements from all subfields. Our bootstrap method is therefore favored when a single measure per person is of interest, as in the present WMH volume analyses. Furthermore, the bootstrap approach can be applied to metrics from other MRI modalities such as Freesurfer measurements, diffusion tensor imaging and magnetic susceptibility mapping. Our finding replicates the well-established association between older age and greater WMH volume. We chose this example to demonstrate that random effects meta-analysis is conservative when the average effect across scanners is of interest and the number of scanners is small. Out of two meta-analysis methods employed, Hartung-Knapp method was more conservative and failed to show the WMH/age association.
This study has several strengths. Our study is based on four well-defined community-based cohorts that have relatively large sample sizes and well-characterized cognitive performance, health-related factors, and MRI data. Using a range of validated meta-analytic approaches, we compared the utility of each to the bootstrap approach with individual participant data.
This study also has several limitations. First, participants are not randomized to scanner. We therefore did not test for differences in WMH volumes between the MRI scanners because the scanners differed in the demographic characteristics of participants as well as in image acquisition parameters and vendors. Second, the proposed approach is computationally intensive. It is well known that resampling algorithms (e.g. permutation, Jackknife, and bootstrap) are computationally intensive. However, given recent advances in computing power, 10,000 bootstrap resampling with WMH volume in this study was completed in about 30 secs with 3.0 GHz CPU and 8.0 GB RAM PC. Further reduction of computation time may be obtained by adopting a parallel computation algorithm. Efficiencies and computational power will enhance the use of the proposed method to address important questions in the future.
Supplementary Material
Highlights.
Random effects meta-analysis is not suitable when the number of MRI scanners is small.
Pooling performs poorly when there is undue variation due to scanner differences.
The proposed approach performs better than random effects meta-analysis and pooling.
Acknowledgement
We thank all the ROS, MAP, MARS, and Rush Clinical Core participants, and also the staff and investigators at the Rush Alzheimer’s Disease Center (RADC) for providing and processing high quality data. Please visit the RADC Research Resource Sharing Hub (www.radc.rush.edu) to obtain data for research purposes. This study was supported in part by National Institutes of Health (NIH) of the United States grants RF1AG022018, R01AG056405, P30R01AG10161, R01AG15819, R01AG17917, U01AG046152, R01AG034374, 1R01AG057911, U01AG61356, R01AG055430, R01AG062711, K01AG064044.
Footnotes
Conflict of interest
Authors have no actual or potential conflicts of interest.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Alosco ML, Sugarman MA, Besser LM, Tripodis Y, Martin B, Palmisano JN, Kowall NW, Au R, Mez J, DeCarli C, Stein TD, McKee AC, Killiany RJ, Stern RA. A Clinicopathological Investigation of White Matter Hyperintensities and Alzheimer’s Disease Neuropathology. J.Alzheimers Dis, 2018;63:1347–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arfanakis K, Evia AM, Leurgans SE, Cardoso LFC, Kulkarni A, Alqam N, Lopes LF, Vieira D, Bennett DA, Schneider JA. Neuropathologic Correlates of White Matter Hyperintensities in a Community-Based Cohort of Older Adults. 2020;73:333–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ashburner J Computational anatomy with the SPM software. Magn.Reson.Imaging, 2009;27:1163–74. [DOI] [PubMed] [Google Scholar]
- Barnes LL, Leurgans S, Aggarwal NT, Shah RC, Arvanitakis Z, James BD, Buchman AS, Bennett DA, Schneider JA. Mixed pathology is more likely in black than white decedents with Alzheimer dementia. Neurology, 2015;85:528–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barnes LL, Shah RC, Aggarwal NT, Bennett DA, Schneider JA. The Minority Aging Research Study: Ongoing Efforts to Obtain Brain Donation in African Americans without Dementia. 2012;9:734–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrio-Arranz G, de Luis-García R, Tristán-Vega A, Martín-Fernández M, Aja-Fernández S. Impact of MR Acquisition Parameters on DTI Scalar Indexes: A Tractography Based Approach. PLOS ONE, 2015;10:e0137905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bauer CM, Jara H, Killiany R, Alzheimer’s Disease Neuroimaging Initiative. Whole brain quantitative T2 MRI across multiple scanners with dual echo FSE: applications to AD, MCI, and normal aging. Neuroimage, 2010;52:508–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bennett DA, Buchman AS, Boyle PA, Barnes LL, Wilson RS, Schneider JA. Religious Orders Study and Rush Memory and Aging Project. J.Alzheimers Dis, 2018;64:S161–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berlingeri M, Devoto F, Gasparini F, Saibene A, Corchs SE, Clemente L, Danelli L, Gallucci M, Borgoni R, Borghese NA, Paulesu E. Clustering the Brain With “CluB”: A New Toolbox for Quantitative Meta-Analysis of Neuroimaging Data. 2019;13:1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brickman AM, Muraskin J, Zimmerman ME. Structural neuroimaging in Altheimer’s disease: do white matter hyperintensities matter? 2009;11:181–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chao LL, Decarli C, Kriger S, Truran D, Zhang Y, Laxamana J, Villeneuve S, Jagust WJ, Sanossian N, Mack WJ, Chui HC, Weiner MW. Associations between white matter hyperintensities and beta amyloid on integrity of projection, association, and limbic fiber tracts measured with diffusion tensor MRI. PLoS One, 2013;8:e65175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Costafreda SG. Pooling FMRI data: meta-analysis, mega-analysis and multi-center studies. Front.Neuroinform, 2009;3:33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davison AC, Hinkley DV. Bootstrap Methods and their Application, Cambridge University Press, 2013. [Google Scholar]
- DerSimonian R, Laird N. Meta-analysis in clinical trials. Control.Clin.Trials, 1986;7:177–88. [DOI] [PubMed] [Google Scholar]
- DiCiccio TJ, Efron B. Bootstrap confidence intervals. Statist.Sci, 1996;11:189–228. [Google Scholar]
- Dickie DA, Ritchie SJ, Cox SR, Sakka E, Royle NA, Aribisala BS, Valdés Hernández MD,C, Maniega SM, Pattie A, Corley J, Starr JM, Bastin ME, Deary IJ, Wardlaw JM. Vascular risk factors and progression of white matter hyperintensities in the Lothian Birth Cohort 1936. Neurobiol.Aging, 2016;42:116–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domínguez Islas C, Rice KM. Addressing the estimation of standard errors in fixed effects meta-analysis. Stat.Med, 2018;37:1788–809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B Bootstrap Methods: Another Look at the Jackknife. Ann.Statist, 1979;7:1–26. [Google Scholar]
- Fischl B, Dale AM. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc.Natl.Acad.Sci.USA, 2000;97:11050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Focke NK, Helms G, Kaspar S, Diederich C, Tóth V, Dechent P, Mohr A, Paulus W. Multi-site voxel-based morphometry--not quite there yet. Neuroimage, 2011;56:1164–70. [DOI] [PubMed] [Google Scholar]
- Fusar-Poli P Voxel-wise meta-analysis of fMRI studies in patients at clinical high risk for psychosis. 2012;37:106–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fusar-Poli P, Placentino A, Carletti F, Landi P, Allen P, Surguladze S, Benedetti F, Abbamonte M, Gasparotti R, Barale F, Perez J, McGuire P, Politi P. Functional atlas of emotional faces processing: a voxel-based meta-analysis of 105 functional magnetic resonance imaging studies. J.Psychiatry Neurosci, 2009;34:418–32. [PMC free article] [PubMed] [Google Scholar]
- Griffanti L, Zamboni G, Khan A, Li L, Bonifacio G, Sundaresan V, Schulz UG, Kuker W, Battaglini M, Rothwell PM, Jenkinson M. BIANCA (Brain Intensity AbNormality Classification Algorithm): A new tool for automated segmentation of white matter hyperintensities. 2016;141:191–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gunning-Dixon F, Brickman AM, Cheng JC, Alexopoulos GS. Aging of cerebral white matter: a review of MRI findings. Int.J.Geriatr.Psychiatry, 2009;24:109–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo C, Niu K, Luo Y, Shi L, Wang Z, Zhao M, Wang D, Zhu W, Zhang H, Sun L. Intra-Scanner and Inter-Scanner Reproducibility of Automatic White Matter Hyperintensities Quantification. 2019;13:679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guolo A, Varin C. Random-effects meta-analysis: the number of studies matters. Stat.Methods Med.Res, 2017;26:1500–18. [DOI] [PubMed] [Google Scholar]
- Hartung J, Knapp G. On tests of the overall treatment effect in meta-analysis with normally distributed responses. Stat.Med, 2001a;20:1771–82. [DOI] [PubMed] [Google Scholar]
- Hartung J, Knapp G. A refined method for the meta-analysis of controlled clinical trials with binary outcome. Stat.Med, 2001b;20:3875–89. [DOI] [PubMed] [Google Scholar]
- Hedges LV, Olkin I. Statistical Methods for Meta-Analysis, New York: Academic Press, 1985. [Google Scholar]
- Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. 2008;172:137–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 2007;8:118–27. [DOI] [PubMed] [Google Scholar]
- Kandel BM, Avants BB, Gee JC, McMillan CT, Erus G, Doshi J, Davatzikos C, Wolk DA. White matter hyperintensities are more highly associated with preclinical Alzheimer’s disease than imaging and cognitive markers of neurodegeneration. 2016;4:18–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Shi L, Daniel Roth H. The bias of the commonly-used estimate of variance in meta-analysis. 1994;23:1063–85. [Google Scholar]
- Marchewka A, Kherif F, Krueger G, Grabowska A, Frackowiak R, Draganski B, Alzheimer’s Disease Neuroimaging Initiative. Influence of magnetic field strength and image registration strategy on voxel-based morphometry in a study of Alzheimer’s disease. Hum.Brain Mapp, 2014;35:1865–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKhann G, Drachman D, Folstein M, Katzman R, Price D, Stadlan EM. Clinical diagnosis of Alzheimer’s disease. Neurology, 1984;34:939. [DOI] [PubMed] [Google Scholar]
- Partlett C, Riley RD. Random effects meta-analysis: Coverage performance of 95% confidence and prediction intervals following REML estimation. Statist.Med, 2017;36:301–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schneider JA, Aggarwal NT, Barnes L, Boyle P, Bennett DA. The neuropathology of older persons with and without dementia from community versus clinic cohorts. J.Alzheimers Dis, 2009;18:691–701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shinohara RT, Oh J, Nair G, Calabresi PA, Davatzikos C, Doshi J, Henry RG, Kim G, Linn KA, Papinutto N, Pelletier D, Pham DL, Reich DS, Rooney W, Roy S, Stern W, Tummala S, Yousuf F, Zhu A, Sicotte NL, Bakshi R,. Volumetric Analysis from a Harmonized Multisite Brain MRI Study of a Single Subject with Multiple Sclerosis. Am.J.Neuroradiol, 2017;38:1501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stonnington CM, Tan G, Klöppel S, Chu C, Draganski B, Jack CR Jr, Chen K, Ashburner J, Frackowiak RS. Interpreting scan data acquired from multiple scanners: a study with Alzheimer’s disease. Neuroimage, 2008;39:1180–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Takao H, Hayashi N, Ohtomo K. Effects of the use of multiple scanners and of scanner upgrade in longitudinal voxel-based morphometry studies. J.Magn.Reson.Imaging, 2013;38:1283–91. [DOI] [PubMed] [Google Scholar]
- Takao H, Hayashi N, Ohtomo K. Effect of scanner in asymmetry studies using diffusion tensor imaging. Neuroimage, 2011;54:1053–62. [DOI] [PubMed] [Google Scholar]
- Tench CR, Tanasescu R, Constantinescu CS, Auer DP, Cottam WJ. Coordinate based random effect size meta-analysis of neuroimaging studies. Neuroimage, 2017;153:293–306. [DOI] [PubMed] [Google Scholar]
- Tudorascu DL, Karim HT, Maronge JM, Alhilali L, Fakhran S, Aizenstein HJ, Muschelli J, Crainiceanu CM. Reproducibility and Bias in Healthy Brain Segmentation: Comparison of Two Popular Neuroimaging Platforms. 2016;10:503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wardlaw JM, Valdes Hernandez MC, Munoz-Maniega S. What are white matter hyperintensities made of? Relevance to vascular cognitive impairment. J.Am.Heart Assoc, 2015;4:001140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson RS, Barnes LL, Krueger KR, Hoganson G, Bienias JL, Bennett DA. Early and late life cognitive activity and cognitive systems in old age. J.Int.Neuropsychol.Soc, 2005;11:400–7. [PubMed] [Google Scholar]
- Wilson RS, Boyle PA, Yu L, Segawa E, Sytsma J, Bennett DA. Conscientiousness, dementia related pathology, and trajectories of cognitive aging. Psychol.Aging, 2015;30:74–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
