Identical, but not the same: Intra-site and inter-site reproducibility of fractional anisotropy measures on two 3.0 T scanners

Christian Vollmar; Jonathan O'Muircheartaigh; Gareth J Barker; Mark R Symms; Pamela Thompson; Veena Kumari; John S Duncan; Mark P Richardson; Matthias J Koepp

doi:10.1016/j.neuroimage.2010.03.046

. 2010 Jul 15;51(4-4):1384–1394. doi: 10.1016/j.neuroimage.2010.03.046

Identical, but not the same: Intra-site and inter-site reproducibility of fractional anisotropy measures on two 3.0 T scanners

Christian Vollmar ^a, Jonathan O'Muircheartaigh ^b, Gareth J Barker ^b, Mark R Symms ^a, Pamela Thompson ^a, Veena Kumari ^b, John S Duncan ^a, Mark P Richardson ^b, Matthias J Koepp ^a,^⁎

PMCID: PMC3163823 PMID: 20338248

Abstract

Diffusion Tensor Imaging (DTI) is being increasingly used to assess white matter integrity and it is therefore paramount to address the test–retest reliability of DTI measures. In this study we assessed inter- and intra-site reproducibility of two nominally identical 3 T scanners at different sites in nine healthy controls using a DTI protocol representative of typical current “best practice” including cardiac gating, a multichannel head coil, parallel imaging and optimized diffusion gradient parameters. We calculated coefficients of variation (CV) and intraclass correlation coefficients (ICC) of fractional anisotropy (FA) measures for the whole brain, for three regions of interest (ROI) and for three tracts derived from these ROI by probabilistic tracking. We assessed the impact of affine, nonlinear and template based methods for spatially aligning FA maps on the reproducibility. The intra-site CV for FA ranged from 0.8% to 3.0% with ICC from 0.90 to 0.99, while the inter-site CV ranged from 1.0% to 4.1% with ICC of 0.82 to 0.99. Nonlinear image coregistration improved reproducibility compared to affine coregistration. Normalization to template space reduced the between-subject variation, resulting in lower ICC values and indicating a possibly reduced sensitivity. CV from probabilistic tractography were about 50% higher than for the corresponding seed ROI.

Reproducibility maps of the whole scan volume showed a low variation of less than 5% in the major white matter tracts but higher variations of 10–15% in gray matter regions.

One of the two scanners showed better intra-site reproducibility, while the intra-site CV for both scanners was significantly better than inter-site CV. However, when using nonlinear coregistration of FA maps, the average inter-site CV was below 2%. There was a consistent inter-site bias, FA values on site 2 were 1.0–1.5% lower than on site 1. Correction for this bias with a global scaling factor reduced the inter-site CV to the range of intra-site CV. Our results are encouraging for multi-centre DTI studies in larger populations, but also illustrate the importance of the image processing pipeline for reproducibility.

Introduction

Diffusion Tensor Imaging (DTI) is an advanced Magnetic Resonance Imaging (MRI) technique that allows the assessment of water diffusion in the brain. In highly organized tissue like cerebral white matter, diffusion preferentially follows the longitudinal direction of axonal bundles and myelin sheaths while transverse diffusivity is limited by cell membranes, organelles and other structures. The degree of this directionality is described by the fractional anisotropy (FA) and high FA values represent highly anisotropic diffusion. FA is commonly used as a measure of white matter organization or white matter integrity, being higher in densely packed, parallel white matter bundles such as the corpus callosum (CC). FA measures are increasingly used in clinical studies and have shown alterations in various brain diseases such as multiple sclerosis (Ge et al., 2005) and epilepsy (Focke et al., 2008), as well as in normal aging (Sullivan and Pfefferbaum, 2006).

The intra-site test–retest reliability of DTI measures has been addressed mainly at 1.5 Tesla (T) (Ciccarelli et al., 2003; Pfefferbaum et al., 2003; Heiervang et al., 2006; Bonekamp et al., 2007) with just two recent studies at 3 T (Jansen et al., 2007; Bisdas et al., 2008) (Table 1). There is considerably less data on cross centre reliability of DTI measures; previous studies have shown large variability of FA quantification on different 1.5 T scanners (Cercignani et al., 2003; Pfefferbaum et al., 2003) with an expected higher inter-site than intra-site variability (Pfefferbaum et al., 2003).Typical current “best practice” 3 T DTI protocols differ considerably from older 1.5 T versions, with the inclusion of modern array head coils resulting in higher signal to noise ratios, and the increasing use of parallel imaging methods. There is little or no information on the inter-site reproducibility of measurements made using these recent MR technological developments. Reproducibility studies require image coregistration, for which there are several possible methods, such as affine, nonlinear and template based approaches. The quality of these coregistration procedures is likely to affect measurement reproducibility. As repeat measurements of the same subject need to be coregistered, it appears to be the most straightforward approach to coregister repeat scans in each subject's native space, avoiding any additional image transformation. However in daily life, it is common practice to use nonlinear normalizations to a common template space before further analysis and we have therefore directly compared both approaches.

Table 1.

Comparison of results with previous studies on DTI test–retest reliability. The values shown for this study are the average CV from the three nonlinear methods. See Fig. 3 for other values. dwd = diffusion weighted directions, CV = coefficient of variation in %, WSV = within-subject variation, cg = cardiac gating, inter-site measures are printed italic.

Study	Field strength	Acquisition	Subjects	Repeated scans/measures	Statistics used	Reported CV whole brain [%]	Reported CV corpus callosum [%]	Reported CV other regions [%]	Comment
Scanner	dwd, repetitions voxel size x, y, z duration	Age mean ± SD or [range]
This study	3.0 T GE Signa + GE Signa	32 dwd 2.4 × 2.4 × 2.4 mm 10 min	9 volunteers 34 ± 8	Intra-site rescan ×2 Inter-site rescan ×2	Mean FA from ROI	1.1 1.5	1.2 1.6	LFWM 1.2 LFWM2.2	ROI SCC: 0.8 cm³ LFWM: left frontal white matter

Bisdas	3.0 T Philips Intera	16 dwd × 2 2 × 2 × 3 mm	12 volunteers 38 ± 11	Intra-site rescan ×2	Mean FA from ROI		2		ROI SCC: 0.2 cm²

Jansen	3.0 T Philips Achieva	15 dwd 2 × 2 × 2 mm 10 min	10 volunteers 26 ± 2	Intra-site rescan ×2	Median FA Voxel wise	3.0 6.5			Images normalized to MNI Smoothed 6 mm FWHM

Bonekamp	1.5 T GE	15 dwd × 2 2.5 × 2.5 × 5 mm 5 min	10 volunteers 14.1 ± 2.8	Intra-site rescan ×2	Mean FA from ROI		2.6	SCR 3.8	SCR: superior corona radiata ROI: 16 voxels in single slice

Heiervang	1.5 T Siemens Sonata	60 dwd 2 × 2 × 2 mm	8 volunteers [21–34]	Intra-site rescan ×3	Variable	0.78 (mean FA from white matter)	4.81		Images normalized to MNI ROI in GCC, size 9 voxels

Ciccarelli	1.5 T GE Signa	60 dwd 2.5 × 2.5 × 2.5 mm 20–30 min (cg)	10 volunteers 37.5 ± 9.7	Intra-site rescan	Mean FA in tract		6.2		4 subjects rescanned ROI: ‘callosal fibers’ after tracking

Cercignani	1.5 T Philips Gyroscan + Siemens Vision	6 dwd ×10 or 8 dwd ×8 1.95 × 1.95 × 5 mm	12 volunteers 28.9 [23–33]	Intra-site rescan ×2 Inter-sequence rescan ×3 Inter-site rescan ×2	Mean FA from histogram	Not reported 5.45 7.71			4 subjects rescanned intra-site 8 subjects rescanned inter-site

Pfefferbaum	1.5 T GE Echospeed + GE Twinspeed	6 dwd ×6 Not reported	10 volunteers [21–33]	Intra-site rescan ×2 Inter-site rescan ×2	Mean FA from ROI	1.36 1.93	1.90 5.20		Images coregistered to common space ROI: ‘outlined in midsagittal FA’

Open in a new tab

Clinical studies often target very specific patient populations which are difficult to recruit by one imaging centre alone. Large scale pharmacological investigations are usually multi-centre studies that increase statistical power by pooling patients, but differences in MRI scanner manufacturers, models and set-ups even for the same type of scanner restrict the comparison of imaging parameters across sites. A necessary first step is the acquisition of test–retest data in controls for the assessment of reliability. Test–retest studies allow for an estimation of reproducibility, i.e. within-subject differences.

The purpose of the current study is fourfold:

1.
To assess the reproducibility of DTI measures using a contemporary 3 T high field scanner system and a protocol typical of that which might be used in multi-centre studies using a variety of scanners.
2.
To determine whether using this protocol on two nominally identical GE Signa HDx scanners at different sites (National Society for Epilepsy MRI Unit and Institute of Psychiatry, King's College London) results in acceptably low levels of cross-site variability.
3.
To assess the impact of different steps of the image processing pipeline on measurement reproducibility: we compared different methods for image coregistration, for ROI definition and the effect of tractography compared to ROI analysis of FA maps.
4.
To assess the measurement reproducibility within the scan volume, creating reproducibility maps to identify regions of unfavorably high FA variability.

Methods

Subjects

Nine healthy subjects (2 female, age range 28–52 years) underwent four MRI scans each, two at each imaging site. The order of scans across sites was randomized, the interval between individual scans ranged from 1 to 95 days, and all scans were acquired within a 12 month period. The study was approved by the Research Ethics Committee of the UCL Institute of Neurology and UCL Hospitals and written informed consent was obtained from each participant.

MR image acquisition

A 3 T MRI scanner was used at each site, with imaging gradients with a maximum strength of 40 mT/m and slew rate 150 mT/m/s (GE Signa HDx, General Electric, Milwaukee, WI, USA.). The body coil was used for RF transmission, and an 8 channel head coil for signal reception, allowing a parallel imaging (ASSET) speed up factor of two. Each volume was acquired using a multi-slice peripherally-gated doubly refocused spin echo EPI sequence, optimized for precise measurement of the diffusion tensor in parenchyma, from 60 contiguous near-axial slice locations with 128 × 128 isotropic (2.4 × 2.4 × 2.4 mm) voxels. The echo time was 104.5 ms while to minimize physiological noise, cardiac gated triggering with a peripheral pulse sensor was applied (Wheeler-Kingshott et al., 2002) and the effective repetition time varied between subjects in the range between 12 and 20 RR intervals. Based on the recommendations of Jones et al. (2002), the maximum diffusion weighting was 1300 s mm^− 2, and at each slice location, 4 images were acquired with no diffusion gradients applied, together with 32 diffusion weighted images in which gradient directions were uniformly distributed in space. The total acquisition time for this sequence was approximately 10 min, depending on the heart rate.

Image processing

Image distortions induced by eddy currents and subject movement during the acquisition were corrected using a mutual information based affine realignment of all volumes to the first non-diffusion weighted volume (FSL 4, http://www.fmrib.ox.ac.uk/fsl/) (Behrens et al., 2003). The brain tissue was automatically segmented from skull and background using FSL's deformable brain model based Brain Extraction Tool (Smith, 2002). Brain extraction was performed on a non-diffusion weighted volume with a fractional intensity threshold of 0.3 and then applied to the whole realigned DTI acquisition.

Diffusion tensors were reconstructed from the 32 diffusion weighted volumes using Camino software (http://www.cs.ucl.ac.uk/research/medic/camino/, Version 2, rev 530), (Cook et al., 2006). The resulting diffusion tensors were diagonalized, yielding the three principal eigenvalues λ₁, λ₂ and λ₃, from which FA maps were calculated (Basser and Pierpaoli, 1996).

To assess reproducibility, images created in each of the four sessions needed to be coregistered to each other. We used three different methods for coregistration and compared their impact on measurement reproducibility.

1.
A rigid body coregistration with 6 degrees of freedom (3 translations, 3 rotations and no scaling) was performed using SPM software (SPM5, http://www.fil.ion.ucl.ac.uk/spm/). This was done using a two pass procedure: to achieve a gross alignment of images, the first FA map of each subject was initially coregistered to a FA template in MNI space by a rigid body transformation, preserving each subject's individual anatomy. Then all four FA images were coregistered to this template aligned image, the average FA was calculated and the rigid body coregistration was repeated, using the average FA as target image. Coregistered images were resampled to 1 mm isotropic voxels using 2nd degree spline interpolation. This procedure will be referred to as ‘affine’ coregistration.
2.
The same procedure was then repeated, including nonlinear warping (32 nonlinear iterations) for normalization to each subject's mean FA image. For the nonlinear normalization the subject's smoothed average FA image was used as a weighting mask, assigning more importance to regions with high FA for the normalization procedure.
3.
We used FSL's tract based spatial statistics (TBSS) tools to normalize each single FA image to the provided FMRIB58_FA template image in MNI space. TBSS default settings were used for this nonlinear transformation.

The masks created by TBSS for each scan were combined to create an average mask image for each subject that was eroded by two voxels to exclude non-brain voxels for all further processing and analyses. For voxel wise comparison, the realigned FA images were smoothed with a 4 mm FWHM kernel.

Regions of interest

We chose three commonly used regions of interest (ROI), representatively reflecting different characteristics of white matter, and defined these ROIs manually on each subject's individual mean FA image in native space as well as on a FA template image in MNI space using MRIcro software (http://www.sph.sc.edu/comd/rorden/mricro.html) (Rorden and Brett, 2000) and the following anatomic guidelines:

1.
A region representing an area of white matter with mainly parallel, densely packed fibers was defined in the splenium of the corpus callosum (SCC). A ROI of 0.8 cm³ was drawn in adjacent coronal slices, and the shape of the ROI was checked in sagittal slices (see Fig. 1a). To minimize partial volume effects at the edge of anatomic structures, ROI were restricted to the centre of the CC and 2 mm distance was kept to its anatomic boundaries.
2.
A large, 3.5 cm³, region representing white matter with fibers of different and crossing orientations, was drawn in the left frontal white matter (LFWM), lateral to the commissural fibers from the CC and including the superior part of the corona radiata (Fig. 1b).
3.
For the left uncinate fascicle (LUF), a smaller tract with lower average FA, a small 0.3 cm³ ROI was drawn in sagittal FA slices, selecting the first voxels with high FA values, ascending anteriorly from the inferior longitudinal bundle when scrolling from lateral to mesial. The anterior part of the core of the LUF was best defined in coronal slices where it can easily be depicted as a bright fiber bundle at the inferior frontal lobe (Fig. 1c).

Fig. 1 — ROI placement in template space. a) Splenium of corpus callosum (SCC), b) left frontal white matter (LFWM) and c) left uncinate fascicle (LUF).

All ROI were smoothed with a 3 × 3 × 3 voxel mean filter after drawing. ROI defined in template spaced were also backnormalized to each subject's individual native space and measurements were performed in both, template and native space.

For comparison with other studies that used all brain voxels or histogram based statistics to assess DTI reproducibility, we also determined statistics for a whole brain ROI, using each subject's thresholded b0 image to mask out CSF.

Tractography

Probabilistic tractography was performed with FSL's probtrack algorithm, using the default settings with 5000 iterations per seed voxel. The abovementioned ROI were defined in template space, backnormalized to each subject's four individual scans and used as seed regions with the following constraints:

1.
For the SCC ROI no further restrictions were made, the resulting tract mainly showing the commissural connections between homologous areas of the two parietal and occipital lobes.
2.
For the LFWM ROI a waypoint mask in the brainstem was defined in the lowest axial slice, the resulting tract therefore showing the descending fibers of the corticospinal tract.
3.
To track from the LUF ROI, exclusion masks were used in the sagittal midline to avoid crossing fibers and posterior to the vertex of the uncinate fascicle to exclude the inferior longitudinal bundle.

Tracking was performed independently for all four scans from each subject and the average FA within the tract and tract volume were calculated, thresholding the probability maps at 2%.

Reproducibility maps

To assess the spatial distribution of FA reproducibility within the scan volume, reproducibility maps were generated. For each subject, a difference image was created for each scan, calculating the absolute (positive or negative) difference of each single FA voxel from the subject's average FA. An average absolute difference image was created as well as an average relative difference image, dividing the absolute difference by the average FA, thereby showing the percentage change of the initial FA value. All maps were normalized to MNI space to create the group average reproducibility maps.

Statistics

ROI were applied to all four FA maps for each subject and ROI statistics were determined using FSLstats (FSL 4). Mean, standard deviation, minimum and maximum FA were extracted per ROI and analyzed with SPSS 14 (SPSS Inc., Chicago, IL, USA) and Microsoft Excel. For voxel wise comparison, AFNI (http://afni.nimh.nih.gov/afni) was used to extract individual voxel values from the SCC ROI for further correlation analysis.

The coefficient of variation (CV) is defined as the ratio of the measurements standard deviation σ divided by the mean μ and multiplied by 100. It allows an intuitive estimate of measurement variance expressed as relative percentage, regardless of the absolute measurement value. In previous studies on DTI test–retest reliability, the CV is the most commonly reported statistical measure. However, there are different ways to determine the CV for a given ROI:

•
CV of the mean (CV_mn): the mean value from each ROI is determined for each scan and the difference between these mean values is determined.
•
CV of the median (CV_md): instead of calculating the mean value from a ROI, the median value is determined and compared across scans. Assuming a symmetric distribution of values within a ROI, this should be close to the CV_mn.
•
CV of voxel wise comparison (CV_vw): within each ROI, corresponding voxels from different scans are compared against each other and the CV_vw is determined for voxel wise differences.

CV_mn were calculated for each ROI and pairs of scans (intra-site and inter-site) per subject and for the group. CV_vw were calculated only for the SCC ROI, derived from both the raw and smoothed FA maps.

A different assessment of a method's reliability is the intraclass correlation coefficient (ICC) which relates the within-subject variation to the between-subject variation:

ICC = \frac{σ^{2}_{bs}}{σ^{2}_{bs} + σ^{2}_{ws}}

where σ_bs = between-subjects standard deviation of the population and σ_ws = within-subject standard deviation for repeated measurements. The ICC expresses the fraction of the total variance in the data that is caused by true biological variation between subjects rather than by measurement error within subjects. For test–retest data of healthy controls, acquired under similar conditions, true within-subject differences will be small, and the method yielding the highest ICC will be preferable.

Results

Visual inspection showed a very high similarity between the generated FA maps. Fig. 2 shows the same mid-axial slice from the four different scans of subject one. Detailed gyral anatomy was reliably reproduced.

ROI characteristics

The cross subject mean FA, within-subject SD and between-subject SD are summarized in Table 2. The average within-subject SD across the four different scans was always lower than the between-subject SD for all FA measures. The between-subject CV_mn ranged from 3.1% to 12.1%.

Table 2.

Group characteristics of the examined regions: ROI size, group mean FA. Average within (SD_ws) and between (SD_bs) subjects SD is shown for all four analysis methods.

Region	ROI size [cm³]	Mean FA	Affine		Nonlinear		Template		Backnormalized
SD_ws	SD_bs	SD_ws	SD_bs	SD_ws	SD_bs	SD_ws	SD_bs
Whole brain	–	0.28	0.0038	0.0145	0.0033	0.0157	0.0034	0.0073	0.0031	0.0087
SCC	0.8	0.84	0.0148	0.0456	0.0118	0.0550	0.0105	0.0368	0.0117	0.0388
LFWM	3.5	0.48	0.0095	0.0499	0.0074	0.0613	0.0080	0.0227	0.0088	0.0273
LUF	0.3	0.39	0.0169	0.0621	0.0102	0.1035	0.0055	0.0219	0.0094	0.0270

Open in a new tab

SCC: splenium of corpus callosum, LFWM: left frontal white matter, LUF: left uncinate fascicle.

Coefficient of variation, CV

CV_mn for intra- and inter-site rescans are summarized in Fig. 3.

Comparing the examined regions, the highest CV was found for the LUF, the smallest of the three regions, and therefore most prone to partial volume effects from imperfect coregistration and interpolation. Unsurprisingly the whole brain average FA showed the lowest variation and also the least dependence on the applied coregistration method.

Comparing the different coregistration methods, in general, affine coregistration resulted in bigger variation compared to any of the nonlinear methods for most measurements. For all three regions, the CV of FA within the tract was higher than the CV of the corresponding backnormalized seed region, on average by 50%.

Fig. 4 shows the average CV across all regions for a given coregistration method. The three nonlinear methods did not differ significantly, but affine coregistration performed worse than any of the three methods including nonlinear normalization steps (nonlinear in native space, template based and backnormalized from template). The average CV from these three methods were 1.3% for intra-site 1, 1.4% for intra-site 2 and 1.9% for inter-site scan–rescan.

There was a non-significant trend toward a higher intra-site CV for site 2 and both intra-site CV were significantly lower than inter-site CV (paired T-test, p = 0.0026 and p = 0.0015). However, using nonlinear coregistration, the average inter-site CV across regions still remained very low at 1.9%.

CV for the tract volume from the three tracts is not shown in the plots; the average was 8.4% for intra-site 1, 6.2% for intra-site 2 and 7.4% for inter-site—more than 2.5 times the variation than for the average FA within tract.

Intraclass correlation coefficient, ICC

The ICC relates the within-subject variation to the between-subject variation. Results are plotted in Fig. 5 for all regions and methods. The ICC values were higher for the two normalization methods in native space (affine and nonlinear) compared to the two template based methods (template and backnormalized).

Like the CV values, ICC of all tract FA measures showed a much lower reproducibility than the corresponding ROI analyses (Fig. 6). The lowest ICC was observed for the LUF tract FA which was only 0.55 for intra-site 1 scan–rescan, compared to 0.91 for the corresponding ROI analysis.

Voxel wise comparison

For the SCC, FA maps from the four scans were compared on a voxel by voxel basis. CV derived from voxel wise comparison (CV_vw) of raw FA images were 4.2% for intra-site-1, 4.4% for intra-site-2 and 4.3% for inter-site, more than twice as big as those derived from the ROI mean value (CV_mn). This illustrates noise in unsmoothed data at a single voxel level and also the averaging effect of a ROI analysis. However smoothing the FA maps with a 4 mm FWHM kernel before comparison reduced the CV_vw to 1.5%, 1.8% and 2.2% respectively, much closer to the ROI derived CV_mn.

Scanner differences

We found a consistent inter-site bias, FA values on site 2 were 1.0–1.5% lower than on site 1. This difference was slightly higher in areas with higher FA. Correction for this bias with a global scaling factor reduced the average inter-site CV for the nonlinear methods from 1.9 to 1.6%. This was no longer significantly different from the intra-site CV of 1.3% and 1.4% (paired T-test, p = 0.07 and p = 0.18).