Abstract
Magnetization transfer imaging (MT) may have considerable promise for early detection and monitoring of subtle brain changes before they are apparent on conventional magnetic resonance images. At 3 Tesla (T), MT affords higher resolution and increased tissue contrast associated with macromolecules. The reliability and reproducibility of a new high-resolution MT strategy were assessed in brain images acquired from 9 healthy subjects. Repeated measures were taken for 12 brain regions of interest (ROIs): genu, splenium, and the left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Spearman's correlation coefficient, coefficient of variation, and intraclass correlation coefficient (ICC) were computed. Multivariate mixed-effects regression models were used to fit the mean ROI values and to test the significance of the effects due to region, subject, observer, time, and manual repetition. A sensitivity analysis of various model specifications and the corresponding ICCs was conducted. Our statistical methods may be generalized to many similar evaluative studies of the reliability and reproducibility of various imaging modalities.
1. Introduction
Magnetization transfer (MT) imaging is a quantitative approach for detecting subtle or occult abnormalities in brain tissue. In previous studies, the Magnetization Transfer Ratio (MTR), an index of MT imaging, was sensitive to brain changes in patients with mild cognitive impairment, an Alzheimer's disease prodrome [1, 2], to new lesions in patients with multiple sclerosis, [3] and to changes associated with progression in chronic neurological disorders [4]. The higher magnetic field strength afforded by 3T allows MT image resolution to be augmented compared with conventional MT acquisition at 1.5T [5–7]. We developed a high resolution MT technique to detect subtle changes in anatomically small, functionally eloquent brain structures. The increased field strength affords whole-brain coverage with considerably thinner slices, potentially reducing partial volume artifacts. However, even among healthy subjects, numerous factors may introduce variability in measures derived from magnetic resonance (MR) data, such as static field B 0 signal dropout and RF nonuniformity. Measurement variation may be introduced by scan repetitions, repositioning at different time points, and image post-processing. Moreover, 3T may be susceptible to variation associated with increased field strength [8]. Such variability may pose limitations when conducting clinical comparisons to differentiate normal and diseased brains or in developing statistically predictive algorithms.
To validate high resolution MT for detecting early disease or for monitoring progression in chronic neurological disease, it is necessary to collect information on normative values and to evaluate the reliability and reproducibility of the measurements when measured across time in healthy controls. This investigation evaluated observer-agreement of high-resolution MT measurements determined from repeated brain scans of 9 healthy volunteers. We postulated that MT values would remain stable during the one month study interval. We evaluated the reliability and reproducibility of the high resolution MT measurements in 12 brain regions of interest (ROIs), applied statistical measures to the data and used complex multivariate mixed-effects models to test the statistical significance of several effects due to region, subject, observer, time, and manual repetition.
2. Materials and Methods
2.1. Study Subjects
The study was approved by the IRB at the North Shore University Health System, and conducted following the ethical principles outlined in the Declaration of Helsinki. Eleven healthy adult volunteers were randomly selected from a database maintained at the Center for Advanced Imaging, Radiology Department, NorthShore University Health System provided written informed consent and evaluated for eligibility criteria. To protect the subjects' confidentiality, all data were de-identified and handled according to the guidelines specified by the Health Insurance Portability and Accountability Act (HIPAA) in the USA.
2.2. Image Acquisition
Brain images were acquired using a 3T General Electric (GE) HDx system (Waukesha, WI, USA). Each volunteer was scanned twice in a randomly-selected time interval between 1 to 4 weeks. Methods for reducing random errors in image acquisition included the use of a body-coil for excitation to control B1 non-uniformities and an 8-channel quadrature receive-only coil [9]. MT pulses with (M s) and without saturation (M 0) were applied at an offset frequency from water resonance. To accelerate the scan for whole-brain coverage, while maintaining thin slices, the image protocol was optimized based on 3T using 3D SPGR [5]. The Gaussian Sinc MT pulse was applied in 8 ms at a 1200 HZ offset. The stability of the scanner and set-up procedure were addressed with a fixed set of parameters per subject. MT pulse was based on a three-dimensional spoiled gradient recalled (3D SPGR) acquisition. The image protocol included the following parameters: TR 34 to 35 ms, TE 4 to 8 ms, imaging FA 5°, bandwidth 15.6 kHz, 0.75 NEX, phase FOV 0.75, voxel dimensions 0.9 × 0.9 × 0.9 ~ 1.3 mm3. The whole brain was covered in 90 to 140 slices with acquisition time ranging from 7 minutes 40 seconds to 10 minutes 20 seconds using a partial k-space acquisition.
2.3. Image Analysis
MTR maps were generated off-line on a General Electric AW Workstation (General Electric, Milwaukee, WI, USA) using the standard equation:
(1) |
where M S and M 0 were the signal intensities in a given voxel obtained, with and without the MT saturation pulse, respectively. MTR maps generated based on the high resolution MT are demonstrated in Figure 1. The 12 ROIs were: genu, splenium, left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Figure 2 illustrated the 12 ROIs that were investigated. Each ROI was sized approximately 30 to 43 mm2 and manually and independently placed by Observers 1 and 2 (Authors S.S. and Y.W.) following procedures in classical and standard agreement studies [10]. After an initial consensus decision was drawn regarding the sizes and locations of the 12 ROIs, the observers performed manual segmentations of the ROI independently on each set of images. This ROI placement procedure was repeated by each observer in the following week.
MTR values were extracted using the manually-defined ROIs with the combinations of observer, time point, and repetition (Table 1). The mean and SDs of the ROI values were calculated. Meta-data were stored in a SAS 9.1 (SAS, Cary, NC, USA) dataset, with individual volunteer identification numbers withheld and replaced by a sequence of 1 to 9 for each subject.
Table 1.
Outcome Variable Y ijkln | Effect in the Variance-Component Analysis | Type of Effect | Mathematical Symbol | Index | Maximum of the Index |
---|---|---|---|---|---|
Mean ROI Value via Manual Segmentations | Subject | Random | S i | i = i,…, I | I = 9 |
Observer | Fixed or Random | O j | j = 1,…, J | J = 2 | |
Time Point | Fixed or Random | T k | k = 1,…, K | K = 2 | |
Repetition | Fixed or Random | R l | l = 1,…, L | L = 2 | |
Region of Interest | Fixed | K m | m = 1,…, M | M = 12 | |
Interaction Terms | Generally Mixed | {S i; O j; T k; R l; K m} | {i; j; k; l; n} | Based on the Appropriate Model Specification |
2.4. Statistical Methods
Statistical analyses were performed using SAS 9.1 (SAS Institute, Cary, NC, USA; http://www.sas.com). The SAS analytic procedures conducted included “Proc Univariate,” “Proc Means,” “Proc Corr,” and “Proc Mixed.” Bar diagrams were constructed using Microsoft Excel (http://www.microsoft.com). Age and gender were not controlled for in analyses.
2.4.1. Descriptive Statistics
Let Y = Y ijklm having the indices described in Table 1 be a random variable representing the mean ROI value. For the mth ROI, we first computed the sample mean and standard deviation of all mean ROI values:
(2) |
where N m = I × J × K × L = 9 × 23 = 72 measurements and the operator “•” means the marginal sum over the particular index.
The 95-percentile normality range was approximately within the following interval, with the following lower and upper bounds:
(3) |
The term “normality range” as used in Europe, could be arbitrarily-defined according to the number of standard deviations away from the mean [11]. Thus, it should not be viewed as the range of the entire dataset, but rather an interval useful for estimating the population value by one or several standard deviations away from the mean. Here the critical value of 2 was chosen as recommended by Bland and Altman [12].
Additionally, we justified using a Student's t-distribution with N m − 1 = 71 degrees of freedom. For any tail probability of α/2 (e.g., 0.025 for a 95-percent normality range), we used the quantile of the corresponding to particular t-distribution, such that
(4) |
This value happened to be close to the recommended multiplier of 2. Therefore, we rounded it to 2 in (3) for convenience.
2.4.2. Concordance Using Spearman's Rank Coefficient Coefficients
We first explored and measured the concordance between the various measurements fully nonparametrically via Spearman's rank correlation coefficient. Suppose that we correlated the ROI values by Observers j = 1 and j′ = 2, then denoted the marginal ranks, R ijklm = ranki(Y ijklm) and R ij′klm = ranki′(Y i′jklm), respectively, for all j ≠ j′ with j = 1 and j′ = 2. The sample version of Pearson's product-moment correlation coefficient between the ranks of the data was equivalent to Spearman's rank correlation coefficient [13]:
(5) |
where 𝒜 denotes (∑l = 1 2∑k = 1 2∑i = 1 9 R i1klm)2 and ℬ denotes (∑l = 1 2∑k = 1 2∑i = 1 9 R i2klm)2.
Assuming that there was no presence of any ties since the ROI values were of continuous random variables, the Spearman's rank correlation coefficient between Observers j and j′ was
(6) |
where the difference of an arbitrary pair of marginal ranks for Observer j and j′ was denoted by D i•klm = R ijklm − R ij′klm, for all j ≠ j′. Consequently, all of the raw mean ROI values were converted to their marginal ranks and the differences between the ranks of each observation on the two variables were computed. Spearman's rank correlation coefficient was also computed for the ROI values between any two different time points k = 1 and k′ = 2.
The strength of the concordance and the benchmark values have been discussed [14]. Bar diagrams were made to display the Spearman's rank correlation coefficients between observers or time points for each ROI.
2.4.3. Reproducibility Using Coefficients of Variations
We used the normalized measure of dispersion of a distribution to evaluate the reproducibility of the measurement [15]. The measure was the coefficient of variation (CV), defined as the ratio of the SD to the mean.
(7) |
where both the numerator (i.e., sample SD) and the denominators (i.e., sample mean) in the above expression for CV are provided in (2). Skewed data, such as those generated by an exponential distribution for which the underlying population mean and standard deviation would be equal, and thus the CV became 1. Hence, CV < 1 would generally represent low variability, and CV > 1 would represent high variability. As in (4) and (6), further stratified computations of CV for different observers, time point, or repetitions were achieved using formulae similar to (7).
2.4.4. Normality and Significance Tests for the Effects via a Multivariate Regression Analysis
As overall variability was likely a result of the effects illustrated in Table 1. We employed a multivariate mixed-effects regression analysis to direct model the ROI values.
A variance-component approach has advantages over many stratified analyses, especially studying studies with a limited sample size. Here, because of the novel imaging modality using MT and 3T acquisitions with labor-intensive manual segmentation procedures, large number of subjects would not have been feasible. To conduct an analysis of variance (ANOVA) based on the various effects, a distributional assumption of normality was necessary and convenient. Therefore, we conducted marginal normality tests using the Shapiro-Wilk test [16]. We would demonstrate (see Section 3.4) that the normality assumption was generally satisfactory.
Thus, we could then consider adopting a linear random-effects model with all pair-wise interactions, in addition to a third-order interaction term:
(8) |
The effects represented the following: μ m as intercept, S i as subjects, O i as observers, T i as time points, R i as repetitions, and ε ijklm as the error team. A random-effects model assumed that each of the effects would have independent normal distributions with mean and variance.
If normality had failed and because the data were mean ROI values that were positively-valued, we would recommend a Box-Cox transformation, h(Y ijklm, λ), of the outcome variable with an optimal power coefficient λ [17–19]. Note that the log-normal becomes a special case when the power coefficient λ = 0. This normality transformation is given by:
(9) |
A profile log-likelihood, llik of λ given the observations y ijklm, would be maximized to estimate an optimal Box-Cox transformation via a nonlinear minimization routine, where the log-likelihood was
(10) |
where c was a constant free of the power coefficient to be optimized.
Due to the limited number of subjects, however, even with an optimal normality transformation, over-fitting and non-convergence might be issues. Alternatively, we could regard all of the observers, time points, and repetitions as fixed and specify a mixed-effects model. The significances of the sources of variability were tested via a restricted maximum likelihood (REML) approach. For our multivariate analysis, the significance threshold for two-tailed P-values was set if P ≤ .05.
2.4.5. Interobserver Reliability Using the ICCs
Stratified by the time points within each ROI, a two-way ANOVA was performed by regarding all of the observers, time points, and repetitions as fixed. We specified a mixed-effects model for simplicity. Due to the complexity of the variance components, we instead adopted a hybrid approach by considering two effects at once. For example, all subjects were segmented by the same observers who were from an entire population of observers. In other words, the subject effect was always assumed to be random, while the remaining effect (e.g., here the observer) was assumed to be fixed. We computed the Case-3 ICCs, accordingly [20].
We simplified our notations by only keeping the indices for the subject and observer effects of interest. We decomposed the data as follows:
(11) |
where the subject effect S i was assumed to be random in an upper-case letter, which had a normal distribution with mean 0 and variance σ S 2, for all i = 1,…, I (here I = 9); the observer effect o j was considered to be a fixed effect in a lower-case letter, with the constraint ∑j = 1 J o j = 0, with the corresponding parameter to the variance being θ o 2 = (1/(J − 1))∑j = 1 J o j 2, for all j = 1,…, J (here J = 2); the interaction term between the subject and the observer S i × o j was the degree to which the jth observer departed from his or her usual rating tendencies for the ith subject, which had a normal distribution with a mean of 0 and variance σ S×o 2; the errors terms ε ij were assumed to have an independent and identical distribution (iid) normal distribution with a mean of 0 and variance σ E 2. For the same ith subject, the effects are further assumed to be subjected to the constraint ∑j = 1 J(S×o)ij = 0 over all of the observers. The corresponding two-way ANOVA table was listed (Table 3).
Table 3.
Source of Variation | Degrees of Freedom | Mean Squares | |
---|---|---|---|
(A) Between Subjects | I − 1 | BSMS | J σ S 2 + σ E 2 |
(B) Within Subjects | I(J − 1) | WSMS | θ o 2 + J σ S×o 2/(J − 1) + σ E 2 |
(B.1) Between Observers | J − 1 | OMS | I θ o 2 + J σ S×o 2/(J − 1) + σ E 2 |
(B.2) Error | (I − 1)(J − 1) | EMS | J σ S×o 2/(J − 1) + σ E 2 |
Note: BSMS: Between Subjects Mean Squares; WSMS: Within Subject Mean Squares; OMS: Observer Mean Squares; EMS: Error Mean Squares.
Shrout and Fleiss gave the true definition of ICC using the variance ratio of the subject variance over the total variance, with its estimated version using the quantities via ANOVA (Table 3) [19]:
(12) |
2.4.6. Intraobserver Reliability Using the ICCs
Similar to the analysis described above, we adopted a hybrid approach by considering two effects at once, with the subject effect always assumed to be random and the time point assumed to be fixed. The associate model was given by
(13) |
As in (12), the estimated intraobserver agreement and its estimate were provided by:
(14) |
where the interaction term the interaction term between the subject and the time S i × t k had a normal distribution with a mean of 0 and variance σ S×t 2.
2.4.7. Sensitivity Analyses of the ICCs under Various Models
We performed a sensitivity analysis by computing 6 different ICC values Shrout and Fleiss previously proposed assumptions for ICCs (Table 4) [18]. A SAS macro, written by Professor Robert Hamer, University of North Carolina School of Medicine, Chapel Hill, NC, USA (http://www.bios.unc.edu/~hamer), was run to perform the various ICC computations.
Table 4.
Notation for the ICC Measure | Multivariate Modeling Assumptions |
---|---|
ICC(1,1) | Each subject is rated by multiple observers; the observers are assumed to be randomly assigned to the subjects; all subjects have the same number of observers. |
ICC(2,1) | All subjects are rated by the same observers who are assumed to be a random subset of all possible observers. |
ICC(3,1) | All subjects are rated by the same observers who are assumed to be the entire population of observers. |
ICC(1,2) | Same assumptions as ICC(1,1) but reliability for the mean of 2 ratings. |
ICC(2,2) | Same assumptions as ICC(2,1) but reliability for the mean of 2 ratings. |
ICC(3,2) | Same assumptions as ICC(3,1) but reliability for the mean of 2 ratings. Assumes additionally there is no subject × observer interaction. |
3. Results
3.1. Descriptive Statistics
Eleven healthy adults provided written informed consent to be evaluated and 9 underwent brain scans. Mean age of participants who received scans was 37.9 ± 14.2 years; 7 participants were men and 2 were women.
The mean ROI values varied across different region (Table 5). The left and right hemispheres tended to yield similar results when the average over these healthy subjects was considered.
Table 5.
Region of Interest | Descriptive Statistics (Mean ± SD) | 95% Normality Range (Mean ± 2 × SD) |
---|---|---|
Genu | 77.0 ± 1.0 | 75.0–79.0 |
Splenium | 72.8 ± 1.5 | 69.9–75.7 |
Left Hippocampus | 51.5 ± 2.5 | 46.6–56.4 |
Left Caudate | 59.5 ± 2.2 | 55.2–63.8 |
Left Putamen | 62.0 ± 2.0 | 58.1–65.9 |
Left Thalamus | 61.6 ± 2.3 | 57.1–66.1 |
Left Cerebral White Matter | 73.2 ± 1.2 | 70.8–75.6 |
Right Hippocampus | 52.0 ± 3.3 | 45.5–58.5 |
Right Caudate | 61.3 ± 1.7 | 58.0–64.6 |
Right Putamen | 62.8 ± 1.5 | 59.9–65.7 |
Right Thalamus | 61.1 ± 2.5 | 56.2–66.0 |
Right Cerebral White Matter | 73.0 ± 1.3 | 70.5–75.5 |
Note: Results were pooled among all 72 observations within each region of interest. SD: standard deviation.
3.2. Concordance Using Spearman's Rank Coefficient Coefficients
Spearman's rank correlation coefficients showed that a majority of correlations within each observer was above 0.5, suggesting a moderate to high concordance (Figure 3). Time point 2 tended to yield higher concordance between the observers, which suggested a possible learning effect over time (Figure 4). Due to limited sample sizes in this pilot study, in Figures 3 and 4, we demonstrated the effect of observers by averaging over repetitions by each observer. Similarly, we demonstrated the effect of time points by averaging over repetitions at each time point.
3.3. Reproducibility Using Coefficients of Variations
Overall, CVs ranged from 1.2% in the genu for Observer 2 to 7.0% in the right hippocampus for Observer 1 (Table 6). Since all of the CVs were within 7%, that is, all CVs were less than 10%, the reproducibility was reasonably high.
Table 6.
Region of Interest | Observer 1 | Observer 2 | ||
---|---|---|---|---|
Mean ± SD (N = 36) | CV (%) | Mean ± SD (N = 36) | CV (%) | |
Genu | 76.9 ± 1.0 | 1.3 | 77.1 ± 0.9 | 1.2 |
Splenium | 73.1 ± 1.4 | 1.9 | 72.6 ± 1.5 | 2.1 |
Left Hippocampus | 51.3 ± 2.4 | 4.7 | 51.6 ± 2.7 | 5.2 |
Left Caudate | 59.7 ± 1.9 | 3.2 | 59.3 ± 2.5 | 4.2 |
Left Putamen | 61.9 ± 2.2 | 3.6 | 62.1 ± 1.9 | 3.1 |
Left Thalamus | 59.9 ± 1.5 | 2.5 | 63.3 ± 1.7 | 2.7 |
Left Cerebral White Matter | 73.3 ± 1.3 | 1.8 | 73.1 ± 1.2 | 1.6 |
Right Hippocampus | 52.5 ± 3.7 | 7.0 | 51.5 ± 2.7 | 5.2 |
Right Caudate | 61.2 ± 1.9 | 3.1 | 61.5 ± 1.4 | 2.3 |
Right Putamen | 62.7 ± 1.5 | 2.4 | 62.8 ± 1.5 | 2.4 |
Right Thalamus | 59.7 ± 1.7 | 2.8 | 62.5 ± 2.5 | 4.0 |
Right Cerebral White Matter | 73.2 ± 1.2 | 1.6 | 72.8 ± 1.4 | 1.9 |
Note. SD: standard deviation.
3.4. Normality and Significance Tests via a Multivariate Analysis
The tests of the normal distribution assumption marginally using the Shapiro-Wilk test indicated that only occasionally (e.g., for left caudate, left and right putamen, and right hippocampus), this assumption was not met (see Table 7). Therefore, it was reasonable to specify linear mixed-effects modeling and two-way ANOVA reported in Sections 3.5 and 3.6.
Table 7.
Region of Interest | P-value | P-value | ||
---|---|---|---|---|
Time Point 1 | Time Point 2 | |||
Observer 1 | Observer 2 | Observer 1 | Observer 2 | |
Genu | .29 | .17 | .70 | .36 |
Splenium | .31 | .06 | .93 | .61 |
Left Hippocampus | .14 | .81 | .45 | >.99 |
Left Caudate | .97 | <.0001a | .49 | .92 |
Left Putamen | .20 | .06 | .01a | .01a |
Left Thalamus | .86 | .51 | .63 | .13 |
Left Cerebral White Matter | .82 | .43 | .21 | .02 |
Right Hippocampus | .54 | .86 | .01a | .58 |
Right Caudate | .49 | .80 | .60 | .89 |
Right Putamen | .07 | .003a | .25 | .03a |
Right Thalamus | .50 | .68 | .82 | .13 |
Right Cerebral White Matter | .79 | .78 | .16 | .54 |
aNormal distribution was not met.
3.5. Interobserver Reliability Using the ICCs
At time point 1, ICCs were greater than 0.7 in regions of genu, left and right putamen, whereas ICCs were from 0.5 to 0.7 in regions of splenium, left and right hippocampus, left caudate, and right cerebral white matter (Table 8). These results indicated moderate to strong interobserver reliability. In comparison, at time point 2, ICCs were greater than 0.7 in regions of genu, splenium, left and right caudate, putamen and cerebral white matter, and left hippocampus and thalamus, while ICCs were from 0.5 to 0.7 in right hippocampus and thalamus. These results suggested a learning effect over time. However, for some ROIs such as the left cerebral white matter, right caudate, right thalamus, ICCs increased from 0.2 (at time point 1) to 0.9 (at time point 2), making it difficult to determine whether this represents a learning effect.
Table 8.
Region of Interest | Inter-Reader ICC | Inter-Reader ICC |
---|---|---|
Time Point 1 | Time Point 2 | |
Genu | 0.866 | 0.726 |
Splenium | 0.537 | 0.758 |
Left Hippocampus | 0.693 | 0.796 |
Left Caudate | 0.580 | 0.902 |
Left Putamen | 0.869 | 0.962 |
Left Thalamus | 0.410 | 0.855 |
Left Cerebral White Matter | 0.378 | 0.929 |
Right Hippocampus | 0.653 | 0.656 |
Right Caudate | 0.209 | 0.872 |
Right Putamen | 0.725 | 0.882 |
Right Thalamus | 0.264 | 0.572 |
Right Cerebral White Matter | 0.637 | 0.896 |
3.6. Intraobserver Reliability Using the ICCs
At each time point, intraobserver agreement was at least 0.5 for a majority of the regions (Table 9).
Table 9.
Region of Interest | Intraobserver ICC | Intraobserver ICC |
---|---|---|
Observer 1 | Observer 2 | |
Genu | 0.537 | 0.555 |
Splenium | 0.598 | 0.756 |
Left Hippocampus | 0.520 | 0.596 |
Left Caudate | 0.709 | 0.362 |
Left Putamen | 0.940 | 0.784 |
Left Thalamus | 0.479 | 0.622 |
Left Cerebral White Matter | 0.560 | 0.703 |
Right Hippocampus | 0.411 | 0.826 |
Right Caudate | 0.473 | 0.436 |
Right Putamen | 0.659 | 0.657 |
Right Thalamus | 0.687 | 0.308 |
Right Cerebral White Matter | 0.570 | 0.770 |
3.7. Sensitivity Analyses of the ICCs under Various Models
Six different methods for generating ICCs exhibited similar patterns for high vs. low reliability results in different ROIs (Table 10). Thus, reliability appeared to be sensitive to ROI.
Table 10.
Region of Interest | ICC (1,1) | ICC (2,1) | ICC (3, 1) | ICC (1, 2) | ICC (2, 2) | ICC (3, 2) |
---|---|---|---|---|---|---|
Interobserver ICC at Time 1 | ||||||
| ||||||
Genu | 0.870 | 0.879 | 0.866 | 0.931 | 0.935 | 0.928 |
Splenium | 0.497 | 0.463 | 0.537 | 0.664 | 0.633 | 0.699 |
Left Hippocampus | 0.653 | 0.605 | 0.693 | 0.790 | 0.754 | 0.819 |
Left Caudate | 0.562 | 0.542 | 0.580 | 0.719 | 0.703 | 0.734 |
Left Putamen | 0.871 | 0.874 | 0.869 | 0.931 | 0.933 | 0.930 |
Left Thalamus | −0.015 | 0.114 | 0.410 | −0.030 | 0.205 | 0.581 |
Left Cerebral White Matter | 0.382 | 0.385 | 0.378 | 0.553 | 0.556 | 0.549 |
Right Hippocampus | 0.660 | 0.669 | 0.653 | 0.795 | 0.802 | 0.790 |
Right Caudate | 0.178 | 0.180 | 0.209 | 0.302 | 0.306 | 0.346 |
Right Putamen | 0.725 | 0.732 | 0.720 | 0.840 | 0.845 | 0.837 |
Right Thalamus | −0.092 | 0.079 | 0.264 | −0.202 | 0.146 | 0.417 |
Right Cerebral White Matter | 0.630 | 0.621 | 0.637 | 0.773 | 0.766 | 0.779 |
| ||||||
Interobserver ICC at Time 2 | ||||||
| ||||||
Genu | 0.722 | 0.715 | 0.726 | 0.838 | 0.834 | 0.841 |
Splenium | 0.758 | 0.757 | 0.758 | 0.862 | 0.862 | 0.863 |
Left Hippocampus | 0.792 | 0.785 | 0.796 | 0.884 | 0.880 | 0.886 |
Left Caudate | 0.905 | 0.909 | 0.902 | 0.950 | 0.952 | 0.949 |
Left Putamen | 0.961 | 0.959 | 0.962 | 0.980 | 0.979 | 0.980 |
Left Thalamus | 0.297 | 0.239 | 0.855 | 0.458 | 0.385 | 0.922 |
Left Cerebral White Matter | 0.928 | 0.926 | 0.929 | 0.963 | 0.962 | 0.963 |
Right Hippocampus | 0.640 | 0.620 | 0.656 | 0.781 | 0.765 | 0.793 |
Right Caudate | 0.876 | 0.884 | 0.872 | 0.934 | 0.938 | 0.932 |
Right Putamen | 0.884 | 0.887 | 0.882 | 0.938 | 0.940 | 0.937 |
Right Thalamus | 0.419 | 0.347 | 0.572 | 0.591 | 0.516 | 0.728 |
Right Cerebral White Matter | 0.889 | 0.876 | 0.896 | 0.941 | 0.934 | 0.945 |
4. Conclusions and Discussion
We present mathematical methods for MT brain images using 3-T high resolution. Our image analysis may provide useful pilot information for future investigations. These mathematical and statistical methods may easily be generalized to practical studies with larger sample sizes or to studies of patients with active disease.
We acquired repeat brain measurements based on a high resolution MT imaging protocol at 3T in 9 healthy adults. Our results indicate moderate to high reproducibility, supporting the validity of this method for further studies. Overall, higher intraobserver reliability was observed at the second time point than that at the initial time point, suggesting a possible learning curve effect for both observers. Interobserver reliability was generally lower than intraobserver variability, suggesting a strong observer effect in this comparison, which may be a factor in future investigations using MT imaging.
Our analyses examined different aspects in a typical observer-agreement study, using measures for concordance, reproducibility, reliability, variance-component analysis, and multivariate analysis. In other studies, all or some of such methods may be considered. However, with a simpler study of either several observers, or one observer with several repetitions at different sessions or time points, then these scenarios may only require several of our methods. Only a small sample of healthy volunteers was evaluated in this initial pilot study. Therefore, the generalization of the 95-percentile normality range may be limited with respect to the wider spectrum of brain mechanisms represented in the broader population. For instance, demonstrating summary measures using all possible observer and time point combinations may not lead to meaningful interpretations in all cases. Nevertheless, since the technology is new, this research may provide useful pilot information for future investigations. Moreover, the statistical methods employed and illustrated here may easily be generalized to studies with larger sample sizes and diseased subjects.
Another limitation was that this study aimed to evaluate only the reproducibility and reliability, rather than the accuracy in a more comprehensive validation study. In the absence of a true gold standard, such as one based on digital phantoms where realistic variability may still not be simulated, or on histopathology, improved reliability may not be equated with improved accuracy [21]. Both sensitivity and specificity are of interest. Further research would benefit from a useful algorithm to perhaps statistically and optimally estimate the underlying spatial “ground truth” [22, 23].
Finally, future research may be directed to evaluating the diagnostic utility of high resolution MT for early detection of Alzheimer's disease, multiple sclerosis or other neurological disorders and for monitoring progression across the clinical course.
Table 2.
Absolute Value of the Correlation Coefficient | Strength of the Concordance Between Samples |
---|---|
0.0 | No |
0.2 | Weak |
0.5 | Moderate |
0.8 | Strong |
1.0 | Perfect |
Table 11.
Region of Interest | ICC (1,1) | ICC (2,1) | ICC (3, 1) | ICC (1, k) | ICC (2, k) | ICC (3, k) |
---|---|---|---|---|---|---|
Intraobserver for Observer 1 | ||||||
| ||||||
Genu | 0.537 | 0.537 | 0.537 | 0.699 | 0.699 | 0.699 |
Splenium | 0.590 | 0.579 | 0.598 | 0.742 | 0.733 | 0.749 |
Left Hippocampus | 0.531 | 0.544 | 0.520 | 0.694 | 0.705 | 0.684 |
Left Caudate | 0.704 | 0.696 | 0.709 | 0.826 | 0.821 | 0.830 |
Left Putamen | 0.942 | 0.946 | 0.940 | 0.970 | 0.972 | 0.969 |
Left Thalamus | 0.481 | 0.484 | 0.479 | 0.650 | 0.653 | 0.647 |
Left Cerebral White Matter | 0.550 | 0.539 | 0.560 | 0.710 | 0.701 | 0.718 |
Right Hippocampus | 0.426 | 0.439 | 0.411 | 0.597 | 0.610 | 0.582 |
Right Caudate | 0.470 | 0.467 | 0.473 | 0.640 | 0.637 | 0.643 |
Right Putamen | 0.657 | 0.654 | 0.659 | 0.793 | 0.791 | 0.795 |
Right Thalamus | 0.696 | 0.711 | 0.687 | 0.821 | 0.831 | 0.814 |
Right Cerebral White Matter | 0.582 | 0.596 | 0.570 | 0.736 | 0.747 | 0.727 |
| ||||||
Intraobserver ICC for Observer 2 | ||||||
| ||||||
Genu | 0.563 | 0.572 | 0.555 | 0.720 | 0.728 | 0.714 |
Splenium | 0.760 | 0.767 | 0.756 | 0.864 | 0.868 | 0.861 |
Left Hippocampus | 0.607 | 0.623 | 0.596 | 0.756 | 0.767 | 0.747 |
Left Caudate | 0.365 | 0.367 | 0.362 | 0.535 | 0.537 | 0.531 |
Left Putamen | 0.790 | 0.800 | 0.784 | 0.883 | 0.889 | 0.879 |
Left Thalamus | 0.632 | 0.645 | 0.622 | 0.774 | 0.784 | 0.767 |
Left Cerebral White Matter | 0.712 | 0.726 | 0.703 | 0.832 | 0.841 | 0.826 |
Right Hippocampus | 0.829 | 0.835 | 0.826 | 0.907 | 0.910 | 0.905 |
Right Caudate | 0.432 | 0.429 | 0.436 | 0.603 | 0.601 | 0.607 |
Right Putamen | 0.667 | 0.682 | 0.657 | 0.800 | 0.811 | 0.793 |
Right Thalamus | 0.298 | 0.294 | 0.308 | 0.459 | 0.455 | 0.471 |
Right Cerebral White Matter | 0.777 | 0.789 | 0.770 | 0.875 | 0.882 | 0.870 |
Acknowledgments
None of the authors on this study had any conflict of interest. This study was partially supported by research Grants 1R01MH080636-01A2, NorthShore University Health System Pilot Grant EH07-267 and Alzheimer's Drug Discovery Foundation (ISOA 271222). The authors are grateful for the assistance of Fiona Malone and Yuyuan Ouyang. In addition, they acknowledge with thanks for the SAS macro for computing various ICCs, developed by Dr. Robert M. Hamer, Professor of Psychiatry and Research Professor of Biostatistics, University of North Carolina School of Medicine, Chapel Hill, NC, USA. Dr. DeTora is a paid employee of Novartis Vaccines and Diagnostics, Cambridge MA, USA.
References
- 1.Kabani NJ, Sled JG, Shuper A, Chertkow H. Regional magnetization transfer ratio changes in mild cognitive impairment. Magnetic Resonance in Medicine. 2002;47(1):143–148. doi: 10.1002/mrm.10028. [DOI] [PubMed] [Google Scholar]
- 2.van der Flier WM, van den Heuvel DMJ, Weverling-Rijnsburger AWE, et al. Magnetization transfer imaging in normal aging, mild cognitive impairment, and Alzheimer’s disease. Annals of Neurology. 2002;52(1):62–67. doi: 10.1002/ana.10244. [DOI] [PubMed] [Google Scholar]
- 3.Agosta F, Rovaris M, Pagani E, Sormani MP, Comi G, Filippi M. Magnetization transfer MRI metrics predict the accumulation of disability 8 years later in patients with multiple sclerosis. Brain. 2006;129(10):2620–2627. doi: 10.1093/brain/awl208. [DOI] [PubMed] [Google Scholar]
- 4.Chen JT, Collins DL, Atkins HL, et al. Magnetization transfer ratio evolution with demyelination and remyelination in multiple sclerosis lesions. Annals of Neurology. 2008;63(2):254–262. doi: 10.1002/ana.21302. [DOI] [PubMed] [Google Scholar]
- 5.Cercignani M, Symms MR, Ron M, Barker GJ. 3D MTR measurement: from 1.5 T to 3.0 T. NeuroImage. 2006;31(1):181–186. doi: 10.1016/j.neuroimage.2005.11.028. [DOI] [PubMed] [Google Scholar]
- 6.Helms G, Draganski B, Frackowiak R, Ashburner J, Weiskopf N. Improved segmentation of deep brain grey matter structures using magnetization transfer (MT) parameter maps. NeuroImage. 2009;47(1):194–198. doi: 10.1016/j.neuroimage.2009.03.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wu Y, Storey P, Carrillo A, et al. Whole brain and localized magnetization transfer measurements are associated with cognitive impairment in patients infected with human immunodeficiency virus. American Journal of Neuroradiology. 2008;29(1):140–145. doi: 10.3174/ajnr.A0740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Edelman RR. MR imaging of the pancreas: 1.5T versus 3T. Magnetic Resonance Imaging Clinics of North America. 2007;15(3):349–353. doi: 10.1016/j.mric.2007.06.005. [DOI] [PubMed] [Google Scholar]
- 9.Tofts PS, Steens SCA, Cercignani M, et al. Sources of variation in multi-centre brain MTR histogram studies: body-coil transmission eliminates inter-centre differences. Magnetic Resonance Materials in Physics, Biology and Medicine. 2006;19(4):209–222. doi: 10.1007/s10334-006-0049-8. [DOI] [PubMed] [Google Scholar]
- 10.Graham P. Modelling covariate effects in observer agreement studies: the case of nominal scale agreement. Statistics in Medicine. 1995;14(3):299–310. doi: 10.1002/sim.4780140308. [DOI] [PubMed] [Google Scholar]
- 11.Filipović SR, Kostić VS. Utility of auditory P300 in detection of presenile dementia. Journal of the Neurological Sciences. 1995;131(2):150–155. doi: 10.1016/0022-510x(95)00093-h. [DOI] [PubMed] [Google Scholar]
- 12.Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet. 1986;1(8476):307–310. [PubMed] [Google Scholar]
- 13.Hettmansperger TS. Statistical Inference Based on Ranks. Malabar, Fla, USA: Krieger; 1991. [Google Scholar]
- 14.Zou KH, Tuncali K, Silverman SG. Correlation and simple linear regression. Radiology. 2003;227(3):617–622. doi: 10.1148/radiol.2273011499. [DOI] [PubMed] [Google Scholar]
- 15.Zijdenbos AP, Dawant BM, Margolin RA, Palmer AC. Morphometric analysis of white matter lesions in MR images: method and validation. IEEE Transactions on Medical Imaging. 1994;13(4):716–724. doi: 10.1109/42.363096. [DOI] [PubMed] [Google Scholar]
- 16.Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples) Biometrika. 1965;52:591–611. [Google Scholar]
- 17.Box GEP, Cox DR. An analysis of transformations. Journal of the Royal Statistical Society. Series B. 1964 ;26:211–252. [Google Scholar]
- 18.Zou KH, O’Malley AJ. A Bayesian hierarchical non-linear regression model in receiver operating characteristic analysis of clustered continuous diagnostic data. Biometrical Journal. 2005;47(4):417–427. doi: 10.1002/bimj.200310158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.O’Malley AJ, Zou KH. Bayesian multivariate hierarchical transformation models for ROC analysis. Statistics in Medicine. 2006;25(3):459–479. doi: 10.1002/sim.2187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin. 1979;86(2):420–428. doi: 10.1037//0033-2909.86.2.420. [DOI] [PubMed] [Google Scholar]
- 21.Zou KH, Wells WM, III, Kikinis R, Warfield SK. Three validation metrics for automated probabilistic image segmentation of brain tumours. Statistics in Medicine. 2004;23(8):1259–1282. doi: 10.1002/sim.1723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Warfield SK, Zou KH, Wells WM. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging. 2004;23(7):903–921. doi: 10.1109/TMI.2004.828354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Warfield SK, Zou KH, Wells WM. Validation of image segmentation by estimating rater bias and variance. Philosophical Transactions of the Royal Society A. 2008;366(1874):2361–2375. doi: 10.1098/rsta.2008.0040. [DOI] [PMC free article] [PubMed] [Google Scholar]