Abstract
We propose fast and scalable statistical methods for the analysis of hundreds or thousands of high dimensional vectors observed at multiple visits. The proposed inferential methods do not require loading the entire data set at once in the computer memory and instead use only sequential access to data. This allows deployment of our methodology on low-resource computers where computations can be done in minutes on extremely large data sets. Our methods are motivated by and applied to a study where hundreds of subjects were scanned using Magnetic Resonance Imaging (MRI) at two visits roughly five years apart. The original data possesses over ten billion measurements. The approach can be applied to any type of study where data can be unfolded into a long vector including densely observed functions and images. Supplemental materials are provided with source code for simulations, some technical details and proofs, and additional imaging results of the brain study.
Keywords: Voxel-based morphometry, MRI, brain imaging data
1 Introduction
Massive and complex data sets raise a host of statistical challenges that were unthinkable, even a few years ago. For example, functional and anatomic imaging data on many subjects at one or multiple visits is now routinely being collected in many scientific studies. The volume and dimensionality of these types of data make standard inferential tools computationally infeasible and lead to intensive methodological research. One of the main directions of research centers around the use of functional principal components for statistical modeling of functions and images (Crainiceanu et al. 2011, 2009; Di and Crainiceanu 2010; Di et al. 2009; Greven et al. 2010; Mohamed and Davatzikos 2004; Reiss and Ogden 2008, 2010; Reiss et al. 2005; Staicu et al. 2010; Zipunnikov et al. 2011). Another direction employs spline and wavelet bases. In particular, Eilers et al. (2006) developed extensions of the P-spline approach to multidimensional grids by constructing the smooth multidimensional surface from the tensor products of B-splines. The authors worked out an efficient computational procedure by making use of rearranged partitioned matrices. Morris et al. (2003) and Morris and Carroll (2006) proposed Bayesian hierarchical models based on wavelets to address multilevel functional observations. Note that these approaches would differ from ours by making explicit use of spatial relationships. In contrast, our methods treat the data as one high dimensional vector so that inferences with permuted dimensions would be identical.
Specifically, in this paper we are concerned with modeling data with the following structure: 1) measurements are observed on hundreds or thousands of subjects; 2) each subject has the same number of visits and there are no missing/mistimed observations (balanced design); and 3) each individual observation at each visit is high-dimensional enough so that the full data set can not be loaded into computer memory. In particular, we focus on MRI data where each image is composed of rougly 3 million measurements at each subject and visit. These types of data are opening an area of research refered to as large n (number of subjects), large J (number of visits), and large p (dimensionality of the observations) problem (Crainiceanu et al. 2011). Our paper is a successful attempt to efficiently solve computational problems for very large p and moderate n · J. The suggested approach loses its appeal if both p is very large as well as n * J ≈ p.
To address challenges arising from these types of data we develop Multilevel Functional Principal Component Analysis for High Dimensional (HD-MFPCA) data. HD-MFPCA combines powerful data compression techniques and statistical inference to decompose the observed data in population- and visit-specific means and subject-specific within and between level variability. Our inferential methods are very fast, even when used on modest computing infrastructures for the analysis of hundreds or thousands of large images. To be precise, it currently takes roughly 16 minutes on a standard personal computer for HD-MFPCA analysis of 3D images of hundreds of subjects at two visits. This was achieved by 1) designing new methods for obtaining Karhunen-Loeve (K-L) expansions of covariance operators; 2) obtaining best linear unbiased predictors (BLUPS) of subject-and visit-specific scores as a by-product of a singular value decomposition (SVD); 3) using the parallel between KL expansion and SVD; 4) relying on partitioning of images into blocks that can be loaded into the computer memory (see also Demmel (1997) and Golub and Loan (1996) for comprehensive review of partitioned matrices techniques); 5) changing all calculations to rely only on block calculations and sequential access to memory; and 6) using SVD for matrices that have at least one dimension smaller than 10, 000. Using these methods our algorithms are linear in the dimensionality of the data and do not require loading the data matrix into memory, an impossible and wasteful task in many applications. HD-MFPCA opens up a road for multilevel modelling of data of size that was previously thought to be prohibitively large for statistically principled analysis.
The motivation for this work came from the analysis of cross-sectional and longitudinal brain volume via magnetic resonance imaging (MRI). The data arise from a population-based study of lead exposure and its relationship to brain structure and function (Caffo et al. 2008; Chen et al. 2009; Schwartz et al. 2007; Su et al. 2008)
1.1 Description of imaging data and their processing
In this manuscript, we consider voxel (three dimensional pixel) based analysis of brain magnetic resonance imaging (MRI) data. The motivating data arise from a study of voxel-based morphometry (VBM) (Ashburner and Friston 2000) in former lead manufacturing workers. VBM is a whole-brain technique for studying localized changes in brain shape which has become a standard component in the toolbox for neuroscientists. The primary benefits of VBM are the lack of need for a-priori specified regions of interest and its exploratory nature that facilitates identification of spatio-temporally complex, and perhaps previously unknown, patterns of brain structure and function.
Typically, VBM is performed via the following steps: 1) registration methods warp each subject’s brain into a template brain space where the Jacobian of the transformation is retained for analysis, 2) the Jacobian maps, which are aligned in template space, are then smoothed using an aggressive filter, and 3) the maps are analyzed across subjects using voxel-wise regression models; in this paper we do not analyze voxels separately and focus instead on the joint analysis of the whole-brain data. Steps 1) and 2) imply that the image used for analysis in VBM is not the original MRI and that intensities represent relative sizes of the original image in normalized space. That is, the intensity of a given voxel represents how relatively large or small that same space was in the subject’s original image. For example, two subjects with different head sizes would have the same VBM image shape; however the smaller subject would have a darker overall image while the larger subject would have a brighter image.
The specific method we use for creating VBM images, step 1), in this study is a generalization of regional analysis of volumes in normalized space (RAVENS) algorithm (Davatzikos et al. 2001). This method is an early and influential standard in VBM analysis. To obtain more accurate registration and analysis, each subject’s brain is segmented into gray matter, white matter and cerebrospinal fluid. Hence, analysis is separated by tissue type (gray or white). Moreover, the data for this analysis use a new 4D longitudinal registration technique that reduces errors by accounting for within-subject correlations. RAVENS images were spatially registered as four dimensional images. That is, considering the three spatial dimensions and time (visit 1 and 2) simultaneously. Conceptually, the approach coregisters each subject across time, then registers across subjects. This approach proved more accurate than the standard approach that registers all subjects to the template, disregarding within subject correlation (Xue et al. 2006). We registered brains to a template that was aligned with the International Consortium for Brain Mapping’s ICBM152 template Mazziotta et al. (1995). For smoothing we used a 10 mm full width at half maximum (FWHM) isotropic Gaussian smoother, where the FWHM is the diameter of the spherical cross-section of the 3D Gaussian kernel at one half of the height at the mode.
The data were derived from an epidemiologic study of the central nervous system effects of organic and inorganic lead in former organolead manufacturing workers (Schwartz et al. 2000a,b; Stewart et al. 1999). Subject scans were from a GE 1.5 Tesla Signa scanner. RAVENS image processing was performed on the T1-weighted volume acquisitions.
Our analysis is different from VBM by exploring hierarchical models of morphometric variation using a generalization of multilevel functional principal components. That is, we offer the first approach to decompose a population of morphometric (RAVENS) images into cross-sectional and longitudinal directions of variation. While our methods are motivated by and applied to a large longitudinal study of RAVENS images, they are general and can be adapted or scaled up to other longitudinal studies where images or very large vectors are collected at multiple visits.
The remainder of the paper is organized as follows. Section 2 briefly describes the MFPCA approach and its limitation when dealing with high-dimensional data. Section 3 provides a solution to the problem and introduces HD-MFPCA as well as shows an efficient way of calculating Best Linear Unbiased Predictors (BLUPs) for principal scores of the model. The results of two simulation studies are provided in Section 4. Our methods are applied to RAVENS data in Section 5. The paper is concluded with discussion in Section 6.
2 Basic multilevel model for high-dimensional data
In this section we will provide a basic framework for MFPCA developed in Di et al. (2009). We will discuss the reasons why the methods can not be applied directly to high-dimensional data sets such as the one described above in Section 1.1.
Suppose that we have a sample of images Xij, where Xij is a recorded brain image of the ith subject, i = 1, …, I, at visit j, j = 1, …, J. Every image is a 3-dimensional array structure of dimension p = p1 × p2 × p3. For example, in the RAVENS data example introduced in Section 1.1 p = 256 × 256 × 198 = 12, 976, 128. For the purpose of this paper we represent the data Xij as a p × 1 dimensional vector containing the voxels in a particular order, where the order is preserved across all voxels.
Following Di et al. (2009) we consider the following functional two-way ANOVA model
| (1) |
where μ(v) is the overall mean image, ηj(v) is the visit-specific image shift from the overall mean image, Zi(v) is a subject-specific image deviation from the visit-specific population mean, and Wij(v) is the visit-specific image deviation from the subject-specific mean. We make the standard assumptions that μ(v) and ηj(v) are fixed and Zi(v) and Wij(v) are zero mean second-order stationary stochastic processes with continuous covariance functions.
Omitting the details (which can be found in the web-appendix) MFPCA model reduces (1) to
| (2) |
where and are p-dimensional principal components of the first and second level, respectively, principal scores ξi = (ξi1, …, ξiN1) ~ (0, Λ(1)) and ζij = (ζij1, …, ζijN2) ~ (0, Λ(2)) are assumed to be zero mean second-order stationary random vectors with diagonal covariance matrices Λ(1) and Λ(2), respectively, with non-increasing diagonal elements: , k = 1, …, N1 − 1 and , l = 1, …, N2 − 1; please note that no distributional assumptions are required. In addition, level 1 principal scores ξi are assumed to be independent of level 2 principal scores ζij. Statistical estimation of this model conditional on the eigenvectors is not very difficult; see, for example, the projection idea (Di et al. 2009). Thus, estimating the eigenvectors is a crucial first step of the methods.
The main reason why MFPCA is infeasible within the high dimensional settings is that the covariance operators involved can not be calculated, stored or diagonalized. Indeed, standard MFPCA proceeds as follows. The mean image μ̂ is estimated by the sample average ΣijXij/J. All visit effects η̂j, j ≥ 1 are estimated by ΣiXij/I − μ̂. To obtain the first and second level decompositions of (2) for the unexplained part of the image, X̃ij = Xij − μ̂ − η̂j, an eigenanalysis has to be performed. Denote by X̃ = (X̃1, …, X̃I) where X̃i is a centered p × IJ matrix and the column j, j = 1, …, J, contains the unfolded image for subject i at visit j. The overall covariance operator K̂T is estimated as , which can be decomposed as a sum of “between” and “within” covariance operators defined as
| (3) |
respectively. It is easy to see that K̂T = K̂B + K̂W. The size of operators is p × p and the standard eigenanalysis requires O(p3) operations, which makes calculations infeasible for p > 104. The data in our application are 4 to 5 orders of magnitude larger than the infeasibility limit. Even the storage of the covariance operators for the RAVENS data with p × p = 1.6 · 1014 is impossible. Therefore, for high-dimensional problems MFPCA, which performs extremely well when the functional dimensionality is in the thousands, fails in very high dimensional setting. The next section provides a methodology capable of handling multi-level models of very high dimensionality. While some attention is required to understand the technical details, our solution works because the ranks of the matrices involved do not exceed the sample size of the study.
3 HD-MFPCA
We now describe our statistical model and inferential methods, with particular focus on solving computational problems. In particular, we show how to calculate all necessary quantities using sequential access to data that cannot be loaded into the computer memory. We introduce high-dimensional MFPCA (HD-MFPCA), which overcomes the problems discussed in the previous section and provides new and deep insights into the geometry of MFPCA. While our approach was inspired by the RAVENS data, HD-MFPCA has enormous potential for the analysis of populations of high dimensional data.
3.1 Eigenanalysis
Constructing and diagonalizing covariance operators for high dimensional data is impossible, which makes a brute-force implementation of the methods in Di et al. (2009) infeasible to the type of problems discussed here. In this section we propose an algorithm for calculating the eigenvalues and eigenvectors of the relevant covariance operators that avoids the problem mentioned above. This crucial result uses a sequential access to the data and allows to perform multilevel principal component analysis on observations of extremely high dimensionality. Note that the approach we adapt here to obtain SVD is general and can be readily incorporated into any methodological framework which deals with high-dimensional observations and demands for SVD.
Our approach starts with constructing the SVD of the matrix X̃, the matrix obtained by column-binding the centered subject-specific data matrices
| (4) |
Here, the matrix V is p × IJ dimensional with IJ orthonormal columns, Σ is a diagonal IJ × IJ dimensional matrix and U is a IJ × IJ dimensional orthogonal matrix. Calculating the SVD of X̃ requires only a number of operations linear in the number of parameters p. Indeed, consider the IJ × IJ symmetric matrix X̃′X̃ with its spectral decomposition X̃′X̃ = UΣU′. Note that for high-dimensional p the matrix X̃ can not be loaded into the memory. The solution is to partition it into M slices as X̃′ = [(X̃1)′|(X̃2)′| … |(X̃M)′], where the size of the mth slice, X̃m, is p/M × IJ and can be adapted to the available computer memory and optimized to reduce implementation time. The matrix X̃′X̃ is calculated as .
From the SVD (4) the p × IJ matrix V can be obtained as V = X̃UΣ−1/2. The actual calculations can be performed on the slices of the partitioned matrix X̃ as Vm = X̃mUΣ−1/2, m = 1; M. The concatenated slices [(V1)′|(V2)′| … |(VM)′] form the matrix of the left singular vectors V′. Therefore, the SVD (4) can be constructed with a sequential access to the data X̃ with p-linear effort.
After obtaining the SVD decomposition of X̃, each image X̃ij can be represented as X̃ij = VΣ1/2Uij, where Uij is a corresponding column of matrix U′. Therefore, vectors X̃ij differ only via the vector factors Uij of dimension IJ × 1. Thus, the high dimensional covariance operators K̂B and K̂W can be expressed as
| (5) |
where the inner matrices and are
| (6) |
Equation (5) establishes the correspondence between high-dimensional operators, K̂B and K̂W, and their low-dimensional counterparts, and , respectively through the “sandwich” operator VΣ1/2[·]Σ1/2V′. These crucial identities can be used to obtain the spectral decompositions and through the spectral decompositions of the low-dimensional matrices and .
Indeed, using the spectral decompositions and , where matrices AB and AW are orthogonal of size IJ × IJ and matrices ΣB and ΣW are diagonal, the covariance operators (5) can be written as
| (7) |
The matrices AB and AW are orthogonal of size IJ × IJ. This implies that the p × IJ matrices VAB and VAW have orthonormal columns. From the uniqueness of the spectral decomposition of a symmetric matrix it follows that (7) is an alternative representation of the spectral decomposition of the covariance operators. It implies that
| (8) |
Therefore, diagonalizing the operators K̂B and K̂W requires a few simple steps that can be performed in p-linear time. Algorithm 1 summarizes these steps in a series of simple actions that can be implemented using linear number of calculations in the dimensionality of the problem and sequential access to the data. The entire data set cannot and is not loaded into computer memory.
Algorithm 1.
Computing eigendecompositions in p-linear time.
|
Note that the low-dimensional matrices AB and AW completely determine the geometry of the high-dimensional “between” and “within” spaces as well as the interaction between the two. In the next section, we provide more insight on the matrices AB and AW. In particular, we show that AB and AW paired with the right singular vectors Uij provide all the necessary information for estimating the principal scores of model (2).
3.2 Principal scores
In this section we will show how the BLUP for scores in the model (2) can be obtained without using high-dimensional matrices or vectors. This is essential as a brute force extension of methods in Crainiceanu et al. (2009) and Di and Crainiceanu (2010) would require inversion of p × p dimensional matrices. In this section we propose a completely different approach to calculating the BLUPS that is computationally feasible for samples of massive images; new insights into the MFPCA geometry are obtained as a by-product.
We introduce some notation first. The subject level principal scores are ξi = (ξi1, …, ξiN1)′, ζi = (ζi1, …, ζiJ)′, where ζij = (ζij1, …, ζijN2). As in Section 3.1, the SVD of the matrix X̃ can be written in by-subject blocks as , where the IJ × J matrix corresponds to subject grouping. The following theorem contains the main result in this section; in particular it provides a simple and fast recipe for calculating the BLUPs for the MFPCA model in the context of high dimensional data.
Theorem 1
Under the MFPCA model (2), the estimated best linear unbiased predictor (EBLUP) of ξi and ζi is given by
| (9) |
where is of size N1 × N2, the matrix is of size N1 × IJ and its rows are the first N1 columns of the matrix AB. The N2 × IJ matrix is defined in a similar way. Vector 1J is a J × 1 vector of ones, ⊗ is the Kronecker product of matrices, and operation vec(·) stacks the columns of a matrix on the top of each other.
The fundamental property of the EBLUP is that all the matrices involved in (9) are low-dimensional and do not depend on the dimension p. Therefore, the EBLUP calculations are almost instantaneous. Careful inspection of the results in Theorem 1 provides novel insights into single and multilevel functional principal component analysis. Indeed, the matrices and are computed using only the right singular vectors in the equation (4). Thus, all the necessary information for estimating the scores is projected onto the low-dimensional space spanned by the vectors orthogonal to the right singular vectors. The BLUPs (9) provide a unique geometrical insight. For illustration, assume that “between” and “within” eigenvectors are orthogonal, that is CBW = 0. Then and . The ith block of size IJ × J is composed of the IJ directions each of which is of size J. The product results in averaging over each of the directions and scaling those averages by the singular values. The matrix then defines the BLUPs of the “between” scores by calculating N1 orthonormal linear combinations of the values of . If we set J = 1 the MFPCA model (2) reduces to the single-level functional model obtained in Zipunnikov et al. (2011). In this case, there is no “within” level, AB = II and the BLUPs of the “between” scores are exactly the ones obtained in Zipunnikov et al. (2011). Similarly, the matrix defines the BLUPs of the “within” scores as the N2 orthonormal combinations of the coordinates of the jth column of the block weighted by the singular values. For non-orthogonal “between” and “within” bases the interaction matrix CBW has to be taken into account. This generalizes the results of Zipunnikov et al. (2011) and unifies the theory of single and multilevel functional principal component analysis for low, moderate and high dimensional data.
4 Simulations
In this section, we will illustrate our HD-MFPCA method with two simulation studies. The first one replicates the simulation scenario in Di et al. (2009) and shows the equivalence of the two approaches in low and moderate dimensional applications. The second one tests how well spatial bases are recovered by HD-MFPCA in an application where the approach of (Di et al. 2009) cannot be implemented. Both studies were run on a four core i7-2.67Gz PC and 6Gb of RAM memory.
First, we start with the scenario considered by Di et al. (2009). Data are generated according to the following model
where ξik’s and ζijl’s are mutually independent. To match the RAVENS data design we slightly deviate from Di et al. (2009) and choose I = 350 and J = 2, i = 1, …, I. We set the number of eigenfunctions at level 1, N1, to 4 and at level 2, N2, to 4. The true eigenvalues at level 1 and level 2 are the same, , k = 1, 2, 3, 4. We follow Case 2 of Section 4 in Di et al. (2009) and choose the following non-orthogonal bases
Level 1: ,
Level 2: ,
which are measured on a regular grid of p equidistant points of interval [0, 1],
= {1/p, 2/p, 3/p, …, 1} with p = 50, 000. Note that a brute-force extension of standard MFPCA would be infeasible for such a large p even for one data set. HD-MFPCA allows to easily handle such a large dimension and conduct a simulation study in 27 minutes. The simulation study was implemented in Matlab 2010a and the software is available upon request.
Figure 1 displays the true and estimated eigenfunctions using the HD-MFPCA approach at levels 1 and 2, respectively (see the web-appendix for high-resolution images). The results for are displayed in the top panel and are shown in the bottom panel. Shown are the true function (solid black line), the estimated functions (cyan lines), the pointwise median of estimated eigenvectors (indistinguishable from the true functions) and the pointwise 5th and 95th percentiles of estimated eigenvectors (black dashed lines). Comparing Figure 1 with Figure 5 in (Di et al. 2009) we conclude that our estimation procedure completely reproduce the eigenfunction results obtained using the standard MFPCA approach.
Figure 1.

True and estimated eigenfunctions and for scenario 1 replicated 100 times. Each box shows the estimated functions (cyan solid lines), a true function (solid black line), the pointwise median and the 5th and 95th pointwise percentile curves (dashed black lines).
Figure 2 shows the boxplots of the estimated level 1 and level 2 eigenvalues. We display the centered and standardized eigenvalues, and , respectively. The results confirm the very good performance of the estimation procedure and are consistent with those reported for the no noise case in Figure 4 of (Di et al. 2009).
Figure 2.

Boxplots of the normalized estimated level 1 eigenvalues, , (left box) and the normalized estimated level 2 eigenvalues, , (right box) based on scenario 1 with 100 replications. The zero is shown by the solid black line.
To estimate the scores ξik and ζijl, we use the EBLUP results of Theorem 1. Each replication of the scenario provides 350 estimates of level 1 scores, ξik, k = 1, 2, 3, 4, and 700 estimates of level 2 scores, ζijl, l = 1, 2, 3, 4. The total number of the scores ξik estimated in the study is 35, 000 for each k. Similarly, the total number of the estimated scores ζijl is 70, 000 for each l. However, the estimated scores within each replication are dependent even if they are a-priori independent. The boxplots of the normalized estimated scores and are reported in panels one and three of Figure 3, respectively. The distribution of the normalized estimated scores corresponding to the fourth eigenfunction at level 1, , has a slightly wider spread around zero; this is likely due to the smaller signal/noise ratio. Panels two and four in Figure 3 display the medians, 0.5%, 5%, 90% and 99.5% quantiles of the distribution of the normalized estimated scores. These plots confirm the theoretical results of Theorem 1 stating that the EBLUPs given by equation (9) are unbiased. This demonstrates that the estimation procedures are unbiased for both approaches. To summarize, the first simulation study showed that HD-MFPCA approach replicates the results of the standard MFPCA but has the key advantage of being able to efficiently handle arbitrarily high-dimensional multilevel functional data.
Figure 3.

Left two panels show the distribution of the normalized estimated level 1 scores, . Boxplots are given in the left column. The right column shows the medians (black marker), 5% and 95% quantiles (blue markers), and 0.5% and 99.5% quantiles (red markers). Similarly, the distribution of the normalized estimated level 2 scores, is provided at the right two panels.
In the second scenario, we considered the extension of our methods to 2D bases. We simulated 100 data sets from the model
| (10) |
where
= [1, 300] × [1, 300]. Level 1 and 2 eigenbases, or eigenimages, are displayed in Figure 4. The images in this scenario can be thought of as 2D grayscale images with pixel intensities on the [0, 1] scale. The black pixels are set to 1 and the white ones are set to zero. Eigenimages are uncorrelated within the same level but are correlated across levels. We assume that I = 350, J = 2, i = 1, …, I, and the true eigenvalues at level 1,
, k = 1, 2, 3, and the ones at level 2,
, l = 1, 2. To apply HD-MFPCA we unfold each image Xij and obtain vectors of size p = 300 × 300 = 90, 000. The simulation study took 32 minutes.
Figure 4.

Level 1 (left three panels) and Level 2 (right two panels) grayscale eigenimages for the 2nd scenario.
Figures 5, 6, and 7 display the mean of the estimated eigenimages and the pointwise 5th and 95th percentile images, respectively. To obtain a grayscale image with pixel values in the [0, 1] interval, each estimated eigenvector, φ̂ = (φ̂1, …, φ̂p), was normalized as φ̂ → φ̂ − mins φ̂s)/(maxs φ̂s − mins φ̂s). Figure 5 displays how on average our method recovers the spatial configuration of both bases. The percentile images in Figures 6 and 7 show a similar pattern as the average with small distortions from the true functions (please note the light gray areas). To better understand the quantile patterns consider as an example the behavior of . A closer inspection of K̂W and model (10) will reveal that an estimated eigenvector is a linear combination of the true level 2 eigenvectors and , with the weights being functions of the level 2 eigenvalues and principal scores. The larger the sample size, the closer the estimated eigen-function to . Large differences between the weights create the pattern observed in the 5% quantile plot of Figure 6. Small differences between the weights result in the pattern shown in the 95% quantile plot of Figure 7. Similarly, it can be shown that the estimates of the first level eigenvectors are linear combinations of both true level 1 and level 2 eigenvectors. This leads to the overlapping patterns shown in the left three panels of Figures 6 and 7. We conclude that the estimation of the 2D eigenimages is very good.
Figure 8 shows the boxplots of the estimated level 1 and level 2 eigenvalues. As before, we display the estimated normalized eigenvalues, and , respectively. The results confirm the very good performance of the estimation procedure and are consistent with those reported for the no noise case in Figure 4 of (Di et al. 2009).
Figure shows the boxplots of the estimated level 1 and level 2 eigenvalues. We plot the estimated normalized eigenvalues, and , indicating that eigenvalues are estimated with essentially no bias. Similarly to Scenario 1, the total number of the estimated scores ξik is 35, 000 for each k and there are 70, 000 estimated scores ζijl for each l. The boxplots of the normalized estimated scores and are displayed in first and third panels of Figure 9, respectively. Results show that the EBLUPs approximate true scores very well. The scores corresponding to the second and third eigenimages at level 1 have a slightly larger spread due to a reduced signal to noise ratio. The second and fourth panels of Figure 9 display the medians, 0.5%, 5%, 90% and 99.5% quantiles quantiles of the distribution of the normalized estimated scores.
5 The analysis of the RAVENS data
In this section we apply HD-MFPCA to RAVENS images discussed in Section 1.1. The RAVENS images are 256×256×198 dimensional for 352 subjects, each with two visits. This is on the order of ten billion (1010) numbers. Assuming 32 bits per number, this represents 34 gigabytes of data for the gray matter images alone. To emphasize processing issues, one must multiply this number by three to represent both tissue types (gray and white) and cerebro-spinal fluid and hence at least an additional one hundred gigabytes is required for every processing step saved. However, even restricting ourselves only to the processed gray matter data, the data matrix is too large (ten billion) to work with in statistical models without further restrictions; therefore, the following strategy was executed. First, after processing, the intersection of non-background voxels across images was collected. Such an intersection greatly reduces the dimension of the data matrix from ten billion numbers to two billion numbers divided as three million relevant voxels per subject per visit, and seven hundred and four subject-visit combinations. However, even this relevant intersection remains prohibitive for modern systems. Hence the data matrix, of size 704 by 3 million, was divided into 100 submatrices of size 704 by 30 thousand (ten million numbers each). Note that on lower-resource computers the only change would be to reduce the size of submatrices. A parallel computation implementation of our approach would be practically instantaneous. The entire analysis performed in Matlab 2010a took around 16 minutes on a PC with a quad core i7-2.67Gz processor and 6Gb of RAM memory.
A small technical concern was of a few artifactual negative values in the data from the preprocessing. These voxels were removed from the analysis. In addition, negative eigenvalues occurred when calculating the spectral decomposition of the between covariance operator, K̂B. Estimating a non-negative definite matrix, it is not necessarily non-negative definite itself (see Section 2.2 of Di et al. 2009 and Hall et al. 2008). The negative eigenvalues of K̂B represented 2.27 percent of the overall variation. Following Hall et al. (2008) all the negative eigenvalues with corresponding eigenvectors were trimmed to zero for the analysis.
In the analysis, we first estimated the mean and the visit effects by the methods of moments. The visit specific mean images are uniform over the template and simply convey the message that localized changes in morphometry within subgroups get averaged over in the visit-specific mean calclulations. In our eigenimage analysis we de-mean the data by subtracting out these vectors.
Next, the total variation was decomposed into “between” and “within” levels. Most of the total variability, 91.42%, is explained by the level 1 (between-subject) variability while 8.58% is explained by level 2 (within-subject between visit) variability. This is expected, but nonetheless, scientifically interesting. That is, the majority of variation in brain shape in this large sample is cross-sectional across subjects rather than longitudinal. A big contribution of our approach is to actually quantify these differences. It would be of interest to see how the ratio of between to within variation changes with the age of the sample. For example, we would expect the proportion of variability explained longitudinally to be much higher when comparing infants and their first five years of life or very elderly subjects, where there is much more longitudinal change.
Table 1 provides the percentages of variability explained by the first 10 eigenimages for between and within levels, respectively. We also plot the percentages explained by the first 30 between and within eigenimages on Figure 10. First 30 between eigenimages explain roughly 47.13% of the between variability. For the within level, Figure 10 indicates that 52.52% of the within variability is explained by the first 30 within eigenimages.
The 3D-renderings of the first three estimated between eigenimages at five different angles are shown on Figure 11. Similar renderings are presented for the within level on Figure 12. The web-appendix contains additional images showing eleven equidistant axial slices of the first three estimated between and within eigenimages.
We interpret the between eigenimages as follows. The first eigenimage loads positively on the whole cortex and has a small negative region around the ventricles. This clearly represents overall brain size and shape. A person loading positively on this eigenimage will have a larger brain. This component of variation is expected, simply due to variation in head sizes within the sample. The second loads positively on the majority of the cortex and negatively on the temporal lobe and cerebellum. The third loads negatively on the cerebellum. It is extremely encouraging, that the eigenimages obey regional boundaries fairly strictly, despite the algorithm’s complete ignorance of spatial relationships. We believe that this is the first characterization of cross-sectional brain morphometry using VBM data.
The within eigenimages are more difficult to interpret. They appear to load positively and negatively on layers of the cortex and no clear anatomical boundaries are seen. Moreover, they appear to span large areas of the brain. We interpret this as suggesting that the most common form of longitudinal brain aging is a uniform volumetric loss. Such effects could also be seen if there are localized volume losses, yet the locations are random across subjects. It is also important to note that a longer follow up time, or investigation different populations might display more longitudinal variation. Moreover, specific diseases, such as Alzheimer or multiple sclerosis, may differentially affect regional brain loss across time. Our method is designed to identify and quantify such potentially interesting processes.
6 Discussion
Longitudinal observational studies that collect high-dimensional functional measurements at multiple visits appear in an increasing number. These studies create a great demand for methods that allow the separation of the observed functional variability into cross-sectional and longitudinal components. However, high dimensionality and size of the observed data pose enormous computational challenges for existing methods of multilevel analysis. We have developed a powerful and computationally efficient solution responding to these challenges, HD-MFPCA. The conducted simulation studies confirmed the high potency of the methodology to recover the spatial configurations in high-dimensional subject-specific and subject/visit-specific spaces. The approach was applied to analyze a large imaging study of 704 RAVENS images containing over ten billion measurements. The high-dimensional multilevel methodology developed in this paper can be applied equally to any longitudinal study with the balanced design.
A potential limitation of our approach is the need to have the total number of functional observations not prohibitively large. In particular, the number of subjects, I, and the number of observations per subject, J, must be of moderate size to be able to perform a spectral decomposition of a IJ × IJ matrix. Our methods are designed for IJ of the order of 10, 000 to 15, 000 and arbitrarily large p. Although, to the best of our knowledge, there are currently no longitudinal imaging studies of such a size, it is not difficult to predict a new generation of observational studies where the number of subjects reaches tens of thousands with dozens or hundreds of observations per subject. In such a situation, a possible alternative to our approach would be an adaptive aggregation of the eigen-decompositions calculated on subsamples of the subjects. Efficient ways of aggregation remain an open problem and will need to be addressed in future. Another limitation is that our methods assume a balanced design. LFPCA methodology developed in Greven et al. (2010) is capable of handling missing/mistimed observations. However, its extension to high-dimensional case is an extremely challenging problem which remains to be solved. Motivated by RAVENS data which represent preprocessed and smoothed images we have not assumed noise in the model. However, one can easily think of applications where functional observations are measured with non-ignorable noise. Another possible scenario is the sparsity of the high-dimensional functional observations. These situations have been considered in Di et al. (2009) and Di and Crainiceanu (2010), respectively, and efficient solutions have been proposed. However, these solutions require smoothing of the covariance operators which is infeasible for high-dimensional data. Thus, a computationally efficient procedure of covariance smoothing or its equivalent is highly desirable to address these problems in the high dimensional context.
Supplementary Material
Acknowledgments
The authors would like to thank the Reviewers, the Associate Editor and the Editor for their helpful comments and suggestions which led to an improved version of the manuscript. The research of Vadim Zipunnikov, Brian Caffo and Ciprian Crainiceanu was supported by Award Number R01NS060910 from the National Institute of Neurological Disorders and Stroke and by Award Number EB012547 from the NIH National Institute of Biomedical Imaging and Bioengineering (NIBIB).
7 Appendix
Figure 5.

Grayscale images of the averages for the level 1 (left three panels) and level 2 (right two panels) estimated eigenimages calculated from 100 replications of the 2nd scenario.
Figure 6.

Grayscale images of the 5th pointwise percentiles for the level 1 (left three panels) and level 2 (right two panels) estimated eigenimages calculated from 100 replications of the 2nd scenario.
Figure 7.

Grayscale images of the 95th pointwise percentiles for the level 1 (left three panels) and level 2 (right two panels) estimated eigenimages calculated from 100 replications of the 2nd scenario.
Figure 8.

Boxplots of the normalized estimated level 1 eigenvalues, , (left box) and the normalized estimated level 2 eigenvalues, , (right box) based on scenario 2 with 100 replications. The zero is shown by the solid black line.
Figure 9.

Left two panels show the distribution of the normalized estimated level 1 scores, . Boxplots are given in left panel. The right panel shows the medians (black marker), 5% and 95% quantiles (blue markers), and 0.5% and 99.5% quantiles (red markers). Similarly, the distribution of the normalized estimated level 2 scores, is provided in the right two panels.
Figure 10.
On the left: proportions explained by the 2nd to 30th Level 1 eigenimages. First “between” eigenimage explains 13.15% of the “between” variation. On the right: proportions explained by the 2nd to 30th Level 2 eigenimages. First “within” eigenimage explains 11.91% of the “within” variation.
Figure 11.
3-Dimensional rendering of the first three between eigenimages overlaid with a thresholded template. Snapshots taken under five different angles. Negative loadings are depicted in red, positive ones are in blue.
Figure 12.
3-Dimensional rendering of the first three within eigenimages overlaid with a thresholded template. Snapshots taken under five different angles. Negative loadings are depicted in red, positive ones are in blue.
Table 1.
Estimated eigenvalues for Level 1 and Level 2 for RAVENS data using HD-MFPCA. Ten first components are reported for each level. cum % var shows the cumulative percentage of variance explained.
| Level 1 eigenvalues | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Component | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| eigenvalue (× 108) | 8.79 | 1.97 | 1.84 | 1.50 | 1.22 | 1.03 | 0.90 | 0.89 | 0.87 | 0.83 |
| cum % var | 13.46 | 16.48 | 19.30 | 21.60 | 23.47 | 25.06 | 26.43 | 27.79 | 29.13 | 30.40 |
| Level 2 eigenvalues | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Component | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| eigenvalue (× 107) | 7.47 | 4.71 | 2.99 | 1.62 | 1.44 | 1.14 | 1.12 | 0.92 | 0.79 | 0.74 |
| cum % var | 11.91 | 19.44 | 24.21 | 26.80 | 29.09 | 30.91 | 32.70 | 34.16 | 35.42 | 36.60 |
Footnotes
web-appendix: The file contains some background details on MFPCA, the proof of Theorem 1, high-resolution images of Figure 1, and images showing eleven equidistant axial slices of the first three estimated between and within eigenimages. (.pdf)
HD-MFPCA-simulations-scenario-01: Matlab code for Scenario 1 of Simulations. (.m)
HD-MFPCA-simulations-scenario-02: Matlab code for Scenario 2 of Simulations. (.m)
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of Neurological Disorders And Stroke or the National Institute of Biomedical Imaging and Bioengineering or the National Institutes of Health.
Contributor Information
Vadim Zipunnikov, Email: vzipunni@jhsph.edu, Postdoctoral Fellow, Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205.
Brian Caffo, Email: bcaffo@jhsph.edu, Associate Professor, Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205.
David M. Yousem, Professor, Department of Radiology, Johns Hopkins Hospital
Christos Davatzikos, Professor of Radiology, University of Pennsylvania.
Brian S. Schwartz, Professor of Environmental Health Sciences, Epidemiology, and Medicine, Johns Hopkins Bloomberg School of Public Health
Ciprian Crainiceanu, Email: ccrainic@jhsph.edu, Associate Professor, Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205.
References
- Ashburner J, Friston KJ. Voxel-based morphometry - the methods. Neuroimage. 2000;11:805–821. doi: 10.1006/nimg.2000.0582. [DOI] [PubMed] [Google Scholar]
- Caffo BS, Chen S, Stewart W, Bolla K, Yousem D, Davatzikos C, Schwartz BS. Are Brain Volumes based on Magnetic Resonance Imaging Mediators of the Associations of Cumulative Lead Dose with Cognitive Function? American Journal of Epidemiology. 2008;167(4):429–437. doi: 10.1093/aje/kwm326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S, Wang C, Eberly L, Caffo B, Schwartz B. Adaptive control of the false discovery rate in voxel-based morphometry. Human Brain Mapping. 2009;30 doi: 10.1002/hbm.20669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crainiceanu CM, Caffo BS, Luo S, Zipunnikov VV, Punjabi NM. Population Value Decomposition, a framework for the analysis of image populations. Journal of the American Statistical Association. 2011 doi: 10.1198/jasa.2011.ap10089. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crainiceanu CM, Staicu A-M, Di C-Z. Generalized Multilevel Functional Regression. Journal of the American Statistical Association. 2009;104:488, 1550–1561. doi: 10.1198/jasa.2009.tm08564. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davatzikos C, Genc A, Xu D, Resnick SM. Voxel-based morphometry using the RAVENS maps: methods and validation using simulated longitudinal atrophy. NeuroImage. 2001;14:1361–1369. doi: 10.1006/nimg.2001.0937. [DOI] [PubMed] [Google Scholar]
- Demmel JW. Applied Numerical Linear Algebra. SIAM; 1997. [Google Scholar]
- Di C-Z, Crainiceanu CM. Multilevel Sparse Functional Principal Component Analysis. Technical Report 2010 [Google Scholar]
- Di C-Z, Crainiceanu CM, Caffo BS, Punjabi NM. Multilevel Functional Principal Component Analysis. Annals of Applied Statistics. 2009;3(1):458–488. doi: 10.1214/08-AOAS206SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eilers PH, Currie ID, Durban M. Fast and compact smoothing on large multidimensional grids. Computational Statistics and Data Analysis. 2006;50:61–76. [Google Scholar]
- Golub GH, Loan CV. Matrix Computations. The Johns Hopkins University Press; 1996. [Google Scholar]
- Greven S, Crainiceanu CM, Caffo BS, Reich D. Longitudinal Functional Principal Component Analysis. Electronic Journal of Statistics. 2010;4:1022–1054. doi: 10.1214/10-EJS575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall P, Muller HG, Yao F. Modelling sparse generalized longitudinal observations with latent gaussian processes. Journal of the Royal Statistical Society: Series B. 2008;70(4):703– 723. [Google Scholar]
- Mazziotta J, Toga A, Evans A, Fox P, Lancaster J. A probabilistic atlas of the human brain: theory and rationale for its development the international consortium for brain mapping (ICBM) Neuroimage. 1995;2:89–101. doi: 10.1006/nimg.1995.1012. [DOI] [PubMed] [Google Scholar]
- Mohamed A, Davatzikos C. Shape Representation via Best Orthogonal Basis Selection. Medical Image Computing and Computer-Assisted Intervention. 2004;3216:225–233. [Google Scholar]
- Morris JS, Carroll RJ. Wavelet-based functional mixed models. Journal of the Royal Statistical Society, Series B. 2006;68:179–199. doi: 10.1111/j.1467-9868.2006.00539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morris JS, Vannucci M, Brown PJ, Carroll RJ. Wavelet-based nonparametric modeling of hierarchical functions in colon carcinogenesis. Journal of the American Statistical Association. 2003;98:573–584. [Google Scholar]
- Reiss PT, Ogden RT. Functional Generalized Linear Models with Applications to Neuroimaging. Poster presentation Workshop on Contemporary Frontiers in High-Dimensional Statistical Data Analysis.2008. [Google Scholar]
- Reiss PT, Ogden RT. Functional generalized linear models with images as predictors. Biometrics. 2010;66(1):61–69. doi: 10.1111/j.1541-0420.2009.01233.x. [DOI] [PubMed] [Google Scholar]
- Reiss PT, Ogden RT, Mann J, Parsey RV. Functional logistic regression with PET imaging data: A voxel-level clinical diagnostic tool. Journal of Cerebral Blood Flow & Metabolism. 2005;25:s635. [Google Scholar]
- Schwartz B, Chen S, Caffo B, Stewart W, Bolla K, Yousem D, Davatzikos C. Relations of brain volumes with cognitive function in males 45 years and older with past lead exposure. Neuroimage. 2007;37:633–641. doi: 10.1016/j.neuroimage.2007.05.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartz B, Stewart W, Bolla K, Simon D, Bandeen-Roche K, Gordon B, Links J, Todd A. Past adult lead exposure is associated with longitudinal decline in cognitive function. Neurology. 2000a;55:1144. doi: 10.1212/wnl.55.8.1144. [DOI] [PubMed] [Google Scholar]
- Schwartz B, Stewart W, Kelsey K, Simon D, Park S, Links J, Todd A. Associations of tibial lead levels with BsmI polymorphisms in the vitamin D receptor in former organolead manufacturing workers. Environmental Health Perspectives. 2000b;108:199. doi: 10.1289/ehp.00108199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Staicu A-M, Crainiceanu CM, Carroll RJ. Fast analysis of spatially correlated multilevel functional data. Biostatistics. 2010;11(2):177–194. doi: 10.1093/biostatistics/kxp058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stewart W, Schwartz B, Simon D, Bolla K, Todd A, Links J. Neurobehavioral function and tibial and chelatable lead levels in 543 former organolead workers. Neurology. 1999;52:1610. doi: 10.1212/wnl.52.8.1610. [DOI] [PubMed] [Google Scholar]
- Su S, Caffo BS, Eberly LE, Garrett-Mayer E, Stewart WF, Chen S, Yousem D, Davatzikos C, Schwartz B. On the merits of voxel-based morthometric path-analysis for investigating volumetric mediation of a toxicant’s influence on cognitive function. Johns Hopkins University, Department of Biostatistics Working Papers. 2008;160 [Google Scholar]
- Xue Z, Shen D, Davatzikos C. CLASSIC: Consistent Longitudinal Alignment and Segmentation for Serial Image Computing. NeuroImage. 2006;30:388–399. doi: 10.1016/j.neuroimage.2005.09.054. [DOI] [PubMed] [Google Scholar]
- Zipunnikov V, Caffo B, Yousem DM, Davatzikos C, Schwartz BS, Crainiceanu CM. Functional Principal Component Models for High Dimensional Brain Volumetrics. NeuroImage. 2011 doi: 10.1016/j.neuroimage.2011.05.085. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



