Abstract
An increasingly important goal of psychiatry is the use of brain imaging data to develop predictive models. Here we present two contributions to statistical methodology for this purpose. First, we propose and compare a set of wavelet-domain procedures for fitting generalized linear models with scalar responses and image predictors: sparse variants of principal component regression and of partial least squares, and the elastic net. Second, we consider assessing the contribution of image predictors over and above available scalar predictors, in particular via permutation tests and an extension of the idea of confounding to the case of functional or image predictors. Using the proposed methods, we assess whether maps of a spontaneous brain activity measure, derived from functional magnetic resonance imaging, can meaningfully predict presence or absence of attention deficit/hyperactivity disorder (ADHD). Our results shed light on the role of confounding in the surprising outcome of the recent ADHD-200 Global Competition, which challenged researchers to develop algorithms for automated image-based diagnosis of the disorder.
Keywords: ADHD-200, elastic net, functional confounding, functional magnetic resonance imaging, functional regression, sparse principal component regression, sparse partial least squares
1. Introduction
A major goal of current psychiatric neuroimaging research is to predict clinical outcomes on the basis of quantitative image data. Many studies have focused on “predicting” current disease states from brain images (e.g., Craddock et al., 2009; Sun et al., 2009). While seemingly less difficult than accurate prediction of future outcomes, the goal of clinically useful imaging-based diagnosis has proved highly challenging (Kapur, Phillips and Insel, 2012; Honorio et al., 2012).
This paper addresses two important limitations of standard methods for using brain images to predict psychiatric outcomes:
-
(i)
Ordinarily, the voxels (volume units) of the brain are treated as interchangeable predictors or “features.” Accuracy might be improved by properly exploiting the spatial arrangement of the brain.
-
(ii)
In some cases brain images may prove successful for diagnostic classification, but only because the images are related to one or more scalar covariates that drive the association. This is a nonstandard form of confounding, and there seems to be no existing methodology for detecting it. In other words, little is known about how to assess whether image data o ers “added value” for prediction, beyond what is available from non-image data—which will typically be much simpler to acquire.
To address limitation (i), we approach the general problem as one of regressing scalar responses on image predictors, which are viewed, as in Reiss (2006) and Reiss and Ogden (2010), as a challenging special case of functional predictors (Ramsay and Silverman, 2005). The responses y1, . . . , yn are assumed to be generated independently by the model
| (1) |
| (2) |
Here EF (μi, ϕ) denotes an exponential family distribution with mean μi and scale parameter ϕ, along with a link function g; ti is an m-dimensional vector of (scalar) covariates, of which the first is the constant 1; is a functional predictor with domain or ; and the corresponding effect, the coefficient function or coefficient image , is the parameter of interest. The simplest special case is the linear model
| (3) |
where the εi are independent and identically distributed errors with mean 0 and variance σ2(= ϕ). When ti ≡ 1 (i.e., no scalar covariates), model (3) is the extension, from one-dimensional to multidimensional predictors, of the functional linear model that has been studied by Marx and Eilers (1999), Cardot, Ferraty and Sarda (1999), Müller and Stadtmüller (2005), Ramsay and Silverman (2005), Hall and Horowitz (2007), Reiss and Ogden (2007), Goldsmith et al. (2011), and many others.
For the case of one-dimensional functional predictors, a popular way to take spatial information into account is to restrict β(·) to the span of a spline basis (e.g., Marx and Eilers, 1999). Spline methods for two-dimensional predictors have been studied by Marx and Eilers (2005) and Guillas and Lai (2010), and by Reiss and Ogden (2010), whose work was motivated by neuroimaging applications.
Some more-recent work has considered neuroimaging applications with two- and three-dimensional predictors (Zhou, Li and Zhu, 2013; Goldsmith, Huang and Crainiceanu, 2013; Huang et al., 2013). In this paper, we propose a set of new approaches based on a wavelet representation of the coe cient image. The idea of transforming the images to the wavelet domain has previously appeared in the brain mapping literature, where it is customary to fit separate models at each voxel, with the image-derived quantity regressed on demographic or clinical variables of interest (e.g., Ruttimann et al., 1998; Van De Ville et al., 2007). But for our objective of using entire images in a single model to predict a scalar response, working in the wavelet domain has been mentioned as a natural idea (Grosenick et al., 2013) but rarely if ever pursued, at least until the very recent work of Wang et al. (2014). Unlike spline bases, wavelet bases are designed for sparse representation, and yield estimates that adapt to the features of the coefficient image.
Limitation (ii) was highlighted by the results of the recent ADHD-200 Global Competition for automated diagnosis of attention deficit/hyperactivity disorder (ADHD-200 Consortium, 2012). Teams were provided with functional magnetic resonance images from ADHD subjects and controls on which to train diagnostic algorithms, and then applied these algorithms to predict diagnosis in a separate set of images. A team of biostatisticians from Johns Hopkins University, whose methods are described by Eloyan et al. (2012), achieved the highest score for correct imaging-based classification, and were declared the winners. But a team from the University of Alberta, which discarded the images and used just four scalar predictors (age, sex, handedness, and IQ; see Brown et al., 2012), attained a slightly higher classification score (see Caffo et al., 2012, for related discussion).
To address limitation (ii), we test the e ect of image predictors via a permutation-based approach originally proposed in the neuroimaging literature (Golland and Fischl, 2003), which we extend to allow for scalar covariates. We also consider how to extend the traditional notion of confounding to settings with both scalar and image predictors. These ideas are illustrated using our wavelet methods, but are not specific to them; rather, they are applicable with other approaches to functional or high-dimensional regression.
Our contributions can be summarized as follows. (i) We propose novel wavelet-domain methodology for regression with image predictors. While Wang et al. (2014) and Zhao, Chen and Ogden (2014) have studied the wavelet-domain lasso for image predictors, we also propose and compare several other methods, and consider the generalized linear case and the role of scalar covariates. (ii) We extend predictive performance-based hypothesis testing (Golland and Fischl, 2003) to the case where scalar confounders are present, providing a new way to assess the usefulness of image-based prediction.
In §2 we introduce wavelet bases, and motivate and outline a general template for scalar-on-image regression in the wavelet domain. §3 describes three specific algorithms, which are evaluated in simulations in §4. §5 considers hypothesis testing and confounding with image predictors. In§ 6 the proposed methods are applied to a portion of the ADHD-200 data set, and the results point to a possible role of confounding in the competition's surprising result. §7 offers a concluding discussion.
2. Wavelets and their use in regression on images
2.1 A brief introduction to wavelet basis representations
Wavelet bases are a popular way to obtain a sparse representation for functional data, in particular when the degree of smoothness exhibits local variation (see Ogden, 1997; Vidakovic, 1999; Nason, 2008, for statistically-oriented treatments). A wavelet basis for is constructed from a scaling function (or “father wavelet”) ϕ and a wavelet function (“mother wavelet”) ψ (see Figure 1(a)–(b)), with the following properties:
For each , the shifted and dilated functions form an orthonormal basis for Vj, where is a nested sequence of subspaces whose union is a dense subspace of .
For each , form an orthonormal basis for a “detail space” Wj satisfying .
Hence, for any integer j0 ≥ 0, is a dense subspace of .
Fig 1.
(a) Scaling function ϕ and (b) wavelet function for 1D Daubechies (1988) “least-asymmetric” wavelets with 10 vanishing moments. 2D basis functions are formed from tensor products such as (c) and (d) .
Given appropriate boundary handling, such as modifying the scaling and wavelet functions to be periodic, one can likewise construct orthonormal wavelet bases for L2[0, 1], of the form
—i.e., 2j0 scaling functions (corresponding to the large-scale features of the data), 2j0 wavelet functions at level j0, 2j0+1 wavelet functions at level j0+1, and so on, with higher wavelet levels capturing finer-scale details. This multiscale structure is what makes wavelet bases so useful for sparse representation of functions with varying degrees of smoothness.
The wavelet decomposition level j0 acts as a tuning parameter. A small j0 implies that a small number (2j0) of scaling functions are used to construct the macro features of the function, with most of the basis elements dedicated to providing detail at a variety of scales. A large j0 allows for many more scaling functions, each at higher resolution, and thus fewer basis elements corresponding to detail.
In practice, a function f ∈ L2[0, 1] is observed at finitely many points, ordinarily taken to be the N = 2J (for some positive integer J) equally spaced points 0, . (When the function is observed at a number of points that is not a power of 2, one can insert zeroes before and after to attain the next-highest power of 2.) The observed values can then be interpolated by the N-dimensional truncated basis
The discrete wavelet transform (DWT), implemented by the O(N) pyramid algorithm of Mallat (1989), expands f with respect to this basis. Given a judicious choice of ϕ and ψ, signals of varying smoothness can be well represented with a small number of coefficients. Throughout this paper we use the compactly supported Daubechies (1988) “least-asymmetric” wavelets with 10 vanishing moments, displayed in Figure 1.
Wavelet bases for two dimensions can be constructed by taking tensor products of the ϕ and ψ functions. The two-dimensional scaling function is ϕ(x)ϕ(y), and there are three types of 2D wavelets: ϕ(x)ψ(y), ψ(x)ϕ(y), roughly corresponding to “horizontal,” “vertical,” and “diagonal” detail respectively (see Figure 1(c)–(d)). These functions are dilated and translated just as their 1D counterparts are. Wavelet bases for 3D are constructed similarly. Morris et al. (2011) discuss alternative wavelet transforms for images that are not constructed as tensor products.
2.2 A meta-algorithm for scalar-on-image regression
Henceforth, the functional predictor xi(·) of (2), (3) will be replaced by the ith discretized image observation xi = (x1, . . . , xN)T ≡ [xi(s1), . . . , xi(sN)]T , where s1, . . . , sN ∈ are distinct spatial locations at which the function xi is measured. Often, in practice, each image is given as a matrix or 3D array; xi is then obtained by converting this into a vector. From now until §3.5 we focus on the linear model (3), which can now be written in matrix form as
| (4) |
Here y = (y1, . . . , yn)T ε = (ε1, . . . , εn)T; T is the n × N matrix with ith row ; X is the n × N matrix with ith row ; and β = (β1, . . ., βN)T is a similarly discretized version of the coefficient image β. More precisely, for j = 1, . . . , N, βj = wjβ(sj), where the wj's are quadrature weights such that is a good approximation to the integral in (3); but for image data, s1, . . . , sN typically form an equally spaced grid, so these weights are taken as constant and hence ignored. With these definitions, (3) is just the ith of the n equations that make up the vector equation (4).
To simplify the notation, we shall use a single subscript and denote the wavelet basis functions for a given j0 as {ψ1, ψ2, . . ., ψN}. The wavelet representation of the ith observed image xi is , in which the wavelet coefficients are given by . The coefficient vector can be written as , where is an N × N orthonormal matrix (which is not formed explicitly when is computed by the DWT). Similarly the discretized coefficient function β can be represented terms of its wavelet coefficients as , leading to the wavelet-domain version of model (4):
| (5) |
where X̃ is the n × N matrix with ith row .
The key point is that the wavelet-domain form (5) is better suited than the original form (4) for applying sparse techniques for high-dimensional regression—both because wavelet bases are designed for sparse representation of images (Mallat, 2009), and because the DWT approximately decorrelates or “whitens” data (Vidakovic, 1999). We can thus formulate a “meta-algorithm” for scalar-on-image regression in the wavelet domain:
Apply the DWT to the image predictors to transform model (4) into model (5).
Use some high-dimensional regression methodology to derive a sparse estimate
Apply the inverse DWT to to obtain a coe cient image estimate for the original model (4).
Different choices for step 2 lead to specific algorithms, as described in the next section.
The above general scheme can be extended to multiple image predictors (cf. Zhu, Vannucci and Cox, 2010). We note that this meta-algorithm has been applied before for 1D functional predictors (Brown, Fearn and Vannucci, 2001; Wang, Ray and Mallick, 2007; Malloy et al., 2010; Zhao, Ogden and Reiss, 2012), and more for image predictors (Wang et al., 2014; Zhao, Chen and Ogden, 2014). Past work on wavelet-domain classification, as opposed to regression (e.g. Berlinet, Biau and Rouvière, 2008; Zhu, Brown and Morris, 2012; Chang, Chen and Ogden, 2014), may bear comparison to our proposed methods. Morris et al. (2011) develop wavelet-domain functional mixed models with images as responses.
3. Three wavelet-domain algorithms
3.1. Sparse wavelet-domain principal component regression
The functional linear model (3) is often fitted by assuming the coefficient function has a truncated functional principal component, or Karhunen-Loève, representation , where m is a positive integer and ρ1, ρ2, . . . , ρm are the first m eigenfunctions of the covariance operator associated with the predictor functions xi (e.g., Cardot, Ferraty and Sarda, 1999; Müller and Stadtmüller, 2005; Cai and Hall, 2006). The eigenfunctions ρ1, ρ2 . . . , ρm can be estimated by viewing the functional predictors as (highly) multivariate data, and applying ordinary principal component analysis to the predictor matrix X.
Here and in §3.2, we assume that X has mean-centered columns, i.e., 1TX = 0. The approach of the previous paragraph then amounts to assuming β = Vmγ for some , where UDVT is the singular value decomposition of X, and Vm comprises the leading m columns of V. Hence estimation reduces to choosing δ, γ to minimize the principal component regression (PCR; Massy, 1965) criterion
| (6) |
(This is a slightly nonstandard PCR criterion, in that principal component reduction is applied only to X but not to T. A similar remark applies to the other criteria introduced below.)
As shown by Reiss and Ogden (2007), PCR can be implemented more e ectively by exploiting the functional character of the data. In the one-dimensional functional predictor case, this has usually meant forming smooth estimates of the eigenfunctions—as in the FPCRC method of Reiss and Ogden (2007), which expands the eigenfunctions with respect to a B-spline basis (cf. Cardot, Ferraty and Sarda, 2003). But for image predictors, local adaptivity—the ability to capture sharp features in some areas, vs. a high degree of smoothness elsewhere—becomes particularly important. This motivates using a wavelet basis, rather than a spline basis, to represent the eigenfunctions; or in other words, developing a wavelet-domain version of PCR as an instance of the meta-algorithm of §2.2.
A non-sparse wavelet-domain PCR estimate would minimize
| (7) |
which is analogous to (6) but based on the SVD of X̃ rather than of X. However, the advantage of working in the wavelet domain is to obtain a sparse coe cient estimate, by replacing the PC weights Ṽ with weights from a sparse version of PCA. Several penalty-based methods have been proposed for sparse PCA (e.g., Zou, Hastie and Tibshirani, 2006; Shen and Huang, 2008; Witten, Tibshirani and Hastie, 2009); but we opted for the approach of Johnstone and Lu (2009), which is simpler than the penalized methods and, unlike them, was developed with a view toward sparse wavelet representations of signals. Johnstone and Lu (2009) propose to select the features or coordinates with highest variance, and apply PCA only to these. The resulting sparse PCR criterion is
| (8) |
here X̃* consists of the c columns of X̃ having highest variance, and consists of the leading m columns of Ṽ*, where Ũ* D̃* Ṽ*T is the SVD of X̃*. The minimizer of (8) can be obtained by simple least squares. The vector of wavelet coefficient estimates is then , and the coefficient image estimate is derived by the inverse DWT.
3.2 Sparse wavelet-domain partial least squares
Whereas PCR reduces dimension by regressing on the leading PCs of the predictors, partial least squares (PLS; Wold, 1966) works by regressing on a set of components that are relevant to predicting the responses. A (non-sparse) wavelet-domain PLS estimate (cf. Nadler and Coifman, 2005) is derived by minimizing
| (9) |
(cf. (7)), where the columns of R̃m are defined iteratively as follows (Stone and Brooks, 1990):
r̃1 = arg min∥r∥=1 Cov(y, X̃r);
- for j = 2, . . ., c,
Once again, however, the point of working in the wavelet domain is to obtain a sparse estimate. To define sparse wavelet-domain PLS, as with PCR, we could have used penalization to derive sparse PLS components (Chun and Keleş, 2010); but we instead opted to build on the aforementioned approach of Johnstone and Lu (2009) to sparse PCA. A natural PLS analogue of that approach is to select those features x̃j whose covariance with y has the greatest magnitude. This results in the sparse PLS criterion
| (10) |
here X̃† consists of the c columns of X̃ having highest covariance with y, and the columns of are defined analogously to those of R̃m in (9). As for PCR, the least-squares minimizer of (10) leads directly to estimates of the wavelet coefficients and of the resulting coefficient image β.
Our PLS algorithm is a wavelet-domain counterpart of the spline-based functional PLS procedure denoted by FPLSC in Reiss and Ogden (2007). We note that Preda and Saporta (2005) and Delaigle and Hall (2012b) have proposed more explicitly functional formulations of PLS, based on covariance operators on function spaces.
3.3 Wavelet-domain elastic net
Since wavelet bases are well suited for sparse representation of functions, recent work has considered combining them with sparsity-inducing penalties, both for semiparametric regression (Wand and Ormerod, 2011) and for regression with functional or image predictors (Zhao, Ogden and Reiss, 2012; Wang et al., 2014; Zhao, Chen and Ogden, 2014). The latter papers focused on penalization, also known as the lasso (Tibshirani, 1996), in the wavelet domain. Alternatives to the lasso include the SCAD penalty (Fan and Li, 2001) and the adaptive lasso (Zou, 2006). Here we consider the elastic net (EN) estimator for wavelet-domain model (5), which minimizes
| (11) |
over for a regularization parameter λ > 0 and a mixing parameter α ∈ [0, 1] which controls the relative strength of the and penalties on the coe cients (Zou and Hastie, 2005).
In the original nomenclature of Zou and Hastie (2005), the minimizer of (11) is the “naïve” EN, whereas EN is a rescaled version. Since we shall make use of the generalized linear extension of EN as implemented by Friedman, Hastie and Tibshirani (2010), we follow these authors in omitting the rescaling step. When α > 0, the penalty shrinks small coefficients to zero, leading to a sparse wavelet representation. The wavelet-domain lasso is obtained when α = 1. As explained by Zou and Hastie (2005), given a group of important features that are highly correlated, the lasso tends to select just one whereas EN selects the entire group, which is often preferable—even in the wavelet domain, notwithstanding the “whitening” property of the discrete wavelet transform.
3.4 Summary: Alternative routes to sparsity
All three of the above methods seek to represent the coefficient image β (·) sparsely, as a linear combination of a subset of the wavelet basis functions; but they deploy very di erent strategies to choose that subset. The penalty in the elastic net criterion (11) has the effiect of shrinking small coefficients to zero. This can be interpreted as imposing a prior that favors a sparse estimate. The PCR criterion (8) eliminates basis elements before performing regression, based on an implicit assumption that those basis elements with low variance in the data have little to contribute to the coe cient image. This assumption is broadly consistent, on the one hand, with the assumption of Johnstone and Lu (2009) that such basis elements are merely capturing noise; and on the other hand, with the underlying assumption of PCR, namely that the highest-variance principal components are most relevant in regression (see Cook, 2007, for some relevant discussion). The PLS criterion (10) likewise lets the data determine which basis elements to include; but here, instead of considering only the wavelet-transformed image data X̃ as in PCR, we define relevant components by iteratively maximizing covariance with the responses y.
3.5 Extension to the generalized linear case
The above three wavelet-domain algorithms can be straightforwardly extended from linear to generalized linear models (GLMs) of the form
| (12) |
for a link function g, generalizing (5). For PCR, one simply fits a GLM, as opposed to a linear model, to the sparse PCs. For the elastic net, the glmnet algorithm of Friedman, Hastie and Tibshirani (2010) is available for the generalized linear case.
PLS is sometimes performed in an iteratively reweighted manner for GLMs (Marx, 1996); but in high-dimensional settings, such algorithms may require nontrivial modification (e.g. Ding and Gentleman, 2005) to avoid convergence problems. Here we view PLS as a generic approach to constructing relevant components, which may be employed beyond the linear regression setting (e.g. Nguyen and Rocke, 2002; Delaigle and Hall, 2012a). Thus we construct PLS components exactly as we would for a linear model, but then use these components to fit a GLM.
3.6 Tuning parameter selection
For wavelet-domain PCR and PLS, three tuning parameters must be selected: the resolution-level parameter j0; the number c of wavelet coefficients to retain (i.e., the number of columns of X̃* in (8), or of X̃† in (10)); and the number m of PCs or PLS components. We generally fix j0 = 4, since we have found that resolution level to be generally either optimal or near-optimal as measured by cross-validation (CV). For wavelet-domain elastic net, one must choose j0 and the two penalty parameters α and λ in (11), but we again prefer to fix j0 = 4.
These tuning parameters are chosen by repeated K-fold CV. In the rth of R repetitions we divide the data points (yi, ti, xi) (i = 1, . . . , n) into K equal-sized validation sets indexed by Ir,1, . . . , Ir,K. We can then choose the tuning parameters to minimize the CV score
| (13) |
where are the estimates that result when model (12) is fitted (by PCR, PLS, or EN) with the observations indexed by Ir,k excluded, and L is an appropriate loss function. For linear regression the standard loss function is the squared error . For the generalized linear case, following Zhu and Hastie (2004), we use the deviance as the loss function. Specifically for logistic regression, unusually large summands can dominate criterion (13). Therefore, similarly to Chi and Scott (2013), we instead choose the tuning parameters by a robust CV score that takes the median rather than the mean over each set of K validation sets:
| (14) |
4. Comparative simulation study
To test the performance of our methods with realistic image predictors, we created a data set based on the positron emission tomography (PET) data previously studied by Reiss and Ogden (2010). That data set included axial slices from 33 amyloid beta maps, from which we extracted a square region of 64 × 64 voxels. To generate a larger sample of n = 500 images, we applied a procedure similar to that of Goldsmith, Huang and Crainiceanu (2013):
- We estimated the (vectorized) principal components (eigenimages)
with corresponding eigenvalues λ1, . . . , λ32. Fir i = 1, . . ., 500, we generated the ith simulated predictor image as with the cij's simulated independently from the N(0, λj) distribution.
In step 1 above we used the sparse PCA method of Johnstone and Lu (2009), including the 492 wavelet coe cients having the highest variance. This number of wavelet coefficients was su cient to capture 99.5% of the “excess” variance, in the sense of section 4.2 of Johnstone and Lu (2009).
We used two different true coefficient images , which are shown in Figure 2. The first image β(1) is similar to that used by Goldsmith, Huang and Crainiceanu (2013). Taking its domain to be [1, 64]2, this coefficient image is given by β(1) = g1 – g2 where g1, g2 are the densities of the bivariate normal distributions
respectively. The second image β(2) is a two-dimensional analogue of the “bumps” function used by Donoho and Johnstone (1994), and many subsequent authors, to illustrate the properties of wavelets.
Fig 2.
Coefficient images β(1) (left) and β(2) (right) used in the simulation study.
We then simulated continuous or binary outcomes y1, . . . , yn with specified approximate values of the coefficient of determination R2, in the sense detailed in Supplementary Appendix A.1 (Reiss et al., 2015). We generated 100 sets of n = 500 continuous outcomes, and 100 sets of 500 binary outcomes, for each of the R2 values 0.1, 0.5.
We compared the performance of the three wavelet-domain methods described in §3 with three analogous “voxel-domain” methods, i.e., sparse PCR, sparse PLS and elastic net without transformation to the wavelet domain. The wavelet- and voxel-domain methods are denoted by WPCR, WPLS and WNet and by VPCR, VPLS and VNet, respectively. We also included the B-spline-based functional PCR method (“FPCRR”, or simply FPCR) of Reiss and Ogden (2007, 2010). Tuning parameter selection was as described in Supplementary Appendix A.1 (Reiss et al., 2015).
Performance was evaluated in terms of estimation error and prediction error. Estimation error is defined by the scaled mean squared error (MSE) , where β, are the true and estimated coefficient images. Prediction error is defined using a separate set of outcomes , generated from the same conditional distribution as y1, . . .,yn. We use the scaled mean squared prediction error as our criterion for linear regression, and the mean of the deviances of for logistic regression.
Figure 3 presents boxplots of the results. In general all seven methods differ only slightly in prediction error. Much greater di erences are seen for estimation error. Compared with the corresponding voxel-domain methods, the estimation MSE for wavelet methods is either roughly equal or clearly lower on average, and the variability of the MSE is often much lower. The wavelet methods also markedly outperform B-spline-based FPCR. Somewhat contrary to expectation, the superior performance of wavelet methods is not clearly more pronounced for β(2) than for β(1).
Fig 3.
Estimation error, displayed as log(scaled MSE) (left subfigure), and prediction error (right subfigure) in the simulation study.
While the wavelet-domain methods do not clearly attain lower estimation error than voxel-domain methods for logistic regression with R2 = 0.5, they do appear superior for the R2 = 0.1 setting (which seems more realistic) and for linear regression. Moreover, qualitatively, wavelet-domain modeling helps to capture the main features of the coefficient image. Figure 4 displays an example of the training-set estimates derived by wavelet-domain lasso versus ordinary lasso. The wavelet-domain estimates are clearly more similar to each other and to the true coefficient image than are the ordinary lasso estimates.
Fig 4.
True coe cient function β(1) from the comparative simulation study (top left) compared with five training-set coefficient function estimates (for data simulated under R2 = 1 setting) based on wavelet-domain lasso (other top panels) and voxel-domain lasso (bottom panels). The wavelet-based estimates are reasonably accurate, while each of the voxel-domain estimates has about 20–25 scattered voxels with nonzero values. Note the unequal scales.
The wavelet-domain EN appears to have a slight edge overall compared with PCR and PLS. For this reason, and because wavelet EN (or at least its special case, the lasso) are now somewhat established in the literature (Zhao, Ogden and Reiss, 2012; Wang et al., 2014; Zhao, Chen and Ogden, 2014), the simulations and real-data analyses in the next two sections consider only wavelet-domain EN.
5. Inferential issues
We now turn to what the Introduction referred to as limitation (ii) of predictive analyses in neuroimaging: the need for methodology to assess the predictive value of image data, in particular when scalar covariates are present.
5.1 Permutation testing
Consider testing the null hypothesis β(s) ≡ 0 in the the general model (1), (2), i.e., testing the null parametric model versus the alternative (2). Informally, we are asking whether the images have predictive value beyond the information contained in the scalar predictors. We propose a permutation test procedure in which the CV criterion (13) or (14) serves as the test statistic. If the true-data CV falls in the left tail of the distribution of permuted-data CV values, significance is declared. Permutation techniques of this kind have previously appeared in the neuroimaging and machine learning literature (Golland and Fischl, 2003; Ojala and Garriga, 2010).
The way the permutation distribution is constructed depends on the null model under consideration. When ti ≡ 1 in (2) (no scalar covariates), one can simply permute the responses: i.e., we repeatedly reorder the responses as yπ(1), . . ., yπ(n) for some permutation π, refit the model, and record the CV value. For the linear model (3) with scalar covariates, a common approach is to permute the residuals from the null parametric model: i.e., model (3) is refitted repeatedly with ith response of the form , where the hats refer to fitted values and residuals from the model . For some GLMs, however, such pseudo-responses based on permuted residuals are not of the correct form (e.g., for logistic regression, they are not binary). One can instead form pseudo-predictors, by regressing the predictor of interest on the nuisance covariates and permuting the residuals from this fit. In other words, we replace the design matrix (T|X) with
| (15) |
where PT = T(TTT)−1TT and Π is a permutation matrix. Although a similar idea was proposed by Potter (2005) for (ordinary) logistic regression, we have adopted it as our preferred permutation approach even for the linear case; see Supplementary Appendix B (Reiss et al., 2015) for further discussion.
We conducted a simulation study, using the ADHD-200 image data analyzed in §6, to assess the type-I error rate and power of the permutation test procedure. Here we focus on logistic regression (see Supplementary Appendix C Reiss et al., 2015, for linear regression results) and the wavelet-domain lasso. We first considered the case without scalar covariates, and generated binary responses yi ~ Bernoulli(pi), i = 1, . . . , n = 333, where
| (16) |
where δ0 is a constant used to adjust the base rate (probability of event); is the ith image (expressed as a mean-zero vector); β is the true coefficient image shown in Figure 5(a) (similarly vectorized), multiplied by an appropriate constant to attain a specified value of R2 (see Supplementary Appendix A Reiss et al., 2015, regarding the definition of R2). For each of the base rates .25, .5, .75 and each of the R2 values .0, .07, .1, .15, .2, .25, .3, we simulated 200 response vectors to assess power to reject H0 : β = 0 at the p = .05 level, as well as 1000 response vectors with β = 0 (R2 = 0) to assess the type-I error rate. Next we considered testing the same null hypothesis for the model
| (17) |
with a scalar covariate ti such that R2 for the submodel E(yi|ti) = tiδ is approximately 0.2. We generated the same number of response vectors as above for each of the above R2 values, but here R2 refers to the partial R2 adjusting for ti (see Supplementary Appendix A.2 Reiss et al., 2015).
Fig 5.
(a) True coefficient image β used in the power study: gray denotes 0, black denotes 1. (b) Estimated probability of rejecting the null hypothesis β = 0 as a function of R2, with 95% confidence intervals, for model (16). (c) Same, for model (17).
The results, displayed in Figure 5(b) and (c), indicate that the nominal type-I error rate is approximately attained for both models. For a given R2 > 0, the power is somewhat higher for model (16) than for model (17), and highest for either model when the base rate is .5. Evidently, for base rates closer to 0 or 1, the CV deviance under the null hypothesis tends to be lower, and thus a stronger signal is needed to reject the null.
Basing a test of the hypothesis β(·) ≡ 0 on the prediction performance of an estimation algorithm, rather than on an estimate of β, is admittedly somewhat unconventional. In neuroimaging specifically, inference typically proceeds by fitting separate models at each voxel, and then applying some form of multiple testing correction (Nichols, 2012). In the present setting of a single model that uses the entire image to predict a scalar response, it might be possible to assign p-values to individual voxels as in Meinshausen, Meier and Bühlmann (2009). In practice, however, predictive algorithms tend to produce rather unstable estimates, as a number of authors have acknowledged (e.g., Craddock et al., 2009; Honorio et al., 2012; Sabuncu, Van Leemput and the Alzheimer Disease Neuroimaging Initiative, 2012). Our hypothesis testing approach thus sets the more modest inferential goal of verifying that the coefficient image as a whole yields better-than-chance prediction.
5.2. Confounding
For ordinary, as opposed to functional, regression, confounding is said to occur when (i) x appears predictive of y, but this relationship can be attributed to a third variable t such that (ii) t is predictive of y and (iii) t is correlated with x. For example, birth order (x) is associated with the occurrence of Down syndrome (y), but this is due to the effect of the confounding variable maternal age (t) (Rothman, 2012).
To extend the above definition to the case of a functional predictor x(·), suppose that (i) x(·) is ostensibly related to y, in the sense that β(·) is not identically zero when model (2) includes no scalar covariates; but (ii) the scalar variable t is also predictive of y. A functional-predictor analogue of point (iii) is to suppose that t is correlated with , where is an estimate obtained with t excluded from model (2). Aside from this “global” analogue of (iii), it may be useful to consider a “local” analogue which holds if t is correlated with x(s), specifically for s such that β(s) ≠ 0; but this is somewhat less straightforward to assess.
6. Application: fALFF and ADHD
6.1 ADHD-200 data set and candidate models
We now apply the wavelet-domain elastic net to “predicting” ADHD diagnosis using maps of fractional amplitude of low-frequency fluctuations (fALFF) (Zou et al., 2008) from a portion of the ADHD-200 sample referred to in the Introduction (http://fcon_1000.projects.nitrc.org/indi/adhd200/). fALFF is defined as the ratio of BOLD signal power spectrum within the 0.01–0.08 Hz range to total over the entire range. Yang et al. (2011) reported altered levels of fALFF in a sample of children with ADHD relative to controls, specifically in frontal regions. That study relied on the traditional analytic approach in neuroimaging, which regresses the imaged quantity (in this case fALFF) on diagnostic group, separately at each voxel. Here we employed scalar-on-image logistic regression, which reverses the roles of response and predictor, to regress diagnostic group on fALFF images. Our sample consisted of 333 individuals: 257 typically developing controls and 76 with combined-type ADHD. The sample included 198 males and 135 females, with age range 7–20 (see Supplementary Appendix D Reiss et al., 2015, for further details). We chose the 2D slice for which the mean across voxels of the SD of fALFF was highest. This was the axial slice located at z = 26 (just dorsal to the corpus callosum) in the coordinate space of the Montreal Neurological Institute's MNI152 template (4mm resolution). We fitted two models. The first was
| (18) |
where xi(s) denotes the ith subject's fALFF image. The second model was
| (19) |
where the vector ti includes the ith subject's age, sex, IQ and mean FD, as well as a leading 1 for the intercept.
Figure 6 shows the coefficient images attained for model (18) with each value of the mixing parameter α. As expected, increasing values of lead to more-sparse estimates in the wavelet domain and hence in the voxel domain. Figure 7 shows the CV deviance as a function of λ for α = 0.1, which had the lowest CV deviance overall, as well as for α = 1.
Fig 6.
Coefficient image estimates for model (18) applied to the ADHD-200 data, using wavelet-domain elastic net with four diffierent values of the mixing parameter α.
Fig 7.
Cross-validated deviance +/− one approximate standard error, for the wavelet-domain elastic net models with α = 0.1, 1.
The left subfigure of Figure 8 shows that the CV deviance lies in the left tail of the permutation distribution for model (18), indicating a significant e ect of the fALFF image predictors (p = 0.015). However, with the scalar covariate adjustment of model (19), this e ect disappears. The next subsection examines more closely how the scalar covariates may be acting as confounders.
Fig 8.
Permutation test results. For the full sample (left), a significant e ect of the fALFF images is seen in model (18), but not in model (19), which adjusts for scalar covariates. When only younger individuals are included (right), neither model shows a significant fALFF effect.
Our test of model (18) entailed 999 permuted-data fits with four candidate values of α and 100 of λ, requiring 14.25 hours on an Intel Xeon E5-2670 processor running at 2.6 GHz. In practice we recommend parallelizing the permutations via cluster computing to make the computation time more manageable. In addition, truncated sequential probability ratio tests (Fay, Kim and Hachey, 2007) could in some cases reduce computation time via early stopping. We also explored fitting model (18) with the full 3D fALFF images as predictors; see Supplementary Appendix E (Reiss et al., 2015).
6.2 Assessing and remedying confounding
As discussed in §5.2, the notion of confounding entails three elements (see Figure 9). Point (i), an apparent e ect of the image predictor fALFF on diagnosis, was established by the above permutation test result for model (18). To check point (ii) of the definition for each of the four scalar covariates under consideration, we performed an ordinary logistic regression with diagnosis (1=ADHD, 0=control) as response and the above four scalar predictors. In Table 1 (at left), sex, age and IQ are all seen to be significantly related to diagnosis. See also Figure 10, which compares the fitted probabilities from this ordinary logistic regression with those resulting from models (18) and (19). The scalar-covariates model is seen to separate the two groups (black vs. gray dots) quite well; the image predictors increase the spread of the predicted probabilities without clearly improving the two groups’ separation. Based on these results, each of these three variables may be acting as a confounder.
Fig 9.
Relationships among a putative predictor X, outcome Y and confounder T (see § 5.2), illustrated with respect to the ADHD-200 data.
Table 1.
To examine element (ii) of confounding, an ordinary logistic regression was fitted with the four scalar predictors and with ADHD diagnosis as response; the resulting estimates are shown with 95% confidence intervals. For element (iii), we display the correlations of each predictor with the logit probabilities estimated by fitting model (18).
| (ii) | (iii) | |||
|---|---|---|---|---|
| Log odds ratio | p-value | Correlation | p-value | |
| Intercept | 3.90 (1.11, 6.78) | 0.007 | ||
| Sex (M - F) | 1.26 (0.65, 1.91) | 0.00008 | 0.14 (0.03,0.24) | 0.011 |
| Age | −0.20 (−0.32, −0.09) | 0.0005 | −0.35 (−0.44,−0.25) | 6 × 10–11 |
| IQ | −0.03 (−0.05, −0.01) | 0.003 | −0.09 (−0.19,0.02) | 0.10 |
| Mean FD | −2.51 (−8.80, 3.56) | 0.42 | −0.04 (−0.15,0.07) | 0.47 |
Fig 10.
Predicted probabilities of ADHD diagnosis, according to the images-only model (18); an ordinary logistic regression with the four scalar covariates; and model (19), which includes both. Also shown are the R2 values, as defined in Supplementary Appendix A.1, for the three models.
Next we consider point (iii), i.e., the correlations of each scalar covariate with where is the coefficient image estimate from the fALFF-only model (18), or equivalently with the predicted logit probability of ADHD from that model. The results, shown at right in Table 1, point to age and sex as the principal confounders. (Here sex was treated as a binary variable, with 1 for male and 0 for female; a t-test and a Mann-Whitney test yielded similar results.) “Local” examination in the sense of §5.2 reveals that the fALFF x(s) tends to be higher in males and in younger individuals for many voxels s; and such regions overlap considerably with those in which . In other words, the ostensible association between fALFF and ADHD likely reflects the dependence of fALFF on age and sex, which in turn are related to ADHD in our sample.
Further inspection revealed that, of the 67 individuals with age above 14.0, only 8 had ADHD, with maximum age 17.43—whereas the controls had ages as high as 20.45. This led us to suspect that these older individuals might be driving the confounding with age that results in a spurious effect of fALFF on diagnosis. To investigate this possibility, we repeated the analysis using only the 266 individuals of age 14.0 or lower. Figure 8 shows that in this subsample, the fALFF effect is no longer significant, even without adjusting for the scalar covariates. Moreover, given how far the test statistic is from the left tail of the permutation distribution, it seems unlikely that the loss of significance is due merely to the lower sample size.
In general, absent careful matching at the design stage, it would be advisable to match the two diagnostic groups optimally on a complete set of clearly relevant variables, via algorithms such as those described in Rosen-baum (2010). Our aim here, however, was to show how a straightforward new notion of confounding for functional predictors can be used to identify a principal scalar confounder, whose impact can be removed by the crude device of simply truncating the age range.
7. Discussion
Our analysis in §6 included only one imaging modality and only a subset of the individuals from the ADHD-200 Global Competition database. At any rate, our essentially negative result is consistent with the finding (Brown et al., 2012) that diagnostic accuracy was optimized by basing prediction on scalar predictors, while ignoring the image data. In a blog comment on that outcome, cited both by ADHD-200 Consortium (2012) and by Brown et al. (2012), the neuroscientist Russ Poldrack suggested that “any successful imaging-based decoding could have been relying upon correlates of those variables rather than truly decoding a correlate of the disease.” Stated a bit di erently, the competing teams’ successes in using the image data to predict diagnosis may have been brought about by confounding. But there appear to have been few attempts, if any, to study systematically how confounding may give rise to spurious relationships between quantitative image data and clinical variables. Similarly, analyses of the ADHD-200 data, and related work on brain “decoding,” have devoted little attention to formally testing the contribution of imaging data to prediction of scalar responses.
As we have shown, these two interrelated issues—testing the e ect of image predictors, and investigating possible confounders—can be handled straightforwardly within our scalar-on-image regression framework. The permutation test procedure of §5.1 found a statistically significant relationship between fALFF images and ADHD diagnosis, but this disappeared when four scalar covariates were adjusted for. Further examination, in light of our extension of the notion of confounding to functional/image predictors in §5.2, pointed to age and sex as the key confounders.
The ADHD-200 project is one of a number of recent initiatives to make large samples of neuroimaging data publicly available (Milham, 2012). These initiatives have been a boon for statistical methodology development, but it must be borne in mind that even as neuroimaging sample sizes increase rapidly, they remain much smaller than the data dimension. No approach to scalar-on-image regression can completely escape the ensuing non-identifiability of the coe cient image. We can, however, (i) put forth assumptions, likely to hold approximately in practice, that reduce the e ective dimension of the coefficient image; and (ii) employ multiple methods in the hope that these will converge upon similar coe cient image estimates, at least when the signal is sufficiently strong.
With these considerations in mind, we have introduced three methods for scalar-on-image regression, each relying on a different set of assumptions to achieve dimension reduction in the wavelet domain. Implementations of these three methods, for 2D and 3D image data, are provided in the refund.wave package (Huo, Reiss and Zhao, 2014) for R (R Development Core Team, 2012), available at http://cran.r-project.org/web/packages/refund.wave. This new package, a spino of the refund package (Crainiceanu et al., 2014), relies on the wavethresh package (Nason, 2013) for wavelet decomposition and reconstruction.
As discussed in §2.2, the three methods described here are merely three instances of a meta-algorithm for scalar-on-image regression. The refund.wave package allows for straightforward incorporation of alternative penalties, and other extensions may allow for more-refined wavelet-domain algorithms, which may improve the stability and reproducibility of the coefficient image estimates (Rasmussen et al., 2012). For instance, in wavelet-based nonpara-metric regression, thresholding is often performed in a level-specific manner. Analogously, it might be appropriate to modify criterion (11) so as to di erentially penalize coe cients at di erent levels. One might also employ resampling techniques (cf. Meinshausen and Bühlmann, 2010) to select those wavelet basis elements that are consistently predictive of the outcome. Finally, wavelets whose domain is anatomically customized, such as the wavelets defined on the cortex by Özkaya and Van De Ville (2011), offer a promising new way to confine the analysis to relevant portions of the brain.
Supplementary Material
Acknowledgments
The authors are grateful to the Editor, Karen Kafadar, and to the Associate Editor and referees, whose feedback led to major improvements in the paper; to Adam Ciarleglio, for contributions to software implementation; to Xavier Castellanos, Samuele Cortese, Cameron Craddock, Brett Lullo, Eva Petkova, Fabian Scheipl and Victor Solo for helpful discussions about our methodology and its application; to Je Goldsmith, Lei Huang and Ciprian Crainiceanu for sharing their insights as well as a preprint of Goldsmith, Huang and Crainiceanu (2013); and to the ADHD-200 Consortium (http://fcon_1000.projects.nitrc.org/indi/adhd200/) and the Neuro Bureau (http://neurobureau.projects.nitrc.org/) for making the fMRI data set publicly available. In addition to the funding sources listed on the first page, the first author thanks the National Science Foundation for its support of the Statistical and Applied Mathematical Sciences Institute, whose Summer 2013 Program on Neuroimaging Data Analysis provided a valuable opportunity to present part of this research. This work utilized computing resources at the High Performance Computing Facility of the Center for Health Informatics and Bioinformatics at New York University Langone Medical Center.
Supported in part by National Science Foundation grant DMS-0907017 and National Institutes of Health grants 5R01EB009744-03 and 1R01MH095836-01A1.
References
- ADHD-200 Consortium The ADHD-200 Consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in Systems Neuroscience. 2012;6 doi: 10.3389/fnsys.2012.00062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berlinet A, Biau G, Rouvière L. Functional supervised classification with wavelets. Annales de l'ISUP. 2008;52:61–80. [Google Scholar]
- Brown PJ, Fearn T, Vannucci M. Bayesian wavelet regression on curves with application to a spectroscopic calibration problem. Journal of the American Statistical Association. 2001;96:398–408. [Google Scholar]
- Brown MRG, Sidhu GS, Greiner R, Asgarian N, Bastani M, Silverstone PH, Greenshaw AJ, Dursun SM. ADHD-200 Global Competition: Diagnosing ADHD using personal characteristic data can outperform resting state fMRI measurements. Frontiers in Systems Neuroscience. 2012;6 doi: 10.3389/fnsys.2012.00069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caffo B, Eloyan A, Han F, Liu H, Muschelli J, Nebel MB, Zhao T, Crainiceanu C. SMART thoughts on the ADHD 200 Competition. 2012 Available at http://www.smart-stats.org/?q=content/repost-our-document-adhd-competition.
- Cai TT, Hall P. Prediction in functional linear regression. Annals of Statistics. 2006;34:2159–2179. [Google Scholar]
- Cardot H, Ferraty F, Sarda P. Functional linear model. Statistics & Probability Letters. 1999;45:11–22. [Google Scholar]
- Cardot H, Ferraty F, Sarda P. Spline estimators for the functional linear model. Statistica Sinica. 2003;13:571–592. [Google Scholar]
- Chang C, Chen Y, Ogden RT. Functional data classification: a wavelet approach. Computational Statistics. 2014 doi: 10.1007/s00180-014-0503-4. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chi EC, Scott DW. Robust parametric classification and variable selection by a minimum distance criterion. Journal of Computational and Graphical Statistics. 2013 in press. [Google Scholar]
- Chun H, Keleş S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B. 2010;72:3–25. doi: 10.1111/j.1467-9868.2009.00723.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cook RD. Fisher lecture: Dimension reduction in regression. Statistical Science. 2007;22:1–26. [Google Scholar]
- Craddock RC, Holtzheimer PE, III, Hu XP, Mayberg HS. Disease state prediction from resting state functional connectivity. Magnetic Resonance in Medicine. 2009;62:1619–1628. doi: 10.1002/mrm.22159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crainiceanu CM, Reiss PT, Goldsmith J, Huang L, Huo L, Scheipl F. refund: Regression with functional data R package version 0.1-10. 2014 [Google Scholar]
- Daubechies I. Orthonormal bases of compactly supported wavelets. Communications on Pure and Applied Mathematics. 1988;41:909–996. [Google Scholar]
- Delaigle A, Hall P. Achieving near perfect classification for functional data. Journal of the Royal Statistical Society: Series B. 2012a;74:267–286. [Google Scholar]
- Delaigle A, Hall P. Methodology and theory for partial least squares applied to functional data. Annals of Statistics. 2012b;40:322–352. [Google Scholar]
- Ding B, Gentleman R. Classification using generalized partial least squares. Journal of Computational and Graphical Statistics. 2005;14:280–298. [Google Scholar]
- Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–455. [Google Scholar]
- Eloyan A, Muschelli J, Nebel MB, Liu H, Han F, Zhao T, Barber A, Joel S, Pekar JJ, Mostofsky S, Caffo B. Automated diagnoses of attention deficit hyperactive disorder using magnetic resonance imaging. Frontiers in Systems Neuroscience. 2012;6 doi: 10.3389/fnsys.2012.00061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fay MP, Kim H-J, Hachey M. On using truncated sequential probability ratio test boundaries for Monte Carlo implementation of hypothesis tests. Journal of Computational and Graphical Statistics. 2007;16:946–967. doi: 10.1198/106186007X257025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- Goldsmith J, Huang L, Crainiceanu CM. Smooth scalar-on-image regression via spatial Bayesian variable selection. Journal of Computational and Graphical Statistics. 2013 doi: 10.1080/10618600.2012.743437. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldsmith J, Bobb J, Crainiceanu CM, Caffo B, Reich D. Penalized functional regression. Journal of Computational and Graphical Statistics. 2011;20:830–851. doi: 10.1198/jcgs.2010.10007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golland P, Fischl B. Permutation tests for classification: towards statistical significance in image-based studies. In: Taylor CJ, Noble JA, editors. Information Processing in Medical Imaging: Proceedings of the 18th International Conference. Springer; Berlin and Heidelberg: 2003. pp. 330–341. [DOI] [PubMed] [Google Scholar]
- Grosenick L, Klingenberg B, Katovich K, Knutson B, Taylor JE. Interpretable whole-brain prediction analysis with GraphNet. NeuroImage. 2013;72:304–321. doi: 10.1016/j.neuroimage.2012.12.062. [DOI] [PubMed] [Google Scholar]
- Guillas S, Lai MJ. Bivariate splines for spatial functional regression models. Journal of Nonparametric Statistics. 2010;22:477–497. [Google Scholar]
- Hall P, Horowitz JL. Methodology and convergence rates for functional linear regression. Annals of Statistics. 2007;35:70–91. [Google Scholar]
- Honorio J, Tomasi D, Goldstein R, Leung H, Samaras D. Can a single brain region predict a disorder? IEEE Transactions on Medical Imaging. 2012;31:2062–2072. doi: 10.1109/TMI.2012.2206047. [DOI] [PubMed] [Google Scholar]
- Huang L, Goldsmith J, Reiss PT, Reich DS, Crainiceanu CM. Bayesian scalar-on-image regression with application to association between intracranial DTI and cognitive outcomes. NeuroImage. 2013;83:210–223. doi: 10.1016/j.neuroimage.2013.06.020. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huo L, Reiss P, Zhao Y. refund.wave: Wavelet-Domain Regression with Functional Data R package version 0.1. 2014 [Google Scholar]
- Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kapur S, Phillips A, Insel T. Why has it taken so long for biological psychiatry to develop clinical tests and what to do about it? Molecular Psychiatry. 2012;17:1174–1179. doi: 10.1038/mp.2012.105. [DOI] [PubMed] [Google Scholar]
- Mallat SG. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1989;11:674–693. [Google Scholar]
- Mallat S. A Wavelet Tour of Signal Processing: The Sparse Way. 3rd ed. Academic Press; Burlington, MA: 2009. [Google Scholar]
- Malloy EJ, Morris JS, Adar SD, Suh H, Gold DR, Coull BA. Wavelet-based functional linear mixed models: an application to measurement error-corrected distributed lag models. Biostatistics. 2010;11:432–452. doi: 10.1093/biostatistics/kxq003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marx BD. Iteratively reweighted partial least squares estimation for generalized linear regression. Technometrics. 1996;38:374–381. [Google Scholar]
- Marx BD, Eilers PHC. Generalized linear regression on sampled signals and curves: a P-spline approach. Technometrics. 1999;41:1–13. [Google Scholar]
- Marx BD, Eilers PHC. Multidimensional penalized signal regression. Technometrics. 2005;47:13–22. [Google Scholar]
- Massy WF. Principal components regression in exploratory statistical research. Journal of the American Statistical Association. 1965;60:234–256. [Google Scholar]
- Meinshausen N, Bühlmann P. Stability selection (with Discussion). Journal of the Royal Statistical Society, Series B. 2010;72:417–473. [Google Scholar]
- Meinshausen N, Meier L, Bühlmann P. P-values for high-dimensional regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
- Milham MP. Open neuroscience solutions for the connectome-wide association era. Neuron. 2012;73:214–218. doi: 10.1016/j.neuron.2011.11.004. [DOI] [PubMed] [Google Scholar]
- Morris JS, Baladandayuthapani V, Herrick RC, Sanna P, Gutstein HB. Automated analysis of quantitative image data using isomorphic functional mixed models, with application to proteomics data. Annals of Applied Statistics. 2011;5:894–923. doi: 10.1214/10-aoas407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Müller HG, Stadtmüller U. Generalized functional linear models. Annals of Statistics. 2005;33:774–805. [Google Scholar]
- Nadler B, Coifman RR. The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration. Journal of Chemometrics. 2005;19:107–118. [Google Scholar]
- Nason GP. Wavelet Methods in Statistics with R. Springer; New York: 2008. [Google Scholar]
- Nason G. wavethresh: Wavelets statistics and transforms R package version 4.6.2. 2013 [Google Scholar]
- Nguyen DV, Rocke DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18:39–50. doi: 10.1093/bioinformatics/18.1.39. [DOI] [PubMed] [Google Scholar]
- Nichols TE. Multiple testing corrections, nonparametric methods, and random field theory. NeuroImage. 2012;62:811–815. doi: 10.1016/j.neuroimage.2012.04.014. [DOI] [PubMed] [Google Scholar]
- Ogden RT. Essential Wavelets for Statistical Applications and Data Analysis. Birkhäuser; Boston: 1997. [Google Scholar]
- Ojala M, Garriga GC. Permutation tests for studying classifier performance. Journal of Machine Learning Research. 2010;11:1833–1863. [Google Scholar]
- Özkaya SG, Van De Ville D. 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE; New York: 2011. Anatomically adapted wavelets for integrated statistical analysis of fMRI data. pp. 469–472. [Google Scholar]
- Potter DM. A permutation test for inference in logistic regression with small- and moderate-sized data sets. Statistics in Medicine. 2005;24:693–708. doi: 10.1002/sim.1931. [DOI] [PubMed] [Google Scholar]
- Preda C, Saporta G. PLS regression on a stochastic process. Computational Statistics and Data Analysis. 2005;48:149–158. [Google Scholar]
- R Development Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2012. ISBN 3-900051-07-0. [Google Scholar]
- Ramsay JO, Silverman BW. Functional Data Analysis. 2nd ed. Springer; New York: 2005. [Google Scholar]
- Rasmussen PM, Hansen LK, Madsen KH, Churchill NW, Strother SC. Model sparsity and brain pattern interpretation of classification models in neuroimaging. Pattern Recognition. 2012;45:2085–2100. [Google Scholar]
- Reiss PT. PhD thesis. Department of Biostatistics, Columbia University; 2006. Regression with signals and images as predictors. [Google Scholar]
- Reiss PT, Ogden RT. Functional principal component regression and functional partial least squares. Journal of the American Statistical Association. 2007;102:984–996. [Google Scholar]
- Reiss PT, Ogden RT. Functional generalized linear models with images as predictors. Biometrics. 2010;66:61–69. doi: 10.1111/j.1541-0420.2009.01233.x. [DOI] [PubMed] [Google Scholar]
- Reiss PT, Huo L, Zhao Y, Kelly C, Ogden RT. Supplement to “Wavelet-domain regression and predictive inference in psychiatric neuroimaging”. 2015 doi: 10.1214/15-AOAS829. DOI: 10.1214/15-AOAS829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenbaum PR. Design of Observational Studies. Springer; New York: 2010. [Google Scholar]
- Rothman KJ. Epidemiology: An Introduction. 2nd ed. Oxford University Press; New York: 2012. [Google Scholar]
- Ruttimann UE, Unser M, Rawlings RR, Rio D, Ramsey NF, Mattay VS, Hommer DW, Frank JA, Weinberger DR. Statistical analysis of functional MRI data in the wavelet domain. IEEE Transactions on Medical Imaging. 1998;17:142–154. doi: 10.1109/42.700727. [DOI] [PubMed] [Google Scholar]
- Sabuncu M, Van Leemput K, the Alzheimer Disease Neuroimaging Initiative The relevance voxel machine (RVoxM): a self-tuning Bayesian model for informative image-based prediction. IEEE Transactions on Medical Imaging. 2012;31:2290–2306. doi: 10.1109/TMI.2012.2216543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen H, Huang JZ. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis. 2008;99:1015–1034. [Google Scholar]
- Stone M, Brooks RJ. Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical Society, Series B. 1990:237–269. [Google Scholar]
- Sun D, Van Erp TGM, Thompson PM, Bearden CE, Daley M, Kushan L, Hardt ME, Nuechterlein KH, Toga AW, Cannon TD. Elucidating a magnetic resonance imaging-based neuroanatomic biomarker for psychosis: classification analysis using probabilistic brain atlas and machine learning algorithms. Biological Psychiatry. 2009;66:1055–1060. doi: 10.1016/j.biopsych.2009.07.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Van De Ville D, Seghier ML, Lazeyras F, Blu T, Unser M. WSPM: Wavelet-based statistical parametric mapping. NeuroImage. 2007;37:1205–1217. doi: 10.1016/j.neuroimage.2007.06.011. [DOI] [PubMed] [Google Scholar]
- Vidakovic B. Statistical Modeling by Wavelets. Wiley; New York: 1999. [Google Scholar]
- Wand MP, Ormerod JT. Penalized wavelets: Embedding wavelets into semiparametric regression. Electronic Journal of Statistics. 2011;5:1654–1717. [Google Scholar]
- Wang X, Ray S, Mallick BK. Bayesian curve classification using wavelets. Journal of the American Statistical Association. 2007;102:962–973. [Google Scholar]
- Wang X, Nan B, Zhu J, Koeppe R, the Alzheimer's Disease Neuroimaging Initiative Regularized 3D functional regression for brain image data via Haar wavelets. Annals of Applied Statistics. 2014 doi: 10.1214/14-AOAS736. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wold H. Nonlinear estimation by iterative least squares procedures. In: David FN, editor. Research Papers in Statistics: Festschrift for J. Neyman. Wiley; New York: 1966. pp. 414–444. [Google Scholar]
- Yang H, Wu Q-Z, Guo L-T, Li Q-Q, Long X-Y, Huang X-Q, Chan RCK, Gong Q-Y. Abnormal spontaneous brain activity in medication-naïve ADHD children: A resting state fMRI study. Neuroscience Letters. 2011;502:89–93. doi: 10.1016/j.neulet.2011.07.028. [DOI] [PubMed] [Google Scholar]
- Zhao Y, Chen H, Ogden RT. Wavelet-based weighted LASSO and screening approaches in functional linear regression. Journal of Computational and Graphical Statistics. 2014 doi: 10.1080/10618600.2012.679241. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Ogden RT, Reiss PT. Wavelet-based LASSO in functional linear regression. Journal of Computational and Graphical Statistics. 2012;21:600–617. doi: 10.1080/10618600.2012.679241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H, Li L, Zhu H. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association. 2013;108:540–552. doi: 10.1080/01621459.2013.776499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu H, Brown PJ, Morris JS. Robust classification of functional and quantitative image data using functional mixed models. Biometrics. 2012;68:1260–1268. doi: 10.1111/j.1541-0420.2012.01765.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics. 2004;5:427–443. doi: 10.1093/biostatistics/5.3.427. [DOI] [PubMed] [Google Scholar]
- Zhu H, Vannucci M, Cox DD. A Bayesian hierarchical model for classification with selection of functional predictors. Biometrics. 2010;66:463–473. doi: 10.1111/j.1541-0420.2009.01283.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]
- Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of Computational and Graphical Statistics. 2006;15:265–286. [Google Scholar]
- Zou QH, Zhu CZ, Yang Y, Zuo XN, Long XY, Cao QJ, Wang YF, Zang YF. An improved approach to detection of amplitude of low-frequency fluctuation (ALFF) for resting-state fMRI: fractional ALFF. Journal of Neuroscience Methods. 2008;172:137–141. doi: 10.1016/j.jneumeth.2008.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.










