SUMMARY
This work concerns spatial variable selection for scalar-on-image regression. We propose a new class of Bayesian nonparametric models and develop an efficient posterior computational aigorithm. The proposed soft-thresholded Gaussian process provides large prior support over the class of piecewise-smooth, sparse, and continuous spatially-varying regression coefficient functions. In addition, under some mild regularity conditions the soft-thresholded Gaussian proess prior leads to the posterior consistency for parameter estimation and variable selection for scalar-on-image regression, even when the number of predictors is larger than the sample size. The proposed method is compared to alternatives via simulation and applied to an electroen-cephalography study of alcoholism.
Keywords: Electroencephalography, Gaussian processes, Posterior consistency, Spatial variable selection
1. INTRODUCTION
Scalar-on-image regression has attracted considerable attention recently in both the frequentist and Bayesian literature. This problem is challenging for several reasons: the predictor is a two-dimensional or three-dimensional image where the number of pixels or voxels is often larger han the sample size; the observed predictors may be contaminated with noise; the true signal ay exhibit complex spatial structure; and most components of the predictor may have no effect on the response, and when they have an effect it may vary smoothly.
Regularized regression techniques are often needed when the number of predictors is much higher than the sample size. The lasso (Tibshirani, 1996) is a popular method for variable selection that employs a penalty based on the sum of the absolute values of the regression coefficients.However, most penalized approaches do not accommodate predictors with ordered components such as an image predictor. One exception is the fused lasso, which generalizes the lasso by pe-nalizing both the coefficients and their successive differences and thus ensures both sparsity and smoothness of the estimated effect. To incorporate the spatial structure of the predictors, Reiss & Ogden (2010) extended functional principal component regression to handle image predictors by approximating the coefficient function using B-splines. However, this method is not sensitive to sparsity or sharp edges. Wang et al. (2016) proposed a penalty based on the total variation of the regression function that yields piecewise smooth regression coefficients. The approach is focused primarily on prediction and does not quantify uncertainty in a way that permits statistical inference. Reiss et al. (2015) considered a wavelet expansion for the regression coefficients and conducted inference via hypothesis testing. Their approach requires that the image predictor has dimensions of equal size and moreover that the common size is a power of two. These strong assumptions are violated for our motivating application. Therefore, none of these methods are appropriate to detect a piecewise smooth, sparse, and continuous signal for scalar outcome and image predictor.
Scalar-on-image regression has been also approached from a Bayesian viewpoint. Goldsmith et al. (2014) proposed to model the regression coefficients as the product of two latent spatial processes to capture both sparsity and spatial smoothness of the important regression coefficients. They used an Ising prior for the binary indicator that a voxel is predictive of the response, and a conditionally autoregressive prior to smooth the nonzero regression coefficients. Use of an Ising prior for binary indicators was first discussed in Smith & Fahrmeir (2007) in the context of high dimensional spatial predictors and was also recently employed by Li et al. (2015) who use a Dirichlet process prior for the non-zero coefficients. In both Li et al. (2015) and Goldsmith et al. (2014), sparsity and smoothness are controlled separately by two independent spatial processes so the transitions from zero areas to neighboring nonzero areas may be abrupt, and computation is challenging because the Ising probability mass function does not have a simple closed form.
We propose an alternative approach for spatial variable selection in scalar-on-image regression by modeling the regression coefficients through soft-thresholding of a latent Gaussian process. The soft-thresholding function is well known for its relation with the lasso estimate when the design matrix is orthonormal (Tibshirani, 1996), and here we use it to specify a spatial prior with mass at zero. The idea is inspired by Boehm Vock et al. (2014) who considered Gaussian processes as a regularization technique for spatial variable selection. However their approach does not assign prior probability mass at zero for regression coefficients and is not designed for scalar-on-image regression. Unlike other Bayesian spatial models (Goldsmith et al., 2014; Li 65 et al., 2015), the soft-thresholded Gaussian process ensures a gradual transition between the zero and nonzero effects of the neighboring locations and provides large support over the class of spatially-varying regression coefficient functions that are piecewise smooth, sparse, and continuous. The use of the soft-thresholded Gaussian process avoids the computational problems posed by the Ising prior, and can be scaled to large datasets using a low-rank spatial model for the latent process. We show that the soft-thresholded Gaussian process prior leads to posterior consistency for both parameter estimation and variable selection under mild regularity conditions, even when the number of predictors is larger than the sample size.
The proposed method is introduced for a single image predictor and Gaussian responses mainly for simplicity. Extensions to accommodate multiple predictors and non-Gaussian responses are relatively straightforward. However, the theoretical investigation of the procedure for non-Gaussian responses is not trivial, and so we establish the theory for binary responses and the probit link function. The methods are applied to the data from an electroencephalography, EEG, study of alcoholism; see http://kdd.ics.uci.edu/datasets/eeg/eeg.data.html. The objective of our analysis is to estimate the relationship between alcoholism and brain activity. EEG signals are recorded for both alcoholics and controls at 64 channels of electrodes on subjects’ scalps for 256 seconds, leading to a high-dimensional predictor. The data have been previously described in Li et al. (2010) and Zhou & Li (2014). Previous literature has ignored the spatial structure of the electrodes shown in Fig. 1, which was recovered from the standard electrode position nomen-clature described by Fig. 1 of https://www.acns.org/pdf/guidelines/Guideline-5.pdf Our spatial analysis exploits the spatial configuration of the electrodes and reveals regions of the brain with activity that are predictive of alcoholism.
Fig. 1.
Standard electrode position nomenclature for 10–10 system
2. MODEL
2.1. Scalar-on-image regression
Let be a m-dimensional vector space of real values. Suppose there are n subjects in the dataset and the data for the subject i consist of a scalar response variable, a set of spatially-distributed image predictors, denoted and other scalar covariates collected by Assume that are fixed design covariates. Here denotes the image intensity value measured at location for j = 1,...,pn. We assume that the set of locations is a fixed subset of a compact closed region denote a normal distribution with mean µ and variance-covariance matrix Σ or variance for one-dimensional case. We consider the following scalar-on-image regression model:
(1) |
where quantifies the effect of Wi and β(·) is a spatially-varying coefficient function defined on Ɓ. In practice, the normalizing scalar can be absorbed into the image predictors; its role is to rescale the total effects of massive image predictors such that they are bounded away from infinity with a large probability, when pn is very large. Scientifically, in brain imaging studies, the image predictors take values that measure the body tissue contrast or the neural activities at each spatial location and the number of image predictors, is determined by the image resolution. Thus, the total effect of the image predictors reflects the total amount of the intensity in the brain signals, which should not increase to infinity as the image resolution increases. In model (1), the response is taken to be Gaussian and only one type of image predictor is included, although extensions of the modeling framework to non-Gaussian responses and multi-modality image predictor regression are straightforward.
2.2. Soft-thresholded Gaussian processes
To capture the characteristics of the image predictors and their effects on the response variable, the prior for should be sparse and spatial. That is, we assume that many locations have the sites with nonzero coefficients cluster spatially, and the coefficients vary smoothly in clusters of nonzero coefficients. To encode these desired properties into the prior, we represent as a transformation of a Gaussian process, where is the transformation function dependent on parameter follows a Gaussian process prior. In this trans-kriging (Cressie, 1993) or Gaussian copula (Nelsen, 1999) model, the function determines the marginal distribution of while the covariance of the latent determines dependence structure.
Spatial dependence is determined by the prior for We assume that is a Gaussian process with mean zero and stationary covariance function for some covariance function Although other transformations are possible (Boehm Vock et al., 2014), we select to be the soft-thresholding function to map near zero to exact zero and thus give a sparse prior. Let
(2) |
where if x > 0, if x < 0, sgn(0) = 0. The thresholding parameter determines the degree of sparsity. This soft-thresholded Gaussian process prior is denoted
3. THEORETICAL PROPERTIES
3.1. Notations and Definitions
We first introduce additional notation for the theoretical development and the formal definitions of the class of spatially-varying coefficient functions under consideration. We assume that all the random variables and stochastic processes in this article are defined on the same probability space, denoted represent a space of d-dimensional non-negative integers. For any vector be the Lp norm for vector v for any be the supremum norm. For any be the smallest integer not smaller than x and be the largest integer not larger than x. Define the event indicator if event otherwise. For any and define . For any real function f on Ɓ, let denote the Lp norm for any p ≥ 1 and denote the supremum norm.
Definition 1. Denote by the set of differentiable functions of order m defined on Ɓ, such that f(s) has partial derivatives
where given any point s0 of Ɓ and any , there is a such that if s and t are any two points of Ɓ with and , then . If , then .
Denote by and the closure and the boundary of any set
Definition 2. Define to be the collection of all spatially-varying coefficient functions that satisfy the following conditions. Assume there exist two disjoint non-empty open sets such that
(2.1) |
(2.2) |
(2.3) |
Simply put, is the collection of all piecewise smooth, sparse and continuous functions defined on Ɓ.
3.2. Large Support
One desired property for the Bayesian nonparametric model is to have prior support over a large class of functions. In this section, we show that the soft-thresholded Gaussian process has large support over . We begin with two appealing properties of the soft-thresholding function. All technical conditions are listed in the Appendix, as Conditions A1-A5.
Lemma 1. The soft-thresholding function gλ(x) is Lipschitz continuous for any λ > 0, that is, for all
Lemma 2. For any function and any threshold parameter λ0 > 0, there exists a smooth function such that .
Lemma 1 is proved directly by verifying the definition. The proof of Lemma 2 is not trivial, it requires a detailed construction of the smooth function . See the Appendix for details.
Theorem 1. For any function and , the soft-thresholded Gaussian process prior satisfies , for any and k that satisfy Condition A5.
Proof. By Lemma 2, for any threshold parameter λ0 > 0 and a smooth function such that . Since , we have with , By Lemma 1,
By Condition A5 and Theorem 4.5 of Tokdar & Ghosh (2007), is in the reproducing kernel Hilbert space of k, and then by Theorem 4 of Ghosal & Roy (2006), we have which completes the proof. □
Theorem 1 implies that there is always a positive probability that the soft-thresholded Gaussian process concentrates in an arbitrarily small neighborhood of any spatially-varying coefficient function that has piecewise smoothness, sparsity and continuity properties. According to Lemma 2, for any positive , there exist such that . This implies that the thresholding parameter and the latent smooth curve are not identifiable, but we can ensure that β0 is identifiable by establishing the posterior consistency of parameter estimation.
3.3. Posterior Consistency
For i = 1,...,n, given the image predictor Xi on a set of spatial locations S and other covariates Wi, suppose the response Yi is generated from the scalar-on-image regression model (1) with parameters a , and . We assume and are known for theoretical convenience; in practice it is straightforward to estimate them from the data. We assign a soft-thresholded Gaussian process prior for the spatially-varying coefficient function, i.e. for any given λ > 0 and covariance kernel k. In light of the large support property in Theorem 1, the following lemma shows the positivity of prior neighborhoods:
Lemma 3. Denote by the density function of in model (1) and suppose Condition A4 holds for . Define , and . There exists a set B with such that, for any ,
We construct sieves for the spatially-varying coefficient functions in as
(3) |
where ρ is defined in Condition A1. By Lemmas A1-A5 in the Appendix, we can find an upper bound for the tail probability and construct uniform consistent tests in the following lemmas:
Lemma 4. If with λ0 > 0 and the kernel function k satisfies Condition A5, then there exist constants K and b such that for all n ≥ 1, .
Lemma 5. For any and , there exist N, C0, C1 and C2 such that for all n > N and all , if , a test function can be constructed such that and , where is defined in Condition A1.
Proofs of Lemmas 3–5 are provided in the Supplementary Material. These lemmas verify three important conditions for proving posterior consistency in the scalar-on-image regression based on Theorem A1 of Choudhuri et al. (2004). It follows that we have the following theorem:
Theorem 2. Denote the data by . If Conditions A1-A5 hold, then for any ε > 0,
(4) |
Theorem 2 implies that the soft-thresholded Gaussian process prior can ensure that the posterior distribution of the spatially-varying coefficient function concentrates in an arbitrarily small neighborhood of the true value, when the numbers of subjects and spatial locations are both sufficiently large. Given that the true function of interest is piecewise smooth, sparse and continuous, the soft-threshold Gaussian process prior can further ensure that the posterior probability of the sign of the spatially-varying coefficient function being correct converges to one as the sample size goes to infinity. The result is formally stated in the following theorem.
Theorem 3. Suppose the model assumptions, prior settings and regularity conditions for Theorem 2 hold. Then
(5) |
This theorem establishes the consistency of spatial variable selection. It does not require that the number of true image predictors is finite or less than the sample size. This result is reasonable in that the true spatially-varying coefficient function is piecewise smooth and continuous, and the soft-thresholded Gaussian process will borrow strength from neighboring locations to estimate the true image predictors. The Supplementary Material gives proofs of Theorems 2 and 3.
To apply the proposed model for the motivating dataset, we extend the proposed model (1) and Theorems 2–3 to analysis of the binary response variable using a probit model:
(6) |
where is the standard normal cumulative distribution function.
Theorem 4. Assume the data Dn are generated from model (6) and prior settings are the same as Theorem 2. If Conditions A1–A5 and S1–S3 hold, then (4) and (5) hold under model (6).
Conditions S1–S3 and the proof of Theorem 4 are in the Supplementary Material.
4. POSTERIOR COMPUTATION
4.1. Model representation and prior specification
Next, we discuss the practical applicability of our method. We select a low-rank spatial model to ensure that computation remains possible for large datasets. We exploit the kernel convolution approximation of a spatial Gaussian process. As discussed in Higdon et al. (1999), any stationary Gaussian process V(s) can be written as , where K is a kernel function and w is a white-noise process with mean zero and variance . Then the covariance function of is cov.
This representation suggests the following approximation for the latent process , where are a grid of equally-spaced spatial knots covering Ɓ and δ is the grid size. Without loss of generality, we assume δ = 1 and K is a local kernel function. We use tapered Gaussian kernels with bandwidth , , so that for s separated from tl by at least . Taking L < p knots and selecting compact kernels leads to computational savings, as discussed in Section 4.2.
The compact kernels K control the local spatial structure and the prior for the coefficients a = (a1,..., aL)T controls broad spatial structure. Following Nychka et al. (2015), we assume that the knots t1,...,tL are arranged on an array, and use to denote that knots tl and tk are adjacent on this array. We then use a conditionally autoregressive prior (Gelfand et al., 2010) for the kernel coefficients. The conditional autoregressive prior is also defined locally, with full conditional distribution
(7) |
where nl is the number of knots adjacent to the knot l, quantifies the strength of spatial dependence, and determines the variance. These full conditional distributions correspond to the joint distribution , where M is diagonal with diagonal elements (n1...,nL) and A is the adjacency matrix with (k, l) element equal 1 if k ~ l and zero otherwise.
Write . Denote by K the kernel matrix with (j, l) element , then as a prior distribution. In this case, the do not have equal variances which may be undesirable. Non-constant variance arises because the kernel knots may be unequally distributed, and because the conditional autoregressive model is non-stationary in that the variances of the al are unequal.
To stabilize the prior variance, define and let be the corresponding p × L matrix of standardized kernel coefficients, where wj are constants chosen so that the prior variance for βj is the same over j. We take wj to be the square root of the j-th diagonal element of , so the kernel functions depend on . By pulling the prior standard deviation σa out of the thresholding transformation we have an equivalent model representation of model (1) as
(8) |
where . After standardization, the prior variance of each is one, and therefore the prior probability that is nonzero is for all j. This endows each parameter with a distinct interpretation: σa controls the scale of the nonzero coefficients, λ controls the prior degree of sparsity, and controls spatial dependence. With an additional set of conditions, we can show that model representation (8) when to has a large prior support, thus leading to the posterior consistency and selection consistency. The Supplementary Material contains more details.
In practice, we normalize the response and covariates, and then select priors , , , , and . Following Banerjee et al. (2004), we use a beta prior for with mean near one because only values near one provide appreciable spatial dependence. Finally, although our asymptotic results suggest that any λ > 0 will give posterior consistency, in finite samples posterior inference may be sensitive to the threshold so we use a prior rather than a fixed value.
4.2. Markov chain Monte Carlo Algorithm
For fully Bayesian inference on model (1), we sample from the posterior distribution using Metropolis-Hastings within Gibbs sampling. The parameters αv, σ2, and have conjugate full conditional distributions and are updated using Gibbs sampling. The spatial dependence parameter is sampled using Metropolis-Hastings sampling using a beta candidate distribution with the current value as mean and standard deviation tuned to give acceptance rate around 0.4. The threshold λ is updated using Metropolis sampling with random-walk Gaussian candidate distribution with standard deviation tuned to have acceptance probability around 0.4. The Metropolis update for al uses the prior full conditional distribution in (7) as the candidate distribution which gives a high acceptance rate and thus good mixing without tuning.
To make posterior inference for the probit regression model (6), we can slightly modify the aforementioned algorithm by introducing an auxiliary continuous variable for each response variable Yi. We assume that and follows (1) with σ2 = 1. The full conditional of is a truncated normal distribution, which is straightforward to generate as a Gibbs sampling step in the posterior computation. The updating schemes for other parameters remain the same. For other types of non-Gaussian response models, we can directly use the Metropolis-Hastings sampling by modifying the likelihood function accordingly.
5. SIMULATION STUDY
5.1. Data generation
In this section we conduct a simulation study to compare the proposed methods with other popular methods for scalar-on-image regression. For each simulated observation, we generate a two-dimensional image Xi on the {1,...,m} × {1,...,m} grids with m = 30. The co-variates are generated following two covariance structures: exponential and with shared structure with the signal. The exponential covariates are Gaussian with mean E(Xij) = 0 and , where dj,l is the distance between locations j and l and controls the range of spatial dependence. The covariates generated with shared structure with are , where is Gaussian with exponential covariance with and . The continuous response is then generated as . For binary response as with . Both Xi and Yi are independent for i = 1,...,n. We consider three true images: two sparse images plotted in Fig. 2, Five peaks and Triangle, and the dense image Waves with . We also compare sample size , spatial correlation , and error standard deviation . For all combinations of these parameters considered we generate S = 100 datasets. We also simulate binary data with with n = 250 and exponential covariance with .
Fig. 2.
True images used in the simulation study.
5.2. Methods
We fit our model with an m/2 × m/2 equally-spaced grid of knots covering {1,...,m} × {1,...,m} with bandwidth σh set to the minimum distance between knots. We fit our model both with λ > 0 where the prior is soft-thresholded Gaussian process with sparsity, and with λ = 0, in which case the prior is Gaussian process with no sparsity. For both models, we run the proposed Markov chain Monte Carlo algorithm 50,000 iterations with 10,000 burn-in, and compute the posterior mean of . For the sparse model, we compute the posterior probability of a nonzero . We compare our method with the lasso (Tibshirani, 1996) and fused lasso (Tibshirani et al., 2005; Tibshirani & Taylor, 2011) penalized regression estimates
(9) |
The lasso estimate is computed using the lars package (Hastie & Efron, 2013) in R (R Core Team, 2017) and the tuning parameter is selected using the Bayesian information criterion. The fused lasso estimate is computed using the genlasso package (Arnold & Tibshirani, 2014) in R. The fussed lasso has two tuning parameters: and . Due to computational considerations, we search only over in {1/5,1, 5}. For each , is selected using the Bayesian information criterion. The Supplementary Material gives results for each and in the main text we report only the results for the with the most precise estimates.
We also consider a functional principal component analysis-based alternative. We smooth each image using the technique of Xiao et al. (2013) implemented in the fbps function in R’s refund package (Crainiceanu et al., 2014), compute the eigen decomposition of the sample covariance of the smoothed images, and then perform principal components regression using the lasso penalty tuned via Bayesian information criterion. The Supplementary Material gives results using the leading eigenvectors that explain 80%, 90%, and 95% of the variation in the sample images and we report only the results for the value with most precise estimates in the main text.
Finally, we compare our approach with the Bayesian spatial model of Goldsmith et al. (2014) using , where is the binary indicator that location j is included in the model, and θj € is the regression coefficient given that the location is included. Both the and have spatial priors; the continuous components θj follow a conditional autoregressive prior, and the binary components θj follow an Ising or autologistic prior (Gelfand et al., 2010) with full conditional distributions
(10) |
Estimating a0 and b0 is challenging because of the complexity of the Ising model (Møller et al., 2006), therefore Goldsmith et al. (2014) recommended selecting a0 and b0 using cross-validation over and . Due to computational limitations we select values in the middle of these intervals and set a0 = −2 and b0 = 1. The posterior mean of β(s) and the posterior probability of a nonzero β(s) are approximated based on 5,000 Markov chain Monte Carlo samples after the first 1,000 are discarded as burn-in.
5.3. Results
Tables 1 and 2 give the mean squared error for βv estimation averaged over location, Type I error and power for detecting nonzero signals, along with the computing time. The soft-thresholded Gaussian process model gives the smallest mean squared error when the covariate has exponential correlation. Compared to the Gaussian process model, adding thresholding reduces mean squared error by roughly 50% in many cases. As expected, the functional principal component analysis method gives the smallest mean squared error in the shared-structure scenarios where the covariates are generated to have a similar spatial pattern as the true signal. Even in this case, the proposed method outperforms other methods which do not exploit this shared structure. Under the dense waves signal the non-sparse Gaussian process model gives the smallest mean squared error, but again the proposed method remains competitive.
Table 1.
Simulation study results for linear regression models. Methods are compared in terms of mean squared error for , Type I error and power for feature detection. The scenarios vary by the true (Fig. 2), sample size n, similarity between covariates and true signal determined by τ, error variance σ, and spatial correlation range of the covariates . Results are reported as the mean with the standard deviation over the 100 simulated datasets
Mean squared error for βv, multiplied by 104 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Signal | n | τ | σ | υx | Lasso | Fused lasso | FPCA | Ising | GP | STGP |
Five peaks | 100 | 0 | 5 | 3 | 326(5) | 22(0) | 36(0) | 46(1) | 27(0) | 13(0) |
100 | 0 | 5 | 6 | 553(9) | 22(0) | 33(0) | 44(1) | 28(0) | 13(0) | |
100 | 0 | 2 | 3 | 102(1) | 11(0) | 24(0) | 26(0) | 15(0) | 4(0) | |
250 | 0 | 5 | 3 | 674(9) | 13(0) | 27(0) | 52(1) | 17(0) | 6(0) | |
Triangle | 100 | 0 | 5 | 3 | 289(4) | 9(0) | 17(0) | 30(1) | 19(0) | 8(0) |
100 | 0 | 5 | 6 | 515(9) | 8(0) | 1.5(0) | 28(1) | 19(0) | 8(0) | |
100 | 0 | 2 | 3 | 73(1) | 5(0) | 12(0) | 14(0) | 10(0) | 4(0) | |
250 | 0 | 5 | 3 | 650(9) | 6(0) | 14(0) | 34(0) | 12(0) | 6(0) | |
100 | 2 | 5 | 3 | 1011(17) | 7(0) | 10(0) | 27(1) | 33(1) | 13(1) | |
100 | 4 | 5 | 3 | 1018(17) | 6(0) | 4(1) | 32(1) | 34(1) | 14(1) | |
Waves | 100 | 0 | 5 | 3 | 1260(13) | 250(5) | 188(7) | 419(3) | 48(1) | 109(10) |
100 | 0 | 5 | 6 | 1639(17) | 233(4) | 126(3) | 402(4) | 51(1) | 128(12) |
Type I error, % | Power, % | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Signal | n | τ | σ | υx | Lasso | Fused lasso | Ising | STGP | Lasso | Fused lasso | Ising | STGP |
Five peaks | 100 | 0 | 5 | 3 | 10(0) | 14(1) | 0(0) | 4(1) | 17(0) | 55(2) | 4(0) | 51(1) |
100 | 0 | 5 | 6 | 10(0) | 37(1) | 0(0) | 7(1) | 15(0) | 80(1) | 5(0) | 58(1) | |
100 | 0 | 2 | 3 | 79(0) | 24(1) | 0(0) | 4(1) | 28(0) | 84(1) | 4(0) | 82(1) | |
250 | 0 | 5 | 3 | 27(0) | 19(1) | 0(0) | 4(1) | 32(0) | 77(1) | 10(0) | 71(1) | |
Triangle | 100 | 0 | 5 | 3 | 10(0) | 5(0) | 0(0) | 4(1) | 23(1) | 85(1) | 9(0) | 87(1) |
100 | 0 | 5 | 6 | 11(0) | 8(1) | 0(0) | 4(1) | 19(1) | 91(1) | 9(0) | 86(1) | |
100 | 0 | 2 | 3 | 9(0) | 7(1) | 0(0) | 4(0) | 42(1) | 95(0) | 5(0) | 98(0) | |
250 | 0 | 5 | 3 | 27(0) | 5(0) | 0(0) | 4(0) | 35(1) | 92(1) | 16(0) | 96(1) | |
100 | 2 | 5 | 3 | 11(0) | 7(1) | 0(0) | 1(0) | 17(0) | 86(1) | 8(0) | 70(1) | |
100 | 4 | 5 | 3 | 11(0) | 2(1) | 0(0) | 3(1) | 19(1) | 84(1) | 12(0) | 73(1) |
Computing time, minutes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Signal | n | τ | σ | υx | Lasso | Fused lasso | FPCA | Ising | GP | STGP |
Five peaks | 100 | 0 | 5 | 3 | 0.02 | 16.77 | 5.40 | 27.61 | 4.81 | 11.28 |
Type I error: proportion of times estimating zero coefficients to be nonzero; Power: proportion of times estimating nonzero coefficients to be nonzero; FPCA, α%: functional principal component analysis approach (Xiao et al., 2013) using eigen vectors that explain α% of variation; Ising: Bayesian spatial variable selection with Ising priors (Goldsmith et al., 2014); GP: Gaussian process approach; STGP: soft-thresholded Gaussian process approach.
Table 2.
Simulation study results for binary regression models. Methods are compared in terms of mean squared error for , Type I error and power for feature detection. The scenarios vary by the true see Fig. 2. Results are reported as the mean with the standard deviation over the 100 simulated datasets
Mean squared error for βv, multiplied by 104 | ||||||
---|---|---|---|---|---|---|
Signal | Lasso | FPCA, 80% | FPCA, 90% | FPCA, 95% | GP | STGP |
Five peaks | 11(2) | 3(0) | 6(2) | 21(16) | 1(0) | 1(0) |
Triangle | 4(1) | 1(0) | 3(1) | 8(4) | 1(0) | 0(0) |
Waves | 80(5) | 18(4) | 52(16) | 38(8) | 107(46) | 191(138) |
Type I error, % | Power, % | |||
---|---|---|---|---|
Signal | Lasso | STGP | Lasso | STGP |
Five peaks | 3(0) | 7(1) | 15(0) | 62(1) |
Triangle | 1(0) | 4(0) | 24(1) | 91(1) |
Type I error: proportion of times estimating zero coefficients to be nonzero; Power: proportion of times estimating nonzero coefficients to be nonzero; FPCA, α%: functional principal component analysis approach (Xiao et al., 2013) using eigen vectors that explain α% of variation; Ising: Bayesian spatial variable selection with Ising priors (Goldsmith et al., 2014); GP: Gaussian process approach; STGP: soft-thresholded Gaussian process approach.
For variable selection results, we only compare the proposed method with the fused lasso and the Ising model for a fair comparison, because the lasso does not incorporate spatial locations and other methods do not perform variable selection directly. The results show that the fused lasso has much larger type I errors in all cases and the Ising model has a very small power to detect the signal in each case. The proposed method is much more efficient than both the fused lasso and the Ising model for variable selection. For the computing time, the proposed method is comparable to the fused lasso and is faster than the Ising model.
6. ANALYSIS OF EEG DATA
Our motivating application is the study of the relationship between the electrical brain activity as measured through multi channel EEG signals and genetic predisposition to alcoholism. EEG is a medical imaging technique that records the electrical activity in the brain by measuring the current flows produced when the neurons are activated. The study comprises a total of 122 subjects: 77 alcoholic subjects and 45 non-alcoholic controls. For each subject, 64 electrodes were placed on their scalp and EEG was recorded from each electrode at a frequency of 256Hz. The electrode positions were located at standard sites, i.e., standard electrode position nomenclature according to American Electroencephalographic Society (Sharbrough et al., 1991). The subjects were presented with 120 trials under several settings involving one stimulus or two stimuli. We consider the multichannel average EEG across the 120 trials corresponding to a single stimulus. The dataset is publicly available at the University of California at Irvine Knowledge Discovery of Datasets. https://kdd.ics.uci.edu/databases/eeg/eeg.data.html.
These data have been previously analyzed by Li et al. (2010), Hung & Wang (2013) and Zhou & Li (2014). However, all the analyses ignore the spatial location of the electrodes on the scalp and used instead smooth based on their identification number, which ranges from 1 to 64 and is assigned arbitrarily relative to the electrodes’ position on the scalp. Our goal in this analysis is to detect the regions of brain which are most predictive of the alcoholism status; thus accounting for the actual position of the electrodes is a key component in our approach. In the absence of more sophisticated means to determine the electrodes’ position on the scalp, we consider a lattice design and assign a two-dimensional location to each electrode that matches closely the electrode’s standard position. Using the labels of the electrodes, we were able to identify only 60 of them. As a result our analysis will be based on the multichannel EEG from these 60 electrodes.
In accordance with the notation employed earlier, let Yi be the alcoholism status indicator with Yi = 1 if the ith subject is alcoholic and 0 otherwise. Furthermore, be the EEG image data for the ith subject which is indexed by a two-dimensional index accounting for the spatial location on the matching lattice design, sj, and one-dimensional index for time, t.
We use a probit model to relate the alcoholism status and the multichannel EEG image: ) and . The spatially-temporally varying coefficient function β quantifies the effect of the image on the response over time and is modeled using the soft-thresholded Gaussian process on spatial and temporal domain. We select a 5 × 5 square grid of spatial knots and 64 temporal knots, for a total of 1,600 three-dimensional knots. We initially fitted a conditional autoregressive model with a different dependence parameter for spatial and temporal neighbors (Reich et al., 2007), but found that the convergence was slow and that the estimates of both the spatial and temporal dependence were similar. Thus, we elected to use the same dependence parameter for all neighbors.
We evaluate the prediction performance of the proposed model using cross validation. We first calculate the posterior predictive probabilities that each test-set response is one and then apply the standard receiving operating characteristic curve plot which evaluates the classification accuracy over thresholds on the predictive probabilities. Figure 3 shows the receiving operating characteristic curve using leave-one-out cross validation. The results are compared with those of the lasso, functional principal component analysis and the soft-thresholding Gaussian process approach with thresholding parameter λ = 0. To facilitate computation for these methods, we thin the time points by two, leaving 128 time points. While no model is uniformly superior, the area under the curve corresponding to our approach is optimal among the alternatives we considered.
Fig. 3.
Receiving operating characteristic curves with the area under the curve, AUC, for the leave-one-out cross validation of the EEG data by six different methods: lasso (black solid, AUC = 0.789), functional principal component analysis using the leading eigen-vectors that explain 80% (red solid, AUC = 0.775), 90% (red dashes, AUC = 0.789), 95% (red dots, AUC = 0.777) of variations, Gaussian process approach (green solid, AUC = 0.770) and soft-thresholded Gaussian process approach (navy solid, AUC = 0.818)
The differences between the models are further examined in the estimated β functions plotted in Fig. 4, where we ignore the spatial location of the electrodes and plot them using their identification number. The lasso solution is nonzero for a single spatiotemporal location, while the functional principal component analysis and Gaussian process methods lead to non-sparse and thus uninterpretable β estimates. In contrast, the soft-thresholded Gaussian process based estimate is near zero for the vast majority of locations, and isolates a subset of the electrodes near time point 86 as the most powerful predictors of alcoholism.
Fig. 4.
Estimated spatial-temporal effects of the EEG image predictors by four different methods: Lasso, functional principal component analysis, Gaussian process and soft-thresholded Gaussian process. The Gaussian process and soft-thresholded Gaussian process estimates are posterior means.
Our analysis indicates that EEG measurements at time t = 86, which roughly corresponds to 0.336 fraction of second, are predictive of the alcoholism status. This observation is further confirmed by the plot of the posterior probability of nonzero ‘s in Fig. 5a. This implies a delayed reaction to the stimulus, although this finding has to be confirmed with the investigators. To gain more insight into these findings, Figures 5b–5d focus on a particular time and display the posterior mean and posterior probability of a nonzero signal across the electrodes locations. They indicate that the right occipital/lateral region is the most predictive of the alcoholism status.
Fig. 5.
Summary of analysis of the EEG data by the soft-thresholded Gaussian process. Panel (a) plots the posterior probability of a nonzero β(s, t); each electrode is a line plotted over time t. The remaining panels map either the posterior probability of a nonzero β(s, t) or the posterior mean of β(s, t) at individual time points
7. DISCUSSION
The proposed method generates a few future directions that we consider to pursue. First, we plan to develop a more efficient posterior computation algorithm for analysis of voxel-level functional magnetic resonance imaging (fMRI) data, which typically contains 180,000 voxels for each subject. Any fast and scalable Gaussian processes approximation approach can be potentially applied to the soft-thresholded Gaussian process. For example, the recent ideas of nearest-neighbor Gaussian process approach by Datta et al. (2016) can be applied to our model. In addition, it is of great interest to perform joint analysis of datasets involving multiple imaging modalities, such as fMRi, diffusion tensor imaging and structural MRi. it is very challenging to model the dependence between the multiple imaging modality over space and to select the interactions between multiple modality image predictors in scalar-on-image regression. The extension of the soft-thresholded Gaussian process might solve this problem. The basic idea is to introduce hierarchical latent Gaussian processes and different types of thresholding parameters for different modalities, leading to an hierarchical soft-thresholded Gaussian process as the prior model for the effects of interactions.
Supplementary Material
ACKNOWLEDGEMENT
This research was supported partially by grants from the National Institutes of Health and the National Science Foundation. The authors would like to thank the Editor, the Associate Editor and anonymous reviewers for suggestions that led to an improved manuscript.
Contributor Information
Jian Kang, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A. jiankang@umich.edu.
Brian J. Reich, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A. bjreich@ncsu.edu
Ana-Maria Staicu, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A. astaicu@ncsu.edu.
REFERENCES
- Arnold TB & Tibshirani RJ (2014). genlasso: Path algorithm for generalized lasso problems. R package version 3.0.2. [Google Scholar]
- BANERJEE S, CARLIN BP & Gelfand AE (2004). Hierarchical Modeling and Analysis for Spatial Data. Boca Rotan, FL: Chapman & Hall/CRC. [Google Scholar]
- Boehm Vock LF, Reich BJ, Fuentes M & Dominici F (2014). Spatial variable selection methods for investigating acute health effects of fine particulate matter components. Biometrics 71, 167–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choudhuri N, Ghosal S & Roy A (2004). Bayesian estimation of the spectral density of a time series. J.Am. Statist. Assoc. 99, 1050–1059. [Google Scholar]
- Crainiceanu C, Reiss P, Goldsmith J, Huang L and Huo L & Scheipl F (2014). refund: Regression with functional data. R package version 3.0.2. [Google Scholar]
- CRESSIE N (1993). Statistics for Spatial Data. New York: Wiley. [Google Scholar]
- Datta A, Banerjee S, Finley AO & Gelfand AE (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J. Am. Statist. Assoc. 111, 800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelfand AE, Diggle PJ, Fuentes M & Guttorp P (2010). Handbook of Spatial Statistics. New York: Chapman & Hall/CRC. [Google Scholar]
- Ghosal S & Roy A (2006). Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann. Statist. 34, 2413–2429. [Google Scholar]
- Goldsmith J, Huang L & Crainiceanu CM (2014). Smooth scalar-on-image regression via spatial Bayesian variable selection. J. Comp. Graph. Stat 23, 46–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- HASTIE TJ & Efron B (2013). lars: Least angle regression, lasso and forward stagewise. R package version 3.0.2. [Google Scholar]
- HIGDON DM, Swall J & Kern J (1999). Non-stationary spatial modeling. In Bayesian Statistics 6 - Proceedings of the Sixth Valencia Meeting, Bernardo J, Berger J, Dawid A & Smith A, eds. Clarendon Press; - Oxford, pp. 761–768. [Google Scholar]
- Hung H & Wang C-C (2013). Matrix variate logistic regression model with application to EEG data. Biostatistics 14, 189–202. [DOI] [PubMed] [Google Scholar]
- Li B, Kim MK & Altman N (2010). On dimension folding of matrix-or array-valued statistical objects. Ann. Statist. 38, 1094–1121. [Google Scholar]
- Li F, Zhang T, Wang Q, Gonzalez M, Maresh E & Coan J (2015). Spatial Bayesian variable selection and grouping in high-dimensional scalar-on-image regressions. Ann. Appl. Statist. 9, 687–713. [Google Scholar]
- MØLLER J, Pettitt A, Berthelsen K & Reeves R (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93, 451–458. [Google Scholar]
- NELSEN RB (1999). An Introduction to Copulas. New York: Springer-Verlag. [Google Scholar]
- Nychka D, Bandyopadhyay S, Hammerling DM, Lindgren F & Sain S (2015). A multi-resolution Gaussian process model for the analysis of large spatial data sets. J. Comp. Graph. Stat 24, 579–599. [Google Scholar]
- R CORE Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Reich BJ, Hodges JS & Carlin BP (2007). Spatial analysis of periodontal data using conditionally autoregressive priors having two types of neighbor relations. J. Am. Statist. Assoc. 102, 44–55. [Google Scholar]
- Reiss PT, Huo L, Zhao Y, Kelly C & Ogden RT (2015). Wavelet-domain regression and predictive inference in psychiatric neuroimaging. Ann. Appl. Statist. 9, 1076–1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reiss PT & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics 66, 61–69. [DOI] [PubMed] [Google Scholar]
- Sharbrough F, Chatrian G, Lesser R, Luders H, Nuwer M & Picton T (1991). American Electroencephalographic Society guidelines for standard electrode position nomenclature. J. Clin. Neurophysiol.8, 200–202. [PubMed] [Google Scholar]
- Smith M & Fahrmeir L (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. J. Am. Statist. Assoc. 102, 417–431. [Google Scholar]
- Tibshirani RJ (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–288. [Google Scholar]
- Tibshirani RJ, Saunders M, Rosset S, Zhu J & Knight K (2005). Sparsity and smoothness via the fused lasso. J. R. Statist. Soc. B 67, 91–108. [Google Scholar]
- Tibshirani RJ & Taylor J (2011). The solution path of the generalized lasso. Ann. Statist. 39, 1335–1371. [Google Scholar]
- Tokdar ST & Ghosh JK (2007). Posterior consistency of logistic Gaussian process priors in density estimation. J. Stat. Plan. Inference 137, 34–42. [Google Scholar]
- Wang X, Zhu H & for the Alzheimer’s Disease Neuroimaging initiative (2016). Generalized scalar-on-image regression models via total variation. J. Am. Statist. Assoc, DOi 10.1080/01621459.2016.1194846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao L, Li Y & Ruppert D (2013). Fast bivariate P-splines: the sandwich smoother. J. R. Statist. Soc. B 75, 577–599. [Google Scholar]
- Zhou H & Li L (2014). Regularized matrix regression. J. R. Statist. Soc. B 76, 463–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.