Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jan 24.
Published in final edited form as: Biometrika. 2018 Jan 19;105(1):165–184. doi: 10.1093/biomet/asx075

Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process

Jian Kang 1, Brian J Reich 2, Ana-Maria Staicu 3
PMCID: PMC6345249  NIHMSID: NIHMS1003776  PMID: 30686828

SUMMARY

This work concerns spatial variable selection for scalar-on-image regression. We propose a new class of Bayesian nonparametric models and develop an efficient posterior computational aigorithm. The proposed soft-thresholded Gaussian process provides large prior support over the class of piecewise-smooth, sparse, and continuous spatially-varying regression coefficient functions. In addition, under some mild regularity conditions the soft-thresholded Gaussian proess prior leads to the posterior consistency for parameter estimation and variable selection for scalar-on-image regression, even when the number of predictors is larger than the sample size. The proposed method is compared to alternatives via simulation and applied to an electroen-cephalography study of alcoholism.

Keywords: Electroencephalography, Gaussian processes, Posterior consistency, Spatial variable selection

1. INTRODUCTION

Scalar-on-image regression has attracted considerable attention recently in both the frequentist and Bayesian literature. This problem is challenging for several reasons: the predictor is a two-dimensional or three-dimensional image where the number of pixels or voxels is often larger han the sample size; the observed predictors may be contaminated with noise; the true signal ay exhibit complex spatial structure; and most components of the predictor may have no effect on the response, and when they have an effect it may vary smoothly.

Regularized regression techniques are often needed when the number of predictors is much higher than the sample size. The lasso (Tibshirani, 1996) is a popular method for variable selection that employs a penalty based on the sum of the absolute values of the regression coefficients.However, most penalized approaches do not accommodate predictors with ordered components such as an image predictor. One exception is the fused lasso, which generalizes the lasso by pe-nalizing both the coefficients and their successive differences and thus ensures both sparsity and smoothness of the estimated effect. To incorporate the spatial structure of the predictors, Reiss & Ogden (2010) extended functional principal component regression to handle image predictors by approximating the coefficient function using B-splines. However, this method is not sensitive to sparsity or sharp edges. Wang et al. (2016) proposed a penalty based on the total variation of the regression function that yields piecewise smooth regression coefficients. The approach is focused primarily on prediction and does not quantify uncertainty in a way that permits statistical inference. Reiss et al. (2015) considered a wavelet expansion for the regression coefficients and conducted inference via hypothesis testing. Their approach requires that the image predictor has dimensions of equal size and moreover that the common size is a power of two. These strong assumptions are violated for our motivating application. Therefore, none of these methods are appropriate to detect a piecewise smooth, sparse, and continuous signal for scalar outcome and image predictor.

Scalar-on-image regression has been also approached from a Bayesian viewpoint. Goldsmith et al. (2014) proposed to model the regression coefficients as the product of two latent spatial processes to capture both sparsity and spatial smoothness of the important regression coefficients. They used an Ising prior for the binary indicator that a voxel is predictive of the response, and a conditionally autoregressive prior to smooth the nonzero regression coefficients. Use of an Ising prior for binary indicators was first discussed in Smith & Fahrmeir (2007) in the context of high dimensional spatial predictors and was also recently employed by Li et al. (2015) who use a Dirichlet process prior for the non-zero coefficients. In both Li et al. (2015) and Goldsmith et al. (2014), sparsity and smoothness are controlled separately by two independent spatial processes so the transitions from zero areas to neighboring nonzero areas may be abrupt, and computation is challenging because the Ising probability mass function does not have a simple closed form.

We propose an alternative approach for spatial variable selection in scalar-on-image regression by modeling the regression coefficients through soft-thresholding of a latent Gaussian process. The soft-thresholding function is well known for its relation with the lasso estimate when the design matrix is orthonormal (Tibshirani, 1996), and here we use it to specify a spatial prior with mass at zero. The idea is inspired by Boehm Vock et al. (2014) who considered Gaussian processes as a regularization technique for spatial variable selection. However their approach does not assign prior probability mass at zero for regression coefficients and is not designed for scalar-on-image regression. Unlike other Bayesian spatial models (Goldsmith et al., 2014; Li 65 et al., 2015), the soft-thresholded Gaussian process ensures a gradual transition between the zero and nonzero effects of the neighboring locations and provides large support over the class of spatially-varying regression coefficient functions that are piecewise smooth, sparse, and continuous. The use of the soft-thresholded Gaussian process avoids the computational problems posed by the Ising prior, and can be scaled to large datasets using a low-rank spatial model for the latent process. We show that the soft-thresholded Gaussian process prior leads to posterior consistency for both parameter estimation and variable selection under mild regularity conditions, even when the number of predictors is larger than the sample size.

The proposed method is introduced for a single image predictor and Gaussian responses mainly for simplicity. Extensions to accommodate multiple predictors and non-Gaussian responses are relatively straightforward. However, the theoretical investigation of the procedure for non-Gaussian responses is not trivial, and so we establish the theory for binary responses and the probit link function. The methods are applied to the data from an electroencephalography, EEG, study of alcoholism; see http://kdd.ics.uci.edu/datasets/eeg/eeg.data.html. The objective of our analysis is to estimate the relationship between alcoholism and brain activity. EEG signals are recorded for both alcoholics and controls at 64 channels of electrodes on subjects’ scalps for 256 seconds, leading to a high-dimensional predictor. The data have been previously described in Li et al. (2010) and Zhou & Li (2014). Previous literature has ignored the spatial structure of the electrodes shown in Fig. 1, which was recovered from the standard electrode position nomen-clature described by Fig. 1 of https://www.acns.org/pdf/guidelines/Guideline-5.pdf Our spatial analysis exploits the spatial configuration of the electrodes and reveals regions of the brain with activity that are predictive of alcoholism.

Fig. 1.

Fig. 1.

Standard electrode position nomenclature for 10–10 system

2. MODEL

2.1. Scalar-on-image regression

Let m be a m-dimensional vector space of real values. Suppose there are n subjects in the dataset and the data for the subject i consist of a scalar response variable, Yi1 a set of pn spatially-distributed image predictors, denoted Xi=(Xi,1,,Xi,pn)Tpn and other scalar covariates collected by Wi=(Wi,1,,Wi,q)Tq. Assume that {Wi}i=1n are fixed design covariates. Here Xi,j denotes the image intensity value measured at location sj, for j = 1,...,pn. We assume that the set of locations S={sj}j=1pn is a fixed subset of a compact closed region Bd.LetN(μ,Σ) denote a normal distribution with mean µ and variance-covariance matrix Σ or variance for one-dimensional case. We consider the following scalar-on-image regression model:

(Yi|Wi,Xi,αv,β,σ2)~N{k=1qαkWi,k+pn1/2j=1pnβ(sj)Xi,j,σ2},i=1,,n, (1)

where αv=(α1,,αq)T quantifies the effect of Wi and β(·) is a spatially-varying coefficient function defined on Ɓ. In practice, the normalizing scalar pn1/2 can be absorbed into the image predictors; its role is to rescale the total effects of massive image predictors such that they are bounded away from infinity with a large probability, when pn is very large. Scientifically, in brain imaging studies, the image predictors take values that measure the body tissue contrast or the neural activities at each spatial location and the number of image predictors, pn, is determined by the image resolution. Thus, the total effect of the image predictors reflects the total amount of the intensity in the brain signals, which should not increase to infinity as the image resolution increases. In model (1), the response is taken to be Gaussian and only one type of image predictor is included, although extensions of the modeling framework to non-Gaussian responses and multi-modality image predictor regression are straightforward.

2.2. Soft-thresholded Gaussian processes

To capture the characteristics of the image predictors and their effects on the response variable, the prior for β() should be sparse and spatial. That is, we assume that many locations have β(sj)=0, the sites with nonzero coefficients cluster spatially, and the coefficients vary smoothly in clusters of nonzero coefficients. To encode these desired properties into the prior, we represent β() as a transformation of a Gaussian process, β(s)=gλ{β˜(s)}, where gλ is the transformation function dependent on parameter λandβ˜(s) follows a Gaussian process prior. In this trans-kriging (Cressie, 1993) or Gaussian copula (Nelsen, 1999) model, the function gλ determines the marginal distribution of β(s), while the covariance of the latent β˜(s) determines β(s)s dependence structure.

Spatial dependence is determined by the prior for β˜(s). We assume that β˜() is a Gaussian process with mean zero and stationary covariance function cov{β˜(s),β˜(s)}=κ(ss) for some covariance function k. Although other transformations are possible (Boehm Vock et al., 2014), we select gλ to be the soft-thresholding function to map β˜(s) near zero to exact zero and thus give a sparse prior. Let

gλ(x)={0,|x|λ,sgn(x)(|x|λ),|x|>λ, (2)

where sgn(x)=1 if x > 0, sgn(x)=1 if x < 0, sgn(0) = 0. The thresholding parameter λ>0 determines the degree of sparsity. This soft-thresholded Gaussian process prior is denoted β~STGP(λ,κ).

3. THEORETICAL PROPERTIES

3.1. Notations and Definitions

We first introduce additional notation for the theoretical development and the formal definitions of the class of spatially-varying coefficient functions under consideration. We assume that all the random variables and stochastic processes in this article are defined on the same probability space, denoted (Ω,F,Π).Let+d={0,1,,}dd represent a space of d-dimensional non-negative integers. For any vector v=(v1,,vd)Td,letvp=(l=1d|vl|p)1/p be the Lp norm for vector v for any p1,and letv=maxl=1d|vl| be the supremum norm. For any xR,let[x] be the smallest integer not smaller than x and x be the largest integer not larger than x. Define the event indicator I(A)=1. if event Aoccurs,I(A)=0 otherwise. For any z=(z1,,zd)T+ddefinez!=l=1dk=1klk and define vz=l=1dvlzl. For any real function f on Ɓ, let ||f||p={B|f(s)|Pds}1/p denote the Lp norm for any p ≥ 1 and ||f||=supsB|f(s)| denote the supremum norm.

Definition 1. Denote by Cm(B) the set of differentiable functions of order m defined on Ɓ, f(), such that f(s) has partial derivatives

Dτf(s)=τ1fs1τ1sdτd(s)=η1+τ1mDτ+ηf(t)η!(st)η+Rm(s,t),

where τ=(τ1,,τd)T+d,η+d,td, given any point s0 of Ɓ and any ϵ>0, there is a δ>0 such that if s and t are any two points of Ɓ with ss01<δ and ts01<δ, then |Rm(s,t)|≤∥st1mτ1ε. If DτfM<, then |Rm(s,t)|(Mst1m+1)/(m+1)!.

Denote by R¯ and R the closure and the boundary of any set RB

Definition 2. Define Θ={β(s):sB} to be the collection of all spatially-varying coefficient functions that satisfy the following conditions. Assume there exist two disjoint non-empty open sets R1andR1withR¯1R¯1= such that

β()is smooth overR¯1R¯1,β(s)I(sR¯1R¯1)Cρ(R¯1R¯1),ρ=[d/2]. (2.1)
β(s)=0forsR0,β(s)>0forsR1andβ(s)<0forsR1,whereR0=B(R1R1)andR0(R1R1). (2.2)
β()iscontinuousoverB.That is,limss0β(s)=β(s0),s0B. (2.3)

Simply put, Θ is the collection of all piecewise smooth, sparse and continuous functions defined on Ɓ.

3.2. Large Support

One desired property for the Bayesian nonparametric model is to have prior support over a large class of functions. In this section, we show that the soft-thresholded Gaussian process has large support over Θ. We begin with two appealing properties of the soft-thresholding function. All technical conditions are listed in the Appendix, as Conditions A1-A5.

Lemma 1. The soft-thresholding function gλ(x) is Lipschitz continuous for any λ > 0, that is, for all x1,x2,|gλ(x1)gλ(x2)||x1x2|.

Lemma 2. For any function β0Θ and any threshold parameter λ0 > 0, there exists a smooth function β˜0(s)Cρ(B) such that β0(s)=gλ0{β˜0(s)}.

Lemma 1 is proved directly by verifying the definition. The proof of Lemma 2 is not trivial, it requires a detailed construction of the smooth function β˜0(s). See the Appendix for details.

Theorem 1. For any function β0Θ and ε>0, the soft-thresholded Gaussian process prior β~STGP(λ0,κ) satisfies Π(ββ0<ε)>0, for any λ0>0 and k that satisfy Condition A5.

Proof. By Lemma 2, for any threshold parameter λ0 > 0 and a smooth function β˜0(s)Cρ(B) such that β0(s)=gλ0{β˜0(s)}. Since β~STGP(λ0,κ), we have β(s)=gλ0{β˜(s)} with β˜(s)~GP(0,κ), By Lemma 1,

Π{supsB|β(s)β0(s)|<ε}=Π[supsB|gλ0{β˜(s)}gλ0{β˜0(s)}|<ε]Π{supsB|β˜(s)β˜0(s)|<ε}.

By Condition A5 and Theorem 4.5 of Tokdar & Ghosh (2007), β˜0() is in the reproducing kernel Hilbert space of k, and then by Theorem 4 of Ghosal & Roy (2006), we have Π{supsB|β˜(s)β˜0(s)|<ε}>0 which completes the proof. □

Theorem 1 implies that there is always a positive probability that the soft-thresholded Gaussian process concentrates in an arbitrarily small neighborhood of any spatially-varying coefficient function that has piecewise smoothness, sparsity and continuity properties. According to Lemma 2, for any positive λ1λ2, there exist β˜1,β˜2Cρ(B) such that β0(s)=gλ1{β˜1(s)}=gλ2{β˜2(s)}. This implies that the thresholding parameter λ0 and the latent smooth curve β˜0 are not identifiable, but we can ensure that β0 is identifiable by establishing the posterior consistency of parameter estimation.

3.3. Posterior Consistency

For i = 1,...,n, given the image predictor Xi on a set of spatial locations S and other covariates Wi, suppose the response Yi is generated from the scalar-on-image regression model (1) with parameters a α0Vq, σ02>0 and β0Θ. We assume α0V and σ02 are known for theoretical convenience; in practice it is straightforward to estimate them from the data. We assign a soft-thresholded Gaussian process prior for the spatially-varying coefficient function, i.e. β~STGP(λ,κ) for any given λ > 0 and covariance kernel k. In light of the large support property in Theorem 1, the following lemma shows the positivity of prior neighborhoods:

Lemma 3. Denote by πn,i(;β) the density function of Zn,i=(Yi,Wi,Xi) in model (1) and suppose Condition A4 holds for Xi. Define Λn,i(;β0,β)=logπn,i(;β)logπn,i(;β0), Kn,i(β0,β)=Eβ0{Λn,i(Zn,i;β0,β)} and Vn,i(β0,β)=varβ0{Λn,i(Zn,i;β0,β)}. There exists a set B with Π(B)>0 such that, for any ε>0,

liminfnΠ[{βB,n1i=1nKn,i(β0,β)<ε}]>0,n2i=1nVn,i(β0,β)0,βB.

We construct sieves for the spatially-varying coefficient functions in Θ as

Θn={βΘ:βpn1/(2d),supsR1R1|Dτβ(s)|pn1/(2d),1τ1ρ}, (3)

where ρ is defined in Condition A1. By Lemmas A1-A5 in the Appendix, we can find an upper bound for the tail probability and construct uniform consistent tests in the following lemmas:

Lemma 4. If β(s)~STGP(λ0,κ) with λ0 > 0 and the kernel function k satisfies Condition A5, then there exist constants K and b such that for all n ≥ 1, Π(ΘnC)Kexp(bpn1/d).

Lemma 5. For any ε>0 and v0/2<v<1/2, there exist N, C0, C1 and C2 such that for all n > N and all βΘn, if ββ01>ε, a test function Ψn can be constructed such that Eβ0(Ψn)C0exp(C2n2v) and Eβ(1Ψn)C0exp(C1n), where υ0 is defined in Condition A1.

Proofs of Lemmas 3–5 are provided in the Supplementary Material. These lemmas verify three important conditions for proving posterior consistency in the scalar-on-image regression based on Theorem A1 of Choudhuri et al. (2004). It follows that we have the following theorem:

Theorem 2. Denote the data by Dn={Yi,Xi,Wi}i=1n. If Conditions A1-A5 hold, then for any ε > 0,

Π(βΘ:ββ01<ε|Dn)1,n. (4)

Theorem 2 implies that the soft-thresholded Gaussian process prior can ensure that the posterior distribution of the spatially-varying coefficient function concentrates in an arbitrarily small neighborhood of the true value, when the numbers of subjects and spatial locations are both sufficiently large. Given that the true function of interest is piecewise smooth, sparse and continuous, the soft-threshold Gaussian process prior can further ensure that the posterior probability of the sign of the spatially-varying coefficient function being correct converges to one as the sample size goes to infinity. The result is formally stated in the following theorem.

Theorem 3. Suppose the model assumptions, prior settings and regularity conditions for Theorem 2 hold. Then

Π[sgn{β(s)}=sgn{β0(s)},forallsB|Dn]1,n. (5)

This theorem establishes the consistency of spatial variable selection. It does not require that the number of true image predictors is finite or less than the sample size. This result is reasonable in that the true spatially-varying coefficient function is piecewise smooth and continuous, and the soft-thresholded Gaussian process will borrow strength from neighboring locations to estimate the true image predictors. The Supplementary Material gives proofs of Theorems 2 and 3.

To apply the proposed model for the motivating dataset, we extend the proposed model (1) and Theorems 2–3 to analysis of the binary response variable using a probit model:

Yi~Bernoulli(πi),Φ1(πi)=k=1qαkWi,k+pn1/2j=1pnβ(sj)Xi,j, (6)

where Φ() is the standard normal cumulative distribution function.

Theorem 4. Assume the data Dn are generated from model (6) and prior settings are the same as Theorem 2. If Conditions A1–A5 and S1–S3 hold, then (4) and (5) hold under model (6).

Conditions S1–S3 and the proof of Theorem 4 are in the Supplementary Material.

4. POSTERIOR COMPUTATION

4.1. Model representation and prior specification

Next, we discuss the practical applicability of our method. We select a low-rank spatial model to ensure that computation remains possible for large datasets. We exploit the kernel convolution approximation of a spatial Gaussian process. As discussed in Higdon et al. (1999), any stationary Gaussian process V(s) can be written as V(s)=K(st)w(t)dt, where K is a kernel function and w is a white-noise process with mean zero and variance σw2. Then the covariance function of V() is cov{V(s),V(s+h)}=κ(h)=σa2K(st)K(s+ht)dt.

This representation suggests the following approximation for the latent process β˜(s)l=1LK(stl)alδ, where t1,,tLd are a grid of equally-spaced spatial knots covering Ɓ and δ is the grid size. Without loss of generality, we assume δ = 1 and K is a local kernel function. We use tapered Gaussian kernels with bandwidth σh, K(h)=exp{h2/(2σh)2}I(h<3σh), so that K(stl)=0 for s separated from tl by at least 3σh. Taking L < p knots and selecting compact kernels leads to computational savings, as discussed in Section 4.2.

The compact kernels K control the local spatial structure and the prior for the coefficients a = (a1,..., aL)T controls broad spatial structure. Following Nychka et al. (2015), we assume that the knots t1,...,tL are arranged on an m1××md array, and use l~k to denote that knots tl and tk are adjacent on this array. We then use a conditionally autoregressive prior (Gelfand et al., 2010) for the kernel coefficients. The conditional autoregressive prior is also defined locally, with full conditional distribution

al|ak,kl~N(ϑnlk~lak,σa2nl), (7)

where nl is the number of knots adjacent to the knot l, ϑ(0,1) quantifies the strength of spatial dependence, and σa2 determines the variance. These full conditional distributions correspond to the joint distribution a~N{0,σa2(MϑA)1}, where M is diagonal with diagonal elements (n1...,nL) and A is the adjacency matrix with (k, l) element equal 1 if k ~ l and zero otherwise.

Write β˜v={β˜(s1),,β˜(sp)}T. Denote by K the p×L kernel matrix with (j, l) element K(sjtl2), then β˜v~N{0,σa2K(MϑA)1KT} as a prior distribution. In this case, the β˜(sj) do not have equal variances which may be undesirable. Non-constant variance arises because the kernel knots may be unequally distributed, and because the conditional autoregressive model is non-stationary in that the variances of the al are unequal.

To stabilize the prior variance, define K˜j,l=K(sjtl2)/wj and let K˜ be the corresponding p × L matrix of standardized kernel coefficients, where wj are constants chosen so that the prior variance for βj is the same over j. We take wj to be the square root of the j-th diagonal element of K(MϑA)1KT, so the kernel functions depend on ϑ. By pulling the prior standard deviation σa out of the thresholding transformation we have an equivalent model representation of model (1) as

Yi~N(WiTαv+pn1/2XiTβv,σ2),β(sj)=σagλ{β˜(sj)}, (8)

where β˜v~N{0,K˜(MϑA)1K˜T}. After standardization, the prior variance of each β˜(sj) is one, and therefore the prior probability that β˜(sj) is nonzero is 2Φ(λ) for all j. This endows each parameter with a distinct interpretation: σa controls the scale of the nonzero coefficients, λ controls the prior degree of sparsity, and ϑ controls spatial dependence. With an additional set of conditions, we can show that model representation (8) when L to has a large prior support, thus leading to the posterior consistency and selection consistency. The Supplementary Material contains more details.

In practice, we normalize the response and covariates, and then select priors αV~N(0,102Iq), σ2~InvGamma(01,01), σa~HalfNormal(0,1), ϑ~Beta(10,1), and λ~Uniform(0,5). Following Banerjee et al. (2004), we use a beta prior for ϑ with mean near one because only values near one provide appreciable spatial dependence. Finally, although our asymptotic results suggest that any λ > 0 will give posterior consistency, in finite samples posterior inference may be sensitive to the threshold so we use a prior rather than a fixed value.

4.2. Markov chain Monte Carlo Algorithm

For fully Bayesian inference on model (1), we sample from the posterior distribution using Metropolis-Hastings within Gibbs sampling. The parameters αv, σ2, and σa2 have conjugate full conditional distributions and are updated using Gibbs sampling. The spatial dependence parameter ϑ is sampled using Metropolis-Hastings sampling using a beta candidate distribution with the current value as mean and standard deviation tuned to give acceptance rate around 0.4. The threshold λ is updated using Metropolis sampling with random-walk Gaussian candidate distribution with standard deviation tuned to have acceptance probability around 0.4. The Metropolis update for al uses the prior full conditional distribution in (7) as the candidate distribution which gives a high acceptance rate and thus good mixing without tuning.

To make posterior inference for the probit regression model (6), we can slightly modify the aforementioned algorithm by introducing an auxiliary continuous variable Yi* for each response variable Yi. We assume that Yi=I(Yi*>0) and Yi* follows (1) with σ2 = 1. The full conditional of Yi* is a truncated normal distribution, which is straightforward to generate as a Gibbs sampling step in the posterior computation. The updating schemes for other parameters remain the same. For other types of non-Gaussian response models, we can directly use the Metropolis-Hastings sampling by modifying the likelihood function accordingly.

5. SIMULATION STUDY

5.1. Data generation

In this section we conduct a simulation study to compare the proposed methods with other popular methods for scalar-on-image regression. For each simulated observation, we generate a two-dimensional image Xi on the {1,...,m} × {1,...,m} grids with m = 30. The co-variates are generated following two covariance structures: exponential and with shared structure with the signal. The exponential covariates are Gaussian with mean E(Xij) = 0 and cov(Xi,j,Xi,l)=exp(dj,l/ϑx), where dj,l is the distance between locations j and l and ϑx controls the range of spatial dependence. The covariates generated with shared structure with βv are Xi=X˜i/2+eiβv, where X˜i is Gaussian with exponential covariance with ϑx=3 and ei~N(0,v2). The continuous response is then generated as Yi~N(XiTβv,σ2). For binary response as Yi~Bernoulli(πi) with Φ1(πi)=XiTβv. Both Xi and Yi are independent for i = 1,...,n. We consider three true βv images: two sparse images plotted in Fig. 2, Five peaks and Triangle, and the dense image Waves with β(s)={cos(6πs1/m)+cos(6πs2/m)}/5. We also compare sample size n{100,250}, spatial correlation ϑx{3,6}, and error standard deviation σ{2,5}. For all combinations of these parameters considered we generate S = 100 datasets. We also simulate binary data with Π(Yi=1)=Φ(XiTβv) with n = 250 and exponential covariance with ϑx=3.

Fig. 2.

Fig. 2.

True βv images used in the simulation study.

5.2. Methods

We fit our model with an m/2 × m/2 equally-spaced grid of knots covering {1,...,m} × {1,...,m} with bandwidth σh set to the minimum distance between knots. We fit our model both with λ > 0 where the prior is soft-thresholded Gaussian process with sparsity, and with λ = 0, in which case the prior is Gaussian process with no sparsity. For both models, we run the proposed Markov chain Monte Carlo algorithm 50,000 iterations with 10,000 burn-in, and compute the posterior mean of βv. For the sparse model, we compute the posterior probability of a nonzero β(s). We compare our method with the lasso (Tibshirani, 1996) and fused lasso (Tibshirani et al., 2005; Tibshirani & Taylor, 2011) penalized regression estimates

β^Lv=argminβv{(YXβv)T(YXβv)+λ˜j|β(sj)|},β^FLv=argminβv{(YXβv)T(YXβv)+λ˜j~k|β(sj)β(sk)|+γ˜λ˜j|β(sj)|}. (9)

The lasso estimate β^Lv is computed using the lars package (Hastie & Efron, 2013) in R (R Core Team, 2017) and the tuning parameter λ˜ is selected using the Bayesian information criterion. The fused lasso estimate β^FLv is computed using the genlasso package (Arnold & Tibshirani, 2014) in R. The fussed lasso has two tuning parameters: γ˜ and λ˜. Due to computational considerations, we search only over γ˜ in {1/5,1, 5}. For each γ˜, λ˜ is selected using the Bayesian information criterion. The Supplementary Material gives results for each γ˜ and in the main text we report only the results for the γ˜ with the most precise estimates.

We also consider a functional principal component analysis-based alternative. We smooth each image using the technique of Xiao et al. (2013) implemented in the fbps function in R’s refund package (Crainiceanu et al., 2014), compute the eigen decomposition of the sample covariance of the smoothed images, and then perform principal components regression using the lasso penalty tuned via Bayesian information criterion. The Supplementary Material gives results using the leading eigenvectors that explain 80%, 90%, and 95% of the variation in the sample images and we report only the results for the value with most precise estimates in the main text.

Finally, we compare our approach with the Bayesian spatial model of Goldsmith et al. (2014) using β(sj)=α˜jθj, where α˜j{0,1} is the binary indicator that location j is included in the model, and θjR is the regression coefficient given that the location is included. Both the α˜j and θj have spatial priors; the continuous components θj follow a conditional autoregressive prior, and the binary components θj follow an Ising or autologistic prior (Gelfand et al., 2010) with full conditional distributions

logitΠ(α˜j=1|α˜l,lj)=a0+b0l~jα˜l. (10)

Estimating a0 and b0 is challenging because of the complexity of the Ising model (Møller et al., 2006), therefore Goldsmith et al. (2014) recommended selecting a0 and b0 using cross-validation over a0(4,0) and b0(0,2). Due to computational limitations we select values in the middle of these intervals and set a0 = −2 and b0 = 1. The posterior mean of β(s) and the posterior probability of a nonzero β(s) are approximated based on 5,000 Markov chain Monte Carlo samples after the first 1,000 are discarded as burn-in.

5.3. Results

Tables 1 and 2 give the mean squared error for βv estimation averaged over location, Type I error and power for detecting nonzero signals, along with the computing time. The soft-thresholded Gaussian process model gives the smallest mean squared error when the covariate has exponential correlation. Compared to the Gaussian process model, adding thresholding reduces mean squared error by roughly 50% in many cases. As expected, the functional principal component analysis method gives the smallest mean squared error in the shared-structure scenarios where the covariates are generated to have a similar spatial pattern as the true signal. Even in this case, the proposed method outperforms other methods which do not exploit this shared structure. Under the dense waves signal the non-sparse Gaussian process model gives the smallest mean squared error, but again the proposed method remains competitive.

Table 1.

Simulation study results for linear regression models. Methods are compared in terms of mean squared error for βV, Type I error and power for feature detection. The scenarios vary by the true β0v (Fig. 2), sample size n, similarity between covariates and true signal determined by τ, error variance σ, and spatial correlation range of the covariates ϑx. Results are reported as the mean with the standard deviation over the 100 simulated datasets

Mean squared error for βv, multiplied by 104
Signal n τ σ υx Lasso Fused lasso FPCA Ising GP STGP
Five peaks 100 0 5 3 326(5) 22(0) 36(0) 46(1) 27(0) 13(0)
100 0 5 6 553(9) 22(0) 33(0) 44(1) 28(0) 13(0)
100 0 2 3 102(1) 11(0) 24(0) 26(0) 15(0) 4(0)
250 0 5 3 674(9) 13(0) 27(0) 52(1) 17(0) 6(0)
Triangle 100 0 5 3 289(4) 9(0) 17(0) 30(1) 19(0) 8(0)
100 0 5 6 515(9) 8(0) 1.5(0) 28(1) 19(0) 8(0)
100 0 2 3 73(1) 5(0) 12(0) 14(0) 10(0) 4(0)
250 0 5 3 650(9) 6(0) 14(0) 34(0) 12(0) 6(0)
100 2 5 3 1011(17) 7(0) 10(0) 27(1) 33(1) 13(1)
100 4 5 3 1018(17) 6(0) 4(1) 32(1) 34(1) 14(1)
Waves 100 0 5 3 1260(13) 250(5) 188(7) 419(3) 48(1) 109(10)
100 0 5 6 1639(17) 233(4) 126(3) 402(4) 51(1) 128(12)
Type I error, % Power, %
Signal n τ σ υx Lasso Fused lasso Ising STGP Lasso Fused lasso Ising STGP
Five peaks 100 0 5 3 10(0) 14(1) 0(0) 4(1) 17(0) 55(2) 4(0) 51(1)
100 0 5 6 10(0) 37(1) 0(0) 7(1) 15(0) 80(1) 5(0) 58(1)
100 0 2 3 79(0) 24(1) 0(0) 4(1) 28(0) 84(1) 4(0) 82(1)
250 0 5 3 27(0) 19(1) 0(0) 4(1) 32(0) 77(1) 10(0) 71(1)
Triangle 100 0 5 3 10(0) 5(0) 0(0) 4(1) 23(1) 85(1) 9(0) 87(1)
100 0 5 6 11(0) 8(1) 0(0) 4(1) 19(1) 91(1) 9(0) 86(1)
100 0 2 3 9(0) 7(1) 0(0) 4(0) 42(1) 95(0) 5(0) 98(0)
250 0 5 3 27(0) 5(0) 0(0) 4(0) 35(1) 92(1) 16(0) 96(1)
100 2 5 3 11(0) 7(1) 0(0) 1(0) 17(0) 86(1) 8(0) 70(1)
100 4 5 3 11(0) 2(1) 0(0) 3(1) 19(1) 84(1) 12(0) 73(1)
Computing time, minutes
Signal n τ σ υx Lasso Fused lasso FPCA Ising GP STGP
Five peaks 100 0 5 3 0.02 16.77 5.40 27.61 4.81 11.28

Type I error: proportion of times estimating zero coefficients to be nonzero; Power: proportion of times estimating nonzero coefficients to be nonzero; FPCA, α%: functional principal component analysis approach (Xiao et al., 2013) using eigen vectors that explain α% of variation; Ising: Bayesian spatial variable selection with Ising priors (Goldsmith et al., 2014); GP: Gaussian process approach; STGP: soft-thresholded Gaussian process approach.

Table 2.

Simulation study results for binary regression models. Methods are compared in terms of mean squared error for βv, Type I error and power for feature detection. The scenarios vary by the true β0v see Fig. 2. Results are reported as the mean with the standard deviation over the 100 simulated datasets

Mean squared error for βv, multiplied by 104
Signal Lasso FPCA, 80% FPCA, 90% FPCA, 95% GP STGP
Five peaks 11(2) 3(0) 6(2) 21(16) 1(0) 1(0)
Triangle 4(1) 1(0) 3(1) 8(4) 1(0) 0(0)
Waves 80(5) 18(4) 52(16) 38(8) 107(46) 191(138)
Type I error, % Power, %
Signal Lasso STGP Lasso STGP
Five peaks 3(0) 7(1) 15(0) 62(1)
Triangle 1(0) 4(0) 24(1) 91(1)

Type I error: proportion of times estimating zero coefficients to be nonzero; Power: proportion of times estimating nonzero coefficients to be nonzero; FPCA, α%: functional principal component analysis approach (Xiao et al., 2013) using eigen vectors that explain α% of variation; Ising: Bayesian spatial variable selection with Ising priors (Goldsmith et al., 2014); GP: Gaussian process approach; STGP: soft-thresholded Gaussian process approach.

For variable selection results, we only compare the proposed method with the fused lasso and the Ising model for a fair comparison, because the lasso does not incorporate spatial locations and other methods do not perform variable selection directly. The results show that the fused lasso has much larger type I errors in all cases and the Ising model has a very small power to detect the signal in each case. The proposed method is much more efficient than both the fused lasso and the Ising model for variable selection. For the computing time, the proposed method is comparable to the fused lasso and is faster than the Ising model.

6. ANALYSIS OF EEG DATA

Our motivating application is the study of the relationship between the electrical brain activity as measured through multi channel EEG signals and genetic predisposition to alcoholism. EEG is a medical imaging technique that records the electrical activity in the brain by measuring the current flows produced when the neurons are activated. The study comprises a total of 122 subjects: 77 alcoholic subjects and 45 non-alcoholic controls. For each subject, 64 electrodes were placed on their scalp and EEG was recorded from each electrode at a frequency of 256Hz. The electrode positions were located at standard sites, i.e., standard electrode position nomenclature according to American Electroencephalographic Society (Sharbrough et al., 1991). The subjects were presented with 120 trials under several settings involving one stimulus or two stimuli. We consider the multichannel average EEG across the 120 trials corresponding to a single stimulus. The dataset is publicly available at the University of California at Irvine Knowledge Discovery of Datasets. https://kdd.ics.uci.edu/databases/eeg/eeg.data.html.

These data have been previously analyzed by Li et al. (2010), Hung & Wang (2013) and Zhou & Li (2014). However, all the analyses ignore the spatial location of the electrodes on the scalp and used instead smooth based on their identification number, which ranges from 1 to 64 and is assigned arbitrarily relative to the electrodes’ position on the scalp. Our goal in this analysis is to detect the regions of brain which are most predictive of the alcoholism status; thus accounting for the actual position of the electrodes is a key component in our approach. In the absence of more sophisticated means to determine the electrodes’ position on the scalp, we consider a lattice design and assign a two-dimensional location to each electrode that matches closely the electrode’s standard position. Using the labels of the electrodes, we were able to identify only 60 of them. As a result our analysis will be based on the multichannel EEG from these 60 electrodes.

In accordance with the notation employed earlier, let Yi be the alcoholism status indicator with Yi = 1 if the ith subject is alcoholic and 0 otherwise. Furthermore, letXi={Xi(sj;t):sj2,j=1,,60,t=1,256} be the EEG image data for the ith subject which is indexed by a two-dimensional index accounting for the spatial location on the matching lattice design, sj, and one-dimensional index for time, t.

We use a probit model to relate the alcoholism status and the multichannel EEG image: Yi|Xi,β~Bernoulli(πi)) and Φ1(πi)=j=160k=1256Xi(sj,tk)β(sj,tk). The spatially-temporally varying coefficient function β quantifies the effect of the image on the response over time and is modeled using the soft-thresholded Gaussian process on spatial and temporal domain. We select a 5 × 5 square grid of spatial knots and 64 temporal knots, for a total of 1,600 three-dimensional knots. We initially fitted a conditional autoregressive model with a different dependence parameter ϑ for spatial and temporal neighbors (Reich et al., 2007), but found that the convergence was slow and that the estimates of both the spatial and temporal dependence were similar. Thus, we elected to use the same dependence parameter for all neighbors.

We evaluate the prediction performance of the proposed model using cross validation. We first calculate the posterior predictive probabilities that each test-set response is one and then apply the standard receiving operating characteristic curve plot which evaluates the classification accuracy over thresholds on the predictive probabilities. Figure 3 shows the receiving operating characteristic curve using leave-one-out cross validation. The results are compared with those of the lasso, functional principal component analysis and the soft-thresholding Gaussian process approach with thresholding parameter λ = 0. To facilitate computation for these methods, we thin the time points by two, leaving 128 time points. While no model is uniformly superior, the area under the curve corresponding to our approach is optimal among the alternatives we considered.

Fig. 3.

Fig. 3.

Receiving operating characteristic curves with the area under the curve, AUC, for the leave-one-out cross validation of the EEG data by six different methods: lasso (black solid, AUC = 0.789), functional principal component analysis using the leading eigen-vectors that explain 80% (red solid, AUC = 0.775), 90% (red dashes, AUC = 0.789), 95% (red dots, AUC = 0.777) of variations, Gaussian process approach (green solid, AUC = 0.770) and soft-thresholded Gaussian process approach (navy solid, AUC = 0.818)

The differences between the models are further examined in the estimated β functions plotted in Fig. 4, where we ignore the spatial location of the electrodes and plot them using their identification number. The lasso solution is nonzero for a single spatiotemporal location, while the functional principal component analysis and Gaussian process methods lead to non-sparse and thus uninterpretable β estimates. In contrast, the soft-thresholded Gaussian process based estimate is near zero for the vast majority of locations, and isolates a subset of the electrodes near time point 86 as the most powerful predictors of alcoholism.

Fig. 4.

Fig. 4.

Estimated spatial-temporal effects of the EEG image predictors by four different methods: Lasso, functional principal component analysis, Gaussian process and soft-thresholded Gaussian process. The Gaussian process and soft-thresholded Gaussian process estimates are posterior means.

Our analysis indicates that EEG measurements at time t = 86, which roughly corresponds to 0.336 fraction of second, are predictive of the alcoholism status. This observation is further confirmed by the plot of the posterior probability of nonzero β(sj,t)‘s in Fig. 5a. This implies a delayed reaction to the stimulus, although this finding has to be confirmed with the investigators. To gain more insight into these findings, Figures 5b–5d focus on a particular time and display the posterior mean and posterior probability of a nonzero signal across the electrodes locations. They indicate that the right occipital/lateral region is the most predictive of the alcoholism status.

Fig. 5.

Fig. 5.

Summary of analysis of the EEG data by the soft-thresholded Gaussian process. Panel (a) plots the posterior probability of a nonzero β(s, t); each electrode is a line plotted over time t. The remaining panels map either the posterior probability of a nonzero β(s, t) or the posterior mean of β(s, t) at individual time points

7. DISCUSSION

The proposed method generates a few future directions that we consider to pursue. First, we plan to develop a more efficient posterior computation algorithm for analysis of voxel-level functional magnetic resonance imaging (fMRI) data, which typically contains 180,000 voxels for each subject. Any fast and scalable Gaussian processes approximation approach can be potentially applied to the soft-thresholded Gaussian process. For example, the recent ideas of nearest-neighbor Gaussian process approach by Datta et al. (2016) can be applied to our model. In addition, it is of great interest to perform joint analysis of datasets involving multiple imaging modalities, such as fMRi, diffusion tensor imaging and structural MRi. it is very challenging to model the dependence between the multiple imaging modality over space and to select the interactions between multiple modality image predictors in scalar-on-image regression. The extension of the soft-thresholded Gaussian process might solve this problem. The basic idea is to introduce hierarchical latent Gaussian processes and different types of thresholding parameters for different modalities, leading to an hierarchical soft-thresholded Gaussian process as the prior model for the effects of interactions.

Supplementary Material

SM

ACKNOWLEDGEMENT

This research was supported partially by grants from the National Institutes of Health and the National Science Foundation. The authors would like to thank the Editor, the Associate Editor and anonymous reviewers for suggestions that led to an improved manuscript.

Contributor Information

Jian Kang, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A. jiankang@umich.edu.

Brian J. Reich, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A. bjreich@ncsu.edu

Ana-Maria Staicu, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A. astaicu@ncsu.edu.

REFERENCES

  1. Arnold TB & Tibshirani RJ (2014). genlasso: Path algorithm for generalized lasso problems. R package version 3.0.2. [Google Scholar]
  2. BANERJEE S, CARLIN BP & Gelfand AE (2004). Hierarchical Modeling and Analysis for Spatial Data. Boca Rotan, FL: Chapman & Hall/CRC. [Google Scholar]
  3. Boehm Vock LF, Reich BJ, Fuentes M & Dominici F (2014). Spatial variable selection methods for investigating acute health effects of fine particulate matter components. Biometrics 71, 167–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Choudhuri N, Ghosal S & Roy A (2004). Bayesian estimation of the spectral density of a time series. J.Am. Statist. Assoc. 99, 1050–1059. [Google Scholar]
  5. Crainiceanu C, Reiss P, Goldsmith J, Huang L and Huo L & Scheipl F (2014). refund: Regression with functional data. R package version 3.0.2. [Google Scholar]
  6. CRESSIE N (1993). Statistics for Spatial Data. New York: Wiley. [Google Scholar]
  7. Datta A, Banerjee S, Finley AO & Gelfand AE (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J. Am. Statist. Assoc. 111, 800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Gelfand AE, Diggle PJ, Fuentes M & Guttorp P (2010). Handbook of Spatial Statistics. New York: Chapman & Hall/CRC. [Google Scholar]
  9. Ghosal S & Roy A (2006). Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann. Statist. 34, 2413–2429. [Google Scholar]
  10. Goldsmith J, Huang L & Crainiceanu CM (2014). Smooth scalar-on-image regression via spatial Bayesian variable selection. J. Comp. Graph. Stat 23, 46–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. HASTIE TJ & Efron B (2013). lars: Least angle regression, lasso and forward stagewise. R package version 3.0.2. [Google Scholar]
  12. HIGDON DM, Swall J & Kern J (1999). Non-stationary spatial modeling. In Bayesian Statistics 6 - Proceedings of the Sixth Valencia Meeting, Bernardo J, Berger J, Dawid A & Smith A, eds. Clarendon Press; - Oxford, pp. 761–768. [Google Scholar]
  13. Hung H & Wang C-C (2013). Matrix variate logistic regression model with application to EEG data. Biostatistics 14, 189–202. [DOI] [PubMed] [Google Scholar]
  14. Li B, Kim MK & Altman N (2010). On dimension folding of matrix-or array-valued statistical objects. Ann. Statist. 38, 1094–1121. [Google Scholar]
  15. Li F, Zhang T, Wang Q, Gonzalez M, Maresh E & Coan J (2015). Spatial Bayesian variable selection and grouping in high-dimensional scalar-on-image regressions. Ann. Appl. Statist. 9, 687–713. [Google Scholar]
  16. MØLLER J, Pettitt A, Berthelsen K & Reeves R (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93, 451–458. [Google Scholar]
  17. NELSEN RB (1999). An Introduction to Copulas. New York: Springer-Verlag. [Google Scholar]
  18. Nychka D, Bandyopadhyay S, Hammerling DM, Lindgren F & Sain S (2015). A multi-resolution Gaussian process model for the analysis of large spatial data sets. J. Comp. Graph. Stat 24, 579–599. [Google Scholar]
  19. R CORE Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  20. Reich BJ, Hodges JS & Carlin BP (2007). Spatial analysis of periodontal data using conditionally autoregressive priors having two types of neighbor relations. J. Am. Statist. Assoc. 102, 44–55. [Google Scholar]
  21. Reiss PT, Huo L, Zhao Y, Kelly C & Ogden RT (2015). Wavelet-domain regression and predictive inference in psychiatric neuroimaging. Ann. Appl. Statist. 9, 1076–1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Reiss PT & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics 66, 61–69. [DOI] [PubMed] [Google Scholar]
  23. Sharbrough F, Chatrian G, Lesser R, Luders H, Nuwer M & Picton T (1991). American Electroencephalographic Society guidelines for standard electrode position nomenclature. J. Clin. Neurophysiol.8, 200–202. [PubMed] [Google Scholar]
  24. Smith M & Fahrmeir L (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. J. Am. Statist. Assoc. 102, 417–431. [Google Scholar]
  25. Tibshirani RJ (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–288. [Google Scholar]
  26. Tibshirani RJ, Saunders M, Rosset S, Zhu J & Knight K (2005). Sparsity and smoothness via the fused lasso. J. R. Statist. Soc. B 67, 91–108. [Google Scholar]
  27. Tibshirani RJ & Taylor J (2011). The solution path of the generalized lasso. Ann. Statist. 39, 1335–1371. [Google Scholar]
  28. Tokdar ST & Ghosh JK (2007). Posterior consistency of logistic Gaussian process priors in density estimation. J. Stat. Plan. Inference 137, 34–42. [Google Scholar]
  29. Wang X, Zhu H & for the Alzheimer’s Disease Neuroimaging initiative (2016). Generalized scalar-on-image regression models via total variation. J. Am. Statist. Assoc, DOi 10.1080/01621459.2016.1194846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Xiao L, Li Y & Ruppert D (2013). Fast bivariate P-splines: the sandwich smoother. J. R. Statist. Soc. B 75, 577–599. [Google Scholar]
  31. Zhou H & Li L (2014). Regularized matrix regression. J. R. Statist. Soc. B 76, 463–483. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SM

RESOURCES