Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process

Jian Kang; Brian J Reich; Ana-Maria Staicu

doi:10.1093/biomet/asx075

. Author manuscript; available in PMC: 2019 Jan 24.

Published in final edited form as: Biometrika. 2018 Jan 19;105(1):165–184. doi: 10.1093/biomet/asx075

Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process

Jian Kang ¹, Brian J Reich ², Ana-Maria Staicu ³

PMCID: PMC6345249 NIHMSID: NIHMS1003776 PMID: 30686828

SUMMARY

This work concerns spatial variable selection for scalar-on-image regression. We propose a new class of Bayesian nonparametric models and develop an efficient posterior computational aigorithm. The proposed soft-thresholded Gaussian process provides large prior support over the class of piecewise-smooth, sparse, and continuous spatially-varying regression coefficient functions. In addition, under some mild regularity conditions the soft-thresholded Gaussian proess prior leads to the posterior consistency for parameter estimation and variable selection for scalar-on-image regression, even when the number of predictors is larger than the sample size. The proposed method is compared to alternatives via simulation and applied to an electroen-cephalography study of alcoholism.

Keywords: Electroencephalography, Gaussian processes, Posterior consistency, Spatial variable selection

1. INTRODUCTION

Scalar-on-image regression has attracted considerable attention recently in both the frequentist and Bayesian literature. This problem is challenging for several reasons: the predictor is a two-dimensional or three-dimensional image where the number of pixels or voxels is often larger han the sample size; the observed predictors may be contaminated with noise; the true signal ay exhibit complex spatial structure; and most components of the predictor may have no effect on the response, and when they have an effect it may vary smoothly.

Regularized regression techniques are often needed when the number of predictors is much higher than the sample size. The lasso (Tibshirani, 1996) is a popular method for variable selection that employs a penalty based on the sum of the absolute values of the regression coefficients.However, most penalized approaches do not accommodate predictors with ordered components such as an image predictor. One exception is the fused lasso, which generalizes the lasso by pe-nalizing both the coefficients and their successive differences and thus ensures both sparsity and smoothness of the estimated effect. To incorporate the spatial structure of the predictors, Reiss & Ogden (2010) extended functional principal component regression to handle image predictors by approximating the coefficient function using B-splines. However, this method is not sensitive to sparsity or sharp edges. Wang et al. (2016) proposed a penalty based on the total variation of the regression function that yields piecewise smooth regression coefficients. The approach is focused primarily on prediction and does not quantify uncertainty in a way that permits statistical inference. Reiss et al. (2015) considered a wavelet expansion for the regression coefficients and conducted inference via hypothesis testing. Their approach requires that the image predictor has dimensions of equal size and moreover that the common size is a power of two. These strong assumptions are violated for our motivating application. Therefore, none of these methods are appropriate to detect a piecewise smooth, sparse, and continuous signal for scalar outcome and image predictor.

Scalar-on-image regression has been also approached from a Bayesian viewpoint. Goldsmith et al. (2014) proposed to model the regression coefficients as the product of two latent spatial processes to capture both sparsity and spatial smoothness of the important regression coefficients. They used an Ising prior for the binary indicator that a voxel is predictive of the response, and a conditionally autoregressive prior to smooth the nonzero regression coefficients. Use of an Ising prior for binary indicators was first discussed in Smith & Fahrmeir (2007) in the context of high dimensional spatial predictors and was also recently employed by Li et al. (2015) who use a Dirichlet process prior for the non-zero coefficients. In both Li et al. (2015) and Goldsmith et al. (2014), sparsity and smoothness are controlled separately by two independent spatial processes so the transitions from zero areas to neighboring nonzero areas may be abrupt, and computation is challenging because the Ising probability mass function does not have a simple closed form.

We propose an alternative approach for spatial variable selection in scalar-on-image regression by modeling the regression coefficients through soft-thresholding of a latent Gaussian process. The soft-thresholding function is well known for its relation with the lasso estimate when the design matrix is orthonormal (Tibshirani, 1996), and here we use it to specify a spatial prior with mass at zero. The idea is inspired by Boehm Vock et al. (2014) who considered Gaussian processes as a regularization technique for spatial variable selection. However their approach does not assign prior probability mass at zero for regression coefficients and is not designed for scalar-on-image regression. Unlike other Bayesian spatial models (Goldsmith et al., 2014; Li 65 et al., 2015), the soft-thresholded Gaussian process ensures a gradual transition between the zero and nonzero effects of the neighboring locations and provides large support over the class of spatially-varying regression coefficient functions that are piecewise smooth, sparse, and continuous. The use of the soft-thresholded Gaussian process avoids the computational problems posed by the Ising prior, and can be scaled to large datasets using a low-rank spatial model for the latent process. We show that the soft-thresholded Gaussian process prior leads to posterior consistency for both parameter estimation and variable selection under mild regularity conditions, even when the number of predictors is larger than the sample size.

The proposed method is introduced for a single image predictor and Gaussian responses mainly for simplicity. Extensions to accommodate multiple predictors and non-Gaussian responses are relatively straightforward. However, the theoretical investigation of the procedure for non-Gaussian responses is not trivial, and so we establish the theory for binary responses and the probit link function. The methods are applied to the data from an electroencephalography, EEG, study of alcoholism; see http://kdd.ics.uci.edu/datasets/eeg/eeg.data.html. The objective of our analysis is to estimate the relationship between alcoholism and brain activity. EEG signals are recorded for both alcoholics and controls at 64 channels of electrodes on subjects’ scalps for 256 seconds, leading to a high-dimensional predictor. The data have been previously described in Li et al. (2010) and Zhou & Li (2014). Previous literature has ignored the spatial structure of the electrodes shown in Fig. 1, which was recovered from the standard electrode position nomen-clature described by Fig. 1 of https://www.acns.org/pdf/guidelines/Guideline-5.pdf Our spatial analysis exploits the spatial configuration of the electrodes and reveals regions of the brain with activity that are predictive of alcoholism.

Fig. 1. — Standard electrode position nomenclature for 10–10 system

2. MODEL

2.1. Scalar-on-image regression

Let $ℝ^{m}$ be a m-dimensional vector space of real values. Suppose there are n subjects in the dataset and the data for the subject i consist of a scalar response variable, $Y_{i} \in ℝ^{1}$ a set of $p_{n}$ spatially-distributed image predictors, denoted $X_{i} = {(X_{i, 1}, \dots, X_{i, p_{n}})}^{T} \in ℝ^{p_{n}}$ and other scalar covariates collected by $W_{i} = {(W_{i, 1}, \dots, W_{i, q})}^{T} \in ℝ^{q} .$ Assume that ${W_{i}}_{i = 1}^{n}$ are fixed design covariates. Here $X_{i, j}$ denotes the image intensity value measured at location $s_{j,}$ for j = 1,...,p_n. We assume that the set of locations $S = {s_{j}}_{j = 1}^{p_{n}}$ is a fixed subset of a compact closed region $B \subseteq ℝ^{d} . Let N (μ, Σ)$ denote a normal distribution with mean µ and variance-covariance matrix Σ or variance for one-dimensional case. We consider the following scalar-on-image regression model:

(Y_{i} | W_{i}, X_{i}, α^{v}, β, σ^{2}) ~ N {\sum_{k = 1}^{q} α_{k} W_{i, k} + p_{n}^{- 1 / 2} \sum_{j = 1}^{p_{n}} β (s_{j}) X_{i, j}, σ^{2}}, i = 1, \dots, n,

(1)

where $α^{v} = {(α_{1}, \dots, α_{q})}^{T}$ quantifies the effect of W_i and β(·) is a spatially-varying coefficient function defined on Ɓ. In practice, the normalizing scalar $p_{n}^{- 1 / 2}$ can be absorbed into the image predictors; its role is to rescale the total effects of massive image predictors such that they are bounded away from infinity with a large probability, when p_n is very large. Scientifically, in brain imaging studies, the image predictors take values that measure the body tissue contrast or the neural activities at each spatial location and the number of image predictors, $p_{n},$ is determined by the image resolution. Thus, the total effect of the image predictors reflects the total amount of the intensity in the brain signals, which should not increase to infinity as the image resolution increases. In model (1), the response is taken to be Gaussian and only one type of image predictor is included, although extensions of the modeling framework to non-Gaussian responses and multi-modality image predictor regression are straightforward.

2.2. Soft-thresholded Gaussian processes

To capture the characteristics of the image predictors and their effects on the response variable, the prior for $β (\cdot)$ should be sparse and spatial. That is, we assume that many locations have $β (s_{j}) = 0,$ the sites with nonzero coefficients cluster spatially, and the coefficients vary smoothly in clusters of nonzero coefficients. To encode these desired properties into the prior, we represent $β (\cdot)$ as a transformation of a Gaussian process, $β (s) = g_{λ} {\tilde{β} (s)},$ where $g_{λ}$ is the transformation function dependent on parameter $λ and \tilde{β} (s)$ follows a Gaussian process prior. In this trans-kriging (Cressie, 1993) or Gaussian copula (Nelsen, 1999) model, the function $g_{λ}$ determines the marginal distribution of $β (s),$ while the covariance of the latent $\tilde{β} (s)$ determines $β {(s)}^{'} s$ dependence structure.

Spatial dependence is determined by the prior for $\tilde{β} (s) .$ We assume that $\tilde{β} (\cdot)$ is a Gaussian process with mean zero and stationary covariance function $cov {\tilde{β} (s), \tilde{β} (s^{'})} = κ (s - s^{'})$ for some covariance function $k .$ Although other transformations are possible (Boehm Vock et al., 2014), we select $g_{λ}$ to be the soft-thresholding function to map $\tilde{β} (s)$ near zero to exact zero and thus give a sparse prior. Let

g λ (x) = {\begin{array}{c} 0, & | x | \leq λ, \\ sgn (x) (| x | - λ), & | x | > λ, \end{array}

(2)

where $sgn (x) = 1$ if x > 0, $sgn (x) = - 1$ if x < 0, sgn(0) = 0. The thresholding parameter $λ > 0$ determines the degree of sparsity. This soft-thresholded Gaussian process prior is denoted $β ~ STGP (λ, κ) .$

3. THEORETICAL PROPERTIES

3.1. Notations and Definitions

We first introduce additional notation for the theoretical development and the formal definitions of the class of spatially-varying coefficient functions under consideration. We assume that all the random variables and stochastic processes in this article are defined on the same probability space, denoted $(Ω, F, Π) . Let ℤ_{+}^{d} = {0, 1, \dots,}^{d} \subset ℝ^{d}$ represent a space of d-dimensional non-negative integers. For any vector $v = {(v_{1}, \dots, v_{d})}^{T} \in ℝ^{d}, let ‖ v ‖_{p} = {(\sum_{l = 1}^{d} {| v_{l} |}^{p})}^{1 / p}$ be the L_p norm for vector v for any $p \geq 1, and let ‖ v ‖_{\infty} = \max_{l = 1}^{d} | v_{l} |$ be the supremum norm. For any $x \in R, let [x]$ be the smallest integer not smaller than x and $⌊ x ⌋$ be the largest integer not larger than x. Define the event indicator $I (A) = 1.$ if event $A occurs, I (A) = 0$ otherwise. For any $z = {(z_{1}, \dots, z_{d})}^{T} \in ℤ_{+}^{d} define z! = \prod_{l = 1}^{d} \prod_{k = 1}^{k l} k$ and define $v^{z} = \prod_{l = 1}^{d} v_{l}^{z_{l}}$ . For any real function f on Ɓ, let $| | f | |_{p} = {\int_{B} | f (s) |^{P} d s}^{1 / p}$ denote the L_p norm for any p ≥ 1 and $|| f {||}_{\infty} = {sup}_{s \in B} | f (s) |$ denote the supremum norm.

Definition 1. Denote by $C^{m} (B)$ the set of differentiable functions of order m defined on Ɓ, $f (\cdot),$ such that f(s) has partial derivatives

D^{τ} f (s) = \frac{\partial ‖ τ ‖_{1} f}{s_{1}^{τ_{1}} \dots s_{d}^{τ_{d}}} (s) = \sum_{‖ η ‖_{1} + ‖ τ ‖_{1} \leq m} \frac{D^{τ + η} f (t)}{η!} {(s - t)}^{η} + R_{m} (s, t),

where $τ = {(τ_{1}, \dots, τ_{d})}^{T} \in ℤ_{+}^{d}, η \in ℤ_{+}^{d}, t \in ℝ^{d},$ given any point s₀ of Ɓ and any $ϵ > 0$ , there is a $δ > 0$ such that if s and t are any two points of Ɓ with $∥ s - s_{0} ∥_{1} < δ$ and $∥ t - s_{0} ∥_{1} < δ$ , then $| R_{m} (s, t) | \leq∥ s - t ∥_{1}^{m - ∥ τ ∥_{1}} ε$ . If $∥ D^{τ} f ∥_{\infty} \leq M < \infty$ , then $| R_{m} (s, t) | \leq (M ∥ s - t ∥_{1}^{m + 1}) / (m + 1)!$ .

Denote by $\bar{R}$ and $\partial R$ the closure and the boundary of any set $R \subseteq B$

Definition 2. Define $Θ = {β (s) : s \in B}$ to be the collection of all spatially-varying coefficient functions that satisfy the following conditions. Assume there exist two disjoint non-empty open sets $R_{- 1} and R_{1} with {\bar{R}}_{1} \cap {\bar{R}}_{- 1} = \emptyset$ such that

β (\cdot) is smooth over {\bar{R}}_{- 1} \cup {\bar{R}}_{1}, β (s) I (s \in {\bar{R}}_{- 1} \cup {\bar{R}}_{1}) \in C^{ρ} ({\bar{R}}_{- 1} \cup {\bar{R}}_{1}), ρ = [d / 2] .

(2.1)

β (s) = 0 for s \in R_{0}, β (s) > 0 for s \in R_{1} and β (s) < 0 for s \in R_{- 1}, where R_{0} = B - (R_{- 1} \cup R_{1}) and R_{0} - (\partial R_{1} \cup \partial R_{- 1}) \neq \emptyset .

(2.2)

β (\cdot) is continuous over B . That is, \lim_{s \to s_{0}} β (s) = β (s_{0}), s_{0} \in B .

(2.3)

Simply put, $Θ$ is the collection of all piecewise smooth, sparse and continuous functions defined on Ɓ.

3.2. Large Support

One desired property for the Bayesian nonparametric model is to have prior support over a large class of functions. In this section, we show that the soft-thresholded Gaussian process has large support over $Θ$ . We begin with two appealing properties of the soft-thresholding function. All technical conditions are listed in the Appendix, as Conditions A1-A5.

Lemma 1. The soft-thresholding function g_λ(x) is Lipschitz continuous for any λ > 0, that is, for all $x_{1}, x_{2} \in ℝ, | g_{λ} (x_{1}) - g_{λ} (x_{2}) | \leq | x_{1} - x_{2} | .$

Lemma 2. For any function $β_{0} \in Θ$ and any threshold parameter λ₀ > 0, there exists a smooth function ${\tilde{β}}_{0} (s) \in C^{ρ} (B)$ such that $β_{0} (s) = g_{λ_{0}} {{\tilde{β}}_{0} (s)}$ .

Lemma 1 is proved directly by verifying the definition. The proof of Lemma 2 is not trivial, it requires a detailed construction of the smooth function ${\tilde{β}}_{0} (s)$ . See the Appendix for details.

Theorem 1. For any function $β_{0} \in Θ$ and $ε > 0$ , the soft-thresholded Gaussian process prior $β ~ STGP (λ_{0}, κ)$ satisfies $Π ({‖ β - β_{0} ‖}_{\infty} < ε) > 0$ , for any $λ_{0} > 0$ and k that satisfy Condition A5.

Proof. By Lemma 2, for any threshold parameter λ₀ > 0 and a smooth function ${\tilde{β}}_{0} (s) \in C^{ρ} (B)$ such that $β_{0} (s) = g_{λ_{0}} {{\tilde{β}}_{0} (s)}$ . Since $β ~ STGP (λ_{0}, κ)$ , we have $β (s) = g_{λ_{0}} {\tilde{β} (s)}$ with $\tilde{β} (s) ~ G P (0, κ)$ , By Lemma 1,

Π {\sup_{s \in B} | β (s) - β_{0} (s) | < ε} = Π [\sup_{s \in B} | g_{λ_{0}} {\tilde{β} (s)} - g_{λ_{0}} {{\tilde{β}}_{0} (s)} | < ε] \geq Π {\sup_{s \in B} | \tilde{β} (s) - {\tilde{β}}_{0} (s) | < ε} .

By Condition A5 and Theorem 4.5 of Tokdar & Ghosh (2007), ${\tilde{β}}_{0} (\cdot)$ is in the reproducing kernel Hilbert space of k, and then by Theorem 4 of Ghosal & Roy (2006), we have $Π {\sup_{s \in B} | \tilde{β} (s) - {\tilde{β}}_{0} (s) | < ε} > 0$ which completes the proof. □

Theorem 1 implies that there is always a positive probability that the soft-thresholded Gaussian process concentrates in an arbitrarily small neighborhood of any spatially-varying coefficient function that has piecewise smoothness, sparsity and continuity properties. According to Lemma 2, for any positive $λ_{1} \neq λ_{2}$ , there exist ${\tilde{β}}_{1}, {\tilde{β}}_{2} \in C^{ρ} (B)$ such that $β_{0} (s) = g_{λ_{1}} {{\tilde{β}}_{1} (s)} = g_{λ_{2}} {{\tilde{β}}_{2} (s)}$ . This implies that the thresholding parameter $λ_{0}$ and the latent smooth curve ${\tilde{β}}_{0}$ are not identifiable, but we can ensure that β₀ is identifiable by establishing the posterior consistency of parameter estimation.

3.3. Posterior Consistency

For i = 1,...,n, given the image predictor X_i on a set of spatial locations S and other covariates W_i, suppose the response Y_i is generated from the scalar-on-image regression model (1) with parameters a $α_{0}^{V} \in ℝ^{q}$ , $σ_{0}^{2} > 0$ and $β_{0} \in Θ$ . We assume $α_{0}^{V}$ and $σ_{0}^{2}$ are known for theoretical convenience; in practice it is straightforward to estimate them from the data. We assign a soft-thresholded Gaussian process prior for the spatially-varying coefficient function, i.e. $β ~ STGP (λ, κ)$ for any given λ > 0 and covariance kernel k. In light of the large support property in Theorem 1, the following lemma shows the positivity of prior neighborhoods:

Lemma 3. Denote by $π_{n, i} (\cdot; β)$ the density function of $Z_{n, i} = (Y_{i}, W_{i}, X_{i})$ in model (1) and suppose Condition A4 holds for $X_{i}$ . Define $Λ_{n, i} (\cdot; β_{0}, β) = \log π_{n, i} (\cdot; β) - \log π_{n, i} (\cdot; β_{0})$ , $K_{n, i} (β_{0}, β) = E_{β_{0}} {Λ_{n, i} (Z_{n, i}; β_{0}, β)}$ and $V_{n, i} (β_{0}, β) = var_{β_{0}} {Λ_{n, i} (Z_{n, i}; β_{0}, β)}$ . There exists a set B with $Π (B) > 0$ such that, for any $ε > 0$ ,

\underset{n \to \infty}{\lim \inf} Π [{β \in B, n^{- 1} \sum_{i = 1}^{n} K_{n, i} (β_{0}, β) < ε}] > 0, n^{- 2} \sum_{i = 1}^{n} V_{n, i} (β_{0}, β) \to 0, β \in B .

We construct sieves for the spatially-varying coefficient functions in $Θ$ as

Θ_{n} = {β \in Θ : ‖ β ‖_{\infty} \leq p_{n}^{1 / (2 d)}, \sup_{s \in R_{1} \cup R_{- 1}} | D^{τ} β (s) | \leq p_{n}^{1 / (2 d)}, 1 \leq ‖ τ ‖_{1} \leq ρ},

(3)

where ρ is defined in Condition A1. By Lemmas A1-A5 in the Appendix, we can find an upper bound for the tail probability and construct uniform consistent tests in the following lemmas:

Lemma 4. If $β (s) ~ STGP (λ_{0}, κ)$ with λ₀ > 0 and the kernel function k satisfies Condition A5, then there exist constants K and b such that for all n ≥ 1, $Π (Θ_{n}^{C}) \leq K \exp (- b p_{n}^{1 / d})$ .

Lemma 5. For any $ε > 0$ and $v_{0} / 2 < v < 1 / 2$ , there exist N, C₀, C₁ and C₂ such that for all n > N and all $β \in Θ_{n}$ , if ${‖ β - β_{0} ‖}_{1} > ε$ , a test function $Ψ_{n}$ can be constructed such that $E_{β_{0}} (Ψ_{n}) \leq C_{0} \exp (- C_{2} n^{2 v})$ and $E_{β} (1 - Ψ_{n}) \leq C_{0} \exp (- C_{1} n)$ , where $υ_{0}$ is defined in Condition A1.

Proofs of Lemmas 3–5 are provided in the Supplementary Material. These lemmas verify three important conditions for proving posterior consistency in the scalar-on-image regression based on Theorem A1 of Choudhuri et al. (2004). It follows that we have the following theorem:

Theorem 2. Denote the data by $D_{n} = {Y_{i}, X_{i}, W_{i}}_{i = 1}^{n}$ . If Conditions A1-A5 hold, then for any ε > 0,

Π (β \in Θ : {‖ β - β_{0} ‖}_{1} < ε | D_{n}) \to 1, n \to \infty .

(4)

Theorem 2 implies that the soft-thresholded Gaussian process prior can ensure that the posterior distribution of the spatially-varying coefficient function concentrates in an arbitrarily small neighborhood of the true value, when the numbers of subjects and spatial locations are both sufficiently large. Given that the true function of interest is piecewise smooth, sparse and continuous, the soft-threshold Gaussian process prior can further ensure that the posterior probability of the sign of the spatially-varying coefficient function being correct converges to one as the sample size goes to infinity. The result is formally stated in the following theorem.

Theorem 3. Suppose the model assumptions, prior settings and regularity conditions for Theorem 2 hold. Then

Π [sgn {β (s)} = sgn {β_{0} (s)}, for all s \in B | D_{n}] \to 1, n \to \infty .

(5)

This theorem establishes the consistency of spatial variable selection. It does not require that the number of true image predictors is finite or less than the sample size. This result is reasonable in that the true spatially-varying coefficient function is piecewise smooth and continuous, and the soft-thresholded Gaussian process will borrow strength from neighboring locations to estimate the true image predictors. The Supplementary Material gives proofs of Theorems 2 and 3.

To apply the proposed model for the motivating dataset, we extend the proposed model (1) and Theorems 2–3 to analysis of the binary response variable using a probit model:

Y_{i} ~ Bernoulli (π_{i}), Φ^{- 1} (π_{i}) = \sum_{k = 1}^{q} α_{k} W_{i, k} + p_{n}^{- 1 / 2} \sum_{j = 1}^{p_{n}} β (s_{j}) X_{i, j},

(6)

where $Φ (\cdot)$ is the standard normal cumulative distribution function.

Theorem 4. Assume the data D_n are generated from model (6) and prior settings are the same as Theorem 2. If Conditions A1–A5 and S1–S3 hold, then (4) and (5) hold under model (6).

Conditions S1–S3 and the proof of Theorem 4 are in the Supplementary Material.

4. POSTERIOR COMPUTATION

4.1. Model representation and prior specification

Next, we discuss the practical applicability of our method. We select a low-rank spatial model to ensure that computation remains possible for large datasets. We exploit the kernel convolution approximation of a spatial Gaussian process. As discussed in Higdon et al. (1999), any stationary Gaussian process V(s) can be written as $V (s) = \int K (s - t) w (t) d t$ , where K is a kernel function and w is a white-noise process with mean zero and variance $σ_{w}^{2}$ . Then the covariance function of $V (\cdot)$ is cov ${V (s), V (s + h)} = κ (h) = σ_{a}^{2} \int K (s - t) K (s + h - t) d t$ .

This representation suggests the following approximation for the latent process $\tilde{β} (s) \approx \sum_{l = 1}^{L} K (s - t_{l}) a_{l} δ$ , where $t_{1}, \dots, t_{L} \in ℝ^{d}$ are a grid of equally-spaced spatial knots covering Ɓ and δ is the grid size. Without loss of generality, we assume δ = 1 and K is a local kernel function. We use tapered Gaussian kernels with bandwidth $σ_{h}$ , $K (h) = \exp {- h^{2} / {(2 σ_{h})}^{2}} I (h < 3 σ_{h})$ , so that $K (∥ s - t_{l} ∥) = 0$ for s separated from t_l by at least $3 σ_{h}$ . Taking L < p knots and selecting compact kernels leads to computational savings, as discussed in Section 4.2.

The compact kernels K control the local spatial structure and the prior for the coefficients a = (a₁,..., a_L)^T controls broad spatial structure. Following Nychka et al. (2015), we assume that the knots t₁,...,t_L are arranged on an $m_{1} \times \dots \times m_{d}$ array, and use $l ~ k$ to denote that knots t_l and t_k are adjacent on this array. We then use a conditionally autoregressive prior (Gelfand et al., 2010) for the kernel coefficients. The conditional autoregressive prior is also defined locally, with full conditional distribution

a_{l} | a_{k}, k \neq l ~ N (\frac{ϑ}{n_{l}} \sum_{k ~ l} a_{k}, \frac{σ_{a}^{2}}{n_{l}}),

(7)

where n_l is the number of knots adjacent to the knot l, $ϑ \in (0, 1)$ quantifies the strength of spatial dependence, and $σ_{a}^{2}$ determines the variance. These full conditional distributions correspond to the joint distribution $a ~ N {0, σ_{a}^{2} {(M - ϑ A)}^{- 1}}$ , where M is diagonal with diagonal elements (n₁...,n_L) and A is the adjacency matrix with (k, l) element equal 1 if k ~ l and zero otherwise.

Write ${\tilde{β}}^{v} = {\tilde{β} (s_{1}), \dots, \tilde{β} (s_{p})}^{T}$ . Denote by K the $p \times L$ kernel matrix with (j, l) element $K ({‖ s_{j} - t_{l} ‖}_{2})$ , then ${\tilde{β}}^{v} ~ N {0, σ_{a}^{2} K {(M - ϑ A)}^{- 1} K^{T}}$ as a prior distribution. In this case, the $\tilde{β} (s_{j})$ do not have equal variances which may be undesirable. Non-constant variance arises because the kernel knots may be unequally distributed, and because the conditional autoregressive model is non-stationary in that the variances of the a_l are unequal.

To stabilize the prior variance, define ${\tilde{K}}_{j, l} = K ({‖ s_{j} - t_{l} ‖}_{2}) / w_{j}$ and let $\tilde{K}$ be the corresponding p × L matrix of standardized kernel coefficients, where w_j are constants chosen so that the prior variance for β_j is the same over j. We take w_j to be the square root of the j-th diagonal element of $K {(M - ϑ A)}^{- 1} K^{T}$ , so the kernel functions depend on $ϑ$ . By pulling the prior standard deviation σ_a out of the thresholding transformation we have an equivalent model representation of model (1) as

Y_{i} ~ N (W_{i}^{T} α^{v} + p_{n}^{- 1 / 2} X_{i}^{T} β^{v}, σ^{2}), β (s_{j}) = σ_{a} g_{λ} {\tilde{β} (s_{j})},

(8)

where ${\tilde{β}}^{v} ~ N {0, \tilde{K} {(M - ϑ A)}^{- 1} {\tilde{K}}^{T}}$ . After standardization, the prior variance of each $\tilde{β} (s_{j})$ is one, and therefore the prior probability that $\tilde{β} (s_{j})$ is nonzero is $2 Φ (- λ)$ for all j. This endows each parameter with a distinct interpretation: σ_a controls the scale of the nonzero coefficients, λ controls the prior degree of sparsity, and $ϑ$ controls spatial dependence. With an additional set of conditions, we can show that model representation (8) when $L \to \infty$ to has a large prior support, thus leading to the posterior consistency and selection consistency. The Supplementary Material contains more details.

In practice, we normalize the response and covariates, and then select priors $α^{V} ~ N (0, 10^{2} I_{q})$ , $σ^{2} ~ InvGamma (0 \cdot 1, 0 \cdot 1)$ , $σ_{a} ~ HalfNormal (0, 1)$ , $ϑ ~ Beta (10, 1)$ , and $λ ~ Uniform (0, 5)$ . Following Banerjee et al. (2004), we use a beta prior for $ϑ$ with mean near one because only values near one provide appreciable spatial dependence. Finally, although our asymptotic results suggest that any λ > 0 will give posterior consistency, in finite samples posterior inference may be sensitive to the threshold so we use a prior rather than a fixed value.

4.2. Markov chain Monte Carlo Algorithm

For fully Bayesian inference on model (1), we sample from the posterior distribution using Metropolis-Hastings within Gibbs sampling. The parameters α^v, σ², and $σ_{a}^{2}$ have conjugate full conditional distributions and are updated using Gibbs sampling. The spatial dependence parameter $ϑ$ is sampled using Metropolis-Hastings sampling using a beta candidate distribution with the current value as mean and standard deviation tuned to give acceptance rate around 0.4. The threshold λ is updated using Metropolis sampling with random-walk Gaussian candidate distribution with standard deviation tuned to have acceptance probability around 0.4. The Metropolis update for a_l uses the prior full conditional distribution in (7) as the candidate distribution which gives a high acceptance rate and thus good mixing without tuning.

To make posterior inference for the probit regression model (6), we can slightly modify the aforementioned algorithm by introducing an auxiliary continuous variable $Y_{i}^{*}$ for each response variable Y_i. We assume that $Y_{i} = I (Y_{i}^{*} > 0)$ and $Y_{i}^{*}$ follows (1) with σ² = 1. The full conditional of $Y_{i}^{*}$ is a truncated normal distribution, which is straightforward to generate as a Gibbs sampling step in the posterior computation. The updating schemes for other parameters remain the same. For other types of non-Gaussian response models, we can directly use the Metropolis-Hastings sampling by modifying the likelihood function accordingly.

5. SIMULATION STUDY

5.1. Data generation

In this section we conduct a simulation study to compare the proposed methods with other popular methods for scalar-on-image regression. For each simulated observation, we generate a two-dimensional image X_i on the {1,...,m} × {1,...,m} grids with m = 30. The co-variates are generated following two covariance structures: exponential and with shared structure with the signal. The exponential covariates are Gaussian with mean E(X_ij) = 0 and $cov (X_{i, j}, X_{i, l}) = \exp (- d_{j, l} / ϑ_{x})$ , where d_j,l is the distance between locations j and l and $ϑ_{x}$ controls the range of spatial dependence. The covariates generated with shared structure with $β^{v}$ are $X_{i} = {\tilde{X}}_{i} / 2 + e_{i} β^{v}$ , where ${\tilde{X}}_{i}$ is Gaussian with exponential covariance with $ϑ_{x} = 3$ and $e_{i} ~ N (0, v^{2})$ . The continuous response is then generated as $Y_{i} ~ N (X_{i}^{T} β^{v}, σ^{2})$ . For binary response as $Y_{i} ~ Bernoulli (π_{i})$ with $Φ^{- 1} (π_{i}) = X_{i}^{T} β^{v}$ . Both X_i and Y_i are independent for i = 1,...,n. We consider three true $β^{v}$ images: two sparse images plotted in Fig. 2, Five peaks and Triangle, and the dense image Waves with $β (s) = {\cos (6 π s_{1} / m) + \cos (6 π s_{2} / m)} / 5$ . We also compare sample size $n \in {100, 250}$ , spatial correlation $ϑ_{x} \in {3, 6}$ , and error standard deviation $σ \in {2, 5}$ . For all combinations of these parameters considered we generate S = 100 datasets. We also simulate binary data with $Π (Y_{i} = 1) = Φ (X_{i}^{T} β^{v})$ with n = 250 and exponential covariance with $ϑ_{x} = 3$ .

Fig. 2. — True $β^{v}$ images used in the simulation study.

5.2. Methods

We fit our model with an m/2 × m/2 equally-spaced grid of knots covering {1,...,m} × {1,...,m} with bandwidth σ_h set to the minimum distance between knots. We fit our model both with λ > 0 where the prior is soft-thresholded Gaussian process with sparsity, and with λ = 0, in which case the prior is Gaussian process with no sparsity. For both models, we run the proposed Markov chain Monte Carlo algorithm 50,000 iterations with 10,000 burn-in, and compute the posterior mean of $β^{v}$ . For the sparse model, we compute the posterior probability of a nonzero $β (s)$ . We compare our method with the lasso (Tibshirani, 1996) and fused lasso (Tibshirani et al., 2005; Tibshirani & Taylor, 2011) penalized regression estimates

\begin{array}{l} {\hat{β}}_{L}^{v} = \underset{β^{v}}{argmin} {{(Y - X β^{v})}^{T} (Y - X β^{v}) + \tilde{λ} \sum_{j} | β (s_{j}) |}, \\ {\hat{β}}_{FL}^{v} = \underset{β^{v}}{argmin} {{(Y - X β^{v})}^{T} (Y - X β^{v}) + \tilde{λ} \sum_{j ~ k} | β (s_{j}) - β (s_{k}) | + \tilde{γ} \tilde{λ} \sum_{j} | β (s_{j}) |} . \end{array}

(9)

The lasso estimate ${\hat{β}}_{L}^{v}$ is computed using the lars package (Hastie & Efron, 2013) in R (R Core Team, 2017) and the tuning parameter $\tilde{λ}$ is selected using the Bayesian information criterion. The fused lasso estimate ${\hat{β}}_{FL}^{v}$ is computed using the genlasso package (Arnold & Tibshirani, 2014) in R. The fussed lasso has two tuning parameters: $\tilde{γ}$ and $\tilde{λ}$ . Due to computational considerations, we search only over $\tilde{γ}$ in {1/5,1, 5}. For each $\tilde{γ}$ , $\tilde{λ}$ is selected using the Bayesian information criterion. The Supplementary Material gives results for each $\tilde{γ}$ and in the main text we report only the results for the $\tilde{γ}$ with the most precise estimates.

We also consider a functional principal component analysis-based alternative. We smooth each image using the technique of Xiao et al. (2013) implemented in the fbps function in R’s refund package (Crainiceanu et al., 2014), compute the eigen decomposition of the sample covariance of the smoothed images, and then perform principal components regression using the lasso penalty tuned via Bayesian information criterion. The Supplementary Material gives results using the leading eigenvectors that explain 80%, 90%, and 95% of the variation in the sample images and we report only the results for the value with most precise estimates in the main text.

Finally, we compare our approach with the Bayesian spatial model of Goldsmith et al. (2014) using $β (s_{j}) = {\tilde{α}}_{j} θ_{j}$ , where ${\tilde{α}}_{j} \in {0, 1}$ is the binary indicator that location j is included in the model, and θ_j € $R$ is the regression coefficient given that the location is included. Both the ${\tilde{α}}_{j}$ and $θ_{j}$ have spatial priors; the continuous components θ_j follow a conditional autoregressive prior, and the binary components θ_j follow an Ising or autologistic prior (Gelfand et al., 2010) with full conditional distributions

logit Π ({\tilde{α}}_{j} = 1 | {\tilde{α}}_{l}, l \neq j) = a_{0} + b_{0} \sum_{l ~ j} {\tilde{α}}_{l} .

(10)

Estimating a₀ and b₀ is challenging because of the complexity of the Ising model (Møller et al., 2006), therefore Goldsmith et al. (2014) recommended selecting a₀ and b₀ using cross-validation over $a_{0} \in (- 4, 0)$ and $b_{0} \in (0, 2)$ . Due to computational limitations we select values in the middle of these intervals and set a₀ = −2 and b₀ = 1. The posterior mean of β(s) and the posterior probability of a nonzero β(s) are approximated based on 5,000 Markov chain Monte Carlo samples after the first 1,000 are discarded as burn-in.

5.3. Results

Tables 1 and 2 give the mean squared error for β^v estimation averaged over location, Type I error and power for detecting nonzero signals, along with the computing time. The soft-thresholded Gaussian process model gives the smallest mean squared error when the covariate has exponential correlation. Compared to the Gaussian process model, adding thresholding reduces mean squared error by roughly 50% in many cases. As expected, the functional principal component analysis method gives the smallest mean squared error in the shared-structure scenarios where the covariates are generated to have a similar spatial pattern as the true signal. Even in this case, the proposed method outperforms other methods which do not exploit this shared structure. Under the dense waves signal the non-sparse Gaussian process model gives the smallest mean squared error, but again the proposed method remains competitive.

Table 1.

Simulation study results for linear regression models. Methods are compared in terms of mean squared error for $β^{V}$ , Type I error and power for feature detection. The scenarios vary by the true $β_{0}^{v}$ (Fig. 2), sample size n, similarity between covariates and true signal determined by τ, error variance σ, and spatial correlation range of the covariates $ϑ_{x}$ . Results are reported as the mean with the standard deviation over the 100 simulated datasets

					Mean squared error for β^v, multiplied by 10⁴
Signal	n	τ	σ	υ_x	Lasso	Fused lasso	FPCA	Ising	GP	STGP
Five peaks	100	0	5	3	326(5)	22(0)	36(0)	46(1)	27(0)	13(0)
	100	0	5	6	553(9)	22(0)	33(0)	44(1)	28(0)	13(0)
	100	0	2	3	102(1)	11(0)	24(0)	26(0)	15(0)	4(0)
	250	0	5	3	674(9)	13(0)	27(0)	52(1)	17(0)	6(0)
Triangle	100	0	5	3	289(4)	9(0)	17(0)	30(1)	19(0)	8(0)
	100	0	5	6	515(9)	8(0)	1.5(0)	28(1)	19(0)	8(0)
	100	0	2	3	73(1)	5(0)	12(0)	14(0)	10(0)	4(0)
	250	0	5	3	650(9)	6(0)	14(0)	34(0)	12(0)	6(0)
	100	2	5	3	1011(17)	7(0)	10(0)	27(1)	33(1)	13(1)
	100	4	5	3	1018(17)	6(0)	4(1)	32(1)	34(1)	14(1)
Waves	100	0	5	3	1260(13)	250(5)	188(7)	419(3)	48(1)	109(10)
Waves	100	0	5	6	1639(17)	233(4)	126(3)	402(4)	51(1)	128(12)

					Type I error, %				Power, %
Signal	n	τ	σ	υ_x	Lasso	Fused lasso	Ising	STGP	Lasso	Fused lasso	Ising	STGP
Five peaks	100	0	5	3	10(0)	14(1)	0(0)	4(1)	17(0)	55(2)	4(0)	51(1)
	100	0	5	6	10(0)	37(1)	0(0)	7(1)	15(0)	80(1)	5(0)	58(1)
	100	0	2	3	79(0)	24(1)	0(0)	4(1)	28(0)	84(1)	4(0)	82(1)
	250	0	5	3	27(0)	19(1)	0(0)	4(1)	32(0)	77(1)	10(0)	71(1)
Triangle	100	0	5	3	10(0)	5(0)	0(0)	4(1)	23(1)	85(1)	9(0)	87(1)
	100	0	5	6	11(0)	8(1)	0(0)	4(1)	19(1)	91(1)	9(0)	86(1)
	100	0	2	3	9(0)	7(1)	0(0)	4(0)	42(1)	95(0)	5(0)	98(0)
	250	0	5	3	27(0)	5(0)	0(0)	4(0)	35(1)	92(1)	16(0)	96(1)
	100	2	5	3	11(0)	7(1)	0(0)	1(0)	17(0)	86(1)	8(0)	70(1)
	100	4	5	3	11(0)	2(1)	0(0)	3(1)	19(1)	84(1)	12(0)	73(1)

					Computing time, minutes
Signal	n	τ	σ	υ_x	Lasso	Fused lasso	FPCA	Ising	GP	STGP
Five peaks	100	0	5	3	0.02	16.77	5.40	27.61	4.81	11.28

Open in a new tab

Type I error: proportion of times estimating zero coefficients to be nonzero; Power: proportion of times estimating nonzero coefficients to be nonzero; FPCA, α%: functional principal component analysis approach (Xiao et al., 2013) using eigen vectors that explain α% of variation; Ising: Bayesian spatial variable selection with Ising priors (Goldsmith et al., 2014); GP: Gaussian process approach; STGP: soft-thresholded Gaussian process approach.

Table 2.

Simulation study results for binary regression models. Methods are compared in terms of mean squared error for $β^{v}$ , Type I error and power for feature detection. The scenarios vary by the true $β_{0}^{v}$ see Fig. 2. Results are reported as the mean with the standard deviation over the 100 simulated datasets

	Mean squared error for β_v, multiplied by 10⁴
Signal	Lasso	FPCA, 80%	FPCA, 90%	FPCA, 95%	GP	STGP
Five peaks	11(2)	3(0)	6(2)	21(16)	1(0)	1(0)
Triangle	4(1)	1(0)	3(1)	8(4)	1(0)	0(0)
Waves	80(5)	18(4)	52(16)	38(8)	107(46)	191(138)

	Type I error, %		Power, %
Signal	Lasso	STGP	Lasso	STGP
Five peaks	3(0)	7(1)	15(0)	62(1)
Triangle	1(0)	4(0)	24(1)	91(1)

Open in a new tab

For variable selection results, we only compare the proposed method with the fused lasso and the Ising model for a fair comparison, because the lasso does not incorporate spatial locations and other methods do not perform variable selection directly. The results show that the fused lasso has much larger type I errors in all cases and the Ising model has a very small power to detect the signal in each case. The proposed method is much more efficient than both the fused lasso and the Ising model for variable selection. For the computing time, the proposed method is comparable to the fused lasso and is faster than the Ising model.

6. ANALYSIS OF EEG DATA

Our motivating application is the study of the relationship between the electrical brain activity as measured through multi channel EEG signals and genetic predisposition to alcoholism. EEG is a medical imaging technique that records the electrical activity in the brain by measuring the current flows produced when the neurons are activated. The study comprises a total of 122 subjects: 77 alcoholic subjects and 45 non-alcoholic controls. For each subject, 64 electrodes were placed on their scalp and EEG was recorded from each electrode at a frequency of 256Hz. The electrode positions were located at standard sites, i.e., standard electrode position nomenclature according to American Electroencephalographic Society (Sharbrough et al., 1991). The subjects were presented with 120 trials under several settings involving one stimulus or two stimuli. We consider the multichannel average EEG across the 120 trials corresponding to a single stimulus. The dataset is publicly available at the University of California at Irvine Knowledge Discovery of Datasets. https://kdd.ics.uci.edu/databases/eeg/eeg.data.html.

These data have been previously analyzed by Li et al. (2010), Hung & Wang (2013) and Zhou & Li (2014). However, all the analyses ignore the spatial location of the electrodes on the scalp and used instead smooth based on their identification number, which ranges from 1 to 64 and is assigned arbitrarily relative to the electrodes’ position on the scalp. Our goal in this analysis is to detect the regions of brain which are most predictive of the alcoholism status; thus accounting for the actual position of the electrodes is a key component in our approach. In the absence of more sophisticated means to determine the electrodes’ position on the scalp, we consider a lattice design and assign a two-dimensional location to each electrode that matches closely the electrode’s standard position. Using the labels of the electrodes, we were able to identify only 60 of them. As a result our analysis will be based on the multichannel EEG from these 60 electrodes.

In accordance with the notation employed earlier, let Y_i be the alcoholism status indicator with Y_i = 1 if the ith subject is alcoholic and 0 otherwise. Furthermore, $let X_{i} = {X_{i} (s_{j}; t) : s_{j} \in ℝ^{2}, j = 1, \dots, 60, t = 1, \dots 256}$ be the EEG image data for the ith subject which is indexed by a two-dimensional index accounting for the spatial location on the matching lattice design, s_j, and one-dimensional index for time, t.

We use a probit model to relate the alcoholism status and the multichannel EEG image: $Y_{i} | X_{i}, β ~ Bernoulli (π_{i})$ ) and $Φ^{- 1} (π_{i}) = \sum_{j = 1}^{60} \sum_{k = 1}^{256} X_{i} (s_{j}, t_{k}) β (s_{j}, t_{k})$ . The spatially-temporally varying coefficient function β quantifies the effect of the image on the response over time and is modeled using the soft-thresholded Gaussian process on spatial and temporal domain. We select a 5 × 5 square grid of spatial knots and 64 temporal knots, for a total of 1,600 three-dimensional knots. We initially fitted a conditional autoregressive model with a different dependence parameter $ϑ$ for spatial and temporal neighbors (Reich et al., 2007), but found that the convergence was slow and that the estimates of both the spatial and temporal dependence were similar. Thus, we elected to use the same dependence parameter for all neighbors.

We evaluate the prediction performance of the proposed model using cross validation. We first calculate the posterior predictive probabilities that each test-set response is one and then apply the standard receiving operating characteristic curve plot which evaluates the classification accuracy over thresholds on the predictive probabilities. Figure 3 shows the receiving operating characteristic curve using leave-one-out cross validation. The results are compared with those of the lasso, functional principal component analysis and the soft-thresholding Gaussian process approach with thresholding parameter λ = 0. To facilitate computation for these methods, we thin the time points by two, leaving 128 time points. While no model is uniformly superior, the area under the curve corresponding to our approach is optimal among the alternatives we considered.

Fig. 3. — Receiving operating characteristic curves with the area under the curve, AUC, for the leave-one-out cross validation of the EEG data by six different methods: lasso (black solid, AUC = 0.789), functional principal component analysis using the leading eigen-vectors that explain 80% (red solid, AUC = 0.775), 90% (red dashes, AUC = 0.789), 95% (red dots, AUC = 0.777) of variations, Gaussian process approach (green solid, AUC = 0.770) and soft-thresholded Gaussian process approach (navy solid, AUC = 0.818)

The differences between the models are further examined in the estimated β functions plotted in Fig. 4, where we ignore the spatial location of the electrodes and plot them using their identification number. The lasso solution is nonzero for a single spatiotemporal location, while the functional principal component analysis and Gaussian process methods lead to non-sparse and thus uninterpretable β estimates. In contrast, the soft-thresholded Gaussian process based estimate is near zero for the vast majority of locations, and isolates a subset of the electrodes near time point 86 as the most powerful predictors of alcoholism.

Our analysis indicates that EEG measurements at time t = 86, which roughly corresponds to 0.336 fraction of second, are predictive of the alcoholism status. This observation is further confirmed by the plot of the posterior probability of nonzero $β (s_{j}, t)$ ‘s in Fig. 5a. This implies a delayed reaction to the stimulus, although this finding has to be confirmed with the investigators. To gain more insight into these findings, Figures 5b–5d focus on a particular time and display the posterior mean and posterior probability of a nonzero signal across the electrodes locations. They indicate that the right occipital/lateral region is the most predictive of the alcoholism status.

Fig. 5. — Summary of analysis of the EEG data by the soft-thresholded Gaussian process. Panel (a) plots the posterior probability of a nonzero β(*s, t*); each electrode is a line plotted over time t. The remaining panels map either the posterior probability of a nonzero β(*s, t*) or the posterior mean of β(*s, t*) at individual time points

7. DISCUSSION

The proposed method generates a few future directions that we consider to pursue. First, we plan to develop a more efficient posterior computation algorithm for analysis of voxel-level functional magnetic resonance imaging (fMRI) data, which typically contains 180,000 voxels for each subject. Any fast and scalable Gaussian processes approximation approach can be potentially applied to the soft-thresholded Gaussian process. For example, the recent ideas of nearest-neighbor Gaussian process approach by Datta et al. (2016) can be applied to our model. In addition, it is of great interest to perform joint analysis of datasets involving multiple imaging modalities, such as fMRi, diffusion tensor imaging and structural MRi. it is very challenging to model the dependence between the multiple imaging modality over space and to select the interactions between multiple modality image predictors in scalar-on-image regression. The extension of the soft-thresholded Gaussian process might solve this problem. The basic idea is to introduce hierarchical latent Gaussian processes and different types of thresholding parameters for different modalities, leading to an hierarchical soft-thresholded Gaussian process as the prior model for the effects of interactions.

Supplementary Material

NIHMS1003776-supplement-SM.pdf^{(268.3KB, pdf)}

ACKNOWLEDGEMENT

This research was supported partially by grants from the National Institutes of Health and the National Science Foundation. The authors would like to thank the Editor, the Associate Editor and anonymous reviewers for suggestions that led to an improved manuscript.

Contributor Information

Jian Kang, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A. jiankang@umich.edu.

Brian J. Reich, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A. bjreich@ncsu.edu

Ana-Maria Staicu, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, U.S.A. astaicu@ncsu.edu.

REFERENCES

Arnold TB & Tibshirani RJ (2014). genlasso: Path algorithm for generalized lasso problems. R package version 3.0.2. [Google Scholar]
BANERJEE S, CARLIN BP & Gelfand AE (2004). Hierarchical Modeling and Analysis for Spatial Data. Boca Rotan, FL: Chapman & Hall/CRC. [Google Scholar]
Boehm Vock LF, Reich BJ, Fuentes M & Dominici F (2014). Spatial variable selection methods for investigating acute health effects of fine particulate matter components. Biometrics 71, 167–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choudhuri N, Ghosal S & Roy A (2004). Bayesian estimation of the spectral density of a time series. J.Am. Statist. Assoc. 99, 1050–1059. [Google Scholar]
Crainiceanu C, Reiss P, Goldsmith J, Huang L and Huo L & Scheipl F (2014). refund: Regression with functional data. R package version 3.0.2. [Google Scholar]
CRESSIE N (1993). Statistics for Spatial Data. New York: Wiley. [Google Scholar]
Datta A, Banerjee S, Finley AO & Gelfand AE (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J. Am. Statist. Assoc. 111, 800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gelfand AE, Diggle PJ, Fuentes M & Guttorp P (2010). Handbook of Spatial Statistics. New York: Chapman & Hall/CRC. [Google Scholar]
Ghosal S & Roy A (2006). Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann. Statist. 34, 2413–2429. [Google Scholar]
Goldsmith J, Huang L & Crainiceanu CM (2014). Smooth scalar-on-image regression via spatial Bayesian variable selection. J. Comp. Graph. Stat 23, 46–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
HASTIE TJ & Efron B (2013). lars: Least angle regression, lasso and forward stagewise. R package version 3.0.2. [Google Scholar]
HIGDON DM, Swall J & Kern J (1999). Non-stationary spatial modeling. In Bayesian Statistics 6 - Proceedings of the Sixth Valencia Meeting, Bernardo J, Berger J, Dawid A & Smith A, eds. Clarendon Press; - Oxford, pp. 761–768. [Google Scholar]
Hung H & Wang C-C (2013). Matrix variate logistic regression model with application to EEG data. Biostatistics 14, 189–202. [DOI] [PubMed] [Google Scholar]
Li B, Kim MK & Altman N (2010). On dimension folding of matrix-or array-valued statistical objects. Ann. Statist. 38, 1094–1121. [Google Scholar]
Li F, Zhang T, Wang Q, Gonzalez M, Maresh E & Coan J (2015). Spatial Bayesian variable selection and grouping in high-dimensional scalar-on-image regressions. Ann. Appl. Statist. 9, 687–713. [Google Scholar]
MØLLER J, Pettitt A, Berthelsen K & Reeves R (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93, 451–458. [Google Scholar]
NELSEN RB (1999). An Introduction to Copulas. New York: Springer-Verlag. [Google Scholar]
Nychka D, Bandyopadhyay S, Hammerling DM, Lindgren F & Sain S (2015). A multi-resolution Gaussian process model for the analysis of large spatial data sets. J. Comp. Graph. Stat 24, 579–599. [Google Scholar]
R CORE Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
Reich BJ, Hodges JS & Carlin BP (2007). Spatial analysis of periodontal data using conditionally autoregressive priors having two types of neighbor relations. J. Am. Statist. Assoc. 102, 44–55. [Google Scholar]
Reiss PT, Huo L, Zhao Y, Kelly C & Ogden RT (2015). Wavelet-domain regression and predictive inference in psychiatric neuroimaging. Ann. Appl. Statist. 9, 1076–1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reiss PT & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics 66, 61–69. [DOI] [PubMed] [Google Scholar]
Sharbrough F, Chatrian G, Lesser R, Luders H, Nuwer M & Picton T (1991). American Electroencephalographic Society guidelines for standard electrode position nomenclature. J. Clin. Neurophysiol.8, 200–202. [PubMed] [Google Scholar]
Smith M & Fahrmeir L (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. J. Am. Statist. Assoc. 102, 417–431. [Google Scholar]
Tibshirani RJ (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–288. [Google Scholar]
Tibshirani RJ, Saunders M, Rosset S, Zhu J & Knight K (2005). Sparsity and smoothness via the fused lasso. J. R. Statist. Soc. B 67, 91–108. [Google Scholar]
Tibshirani RJ & Taylor J (2011). The solution path of the generalized lasso. Ann. Statist. 39, 1335–1371. [Google Scholar]
Tokdar ST & Ghosh JK (2007). Posterior consistency of logistic Gaussian process priors in density estimation. J. Stat. Plan. Inference 137, 34–42. [Google Scholar]
Wang X, Zhu H & for the Alzheimer’s Disease Neuroimaging initiative (2016). Generalized scalar-on-image regression models via total variation. J. Am. Statist. Assoc, DOi 10.1080/01621459.2016.1194846. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xiao L, Li Y & Ruppert D (2013). Fast bivariate P-splines: the sandwich smoother. J. R. Statist. Soc. B 75, 577–599. [Google Scholar]
Zhou H & Li L (2014). Regularized matrix regression. J. R. Statist. Soc. B 76, 463–483. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1003776-supplement-SM.pdf^{(268.3KB, pdf)}

[R1] Arnold TB & Tibshirani RJ (2014). genlasso: Path algorithm for generalized lasso problems. R package version 3.0.2. [Google Scholar]

[R2] BANERJEE S, CARLIN BP & Gelfand AE (2004). Hierarchical Modeling and Analysis for Spatial Data. Boca Rotan, FL: Chapman & Hall/CRC. [Google Scholar]

[R3] Boehm Vock LF, Reich BJ, Fuentes M & Dominici F (2014). Spatial variable selection methods for investigating acute health effects of fine particulate matter components. Biometrics 71, 167–177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Choudhuri N, Ghosal S & Roy A (2004). Bayesian estimation of the spectral density of a time series. J.Am. Statist. Assoc. 99, 1050–1059. [Google Scholar]

[R5] Crainiceanu C, Reiss P, Goldsmith J, Huang L and Huo L & Scheipl F (2014). refund: Regression with functional data. R package version 3.0.2. [Google Scholar]

[R6] CRESSIE N (1993). Statistics for Spatial Data. New York: Wiley. [Google Scholar]

[R7] Datta A, Banerjee S, Finley AO & Gelfand AE (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J. Am. Statist. Assoc. 111, 800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Gelfand AE, Diggle PJ, Fuentes M & Guttorp P (2010). Handbook of Spatial Statistics. New York: Chapman & Hall/CRC. [Google Scholar]

[R9] Ghosal S & Roy A (2006). Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann. Statist. 34, 2413–2429. [Google Scholar]

[R10] Goldsmith J, Huang L & Crainiceanu CM (2014). Smooth scalar-on-image regression via spatial Bayesian variable selection. J. Comp. Graph. Stat 23, 46–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] HASTIE TJ & Efron B (2013). lars: Least angle regression, lasso and forward stagewise. R package version 3.0.2. [Google Scholar]

[R12] HIGDON DM, Swall J & Kern J (1999). Non-stationary spatial modeling. In Bayesian Statistics 6 - Proceedings of the Sixth Valencia Meeting, Bernardo J, Berger J, Dawid A & Smith A, eds. Clarendon Press; - Oxford, pp. 761–768. [Google Scholar]

[R13] Hung H & Wang C-C (2013). Matrix variate logistic regression model with application to EEG data. Biostatistics 14, 189–202. [DOI] [PubMed] [Google Scholar]

[R14] Li B, Kim MK & Altman N (2010). On dimension folding of matrix-or array-valued statistical objects. Ann. Statist. 38, 1094–1121. [Google Scholar]

[R15] Li F, Zhang T, Wang Q, Gonzalez M, Maresh E & Coan J (2015). Spatial Bayesian variable selection and grouping in high-dimensional scalar-on-image regressions. Ann. Appl. Statist. 9, 687–713. [Google Scholar]

[R16] MØLLER J, Pettitt A, Berthelsen K & Reeves R (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93, 451–458. [Google Scholar]

[R17] NELSEN RB (1999). An Introduction to Copulas. New York: Springer-Verlag. [Google Scholar]

[R18] Nychka D, Bandyopadhyay S, Hammerling DM, Lindgren F & Sain S (2015). A multi-resolution Gaussian process model for the analysis of large spatial data sets. J. Comp. Graph. Stat 24, 579–599. [Google Scholar]

[R19] R CORE Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]

[R20] Reich BJ, Hodges JS & Carlin BP (2007). Spatial analysis of periodontal data using conditionally autoregressive priors having two types of neighbor relations. J. Am. Statist. Assoc. 102, 44–55. [Google Scholar]

[R21] Reiss PT, Huo L, Zhao Y, Kelly C & Ogden RT (2015). Wavelet-domain regression and predictive inference in psychiatric neuroimaging. Ann. Appl. Statist. 9, 1076–1101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Reiss PT & Ogden RT (2010). Functional generalized linear models with images as predictors. Biometrics 66, 61–69. [DOI] [PubMed] [Google Scholar]

[R23] Sharbrough F, Chatrian G, Lesser R, Luders H, Nuwer M & Picton T (1991). American Electroencephalographic Society guidelines for standard electrode position nomenclature. J. Clin. Neurophysiol.8, 200–202. [PubMed] [Google Scholar]

[R24] Smith M & Fahrmeir L (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. J. Am. Statist. Assoc. 102, 417–431. [Google Scholar]

[R25] Tibshirani RJ (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–288. [Google Scholar]

[R26] Tibshirani RJ, Saunders M, Rosset S, Zhu J & Knight K (2005). Sparsity and smoothness via the fused lasso. J. R. Statist. Soc. B 67, 91–108. [Google Scholar]

[R27] Tibshirani RJ & Taylor J (2011). The solution path of the generalized lasso. Ann. Statist. 39, 1335–1371. [Google Scholar]

[R28] Tokdar ST & Ghosh JK (2007). Posterior consistency of logistic Gaussian process priors in density estimation. J. Stat. Plan. Inference 137, 34–42. [Google Scholar]

[R29] Wang X, Zhu H & for the Alzheimer’s Disease Neuroimaging initiative (2016). Generalized scalar-on-image regression models via total variation. J. Am. Statist. Assoc, DOi 10.1080/01621459.2016.1194846. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Xiao L, Li Y & Ruppert D (2013). Fast bivariate P-splines: the sandwich smoother. J. R. Statist. Soc. B 75, 577–599. [Google Scholar]

[R31] Zhou H & Li L (2014). Regularized matrix regression. J. R. Statist. Soc. B 76, 463–483. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process

Jian Kang

Brian J Reich

Ana-Maria Staicu

SUMMARY

1. INTRODUCTION

Fig. 1.

2. MODEL

2.1. Scalar-on-image regression

2.2. Soft-thresholded Gaussian processes

3. THEORETICAL PROPERTIES

3.1. Notations and Definitions

3.2. Large Support

3.3. Posterior Consistency

4. POSTERIOR COMPUTATION

4.1. Model representation and prior specification

4.2. Markov chain Monte Carlo Algorithm

5. SIMULATION STUDY

5.1. Data generation

Fig. 2.

5.2. Methods

5.3. Results

Table 1.

Table 2.

6. ANALYSIS OF EEG DATA

Fig. 3.

Fig. 4.

Fig. 5.

7. DISCUSSION

Supplementary Material

ACKNOWLEDGEMENT

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases