Abstract
We propose a nested Gaussian process (nGP) as a locally adaptive prior for Bayesian nonparametric regression. Specified through a set of stochastic differential equations (SDEs), the nGP imposes a Gaussian process prior for the function’s mth-order derivative. The nesting comes in through including a local instantaneous mean function, which is drawn from another Gaussian process inducing adaptivity to locally-varying smoothness. We discuss the support of the nGP prior in terms of the closure of a reproducing kernel Hilbert space, and consider theoretical properties of the posterior. The posterior mean under the nGP prior is shown to be equivalent to the minimizer of a nested penalized sum-of-squares involving penalties for both the global and local roughness of the function. Using highly-efficient Markov chain Monte Carlo for posterior inference, the proposed method performs well in simulation studies compared to several alternatives, and is scalable to massive data, illustrated through a proteomics application.
Keywords: Bayesian nonparametric regression, Nested Gaussian processes, Nested smoothing spline, Penalized sum-of-square, Reproducing kernel Hilbert space, Stochastic differential equations
1 Introduction
We consider the nonparametric regression problem
| (1) |
where U:
→ ℝ is an unknown mean regression function to be estimated at
= {t0, t1, t2, …, tJ < tU}, t0 = 0, and
a J-dimensional multivariate normal distribution with mean vector 0 and covariance matrix
. We are particularly interested in allowing the smoothness of U to vary locally as a function of t. For example, consider the protein mass spectrometry data in panel (a) of Figure 1. There are clearly regions of t across which the function is very smooth and other regions in which there are distinct spikes, with these spikes being quite important. An additional challenge is that the data are generated in a high-throughput experiment with J = 11, 186 observations. Hence, we need a statistical model which allows locally-varying smoothness, while also permitting efficient computation even when data are available at a large number of locations along the function.
Figure 1.
(a) Plot of protein mass spectrometry data: observed intensities versus mass to charge ratio m/z; (b) Posterior mean (
) and 95% credible interval of U (red shades); (c) Posterior mean and 95% credible interval of U for a local region; (d) Posterior mean and 95% credible interval of rate of intensity changes DU.
A commonly used approach for nonparametric regression is to place a Gaussian process (GP) prior (Neal, 1998; Rasmussen and Williams, 2006; Shi and Choi, 2011) on the unknown U, where the GP is usually specified by its mean and covariance function (e.g. squared exponential). The posterior distribution of U(
) can be conveniently obtained as a multivariate Gaussian distribution. When carefully-chosen hyperpriors are placed on the parameters in the covariance kernel, GP priors have been shown to lead to large support, posterior consistency (Ghosal and Roy, 2006; Choi and Schervish, 2007) and even near minimax optimal adaptive rates of posterior contraction (Van der Vaart and Van Zanten, 2008a). However, the focus of this literature has been on isotropic Gaussian processes, which have a single bandwidth parameter controlling global smoothness, with the contraction rate theory assuming the true function has a single smoothness level. There has been applied work allowing the smoothness of a multivariate regression surface to vary in different directions by using predictor-specific bandwidths in a GP with a squared exponential covariance (Savitsky et al., 2011; Zou et al., 2010). Bhattacharya, Pati, and Dunson (2011) recently showed that a carefully-scaled anisotropic GP leads to minimax optimal adaptive rates in anisotropic function classes including when the true function depends on a subset of the predictors. However, the focus was on allowing a single smoothness level for each predictor, while our current interest is allowing smoothness to vary locally in nonparametric regression in a single predictor.
There is a rich literature on locally-varying smoothing. One popular approach relies on free knot splines, for which various strategies have been proposed to select the number of knots and their locations, including stepwise forward and/or backward knots selection (Friedman and Silverman, 1989; Friedman, 1991; Luo and Wahba, 1997), accurate knots selection scheme (Zhou and Shen, 2001) and Bayesian knots selection (Smith and Kohn, 1996; Denison et al., 1998; Dimatteo et al., 2001) via Gibbs sampling (George and McCulloch, 1993) or reversible jump Markov chain Monte Carlo (Green, 1995). Although many of these methods perform well in simulations, such free knot approaches tend to be highly computationally demanding making their implementation in massive data sets problematic.
In addition to free knot methods, adaptive penalization approaches have also been proposed. An estimate of U is obtained as the minimizer of a penalized sum of squares including a roughness penalty with a spatially-varying smoothness parameter (Wahba, 1995; Ruppert and Carroll, 2000; Pintore et al., 2006; Crainiceanu et al., 2007). Other smoothness adaptive methods include wavelet shrinkage (Donoho and Johnstone, 1995), local polynomial fitting with variable bandwidth (Fan and Gijbels, 1995), L-spline (Abramovich and Steinberg, 1996; Heckman and Ramsay, 2000), mixture of splines (Wood et al., 2002) and linear combination of kernels with varying bandwidths (Wolpert et al., 2011). Additionally, adaptive penalization approaches have been applied to non-Gaussian data (Krivobokova et al., 2008; Wood et al., 2008; Scheipl and Kneib, 2009). The common theme of these approaches is to reduce the constraint on the single smoothness level assumption and to implicitly allow the derivatives of U, a common measurement of the smoothness of U, to vary over t.
In this paper, we instead propose a nested Gaussian process (nGP) prior to explicitly model the expectation of the derivative of U as a function of t and to make full Bayesian inference using an efficient Markov chain Monte Carlo (MCMC) algorithm scalable to massive data. More formally, our nGP prior specifies a GP for U’s mth-order derivative DmU centered on a local instantaneous mean function A:
→ ℝ which is in turn drawn from another GP. Both GPs are defined by stochastic differential equations (SDEs), related to the method proposed by Zhu et al. (2011). However, Zhu et al. (2011) centered their process on a parametric model, while we instead center on a higher-level GP to allow nonparametric locally-adaptive smoothing. Along with the observation equation (1), SDEs can be reformulated as a state space model (Durbin and Koopman, 2001). This reformulation facilitates the application of simulation smoother (Durbin and Koopman, 2002), an efficient MCMC algorithm with
(J) computational complexity which is essential to deal with large scale data. We will show that the nGP prior has large support and its posterior distribution is asymptotically consistent. In addition, the posterior mean or mode of U under the nGP prior can be shown to correspond to the minimizer of a penalized sum of squares with nested penalty functions.
The remainder of the paper is organized as follows. Section 2 defines the nGP prior and discusses some of its properties. Section 3 outlines an efficient Markov chain Monte Carlo (MCMC) algorithm for posterior computation. Section 4 presents simulation studies. The proposed method is applied to a mass spectra dataset in Section 5. Finally, Section 6 contains several concluding remarks and outlines some future directions.
2 Nested Gaussian Process Prior
2.1 Definition and Properties
The nGP defines a GP prior for the mean regression function U and the local instantaneous mean function A through the following SDEs with parameters σU ∈ ℝ+ and σA ∈ ℝ+:
| (2) |
| (3) |
where ẆU (t) and ẆA(t) are two independent Gaussian white noise processes with mean function E{ẆU (t)} = E{ẆA(t)} = 0 and covariance function E{ẆU (t)ẆU (t′)} = E{ẆA(t)ẆA(t′)} = δ(t − t′) a delta function; an mth-order differential operator. The initial value of U and its derivatives up to order m − 1 at t0 = 0 are denoted as . Similarly, the initial values of A and its derivatives till order n − 1 at t0 = 0 are denoted as . In addition, we assume that μ, α, ẆU (·) and ẆA(·) are mutually independent. The definition of nGP naturally induces a prior for U with varying smoothness. Indeed, the SDE (2) suggests that E{DmU(t) | A(t)} = A(t). Thus, the smoothness of U, measured by DmU, is expected to be centered on a function A varying over t.
We first recall the definition of the reproducing kernel Hilbert space (RKHS) generated by the zero-mean Gaussian process W = {W(t): t ∈
} and the results on the support of W, which will be useful to explore the theoretical properties of the nGP prior. Let (Ω,
, P) be the probability space for W such that for any t1, t2,…, tk ∈
with k ∈ ℕ, {W(t1), W(t2),…, W (tk)}′ follow a zero-mean multivariate normal distribution with covariance matrix induced through the covariance function
:
×
→ ℝ, defined by
(s, t) = E{W (s)W (t)}. The RKHS
generated by W is the completion of the linear space of all functions
with the inner product
which satisfies the reproducing property f(t) = 〈f,
(t, ·)〉
for any f ∈
:
→ ℝ.
With the specification of the RKHS
, we are able to define the support of W as the closure of
(Lemma 5.1, Van der Vaart and Van Zanten, 2008b). We apply this definition to characterize the support of the nGP prior, which is formally stated in Theorem 1.
Theorem 1
The support of nested Gaussian process U is the closure of RKHS
=
⊕
⊕
⊕
, the direct sum of RKHSs
,
,
and
with reproducing kernels
(s, t),
(s, t),
(s, t) and
(s, t) respectively.
The proof is in Appendix A. It is of interest to note that the support of the Gaussian process Ũ = Ũ0 + Ũ1 as the prior for polynomial smoothing spline is the closure of RKHS
=
⊕
with
⊂
(the proof is given in Appendix A). Hence, it is clear that the nGP prior includes GP prior for polynomial smoothing spline as a special case when
and
.
The nGP prior can generate functions U arbitrarily close to any function U0 in the support of the prior. From Theorem 1 it is clear that the support is large and hence the sample paths from the proposed prior can approximate any function in a broad class. As a stronger property, it is also appealing that the posterior distribution concentrate in arbitrarily small neighborhoods of the true function U0 which generated the data as the sample size J increases, with this property referred to as posterior consistency. More formally, a prior Π on Θ achieves posterior consistency at the true parameter θ0 if for any neighborhoods
, the posterior distribution Π (
| Y1, Y2,…, YJ) → 1 almost surely under Πθ0, the true joint distribution of observations
. For our case, the parameters θ = (U, σε) lie in the product space Θ =
× ℝ+ and have a prior Πθ = ΠU × Πσε, for which ΠU is an nGP prior for U and Πσε is a prior distribution for σε. The L1 neighborhood of θ0 = (U0, σε,0) is defined as
.
Under those specifications and regularity conditions detailed in Appendix A, the results on strong posterior consistency for the Bayes nonparametric regression with nGP prior is given as follows.
Theorem 2
Let be the independent but non-identical observations following normal distributions with unknown mean function U and unknown at design points t1, t2,…tJ. Suppose U follows an nGP prior and the Assumptions 1,2 and 3 hold. Then for every θ0 ∈ Θ and every ε > 0,
The proof is based on the strong consistency theorem by Choi and Schervish (2007) and is detailed in Appendix A.
2.2 The Bayes Estimate as The Nested Smoothing Spline
We show in Theorem 4 that the posterior mean of U under an nGP prior can be related to the minimizer, namely the nested smoothing spline (nSS) Û, of the following penalized sum-of-squares with nested penalties,
| (4) |
where λU ∈ ℝ+ and λA ∈ ℝ+ are the smoothing parameters which control the smoothness of unknown functions U(t) and A(t) respectively. The following Theorem 3 provides the explicit forms for nSS, for which the proofs are included in Appendix A.
Theorem 3
The nested smoothing spline Û (t) has the form
where μ = (μ0, μ1, ···, μm−1)′, ν = (ν1, ν1, ···, νJ)′, α = (α0, α1, ···, αn−1)′ and β = (β1, β2, ···, βJ)′ are the coefficients for the bases
In addition, the nested penalized sum-of-squares can be written as
where
The coefficients μ, ν, α and β of the nested smoothing spline Û (t) are given as
where .
By applying Theorem 3, it’s easy to show that the nested smoothing spline Û (t) is a linear smoother, expressed in the matrix form as, Û = KλU,λAY, a linear combination of observations with the weight matrix KλUλA where KλUλA = φμBμ+RŨBν+ φαBα+ RÃBβ with and .
Theorem 4 below shows the main result of this subsection, i.e. the posterior mean of U under the nGP prior is equivalent to the nSS Û when and . The proof is in Appendix A
Theorem 4
Given observations Y = {Y(t1), Y(t2),…, Y(tJ)}′, the posterior mean of U(t) with nested Gaussian process prior is denoted as . We have
where Û(t) is the nested smoothing spline.
3 Posterior Computation
To complete a Bayesian specification, we choose priors for the initial values, covariance parameters in the nGP and residual variance. In particular, we let and , where invGamma(a, b) denotes the inverse gamma distribution with shape parameter a and scale parameter b. In the applications shown below, the data are rescaled so that the absolute value of the maximum observation is less than 100. We choose diffuse but proper priors by letting as a default to allow the data to inform strongly, and have observed good performance in a variety of settings for this choice. In practice, we have found the posterior distributions for these hyperparameters to be substantially more concentrated than the prior in applications we have considered, suggesting substantial Bayesian learning.
With this prior specification, we propose an MCMC algorithm for posterior computation. This algorithm consists of two iterative steps: (1) Given the and Y, draw posterior samples of μ, U = {U(t1), U (t2),…, U (tJ)}′, α and A = {A(t1), A(t2),…, A(tJ)}′; (2) Given the μ, U, α, A and Y, draw posterior samples of and .
In the first step, it would seem natural to draw U and A from their multivariate normal conditional posterior distributions. However, this is extremely expensive computationally in high dimensions involving O(J3) computations in inverting J × J covariance matrices, which do not have any sparsity structure that can be exploited. To reduce this computational bottleneck in GP models, there is a rich literature relying on low rank matrix approximations (Smola and Bartlett, 2001; Lawrence et al., 2002; Quinonero-Candela and Rasmussen, 2005). Of course, such low rank approximations introduce some associated approximation error, with the magnitude of this error unknown but potentially substantial in our motivating mass spectrometry applications, as it is not clear that typical approximations having sufficiently low rank to be computationally feasible can be accurate.
To bypass the need for such approximations, we propose a different approach that does not require inverting J × J covariance matrices but instead exploits the Markovian property implied by SDEs (2) and (3). The Markovian property is represented by a stochastic difference equation, namely the state equation, as illustrated for the case when m = 2 and n = 1 as follows which is easily extended to cases with higher order of m and n. Given m = 2 and n = 1, nested Gaussian process U(t) along with its first order derivative D1U(t) and A(t) follow the exact state equation:
| (5) |
where θj+1 = {U(tj+1), D1U(tj+1), A(tj+1)}′, ωj ~ N3 (0, Wj), and with δj = tj+1 − tj. The proof is in Appendix A. The exact state equation combined with the observation equation (1) forms a state space model (West and Harrison, 1997; Durbin and Koopman, 2001), for which the latent states θj’s can be efficiently sampled by a simulation smoother algorithm (Durbin and Koopman, 2002) with O(J) computation complexity.
Given the μ, U, α and A, posterior samples of can be obtained by drawing from the inverse-gamma conditional posterior while and can be updated in Metropolis-Hastings (MH) steps. We have found that typical MH random walk steps tend to be sticky and it is preferable to use MH independence chain proposals in which one samples candidates for and from approximations to their conditional posteriors that are easy to sample from. To accomplish this, we rely on the following approximate state equation.
When δj is sufficient small, the exact state equation (5) can be approximated by
| (6) |
where ω̃j ~ N2 (0, W̃j), and . The above approximate state equation is derived by applying the Euler approximation (chapter 9, Kloeden and Platen, 1992), essentially a first-order Taylor approximation, to the SDEs (2) and (3). Given the θj’s, the and in the above approximate state equation can be easily sampled as a Bayesian linear regression model with the given coefficients.
Finally, we outline the proposed MCMC algorithm as follows:
-
(1)
For the state space model with the observation equation (1) and the exact state equation (5), update the latent states μ, U, α and A by using the simulation smoother.
-
(2)
Sample from the posterior distribution .
-
(3a)
Given and , we sample the latent states μ*, U*, α* and A* for the approximate state space model with the observation equation (1) and the approximate state equation (6).
-
(3b)
Given μ*, U*, α* and A*, the proposal and is drawn from the posterior distributions and , respectively.
-
(3c)The proposal and will be accepted with the probability min
where fN,k(X | 0, Σ) denotes the probability density function of the k-dimensional normal random vector with mean 0 and covariance matrix Σ; θj, Wj and W̃j are specified in equations (5) and (6); Similar notions hold for and with μ, U, α A, and replaced by μ*, U*, α* A*, and correspondingly.
We also consider an empirical Bayes approach, which modifies the above MCMC algorithm to use conditional method of moment estimators for and . In particular, focus on the m = 2, n = 1 case and assume δj = δ sufficiently small and observations are equally spaced. Then, given Y (tj), U (tj) and A(tj), we have and derived from the observation equation (1) and the approximate state equation (6). Given and , the latent states can be easily sampled using the first step of the above MCMC algorithm.
4 Simulations
We conducted a simulation study to assess the performance of the proposed method, Bayesian nonparametric regression via an nGP prior (BNR-nGP), and compared it to several alternative methods: cubic smoothing spline (SS, Wahba, 1990), wavelet method with the soft minimax threshold (Wavelet1, Donoho and Johnstone, 1994), wavelet method with the soft Stein’s unbiased estimate of risk for threshold choice (Wavelet2, Donoho and Johnstone, 1995) and hybrid adaptive splines (HAS, Luo and Wahba, 1997). For BNR-nGP, we take the posterior mean as the estimate, which is based on the draws from the proposed MCMC algorithm with 1,500 iterations, discarding the first 500 as the burn-in stage and saving remaining ones. The initial values of and are chosen as method of moment estimators. The other methods are implemented in R (R Development Core Team, 2011), along with the corresponding R packages for Wavelet methods (wmtsa, Constantine and Percival, 2010) and hybrid adaptive splines (bsml, Wu et al., 2011).
Our first simulation study focuses on four functions adapted from Donoho and Johnstone (1994) with different types of locally-varying smoothness. The functions are plotted in Figure 2, for which the smoothness levels vary, for example, abruptly in panel (a) or gradually in panel (d). For each function, equally-spaced observations are obtained with Gaussian noise, for which the signal-to-noise ratio is . We use the mean squared error (MSE) to compare the performance of different methods based on 100 replicates. The simulation results are summarized in Table 1. Among all methods, SS performs worst, which is not surprising since it can not adapt to the locally-varying smoothness. Among the remaining methods, BNR-nGP performs well in general for all cases with either the smallest or the second smallest average MSE across 100 replicates. In contrast, Wavelet2 and HAS may perform better for a given function, but their performances are obviously inferior for another function (e.g. Heavisine for Wavelet2 and Doppler for HAS). This suggests the nGP prior is able to adapt to a wide variety of locally-varying smoothness profiles.
Figure 2.
Four locally-varying smoothness functions: true function (
) and 128 observations (
).
Table 1.
Average MSE and the interquartile range of MSE (in parentheses) for Bayesian nonparametric regression with nGP prior (BNR-nGP), smoothing spline (SS), wavelet method with the soft minimax threshold (Wavelet1), wavelet method with the soft Stein’s unbiased estimate of risk for threshold choice (Wavelet2) and Hybrid adaptive spline (HAS).
| Example | BNR-nGP | SS | Wavelet1 | Wavelet2 | HAS |
|---|---|---|---|---|---|
| Blocks | 0.950(0.166) | 3.018(0.248) | 2.750(0.748) | 1.237(0.341) | 0.539(0.113) |
| Bumps | 1.014(0.185) | 26.185(0.787) | 3.433(0.938) | 1.195(0.282) | 0.904(0.258) |
| Heavisine | 0.320(0.058) | 0.337(0.087) | 0.702(0.230) | 1.620(0.460) | 0.818(0.122) |
| Doppler | 0.989(0.183) | 3.403(0.361) | 1.517(0.402) | 0.695(0.179) | 3.700(0.534) |
| MS Region 1(×10−3) | 1.498(0.266) | 2.293(0.513) | 2.367(0.616) | 6.048(3.441) | 72.565(39.596) |
| MS Region 2(×10−3) | 0.840(0.375) | 0.798(0.490) | 0.948(0.587) | 1.885(0.493) | 7.958(5.559) |
We further compare the proposed method and the alternative methods for analyzing mass spectrometry data. The 100 datasets are generated by the ‘virtual mass spectrometer’ (Coombes et al., 2005a), which considers the physical principles of the instrument. One set of these simulated data is plotted in Figure 3 with σε = 66. The simulated data have been shown to accurately mimic real data (Morris et al., 2005) and are available at http://bioinformatics.mdanderson.org/Supplements/Datasets/Simulations/index.html. Since the analysis of all observations (J=20,695) of a given dataset is computational infeasible for HAS, we focus on the analysis of the observations within two regions with 5 < km=z < 8 (region 1 with J=2,524) and 20 < km=z < 25 (region 2 with J=2,235) respectively. Those two regions represent the unique feature of mass spectrometry data. More specifically, with smaller km/z values the peaks are much taller and sharper than the peaks in the region with larger km/z values. The results in Table 1 indicate that the BNR-nGP performs better than the other smoothness adaptive methods for both regions in terms of smaller average MSE and narrower interquartile range of MSE. Although the smoothing spline seems to work well with smaller average MSE in region 2, the peaks are clearly over-smoothed, leading to large MSEs at these important locations. In contrast, BNR-nGP had excellent performance relative to the competitors across locations.
Figure 3.
The plot of one simulated mass spectrometry data (J=20,695).
We finally compare the full Bayes inference approach with the empirical Bayes approach detailed in Section 3. The performance of the two methods are measured by MSEs. The observations are simulated based on the exact state equation (5) and the observation equation (1) with , θ0 = 0 and J = 256. The observations are equally spaced with δ = 1 and we simulate 100 replicate datasets. The full Bayes approach is implemented as in the previous simulation study. The MCMC algorithm took 86 seconds for 1500 iterations on a PC with 2.33GHz Intel(R)Xeon(R)CPU. The trace plots and autocorrelation plots imply fast convergence and rapid mixing, especially for the samples of U(t) and . For the posterior means of , their MSEs are 0.003, 4.319 and 0.142 while the average MSE of posterior mean of U(t) is 0.622. We also apply the empirical Bayes approach with method of moment estimators of . Their MSE are 0.002, 11.459 and 0.183 respectively. The average MSE of U(t) is 0.629 when we take the posterior mean as the estimator. Although the empirical Bayes approach with the method of moment estimators is simple to implement and computationally less intensive, the MSEs of and are inflated significantly, possibly due to approximating the differential equations in deriving the method of moment estimators.
5 Applications
We apply the proposed method to protein mass spectrometry (MS) data. Protein MS plays an important role in proteomics for identifying disease-related proteins in the samples (Cottrell and London, 1999; Tibshirani et al., 2004; Domon and Aebersold, 2006; Morris et al., 2008). For example, Panel (a) of Figure 1 plots 11, 186 intensities in a pooled sample of nipple aspirate fluid from healthy breasts and breasts with cancer versus the mass to charge ratio m/z of ions (Coombes et al., 2005b). Analysis of protein MS data involves several steps, including spectra alignment, signal extraction, baseline subtraction, normalization and peak detection. As an illustration of our method, we focus on the second step, i.e., estimate the intensity function adjusted for measurement errors. Peaks in the intensity function may correspond to proteins that differ in the expression levels between cancer and control patients.
We fit the Bayes nonparametric regression with nGP prior and ran the MCMC algorithm for 11,000 iterations with the first 1000 iterations discarded as burn-in and every 10th draw retained for analysis. The trace plots and autocorrelation plots suggested the algorithm converged fast and mixed well. Panel (b) of Figure 1 plots the posterior mean of U and its pointwise 95% credible interval. Note that the posterior mean of U is adapted to the various smoothness at different regions, which is more apparently illustrated by the Panel (c) of Figure 1. Panel (d) of Figure 1 demonstrates the posterior mean and 95% credible interval of rate of intensity change DU, which suggests a peak around 4 km/z.
6 Discussion
We have proposed a novel nested Gaussian process prior, which is designed for flexible nonparametric locally adaptive smoothing while facilitating efficient computation even in large data sets. Most approaches for Bayesian locally adaptive smoothing, such as free knot splines and kernel regression with varying bandwidths, encounter substantial problems with scalability. Even isotropic Gaussian processes, which provide a widely used and studied prior for nonparametric regression, face well known issues in large data sets, with standard approaches for speeding up computation relying on low rank approximations. It is typically not possible to assess the accuracy of such approximations and whether a low rank assumption is warranted for a particular data set. However, when the function of interest is not smooth but can have very many local bumps and features, high resolution data may be intrinsically needed to obtain an accurate estimate of local features of the function, with low rank approximations having poor accuracy. This seems to be the case in mass spectroscopy applications, such as the motivating proteomics example we considered in Section 5. We have simultaneously addressed two fundamental limitations of typical isotropic Gaussian process priors for nonparametric Bayes regression: (i) the lack of spatially-varying smoothness; and (ii) the lack of scalability to large sample sizes. In addition, this was accomplished in a single coherent Bayesian probability model that fully accounts for uncertainty in the function without relying on multistage estimation.
Although we have provided an initial study of some basic theoretical properties, the fundamental motivation in this paper is to obtain a practically useful method. We hope that this initial work stimulates additional research along several interesting lines. The first relates to generalizing the models and computational algorithms to multivariate regression surfaces. Seemingly this will be straightforward to accomplish using additive models and tensor product specifications. The second is to allow for the incorporation of prior knowledge regarding the shapes of the functions; in some applications, there is information available in the form of differential equations or even a rough knowledge of the types of curves one anticipates, which could ideally be incorporated into an nGP prior. Finally, there are several interesting theoretical directions, such as showing rates of posterior contraction for true functions belonging to a spatially-varying smoothness class.
Acknowledgments
This work was supported by Award Number R01ES017436 and R01ES17240 from the National Institute of Environmental Health Sciences, by funding from the National Institutes of Health (5P2O-RR020782-O3) and the U.S. Environmental Protection Agency (RD-83329301-0) and by the Intramural Research Program of the National Cancer Institute, National Institutes of Health, Maryland, USA. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Environmental Health Sciences, the National Institutes of Health or the U.S. Environmental Protection Agency.
A Appendix: Proofs of Theoretical Results
A.1 Proof of Theorem 1
Its proof requires the results of the following lemma.
Lemma 1
The nested Gaussian process U can be written as U(t) = Ũ0(t) + Ũ1(t) + Ã0(t) + Ã1(t), the summation of mutually independent Gaussian processes with the corresponding mean functions E{U0(t)} = E{U1(t)} = E{A0(t)} = E {A1(t)} = 0 and covariance functions
respectively, where and .
Proof
We specify U(t) = Ũ(t) + Ã(t) and A(t) = Dm Ã(t). By SDEs (2) and (3),
| (7) |
| (8) |
By applying stochastic integration to SDEs (7) and (8), it can be shown that
given the initial values μ and α. Since Ũ0(t), Ũ1(t), Ã0(t) and Ã1(t) are the linear combination of Gaussian random variables at every t, they are Gaussian processes defined over t, whose mean functions and covariance functions can be easily derived as required. In addition, Ũ0(t), Ũ1(t), Ã0(t) and Ã1(t) are mutually independent due to the mutually independent assumption of μ, α, ẆU (·) and ẆA(·) in the definition of nGP.
Given the above lemma, the Theorem 1 can be proven as follows. We aim to characterize
, the RKHS of U with the reproducing kernel
(s, t). The support of U, a mean-zero Gaussian random element, is the closure of
(Van der Vaart and Van Zanten, 2008b, Lemma 5.1).
By Loève’s Theorem (Berlinet and Thomas-Agnan, 2004, Theorem 35), the RKHSs generated by the processes Ũ0(t), Ũ1(t), Ã0(t) and Ã1(t) with covariance functions
(s, t),
(s, t),
(s, t) and
(s, t) (given in Lemma 1) are congruent to RKHSs
,
,
and
, respectively. Based on Theorem 5 of Berlinet and Thomas-Agnan (2004), we conclude that
(s, t) =
(s, t) +
(s, t) +
(s, t) +
(s, t) is the reproducing kernel of the RKHS
A.2 Proof of Theorem 2
We first specify a couple of regularity conditions given by:
Assumption 1 tj arises according to an infill design: for each Sj = tj+1 − tj, there exists a constant 0< Cd ≤ 1 such that .
Assumption 2 The prior distributions and satisfy an exponential tail condition. Specifically, there exist sequences MJ, and such that: (i) and , for some positive constants Cμ, CU, Cα and CA; (ii) , for every Cg > 0 and , the minimal element of { }.
Assumption 3 The prior distribution Πσε is continuous and the σε,0 lies in the support of Πσε
Similar to the proof of Lemma 1, we specify U = Ũ0 + Ũ1 + Ã0 + Ã1, which is a mean zero Gaussian process with continuous and differentiable covariance function
(s, t) =
(s, t) +
(s, t) +
(s, t) +
(s, t).
We aims to verify the suficient conditions of the strong consistency theorem (Theorem 1, Choi and Schervish, 2007) for nonparametric regression: (I) prior positivity of neighborhoods and (II) existence of uniformly exponentially consistent tests and sieves ΘJ with for some positive constants C1 and C2.
Given U is a Gaussian process with continuous sample path and continuous covariance function, it follows from Theorem 4 of Ghosal and Roy (2006) that ΠU (||U − U0||∞ < δ) > 0 for any δ > 0. In addition, for every δ > 0, under Assumption 3. Hence, we can define a neighborhood such that Π(U,σε)(Bδ) > 0 satisfying the condition (I).
From Theorem 2 of Choi and Schervish (2007), we can show that for a sequence of MJ, there exist uniformly exponentially consistent tests for the sieves ΘJ = {U: ||U||∞ < MJ, ||DU ||∞ < MJ} under the infill design Assumption 1. What remains is to verify the exponentially small probability of and . Using Borell’s inequality (Proposition A.2.7, Van der Vaart and Wellner, 1996), we have
for some positive constants C1 and C3, and
By applying the Borel-Cantelli theorem, we have
almost surely under the exponential tail Assumption 2. By the similar arguments, we can show that ΠU (||DU||∞ > MJ) ≤ C1 exp(−C2J).
Hence, the conditions (I) and (II) hold, which leads to the strong consistency for Bayesian nonparametric regression with nGP prior.
A.3 Proof of the Support of GP Prior for Polynomial Smoothing Spline
Note that
is the prior for the polynomial smoothing spline (Wahba, 1990, Section 1.5). By the similar arguments in Theorem 1, we can show that the support of Ũ is the closure of RKHS
=
⊕
.
Thus,
(s, t) −
(s, t) =
(s, t) =
⊕
a nonnegative kernel, which implies that
⊂
by Corollary 4 of Aronszajn (1950).
A.4 Proof of Theorem 3
Let U(t) = Ũ(t) + Ã(t) and A(t) = Dm Ã(t). The nested penalized sum-of-square (4) can be written as:
| (9) |
where Ũ(t) is the m-order polynomial smoothing spline and Ã(t) is the (m+ n)-order polynomial smoothing spline.
By the classical RKHS theory of the polynomial smoothing spline (Wahba, 1990, Section 1.2), there exists a unique decomposition of Ũ(t):
with Ũ0(t) ∈
and Ũ1(t) ∈
·
= {f(t): Dm f(t) = 0, t ∈
} nad
= {f(t): Dif(t) absolutely continuous for i = 0, 1, ···, m − 1, Dmf(t) ∈
(
)} are the RKHSs with reproducing kernel
(s, t) and
(s, t) respectively, where φi(t), Gm(t, u),
(s, t) and
(s, t) are defined in Theorem 1 with
(
) = {f(t): ∫
f2(t)dt < ∞} the space of squared integrable functions defined on index set
.
Given
= {tj: j = 1, 2, ···, J}, Ũ1(t) can be uniquely written as
, where ηŨ1(·) ∈
orthogonal to
(tj, ·) with inner product 〈
(tj, ·), ηŨ1 (·)〉
= ∫
Dm
(tj, u)Dm ηŨ1 (u)du = 0 for j = 1, 2, ···, J.
As a result,
By similar arguments,
and
where Ã0(t) ∈
and Ã1(t) ∈
with
= {f(t): Dm+nf(t) = 0, t ∈
} and
= {f(t): Dif(t) absolutely continuous for i = 0, 1, ···, m + n − 1, Dm+nf(t) ∈
(
)} RKHSs with reproducing kernel
(s, t) and
(s, t) respectively; ηÃ1 (·) ∈
is orthogonal to
(tj, ·) with inner product 〈
(tj, ·), ηÃ1 (·)〉
= ∫
Dm
(tj, u)Dm ηÃ1 (u)du = 0 for j = 1, 2,···, J.
Note that ηŨ1(tj) = 〈
(tj, ·), ηŨ1(·)〉
= 0 and ηÃ1 (tj) = 〈
(tj, ·), ηÃ1 (·)〉
= 0 due to the reproducing property of
(tj, ·) and
(tj, ·). It then follows from expression (9) that
which is minimized when 〈 ηŨ1 (·), ηŨ1 (·)〉
= 〈 ηÃ1 (·), ηÃ1 (·)〉
= 0. Thus, ηŨ1 (·) = ηÃ1 (·) = 0 and we obtain the forms of Û(t) and nPSS(t) as required.
To obtain the coefficients μ, ν, α and β, we first take partial derivatives of nested penalized sum-of-squares nPSS with respective to μ, ν, α and β and set them to zeros:
| (10) |
| (11) |
| (12) |
| (13) |
where MŨ = RŨ + JλUI and MÃ = RÃ + JλAI. It follows from equations (11) and (13) that
Substituting them into equations (10) and (12) with some algebra leads to
from which we obtain
It is then straightforward to show
as desired.
A.5 Proof of Theorem 4
The proof is based on the following results of Lemma 2.
Lemma 2
For the observations Y = {Y(t1), Y(t2), …, Y(tJ)}′ and the nested Gaussian process U(t), we have
Proof
Let U(t) = Ũ(t) + Ã(t) and A(t) = Dm Ã(t). From SDEs (2) and (3),
Thus, given the initial value μ, it can be shown that , a (m−1)-fold integrated Wiener process (Shepp, 1966). Similarly, , a (m + n − 1)-fold integrated Wiener process.
It is obvious that E {U(t)} = 0 and E {Y} = 0. Given the mutually independent assumption of μ, α, ẆU (·) and ẆA(·),
and
for j = 1, 2, ···, J and j′ = 1, 2, ···, J. The lemma holds.
The proof of Theorem 4 is given as follows. By Lemma 2 and the results on conditional multivariate normal distribution (Searle, 1982),
where and with . We are going to evaluate the limits of and when ρμ → +∞ and ρα → + ∞
It can be verified that
| (14) |
It follows that and and .
As a result,
By expression (14),
It follows that . By similar arguments, .
Hence, and . The theorem holds.
A.6 Proof of the Exact State Equation
When m = 2 and n = 1, the SDEs (2) and (3) can be written as,
where and .
As a result,
where and ωj ~ N3 (0, Wj) with
as required.
Contributor Information
Bin Zhu, Email: bin.zhu@nih.gov, Tenure-Track Principal Investigator, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD 20852.
David B. Dunson, Email: dunson@stat.duke.edu, Arts & Sciences Distinguished Professor, Department of Statistical Science, Duke University, Durham, NC 27708
References
- Abramovich F, Steinberg DM. Improved inference in nonparametric regression using L-smoothing splines. Journal of Statistical Planning and Inference. 1996;49:327–341. [Google Scholar]
- Aronszajn N. Theory of Reproducing Kernels. Transactions of the American Mathematical Society. 1950;68:337–404. [Google Scholar]
- Berlinet A, Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. Netherlands: Springer; 2004. [Google Scholar]
- Bhattacharya A, Pati D, Dunson DB. Adaptive dimension reduction with a Gaussian process prior. 2011 Arxiv preprint arXiv:1111.1044. [Google Scholar]
- Choi T, Schervish MJ. On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis. 2007;98:1969–1987. [Google Scholar]
- Constantine W, Percival D. wmtsa: Insightful Wavelet Methods for Time Series Analysis. 2010 http://CRAN.R-project.org/package=wmtsa.
- Coombes K, Koomen J, Baggerly K, Morris J, Kobayashi R. Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Informatics. 2005a;1:41. [PMC free article] [PubMed] [Google Scholar]
- Coombes KR, Tsavachidis S, Morris JS, Baggerly KA, Hung MC, Kuerer HM. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics. 2005b;5:4107–4117. doi: 10.1002/pmic.200401261. [DOI] [PubMed] [Google Scholar]
- Cottrell J, London U. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–3567. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. [DOI] [PubMed] [Google Scholar]
- Crainiceanu CM, Ruppert D, Carroll RJ, Joshi A, Goodner B. Spatially adaptive Bayesian penalized splines with heteroscedastic errors. Journal of Computational and Graphical Statistics. 2007;16:265–288. [Google Scholar]
- Denison DGT, Mallick BK, Smith AFM. Automatic Bayesian curve fitting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1998;60:333–350. [Google Scholar]
- Dimatteo I, Genovese CR, Kass RE. Bayesian curve-fitting with free-knot splines. Biometrika. 2001;88:1055. [Google Scholar]
- Domon B, Aebersold R. Mass spectrometry and protein analysis. Science. 2006;312:212. doi: 10.1126/science.1124619. [DOI] [PubMed] [Google Scholar]
- Donoho DL, Johnstone IM. Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association. 1995:1200–1224. [Google Scholar]
- Donoho DL, Johnstone JM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–455. [Google Scholar]
- Durbin J, Koopman S. A simple and efficient simulation smoother for state space time series analysis. Biometrika. 2002;89:603. [Google Scholar]
- Durbin J, Koopman SJ. Time series analysis by state space methods. Vol. 24. Oxford: Oxford University Press; 2001. [Google Scholar]
- Fan J, Gijbels I. Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1995:371–394. [Google Scholar]
- Friedman JH. Multivariate adaptive regression splines. The Annals of Statistics. 1991:1–67. [Google Scholar]
- Friedman JH, Silverman BW. Flexible parsimonious smoothing and additive modeling. Technometrics. 1989:3–21. [Google Scholar]
- George EI, McCulloch RE. Variable selection via Gibbs sampling. Journal of the American Statistical Association. 1993:881–889. [Google Scholar]
- Ghosal S, Roy A. Posterior consistency of Gaussian process prior for nonparametric binary regression. The Annals of Statistics. 2006;34:2413–2429. [Google Scholar]
- Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711. [Google Scholar]
- Heckman NE, Ramsay JO. Penalized regression with model-based penalties. Canadian Journal of Statistics. 2000;28:241–258. [Google Scholar]
- Kloeden PE, Platen E. Numerical Solution of Stochastic Differential Equations. New York: Springer Verlag; 1992. [Google Scholar]
- Krivobokova T, Crainiceanu CM, Kauermann G. Fast Adaptive Penalized Splines. Journal of Computational and Graphical Statistics. 2008;17:1–20. [Google Scholar]
- Lawrence ND, Seeger M, Herbrich R. Fast sparse Gaussian process methods: The informative vector machine. Advances in neural information processing systems. 2002;15:609–616. [Google Scholar]
- Luo Z, Wahba G. Hybrid adaptive splines. Journal of the American Statistical Association. 1997:107–116. [Google Scholar]
- Morris J, Coombes K, Koomen J, Baggerly K, Kobayashi R. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics. 2005;21:1764–1775. doi: 10.1093/bioinformatics/bti254. [DOI] [PubMed] [Google Scholar]
- Morris JS, Brown PJ, Herrick RC, Baggerly KA, Coombes KR. Bayesian Analysis of Mass Spectrometry Proteomic Data Using Wavelet-Based Functional Mixed Models. Biometrics. 2008;64:479–489. doi: 10.1111/j.1541-0420.2007.00895.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neal R. Regression and classification using gaussian process priors. Bayesian Statistics. 1998;6:475–501. [Google Scholar]
- Pintore A, Speckman P, Holmes CC. Spatially adaptive smoothing splines. Biometrika. 2006;93:113. [Google Scholar]
- Quinonero-Candela J, Rasmussen CE. A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research. 2005;6:1939–1959. [Google Scholar]
- R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2011. http://www.R-project.org. [Google Scholar]
- Rasmussen CE, Williams CKI. Gaussian processes for machine learning. Boston: MIT Press; 2006. [Google Scholar]
- Ruppert D, Carroll RJ. Spatially-adaptive Penalties for Spline Fitting. Australian & New Zealand Journal of Statistics. 2000;42:205–223. [Google Scholar]
- Savitsky T, Vannucci M, Sha N. Variable selection for nonparametric Gaussian process priors: Models and computational strategies. Statistical Science. 2011;26:130–149. doi: 10.1214/11-STS354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scheipl F, Kneib T. Locally adaptive Bayesian P-splines with a Normal-Exponential-Gamma prior. Computational Statistics and Data Analysis. 2009;53:3533–3552. [Google Scholar]
- Searle SR. Matrix Algebra Useful for Statistics. New York: Wiley; 1982. [Google Scholar]
- Shepp LA. Radon-Nikodym derivatives of Gaussian measures. The Annals of Mathematical Statistics. 1966;37:321–354. [Google Scholar]
- Shi JQ, Choi T. Gaussian Process Regression Analysis for Functional Data. London: Chapman & Hall/CRC Press; 2011. [Google Scholar]
- Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. Journal of Econometrics. 1996;75:317–343. [Google Scholar]
- Smola AJ, Bartlett P. Advances in Neural Information Processing Systems. Vol. 13. Citeseer: 2001. Sparse greedy Gaussian process regression. [Google Scholar]
- Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le QT. Sample classification from protein mass spectrometry, by peak probability contrasts. Bioinformatics. 2004;20:3034–3044. doi: 10.1093/bioinformatics/bth357. [DOI] [PubMed] [Google Scholar]
- Van der Vaart AW, Van Zanten JH. Rates of contraction of posterior distributions based on Gaussian process priors. The Annals of Statistics. 2008a;36:1435–1463. [Google Scholar]
- Van der Vaart AW, Van Zanten JH. Reproducing kernel Hilbert spaces of Gaussian priors. Limits of Contemporary Statistics: Contributions in Honor of Jayanta K Ghosh. 2008b;3:200–222. [Google Scholar]
- Van der Vaart AW, Wellner JA. Weak convergence and empirical processes. New York: Springer Verlag; 1996. [Google Scholar]
- Wahba G. Spline models for observational data. Vol. 59. Philadelphia: Society for Industrial Mathematics; 1990. [Google Scholar]
- Wahba G. Discussion of a paper by Donoho et al. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1995:360–361. [Google Scholar]
- West M, Harrison J. Bayesian Forecasting and Dynamic Models. New York: Springer Verlag; 1997. [Google Scholar]
- Wolpert RL, MACCT Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels. The Annals of Statistics. 2011;39:1916–1962. [Google Scholar]
- Wood SA, Jiang W, Tanner M. Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika. 2002;89:513. [Google Scholar]
- Wood SA, Kohn R, Cottet R, Jiang W, Tanner M. Locally adaptive nonparametric binary regression. Journal of Computational and Graphical Statistics. 2008;17:352–372. [Google Scholar]
- Wu JQ, Sklar J, Wang YD, Meiring W. bsml: Basis Selection from Multiple Libraries. 2011 http://CRAN.R-project.org/package=bsml.
- Zhou S, Shen X. Spatially adaptive regression splines and accurate knot selection schemes. Journal of the American Statistical Association. 2001;96:247–259. [Google Scholar]
- Zhu B, Song PXK, Taylor JMG. Stochastic Functional Data Analysis: A Diffusion Model-Based Approach. Biometrics. 2011;67:1295–1304. doi: 10.1111/j.1541-0420.2011.01591.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou F, Huang H, Lee S, Hoeschele I. Nonparametric Bayesian Variable Selection With Applications to Multiple Quantitative Trait Loci Mapping With Epistasis and Gene–Environment Interaction. Genetics. 2010;186:385. doi: 10.1534/genetics.109.113688. [DOI] [PMC free article] [PubMed] [Google Scholar]



