Summary
Gaussian processes (GPs) are common components in Bayesian non-parametric models having a rich methodological literature and strong theoretical grounding. The use of exact GPs in Bayesian models is limited to problems containing several thousand observations due to their prohibitive computational demands. We develop a posterior sampling algorithm using -matrix approximations that scales at . We show that this approximation’s Kullback-Leibler divergence to the true posterior can be made arbitrarily small. Though multidimensional GPs could be used with our algorithm, d-dimensional surfaces are modeled as tensor products of univariate GPs to minimize the cost of matrix construction and maximize computational efficiency. We illustrate the performance of this fast increased fidelity approximate GP, FIFA-GP, using both simulated and non-synthetic data sets.
Keywords: Bayesian Gaussian process, Matrix compression, Scalability, Surface estimation, Tensor product
1. Introduction
Gaussian processes (GPs) provide flexible priors over smooth function spaces and are widely used in Bayesian non-parametric modeling. Though such models have strong theoretical underpinnings (e.g., see Rasmussen and Williams [2006] and references therein), they are limited in use due to the computational complexity required to sample from the posterior distribution. For n observations, estimation requires computing the inverse, determinant, and square-root decomposition of the n×n covariance matrix. These operations scale at with memory requirements. When using Gibbs sampling, computational issues are compounded as reliable inference depends on multiple samples being drawn from the posterior distribution. In practice, these bottelnecks limit the use of GPs to data sets containing at most several thousand observations.
A large literature exists on scalable GPs. Common methods for approximating the likelihood from the full GP include (a) using a subset of the training data [Herbrich et al., 2003, Chalupka et al., 2013]; (b) introducing a grid of inducing points [Williams and Seeger, 2001, Quiñonero-Candela and Rasmussen, 2005, Titsias, 2009, Hensman et al., 2013] to approximate the covariance matrix using a low-rank matrix that is scaled further using structure exploiting algebra [Wilson and Nickisch, 2015, Wilson et al., 2015]; and (c) relying on local data to make predictions in a given region [Gramacy and Lee, 2008, Nguyen-Tuong et al., 2009, Datta et al., 2016b]. Methods based on (a) or (b) struggle to capture local structure in the data unless the subset size or number of inducing points approach the number of training points. Methods based upon (c) tend to capture local structure well but may struggle with global patterns; additionally, they do not offer an approximation to the global covariance. Saibaba and Kitanidis [2012, 2015] suggest methods for approximating posterior realizations of a function conditional on observed data, but do not address hyperparameter uncertainty or quantify the distributional similarity between the true and approximating posterior. The bulk of existing approximation approaches do not address the issue of unified parameter estimation and function interpolation. Further, the majority of the extant literature focuses on fast approximate methods for computing the inverse of a matrix, but does not address estimating the square-root of this inverse, which is required to sample from the process. For a thorough review of scalable GP approaches, see Liu et al. [2018].
Note that special kernels (e.g., sparse kernels) aren’t relevant here because we are focused on methods for approximating realization of a given covariance matrix, not modifying the generating kernel itself.
Though the majority of the above methodologies are used in a frequentist context for point estimation, Bayesian models have also been proposed to approximate samples from a GP for large n. Sang and Huang [2012] use a combination of inducing point and sparse kernel methods to capture both local and global structure. Global structure is captured via a reduced rank approximation to the GP covariance matrix based on truncating the Karhunen–Loève expansion of the process and solving the resulting integral equation using the Nyström method, and residual local structure is captured via a tapered kernel. Approximation fidelity is a function of taper length and number of inducing points. Computation using this method scales at , where m is the number of inducing points and k is the average number of non-zero entries per row in the approximated covariance matrix. For high fidelity approximations, or in high dimensions, the size of m and k may need to approach n to reach the desired accuracy.
Banerjee et al. [2012] outline an approximation approach that specifies a probabilistic bound on the Frobenius norm between the approximated and true covariance matrix. This approach constructs a linear projection onto an m-dimensional subspace, m << n, using a stochastic basis construction [Sarlos, 2006, Halko et al., 2009]. As with the method of Sang and Huang [2012], the construction of the subspace may scale poorly as n increases. For higher dimensions, m tends to approach n. As this algorithm scales at , in higher dimensions there may be no computational savings.
A Bayesian alternative to the linear projection approaches is the hierarchical nearest-neighbor approach (NNGP) of Datta et al. [2016a]. Computations using this process scale at , which allows for scalability to previously unimplementable data sizes. The NNGP is a well-defined spatial process, with sparse precision matrices available in closed form enabling the aforementioned computation gains. However, this sparsity guarantees the approximating covariance matrix will differ from the covariance matrix of an exact GP.
We propose an approach using tree-based approximations of the original matrix, called hierarchical matrices or -matrices [Hackbusch, 1999, Grasedyck and Hackbusch, 2003]. These matrices can approximate covariance matrices to user defined precision while providing quasi-linear solutions to linear systems of equations, determinant computation, matrix multiplication, and Cholesky-like square root matrix decomposition. The computational complexity of a given operation can vary depending on the type of -matrix and the particular algorithm, but the majority of the -matrix approaches defined in the literature have a cost of at most . -matrix techniques combine methods similar to Datta et al. [2016a] and Banerjee et al. [2012] to accurately approximate a covariance matrix to near machine precision by decomposing the matrix as a tree of partial and full rank matrices.
-matrices have previously been used for likelihood approximation and kriging in large scale GPs [Ambikasaran et al., 2015, Litvinenko et al., 2019, Geoga et al., 2019], Gaussian probability computations in high dimensions [Cao et al., 2019, Genton et al., 2018], and as a step in approximate matrix square root computation using Chebyshev matrix polynomials for conditional realizations [Saibaba and Kitanidis, 2012]. For Bayesian analysis, the benefits of the -matrix formulation do not directly translate into efficient Gibbs sampling algorithms. This is because the posterior conditional distribution has a covariance matrix comprised of multiplicative pieces not directly amenable to -matrix approximations. To overcome this, we propose a sampling algorithm that does not require direct computation of the posterior covariance matrix in -matrix form, but instead uses the properties of -matrices to sample from the posterior conditional distribution efficiently.
Though the algorithm can be used for any -matrix construction, we develop an approach focusing on the Hierarchical Off-Diagonal Low-Rank (HODLR) -matrix construction of Ambikasaran et al. [2015]. This method is extremely efficient at compressing matrices when d, the dimension of the inputs, is 1, allowing for efficient sampling algorithms on modest hardware for n > 100, 000. When d > 1, the HODLR construction bottlenecks and the compression is too computationally demanding for run-time construction. Consequently, when d = 1 we approximate the GP directly, and when d > 1 we approximate the surface using a tensor product of d 1-dimensional GPs similar to Wheeler [2019]. The results is a process that can approximate a surface with high fidelity by relying on d 1-dimensional functions and that, given the right covariance kernel, is not constrained by the need (typical of spline tensor product methods) to define inducing points [De Boor, 1978, Eilers and Marx, 2010]. The method, which we call fast increased fidelity approximate GP (FIFA-GP), involves approximations of the GP covariance using -matrices, requiring d individual matrix factorizations for the tensor product formulation, which allows for the development of a novel sampling step with cost.
To investigate our approach, we first show the Kullback–Leibler divergence between the approximate and true conditional posterior goes to zero as the max-norm between the approximate and true matrix goes to zero. For most -matrix constructions, this deviation can be decreased to machine precision, which implies there is theoretically no difference in sampling from the posterior with approximation error induced by -matrix approximation and with that induced by finite computer precision. As FIFA-GP uses 1-dimensional GPs, when d > 1 we show how the tensor product, with appropriate covariance kernel, defines a stochastic process over a space of d-dimensional surfaces dense in the space of all possible d-dimensional smooth surfaces. We use these results in various simulation comparisons and non-synthetic data examples to show the applicability of the method. The R code that can reproduce all numerical results is included in the web-based supporting materials.
The manuscript is structured as follows: Section 2 gives an overview of Gaussian process modeling and describes existing methods for approximate Bayesian computation. Section 3 describes the matrix decomposition method used for the proposed GP approximation. Section 4 offers proofs for the approximation fidelity of the posterior and details the sampling algorithm. We illustrate the relative performance of all approximate methods alongside a true GP in Section 5 with a simulation study having small n and then compare performance of approximate methods using a large n simulation. Section 6 provides timing and performance results using non-synthetic data. Finally, section 7 discusses possible applications and extensions of this approximation to other high-dimensional Bayesian settings.
2. Gaussian process models
We provide an overview of Gaussian process models and detail the computational bottlenecks associated with their use in Bayesian samplers. We also review existing approaches for approximating the GP posterior, pointing out drawbacks with each to motivate our approach.
2.1. Gaussian process regression
Suppose one observes n noisy realizations of a function at a set of inputs . Denote these observations , and for simplicity of exposition, assume the inputs are unique. In the classic regression setting, error is assumed independent and normally distributed:
| (1) |
When f is an unknown function it is common to assume , a Gaussian process with mean function and covariance function . This process is completely specified by the mean and the covariance function. A priori the mean function is frequently taken to be zero, with behavior of the process defined by the covariance function. This function, , along with its hyper-parameters, Θ, specify properties of f (e.g., for the exponential covariance kernel realizations of f are smooth and infinitely differentiable); the hyper-parameters Θ further specify these properties (e.g. how quickly covariance between points decays or how far the function tends to stray from its mean).
Letting f be a zero centered GP with being any symmetric covariance kernel (e.g., squared exponential or Matérn) having hyper-parameters Θ, one specifies a nonparametric prior over f for the regression problem given in (1). As an example, on e may define using the squared exponential covariance kernel, parameterized as with , where .
In general for n observations, define to be the covariance matrix for the n observed inputs using . That is, is such that for . Let be the vector of observed noisy realizations of f. The log-likelihood function is then
| (2) |
Computing this log-likelihood requires calculating the determinant and inverse of the n × n matrix , which are both operations. For Bayesian computation, the computational bottlenecks increase. Defining and , the posterior of f conditional on the hyper-parameters is known,i.e.,
| (3) |
Calculating the covariance in (3) requires the inversion of followed by the matrix multiplication of K and , which are both operations. Calculating just the mean in (3) requires either the inversion of followed by two matrix-vector products (an followed by two operations) or first calculating the variance in (3) followed by a matrix-vector product (two operations followed by one operation).
For Markov Chain Monte Carlo algorithms, assuming prior distributions are assigned to Θ and τ, inference proceeds by sampling from , which requires evaluating (2) and (3) a large number of times. In addition to these evaluation, the Cholesky decomposition of the matrix product is required for sampling from the posterior of f at each iteration. This is also an operation. The computational cost of each sample limits the use of Bayesian estimation of the full GP to problems having at most 5,000 observations on most single processor computers, and 5,000 observations is likely a generous upper bound. It is the authors’ experience that 1,500 is a more reasonable limit for most problems.
2.2. Bayesian approximation methods
Given the difficulties of evaluating (2) and (3) for large n, a number of approximation methods have been developed. The approach of Sang and Huang [2012] combines the reduced rank process of Banerjee et al. [2008] with the idea of covariance tapering in order to capture both global and local structure. Specifically, f (x) is decomposed as , where is the reduced rank approximation from Banerjee et al. [2008] and is the residual of the process after this global structure has been accounted for. The covariance function of the residual is approximated using a tapered covariance kernel, i.e. a kernel in which the covariance is exactly 0 for data at any two locations whose distance is larger than the specified taper range. The full-scale method has improved performance relative to the reduced rank process or covariance tapering alone and has the desired quality of capturing global and local structure. However, the quality of this approximation to the original function is highly dependent on the choice of taper length and number of inducing points and there is no way to constructively bound the error between realizations of the approximate and true covariance matrices.
The compression approach of Banerjee et al. [2012] approximates the covariance matrix K by , where Φ is a projection matrix. The “best” rank-m projection matrix in terms of and is the matrix of the first m eigenvectors of K. Because finding the spectral decomposition of K is itself an operation, the two algorithms in Banerjee et al. [2012] focus on finding near-optimal projection approximations. The second of the proposed algorithms in the paper has the advantage of a probabilistic bound on the Frobenius norm between the projection approximation and true covariance matrix. That is, one can choose algorithm settings s.t. for some desired probability p. However, it is iterative; its use requires expensive pre-computation steps that scale poorly as n increases with no defined order of computational complexity in this pre-computation phase.
The hierarchical nearest-neighbor GP [Datta et al., 2016a] is introduced as a sparsity-inducing prior that allows for fully Bayesian sampling with scalability to previously unimplementable data sizes. This prior introduces a finite set of neighborhood sets having block sparsity structure, and defines a relation between these neighborhood sets via a directed acyclic graph. The distribution at any new points is then expressed via nearest neighborhood sets, yielding a valid spatial process over uncountable sets. While this prior is shown to yield similar inference to the full GP, the choice of a sparse prior means the ability to retain fidelity to a true (non-sparse) covariance matrix is sacrificed. Thus, although inference in simulation using the true and approximated posterior appear similar in numerical simulations, there is no bound on their divergence.
The -matrix approximations used in FIFA-GP for approximating K generalize block-diagonal, sparse, and low-rank approximations to the underlying covariance matrix [Litvinenko et al., 2019]. The first layer of the FIFA-GP approximation to K is analogous to the NNGP with neighbor sets defined to be those points located in the same dense diagonal block. These off diagonal blocks can be estimated using parallel random projection approaches to enable fast decomposition into low-rank approximations, which are similar to the projection algorithms proposed for the full matrix in the compressed GP model. Additionally, the unique structure of the -matrix composition allows for fast computation of the determinant, the Cholesky, and the inverse, which are required for an MCMC algorithm. However, these operations cannot be used directly to sample from (3). Here, one of the key contributions of this manuscript is showing how to combine standard -matrix operations to draw a random vector with distribution defined in (3).
Using FIFA-GP, local structure is captured due to the dense block diagonal elements. At the same time, global structure is preserved via high-fidelity off-diagonal compression. Furthermore, the approximation error between the true and approximate covariance matrices is bounded when constructing the -matrix. This constructive bound allows for a bound on the KL-divergence of the posterior for the GP, a feature that has not been shown in previous methods. The idea for our accelerated Gibbs sampler is similar to Zhang et al. [2019] and Zhang and Banerjee [2021], both of which rely on the NNGP, but FIFA-GP provides an attractive alternative due to its fidelity to exact GP regression.
3. Covariance approximation using -matrices
In this section we describe how low rank approximations and tree-based hierarchical matrix decompositions enable fast and accurate computations for Gaussian process.
3.1. Low-rank matrices
The rank of a matrix is the dimension of the vector space spanned by its columns. Intuitively, the rank can be thought of as the amount of information contained in a matrix. The matrix A is full rank if ; such a matrix contains maximal information for its dimension. For any rank p matrix A it is possible to write where and ; then it is only necessary to store values rather than values. For GP covariance matrices having n observations the savings gained by relaxing the memory requirements can be quite significant (see the web-based supporting materials for a concrete example).
It is also possible to approximate full-rank A with some rank-p matrix Ap so that . If significant computational gain can be achieved, but approximation fidelity depends on how much information is lost in the representation of A as some lower-rank matrix. With most low-rank factorization methods the approximation error decreases as p approaches rank(A).
3.2. Hierarchical matrices
A hierarchical matrix (-matrix) is a data-sparse approximation of a non-sparse matrix relying on recursive tree-based subdivision. The data-sparsity of this approximation depends on how well sub-blocks of the original matrix can be represented by low-rank matrices, and whether the assumed tree structure aligns with these potentially low-rank blocks. For dense matrices having suitably data-sparse -matrix approximations, many dense linear algebra operations (e.g. matrix-vector products, matrix factorizations, solving linear equations) can be performed stably and with near linear complexity since they take place on low-rank or small full-rank matrices. Storage gains are made when the decomposed low-rank blocks can be stored rather than the original full dense matrix blocks. The construction of and algorithms for -matrices are vast topics and outside of the scope of this manuscript; for further information on -matrices we refer the reader to Hackbusch [1999, 2015] and Grasedyck and Hackbusch [2003].
A key advantage to using -matrices for matrix approximation is that the error between the true matrix and the approximate matrix can be bounded by construction. Specifically, the max-norm (i.e. the maximum elementwise absolute difference) can be made arbitrarily small for the type of -matrix used in FIFA-GP.
The ability for a given matrix to be well approximated by an -matrix (and thus the potential for computational and storage gains) depends on the structure of that matrix. With a GP covariance matrix, the relationship between rows and columns of the matrix depends on the ordering of the individual data points. In one dimension, a simple direct sorting of the points will lead to the points closest in space being closest in the matrix. In multiple dimensions, clustering algorithms can be used to group similar data points. Alternatively, fast sorting via a kd-tree can be used (e.g. Wald and Havran [2006]), in which the data are sorted recursively one dimension at a time.
There are many types of -matrices [Hackbusch, 1999, Börm and Hackbusch, 2002, Grasedyck and Hackbusch, 2003, Ambikasaran and Darve, 2013, Hackbusch, 2015]. Each can facilitate posterior computation using the methods outlined below. We focus on HODLR matrices, a type of -matrix, due to their fast construction and ease of use. HODLR matrices are defined recursively via 2×2 block partitions. Off-diagonal blocks are approximated via low rank representations and diagonal blocks are again assumed to be HODLR matrices. The recursion ends when the diagonal blocks reach some specified dimension at which the remaining diagonal block matrices are dense. The level of a HODLR matrix refers to the number of recursive partitions performed. Storage cost of off-diagonal blocks is reduced by storing the components of the low-rank decomposition rather than the original matrix elements. Compared to general -matrices HODLR matrices tend to be superior in practice in one dimension, with faster decomposition and solve algorithms for our purposes. However, they sacrifice the flexibility of allowing high-rank off-diagonal blocks and adaptive matrix partitionings. Other -matrix classes would be better suited than HODLR matrices to Gaussian processes defined over dimensions greater than one.
HODLR matrices, their factorization, and their use in GP likelihood estimation and kriging have been thoroughly discussed in Ambikasaran and Darve [2013] and Ambikasaran et al. [2014]. Open source code for constructing, factorizing, and performing select linear algebra operations using HODLR matrices is available in the HODLRlib library [Ambikasaran et al., 2019]. Of relevance to our work are the fast symmetric factorization of symmetric positive-definite HODLR matrices, determinant calculation, matrix-vector multiplication, and solves. Specifically, symmetric positive-definite HODLR matrix A can be symmetrically factored into in time [Ambikasaran et al., 2014]. An example is included in the web-based supporting materials. Further details on these algorithms are available in Ambikasaran and Darve [2013] and Ambikasaran et al. [2014].
Assuming points proximal in space are near each other in the covariance matrix, a major benefit of the HODLR decomposition is that local structure is captured with high fidelity due to the dense diagonal blocks in the approximate covariance matrix. By construction, these diagonal blocks are perfectly preserved. Furthermore, global structure is not ignored, as in covariance tapering methods, but rather approximated via low-rank representations of the off-diagonal blocks. The neighborhood sets of the hierarchical nearest-neighbor GP [Datta et al., 2016a] can be thought of similarly to the block diagonal entries (although these neighborhoods need not be all of the same size), but with relationships between neighborhood sets providing an update to this block-sparse structure rather than the subsequent matrix factors in Figure 2.
Figure 2:
Factorization of a 2-level HODLR matrix.
4. Bayesian fast increased fidelity approximate GP algorithm
4.1. Gibbs sampler
Consider the case where d = 1. Bayesian GP regression requires estimating the unknown hyperparameters of the covariance kernel. These parameters vary based upon the kernel chosen; we focus on the squared exponential kernel, i.e., , with extensions to other covariance kernels being straightforward. Under this assumption, let denote the n × n matrix computed from the squared-exponential kernel having hyperparameters σf and ρ, and evaluated at .
Let , with conditionally conjugate priors specified for the parameters, i.e., take , , and , a discrete uniform distribution over possible values for p in . Such priors are standard in the literature [Banerjee et al., 2008, Sang and Huang, 2012, Dattaet al., 2016a, Wheeler, 2019]. Then exact GP regression proceeds by sampling from the following conditional distributions:
| (4) |
| (5) |
| (6) |
| (7) |
where .
This sampler requires computing the inverse, determinant, and Cholesky decomposition, all operations. For what follows, we use the properties of -matrices to replace these operations with counterparts to develop a fast Gibbs sampler. In the case where a given -matrix has already been factorized, e.g. in the case when the covariance kernels can be reused by fixing the length-scale components on a discrete grid and pre-computing the factorization, inverse, and determinant, computations can be done in time using the HODLRlib library [Ambikasaran et al., 2019]. The discrete prior in step (7) allows for the relevant matrix inverse and determinant to be computed at every grid point before running the Gibbs sampler so as not to have to recompute these values in each step. In practice, this step could be replaced with a Metropolis step for even faster computation when the number of possible values for ρ is large. The Gaussian error sampler can be extended to non-Gaussian errors by defining an appropriate relation between a latent GP and the observed data, e.g., via a probit link for binary data [Choudhuri et al., 2007] or a rounding operator for count data [Canale and Dunson, 2013].
4.1.1. Sampling f
Consider step (4), which requires computation of the posterior mean,
and posterior covariance,
| (8) |
Direct computation of these quantities has cubic time complexity. It is possible to use -matrices to quickly compute the mean [Ambikasaran et al., 2015], but it is difficult to construct (8) in the -matrix format, which makes sampling from the standard algorithm difficult.
We propose a sampling algorithm for the GP function posterior that leverages the near linear HODLR operations (specifically matrix-vector products, solutions to linear systems, and applications of symmetric factor to a vector) rather than direct matrix multiplication or inversion. Let for notational convenience. We approximate the matrices K and in (4) by -matrices and respectively. In what follows, we use the relation to develop our sampling algorithm.
Algorithm 1:
GP sampler using -matrix
|
Lemma 1.
From Algorithm 1, , where and are defined to be the approximations of the posterior function variance Σf and mean μf, respectively.
Proof. See Appendix.
The most expensive steps in Algorithm 1 are the -matrix factorizations of the matrices and . When using a fixed grid of length-scale parameters, the factorization of can be pre-computed. If the precision τ is also fixed or known (i.e., if the factorization of can also be pre-computed), the cost of the entire algorithm will be rather than .
Remark:
There are a number of -matrix constructions that can be used in this algorithm, and they may be preferable in certain situations. For example, the -matrix [Börm and Hackbusch, 2002] provides a faster construction for 2- and 3-dimensional inputs, and may be preferable in these cases.
When discussing computation times above we have assumed the -matrix decomposition method used is the HODLRlib implementation of the HODLR decomposition. However, any hierarchical decomposition allowing for symmetric matrix factorization and having an ϵ bound by construction on the max-norm between the approximated and true matrices could be used instead, with the associated computation cost being that of the method used.
4.1.2. Hyperparameter posterior approximation
Sampling (6) and (7) involves solving the linear system for b (i.e., finding ) and finding the determinant of K. If matrix K is approximated by a -matrix , significant computational savings result. The HODLR -matrix decomposition performs these operations at a cost of . Here is a factorized HODLR matrix, which has been computed prior to this operation (the factorization itself is an operation).
4.1.3. Approximation fidelity
The following theorem concerns the approximation fidelity of when the component pieces of the GP posterior covariance , K and , are calculated using approximations and , respectively, with maximum absolute difference between the true and approximated matrices being bounded by ϵ. The proof is general and gives an upper bound on convergence for large n. The implication of this result is that there’s an upper limit to the approximation fidelity when n grows past a certain point, even if the true matrices are used, because of the limits of finite computer arithmetic.
Theorem 1.
Let where and is an n × n positive definite matrix, with for K the n × n realization of some symmetric covariance kernel, y is a length-n vector, and τ is a constant. Define such that . Then there exists matrices with and such that for , , and
| (9) |
with , where the density functions of p and q are denoted by and , respectively. The constants c1, c2, and c3 are dependent on the conditioning of K and M. Note that using the HODLRlib library, and can be created and factorized in time.
Proof. See Appendix.
The proof relies on the assumption that and can be bounded to be arbitrarily small. The advantage to using a hierarchical matrix decomposition to approximate K and M is that the norm is often bounded by construction. Specifically, for the HODLR decomposition used in this paper the max-norm is bounded on construction [Ambikasaran et al., 2019]. Therefore, the resulting F-norm of the difference between the HODLR approximation and the true matrix is bounded by n2 times the max-norm for each matrix (i.e., for matrix A, ), satisfying the assumption of the proof.
4.1.4. Additional considerations
Smooth GPs measured at a dense set of locations tend to have severely ill-conditioned covariance matrices (Section 3.2 of Banerjee et al. [2012] provides an excellent discussion of this issue). An -matrix approximation does not necessarily improve the conditioning relative to the original dense matrix. Typically, practitioners mitigate this ill-conditioning by adding a small nugget term to the diagonal of the covariance matrix. We include this nugget term prior to -matrix construction.
Algorithm 1 does not require a linear solve involving the -matrix approximation of K, but it does require one involving that of . A practical tweak that improves conditioning and doesn’t alter the fundamental algorithm is to scale y so as to make τ smaller, followed by removing this scaling factor in post processing. A smaller τ makes inverting (and its approximation) more stable with no impact on the posterior surface estimates. It is also possible to combine the ideas in Banerjee et al. [2012] with a given -matrix algorithm so that the dense block diagonals in the factorization of both and could be compressed to the desired level of fidelity in order to improve conditioning within each block.
Details on how the function posterior at new inputs can be sampled, an efficient way to handle non-unique input points, and adjustments for heteroskedastic noise are provided in Section 9.1 of the Appendix.
4.2. Extending to higher dimensional inputs
Though Lemma 1 and Theorem 1 are valid for any input dimension, the HODLR factorization doesn’t generally scale well with the number of input dimensions. For an example of this phenomenon, see the web-based supporting materials. To take advantage of the fast factorization and linear algebra operations afforded by HODLR when d = 1, we propose an extension of these algorithms that scales to higher dimensions at . We model a d-dimensional surface as a scaling factor times a tensor product of 1-dimensional GPs having unit variance:
| (10) |
where β is the scaling factor that controls the variability of the stochastic process and is some covariance kernel with unit function variance. In practice we let , setting and choosing based on prior knowledge of the scale of the data.
Approximating surfaces using a tensor product of bases is common [De Boor, 1978, Dierckx, 1984, Greiner and Hormann, 1996, Jüttler, 1997]. Often, the tensor product takes the form of a finite basis expansion using splines. The flexibility of more traditional tensor product spline approaches carries over to this approach, but one does not have to be concerned with the choice of the knot set to achieve this flexibility (e.g. see De Jonge et al. [2012] for an example of the issue of knot set selection using tensor product B-splines). By utilizing Lemma 2.1 of De Jonge et al. [2012], it is trivial to show that there exists GP tensor product specifications in the sample path of (10) that are arbitrarily close to any d-dimensional Hölder space of functions having up to r continuous partial derivatives .
Lemma 2.
Let be the Hölder space of functions having r continuous partial derivatives and ϵ > 0, then for any there exists a tensor product with sample paths such that
This follows from the fact that B-splines are continuous functions and that by Lemma 2.1 of De Jonge et al. [2012] there exists B-splines such that their tensor product is within an ϵ ball of any continuous surface. The result then follows because there are a large number of covariance kernels that are dense in the space of continuous functions (see for example Tokdar and Ghosh [2007]). This lemma shows that (10) provides a flexible solution to modeling higher dimensional surfaces.
In practice, the actual covariance kernel used may not be adequate to model a given data-set (e.g. there may be a high prior-probability placed on sample paths that are too smooth and/or the prior over the hyperparameters too restrictive). In these cases, one tensor product may over smooth the data and miss local features of the surface; here, multiple additive tensor products may be considered to alleviate this problem. This was the strategy of Wheeler [2019] who used an additive sum of GP tensor products like those defined in (10). It is also the strategy in most standard spline based approaches, and it is often used in hierarchical grid refinements for tensor product B-splines (e.g. see Jüttler [1997]); however, the choice of an appropriate knot set for each additive component is not necessary when using FIFA GPs.
-matrices can be used to approximate the GP in equation (10) and provide the previously discussed benefits of speed and near-machine-precision fidelity. Sampling from the function posterior for each involves heteroskedastic noise; and one must modify the KL divergence bound as well as the sampling algorithm. To see why, note that . Then variance is a noisy observation of with variance .
This approach also provides additional computational advantages if the number of unique input values is smaller than n in a given dimension. For example, consider observations made on a 500 × 500 grid of inputs. Then computation proceeds rapidly even though the total number of observations is 250,000. Such computation gains are not unique to the tensor product approach, but are also available using additive kernels [Duvenaud et al., 2011, Durrande et al., 2012] or Kronecker based inference [Flaxman et al., 2015]; however, these approaches do not have the computational advantages of -matrix arithmetic and thus still scale with cubic complexity.
The tensor product of GP realizations is different than a GP with a separable covariance kernel comprised of the tensor product of univariate kernels, as in Karol’ and Nazarov [2014]. The latter also offers computational advantages, but has the same issues for large covariance matrices. Finally, the tensor product approach in (10) is one way in which univariate GPs can be used to model higher dimensional surfaces. In some cases, an additive model may suffice [Duvenaud et al., 2011, Durrande et al., 2012].
5. Simulation study
In this section, we report performance and timing of each method for a small-n simulation using data generated from a Gaussian process and a large-n simulation in which the true generating function is known but not a GP. All calculations in this and the subsequent section are performed on a 2016 MacBook Pro with a 2.9 GHz Intel Core i7 processor.
5.1. Small-n simulation
We compare the performance of the approximate methods to that of an exact GP sampler. Of interest is both similarity of point estimates to the exact GP and fidelity of uncertainty about those estimates. The experiment with synthetic data proceeds as follows for training points and n* = 50 test points:
Points are simulated from a N[−2, 2](0, 1) distribution, so observed data are more concentrated about the origin.
A “true” function is simulated from a Gaussian process having an exponential covariance function . Note that the true values of , and τ are varied in each configuration.
A sampler based on the exact GP covariance matrix is run to get exact Bayesian estimates of the posterior for the GP hyperparameters and ρ, the noise precision τ, the function f at training points , and the function f* at test points . These estimates are defined as the “exact posterior.”
Each of the approximate methods are used to obtain samples from the posterior for , and , referred to as “approximate posterior” samples.
For each sampler, the first 2,000 samples are discarded as burn-in and every tenth draw of the next 25,000 samples is retained.
Table 1 shows simulation results for n = 1000. In the table, summarizes predictive performance for determining the true function mean at test points and 95% CI (f*) Area summarizes the geometric area covered by the 95% credible interval about f*. Hyperparameter summary provides the mean of the T samples, where is the t-th draw of θ, and gives the associated 95% pointwise posterior credible interval. The key take away from Table 1 is that inference made using FIFA-GP is essentially the same to that made using the exact GP for GP hyperparameters, function posterior, and the noise precision. The compressed GP has similar predictive performance and noise precision estimates, but inference on the GP hyperparameters and area of uncertainty about the function posterior differs from those quantities as measured by the exact GP. Analogous tables for n = 100 and n = 500, and replicate samplers run with the same data for n = 100, shown in the web-based supporting materials, provide further empirical support that the exact GP and FIFA-GP behave almost identically in terms of both predictive performance and inference. An additional table and illustrative figure, also found in the web-based supporting materials, shows relatively large deviations between the results for the NNGP and the exact GP when compared to those between FIFA-GP and the exact GP.
Table 1:
Performance results, parameter estimates, and computing time for Bayesian GP regression with n = 1000 data points simulated using a squared exponential covariance kernel. Time shown is total time for setup and 27,000 Gibbs sampler iterations, with 2,500 samples retained.
| Truth |
Exact GP |
FIFA-GP |
Compressed GP |
||||
|---|---|---|---|---|---|---|---|
| Smooth and low noise | MSPEf* | - | 1e-04 | 1e-04 | 1e-04 | 2e-04 | 2e-04 |
| 95% CI (f*) Area | - | 0.27 | 0.27 | 0.27 | 0.29 | 0.30 | |
| τ = 30 | 28.3 [25.9, 30.9] | 28.3 [25.9, 30.8] | 28.3 [25.9, 30.8] | 28.4 [26.0, 31] | 28.4 [25.9, 30.9] | ||
| σf = 1 | 1.04 [0.52, 2.27] | 0.92 [0.51, 1.85] | 0.98 [0.52, 2.06] | 0.72 [0.44, 1.22] | 0.70 [0.43, 1.17] | ||
| ρ = 0.25 | 0.33 [0.15, 0.69] | 0.35 [0.17, 0.70] | 0.33 [0.16, 0.65] | 0.68 [0.56, 0.77] | 0.76 [0.70, 0.80] | ||
| Time (min) | - | 96.7 | 8.7 | 8.0 | 6.2 | 6.0 | |
| Smooth and high noise | MSPEf* | - | 0.008 | 0.008 | 0.008 | 0.007 | 0.007 |
| 95% CI (f*) Area | - | 0.83 | 0.85 | 0.85 | 0.98 | 0.95 | |
| τ = 2 | 2.25 [2.06, 2.45] | 2.26 [2.07, 2.46] | 2.26 [2.06, 2.46] | 2.26 [2.07, 2.47] | 2.26 [2.06, 2.46] | ||
| σf = 1 | 1.01 [0.53, 2.01] | 0.98 [0.52, 1.88] | 0.97 [0.50, 1.98] | 0.89 [0.51, 1.73] | 0.87 [0.50, 1.49] | ||
| ρ = 0.25 | 0.23 [0.13, 0.53] | 0.27 [0.13, 0.85] | 0.26 [0.13, 0.68] | 0.68 [0.55, 0.82] | 0.60 [0.55, 0.71] | ||
| Time (min) | - | 94.5 | 8.7 | 8.0 | 6.1 | 6.2 | |
| Wiggly and low noise | MSPEf* | - | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| 95% CI (f*) Area | - | 0.37 | 0.37 | 0.37 | 0.40 | 0.39 | |
| τ = 30 | 30.6 [28.0, 33.4] | 30.6 [28.0, 33.4] | 30.6 [28.0, 33.3] | 30.7 [28.0, 33.5] | 30.7 [28.1, 33.4] | ||
| σf = 1 | 1.45 [0.90, 2.50] | 1.41 [0.87, 2.39] | 1.50 [0.90, 2.91] | 1.14 [0.80, 1.69] | 1.17 [0.79, 1.74] | ||
| ρ = 2 | 1.70 [1.20, 2.50] | 1.74 [1.22, 2.60] | 1.69 [1.14, 2.55] | 2.54 [2.54, 2.54] | 2.44 [2.35, 2.53] | ||
| Time (min) | - | 96.0 | 9.6 | 8.6 | 7.9 | 7.3 | |
| Wiggly and high noise | MSPEf* | - | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 |
| 95% CI (f*) Area | - | 1.32 | 1.33 | 1.33 | 1.30 | 1.30 | |
| τ = 2 | 2.09 [1.91, 2.28] | 2.09 [1.91, 2.28] | 2.09 [1.91, 2.29] | 2.09 [1.91, 2.28] | 2.09 [1.91, 2.28] | ||
| σf = 1 | 0.89 [0.57, 1.48] | 0.93 [0.56, 1.64] | 0.91 [0.58, 1.59] | 0.92 [0.58, 1.51] | 0.92 [0.60, 1.52] | ||
| ρ = 2 | 2.61 [1.67, 3.73] | 2.48 [1.56, 3.78] | 2.56 [1.62, 3.79] | 2.43 [2.41, 2.50] | 2.42 [2.36, 2.49] | ||
| Time (min) | - | 95.3 | 10.8 | 9.7 | 7.6 | 8.0 | |
5.2. Large-n simulation
We compare the predictive performance and the computing time of each approximation method using a larger simulated data set, which was not based on a true underlying Gaussian process model. In this simulation, the true function is and input data are sampled from a N[−2, 2](0, 1) distribution. Values of y are sampled having mean f and precision τ = 1.
Parameter estimates and performance results analogous to those provided in the small-n simulation are provided in the web-based supporting materials. As when data were simulated using a true GP for the mean, parameter estimates for FIFA-GP are similar to that of the exact GP sampler. As n increases, both the precision and accuracy of FIFA-GP increase. Figure 3 shows the time taken (in minutes) to iterate through 100 steps of the sampler for the exact GP and that of the approximated methods with a grid of 100 possible length scale values comprising the options for the discrete sampling step. The FIFA-GP method remains computationally tractable even when n = 200,000. The cost of the pre-computation steps (i.e., projection construction) scales poorly for the compressed GP method.
Figure 3:
Time taken for 100 samples from full Bayesian posterior. Type “Total” includes the time taken to pre-create the matrices for the discrete uniform grid of length-scale values; “Sampling” only includes the time taken during the sampling phase after these setup computations have occurred.
MSE and MSPE for FIFA-GP are similar to the exact GP and compressed GP, and superior to the NNGP. The NNGP showed signs of numerical instability for n > 2,000, with the empirical coverage jumping up to near 1 and large oscillations in curve predictions. Because of this behavior, timing results for the NNGP are omitted from Figure 3, although findings in the next two sections will show that whether NNGP is faster than FIFA-GP depends on the number of neighbors and the structure of the design matrix; MSE, MSPE, and coverage for all methods and error details for the NNGP are provided in the web-based supporting materials, along with results for the NNGP using the more numerically stable exponential covariance kernel showing similar performance to the NNGP with a squared exponential kernel at n ≥ 2,000.
5.3. Tensor product simulation
In order to illustrate the capability of the tensor product approach defined in equation (10), we simulate from two functions having d = 2. The first function, is separable into a function of x1 multiplied by a function of x2 and thus should be easily approximated by the tensor product formulation. The second function, , is not separable and thus may not be well approximated by a single tensor product term. For each function, draws of x1 and x2 are sampled from and noisy observations of the functions are made with precision τ = 0.5. The prior parameters on the scaling term are set to the default values of and . The number of observations is varied, with .
Figure 4 shows the noisy observations of g1 and model estimated surface for varying n. Figure 5 shows the MSPE for surface estimates at test points. The model is able to learn the surface reasonably well, even with small n. Coverage of the 95% credible interval is near nominal (0.962, 0.949, and 0.946 for n = 500, n = 2000, and n = 25000, respectively). Furthermore, the model is scalable to large n because it relies on one-dimensional HODLR-approximated GPs.
Figure 4:
Top: Noisy observations. Bottom: Estimated surface from tensor product model. Left to right: n = 100, n = 1000, and n = 10000.
Figure 5:
MSPE of the tensor product surface for predicting g1 at new inputs.
Figure 6 shows the noisy observations of g2 and model estimated surface for varying n, with either 1 or 2 additive basis components used. Figure 7 shows the MSPE for surface estimates at test points. In this example, g2 is poorly approximated by a single tensor product component, i.e. by the model . When a second tensor product basis as added, i.e. when we model the data as , the model is flexible enough to capture the true surface. Coverage of the 95% credible interval is near nominal (0.956, 0.949, 0.947 for n = 500, n = 2000, and n = 25000, respectively).
Figure 6:
Top: Noisy observations. Middle (bottom): Estimated surface from tensor product model with one (two) additive basis term(s). Left to right: n = 100, n = 1000, and n = 10000.
Figure 7:
MSPE of the tensor product surface for predicting g2 at new inputs with 1 or 2 additive bases.
With an appropriate number of bases (i.e., 1 for g1 and 2 for g2) FIFA-GP outperforms the competitor methods in terms of MSE and MSPE at large n and is comparable to competitors (including the exact GP) at small n. Of note, the ratio between the MSE using the NNGP and the MSE using FIFA-GP is greater than 1 for all n ≥ 500, and increases as n increases. That is, the relative amount by which FIFA-GP outperforms the NNGP grows with n. This phenomenon is likely due, in part, to the smoothness of these two example functions – the NNGP tends to under-smooth due to sensitivity to local effects. In terms of speed, FIFA-GP falls between the NNGP with 15 neighbors and the NNGP with 50 neighbors. Full results for the MSE, MSPE, coverage of 95% credible intervals, and timing for all competitors are available in the web-based supporting materials. Also included are the MSE ratios for all n and side-by-side surface prediction visualizations for the NNGP and FIFA-GP. Performance, coverage, timing, and ratio plots are also shown for the NNGP run with an exponential covariance kernel, which performs better than the squared exponential NNGP but worse than FIFA-GP. Finally, we include a head to head performance comparison showing that FIFA-GP still exhibits superior performance when it and the NNGP are both run using a Matérn covariance kernel.
6. Non-synthetic data illustrations
We consider two data examples to show how multidimensional surfaces can be modeled using FIFA-GP. As we only include a spacial process and do not include relevant covariates, we do not claim the estimates to be optimal; rather they are illustrative of how the methodology can be used to model flexible functional forms for extremely large data sets on relatively modest computer hardware.
6.1. Atmospheric carbon dioxide
NOAA started recording carbon dioxide (CO2) measurements at Mauna Loa Observatory, Hawaii in May of 1974. The CO2 data are measured as the mole fraction in dry air in units of parts per million (ppm). Hourly data are available for download from 1974 to the present at https://www.esrl.noaa.gov/gmd/dv/data/index.php?parameter_name=Carbon%2BDioxide&frequency=Hourly%2BAverages&site=MLO. To illustrate the utility of the proposed method, we use the full data for which the quality control flag indicated no obvious problems during collection or analysis (358,253 observations). We model the year-season surface as the sum of a GP for the annual effect, a GP for the seasonal effect, and a tensor product of GPs for year and season. For the tensor product, the one dimensional GP’s had just under 9,000 observations in each dimension. For this example, the full MCMC algorithm took under thirty minutes for 7,000 sampling iterations.
Figure 8 shows the observed and model-predicted year-season CO2 surfaces. There is an evident increase in CO2 across years along with a seasonal pattern, with a peak in early summer and a trough in the fall. The tensor product spline basis allows for variation in the seasonal shape that varies smoothly with year.
Figure 8:
Left: Observed surface. Right: Model-predicted surface.
Figure 9 shows slices of the model-predicted year-season CO2 surfaces along each dimension. The tensor product of Gaussian processes fits the model well, with residuals having no apparent seasonal or annual patterns. Earlier years seem to exhibit some heteroskedastic and positive-skewed non-Gaussian errors, which could be explicitly accounted for using an alternative noise model.
Figure 9:
Top: Slice of CO2 surface viewed along the year dimension (hour 14 of day 185 shown). Evident is the annual increase in CO2. Bottom: Slice of CO2 surface viewed along the season dimension (year 1975 shown). Evident is the seasonal pattern of CO2 levels, with a peak in early summer and a trough in the fall.
A leave-one-year-out cross validation assessment, shown in detail in the web-based supporting materials, shows that FIFA-GP exhibits consistently strong performance and reasonable posterior predictive estimates. Furthermore, FIFA-GP outperforms the NNGP in terms of MSPE for 75% of the hold-out years and consistently takes around 1/3 the computation time. Of the holdout years for which FIFA-GP outperformed the NNGP, the median reduction in MSPE was about 32%, and the best was 85%. Conversely, of the holdout years for which the NNGP outperformed FIFA-GP these numbers were 21% and 51%, respectively. Note the NNGP was run using the more numerically stable exponential kernel. The exact and compressed GP were not assessed for comparison because both are infeasible to run on data of this size.
6.2. Particulate matter
PM2.5 describes fine particulate matter, inhalable and with diameters that are generally under 2.5 micrometers. The EPA monitors and reviews national PM2.5 concentrations, and sets national air quality standards under the Clean Air Act. Fine particles are the particulate matter having greatest threat to human health. These particles are also the main cause of reduced visibility in many parts of the United States.
We use the sample-level normalized PM2.5 data for the contiguous states since 1999 (4,977,391 observations). We model the latitude-longitude-time surface as the sum of a GP for the geographic effect and a tensor product of GPs for latitude, longitude, and time. This three dimensional surface fit was again fast, taking about one hour for 7,000 sampling iterations.
Figure 10 shows the model predicted mean normalized PM2.5 value for the contiguous United States at three time points. There is an evident decrease in normalized PM2.5 across years along with regional hot spots, with the southeast and parts of California being particularly problematic. The total mean normalized PM2.5 decreased over time, which reflects the increased air quality standards over this time period.
Figure 10:
Model predicted mean PM2.5 value for the contiguous United States. From top to bottom: February 19, 1999; July 16, 2009; and December 11, 2019.
We assess the predictive ability of FIFA-GP for modeling normalized PM2.5 by withholding one full year of data from training and assessing both MSPE on the holdout data and the fidelity of the posterior predictive distribution for two potential quantities of interest. FIFA-GP exhibited satisfactory performance in terms of accuracy, posterior predictive checks, and speed, and outperformed the NNGP (again run using an exponential kernel for numerical stability) in terms of MSPE. Detailed results are shown in the web-based supporting materials.
7. Discussion
We have described an algorithm that allows near-exact Gaussian process sampling to be performed in a computationally tractable way up to n on the order of 105 for univariate inputs using -matrices. As observed by Litvinenko [2006] and Ambikasaran et al. [2015], the HODLR matrix format can quickly grow in computation cost in higher dimensional input spaces. We have addressed this issue by proposing a tensor product approach that leverages the speed of HODLR operations in univariate input domains and scales to higher dimensions at O(dn log2 n). It would also be possible to use the univariate GPs as building blocks in other ways to construct multidimensional surfaces. The downside to this approach is that it is unsuitable for approximating an elaborate GP with a carefully chosen covariance for d > 1.
While the use of HODLR matrices is convenient due to the well supported code base, the ideas in the paper are extensible to other classes of -matrices as well. Lemma 1 and Theorem 1 are valid for any matrix approximation having an error bound. Other -matrix approximations could ameliorate some of the computational costs associated with HODLR matrices in higher-dimensional input settings, without requiring the tensor product GP approach, and we have begun to implement these ideas using -compression matrices using the H2LIB library [Hackbusch and Börm, 2002]. This library efficiently compresses matrices up to three dimensions, and we have had success modeling 3-dimensional GP problems up to n = 20, 000. For the H2LIB library, the inversion uses a single processor, but one sample still takes less than a second for many problems where n = 20, 000. For higher dimensional inputs, the tensor product approach is still applicable, and we imagine one could approximate a surface using different matrix compression algorithms with the same theoretical guarantees on approximation as well as computational efficiency.
Although we considered the intensive Bayesian regression problem of the case of large n when f (·) is modeled as a Gaussian process, the framework introduced here is general and can be applied to any problem in which some large structured matrix inversion is required within a Bayesian sampler. Still within the GP realm, there are many cases when univariate GPs are used as model components rather than for standalone regression problems. Having the ability to plug in a fast near-exact sampler could open the use of these component GPs up to larger problems. For these large problems, we note other methods are continuing and will continue to emerge in this race against scalability (e.g. meshed Gaussian process models [Peruzzi et al., 2020], that in a sense generalize the NNGP by allowing blocks of locations as nodes). We also see promise in a hybrid approach combining the -matrix and NNGP approaches, potentially allowing for both increased scalability and higher accuracy.
Outside of the GP regression setting, a similar set of computational bottlenecks arise in the case of large p when f (·) is given a parametric sparsity inducing prior (e.g., the horseshoe prior [Carvalho et al., 2010]). Specifically, the matrix inversion required in Makalic and Schmidt [2015] is also of the form , where Σ is some p × p positive definite matrix, τ is a constant, and I is a p × p identity matrix. Bhattacharya et al. [2016] proposed an algorithm that reduces the cubic complexity to , but exploring the use of -matrix compression and arithmetic in the sparse Bayesian regression setting provides a promising avenue of future research that has the potential to mitigate the bottleneck due to the n2 term.
Supplementary Material
Figure 1:
Example partition tree and associated partitioning of a 2-level HODLR matrix. The diagonal blocks are full rank, while the off-diagonal blocks can be represented via a reduced rank factorization.
8. Acknowledgements
Research presented in this paper was partially supported by the Department of Energy Computational Science Graduate Fellowship (grant DE-FG02-97ER25308). It was also supported in part by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences, and in part by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20200065DR. Approved for public release: LA-UR-21-23822. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors would like to thank David Dunson and Amy Herring for helpful comments.
9. Appendix
9.1. Additional sampling considerations
Here we provide details on how the function posterior at new inputs can be sampled, an efficient way to handle non-unique input points, and handling heteroskedastic noise.
9.1.1. Function posterior at new inputs
In the Gibbs sampling framework, i.e. when we can condition on f, the predictive distribution of the realization of the function f* at a new point x* is given by
| (11) |
where , and and are realizations of the covariance kernel between new input x* and each of the observed inputs.
Posterior realizations of f* can be sampled in per point assuming the -matrix approximation for K is pre-assembled and pre-factorized; in this case the most costly calculation of from equation (11) is a linear solve. Such pre-computation is possible when the length-scale is sampled on a fixed grid. In this case we can also pre-solve for each possible length-scale value.
Unfortunately, we cannot pre-solve K−1f because f is sampled and changes at each iteration of the Gibbs sampler. Therefore, for N such new points the cost is . Thus, this method is not preferred for applications in which function realizations are desired for a very large number of new points (i.e., when N ≈ n). However, it is quite powerful in applications for which .
9.1.2. Non-unique input points
Denote the set of all n (possibly non-unique) inputs as . Let denote the set of U unique input values. Then and for any . Computational savings can be achieved when . Let Qi denote the set of original indices for which the input values are equal to , i.e. . Then define to be the set of U average observations at each unique input value. Sampling and inference can proceed using a GP prior on the U × U covariance matrix associated with observations . The precision term τ added to the diagonal of the original covariance matrix is then replaced with · τ for entry (i, i) of the new covariance matrix for all .
9.1.3. Heteroskedastic noise
It was assumed that . However, heteroskedastic noise can easily be accomodated in the above samplers. Suppose that . Define to be the diagonal matrix of variance parameters so that is then the diagonal matrix of precision parameters. Then the posterior variance from equation (3) becomes
The posterior mean from equation (3) becomes
The format of the above posterior covariance is chosen so as to enable easier HODLR operations. See the web-based supporting materials for further algebraic exposition and a discussion of how the KL divergence bound and Algorithm 1 are adjusted for heteroskedastic variance.
9.1.4. Practical considerations
The HODLR approximation relies on random compression of the off-diagonal block matrices. This process doesn’t guarantee that the result is always a positive semidefinite matrix, although in practice it most often is (and when it is not, simply decreasing ϵ will often lead to the conditions being met). Similarly, the product in Theorem 1 is not guaranteed to be positive semidefinite for every perturbation from K and M. In fact, this issue is practically present for large n even when using the exact GP due to finite precision arithmetic.
9.2. Proofs
Proof of Lemma 1.
After letting , . Note that is equivalent to because of the tolerances specified on construction. To see why, note that , so the HODLR approximation satisfies the bound for approximating τK. Since the off-diagonal block matrices are factorized via partial-pivoted LU decomposition, multiplication by a constant does not impact the solution other than by rescaling. That is, finding an approximation for τK with tolerance is equivalent to finding an approximation for τK with tolerance ϵ* and dividing the result by τ. Next, recall the HODLR decomposition preserves diagonal elements of the original matrix, so a decomposition of matrix A plus the identity matrix is equivalently expressed as and . Thus and at the start of Step 1. Now after solving . Because , , and their product are symmetric, the terms commute and . Thus . The mean is created by solving for r (here ), leading to the sum in the returned Z** to have the desired distribution.
Proof of Theorem 1.
What follows is a condensed proof – an extended proof with full exposition on each step is available in the web-based supporting materials.
From equation (3), let . Define and to be the -matrix approximations of K and M, respectively. Denote and , with density functions and , respectively. In other words, is the true GP posterior and is the -approximated GP posterior.
| (12) |
Let , and be the matrices of differences between the approximated and true matrices. Assume by construction and . That is, assume the absolute entry-wise difference between each approximated and true matrix is at most some ϵ.
Part 1.
For the part (i.) bound we will rely on the Hoffman-Wielandt (HW) inequality [Bhatia, 1997]. This inequality states that for symmetric with the eigenvalues of Σ0, Σ1 respectively, there exists a permutation π such that .
In the case that the max norm of (Σ0 − Σ1) is bounded by ϵ > 0, the HW inequality allows us to bound .
We return to the issue of bounding part (i.):
Consider part (d1.). Let and be the eigenvalues of K and , respectively, and π be some permutation of the indices , then we have
The bound on part (d2.) follows analogously, so the overall bound on part (i.) is
| (13) |
Part 2.
For the part (ii.) bound we have
Consider the bound on the first term in the above expression:
Because the ϵ bound on applies to as well, the upper bound on the second term may be found analogously as that on the first . Let and define the minimum and maximum eigenvalues, respectively, of a matrix. By Weyl’s inequality for estimating perturbations of the spectrum [Weyl, 1912], assuming that , we can bound by . Now consider the third and final term. Using the linearity of the trace and the Cauchy-Schwarz inequality applied to the trace of two products, we have
Using properties of trace and the fact that and are PSD to bound (t1.) and (t2.), we have and .
As with the max-norm of , we want the term to only involve K, n, and ϵ. Assuming as before that , and using the fact that the trace of a matrix is the sum of its eigenvalues, we have
Therefore, the bound on all of part (ii.) is
| (14) |
Part 3.
To bound the part (iii.) term, we will Rayleigh’s inequality, the definitions of the spectral and Frobenius norms, and an inequality in Horn and Johnson [2012] for the difference of matrix inverses (call this DMI inequality).
We bound the term in (n1.) involving and by a function of K, M, n, and ϵ. Assuming as before that , we have .
The term (n1.) thus acts as a constant, and we show that (n2.) will go toward 0 with ϵ. Here, consider the square root of (n2.). Assume , a required condition for the DMI inequality.
As with , we can bound in terms of M, n, and ϵ. Assuming that , .
Note that since n, , and are fixed (although is is possible they are each quite large), we have
Since we chose , we have (assuming , if not) we could simply choose ϵ satisfying . Then , so , and all of (n2.) is .
Therefore, the bound on all of part (iii.) is
| (15) |
References
- Ambikasaran Sivaram and Darve Eric. An fast direct solver for partial hierarchically semi-separable matrices. Journal of Scientific Computing, 57(3):477–501, 2013. [Google Scholar]
- Ambikasaran Sivaram, O’Neil Michael, and Singh Karan Raj. Fast symmetric factorization of hierarchical matrices with applications. arXiv preprint arXiv:1405.0223, 2014.
- Ambikasaran Sivaram, Foreman-Mackey Daniel, Greengard Leslie, Hogg David W, and O’Neil Michael. Fast direct methods for Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):252–265, 2015. [DOI] [PubMed] [Google Scholar]
- Ambikasaran Sivaram, Singh Karan, and Sankaran Shyam. Hodlrlib: A library for hierarchical matrices. The Journal of Open Source Software, 4:1167, 2019. [Google Scholar]
- Banerjee Anjishnu, Dunson David B, and Tokdar Surya T. Efficient Gaussian process regression for large datasets. Biometrika, 100(1):75–89, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Banerjee Sudipto, Gelfand Alan E, Finley Andrew O, and Sang Huiyan. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (4):825–848, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhatia Rajendra. Matrix analysis. Springer Verlag, 1997. [Google Scholar]
- Bhattacharya Anirban, Chakraborty Antik, and Mallick Bani K. Fast sampling with gaussian scale mixture priors in high-dimensional regression. Biometrika, page asw042, 2016. [DOI] [PMC free article] [PubMed]
- Börm Steffen and Hackbusch Wolfgang. Data-sparse approximation by adaptive -matrices. Computing, 69(1):1–35, 2002. [Google Scholar]
- Canale Antonio and Dunson David B. Nonparametric Bayes modelling of count processes. Biometrika, 100 (4):801–816, 2013. [Google Scholar]
- Cao Jian, Genton Marc G, Keyes David E, and Turkiyyah George M. Hierarchical-block conditioning approximations for high-dimensional multivariate normal probabilities. Statistics and Computing, 29(3): 585–598, 2019. [Google Scholar]
- Carvalho Carlos M, Polson Nicholas G, and Scott James G. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. [Google Scholar]
- Chalupka Krzysztof, Williams Christopher KI, and Murray Iain. A framework for evaluating approximation methods for Gaussian process regression. Journal of Machine Learning Research, 14(Feb):333–350, 2013. [Google Scholar]
- Choudhuri Nidhan, Ghosal Subhashis, and Roy Anindya. Nonparametric binary regression using a Gaussian process prior. Statistical Methodology, 4(2):227–243, 2007. [Google Scholar]
- Datta Abhirup, Banerjee Sudipto, Finley Andrew O, and Gelfand Alan E. Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812, 2016a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Datta Abhirup, Banerjee Sudipto, Finley Andrew O, and Gelfand Alan E. On nearest-neighbor Gaussian process models for massive spatial data. Wiley Interdisciplinary Reviews: Computational Statistics, 8(5): 162–171, 2016b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Boor Carl. A practical guide to splines. Springer-Verlag, 1978. [Google Scholar]
- De Jonge R, Van Zanten JH, et al. Adaptive estimation of multivariate functions using conditionally Gaussian tensor-product spline priors. Electronic Journal of Statistics, 6:1984–2001, 2012. [Google Scholar]
- Dierckx Paul. Algorithms for smoothing data on the sphere with tensor product splines. Computing, 32(4): 319–342, 1984. [Google Scholar]
- Durrande Nicolas, Ginsbourger David, and Roustant Olivier. Additive covariance kernels for high-dimensional Gaussian process modeling. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 21, pages 481–499, 2012. [Google Scholar]
- Duvenaud David K, Nickisch Hannes, and Rasmussen Carl E. Additive Gaussian processes. In Advances in Neural Information Processing Systems, pages 226–234, 2011.
- Eilers Paul HC and Marx Brian D. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2(6):637–653, 2010. [Google Scholar]
- Flaxman Seth, Wilson Andrew, Neill Daniel, Nickisch Hannes, and Smola Alex. Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods. In International Conference on Machine Learning, pages 607–616, 2015.
- Genton Marc G, Keyes David E, and Turkiyyah George. Hierarchical decompositions for the computation of high-dimensional multivariate normal probabilities. Journal of Computational and Graphical Statistics, 27(2):268–277, 2018. [Google Scholar]
- Geoga Christopher J, Anitescu Mihai, and Stein Michael L. Scalable Gaussian process computations using hierarchical matrices. Journal of Computational and Graphical Statistics, pages 1–11, 2019.33013150
- Gramacy Robert B and Lee Herbert K H. Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483):1119–1130, 2008. [Google Scholar]
- Grasedyck Lars and Hackbusch Wolfgang. Construction and arithmetics of -matrices. Computing, 70(4): 295–334, 2003. [Google Scholar]
- Greiner Günther and Hormann Kai. Interpolating and approximating scattered 3D-data with hierarchical tensor product B-splines. In Proceedings of Chamonix, volume 1, 1996. [Google Scholar]
- Hackbusch Wolfgang. A sparse matrix arithmetic based on -matrices. part i: Introduction to -matrices. Computing, 62(2):89–108, 1999. [Google Scholar]
- Hackbusch Wolfgang. Hierarchical matrices: Algorithms and analysis, volume 49. Springer, 2015. [Google Scholar]
- Hackbusch Wolfgang and Börm Steffen. Data-sparse approximation by adaptive -matrices. Computing, 69(1):1–35, 2002. [Google Scholar]
- Halko Nathan, Martinsson Per-Gunnar, and Tropp Joel A. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. 2009.
- Hensman James, Fusi Nicolo, and Lawrence Neil D. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
- Herbrich Ralf, Lawrence Neil D, and Seeger Matthias. Fast sparse Gaussian process methods: The informative vector machine. In Advances in Neural Information Processing Systems, pages 625–632, 2003.
- Horn Roger A. and Johnson Charles R. 5.8 condition numbers: inverses and linear systems. In Matrix Analysis, page 381. Cambridge University Press, 2012. [Google Scholar]
- Jüttler Bert. Surface fitting using convex tensor-product splines. Journal of Computational and Applied Mathematics, 84(1):23–44, 1997. [Google Scholar]
- Karol’ Andrei I and Nazarov Alexander I. Small ball probabilities for smooth Gaussian fields and tensor products of compact operators. Mathematische Nachrichten, 287(5–6):595–609, 2014. [Google Scholar]
- Litvinenko Alexander. Application of hierarchical matrices for solving multiscale problems. PhD thesis, Verlag nicht ermittelbar, 2006. [Google Scholar]
- Litvinenko Alexander, Sun Ying, Genton Marc G, and Keyes David E. Likelihood approximation with hierarchical matrices for large spatial datasets. Computational Statistics & Data Analysis, 137:115–132, 2019. [Google Scholar]
- Liu Haitao, Ong Yew-Soon, Shen Xiaobo, and Cai Jianfei. When Gaussian process meets big data: A review of scalable GPs. arXiv preprint arXiv:1807.01065, 2018. [DOI] [PubMed]
- Makalic Enes and Schmidt Daniel F. A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1):179–182, 2015. [Google Scholar]
- Nguyen-Tuong Duy, Peters Jan R, and Seeger Matthias. Local Gaussian process regression for real time online model learning. In Advances in Neural Information Processing Systems, pages 1193–1200, 2009.
- Peruzzi Michele, Banerjee Sudipto, and Finley Andrew O. Highly scalable bayesian geostatistical modeling via meshed gaussian processes on partitioned domains. Journal of the American Statistical Association, pages 1–14, 2020. [DOI] [PMC free article] [PubMed]
- Quiñonero-Candela Joaquin and Rasmussen Carl Edward. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005. [Google Scholar]
- Rasmussen Carl Edward and Williams Christopher KI. Gaussian Processes for Machine Learning. MIT Press, Cambridge MA, 2006. [Google Scholar]
- Saibaba Arvind K and Kitanidis Peter K. Efficient methods for large-scale linear inversion using a geostatistical approach. Water Resources Research, 48(5), 2012. [Google Scholar]
- Saibaba Arvind K and Kitanidis Peter K. Fast computation of uncertainty quantification measures in the geostatistical approach to solve inverse problems. Advances in Water Resources, 82:124–138, 2015. [Google Scholar]
- Sang Huiyan and Huang Jianhua Z. A full scale approximation of covariance functions for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):111–132, 2012. [Google Scholar]
- Sarlos Tamas. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 143–152. IEEE, 2006. [Google Scholar]
- Titsias Michalis. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009.
- Tokdar Surya T and Ghosh Jayanta K. Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference, 137(1):34–42, 2007. [Google Scholar]
- Wald Ingo and Havran Vlastimil. On building fast kd-trees for ray tracing, and on doing that in o (n log n). In 2006 IEEE Symposium on Interactive Ray Tracing, pages 61–69. IEEE, 2006. [Google Scholar]
- Weyl Hermann. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912. [Google Scholar]
- Wheeler Matthew W. Bayesian additive adaptive basis tensor product models for modeling high dimensional surfaces: An application to high-throughput toxicity testing. Biometrics, 75(1):193–201, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williams Christopher KI and Seeger Matthias. Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, pages 682–688, 2001.
- Wilson Andrew and Nickisch Hannes. Kernel interpolation for scalable structured Gaussian processes (kiss-GP). In International Conference on Machine Learning, pages 1775–1784, 2015.
- Wilson Andrew Gordon, Dann Christoph, and Nickisch Hannes. Thoughts on massively scalable Gaussian processes. arXiv preprint arXiv:1511.01870, 2015.
- Zhang Lu and Banerjee Sudipto. Spatial factor modeling: A bayesian matrix-normal approach for misaligned data. Biometrics, 2021. [DOI] [PMC free article] [PubMed]
- Zhang Lu, Datta Abhirup, and Banerjee Sudipto. Practical bayesian modeling and inference for massive spatial data sets on modest computing environments. Statistical Analysis and Data Mining: The ASA Data Science Journal, 12(3):197–209, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.










