Graphical abstract
Keywords: Parallel, Hierarchical matrices, Large datasets, Matérn covariance, Random fields, HLIBCov, HLIBpro, Cholesky, Matrix determinant, Parameter identification
Abstract
We provide more technical details about the HLIBCov package, which is using parallel hierarchical (H-) matrices to:
-
•
Approximate large dense inhomogeneous covariance matrices with a log-linear computational cost and storage requirement.
-
•
Compute matrix-vector product, Cholesky factorization and inverse with a log-linear complexity.
-
•
Identify unknown parameters of the covariance function (variance, smoothness, and covariance length).
These unknown parameters are estimated by maximizing the joint Gaussian log-likelihood function. To demonstrate the numerical performance, we identify three unknown parameters in an example with 2,000,000 locations on a PC-desktop.
| Subject Area | Mathematics |
|---|---|
| More specific subject area | Applied mathematics, computational statistics, data analysis |
| Method name | HLIBCov (parallel hierarchical matrices for approximating large covariance matrices, likelihood functions and MLE estimations) |
| Name and reference of original method | A. Litvinenko, Y. Sun, M. G. Genton, and D. E. Keyes. Likelihood approximation with hierarchical matrices for large spatial datasets. Computational Statistics & Data Analysis, 137:115–132, (2019). |
| W. Hackbusch. A sparse matrix arithmetic based on H-matrices. Introduction to H-matrices. Computing, 62(2):89-108, 1999. ISSN0010-485X. | |
| R. Kriemann. Parallel H-matrix arithmetics on shared memory systems. Computing, 74(3):273-297, 2005. ISSN 0010-485x. doi: 10.1007/s00607-004-0102-2. URL: http://www.mis.mpg.de/de/publications/preprints/2004/prepr2004-29.html | |
| Resource availability | A. Litvinenko. HLIBCov: Log-likelihood approximation with hierarchical matrices, 2017. URL https://github.com/litvinen/HLIBCov.git |
1 Technical details
- Program title:
HLIBCov
- Nature of problem:
To approximate large covariance matrices. To perform efficient linear algebra with large covariance matrices on a non-tensor grid. To estimate the unknown parameters (variance, smoothness parameter, and covariance length) of a covariance function by maximizing the joint Gaussian log-likelihood function with a log-linear computational cost and storage.
- Software license:
HLIBCov (GPL 2.0), HLIBpro (proprietary)
- CiCP scientific software URL:
Distribution format:
- Distribution format:
*.cc files via github
- Programming language(s):
C++
- Computer platform:
any
- Operating system:
Linux, MacOSX and MS Windows
- Compilers:
standard C++ compilers
- RAM:
4 GB and more (depending on the matrix size)
- External routines/libraries:
HLIBCov requires HLIBpro and GNU Scientific Library (https://www.gnu.org/software/gsl/).
- Running time:
with p number of CPU cores
- Restrictions:
None (similar limitations as HLIBpro)
- Supplementary material and references:
www.HLIBpro.com and references therein.
- Additional Comments:
HLIBpro is a software library that implements parallel algorithms for hierarchical matrices. It is freely available in binary form for academic purposes. HLIBpro algorithms are designed for one, two, and three - dimensional problems.
2 Introduction
HLIBpro is a very fast and efficient parallel -matrices library. This is an auxiliary technical paper, which contains technical details to our previous paper [31]. In [31] we used the gradient-free optimization method to estimate the unknown parameters of a covariance function using HLIB and HLIBpro.
Parameter estimation and problem settings. We let n be the number of spatial measurements Z located irregularly across a given geographical region at locations , d ≥ 1. We also let Z = {Z(s1), …, Z(sn)}⊤, where Z(s) is a stationary Gaussian random field. Then, we assume that Z has mean zero and a stationary parametric covariance function C(h;θ) = cov{Z(s), Z(s + h)}, where is a spatial distance and vector denotes q unknown parameters. To infer θ, we maximize the joint Gaussian log-likelihood function,
| (2.1) |
where C(θ)ij = C(si − sj;θ), i, j = 1, …, n. Let us assume that maximizes (2.1). When the sample size n is large, the evaluation of (2.1) becomes challenging, due to computational cost of the Cholesky factorization. Hence, scalable and efficient methods that can process larger n are needed.
For this, the hierarchical matrices (-matrix) technique is used, which approximates sub-blocks of the dense matrix by a low-rank representation of either a given rank k or a given accuracy ϵ > 0 (see Section 3.2).
Definition 2.1
An -matrix approximation with maximal rank k of the exact log-likelihood is defined by :
| (2.2) |
where is an -matrix approximation of the Cholesky factor L(θ) with maximal rank k in the sub-blocks, C(θ) = L(θ)L(θ)⊤, and vector v(θ) is the solution of the system .
To maximize in (2.2), we use the Brent-Dekker method [9], [33]. It could be used with or without derivatives.
An additional difficulty is the ill-posedness of the optimization problem. Even a small perturbation in the covariance matrix C(θ) may result in large perturbations in the log-determinant and the log-likelihood. A possible remedy, which may or may not help, is to take a higher rank k.
Features of the -matrix approximation. Other advantages of applying the -matrix technique are the following:
-
1.
The -matrix class is large, including low-rank and sparse matrix classes;
-
2.
C(θ)−1, C(θ)1/2, |C(θ)|, Cholesky decomposition, the Schur complement, and many others can be computed in the -matrix format [16];
-
3.
Since the -matrix technique has been well studied, there are many examples, multiple sequential and parallel implementations and a solid theory already available. Therefore, no specific MPI or OpenMP knowledge is needed;
-
4.
The -matrix cost and accuracy is controlled by k;
-
5.
The -Cholesky factor and the -inverse have often moderate ranks.
Structure of the paper. After introduction and problem setting in Section 2, we explain the -matrix approximation of Matérn covariance matrices in Section 3. In Section 4, we describe the software installation details, the input, and output of the HLIBCov code. In Section 5, we provide dependence of the storage and computing costs vs. the -matrix rank k. We visualize the shapes of the log-likelihood functions for different n. Finally, we demonstrate how to estimate three unknown parameters of C(θ). Best practices are listed in Section 6. We end the paper with a conclusion in Section 7. The auxiliary -matrix and log-likelihood details are provided in the Appendices A and B.
3 Methodology and algorithms
3.1 Matérn covariance functions
Matérn covariance functions [32] are a very widely used class of functions [14], [20].
For any two spatial locations s and s′ and the distance h : = ||s − s′||, the Matérn class of covariance functions is defined as
| (3.1) |
where ; ℓ > 0 is a spatial range parameter; ν > 0 is the smoothness, with larger values of ν corresponding to smoother random fields; and σ2 is the variance. Here, denotes a modified Bessel function of the second kind of order ν, and Γ(·) denotes the Gamma function. The values ν = 1/2 and ν = ∞ correspond to the exponential and Gaussian covariance functions respectively.
3.2 Introduction to hierarchical matrices
Detailed descriptions of hierarchical matrices [12], [15], [16], [17], [18], [29] and their applications can be found elsewhere [1], [2], [6], [21], [23], [24], [30].
The -matrix technique was originally introduced by Hackbusch (1999) for the approximation of stiffness matrices and their inverses coming from partial differential and integral equations [8], [12], [15]. Briefly, the key idea of the -matrix technique is to divide the initial matrix into sub-blocks in a specific way, identify those sub-blocks which can be approximated by low-rank matrices and compute the corresponding low-rank approximations.
The partitioning of the matrix into sub-blocks starts by recursively dividing the rows and columns into disjoint sub-sets, e.g., splitting the set of all rows into two (equal sized) sub-sets, which are again divided. This yields a cluster tree where each sub-set of rows/columns is called a cluster. By multiplying the cluster trees for the rows and the columns, a hierarchical partitioning of the matrix index set is obtained, the so called block cluster tree or . Within this block cluster tree, low-rank approximable blocks are identified using an admissibility condition. Such admissible blocks are not further refined into sub-blocks, i.e., the corresponding sub-tree is not computed or stored. For all admissible blocks a low-rank approximation of the initial matrix is computed, either with a given rank k (fixed-rank strategy) or an accuracy ε > 0 (fixed-accuracy strategy). The result of this computation is called an -matrix. This process is also shown in Fig. 1.
Definition 3.1
Let I be an index set (representing the rows/columns) and TI be a cluster tree based on I. Furthermore, let TI×I be a block cluster tree based on TI and an admissibility condition adm : TI×I → {true, false}. Then the set of -matrices with maximal rank k is defined as
Fig. 1.
Examples of a cluster tree TI (left) and a block cluster tree TI×I (right). The decomposition of the matrix into sub-blocks is defined by TI×I and the admissibility condition.
Here the block cluster (t, s) is an element (a vertex) of the block cluster tree TI×I.
Various partitioning strategies for the rows and columns of the matrix and admissibility conditions have been developed to approximate different types of matrices. Typical admissibility conditions are strong (also called standard), weak and based on domain decomposition [16], for which examples are shown in Fig. 2. The red blocks indicate dense or in-admissible blocks whereas green blocks are identified as admissible. The maximal size of the dense blocks (i.e., how deep the hierarchical subdivision into sub-blocks is) is regulated by the parameter “nmin”, whose value affects the storage size and the runtime of the -matrix arithmetic, e.g., a smaller value leads to less storage but is often in-efficient with respect to CPU performance. Typically values of nmin are in the range 20 to 150.
Remark 1
With a more appropriate choice of the admissibility condition (criteria), one can influence the depth of the hierarchy, the size of the sub-blocks, their number, the number of empty sub-blocks. The choice of the admissibility criteria is not crucial here; the -matrix rank (or accuracy) in each sub-block is much more crucial. If the -matrix rank in each block is sufficiently large, then any admissibility criteria will work. In our numerical tests, we used the standard admissibility criteria.
Fig. 2.
Examples of three different block partitioning, generated with three different admissibility criteria: (left) strong, (middle) domain-decomposition-based, and (right) weak.
For the computation of the low-rank approximation for admissible sub-blocks many different methods are available, e.g., adaptive cross approximation (ACA), hybrid cross approximation (HCA), rank-revealing QR, randomized SVD [3], [5], [7], [8], [11], [19], [22]. For the fixed-rank strategy, the resulting low-rank matrix is of rank at most k. In case of the fixed-accuracy strategy with a given ε > 0, the low-rank approximation of the sub-block M is computed such that . The storage size of the resulting -matrix is of order [12].
In Fig. 3 (left), an example of an -matrix approximation to C(θ) can be found. There, the local ranks and the decay of singular values in the admissible blocks (green) in logarithmic scale are shown.
Fig. 3.
Examples of -matrix approximations of the exponential covariance matrix (left), its hierarchical Cholesky factor (middle), and the zoomed upper-left corner of the matrix (right), n = 4000, ℓ = 0.09, ν = 0.5, σ2 = 1. Approximation and arithmetic performed with a fixed-accuracy of 10−5. The number inside a sub-block indicates the maximal rank, while the “stairs” represent its singular values in logarithmic scaling.
In addition to efficient matrix approximation, -matrices also permit full matrix arithmetic, e.g., matrix addition, matrix multiplication, inversion or factorization. However, similar to matrix compression, -matrix arithmetic is approximate to maintain log-linear complexity. The approximation during arithmetic is again either of a fixed-rank or a fixed-accuracy [12]. In this work, we make use of the -Cholesky factorization of C(θ) (see Fig. 3).
For C(θ), the predefined rank (or accuracy ε) defines the accuracy of the -matrix approximation, for the initial approximation of C(θ) as well as for the Cholesky factorization for C(θ)−1.
3.3 Parallel hierarchical-matrix technique
We used the parallel -matrix library HLIBpro [13], [25], [27], [28], which implements -matrix approximation and arithmetic functions using a task-based approach to make use of todays many-core architectures. For this, the mathematical operation is decomposed into small atomic tasks with corresponding incoming and outgoing data dependencies. This set of tasks and dependencies forms a directed acyclic graph (DAG), which is used for scheduling the tasks to the CPU cores, e.g., if all incoming data dependencies are met, the corresponding task is executed on the next free CPU core available.
The computational complexity of the different -matrix operations is shown in Table 1. Here, |V(T)| denotes the number of vertices, |L(T)| is the number of leaves in the block-cluster tree T = TI×I. The sequential terms in those estimates are typically due to the sequential behaviour of the corresponding algorithm, e.g., strictly following the diagonal during Cholesky factorization, but usually do not show in practical applications since the majority of the computation work is parallelized.
Table 1.
Parallel complexity of the main linear operations in HLIBpro on p cores. Truncated multiplication and addition are denoted by ⊙ and ⊕.
| Operation | Parallel Complexity [26] (Shared Memory) |
|---|---|
| build | |
| store | |
| -Cholesky | |
| determinant |
4 HLIBCov and HLIBpro installation
This section contains a summary of the information provided at https://www.hlibpro.com and https://github.com/litvinen/HLIBCov.git. HLIBpro supports both shared and distributed memory architectures, though in this work we only use the shared memory version. For the implementation of the task-parallel approach, Intel's Threading Building Blocks (TBB) is used, Table 2. HLIBpro is free for academic purposes, and is distributed in a pre-compiled form (no source code available). Originally, HLIBpro was developed for solving FEM and BEM problems [13], [28]. In this work, we extend the applicability of HLIBpro to dense covariance matrices and log-likelihood functions.
Table 2.
Version of Software used for Experiments.
| Software | Version |
|---|---|
| HLIBCov | 1.0 |
| HLIBpro | 2.6 |
| GSL | 1.16 |
| TBB | 4.3 |
Installation: HLIBCov uses the functionality of HLIBpro; therefore, HLIBpro must be installed first. All functionality implemented by HLIBCov is based on HLIBpro, i.e., no extra software is needed in addition to the libraries needed by HLIBpro. This also holds for the Matérn kernel, which uses Bessel functions and maximization algorithms, both being provided by the GNU Scientific Library (GSL) and also used by HLIBpro. The reader can easily replace GSL with his own optimization library. The Bessel functions are also available in other packages.
To install HLIBpro on MacOS and Windows, we refer the reader to www.HLIBpro.com for further details.
Hardware. All of the numerical experiments herein are performed on a Dell workstation with two Intel(R) Xeon(R) E5-2680 v2 CPUs (2.80GHz, 10 cores/20 threads) and 128 GB main memory.
Adding HLIBCov to HLIBpro. The easiest form of compiling HLIBCov is by using the compilation system of HLIBpro. For this, the source code file of HLIBCov is placed in the examples directory of HLIBpro and an entry is added to the file examples/SConscript:
Afterwards, the make process of HLIBpro is run to compile also HLIBCov (see HLIBpro installation instructions at www.hlibpro.com).
Input of HLIBCov. The input contained in the first line is the total number of locations n. Lines 2,…, n + 1 contain the coordinates xi, yi, and the measurement value. An example is provided below;
HLIBpro requires neither a list of finite elements nor a list of edges. We provide several examples of few input files of different size on the open-access file hosting service GitHub (https://github.com/litvinen/HLIBCov.git). We added two data sets to GitHub: data.tar.gz and moisture_data.zip. Both examples contain multiple data sets of different sizes.
Output of HLIBCov. The main output is the three identified parameter values . The auxiliary output may include -matrix details: the maximal rank k, the maximal accuracy in each sub-block, and the Frobenius and spectral norms of . Additionally, iterations of the maximization algorithm can also be printed out. The example of an output file provided below contains two iterations: the index, ν, ℓ, σ2, , and the residual TOL of the iterative method:
If the iterative process is converging, then the last row will contain the solution . When computing error boxes, the output file will contain M solutions (n, ℓ*, ν*, ), where M is the number of replicates:
The name of the output file can be found in the main() procedure in loglikelihood.cc.
5 Numerical experiments
We perform several numerical tests. First, we investigate how the approximation errors and the Kullback-Leibler divergence depend on the maximal rank k. Second, we show how the memory requirement for the matrix depends on ℓ and ν. Third, we compute the variances of ℓ and ν vs. k and n. Fourth, we estimate the unknown parameters.
5.1 Convergence errors and memory requirement
The Kullback-Leibler divergence (KLD) DKL(P||Q) is a measure of information loss when a distribution Q is used to approximate P. For the multivariate normal distributions (μ0, C) and , it is defined as follows:
In Tables 3 and 4, we show the dependence of KLD and two matrix errors on the -matrix rank k for the Matérn covariance function with parameters ℓ = {0.25, 0.75}, ν = {0.5, 1.5}, and σ2 = {1.0, 1.0}, computed on the domain . All errors are under control, except for the last column. The ranks k = {10, 12} or k = {10, 20} are too small to approximate the inverse, and, therefore, the resulting error is large. Relatively often, the -matrix procedure, which computes the -Cholesky factor or the -inverse, produces “NaN” (not a number) and terminates. One possible cause is that some of the diagonal elements can be very close to zero, and their inverse is not defined. This may happen when two locations are very close to each other and, as a result, two columns (rows) are linearly dependent. To avoid such cases, the available data should be preprocessed to remove duplicate locations. Very often, the nugget τ2I is added to the main diagonal to stabilize numerical calculations (see more in Section 5.5), i.e., . In Tables 3 and 4, the nugget is equal to zero.
Table 3.
KLD and -matrix approximation errors vs. the -matrix rank k for Matérn covariance function, ℓ = {0.25, 0.75}, ν = 0.5, σ2 = 1, domain , and ||C(ℓ=0.25,0.75)||2 = {212, 568}.
| k | KLD | |||||
|---|---|---|---|---|---|---|
| ℓ = 0.25 | ℓ = 0.75 | ℓ = 0.25 | ℓ = 0.75 | ℓ = 0.25 | ℓ = 0.75 | |
| 10 | 2.6 × 10−3 | 2.0 × 10−1 | 7.7 × 10−4 | 7.0 × 10−4 | 6.0 × 10−2 | 3.1 × 100 |
| 12 | 5.0 × 10−4 | 2.2 × 10−2 | 9.7 × 10−5 | 5.6 × 10−5 | 1.6 × 10−2 | 5.0 × 10−1 |
| 15 | 1.0 × 10−5 | 9.0 × 10−4 | 2.0 × 10−5 | 1.1 × 10−5 | 8.0 × 10−4 | 2.0 × 10−2 |
| 20 | 4.5 × 10−7 | 4.8 × 10−5 | 6.5 × 10−7 | 2.8 × 10−7 | 2.1 × 10−5 | 1.2 × 10−3 |
| 50 | 3.4 × 10−13 | 5.0 × 10−12 | 2.0 × 10−13 | 2.4 × 10−13 | 4.0 × 10−11 | 2.7 × 10−9 |
Table 4.
KLD and -matrix approximation error vs. the -matrix rank k for Matérn covariance function, ℓ = {0.25, 0.75}, ν = 1.5, σ2 = 1, domain , and ||C(ℓ=0.25,0.75)||2 = {720, 1068}.
| k | KLD | |||||
|---|---|---|---|---|---|---|
| ℓ = 0.25 | ℓ = 0.75 | ℓ = 0.25 | ℓ = 0.75 | ℓ = 0.25 | ℓ = 0.75 | |
| 20 | 1.2 × 10−1 | 2.7 × 100 | 5.3 × 10−7 | 2.3 × 10−7 | 4.5 × 100 | 7.2 × 101 |
| 30 | 3.2 × 10−5 | 4.0 × 10−1 | 1.3 × 10−9 | 5.0 × 10−10 | 4.8 × 10−3 | 2.0 × 101 |
| 40 | 6.5 × 10−8 | 1.0 × 10−2 | 1.5 × 10−11 | 8.0 × 10−12 | 7.4 × 10−6 | 5.0 × 10−1 |
| 50 | 8.3 × 10−10 | 3.0 × 10−3 | 2.0 × 10−13 | 1.5 × 10−13 | 1.5 × 10−7 | 1.0 × 10−1 |
Fig. 4 shows that the -matrix storage cost remains almost the same for the different parameters ℓ = {0.15, …, 2.2} (left) and ν = {0.3, …, 1.3} (right). The Matérn covariance function is discretized in the domain [32.4, 43.4] × [−84.8, − 72.9] with n = 2, 000 mesh points.
Fig. 4.
Dependence of the matrix size on (left) the covariance length ℓ (other two parameters are fixed ν = 0.325, σ2 = 0.98), and (right) the smoothness ν (other two parameters are fixed ℓ = 0.58, σ2 = 0.98) for two different accuracies in the -matrix sub-blocks ε = {10−4, 10−6}, for n = 2, 000 locations in the domain [32.4, 43.4] × [−84.8, −72.9].
5.2 Uncertainty in parameters vs. k and n
In Fig. 5 (left), the results for computing ℓ with a different rank in the -matrix approximation for 100 replicates are shown. On each box, the central red line indicates the median. The lower edge of the box indicates the 25% percentile, and the top edge indicates the 75% percentile. The outliers are marked by the red symbol ‘+’. The bold long red line denotes the true value of the parameter ℓ = 0.0334. With a larger rank and hence, with a better approximation, the variance of ℓ decreases.
Fig. 5.
(left) Dependence of the boxplots for ℓ on the -matrix rank k, when n = 16, 000; (right) Convergence of the boxplots for ν with increasing n (read this plot from right to the left); 100 replicates.
The dependence of ν on the problem size, e.g., the number of measurements is also tested with the results shown in Fig. 5 (right). As the results demonstrate, with a larger number n the estimation of the parameter ν is getting better. This experiment is also showing that the estimations, obtained on a smaller data set, say with n = 2000 measurement, could be used as a starting value in the next experiment with a larger data set, say n = 4000.
5.3 Log-likelihood vs. ℓ and ν for different n
In Fig. 6, we illustrate the dependence of on the parameters ℓ (left, with ν = 0.5, σ2 = 1), and ν (right, with ℓ = 0.0864 and σ2 = 1). Both figures demonstrate the smooth dependence of on ℓ and ν. It also illustrates the locations of the minima for different numbers n = {2000, 4000, …, 128000}.
Fig. 6.
(left) Shape of the scaled log-likelihood function, , vs. ℓ for different sample sizes n. (right) Shape of the scaled log-likelihood function, , vs. ν for different sample sizes n.
5.4 Identification of unknown parameters
We generate a “synthetic” data set with parameters and then try to infer these parameters. To build M various datasets (M replicates) with n ∈ {64,…,4,2} × 1000 locations, we generate a large vector Z0 with n0 = 2 ×106 locations, and randomly sample n points from it. We note that if the locations are very close to each other, then the covariance matrix may be singular or the Cholesky factorization will be very difficult to compute.
To generate the random data , we compute the -Cholesky factorization of . Then, we evaluate , where is a normal vector with zero mean and unit variance. We generate Z0 only once. Next, we run our optimization algorithm and try to identify (recover) the “unknown” parameters . The resulting boxplots for ℓ and σ2 over M = 100 replicates are illustrated in Fig. 7. We see that the variance (or uncertainty) decreases with increasing n. The green line indicates the true values.
Fig. 7.
Synthetic data with known parameters . Boxplots for ℓ and σ2 for n = 1, 000 × {64, 32,…, 4, 2}; 100 replicates.
To identify all three parameters simultaneously, we solve a three-dimensional optimization problem. The maximal number of iterations is set to 200, and the residual is 10−6. The behavior and accuracy of the boxplots depend on the -matrix rank, the maximum number of iterations to achieve a certain threshold, the threshold (or residual) itself, the initial guess, the step size in each parameter of the maximization algorithm, and the maximization algorithm. All replicates of Z are sampled from the same generated vector of size n0 = 2 ×106.
In Table 5, we present the almost-linear storage cost (columns 3 and 6) and the computing time (columns 2 and 5).
Table 5.
Computing time and storage vs n. The number of parallel computing cores is 40, , , . -matrix accuracy in each sub-block for both and is 10−5.
| n | ||||||
|---|---|---|---|---|---|---|
| comp. time | size | kB/dof | comp. time | size | ||
| sec. | MB | sec. | MB | |||
| 32,000 | 3.3 | 162 | 5.1 | 2.4 | 172.7 | 2.4 × 10−3 |
| 128,000 | 13.3 | 776 | 6.1 | 13.9 | 881.2 | 1.1 × 10−2 |
| 512,000 | 52.8 | 3420 | 6.7 | 77.6 | 4150 | 3.5 × 10−2 |
| 2,000,000 | 229 | 14790 | 7.4 | 473 | 18970 | 1.4 × 10−1 |
The shape of the negative log-likelihood function and its components are illustrated in Fig. 8. This helps us to understand the behavior of the iterative optimization method, and the contributions of the log-determinant and the quadratic functional. We see that the log-likelihood is almost flat, and that it may be necessary to perform many iterations in order to find the minimum.
Fig. 8.
Dependence of the negative log-likelihood and its ingredients on parameters ℓ (on the left); ν (in the middle); and σ2 (on the right). In each experiment the other two parameters are always fixed: (ν = 0.325, σ2 = 0.98) on the left, (ℓ = 0.62, σ2 = 0.98) in the middle; (ℓ = 0.62, ν = 0.325) on the right; n = 64, 000.
To research how close to each other the log-likelihood functions are, computed with different -matrix accuracy, we demonstrate Table 6. It contains the values of three log-likelihood functions computed with three different -matrix accuracies {10−7, 10−9, 10−11}. The covariance function is exponential (i.e., ν = 0.5) and is discretized in the domain [32.4, 43.4] × [−84.8, − 72.9] with n = 32, 000 mesh points (locations). The columns correspond to different covariance lengths {0.001,…,0.1}.
Table 6.
Comparison of three log-likelihood functions computed with three different -matrix accuracies {10−7, 10−9, 10−11}. Exponential covariance function discretized in the domain [32.4, 43.4] × [−84.8, −72.9], n = 32,000 locations.
| ℓ | 0.001 | 0.005 | 0.01 | 0.02 | 0.03 | 0.05 | 0.07 | 0.1 |
|---|---|---|---|---|---|---|---|---|
| 44657 | 36157 | 36427 | 40522 | 45398 | 68450 | 70467 | 90649 | |
| 44585 | 36352 | 36113 | 41748 | 47443 | 60286 | 70688 | 90615 | |
| 44529 | 37655 | 36390 | 42020 | 47954 | 60371 | 72785 | 90639 |
5.5 Adding nugget τ2
When diagonal values of are very close to zero, the -Cholesky procedure becomes unstable. It produces negative entries on the diagonal during computation. By adding a diagonal matrix with small positive numbers, all the singular values become larger and move away from zero. However, by adding a nugget, we redefine the original matrix as . Below, we analyze how the loglikelihood function, as well as its maximum are changing by this.
We assume . For a small perturbation matrix E [10], it holds that
where is the condition number of , and E = τ2I. Alternatively, by substituting , we obtain
| (5.1) |
From (5.1), we see that the relative error on the left-hand side of (5.1) depends on the norm , i.e., the relative error is inversely proportional to the smallest singular value of . This may explain a possible failing of approximating matrices, where the smallest singular values tend towards zero. The estimates for the -Cholesky and the Schur complement for general sparse positive-definite matrices are given in [4]. The approximation errors are proportional to the , i.e., matrices with a very large condition number may require a very large -matrix rank.
Fig. 9 (left) demonstrates three negative log-likelihood functions computed with the nuggets 0.01, 0.005, and 0.001. For this particular example, the behavior of the likelihood is preserved, and the minimum does not change (or changes very slightly). Fig. 9 (right) is just a zoomed version of the picture on the left.
Fig. 9.
(left) Dependence of the log-likelihood on parameter ℓ with nuggets ({0.01, 0.005, 0.001}) for Gaussian covariance. (right) Zoom of the left figure near minimum; n = 2000 random locations, rank k = 14, σ2 = 1.
6 Best practices (HLIBCov)
In this section, we list our recommendations and warnings.
For practical computations, use adaptive-rank arithmetic since it produces smaller matrices and faster runtime.
For the input, it is sufficient to define a file by three columns in 2D and four columns in 3D: location coordinates (x, y, z) and the observed value; no triangles or edges are required.
If two locations coincide or are very close to each other, then the matrix will be close to singular or singular. As a result, it will be hard to compute the Cholesky factorization. Our suggested remedy is to improve the quality of the locations by preprocessing the data.
By default, the -Cholesky or LU factorizations use a task-based approach employing a DAG (directed acyclic graph). For sequential computations this can be turned off to revert to a slightly faster recursive implementation by setting
HLIB::CFG::Arith::use_dag = false
By default, HLIBpro uses all available computing cores. To perform computations on 16 cores, use HLIB::CFG::set_nthreads(16) at the beginning of the program (after command INIT()).
Since HLIBpro is working in d = 1, 2, 3, .. .-dimensional case, only very minor changes are required to move from 1D locations to 2D or 3D (or higher). Replace dim= 2 with dim= 3 in
TCoordinate coord(vertices, dim);
then add “> >z” to
in > >x > >y > >z > >v;
The -matrix data format is a rather complicated data structure (class) in HLIBpro. Therefore, the -matrix objects (or the pointers on them) are neither the input nor the output parameters. Instead, the input parameters for the HLIBpro C++ routines are: a vector (array) of locations and a vector of observations Z. The triangulation (a list of triangles/edges) is not needed. The output parameters are either scalar values or a vector; for example, the determinant, the trace, a norm, the result of the matrix-vector product, and an approximation error.
7 Conclusion
We extended functionality of the parallel -matrix library HLIBpro to infer unknown parameters for applications in spatial statistics. This new extension allows us to work with large covariance matrices. We approximated the joint multivariate Gaussian likelihood function and found its maxima in the -matrix format. These maxima were used to estimate the unknown parameters (ℓ, ν, and σ2) of a covariance model. The new code is parallel, highly efficient, and written in C++ language. With the -matrix technique, we reduced the storage cost and the computing cost (Tables 4, 5) of the log-likelihood function dramatically, from cubic to almost linear. We demonstrated these advantages in a synthetic example, where we were able to identify the true parameters of the covariance model. We were also able to compute the log-likelihood function for 2, 000, 000 locations in just a few minutes on a desktop machine (Table 5). The -matrix technique allowed us to increase the spatial resolution, handle more measurements, consider larger regions, and identify more parameters simultaneously.
Acknowledgments
The research reported in this publication was supported by funding from the Alexander von Humboldt foundation (chair of Mathematics for Uncertainty Quantification at RWTH Aachen) and Extreme Computing Research Center (ECRC) at King Abdullah University of Science and Technology (KAUST).
Appendix A. Admissibility condition
Here we give an example of the admissibility criteria [15], [12], [8]. Let
| (A.1) |
with singularity at x = y. We will introduce a condition, which divides all sub-blocks into admissible and inadmissible. Admissible blocks will be approximated by low-rank matrices.
Definition A.1
Let I be an index set of all degrees of freedom, i.e. I = {1, 2, …, n}. Denote for each index i ∈ I corresponding to a basis function bi the support .
Let τ, δ ∈ TI be two clusters (elements of the cluster tree TI). Clusters τ, δ are subsets of I, i.e. τ, δ ⊆ I. We generalise to clusters τ ∈ TI by setting , i.e., is the minimal subset of that contains the supports of all basis functions bi with i ∈ τ.
Suppose that and are compact and χ(x, y) is defined for with x ≠ y. The standard assumption on the kernel function in the -matrix theory is asymptotic smoothness of , i.e, that
holds for constants C1, C0 and . This estimation is used to control the error ϵq from the Taylor expansion
Suppose that χk(x, y) is an approximation of χ in of the separate form (e.g., Taylor or Lagrange polynomials):
| (A.2) |
where k is the rank of separation. We are aiming at an approximation of the form (A.2) such that exponential convergence
| (A.3) |
holds.
Let be axis-parallel bounding boxes of the clusters τ and δ such that and .
Definition A.2
The standard admissibility condition (Admη), shown in Fig. 2 on the left, for two clusters τ and δ is
| (A.4) |
Another example is
where η is some positive number.
Definition A.3
We will say that a pair (τ, δ) of clusters τ and δ ∈ TI is admissible if the condition (A.4) is satisfied. The blocks for which condition (A.4) is true are called admissible blocks.
The admissibility condition indicates blocks that allow rank-k approximation and those that do not. Admissible blocks are either very small (and computed exactly) or are approximated by rank-k matrices. All other (inadmissible) blocks are computed as usual.
In order to get a simpler partitioning (see an example in Fig. 2, right), we define the weak admissibility condition AdmW for a pair (τ, δ):
| (A.5) |
where τ, δ are assumed to belong to the same level of TI×I.
See more details about derivation of admissibility condition for covariance matrices in [24].
Appendix B. Maximum of the log-likelihood function
The C++ code for computing the maximum of the log-likelihood function (loglikelihood.cc):
Below we list the C++ code, which computes the value of the log-likelihood for given parameters (loglikelihood.cc):
Rank-k Adaptive Cross Approximation (ACA): An -matrix contains many sub-blocks, which can be well approximated by low-rank matrices. These low-rank approximations can be computed accurately by truncated singular value decomposition (SVD), but it is very slow. HLIBpro uses the Adaptive Cross Approximation method (ACA) [11] and its improved modifications such as ACA+ and HACA [3], [5], [7].
Remark B.1
Further optimization of the ACA algorithm can be achieved by a recompression using low-rank SVD. If we suppose that a factorization of the matrix R = AB⊤, , , is found by ACA and that the actual rank of R is k, k < K. Then we can apply the low-rank SVD algorithm to compute R = UΣV⊤ in time.
References
- 1.Ambikasaran S., Foreman-Mackey D., Greengard L., Hogg D.W., O'Neil M. 2014. Fast direct methods for Gaussian processes and the analysis of NASA Kepler mission data. arXiv preprint arXiv:1403.6015. [DOI] [PubMed] [Google Scholar]
- 2.Ambikasaran S., Li J.Y., Kitanidis P.K., Darve E. Large-scale stochastic linear inversion using hierarchical matrices. Comput. Geosci. 2013;17(6):913–927. [Google Scholar]
- 3.Bebendorf M. Approximation of boundary element matrices. Numer. Math. 2000;86(4):565–589. [Google Scholar]
- 4.Bebendorf M., Fischer T. On the purely algebraic data-sparse approximation of the inverse and the triangular factors of sparse matrices. Numer. Linear Algebra Appl. 2011;18(1):105–122. [Google Scholar]
- 5.Bebendorf M., Rjasanow S. Adaptive low-rank approximation of collocation matrices. Computing. 2003;70(1):1–24. [Google Scholar]
- 6.Börm S., Garcke J. Approximating Gaussian processes with H2-matrices. In: Kok J.N., Koronacki J., de Mantaras R.L., Matwin S., Mladen D., Skowron A., editors. Proceedings of 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007. ECML 2007, volume 4701. 2007. pp. 42–53. [Google Scholar]
- 7.Börm S., Grasedyck L. Hybrid cross approximation of integral operators. Numer. Math. 2005;101(2):221–249. [Google Scholar]
- 8.Börm S., Grasedyck L., Hackbusch W. volume 21 of Lecture Note. Max-Planck Institute for Mathematics; Leipzig: 2003. Hierarchical Matrices. www.mis.mpg.de. [Google Scholar]
- 9.Brent R.P. Prentice-Hall; Englewood Cliffs, NJ: 1973. Chapter 4: An algorithm with Guaranteed Convergence for Finding a Zero of a Function, Algorithms for Minimization without Derivatives. [Google Scholar]
- 10.Demmel J. The componentwise distance to the nearest singular matrix. SIAM J. Matrix Anal. Appl. 1992;13(1):10–19. [Google Scholar]
- 11.Goreinov S.A., Tyrtyshnikov E.E., Zamarashkin N.L. A theory of pseudoskeleton approximations. Linear Algebra Appl. 1997;261:1–21. [Google Scholar]
- 12.Grasedyck L., Hackbusch W. Construction and arithmetics of -matrices. Computing. 2003;70(4):295–334. [Google Scholar]
- 13.Grasedyck L., Kriemann R., LeBorne S. Parallel black box H-LU preconditioning for elliptic boundary value problems. Comput. Visualization Sci. 2008;11(4-6):273–291. [Google Scholar]
- 14.Guttorp P., Gneiting T. Studies in the history of probability and statistics XLIX: On the Matérn correlation family. Biometrika. 2006;93:989–995. [Google Scholar]
- 15.Hackbusch W. A sparse matrix arithmetic based on -matrices. I. Introduction to -matrices. Computing. 1999;62(2):89–108. [Google Scholar]
- 16.Hackbusch W. Springer; 2015. Hierarchical matrices: Algorithms and Analysis, volume 49 of Springer Series in Comp. Math. [Google Scholar]
- 17.Hackbusch W., Khoromskij B.N. A sparse -matrix arithmetic. II. Application to multi-dimensional problems. Computing. 2000;64(1):21–47. [Google Scholar]
- 18.Hackbusch W., Khoromskij B.N., Kriemann R. Hierarchical matrices based on a weak admissibility criterion. Computing. 2004;73(3):207–243. [Google Scholar]
- 19.Halko N., Martinsson P.G., Tropp J.A. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Rev. 2011;53(2):217–288. [Google Scholar]
- 20.Handcock M.S., Stein M.L. A Bayesian analysis of kriging. Technometrics. 1993;35:403–410. [Google Scholar]
- 21.Harbrecht H., Peters M., Siebenmorgen M. Efficient approximation of random fields for numerical applications. Numer. Linear Algebra Appl. 2015;22(4):596–617. [Google Scholar]
- 22.Hong Y.P., Pan C.-T. Rank-revealing qr factorizations and the singular value decomposition. Math. Comput. 1992;58(197):213–232. [Google Scholar]
- 23.Khoromskij B.N., Litvinenko A. Data sparse computation of the Karhunen-Loève expansion. AIP Conference Proceedings. 2008;1048(1):311. [Google Scholar]
- 24.Khoromskij B.N., Litvinenko A., Matthies H.G. Application of hierarchical matrices for computing the Karhunen-Loève expansion. Computing. 2009;84(1-2):49–67. [Google Scholar]
- 25.Kriemann R. Parallel H-matrix arithmetics on shared memory systems. Computing. 2005;74(3):273–297. [Google Scholar]
- 26.Kriemann R. University of Kiel; 2005. Parallele Algorithmen für -Matrizen. PhD thesis. [Google Scholar]
- 27.Kriemann R. Max Planck Institute for Mathematics in the Sciences; 2008. HLIBpro user manual. Technical report. [Google Scholar]
- 28.Kriemann R. H-LU factorization on many-core systems. Comput. Visualization Sci. 2013 Jun;16(3):105–117. [Google Scholar]
- 29.Litvinenko A. Leipzig University; 2006. Application of hierarchical matrices for solving multiscale problems. PhD Dissertation. [Google Scholar]
- 30.Litvinenko A., Matthies H.G. Sparse data representation of random fields. PAMM. 2009;9(1):587–588. [Google Scholar]
- 31.Litvinenko A., Sun Y., Genton M.G., Keyes D.E. Likelihood approximation with hierarchical matrices for large spatial datasets. Comput. Stat. Data Anal. 2019;137:115–132. [Google Scholar]
- 32.Matérn B. volume 36 of Lecture Notes in Statistics. second edition. Springer-Verlag; Berlin; New York: 1986. Spatial Variation. [Google Scholar]
- 33.Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P. Cambridge University Press; New York: 2007. Section 9.3. Van Wijngaarden-Dekker-Brent Method. Numerical Recipes: The Art of Scientific Computing, volume 3rd ed. [Google Scholar]

















