Abstract
Modeling correlation (and covariance) matrices can be challenging due to the positive-definiteness constraint and potential high-dimensionality. Our approach is to decompose the covariance matrix into the correlation and variance matrices and propose a novel Bayesian framework based on modeling the correlations as products of unit vectors. By specifying a wide range of distributions on a sphere (e.g. the squared-Dirichlet distribution), the proposed approach induces flexible prior distributions for covariance matrices (that go beyond the commonly used inverse-Wishart prior). For modeling real-life spatio-temporal processes with complex dependence structures, we extend our method to dynamic cases and introduce unit-vector Gaussian process priors in order to capture the evolution of correlation among components of a multivariate time series. To handle the intractability of the resulting posterior, we introduce the adaptive Δ-Spherical Hamiltonian Monte Carlo. We demonstrate the validity and flexibility of our proposed framework in a simulation study of periodic processes and an analysis of rat’s local field potential activity in a complex sequence memory task.
Keywords: dynamic covariance modeling, spatio-temporal models, geometric methods, posterior contraction, Δ-Spherical Hamiltonian Monte Carlo
1. Introduction
Modeling covariance matrices—or more broadly, positive definite (PD) matrices—is one of the most fundamental problems in statistics. In general, the task is difficult because the number of parameters grows quadratically with the dimension of the matrices. The complexity of the challenge increases substantially if we allow dependencies to vary over time (or space) in order to account for the dynamic (non-stationary) nature of the underlying probability model. In this paper, we propose a novel solution to the problem by developing a flexible and yet computationally efficient Bayesian inferential framework for both static and dynamic covariance matrices.
This work is motivated by modeling the dynamic brain connectivity (i.e., associations between brain activity at different regions). In light of recent technical advances that allow the collection of large, multidimensional neural activity datasets, brain connectivity analyses are emerging as critical tools in neuroscience research. Specifically, the development of such analytical tools will help elucidate fundamental mechanisms underlying cognitive processes such as learning and memory, and identify potential biomarkers for early detection of neurological disorders. There are a number of new methods that have been developed (Cribben et al., 2012; Fiecas and Ombao, 2016; Lindquist et al., 2014; Ting et al., 2015; Prado, 2013) but the main limitation of these methods (especially the ones that have a frequentist approach) is a lack of natural framework for inference. Moreover, parametric approaches (e.g. vector auto-regressive models) need to be tested for adequacy for modeling complex brain processes and often have high dimensional parameter spaces (especially with a large number of channels and high lag order). This work provides both a nonparametric Bayesian model and an efficient inferential method for modeling the complex dynamic dependence among multiple stochastic processes that is common in the study of brain connectivity.
Within the Bayesian framework, it is common to use an inverse-Wishart prior on the covariance matrix for computational convenience (Mardia et al., 1980; Anderson, 2003). This choice of prior however is very restrictive (e.g. common degrees of freedom for all components of variance) (Barnard et al., 2000; Tokuda et al., 2011). Daniels (1999); Daniels and Kass (2001) propose uniform shrinkage priors. Daniels and Kass (1999) discuss three hierarchical priors to generalize the inverse-Wishart prior. Alternatively, one may use decomposition strategies for more flexible modeling choices (see Barnard et al. (2000) for more details). For instance, Banfield and Raftery (1993), Yang and Berger (1994), Celeux and Govaert (1995), Leonard and Hsu (1992), Chiu et al. (1996), and Bensmail et al. (1997) propose methods based on the spectral decomposition of the covariance matrix. Another strategy is to use the Cholesky decomposition of the covariance matrix or its inverse, e.g., Pourahmadi (1999, 2000); Liu (1993); Pinheiro and Bates (1996). There are other approaches directly related to correlation, including the constrained model based on truncated distributions (Liechty, 2004), the Cholesky decomposition of correlation matrix using an angular parametrization (Pourahmadi and Wang, 2015), and methods based on partial autocorrelation and parametrizations using angles (Rapisarda et al., 2007). In general, these methods fail to yield full flexibility and generality; and often sacrifice statistical interpretability.
While our proposed method in this paper is also based on the separation strategy (Barnard et al., 2000) and the Cholesky decomposition, the main distinction from the existing methods is that it represents each entry of the correlation matrix as a product of unit vectors. This in turn provides a flexible framework for modeling covariance matrices without sacrificing interpretability. Additionally, this framework can be easily extended to dynamic settings in order to model real-life spatio-temporal processes with complex dependence structures that evolve over the course of the experiment.
To address the constraint for correlation processes (positive definite matrix at each time having unit diagonals and off-diagonal entries with magnitudes no greater than 1), we introduce unit-vector Gaussian process priors. There are other related works, e.g. generalized Wishart process (Wilson and Ghahramani, 2011), and latent factor process (Fox and Dunson, 2015), that explore the product of vector Gaussian processes. In general they do not grant full flexibility in simultaneously modeling the mean, variance and correlation processes. For example, latent factor based models link the mean and covariance processes through a loading matrix, which is restrictive and undesirable if the linear link is not appropriate, and thus are outperformed by our proposed flexible framework (See more details in Section 4.2). Other approaches to model non-stationary processes use a representation in terms of a basis such as wavelets (Nason et al., 2000; Park et al., 2014; Cho and Fryzlewicz, 2015) and the smooth localized complex exponentials (SLEX) (Ombao et al., 2005), which are actually inspired by the Fourier representations in the Dahlhaus locally stationary processes Dahlhaus (2000); Priestley (1965). These approaches are frequentist and do not easily provide a framework for inference (e.g., obtaining confidence intervals). The class of time-domain parametric models allows for the autoregressive-moving-average (ARMA) parameters to evolve over time (see, e.g. Rao, 1970) or via parametric latent signals (West et al., 1999; Prado et al., 2001). A restriction for this class of parametric models is that some processes might not be adequately modeled by them.
This main contributions of this paper are: (a.) a sphere-product representation of correlation/covariance matrix is introduced to induce flexible priors for correlation/covariance matrices and processes; (b.) a general and flexible framework is proposed for modeling mean, variance, and correlation processes separately; (c.) an efficient algorithm is introduced to infer correlation matrices and processes; (d.) the posterior contraction of modeling covariance (correlation) functions with Gaussian process prior is studied for the first time.
The rest of the paper is organized as follows. In the next section, we present a geometric view of covariance matrices and extend this view to allow covariance matrices to change over time. In Section 3, we use this geometrical perspective to develop an effective and computationally efficient inferential method for modeling static and dynamic covariance matrices. Using simulated data, we will evaluate our method in Section 4. In Section 5, we apply our proposed method to local field potential (LFP) activity data recorded from the hippocampus of rats performing a complex sequence memory task (Allen et al., 2014, 2016; Ng et al., 2017). In the final section, we conclude with discussions on the limitations of the current work and future extensions.
2. Structured Bayesian Modeling of the Covariance (Correlation) Matrices
To derive flexible models for covariance and correlation matrices, we start with the Cholesky decomposition, form a sphere-product representation, and finally obtain the separation decomposition in Barnard et al. (2000) with correlations represented as products of unit vectors. The sphere-product representation is amenable for the inferential algorithm to handle the resulting intractability, and hence lays the foundation for full flexibility in choosing priors.
Any covariance matrix Σ = [σij] > 0 is symmetric positive definite, and hence has a unique Cholesky decomposition Σ = LL⊤ where the Cholesky factor L = [lij] is a lower triangular matrix such that . We denote the variance vector as , then each variance component, , can be written in terms of the corresponding row li of L as follows:
| (2.1) |
For Σ to be positive definite, it is equivalent to require all the leading principal minors {Mi} to be positive,
| (2.2) |
Based on (2.1) and (2.2), for i ∈ {1,⋯,D}, li can be viewed as a point on a sphere with radius σi excluding the equator, denoted as . Therefore the space of the Cholesky factor in terms of its rows can be written as a product of spheres and we require
| (2.3) |
Note that (2.3) is the sufficient and necessary condition for the matrix Σ = LL⊤ to be a covariance matrix.
We present probabilistic models involving covariance matrices in the following generic form:
| (2.4) |
where σ := [σ1,⋯,σD]⊤, and the half-vectorization in row order, vech⊤, transforms the lower triangular matrix L into a vector (l1, l2,⋯,lD). The total dimension of (σ,L) is .1
Alternatively, if we separate variances from covariance, then we have a unique Cholesky decomposition for the correlation matrix P = [ρij] = L*(L*)⊤, where the Cholesky factor L* = diag(σ−1)L can be obtained by normalizing each row of L. The magnitude requirements for correlations are immediately satisfied by the Cauchy-Schwarz inequality: . Thus we require
| (2.5) |
where . Similarly, (2.5) is the sufficient and necessary condition for P = L*(L*)⊤ to be a correlation matrix. Then we have the following alternatively structured model for covariance Σ that involves correlation P explicitly
| (2.6) |
Note, this direct decomposition Σ = diag(σ)P diag(σ) as a separation strategy is motivated by statistical thinking in terms of standard deviations and correlations (Barnard et al., 2000). This setting is especially relevant if the statistical quantity of interest is correlation matrix P itself, and we can then skip inference of the standard deviation σ by fixing it to a data-derived point estimate.
In what follows, we will show that the above framework includes the inverse-Wishart prior as a special case, but it can be easily generalized to a broader range of priors for additional flexibility. Such flexibility enables us to better express prior knowledge, control the model complexity and speed up computation in modeling real-life phenomena. This is crucial in modeling spatio-temporal processes with complex structures.
2.1. Connection to the Inverse-Wishart Prior
There are some interesting connections between the spherical product representations (2.3) (2.5) and the early development of the Wishart distribution (Wishart, 1928). The original Wishart distribution was derived by orthogonalizing multivariate Gaussian random variables leading to a lower triangular matrix whose elements (analogous to lij or ) were called rectangular coordinates. This way, the probability density has a geometric interpretation as a product of volumes and approximate densities on a series of spherical shells with radius (See more details in Sverdrup, 1947; Anderson, 2003). Now we demonstrate that the proposed schemes (2.4) (2.6) include the commonly used inverse-Wishart prior as a special case in modeling covariances.
Suppose Σ is a random sample from the inverse-Wishart distribution with the scale matrix Ψ > 0 and the degree of freedom ν ≥ D. Therefore, . Denote C as the Cholesky factor of Ψ−1, i.e. Ψ−1 = CC⊤. Then Σ−1 has the following Bartlett decomposition (Anderson, 2003; Smith and Hocking, 1972)
| (2.7) |
where the lower triangular matrix T, named Bartlett factor, has the following density (Theorem 7.2.1 of Anderson, 2003)
with multivariate gamma function defined as .
Now taking the inverse of the first equation in (2.7) yields the following reversed Cholesky decomposition2
where U := T−⊤ is an upper triangular matrix. The following proposition describes the density of the reversed Cholesky factor U of Σ, which enables us to treat the inverse-Wishart distribution as a special instance of strategy (2.4) or (2.6).
Proposition 2.1. Assume . Then its reversed Cholesky factor U has the following density
Proof. See Section A in the supplementary file (Lan et al., 2019). □
If we normalize each row of U and write
then the following joint prior of (σ,U*) is inseparable in general:
| (2.8) |
With this result, we can conditionally model variance and correlation factor as p(σ|U*) and p(U*|σ) respectively, similarly as in our proposed scheme (2.4) or (2.6). It is also used to verify the validity of our proposed method (2.6) (see more details in Section 4.1). A similar result exists for the Wishart prior distribution regarding the Cholesky factor. This representation facilitates the construction of a broader class of more flexible prior distributions for covariance matrix detailed below.
2.2. More Flexible Priors
Within the above framework, the only constraint on U or L is that it resides on the product of spheres with increasing dimensions. Using this fact, we can develop a broader class of priors on covariance matrices and thus be able to model processes with more complicated dependence in covariance structures. Since σ and L* have independent priors in (2.6), in what follows we focus on the scheme (2.6), and for simplicity, we denote the normalized Cholesky factor as L. Also, following Barnard et al. (2000), we assume a log-Normal prior on σ:
We now discuss priors on L that properly reflect the prior knowledge regarding the covariance structure among variables. If two variables, yi and yj (assuming i < j) are known to be uncorrelated a priori, i.e. 0 = ρij = 〈li, lj〉, then we can choose a prior that encourages li ⊥ lj, e.g. ljk ≈ 0 for k ≤ i. In contrast, if we believe a priori that there is a strong correlation between the two variables, we can specify that li and lj be linearly dependent, e.g., by setting [ljk]k≤i ≈ ±li. When there is no prior information, we might assume that components are uncorrelated and consider the following Jeffreys prior for li that concentrates on the (two) poles of ,
| (2.9) |
where G(li) is the canonical metric on spheres (Lan et al., 2014). Putting more prior probability on the diagonal elements of L renders fewer non-zero off-diagonal elements, which in turn leads to a larger number of perpendicular variables; that is, such a prior favors zeros in the correlation matrix P. More generally, one can map a probability distribution defined on the simplex onto the sphere and consider the following squared-Dirichlet distribution.
Definition 1 (Squared-Dirichlet distribution). A random vector is said to have a squared-Dirichlet distribution with parameter αi := (αi1,αi2,⋯,αii) if
Denote li ~ Dir2(αi). Then li has the following density
| (2.10) |
Remark 1. This definition includes a large class of flexible prior distributions on the unit sphere that specify different concentrations of probability density through the parameter αi. For example, the Jeffreys prior (2.9) corresponds to.αi = (1/2,⋯,1/2, 0).
To induce a prior distribution for the correlation matrix P = LL⊤, one can specify priors on row vectors of L, li ~ Dir2(αi) for i = 2,⋯,D. Each pair for (li, lj) in turn defines a prior distribution for the correlation ρij = 〈li, lj〉, whose sign is determined by the angle between li and lj. To encourage small correlation, we choose the concentration parameter αi so that the probability density concentrates around the (two) poles of , e.g. 0 < αik ≪ αik for k < i. Figure 1 illustrates the density heat maps of some symmetric squared-Dirichlet distributions Dir2(α1) on . It is interesting that the squared-Dirichlet distribution induces two important uniform prior distributions over correlation matrices from Barnard et al. (2000) in an effort to provide flexible priors for covariance matrices, as stated in the following theorem.
Figure 1:
Symmetric squared-Dirichlet distributions Dir2(α) defined on the 2-sphere with different settings for concentration parameter α = α1. The uniform distribution on the simplex, Dir(1), becomes non-uniform on the sphere due to the stretch of geometry (left); the symmetric Dirichlet distribution Dir(1/21) becomes uniform on the sphere (middle); with α closer to 0, the induced distribution becomes more concentrated on the polar points (right).
Theorem 2.1 (Uniform distributions). Let P = LL⊤. Suppose li ~ Dir2(αi), for i = 2,⋯,D, are independent, where li is the i-th row of L. We have the following
If , , then P follows a marginally uniform distribution, that is, ρij ~ Unif(−1, 1), i ≠ j.
If , , then P follows a jointly uniform distribution, that is, p(P) ∝ 1.
Proof. See Section A in the supplementary file (Lan et al., 2019). □
Another natural spherical prior can be obtained by constraining a multivariate Gaussian random vector to have unit norm. This is later generalized to a vector Gaussian process constrained to a sphere that serves as a suitable prior for modeling correlation processes. Now we consider the following unit-vector Gaussian distribution:
Definition 2 (Unit-vector Gaussian distribution). A random vector is said to have a unit-vector Gaussian distribution with mean μ and covariance Σ if
Then we denote and li has the following (conditional) density
Remark 2. This conditional density essentially defines the following Fisher-Bingham distribution (a.k.a. generalized Kent distribution, Kent, 1982; Mardia and Jupp, 2009). If Σ = I, then the above distribution reduces to the von Mises-Fisher distribution (Fisher, 1953; Mardia and Jupp, 2009) as a special case. If in addition μ = 0, then the above density becomes a constant; that is, the corresponding distribution is uniform on the sphere . See more details in Section E.1 of the supplementary file (Lan et al., 2019).
2.3. Dynamically Modeling the Covariance Matrices
We can generalize the proposed framework for modeling covariance/correlation matrices to the dynamic setting by adding subscript t to variables in the model (2.4) and the model (2.6), thus called dynamic covariance and dynamic correlation models respectively. We focus the latter in this section. One can model the components of σt as independent dynamic processes using, e.g. ARMA, generalized autoregressive conditional heteroskedasticity (GARCH), or log-Gaussian process. For Lt, we use vector processes. Since each row of Lt has to be on a sphere of certain dimension, we require the unit norm constraint for the dynamic process over time. We refer to any multivariate process li(x) satisfying ‖li(x)‖ ≡ 1, as unit-vector process (uvP). A unit-vector process can be obtained by constraining an existing multivariate process, e.g. the vector Gaussian process (vGP), as defined below.
Definition 3 (Vector Gaussian process). A D-dimensional vector Gaussian process Z(x) := (Z1(x),⋯,ZD(x)), with vector mean function μ(x) = (μ1(x),⋯,μD(x)), covariance function C and (D-dimensional) cross covariance VD×D,
is a collection of D-dimensional random vectors, indexed by , such that for any finite set of indices {x1,⋯,xN}, the random matrix has the following matrix normal distribution
where MN×D := (m1,⋯,mD), and mk = (μk(x1),⋯,μk(xN))⊤, and K is the kernel matrix with elements Kij = C(xi, xj).
Remark 3. Note for each k = 1,⋯D, we have the following marginal GP
In the above definition, we require a common kernel C for all the marginal GPs, whose dependence is characterized by the cross covariance VD×D. On the other hand, for any fixed , we have
For simplicity, we often consider μ ≡ 0 and VD×D = ID. That is, for k = 1,⋯,D.
Restricting vGP Z(·) to sphere yields a unit-vector Gaussian process (uvGP) Z*(·) := Z(·)| {‖Z(·)‖2 ≡ 1}, denoted as . Note for any fixed , . Setting μ ≡ 0, V = I, and conditioned on the length ℓn of each row of , we have
This conditional density is preserved by the inference algorithm in Section 3 and used for defining priors for correlations with all ℓn = 1. For each marginal GP, we select the following powered exponential function as the common kernel
where s controls the smoothness, the scale parameter γ is given an inverse-Gamma prior, and the correlation length parameter ρ is given a log-normal prior. Figure 2 shows a realization of vector GP Zt, unit-vector GP (forming rows of) Lt and the induced correlation process Pt respectively.
Figure 2:
A realization of vector GP Zt (left), unit-vector GP (forming rows of) Lt (middle) and the induced correlation process Pt (right).
In what follows, we focus on multivariate time series; therefore, we use the one dimensional time index . The overall dynamic correlation model can be summarized as follows:
| (2.11) |
where a constant mean function ni = (0,⋯,0,1) is used in the uvGP prior for li(t), with mean matrix for the realization . This model (2.11) captures the spatial dependence in the matrix Σt, which evolves along the time; while the temporal correlation is characterized by various GPs. The induced covariance process Σt is not a generalized Wishart process (Wilson and Ghahramani, 2011), which only models Cholesky factor of covariance using GP. Though with GP, dynamic covariance model may work similarly as the dynamic correlation model (2.11), yet the latter provides extra flexibility in modeling the evolution of variances and correlations separately. In general such flexibility could be useful in handling constraints for processes, e.g. modeling the dynamic probability for binary time series.
With this structured model (2.11), one can naturally model the evolution of variances and correlations separately in order to obtain more flexibility. If the focus is on modeling the correlation among multiple time series, then one can substitute σt with a point estimate from one trial and assume a steady variance vector. Alternatively, if sufficient trials are present, one can obtain an empirical estimate , from multiple trials at each time point. In the following, we study the posterior contraction of GP modeling in this setting.
2.4. Posterior Contraction Theorem
We now provide a theorem on the posterior contraction of the dynamic covariance model before we conclude this section. Because the posterior contraction for mean regression using Gaussian process has been vastly investigated in the literature (van der Vaart and van Zanten, 2008a, 2009, 2011; Yang and Dunson, 2016), we only investigate the posterior contraction for the covariance regression and set μt ≡ 0. We leave the posterior contraction of the dynamic correlation model (2.11) for future work. Note, the Cholesky decomposition of covariance matrix Σ = LL⊤ is unique if all the diagonal entries of L are positive. Therefore in the remaining of this section, we identify Cholesky factors up to a column-wise sign, i.e. L ~ Ldiag(−∑j∈J ej) for J ⊂{1,⋯,D}where ej is the j-th column of identity matrix ID.
In most cases, Gaussian process Lt can be viewed as a tight Borel measurable map in a Banach space, e.g. a space of continuous functions or Lp space. It is well known that the support of a centered GP is equal to the closure of the reproducible kernel Hilbert space (RKHS) associated to this process (Lemma 5.1 of van der Vaart and van Zanten, 2008b). Because the posterior distribution necessarily puts all its mass on the support of the prior, the posterior consistency requires the true parameter L0 governing the distribution of the data to fall in this support (van der Vaart and van Zanten, 2008a). Following van der Vaart and van Zanten (2008a,b, 2011), we express the rate of the posterior contraction in terms of the concentration function
| (2.12) |
where ‖·‖ is the norm of the Banach space where the GP L takes value, Π is the GP prior and is the associated RKHS with norm . Under certain regularity conditions, the posterior contracts with increasing data expressed in n at the rate εn → 0 satisfying
| (2.13) |
Let . Consider Banach space . Let p be a (centered) Gaussian model, which is uniquely determined by the covariance matrix Σ = LL⊤. Therefore the model density is parametrized by L, hence denoted as pL. Denote as the product measure on . Each PL,i has a density pLi with respect to the σ-finite measure μi. Define the average Hellinger distance as . Denote the observations with Yi = y(ti). Note they are independent but not identically distributed (inid). Now we state the main theorem of posterior contraction.
Theorem 2.2 (Posterior contraction). Let L − I be a Borel measurable, zero-mean tight Gaussian random element in and be the product measure of Y (n) parametrized by L. Let ϕL0 be the function in (2.12) with the uniform norm ‖·‖∞. If L0 is contained in the support of L and ϕL0 satisfies (2.13) with εn ≥ n−1/2, then Πn(L : dn(L,L0) > Mnεn|Y(n)) → 0 in -probability for every Mn → ∞.
Proof. See Section B in the supplementary file (Lan et al., 2019). □
Remark 4. In principle, the smoothness of GP should match the regularity of the true parameter to achieve the optimal rate of contraction (van der Vaart and van Zanten, 2008a, 2011). One can scale GP, e.g. using an inverse-Gamma bandwidth, to get optimal contraction rate for every regularity level so that the resulting estimator is rate adaptive (van der Vaart and van Zanten, 2009, 2011). One can refer to Section 3.2 of (van der Vaart and van Zanten, 2011) for posterior contraction rates using squared exponential kernel for GP. We leave further investigation on contraction rates in the setting of covariance regression to future work.
Remark 5. Here the GP prior L defines a (mostly finite) probability measure on the space of bounded functions. The true parameter function L0 is required to be contained in the support of the prior, the RKHS of L. The contraction rate depends on the position of L0 relative to the RKHS and the small-ball probability Π(‖L‖ < ε)
3. Posterior Inference
Now we obtain the posterior probability of mean μt, variance σt, Cholesky factor of correlation Lt, hyper-parameters γ := (γμ, γσ, γL) and ρ := (ρμ, ρσ, ρL) in the model (2.11) Denote the realization of processes μt, σt, Lt at discrete time points as , , respectively. Transform the parameters , η :=log(ρ) for calculation convenience. Denote for M trials, (Ym)N×D := [ym1,⋯,ymN]⊤ and where ○ is the Hadamard product (a.k.a. Schur product), i.e. the entry-wise product. Let K*(γ*, η*) = γ*K0*(η*) and .
3.1. Metropolis-within-Gibbs
We use a Metropolis-within-Gibbs algorithm and alternate updating the model parameters , , , γ, η, We now list the parameters and their respective updates one by one. (γ) Note the prior for γ is conditionally conjugate given * = μ, τ, or L,
where [condition] is 1 with the condition satisfied and 0 otherwise.
(η) Given * = μ, τ, or L, we could sample η* using the slice sampler (Neal, 2003), which only requires log-posterior density and works well for scalar parameters,
By the definition of vGP, we have ; therefore, . On the other hand, one can write
where , and K(N,D) is the commutation matrix of size ND × ND such that for any N × D matrix A, K(N,D) vec(A) = vec(A⊤) (Tracy and Dwyer, 1969; Magnus and Neudecker, 1979). Therefore, the prior on is conditionally conjugate, and we have
Using a similar argument by matrix Normal prior for , we have . Therefore, we could use the elliptic slice sampler (ESS, Murray et al., 2010), which only requires the log-likelihood
where and .
For each n ∈ {1,⋯N}, we have . We could sample from its posterior distribution using the Δ-Spherical Hamiltonian Monte Carlo (Δ-SphHMC) described below. The log-posterior density of is
The derivative of log-likelihood with respect to Ln and the derivative of log-prior with respect to can be calculated as
3.2. Spherical HMC
We need an efficient algorithm to handle the intractability in the posterior distribution of introduced by various flexible priors. Spherical Hamiltonian Monte Carlo (SphHMC, Lan et al., 2014; Lan and Shahbaba, 2016) is a Hamiltonian Monte Carlo (HMC, Duane et al., 1987; Neal, 2011) algorithm on spheres that can be viewed as a special case of geodesic Monte Carlo (Byrne and Girolami, 2013), or manifold Monte Carlo methods (Girolami and Calderhead, 2011; Lan et al., 2015). The algorithm was originally proposed to handle norm constraints in sampling so it is natural to use it to sample each row of the Cholesky factor of a correlation matrix with unit 2-norm constraint. The general notation q is instantiated as li in this section.
Assume a probability distribution with density function f(q) is defined on a (D–1) dimensional sphere with radius r, . Due to the norm constraint, there are (D–1) free parameters q−D := (q1,⋯qD–1), which can be viewed as the Cartesian coordinates for the manifold . To induce Hamiltonian dynamics on the sphere, we define the potential energy for position q as U(q) := −log f(q). Endowing the canonical spherical metric on the Riemannian manifold , we introduce the auxiliary velocity vector and define the associated kinetic energy as (Girolami and Calderhead, 2011). Therefore the total energy is defined as
| (3.1) |
where we denote , and . (Lan and Shahbaba, 2016). Therefore the Lagrangian dynamics with above total energy (3.1) is (Lan et al., 2015)
| (3.2) |
where Γ(q–D) = r−2G(q−D) ⊗ q−D is the Christoffel symbols of second kind (see details in Lan and Shahbaba, 2016, for r = 1). A splitting technique is used to yield a geometric integrator called leapfrog that mainly moves along geodesics (great circles) on the sphere with random perturbation in the direction. It is then applied with discrete step size h for T times to numerically solve (3.2) in order to obtain a proposed state (qT, vT). This proposal can be accepted with certain probability expressed in the total energy (3.1). Interested readers can refer to Lan et al. (2014), Lan and Shahbaba (2016), or Section C in the supplementary file (Lan et al., 2019). In addition to the original work in Lan et al. (2014) and Lan and Shahbaba (2016), we prove the following result on energy conservation of the algorithm (Beskos et al., 2011).
Theorem 3.1. Let h → 0 we have the following energy conservation
Proof. See Section C in the supplementary file (Lan et al., 2019). □
3.3. Adaptive Spherical HMC
There are two tuning parameters in HMC and its variants: the step size h and the number of integration (leapfrog) steps T. Hand tuning heavily relies on domain expertise and could be inefficient. Here, we adopt the ‘No-U-Turn’ idea from Hoffman and Gelman (2014) and introduce a novel adaptive algorithm beyond Lan et al. (2014); Lan and Shahbaba (2016) that obviates manual tuning of these parameters.
First, for any given step size h, we adopt a rule for setting the number of leapfrog steps T based on the same philosophy as ‘No-U-Turn’ (Hoffman and Gelman, 2014). The idea is to avoid waste of computation occurred (e.g. when the sampler backtracks on its trajectory) without breaking the detailed balance condition for the Markov chain Monte Carlo (MCMC) transition kernel. is a compact manifold where any two points q(0), have bounded geodesic distance πr. We adopt the stopping rule for the leapfrog when the sampler exits the orthant of the initial state, that is, the trajectory measured in geodesic distance is at least , which is equivalent to 〈q(0), q(t)〉 < 0. On the other hand, this condition may not be satisfied within reasonable number of iterations because the geometric integrator does not exactly follow a geodesic in general (only the middle part does), therefore we set some threshold Tmax for the number of tests, and adopt the following ‘Two-Orthants’ (as the starting and end points occupy two orthants) rule for the number of leapfrogs:
| (3.3) |
Alternatively, one can stop the leapfrog steps in a stochastic way based on the geodesic distance travelled:
| (3.4) |
These stopping criteria are already time reversible, so the recursive binary tree as in ‘No-U-Turn’ algorithm (Hoffman and Gelman, 2014) is no longer needed.
Lastly, we adopt the dual averaging scheme (Nesterov, 2009) for the adaptation of step size h. See Hoffman and Gelman (2014) for more details. We summarize our Adaptive Spherical Hamiltonian Monte Carlo (adp-SphHMC) in the supplementary file (Lan et al., 2019).
To sample L (or Lt), we could update each row vector in parallel, and accept/reject vech⊤(L) (or vech⊤(Lt)) simultaneously in terms of the sum of total energy of all components. We refer to the resulting algorithm as Δ-Spherical HMC (Δ-SphHMC).
The computational complexity involving GP prior is , and that of the likelihood evaluation is . MCMC updates of , , have complexity , and respectively. To scale up applications to larger dimension D, one could preliminarily classify data into groups, and arrange the corresponding blocks of their covariance/correlation matrix in some ‘band’ along the main diagonal assuming no correlation among groups. More specifically, we can assume Lt is w-band lower triangular matrix for each time t, i.e. lij = 0 for i < j or i − j ≥ w, then the resulting covariance/correlation matrix will be (2w − 1)-banded. In this way the complexity of likelihood evaluation and updating will be reduced to and resepctively. Therefore the total computational cost would scale linearly with the dimension D. This technique will be investigated in Section 4.2.
4. Simulation Studies
In this section, we use simulated examples to illustrate the advantage of our structured models for covariance. First, we consider the normal-inverse-Wishart problem. Since there is conjugacy and we know the true posterior, we use this to verify our method and investigate flexible priors in Section 2.2. Then we test our dynamical modeling method in Section 2.3 on a periodic process model. Our model manifests full flexibility compared to a state-of-the-art nonparametric covariance regression model based on latent factor process (Fox and Dunson, 2015).
4.1. Normal-inverse-Wishart Problem
Consider the following example involving inverse-Wishart prior
| (4.1) |
It is known that the posterior of Σ|Y is still inverse-Wishart distribution:
| (4.2) |
We consider dimension D = 3 and generate data Y with μ0 = 0, for N = 20 data points so that the prior is not overwhelmed by data.
Verification of Validity
Specifying conditional priors based on (2.8) in the structured model (2.6), we want to check the validity of our proposed method by comparing the posterior estimates using Δ-SphHMC agains the truth (4.2).
We sample τ := log(σ) using standard HMC and U* using Δ-SphHMC. They are updated in Metropolis-Within-Gibbs scheme. 106 samples are collected after burning the first 10% and subsampling every 1 of 10. For each sample of τ and vech(U*), we calculate Σ = diag(eτ)U*(U*)⊤ diag(eτ). Marginal densities of entries in Σ are estimated with these samples and plotted against the results by direct sampling in Figure 3. Despite of sampling variance, these estimates closely match the results by direct sampling, indicating the validity of our proposed method.
Figure 3:
Marginal posterior densities of σij in the normal-inverse-Wishart problem. Solid blue lines are estimates by Δ-SphHMC and dashed red lines are estimates by direct sampling. All densities are estimated with 106 samples.
Examining Flexibility of Priors
We have studied several spherical priors for the Cholesky factor of correlation matrix proposed in Section 2.2. Now we examine the flexibility of these priors in providing prior information for correlation with various parameter settings.
With the same data generated according to (4.1), we now consider the squared-Dirichlet prior (2.10) for L in the structured model (2.6) with the following setting
| (4.3) |
where we consider three cases i) α = 1, α0 = 1; ii) α = 0.1, α0 = 1; iii) α = 0.1, α0 = 10.
We generate 106 prior samples (according to (4.3)) and posterior samples (by Δ-SphHMC) for L respectively and covert them to P = LL⊤. For each entry of ρij, we estimate the marginal posterior (prior) density based on these posterior (prior) samples. The posteriors, priors and maximal likelihood estimates (MLEs) of correlations ρij are plotted in Figure 4 for different α’s respectively. In general, the posteriors are compromise between priors and the likelihoods (MLEs). With more and more weight (through α) put around the poles (last component) of each factor sphere, the priors become increasingly dominant that the posteriors (red dash lines) almost fall on priors (blue solid lines) when α = (0.1, 0.1, 10). In this extreme case, the squared-Dirichlet distributions induce priors in favor of trivial (zero) correlations. We have similar conclusion on Bingham prior and von Mises-Fisher prior but results are reported in Section E.1 of the supplementary file (Lan et al., 2019).
Figure 4:
Marginal posterior, prior (induced from squared-Dirichlet distribution) densities of correlations and MLEs with different settings for concentration parameter α, estimated with 106 samples.
4.2. Simulated Periodic Processes
In this section, we investigate the performance of our dynamic model (2.11) on the following periodic process example
| (4.4) |
Based on the model (4.4), we generate M trials (process realizations) of data y at N evenly spaced points for t in [0, 2], and therefore the whole data set {y(t)} is an M × N × D array. We first consider D = 2 to investigate the posterior contraction phenomena and the model flexibility; then we consider D = 100 over a shorter period [0, 1] to show the scalability using the ‘w-band’ structure.
Posterior Contraction
Posterior contraction describes the phenomenon that the posterior concentrates on smaller and smaller neighborhood of the true parameter (function) given more and more data (van der Vaart and van Zanten, 2008a). We investigate such phenomena in both mean functions and covariance functions in our model (2.11) using the following settings i) M = 10,N = 20; ii) M = 100,N = 20; iii) M = 10,N = 200; iv) M = 100,N = 200.
To fit the data using the model (2.11), we set s = 2, a = (1, 1, 1), b = (0.1, 10−3, 0.2), m = (0, 0, 0) for all settings, V = (1,0.5,1) for N = 20 and V = (1,1,0.3) for N = 200. We also add an additional nugget of 10−5In to all the covariance kernel of GPs to ensure non-degeneracy. Following the procedure in Section 3.1, we run MCMC for 1.5 × 105 iterations, burn in the first 5 × 104 and subsample 1 for every 10. Based on the resulting 104 posterior samples, we estimate the underlying mean functions and covariance functions and plot the estimates in Figure 5.
Figure 5:
Estimation of the underlying mean functions μt (left in each of 4 subpannels) and covariance functions Σt (right in each of 4 subpannels) of 2-dimensional periodic processes. M is the number of trials, and N is the number of discretization points. Dashed lines are true values, solid lines are estimates and shaded regions are 95% credible bands.
Note in Figure 5, both M and N have effect on the amount of data information thereafter on the posterior contraction but the contraction rate may depend on them differently. Both mean and covariance functions have narrower credible bands for more discretization points N (comparing N = 20 in the first row with N = 200 for the second row). On the other hand, both posteriors contract further with more trials M (comparing M = 10 in the first column agains M = 100 for the second column). In general the posterior of mean function contracts to the truth faster than the posterior of covariance function. With M = 100 trials and N = 200 discretization points, both mean and covariance functions are almost recovered by the model (2.11).
Full Flexibility
Our method (2.11) grants full flexibility because it models mean, variance and correlation processes separately. This is particularly useful if they behave differently. It contrasts with latent factor based models that tie mean and covariance processes together. One of the state-of-the-art models of this type is Bayesian nonparametric covariance regression (Fox and Dunson, 2015):
| (4.5) |
We tweak the simulated example (4.4) for D = 2 to let mean and correlation processes have higher frequency than variance processes, as shown in the dashed lines in Figure 6. We generate M = 10 trials of data over N = 200 evenly spaced points. In this case, the true mean processes μ(x) and true covariance processes Σ(x) behave differently but are modeled with a common loading matrix Λ(x) in model (4.5). This imposes difficulty on (4.5) to have a latent factor process ψ(x) that could properly accommodate the heterogeneity in mean and covariance processes. Figure 6 shows that due to this reason, latent factor based model (4.5) (upper row) fails to generate satisfactory fit for all of the mean, variance and correlation processes. Our fully flexible model (2.11) (bottom row), on the contrary, successfully produces more accurate characterization for all of them. Note that this artificial example is used to demonstrate the flexibility of our dynamic model (2.11). For cases that are not as extreme, (4.5) may performance equally well. See more discussion in Section 6 and more details in Section E.2 in the supplementary file (Lan et al., 2019).
Figure 6:
Estimation of the underlying mean functions μt (left column), variance functions σt (middle column) and correlation function ρt (right column) of 2-dimensional periodic processes, using latent factor process model (upper row) and our flexible model (lower row), based on M = 10 trials of data over N = 200 evenly spaced points. Dashed lines are true values, solid lines are estimates and shaded regions are 95% credible bands.
Scalability
Now we use the same simulation model (4.4) for D = 100 dimensions to test the scalability of our dynamic model (2.11). However instead of the full covariance, we consider a covariance matrix with 110 non-zero off-diagonal entries ρ1,2, ρ3,4,⋯,ρ99,100 and a few outside the ‘2-band’ of the diagonal. The sparsity structure of the covariance is shown in the left panel of Figure 7 where the red lines indicate the ‘2-band’ structure. We focus on the correlation process in this example, thus set μt ≡ 0 and σt ≡ 1, for t ∈ [0, 1]. More specifically when generating the data {yt}, we only use Lij and Sij in (4.4) for non-zero entries.
Figure 7:
Underlying structure of the correlation matrix (left), selective posterior estimates of the correlation functions Pt (middle), and its Frobenius-norm distance to the truth (right) based on 100-dimensional periodic processes with 2-band structure, using M = 100 trials over N = 100 discretization points. Dashed lines are true values, solid lines are estimates and shaded regions are the 95% credible bands.
To apply our dynamical model (2.11) in this setting, we let Lt have ‘w-band’ structure with w = 2 at each time t. Setting s = 2, a = 1, b = 0.1, m = 0 and V = 10−3, N = 100 and M = 100, we repeat the MCMC runs for 7.5 × 104 iterations, burn in the first 2.5 × 104 and subsample 1 for every 10 to obtain 5 × 103 posterior samples in the end. Based on those samples, we estimate the underlying correlation functions and only plot selective correlations ρ1,2, ρ3,12, ρ50,51 and ρ99,100 in Figure 7. With the ‘w-band’ structure, we have less entries in the covariance matrix and focus on the ‘in-group’ correlation. Our dynamical model (2.11) is sensitive enough to discern the informative non-zero components from the non-informative ones in these correlation functions. Unit-vector GP priors provide flexibility for the model to capture the changing pattern of informative correlations. The middle panel of Figure 7 shows that except ρ3,12 which falls out of the ‘2-band’ as indicated by red circle in the left panel of Figure 7, the model (2.11) correctly identify the non-zero components ρ1,2 and ρ99,100 and characterize their evolution. The right panel shows that the relative Frobenius-norm distance between the estimated and true correlation matrices, is small, indicating that our dynamic model (2.11) performs well with higher dimension in estimating complex dependence structure among multiple stochastic processes.
5. Analysis of Local Field Potential Activity
Now we use the proposed model (2.11) to analyze a local field potential (LFP) activity dataset. The goal of this analysis is to elucidate how memory encoding, retrieval and decision-making arise from functional interactions among brain regions, by modeling how their dynamic connectivity varies during performance of complex memory tasks. Here we focus on LFP activity data recorded from 24 electrodes spanning the dorsal CA1 subregion of the hippocampus as rats performed a sequence memory task (Allen et al., 2014, 2016; Ng et al., 2017; Holbrook et al., 2017). The task involves repeated presentations of a sequence of odors (e.g., ABCDE) at a single port and requires rats to correctly determine whether each odor is presented ‘in sequence’ (InSeq; e.g., ABCDE; by holding their nosepoke response until the signal at 1.2s) or ‘out of sequence’ (OutSeq; e.g., ABDDE; by withdrawing their nose before the signal). In previous work using the same dataset, Holbrook et al. (2016) used a direct MCMC algorithm to study the spectral density matrix of LFP from 4 selected channels. However, they did not examine how their correlations varied across time and recording site. These limitations are addressed in this paper.
We focus our analyses on the time window from 0ms to 750ms (with 0 corresponding to when the rat’s nose enters the odor port). Critically, this includes a time period during which the behavior of the animal is held constant (0–500ms) so differences in LFP reflect the cognitive processes associated with task performance, and, to serve as a comparison, a time period near 750ms during which the behavioral state of the animal is known to be different (i.e., by 750ms the animal has already withdrawn from the port on the majority of OutSeq trials, but is still in the port on InSeq trials). We also focus our analyses on two sets of adjacent electrodes (electrodes 20 and 22, and electrodes 8 and 9), which allows for comparisons between probes that are near each other (<1mm; i.e., 20:22 and 8:9) or more distant from each other (>2mm; i.e., 20:8, 20:9, 22:8, and 22:9). Figure 8 shows M = 20 trials of these LFP signals from D = 4 channels under both InSeq and OutSeq conditions. Our main objective is to quantify how correlations among these LFP channels varied across trial types (InSeq vs OutSeq) and over time (within the first 750ms of trials). To do so, we discretize the time window of 0.75 seconds into N = 300 equally-spaced small intervals. Under each experiment condition (InSeq or OutSeq), we treat all the signals as a 4 dimensional time series and fit them using our proposed dynamic correlation model (2.11) in order to discover the evolution of their relationship. Note that we model the mean, variance, and correlation processes separately but only report findings about the evolution of correlation among those brain signals.
Figure 8:
LFP signals on “in sequence” and “out of sequence” trials. It is difficult to identify differences between the two conditions based on a mere visual inspection of the LFPs.
We set s = 2, a = (1, 1, 1), b = (1, 0.1, 0.2), m = (0, 0, 0), V = (1, 1.2, 2); and the general results are not very sensitive to the choice of these fine-tuning parameters. We also scale the discretized time points into (0, 1] and add an additional nugget of 10−5In to the covariance kernel of GPs. We follow the same procedure in Section 3.1 to collect 7.5 × 104 samples, burn in the first 2.5 × 104 and subsample 1 for every 10. The resulting 104 samples yield estimates of correlation processes as shown in Figure 9 for beta-filtered traces (20–40Hz) but similar patterns were also observed for theta-filtered traces (4–12Hz; see the supplement). The bottom panel of Figure 9 shows the dissimilarity between correlation processes under different conditions measured by the Frobenius norm of their difference.
Figure 9:
Estimated correlation processes of LFPs (beta) under in-sequence condition (top), out-of-sequence condition (middle) and the (Frobenius) distance between two correlation matrices (bottom).
Our approach revealed many important patterns in the data. First, it showed that electrodes near each other (20:22 and 8:9) displayed remarkably high correlations in their LFP activity on InSeq and OutSeq trials, whereas correlations were considerably lower among more distant electrodes (20:8, 20:9, 22:8, and 22:9). Second, it revealed that the correlations between InSeq and OutSeq matrices evolved during the presentation of individual trials. These results are consistent with other analyses on learning (see, e.g., Fiecas and Ombao, 2016). As expected, InSeq and OutSeq activity was very similar at the beginning of the time window (e.g., before 350ms), which is before the animal has any information about the InSeq or OutSeq status of the presented odor, but maximally different at the end of the time window, which is after it has made its response on OutSeq trials. Most important, however, is the discovery of InSeq vs OutSeq differences before 500ms, which reveal changes in neural activity associated with the complex cognitive process of identifying if events occurred in their expected order. These findings highlight the sensitivity of our novel approach, as such differences have not been detected with traditional analyses. Interested readers can find more results about all the 12 channels in Section E.3 of the supplementary file (Lan et al., 2019).
6. Conclusion
In this paper, we propose a novel Bayesian framework that grants full flexibility in modeling covariance and correlation matrices. It extends the separation strategy proposed by Barnard et al. (2000) and uses the Cholesky decomposition to maintain the positive definiteness of the correlation matrix. By defining distributions on spheres, a large class of flexible priors can be induced for covariance matrix that go beyond the commonly used but restrictive inverse-Wishart distribution. Furthermore, the structured models we propose maintain the interpretability of covariance in terms of variance and correlation. Adaptive Δ-Spherical HMC is introduced to handle the intractability of the resulting posterior. Furthermore, we extend this structured scheme to dynamical models to capture complex dependence among multiple stochastic processes, and demonstrate the effectiveness and efficiency in Bayesian modeling covariance and correlation matrices using a normal-inverse-Wishart problem, a simulated periodic process, and an analysis of LFP data. In addition, we provide both theoretic characterization and empirical investigation of posterior contraction for dynamically covariance modeling, which to our best knowledge, is a first attempt.
In this work, we consider the marginal (pairwise) dependence among multiple stochastic processes. The priors for correlation matrix specified through the sphere-product representation are in general dependent among component variables. For example, the method we use to induce uncorrelated prior between yi and yj (i < j) by setting ljk ≈ 0 for k ≤ i has a direct consequence that Cor(yi′, yj) ≈ 0 for i′ ≤ i. In another word, more informative priors (part of the components are correlated) may require careful ordering in {yi}. To avoid this issue, one might consider the inverse of covariance (precision) matrices instead. This leads to modeling the conditional dependence, or Markov network (Dempster, 1972; Friedman et al., 2008). Our proposed methodology applies directly to (dynamic) precision matrices/processes, which will be our future direction.
To further scale our method to problems of greater dimensionality in future, one could explore the low-rank structure of covariance and correlation matrices, e.g. by adopting the similar factorization as in (Fox and Dunson, 2015) and assuming for some k ≪ D, or impose some sparse structure on the precision matrices.
We have proved that the posterior of covariance function contracts at a rate given by the general form of concentration function (van der Vaart and van Zanten, 2008a). Empirical evidence (Section 4.2) shows that the posterior of covariance contracts slower than that of mean. More theoretical works is needed to compare their contraction rates. Also, future research could involve investigating posterior contraction in covariance regression with respect to the optimal rates under different GP priors.
While our research has generated interesting new findings regarding brain signals during memory tasks, one limitation of our current analysis on LFP data is that it is conducted on a single rat. The proposed model can be generalized to account for variation among rats. In the future, we will apply this sensitive approach to other datasets, including simultaneous LFP recordings from multiple brain regions in rats as well as BOLD functional magnetic resonance imaging (fMRI) data collected from human subjects performing the same task.
Supplementary Material
Acknowledgments
SL is supported by ONR N00014-17-1-2079. AH is supported by NIH grant T32 AG000096. GAE is supported by NIDCD T32 DC010775. NJF is supported by NSF awards IOS-1150292 and BCS-1439267 and Whitehall Foundation award 2010-05-84. HO is supported by NSF DMS 1509023 and NSF SES 1461534. BS is supported by NSF DMS 1622490 and NIH R01 MH115697. We thank Yulong Lu at Duke for discussions on posterior contraction of covariance regression with Gaussian processes.
Footnotes
Supplementary Material
Web-based supplementary file for “Flexible Bayesian Dynamic Modeling of Correlation and Covariance Matrices” (DOI: 10.1214/19-BA1173SUPP; .pdf).
For each i ∈ {1,⋯,D}, given σi, there are only (i − 1) free parameters on , so there are totally free parameters.
This can be achieved through the exchange matrix (a.k.a. reversal matrix, backward identity, or standard involutory permutation) E with 1’s on the anti-diagonal and 0’s elsewhere. Note that E is both involutory and orthogonal, i.e. E = E−1 = ET. Let EΣE = LL⊤ be the usual Cholesky decomposition. Then Σ = (ELE)(ELE)⊤ = UU⊤ and define U := ELE⊤.
References
- Allen TA, Morris AM, Mattfeld AT, Stark CE, and Fortin NJ (2014). “A Sequence of events model of episodic memory shows parallels in rats and humans.” Hippocampus, 24(10): 1178–1188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Allen TA, Salz DM, McKenzie S, and Fortin NJ (2016). “Nonspatial sequence coding in CA1 neurons.” Journal of Neuroscience, 36(5): 1547–1563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson TW (2003). An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability and Statistics. Hoboken, N. J.: Wiley Interscience. MR1990662. [Google Scholar]
- Banfield JD and Raftery AE (1993). “Model-based Gaussian and non-Gaussian clustering.” Biometrics, 803–821. MR1243494. doi: 10.2307/2532201. [DOI] [Google Scholar]
- Barnard J, McCulloch R, and Meng X-L (2000). “Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage.” Statistica Sinica, 1281–1311. MR1804544. [Google Scholar]
- Bensmail H, Celeux G, Raftery AE, and Robert CP (1997). “Inference in modelbased cluster analysis.” Statistics and Computing, 7(1): 1–10. [Google Scholar]
- Beskos A, Pinski FJ, Sanz-Serna JM, and Stuart AM (2011). “Hybrid Monte Carlo on Hilbert spaces.” Stochastic Processes and their Applications, 121(10): 2201–2230. doi: [Google Scholar]
- Byrne S and Girolami M (2013). “Geodesic Monte Carlo on embedded manifolds.” Scandinavian Journal of Statistics, 40(4): 825–845. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Celeux G and Govaert G (1995). “Gaussian parsimonious clustering models.” Pattern recognition, 28(5): 781–793. [Google Scholar]
- Chiu TY, Leonard T, and Tsui K-W (1996). “The matrix-logarithmic covariance model.” Journal of the American Statistical Association, 91(433): 198–210. doi: [Google Scholar]
- Cho H and Fryzlewicz P (2015). “Multiple-change-point detection for high dimensional time series via sparsified binary segmentation.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2): 475–507. doi: [Google Scholar]
- Cribben I, Haraldsdottir R, Atlas LY, Wager TD, and Lindquist MA (2012). “Dynamic connectivity regression: Determining state-related changes in brain connectivity.” NeuroImage, 61(4): 907–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dahlhaus R (2000). “A likelihood approximation for locally stationary processes.” Annals of Statistics, 28(6): 1762–1794. doi: [Google Scholar]
- Daniels MJ (1999). “A prior for the variance in hierarchical models.” Canadian Journal of Statistics, 27(3): 567–578. doi: [Google Scholar]
- Daniels MJ and Kass RE (1999). “Nonconjugate Bayesian Estimation of Covariance Matrices and its Use in Hierarchical Models.” Journal of the American Statistical Association, 94(448): 1254–1263. doi: [Google Scholar]
- Daniels MJ and Kass RE (2001). “Shrinkage Estimators for Covariance Matrices.” Biometrics, 57(4): 1173–1184. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dempster AP (1972). “Covariance Selection.” Biometrics, 28(1): 157–175. MR3931974. [Google Scholar]
- Duane S, Kennedy AD, Pendleton BJ, and Roweth D (1987). “Hybrid Monte Carlo”. Physics Letters B, 195(2): 216–222. MR3960671. [Google Scholar]
- Fiecas M and Ombao H (2016). “Modeling the Evolution of Dynamic Brain Processes During an Associative Learning Experiment.” Journal of the American Statistical Association, 111(516): 1440–1453. doi: [Google Scholar]
- Fisher R (1953). “Dispersion on a sphere.” In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, volume 217, 295–305. The Royal Society. doi: [Google Scholar]
- Fox EB and Dunson DB (2015). “Bayesian nonparametric covariance regression.” Journal of Machine Learning Research, 16: 2501–2542. MR3450515. [Google Scholar]
- Friedman J, Hastie T, and Tibshirani R (2008). “Sparse inverse covariance estimation with the graphical lasso.” Biostatistics, 9(3): 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Girolami M and Calderhead B (2011). “Riemann manifold Langevin and Hamiltonian Monte Carlo methods.” Journal of the Royal Statistical Society, Series B, (with discussion) 73(2): 123–214. doi: [Google Scholar]
- Hoffman MD and Gelman A (2014). “The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.” The Journal of Machine Learning Research, 15(1): 1593–1623. MR3214779. [Google Scholar]
- Holbrook A, Lan S, Vandenberg-Rodes A, and Shahbaba B (2016). “Geodesic Lagrangian Monte Carlo over the space of positive definite matrices: with application to Bayesian spectral density estimation.” arXiv preprint arXiv:1612.08224. MR3750634. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holbrook A, Vandenberg-Rodes A, Fortin N, and Shahbaba B (2017). “A Bayesian supervised dual-dimensionality reduction model for simultaneous decoding of LFP and spike train signals.” Stat, 6(1): 53–67. Sta4.137. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent JT (1982). “The Fisher-Bingham distribution on the sphere.” Journal of the Royal Statistical Society. Series B (Methodological), 71–80. MR0655376. [Google Scholar]
- Lan S, Holbrook A, Elias GA, Fortin NJ, Ombao H, Shahbaba B (2019). “Web-based supplementary file for “Flexible Bayesian Dynamic Modeling of Correlation and Covariance Matrices” .” Bayesian Analysis. doi: 10.1214/19-BA1173SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lan S and Shahbaba B (2016). Algorithmic Advances in Riemannian Geometry and Applications, chapter 2, 25–71. Advances in Computer Vision and Pattern Recognition. Springer International Publishing, 1 edition. MR3811242. [Google Scholar]
- Lan S, Stathopoulos V, Shahbaba B, and Girolami M (2015). “Markov Chain Monte Carlo from Lagrangian Dynamics.” Journal of Computational and Graphical Statistics, 24(2): 357–378. doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lan S, Zhou B, and Shahbaba B (2014). “Spherical Hamiltonian Monte Carlo for constrained target distributions.” volume 32, 629–637. Beijing: The 31st International Conference on Machine Learning. [PMC free article] [PubMed] [Google Scholar]
- Leonard T and Hsu JS (1992). “Bayesian inference for a covariance matrix.” The Annals of Statistics, 1669–1696. doi: [Google Scholar]
- Liechty JC (2004). “Bayesian correlation estimation.” Biometrika, 91(1): 1–14. doi: [Google Scholar]
- Lindquist MA, Xu Y, Nebel MB, and Caffo BS (2014). “Evaluating dynamic bivariate correlations in resting-state fMRI: A comparison study and a new approach.” NeuroImage, 101(Supplement C): 531–546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C (1993). “Bartlett’ s Decomposition of the Posterior Distribution of the Covariance for Normal Monotone Ignorable Missing Data.” Journal of Multivariate Analysis, 46(2): 198–206. MR1240420. doi: 10.1006/jmva.1993.1056. [DOI] [Google Scholar]
- Magnus JR and Neudecker H (1979). “The commutation matrix: some properties and applications.” The Annals of Statistics, 381–394. MR0520247. [Google Scholar]
- Mardia KV and Jupp PE (2009). Directional statistics, volume 494. John Wiley & Sons. MR1828667. [Google Scholar]
- Mardia KV, Kent JT, and Bibby JM (1980). “Multivariate analysis (probability and mathematical statistics).” MR0560319.
- Murray I, Adams RP, and MacKay DJ (2010). “Elliptical slice sampling.” JMLR: Workshop and Conference Proceedings, 9: 541–548. [Google Scholar]
- Nason GP, Von Sachs R, and Kroisandt G (2000). “Wavelet processes and adaptive estimation of the evolutionary wavelet spectrum.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(2): 271–292. doi: [Google Scholar]
- Neal RM (2003). “Slice sampling.” Annals of Statistics, 31(3): 705–767. doi: [Google Scholar]
- Neal RM (2011). “MCMC using Hamiltonian dynamics.” In Brooks S, Gelman A, Jones G, and Meng XL (eds.), Handbook of Markov Chain Monte Carlo, 113–162. Chapman and Hall/CRC. doi: [Google Scholar]
- Nesterov Y (2009). “Primal-dual subgradient methods for convex problems.” Mathematical programming, 120(1): 221–259. MR2496434. doi: 10.1007/s10107-007-0149-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ng C-W, Elias GA, Asem JS, Allen TA, and Fortin NJ (2017). “Nonspatial sequence coding varies along the CA1 transverse axis.” Behavioural Brain Research. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ombao H, von Sachs R, and Guo W (2005). “SLEX Analysis of Multivariate Nonstationary Time Series.” Journal of the American Statistical Association, 100(470): 519–531. MR2160556. doi: 10.1198/016214504000001448. [DOI] [Google Scholar]
- Park T, Eckley IA, and Ombao HC (2014). “Estimating Time-Evolving Partial Coherence Between Signals via Multivariate Locally Stationary Wavelet Processes.” IEEE Transactions on Signal Processing, 62(20): 5240–5250. MR3268108. doi: 10.1109/TSP.2014.2343937. [DOI] [Google Scholar]
- Pinheiro JC and Bates DM (1996). “Unconstrained parametrizations for variancecovariance matrices.” Statistics and Computing, 6(3): 289–296. [Google Scholar]
- Pourahmadi M (1999). “Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation.” Biometrika, 677–690. MR1723786. doi: 10.1093/biomet/86.3.677. [DOI] [Google Scholar]
- Pourahmadi M (2000). “Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix.” Biometrika, 425–435. [Google Scholar]
- Pourahmadi M and Wang X (2015). “Distribution of random correlation matrices: Hyperspherical parameterization of the Cholesky factor.” Statistics & Probability Letters, 106: 5–12. MR3389964. doi: 10.1016/j.spl.2015.06.015. [DOI] [Google Scholar]
- Prado R (2013). “Sequential estimation of mixtures of structured autoregressive models.” Computational Statistics & Data Analysis, 58(Supplement C): 58–70. The Third Special Issue on Statistical Signal Extraction and Filtering. MR2997925. doi: 10.1016/j.csda.2011.03.017. [DOI] [Google Scholar]
- Prado R, West M, and Krystal AD (2001). “Multichannel electroencephalographic analyses via dynamic regression models with time-varying lag–lead structure.” Journal of the Royal Statistical Society: Series C (Applied Statistics), 50(1): 95–109. [Google Scholar]
- Priestley M (1965). “Evolutionary and non-stationary processes.” Journal of the Royal Statistical Society, Series B, 27: 204–237. MR0199886. [Google Scholar]
- Rao TS (1970). “The Fitting of Non-Stationary Time-Series Models with Time-Dependent Parameters.” Journal of the Royal Statistical Society. Series B (Methodological), 32(2): 312–322. MR0269065. [Google Scholar]
- Rapisarda F, Brigo D, and Mercurio F (2007). “Parameterizing correlations: a geometric interpretation.” IMA Journal of Management Mathematics, 18(1): 55–73. MR2332847. doi: 10.1093/imaman/dpl010. [DOI] [Google Scholar]
- Smith W and Hocking R (1972). “Algorithm as 53: Wishart variate generator.” Journal of the Royal Statistical Society. Series C (Applied Statistics), 21(3): 341–345. [Google Scholar]
- Sverdrup E (1947). “Derivation of the Wishart distribution of the second order sample moments by straightforward integration of a multiple integral.” Scandinavian Actuarial Journal, 1947(1): 151–166. MR0024102. doi: 10.1080/03461238.1947.10419659. [DOI] [Google Scholar]
- Ting CM, Seghouane AK, Salleh SH, and Noor AM (2015). “Estimating Effective Connectivity from fMRI Data Using Factor-based Subspace Autoregressive Models.” IEEE Signal Processing Letters, 22(6): 757–761. [Google Scholar]
- Tokuda T, Goodrich B, Van Mechelen I, Gelman A, and Tuerlinckx F (2011). “Visualizing distributions of covariance matrices.” Columbia University, New York, USA, Technical Report, 18–18. [Google Scholar]
- Tracy DS and Dwyer PS (1969). “Multivariate maxima and minima with matrix derivatives.” Journal of the American Statistical Association, 64(328): 1576–1594. MR0251849. [Google Scholar]
- van der Vaart A and van Zanten H (2011). “Information Rates of Nonparametric Gaussian Process Methods.” Journal of Machine Learning Research, 12: 2095–2119. MR2819028. [Google Scholar]
- van der Vaart AW and van Zanten JH (2008a). “Rates of Contraction of Posterior Distributions Based on Gaussian Process Priors.” The Annals of Statistics, 36(3): 1435–1463. MR2418663. doi: 10.1214/009053607000000613. [DOI] [Google Scholar]
- van der Vaart AW and van Zanten JH (2008b). Reproducing kernel Hilbert spaces of Gaussian priors, volume 3 of Collections, 200–222. Beachwood, Ohio, USA: Institute of Mathematical Statistics. [Google Scholar]
- van der Vaart AW and van Zanten JH (2009). “Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth.” Annals of Statistics, 37(5B): 2655–2675. MR2541442. doi: 10.1214/08-AOS678. [DOI] [Google Scholar]
- West M, Prado R, and Krystal AD (1999). “Evaluation and Comparison of EEG Traces: Latent Structure in Nonstationary Time Series.” Journal of the American Statistical Association, 94(446): 375–387. MR2697749. [Google Scholar]
- Wilson AG and Ghahramani Z (2011). “Generalised Wishart Processes.” In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence. [Google Scholar]
- Wishart J (1928). “The generalised product moment distribution in samples from a normal multivariate population.” Biometrika, 32–52. [Google Scholar]
- Yang R and Berger JO (1994). “Estimation of a covariance matrix using the reference prior.” The Annals of Statistics, 1195–1211. MR1311972. doi: 10.1214/aos/1176325625. [DOI] [Google Scholar]
- Yang Y and Dunson DB (2016). “Bayesian manifold regression.” The Annals of Statistics, 44(2): 876–905. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.









