Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2018 Aug 7;114(526):735–748. doi: 10.1080/01621459.2018.1437043

High-Dimensional Posterior Consistency in Bayesian Vector Autoregressive Models

Satyajit Ghosh 1, Kshitij Khare 1, George Michailidis 1
PMCID: PMC6716151  NIHMSID: NIHMS1515017  PMID: 31474783

Abstract

Vector autoregressive (VAR) models aim to capture linear temporal interdependencies amongst multiple time series. They have been widely used in macroeconomics and financial econometrics and more recently have found novel applications in functional genomics and neuroscience. These applications have also accentuated the need to investigate the behavior of the VAR model in a high-dimensional regime, which provides novel insights into the role of temporal dependence for regularized estimates of the model’s parameters. However, hardly anything is known regarding properties of the posterior distribution for Bayesian VAR models in such regimes. In this work, we consider a VAR model with two prior choices for the autoregressive coefficient matrix: a non-hierarchical matrix-normal prior and a hierarchical prior, which corresponds to an arbitrary scale mixture of normals. We establish posterior consistency for both these priors under standard regularity assumptions, when the dimension p of the VAR model grows with the sample size n (but still remains smaller than n). A special case corresponds to a shrinkage prior that introduces (group) sparsity in the columns of the model coefficient matrices. The performance of the model estimates are illustrated on synthetic and real macroeconomic data sets.

Keywords: Vector Autoregressive Models, Posterior consistency, Shrinkage prior, Bayesian Lasso

1. Introduction

There has been recent interest in modeling high-dimensional time series data sets. In macroeconomics, Mol et al. (2008) advocated the need to include a large number of variables in econometric models to improve forecastability, while Billio et al. (2012) examined stock returns of many financial institutions to assess systemic risk of the financial system. Similar modeling challenges arise in functional genomics for the reconstruction of regulatory networks as discussed in Basu et al. (2015), while in neuroscience one is interested in understanding functional connectivity between brain regions Seth et al. (2015).

A popular and informative model has been vector autoregressions (VAR), that captures linear temporal dependencies between time series. The VAR model and its properties have been thoroughly explored in low-dimensional settings both from a frequentist (for a comprehensive overview see Lütkepohl (2007)) and a Bayesian perspective Bańbura et al. (2010).

More recently, Basu and Michailidis (2015) provided an in depth analysis of the model for Gaussian data in a high-dimensional setting under sparsity assumptions, while Melnyk and Banerjee (2016) extended the results to other regularizers (e.g. group lasso, sparse group lasso, etc.). The results of Basu and Michailidis (2015); Melnyk and Banerjee (2016) and related follow-up work (Raskutti and Yuan (2015); Schweinberger et al. (2017); Lin and Michailidis (2017)) indicate that the resulting estimation error rates are those obtained for independent and identically distributed data times a factor that captures the temporal dependence in the data.

On the Bayesian front, there has been primarily methodological/computational work for low-dimensional VAR models. The so-called Minnesota prior Litterman et al. (1979); Doan et al. (1984) has been a staple of applied econometric work involving VAR models. This is a normal prior distribution on the elements of the transition matrix that puts stronger weights on the “own” lags of each time series, since they are considered more informative for forecasting purposes than lags from “other” time series. For large size VAR models, Bańbura et al. (2010) advocate normal-inverted Wishart distribution that leads to a posterior mean that can be interpreted as a ridge shrinkage estimator, suitable for such models. A first attempt for Bayesian estimation of VAR models combined with variable selection is presented in Korobilis (2013), where an indicator variable is specified for each parameter in the transition matrix that indicates whether the cross-autocorrelation coeffcient is included or set to zero. A prior is specified for the indicator variables that in principle can also be combined with the Minnesota prior.

On the other hand, Bayesian investigations into high-dimensional asymptotics of statistical models that incorporate sparsity with temporally dependent data are not in general available to the best of our knowledge. Hence, the main objective of this work is to study posterior (estimation) consistency for a VAR model, which asserts that the posterior concentrates around the “true” parameter value (in an appropriate norm) as the sample size increases. There is a rich literature on high-dimensional posterior estimation consistency for linear regression models for independent and identically distributed data. Ghoshal (1999) established posterior consistency and asymptotic normality with a general prior on the p-vector of regression coefficients (with appropriate positivity and Lipschitz assumptions) when p3 log p/n → 0 and p4logp/n → 0, respectively. Bontemps (2011) extended the work of Ghoshal (1999) by permitting the model to be misspecified and the number of predictor variables to grow proportionally to the sample size. Armagan et al. (2013) focus on shrinkage priors, which are appropriate scale mixtures of normal priors and induce weak sparsity in the vector of regression coefficients (see Carvalho et al. (2010); Griffin and Brown (2010); Armagan et al. (2011, 2013)). They establish posterior consistency under a simple sufficient condition on prior concentration when p = o(n). Lee and Oh (2013) establish posterior consistency under a high dimensional Bayesian PCA regression setup with p > n under appropriate assumptions on the rank of the design matrix. Posterior estimation consistency in linear regression models with g-priors has also been addressed in Sparks et al. (2015).

A crucial difference between the linear regression models considered in the above work and VAR models (expressed as a linear model) is that the design matrix in the latter case is random, and exhibits dependencies both between its rows and across its columns, and also with the error term in the model (see Section 2). This leads to a significantly more involved and challenging theoretical analysis that we successfully resolve. In this paper, we investigate high-dimensional posterior consistency for Bayesian VAR models in two natural and relevant settings: (a) with a non-hierarchical matrix normal prior on the dp × p autoregressive parameter matrix and (b) a hierarchical prior which corresponds to a general scale mixture of normals. In particular, this includes spherically symmetric priors such as the multivariate-t and standard shrinkage priors which induce (group) sparsity in the columns of the coefficient matrices, such as the group structure in Basu et al. (2015). Further, we employ a flat (uniform) prior distribution for the error term. Note that the joint maximum likelihood estimation problem for a sparse VAR model, with a sparse error covariance matrix is investigated in Lin and Michailidis (2017). The posterior consistency results are established under mild regularity assumptions on the underlying spectral density and with p = o(n/ log n). The key to handling the dependencies, within the design matrix and also between the design matrix and the error term, is a pair of high-dimensional concentration inequalities established in the supplementary material (Propositions B1 and B3). Note that we are considering the “large p large n” setting with p = o(n). However, we make no assumption reducing the effective dimension of the “true parameter matrix”. We only assume that the matrix norm of the true parameter matrix is of the order p in the non-hierarchical prior setting and bounded by a constant in the hierarchical prior setting. The large p small n situation, where p is allowed to grow at a much faster rate than n is also of interest, but assumptions such as sparsity/restricted eigenvalue type conditions are required, which in turn reduce the effective dimension of the true parameters. General posterior consistency results for VAR models in the large p small n setting are also not available to the best of our knowledge and are topics of future discussion/research.

The remainder of the paper is organized as follows. In Section 2, we introduce the VAR model and necessary notions of posterior consistency. We consider the non-hierarchical matrix normal prior on the coefficient matrix in Section 3.1 and establish posterior consistency under suitable regularity assumptions. In Section 3.2, we prove posterior consistency considering a hierarchical prior corresponding to a scale-mixture of matrix normals. In Section 4 and Section 5 the methodology/results of this paper are illustrated on simulated and real data sets, respectively. Finally, we conclude with a discussion in Section 6.

1.1. Notation

Throughout this paper, ℤ, ℝ and ℂ denote the sets of integers, real numbers and complex numbers, respectively. We denote the cardinality of a set J by |J|. For a vector v ϵ ℝp, vvj2 denotes the ℓ2-norm. For a matrix A, ‖A‖ and σmax(A) denote spectral norm i.e., A=supx0Ax2x2 and the largest singular value of A, respectively. For a symmetric or Hermitian matrix A, we denote its maximum and minimum eigenvalues by λmax(A) and λmin(A). The vector ei is used for the i-th unit vector in ℝp. Bold uppercase letters are only used to denote matrices, and vectorized form of such matrices are represented by corresponding lower cases. For example, if Φ is a p×p matrix then ϕ is vec(Φ). Also, O represents a zero-matrix of appropriate dimension, and in general vectors are denoted by italicized bold lowercase letters.

2. Model Formulation

For a p-dimensional stationary time series {Xt}, a vector autoregressive model of lag-d is given by

Xt=c+i=1dAiXti+εt. (1)

The temporal dependence structure of the VAR model is characterized by the p × p transition matrices A1,A2,· · · ·,Ad and c is a p×1 location vector which we choose to be 0. In the Gaussian VAR, the errors εt are i.i.d Np(0,Σε) where Σε is a p × p unknown error covariance matrix. The model in (1) can be rewritten in the Yule-Walker representation (Lütkepohl (2007)) as

Xtμ=i=1dAi(Xtiμ)+εt,

where μ = (IA1A2 – · · ·– Ad)–1c is known as the process mean. Usually μ will not be known in advance. In that case, μ may be estimated by the vector of sample means X¯=1nXt. An alternative estimator is μ^=(IA^1A^2A^d)1c^ in which c^ and A^i’s are the least squares estimator. Henceforth we assume without loss of generality μ = 0. Based on the data {X0,··· ,XT}, we define the response matrix Y and design matrix X as follows,

Y=[(XT)(Xd)]n×pX=[(XT1)(XTd)(Xd1)(X0)]n×dp.

We can now rewrite the above model in a linear regression setup as

Y=XΦ+E (2)

where

Φ=[A1A2Ad]E=[(εT)(εd)].

In this formulation, the number of samples is n = T − d + 1 and the number of unknown parameters is q = dp2, respectively. Vectorizing (column-wise) each matrix, we get

yvec(Y)=Zϕ+ε,

where Z := (IpX), φ = vec(Φ) and ε = vec(E). In this paper, we consider a high-dimensional setting where the dimension p of the VAR model increases with the sample size n. However, we assume that the lag d does not vary with n. This basic formulation of regression lends itself easily to a Bayesian analysis in which priors are placed on the unknown parameter matrices Φ and Σε.

As previously mentioned, we let the dimension p = pn of the VAR model vary with n, so that our results are relevant to high dimensional settings. We assume that our data come from a true VAR model described as follows: for every, n ≥ 1, let Yn(Xn,0,,Xn,n+d1) be the set of observations for sample size n, which satisfy Xn,k=i=1dAi,0nXn,ki+εn,k for dkn + d − 1. The errors {εn,k}k=dn+d1 are i.i.d. Npn(0,Σε,0n). Here {Φ0n}n1 denotes the sequence of the true coefficient matrices given by Φ0n[A1,0nA2,0nAd,0n], and {Σε,0n}n1 denotes the sequence of the true error covariance matrices. Let ℙ0 denote the probability measure underlying the true model described above.

Next, consider a Bayesian model which builds on (2) by placing priors on the parameters (Φ,Σε). In particular, let {πn(Φ,Σε)}n1 and {πn(Φ,Σε|Yn)}n1 denote the sequences of the corresponding (joint) prior and posterior densities. Analogously, {Πn(·)}n≥1 and {Πn(|Yn)}n1 denote the corresponding sequences of (joint) prior and posterior distributions. We will also use the notation πn an Πn to denote the marginal prior and posterior densities/distributions for Φ and Σε as appropriate.

Note that our main parameter of interest is Φ, while the error covariance matrix Σε is more of an unknown nuisance parameter that we need to deal with. One would hope that as the sample size n tends to infinity, the posterior probability assigned to any ε neighborhood of Φ0n converges to 1 almost surely under ℙ0. We now formally define a notion of posterior consistency that formalizes this.

Definition 1. The sequence of posterior distributions Πn(|Yn) is said to be consistent at {Φ0n}n1, if for every ε > 0, Πn(ΦΦ0n>ε|Y)0 as n → ∞ a.s. ℙ0.

For ease of exposition, we will henceforth denote Φ0n as Φ0, and Σε,0n as Σε,0, and highlight their dependence on n as needed.

2.1. Stability of VAR(d) process

Since VAR models are dynamical systems, the notion of ‘stability’ plays an important role in their analysis and asymptotic properties.

Definition 2. A VAR(d) process defined in (1) is said to be stable if the matrix valued polynomials A(z)Ipi=1dAizi satisfies det (A(z))0 on the unit circle of the complex plane {z ∈ ℂ: |z| = 1}.

The autocovariance function of a p-dimensional centered covariance-stationary time series {Xt} is defined as ΓX(h)=Cov(Xt,Xt+h)t,h and the corresponding spectral density is given by fX(θ)12πh=ΓX(h)eihθ, θ ∈ [−π, π]. For a Gaussian stable VAR(d) model the spectral density has a closed form expression,

fX(θ)=12π(Ipj=1dAjeijθ)1Σε[(Ipj=1dAjeijθ)1]*,

where * denotes the Hermitian conjugate of a matrix and i1. The autocovariance function which characterizes a centered Gaussian process, can be used to quantify the temporal and cross-sectional dependence for VAR(d) models. The peak of the spectral density, measured by its maximum eigenvalue M(fX)maxθ[π,π]λmax(fX(θ)) can be used as a measure of stability of the process. Also the minimum eigenvalue m(fX)minθ[π,π]λmax(fX(θ)) captures cross-dependence among its components. However, as mentioned in Basu and Michailidis (2015) instead of working with M(fX) and m(fX) it is often easier to work with the eigenvalues of A*(z)A(z) over the unit circle {z ∈ ℂ : |z| = 1}. Let

μmin(A)min|z|=1λmin(A*(z)A(z))=minθ[π,π]λmin((Ipi=1dAjeijθ)(Ipi=1dAjeijθ))μmax(A)min|z|=1λmax(A*(z)A(z))=maxθ[π,π]λmax((Ipj=1dAjeijθ)(Ipj=1dAjeijθ))

For stable VAR(d) process 0<μmin(A)μmax(A)<. Since each εt is i.i.d as Np(0,Σε), each row of X is distributed as Ndp(0,CX), where the covariance matrix CX has the following structure,

CX=[Γ(0)Γ(1)Γ(d1)Γ(1)Γ(0)Γ(d2)Γ(d1)Γ(d2)Γ(0)]dp×dp (3)

The quantities μmin(A) and μmax(A) provide a useful bound for the eigenvalues of CX. As mentioned in Melnyk and Banerjee (2016) and from Proposition 2.3 and equation (2.6) of Basu and Michailidis (2015) we have the following chain of inequalities,

λmin(Σε)μmax(A)2πm(fX)λmin(CX)λmax(CX)2πM(fX)λmax(Σε)μmin(A). (4)

We finally note that the p dimensional VAR(d) model in (2) can be equivalently written as a dp dimensional VAR(1) process. Let

X˜t=[XtXtd+1]dp×1,A˜1=[A1A2Ad1AdIpOOOOIpOOOOIpO]dp×dpandωt=[εt+100]dp×1.

Then the new representation becomes

X˜t=A˜1X˜t1+ωtt=d,,n+d1. (5)

It follows that X=[X˜n+d2X˜n+d3X˜d1], i.e., i-th row of X is denoted as (dp × 1 vector) X˜n+di1. Note that if the underlying VAR(d) process {Xt} is stable then the process X˜t with the characteristic polynomial, A˜(z)IdpA˜1z is also stable. This is because {X˜t} can be viewed as generated according to a stable VAR(1) process with transition matrix A˜1 and {X˜t} is stable if and only if {Xt} is stable (Lütkepohl (2007)). Based on A˜(z) we define

μmin(A˜)minθ[π,π]λmin((IpA˜1eiθ)(IpA˜1eiθ))μmax(A˜)maxθ[π,π]λmax((IpA˜1eiθ)(IpA˜1eiθ)) (6)

While μmin(A˜) and μmax(A˜) are not necessarily the same as μmin(A) and μmax(A) the inequalities in (4) still hold with μmin(A) and μmin(A) replaced by μmin(A˜) and μmax(A˜), respectively.

3. Bayesian Estimation and Posterior Consistency

In this section, we first discuss Bayesian estimation of VAR models with non-hierarchical and hierarchical scale mixture matrix normal prior distributions on the parameter matrix Φ (conditioned on Σε) and subsequently establish high-dimensional posterior consistency in this setting under mild regularity assumptions. We start by introducing the necessary notation for the matrix-variate normal distribution. Let Ma,b denote the space of a × b matrices.

Definition 3. An a×b random matrix X is defined to follow a matrix-variate normal distribution (MNa×b(M,B1,B2)) if its density function (on the space Ma,b ) is given by

|B1|b/2|B2|a/2(2π)ab/2e12tr{B11(XM)B21(XM)}.

Here MMa,b. B1Ma,a and B2Mb,b which are both positive definite matrices corresponding to the variances among the rows and columns of X, respectively. Note that the matrix normal distribution is related to the multivariate normal distribution in the following way: XMNn×p(M,B1,B2), if and only if vec(X)Nnp(vec(M),B2B1).

3.1. Non-Hierarchical Matrix Normal Prior

We consider a matrix normal prior for Φ conditional on Σε, and a flat (uniform) prior on Σε. In particular,

Φ|ΣεMNdp×p(O,U1,Σε)andπ(Σε)1, (7)

where U is a dp × dp known positive definite matrix. Note that under this matrix normal prior U1 and Σε are the covariance matrices corresponding to the columns and rows of Φ, respectively. The posterior distribution of Φ (conditional on Σε) can easily be shown to be MNdp×p(ΦPM,(XX+U)1,Σε), where ΦPM(XX+U)1XY is the (conditional) posterior mean which does not depend on Σε. Hence, the unconditional posterior mean of Φ is available in closed form and is given by ΦPM. It follows by standard computations using the multivariate normal density that the marginal posterior density of Σε is proportional to

|Σε|n/2etr(Σε1Σ^res),

Where Σ^res=YT(IX(XTX+U)1XT)Y. This density is proper if and only if n > 2p. In this case, the marginal posterior density of Σε corresponds to the Inverse-Wishart density with scale parameter Σ^res and shape parameter np − 1. We summarize the above observations in the lemma below.

Lemma 3.1. Under the non-hierarchical prior in (7), the posterior density of (Φ,Σε) is proper if and only if n > 2p. In this case

Φ|Σε,YMNdp×p(ΦPM,(XX+U)1,Σε)Σε|YInverse-Wishart(Σ^res,np1).

Assumptions for posterior consistency

We will establish our results under the high-dimensional setting from Section 2. Recall that Φ0 = Φ0n denotes the true underlying parameter matrix, and Σε,0 = Σε,0n denotes the true underlying error covariance matrix in this setting. The quantities μmin(A˜), μmax(A˜) and CX are as defined in (6) and (3), but with Φ0 and Σε,0 as the underlying parameter values. We assume the following:

Assumption A1. The VAR(d) model given in (1) is stable.

Assumption A2. 1+μmax(A˜)μmin(A˜) is O(np5) as n → ∞.

Assumption A3. 0<infn1λmin(CX)< and λmax(Σε,0n)=O(1).

Assumption A4. The true parameter matrix of the VAR model (2), Φ0 and the hyperparameter U of (7) are such that Φ0TUΦ0=o(n) and UΦ0=o(n).

Assumption A5. p = o(n).

When d = 1, we deal with a VAR(1) model and CX becomes ΓX(0), while μmin(A˜) is the same as μmin(A) and μmax(A˜) is also equal to μmax(A). Assumption A1 is a standard assumption which ensures that the underlying VAR process is well-behaved. Assumption A2 plays an important role in deriving high-dimensional concentration bounds for XX/n and XE/n around CX and O, respectively (see Propositions B.1 and B.3 in the supplementary material). Assumption A3 is needed to ensure that λmin(XX/n) is bounded away from 0 with high probability. Further, if we consider that each column of Φ is independently and identically distributed according to a normal prior distribution, that is U = Idp, Assumption A4 reduces to Φ0Φ0=o(n) and Φ0=o(n).

We now state the main theoretical result of posterior consistency with a non-hierarchical matrix normal prior distribution on Φ. The proof is given in Appendix C.1 of the supplementary material.

Theorem 3.2 (Posterior consistency for non-hierarchical prior). For any centered VAR(d) model (2) with non-hierarchical prior (7) on Φ satisfying Assumptions A1-A5, the posterior consistency of the parameter matrix can be achieved i.e. for every fixed every ε > 0

E0[Πn(ΦΦ0>ε|Y=(X0,,Xn))]0asn

where Φ0 is the true parameter matrix under the model (2).

A natural question to ask is whether the assumption p = o(n) can be relaxed for posterior consistency. In the lemma below, we consider a situation in which Assumptions A1 - A4 are satisfied and p is the same order as n, and prove that the resulting posterior is not consistent. The proof is given in Appendix C.2 of the supplementary material.

Lemma 3.3. Consider a (sequence of) VAR(1) model with pn = γn, Φ0n = αIpn and Σε,0n = Ipn, where γ(0,12), α ∈ (0,1) do not depend on n. If we use the non-hierarchical prior (7) on Φ with U=o(n), then there exists ε > 0 such that

liminfnE0[Πn(ΦΦ0>ε|Y=(X0,,Xn))]>0.

Remark Note that the condition U=o(n) assumed in Lemma 3.3 corresponds to Assumption A4 in the setting of the Lemma. The reason for making this assumption is that we want to show violating Assumption A5 (p = o(n)) can lead to posterior inconsistency, even if all of Assumptions A1-A4 hold. If we decide to violate Assumption A4 too by assuming U=o(n) or Un (goes to ∞ at the same rate or faster than n) then the posterior inconsistency proof becomes comparatively easier. We have provided the corresponding proofs in Supplemental Section C.3 and Supplemental Section C.4, respectively.

3.2. Hierarchical Normal-mixture Prior

Next, we study the posterior consistency of the parameter matrix in model (2) in which Φ has the following hierarchical prior

Φ|Σε,UMNdp×p(O,U1,Σε),π(Σε)1,andUπscl(.) (8)

where U is the dp×dp matrix having probability density πscl(·) over the space of dp×dp positive definite matrices, Mdp+. As shown below, the group lasso and multivariate t distribution prior on Φ can be obtained from (8) using appropriate choices of πscl(·). The lemma below shows that the posterior is proper if n > (d+1)p, and provides the form of various conditional and marginal posterior densities. The proof is given in Appendix C.5 of the supplementary material.

Lemma 3.4. Under the hierarchical normal-mixture prior in (8), the posterior density of (Φ,Σε,U) is proper if n > (d + 1)p. In this case

Φ|Σε,U,YMNdp×p(ΦPM,(XX+U)1,Σε)Σε|U,YInverse-Wishart(Σ^res,np1)π(U|Y)|U|dp/2|XX+U|dp/2|Σ^res|(np1)/2πscl(U).

The Bayesian group lasso prior was proposed by Kyung et al. (2010) in the context of linear regression. We adapt it to the VAR setting as follows. Suppose the rows of Φ are divided in G groups Φ[1],,Φ[G], where each Φ[g] is an mg × p sub-matrix of Φ (hence mg=dp) and Xg is the submatrix of X of order n × mg corresponding to the group Φ[g]. The frequentist group lasso estimator (conditional on Σε) is obtained by solving

minΦ[1],,Φ[G]Σε1/2(Yg=1GXgΦ[g])F2+g=1GλgΦ[g]Σε1/2F,

where λg is a tuning parameter corresponding to the group g. The group lasso estimator (conditional on Σε) can also be expressed as the maximum a posteriori probability (MAP) estimate under model (2) with the prior

π(Φ|Σε)exp(g=1GλgΦ[g]Σε1/2F)

which is a multivariate generalization of the double exponential prior and can also be expressed as a scale mixture of normals with Gamma hyperpriors (Park and Casella (2008), Kyung et al. (2010)) leading to the group lasso hierarchy,

Φ[g]|τg,ΣεindMNmg×p(O,τgImg,Σε)andτgindGamma(mg+12,λg22),g=1,,G.

Here Gamma(α,λ) denotes the Gamma distribution with shape parameter α and rate parameter λ. This can be alternatively presented as Φ|τ,ΣεMNdp×p(O,BDiag(τ1,,τG),Σε) and τgindGamma(mg+12,λg22) where BDiag(τ1,· · ·,τG) denotes a block-diagonal matrix with g−th block to be τgImg. Note that under the above hierarchical prior, conditionally on (τ1,· · ·,τG) and Σε, the columns of Φ are independent. If mg = 1 ∀ g = 1,· · · ,dp we get the ordinary Bayesian lasso.

Under the specification given in (8), suppose we assume U = Diag(τ1,· · ·,τdp) and 1/τiindGamma(αi,λi/2) then it can be shown that the prior density for Φ given only Σε is proportional to

i=1dp(Φi.Σε1/222+λi)(αi+12)

which corresponds to the multivariate t-distribution.

Estimation

For the hierarchical model given in (8), the posterior density of Φ is intractable and quantities such as the posterior mean are not available in closed form. Hence, we develop a Markov Chain Monte Carlo algorithm to generate values from the posterior density. It follows by straightforward calculations that

Φ|Σε,U,YMNdp×p(ΦPM,(XX+U)1,Σε)Σε|U,YInverse-Wishart(Σ^res,np1)π(U|Φ,Σε,Y)|U|dp/2exp[12tr{ΦΣε1ΦU}]πscl(U),

where Σ^res=YT(IX(XTX+U)1XT)Y. While the conditional posterior distribution of Φ given Σε,U and Σε given U are easy to simulate from (being Matrix-normal and InverseWishart), the tractability of the conditional posterior density of U given Φ,Σε depends on the form of the prior πscl(U). We show below that for three standard choices of πscl(U) corresponding to the Wishart prior, the group lasso prior and the multivariate t-prior π(U|Φ) becomes a tractable density and easy to simulate from.

Case 1: Wishart Prior

For a dp × dp positive definite matrix V, let U ~ Wishartdp(V,df = ν + dp) that is π(U)|U|ν12exp[12tr{V1U}]. In this case,

π(U|Φ,Σε,Y)|U|ν+dp12exp[12tr{(ΦΣε1Φ+V1)U}]

which is Wishartdp((ΦΣε1Φ+V1)1,df=ν+2dp). Note that as long as we have ν > −(dp+1) the posterior of U given Φ,Σε, Y is proper.

Case 2: Bayesian Group Lasso

In this case as already discussed in Subsection 3.2, U1 has a block diagonal form BDiag(τ1,· · · ,τG) and τg’s are apriori independently distributed as Gamma with scale (mg + 1)/2 and rate λg2/2. Hence, the conditional distribution of τg has the following form,

1τg|Φ,ΣεYindInverseGaussian(μg=λgΦ[g]Σε1/2F,λg2).

Case 3: Multivariate t-distribution

By taking U1 = Diag(τ1,··· ,τdp) and 1/τi to be independently distributed as Gamma with scale αi and rate λi/2 we have the multivariate t-distribution as the prior on Φ. In this case, the conditional distribution of τi has the following form

1τi|Φ,Σε,YindGamma(αi+dp2,Φi.Σε1/222+λi2).

Assumptions for posterior consistency

We now introduce regularity conditions to establish posterior consistency under the hierarchical prior model.

Assumption B1. The VAR(d) model given in (1) is stable.

Assumption B2. 1+μmax(A˜)μmin(A˜) is O(np5) as n → ∞.

Assumption B3. 0<infn1λmin(CX)supn1λmax(CX)< and 0 < λmax(Σε,0n) = O(1).

Assumption B4. The singular values of the true parameter matrices {Φ0n}n1 are uniformly bounded. Equivalently, the eigenvalues of {Φ0nΦ0n}n1 are uniformly bounded.

Assumption B5. p=O(nlogn).

Assumption B6. There exists (fixed and not-depending on n) α > 0 such that liminfnπscl,n(λmax(U)>α)>0 and for every β > 0 we have limnπscl,n(λmax(U)>βn)=0.

We now discuss these assumptions and compare them to the assumptions for the non-hierarchical prior model.

  • Assumptions B1 and B2 are identical to A1 and A2, while B3 is fairly similar to A3.

  • One key difference is the permissible scaling of p as a function of the sample size n in Assumption B5, which is slightly more stringent than the permissible scaling for the nonhierarchical matrix normal prior in Assumption A5.

  • Note that Assumption B6 is a mild one. For example, a sufficient condition for this assumption to be satisfied is that limsupnmax1ipEπscl,n[Uiiδ]< for some δ > 0 and liminfnπscl,n(U11>α)>0. It can be easily checked that this condition, and hence Assumption B6, is satisfied in the case of Wishart, Inverse-Wishart, Bayesian group lasso, multivariate t-distribution, Horseshoe (Carvalho et al. (2010)), Strawderman-Berger and generalized double Pareto (Armagan et al. (2013)) priors as long as the prior parameters do not depend on n.

  • In the non-hierarchical prior case, assumptions regarding Φ0 and U (non-random) are simultaneously and exclusively provided in Assumption A4 through the conditions Φ0TUΦ0=o(n) and UΦ0=O(n). For the hierarchical prior case, for clarity of exposition, we provide the assumptions regarding Φ0 in Assumption B4, and those regarding the distribution of U (random) in Assumption B6. Combining these two assumptions it can be easily shown that a priori Φ0TUΦ0 and UΦ0 converge to zero in Πscl,n-probability as n → ∞. In that sense, the assumptions on (Φ0,U) in the hierarchical model are stronger than in the non-hierarchical model case.

With these assumptions in hand, we state our key consistency result, whose proof is delegated to Appendix C.6 of the supplementary material.

Theorem 3.5 (Posterior Consistency for Hierarchical Prior). For any centered VAR(d) model with the hierarchical prior (8) on the transition matrix satisfying Assumptions B1-B6, the posterior consistency of the transition matrix can be achieved i.e. for every fixed cε > 0

E0[Πn(ΦΦ0>ε|Y=(X0,,Xn))]0asn

where Φ0 is the true parameter matrix under the model (2).

4. Performance Evaluation

To illustrate the performance of our Bayesian modeling framework for VAR processes, we design three sets of numerical experiments involving: (a) Small VAR (p = 10), (b) Medium VAR (p = 100) and (c) Large VAR (p = 500) models, each with two lags - (i) d = 1 and (ii) d = 2.

In each setting, we use transition matrices Ai’s with 1030% non-zero entries that are generated from U(0,2) ∪ U(2,0) selected at random and rescaled to ensure that the process is stable with SNR = 2. For small VAR models, we generate n = 40,80,120 time points, for medium VAR models, n = 400,800,1200, while for large VAR models we use n = 2000,4000,6000. The hyper-parameters for the prior distributions are selected using the Deviance Information Criterion (DIC). Note that DIC=2D¯D(Φ¯,Σ¯ε), where D(Φ,Σε)2logL(Y|Φ,Σε)=nlog|Σε|+tr{Σε1(ΦXXΦ2ΦXY+YY)}, D¯ is the posterior expectation of D(Φ,Σε) and Φ¯ and Σ¯ε are the posterior expectation of Φ and Σε respectively.

4.1. Non-hierarchical prior

We generate two different error processes using Σε=σ2Ip and Σε=σ2((ρ|ij|))ij (Toeplitz form). For each of the small, medium and large VAR models, U is taken to be a diagonal matrix, c Ip where c is chosen according to the minimum DIC value over the interval [0,10]. In Table 1, for both the posterior mean (PM) and least squares estimator (LS), we report their relative estimation error (Φ^Φ02/Φ02) and the standard error of Φ^2 within parenthesis averaged over 10×p replicates for small and medium VAR and 100 replications for large VAR (p = 500). Since the true parameter matrix Φ0 is sparse, we identify entries whose 95% posterior credible intervals contain zero, and set them to zero in both parameter matrix estimates (PM and LS).

Table 1:

Relative error in VAR (d = 1,2) with Σε= σ2Ip where % denotes percentage of non-zero entries in Φ0.

Lag d = 1 Lag d = 2

n1
n2
n3
n1
n2
n3
% LS PM LS PM LS PM LS PM LS PM LS PM
Small VAR 10 0.93 0.83 0.82 0.70 0.70 0.55 1.26 1.09 1.11 0.96 1.00 0.76
(0.21) (0.12) (0.12) (0.08) (0.07) (0.05) (0.24) (0.17) (0.17) (0.07) (0.10) (0.08)
20 1.06 1.00 1.00 0.87 0.83 0.71 1.39 1.22 1.28 1.07 1.14 0.94
(0.34) (0.24) (0.26) (0.16) (0.16) (0.11) (0.35) (0.30) (0.27) (0.21) (0.18) (0.15)
30 1.23 1.15 1.12 0.97 0.99 0.81 1.57 1.40 1.45 1.28 1.30 1.15
(0.45) (0.37) (0.37) (0.27) (0.27) (0.19) (0.47) (0.40) (0.37) (0.32) (0.34) (0.22)

Medium VAR 10 1.81 1.64 1.69 1.53 1.56 1.40 2.45 2.12 2.31 2.01 2.24 1.85
(0.40) (0.21) (0.31) (0.12) (0.23) (0.07) (0.46) (0.32) (0.37) (0.24) (0.31) (0.14)
20 1.95 1.81 1.85 1.70 1.74 1.56 2.60 2.26 2.50 2.15 2.43 2.01
(0.50) (0.32) (0.42) (0.25) (0.35) (0.17) (0.57) (0.42) (0.51) (0.33) (0.40) (0.29)
30 2.10 1.94 1.98 1.84 1.87 1.74 2.74 2.46 2.62 2.31 2.52 2.20
(0.64) (0.43) (0.52) (0.38) (0.45) (0.27) (0.69) (0.54) (0.60) (0.46) (0.56) (0.39)

Large VAR 10 2.70 2.47 2.55 2.34 2.46 2.22 3.66 3.18 3.50 3.04 3.38 2.92
(0.58) (0.30) (0.50) (0.23) (0.40) (0.16) (0.67) (0.45) (0.57) (0.35) (0.49) (0.28)
20 2.84 2.62 2.75 2.52 2.60 2.35 3.81 3.33 3.69 3.23 3.54 3.05
(0.70) (0.42) (0.60) (0.34) (0.54) (0.25) (0.78) (0.56) (0.70) (0.50) (0.61) (0.40)
30 3.03 2.78 2.90 2.65 2.71 2.49 3.96 3.49 3.89 3.32 3.78 3.25
(0.82) (0.54) (0.70) (0.44) (0.63) (0.35) (0.88) (0.69) (0.84) (0.63) (0.76) (0.51)

First, we assume the true error covariance matrix Σε is diagonal i.e. σ2Ip. Here % denotes percentage of non-zero entries in Φ and d represents lag length of the underlying VAR process. Recall that the sample sizes used for small VAR models are n1 = 40, n2 = 80, n3 = 120, for medium VAR ones n1 = 400, n2 = 800, n3 = 1200, and for large VAR ones n1 = 2000, n2 = 4000, n3 = 6000.

It can be seen that the relative estimation error decreases with an increase in the number of time points (sample size) n for both lags d = 1,2; further, its values are significantly larger in medium and large size VAR models than in small VAR ones. Moreover, the estimation error for lag 1 is uniformly smaller than that for lag 2, and the same holds true for their respective standard errors. Regarding the percentage of non-zero entries in the true transition matrices, the results show that for fixed n and p, the more true non-zero entries in A1,A2, the less accurate the posterior mean and the LS estimator are, while their variability as indicated by their standard errors also follows the same pattern. However, the posterior mean clearly outperforms the LS estimates, especially in settings with large p. This is to a large extent due to the fact that the true transition matrices A1,A2 exhibit weaker signal as p or the number of non-zero edges increases (this is to ensure stability of the underlying VAR model) and due to our choice of U = cIdp the posterior mean is the ridge regression estimator which applies shrinkage on the coefficients.

Next, we introduce correlation in the error components by specifying Σε to be of Toeplitz form. As discussed in Section 3.2.1 of Lütkepohl (2007) the generalized least squares estimate in this multivariate regression set up is the same as the ordinary one; i.e. (XX)–1XY, a result due to Zellner (1962). In Table 2, we compare the performance of least squares and posterior (ridge) estimates with noise covariance Σε =Toeplitz (ρ = 0.8).

Table 2:

Relative error in VAR (d = 1,2) with Σε =Toeplitz (ρ = 0.80) where % denotes percentage of non-zero entries in Φ0.

Lag d = 1 Lag d = 2

n1 n2 n3 n1 n2 n3






% LS PM LS PM LS PM LS PM LS PM LS PM
Small VAR 10 1.03 0.95 0.95 0.82 0.82 0.64 1.45 1.24 1.31 1.08 1.19 0.98
(0.27) (0.18) (0.20) (0.09) (0.10) (0.05) (0.37) (0.27) (0.30) (0.19) (0.18) (0.09)
20 1.20 1.10 1.06 0.94 0.98 0.83 1.60 1.41 1.49 1.30 1.35 1.17
(0.40) (0.30) (0.29) (0.21) (0.25) (0.17) (0.48) (0.37) (0.40) (0.29) (0.31 (0.19)
30 1.36 1.27 1.22 1.11 1.12 0.99 1.76 1.61 1.66 1.45 1.50 1.26
(0.50) (0.41) (0.44) (0.32) (0.32) (0.25) (0.60) (0.51) (0.50) (0.40) (0.46) (0.30)

Medium VAR 10 2.02 1.87 1.94 1.72 1.75 1.61 2.80 2.46 2.70 2.35 2.53 2.21
(0.52) (0.34) (0.41) (0.27) (0.37) (0.16) (0.69) (0.50) (0.62) (0.43) (0.54) (0.33)
20 2.20 2.01 2.10 1.87 1.92 1.80 2.95 2.62 2.86 2.52 2.74 2.34
(0.63) (0.45) (0.55) (0.38) (0.45) (0.33) (0.81) (0.62) (0.71) (0.52) (0.65) (0.42)
30 2.36 2.20 2.19 2.06 2.06 1.91 3.12 2.78 2.99 2.67 2.91 2.50
(0.73) (0.56) (0.67) (0.48) (0.56) (0.41) (0.94) (0.72) (0.83) (0.67) (0.77) (0.57)

Large VAR 10 3.03 2.80 2.88 2.63 2.83 2.55 4.19 3.67 4.09 3.55 3.93 3.44
(0.75) (0.49) (0.67) (0.41) (0.60) (0.33) (1.03) (0.73) (0.97) (0.62) (0.87) (0.58)
20 3.17 2.94 3.07 2.79 2.89 2.69 4.34 3.85 4.25 3.70 4.11 3.594
(0.86) (0.60) (0.79) (0.51) (0.69) (0.45) (1.14) (0.84) (1.06) (0.76) (1.00) (0.66)
30 3.33 3.10 3.27 3.00 3.08 2.80 4.53 4.00 4.37 3.89 4.26 3.70
(0.98) (0.73) (0.90) (0.63) (0.82) (0.56) (1.26) (0.95) (1.17) (0.87) (1.11) (0.79)

In this setting, the relative estimation error of both the least squares and ridge estimators increases compared to that with an uncorrelated error structure given in Table 1; in particular, the performance of the LS estimator deteriorates even further. However, with an increase in sample size, the accuracy of both estimates significantly improves. Further, as gleaned from the entries of the Table corresponding to lag 2, the relative error exhibits a further increase, a pattern consistent with the results in Table 1. This is quite expected as we not only have Toeplitz type error covariance structure, but also the total number of unknown parameters has increased by p2.

Finally, we study the support recovery under both error processes. In Table 3, we provide the percentage of true positives recovered by using 95% posterior credible intervals based on the same sample sizes n1, n2, and n3 as used previously.

Table 3:

Percentage of true positive non-zero entries recovered in Φ.

Lag d = 1 Lag d = 2

Σε= σ2Ip Σε= Toep Σε= σ2Ip Σε= Toep




% n1 n2 n3 n1 n2 n3 n1 n2 n3 n1 n2 n3
Small VAR 10 85.0 85.0 82.0 85.0 85.0 84.0 84.7 84.5 81.6 84.6 84.6 83.6
20 80.0 80.0 81.0 80.0 78.0 80.0 79.7 79.6 80.5 79.5 77.6 79.8
30 77.0 75.0 77.0 77.0 73.0 73.0 76.6 74.6 76.5 76.7 72.6 72.6

Medium VAR 10 89.3 90.0 89.3 89.3 89.8 88.3 88.8 89.5 89.0 88.9 89.4 87.9
20 85.3 85.5 85.5 85.3 84.5 85.0 84.9 85.3 85.3 84.8 84.1 84.8
30 81.0 80.8 81.3 81.0 79.8 78.8 80.8 80.5 81.0 80.6 79.3 78.3

Large VAR 10 92.0 92.1 92.2 92.0 92.1 91.8 91.8 91.6 91.8 91.7 91.7 91.4
20 88.1 88.1 88.1 88.1 87.9 87.9 87.7 87.8 87.7 87.7 87.5 87.5
30 83.1 83.3 83.4 83.1 82.8 82.9 82.7 82.8 82.9 82.7 82.4 82.6

The above table indicates that support recovery is not sensitive to the sample size, or to the lag; however, it deteriorates for all VAR models and error covariance settings, as the density of non-zero entries increases and exhibits a small increase with model dimension.

4.2. Hierarchical Priors

As discussed in Section 3.2, three types of hierarchical priors (Wishart, group-lasso and multivariate t) are studied. Analogously to the non-hierarchical prior case, the performance of the LS estimator is not at all satisfactory in this set up as well. Thus, we only compare the relative accuracy of the three prior choices in this setting. We select V = cIp and df = ν = dp for the Wishart prior, λi = λ for all 1 ≤ idp for the group-lasso prior and αi = 1, λi = λ for all 1 ≤ i ≤ dp for the multivariate t prior. The hyper-parameters c and λ are chosen using DIC. In Table 4, we report the relative estimation error (Φ^Φ02/Φ02,d=1,2) of the three hierarchical estimators when the error process covariance is set to σ2Ip and d represents lag length in the underlying VAR model.

Table 4:

Relative error in VAR (d = 1,2) with Σε = σ2Ip, where % denotes percentage of non-zero entries in Φ0.

Wishart
Group lasso
Multivariate t
Lag d = 1 % n1 n2 n3 n1 n2 n3 n1 n2 n3
Small VAR 10 0.84 0.75 0.65 0.84 0.75 0.64 0.85 0.74 0.58
20 0.99 0.91 0.73 1.00 0.89 0.79 1.00 0.90 0.76
30 1.14 1.09 0.92 1.16 1.02 0.94 1.15 1.04 0.93

Medium VAR 10 1.63 1.53 1.42 1.62 1.54 1.43 1.63 1.53 1.39
20 1.78 1.71 1.53 1.76 1.69 1.56 1.79 1.66 1.55
30 1.93 1.83 1.70 1.95 1.86 1.72 1.95 1.82 1.69

Large VAR 10 2.39 2.26 2.16 2.41 2.31 2.22 2.42 2.34 2.22
20 2.59 2.42 2.31 2.58 2.47 2.39 2.59 2.46 2.32
30 2.77 2.61 2.52 2.70 2.60 2.53 2.74 2.63 2.54

Lag d = 2

Small VAR 10 0.88 0.74 0.60 0.87 0.80 0.71 0.89 0.78 0.63
20 1.05 0.91 0.83 1.04 0.94 0.85 1.08 0.94 0.86
30 1.25 1.13 0.97 1.23 1.10 0.99 1.27 1.13 1.03

Medium VAR 10 1.71 1.58 1.46 1.71 1.58 1.53 1.70 1.60 1.50
20 1.85 1.72 1.60 1.85 1.76 1.71 1.89 1.82 1.66
30 2.05 1.96 1.77 2.01 1.95 1.82 2.04 1.97 1.83

Large VAR 10 2.52 2.37 2.26 2.51 2.41 2.29 2.53 2.44 2.25
20 2.69 2.58 2.47 2.67 2.59 2.47 2.69 2.59 2.51
30 2.88 2.75 2.60 2.86 2.75 2.62 2.87 2.74 2.64

In the following table, we present relative estimation errors with the same 3 hierarchical priors when Σε =Toeplitz (ρ = 0.6).

All of our hierarchical estimates outperform the ridge estimator (Table 1 and Table 2) across all settings considered. This is again expected, since the Ai’s have sparse structure by construction and the group lasso prior favors sparsity. However, the above results are not conclusive whether the group lasso estimate exhibits better accuracy than the Wishart or multivariate t estimates. To gain some insight into this issue, we use a VAR(1) model with p = 9 and transition matrix A1 in which the columns form three groups each containing three columns. The sparsity increases as we move from group 1 to group 3. Finally, we rescale the coefficient matrix so that the corresponding VAR process is stable with SNR= 2. The structure of the resulting A1 transition matrix is depicted next, where * indicates non-zero entries.

A1=(******000******000******000****00000***0*0000***00*000000000*000000000*000000000*)9×9

We generate n = 100 observations from the above VAR(1) model using two white noise variances (1) Σε=σ2Ip and (2) Σε =Toeplitz (ρ = 0.80) and report the relative estimation error (A^1A12/A12) of five different estimates - least squares (LS), posterior mean for non-hierarchical normal prior (Non H), hierarchical Wishart (W), group lasso (GL) and multivariate t prior (Mult. t), in Table 6.

Table 6:

Relative error

Estimator Σε = σ2Ιρ Σε = Toeplitz

LS 0.604 1.384
Non H 0.583 0.614
W 0.462 0.544
Mult. t 0.430 0.414
GL 0.321 0.362

The results show that the group lasso estimator exhibits the best performance, followed by the multivariate t one, whereas the LS estimator is the least accurate. The result is consistent with the structure of the underlying transition matrix, since the group lasso prior can capitalize on it.

In Appendix A.1 of the supplementary document we illustrate the posterior estimate of a VAR(1) model transition matrix A1, 95% credible intervals and estimated posterior densities of several entries of A1. We also look into the performance of Σ^ε when the true error covariance is Toeplitz(ρ = 0.8). The relative error of Φ^ under all the four different priors using a new noise covariance matrix which is generated from a Wishart distribution with degrees of freedom ν = p and scale matrix Ip is also given.

5. Application to Macroeconomic Data

We use the proposed Bayesian framework to understand the lead-lag relationships in the FRED-MD dataset containing p = 137 key macroeconomic variables for the period January 1973 to December 2014. VAR modeling for this task was strongly advocated by Sims (1980) and since then has become a standard tool for it, although usually the focus is on small models involving few macroeconomic indicators (e.g. consumer price index, employment index and the federal funds rate). However, recent work has advocated for larger VAR models (see Bernanke et al. (2005); Bańbura et al. (2010) and references therein), in order to improve forecastability and also avoid the presence of hard to interpret or even contradictory to economic theory relationships, due to the fact of not including an adequate number of variables for properly modeling the economic phenomenon under consideration. Before centering the data and estimating Σε as discussed earlier in Section 3.2, we ensure stationarity by processing the variables according to the recommendations in Stock and Watson (2005). The specific transformations used for each time series are given in the supplementary documents. Analogously to Bańbura et al. (2010), we consider the following 3 different size VAR models:

  • SMALL: This model contains p = 4 key variables - CPI, number of employees non-farm (PAYEMS), Federal Funds Rate (FEDFUNDS) and Unemployment Rate (UNRATE).

  • MEDIUM: In addition to the four variables in the SMALL VAR model, this one contains an additional 16 variables (total p = 20) listed next - Reserves Of Depository Institutions (NONBORRES), Total Reserves of Depository Institutions (TOTRESNS), M2 (M2REAL), Real Personal Income (RPI), Real personal consumption expenditures (DPCERA), IP Index (INDPRO), Capacity Utilization: Manufacturing (CAPT), Housing Starts: Total New Privately Owned (HOUST), Avg Hourly Earnings : Goods-Producing (CES), M1 (M1), S & P’s Common Stock Price Index: Composite (S.P.), 10-Year Treasury Rate (GS10), Personal Cons. Expend.: Chain Index (PCEPI), Foreign Exchange Rate (EXS), Crude Oil, spliced WTI and Cushing (OIL) and Retail and Food Services Sales (RETAILx).

  • LARGE: This specification has all p = 134 macroeconomic indicators (3 were excluded from further analysis due to the presence of a large number of missing values).

Based on initial exploratory work, we choose lag d = 6 according to the Bayesian information criterion (BIC) and the following distributions were used for prior specification to obtain the estimated parameter matrix Φ: (i) non-hierarchical normal (Non H), (ii) hierarchical Wishart (W), (iii) group lasso (GL) and (iv) multivariate t prior (Mult. t). Since with an increase in the lag length d the number of parameters increases linearly we suggest using BIC over the Akaike information criterion (AIC). For the non-hierarchical prior, we use U = BDiag(λ1,⋯,λd), while for the hierarchical Wishart, group-lasso and multivariate t priors on Φ, we use V = c1Idp and αi = α for all 1 ≤ idp. The values of c1 and λ are chosen using the Deviance Information Criterion (DIC) which, as explained previously, is a hierarchical Bayesian modeling generalization of BIC. The respective posterior means were compared to the least squares (LS) estimate. For each of the estimates Φ^, the residual norm ratio (YXΦ^F/YF) which measures the in-sample fit, is reported in Table 7.

Table 7:

In-sample prediction error

SMALL (p = 4) MEDIUM (p = 20) LARGE (p = 134)

LS 0.847 0.863 0.673
Non-H 0.852 0.864 0.674
W 0.845 0.863 0.675
Mult. t 0.877 0.885 0.685
GL 0.847 0.873 0.674

Note that since the LS estimator is obtained by minimizing YXΦF, it will always result in minimum relative residual norm as observed in the above table; i.e. the LS estimator is always the best one in terms of in-sample prediction accuracy.

Next, we investigate the 4 different Bayesian estimates based on their out-of-sample prediction performance with respect to the benchmark prior, analogously to the evaluation strategies discussed in Bańbura et al. (2010) and Stock and Watson (2005). We consider the following two benchmark priors:

  1. Prior information is imposed exactly by setting U−1 = O matrix (the zero matrix) and it corresponds to λ = 0 in the Minnesota prior. Bańbura et al. (2010) uses this specification as the benchmark prior, in which case the corresponding benchmark model becomes a random walk with drift; i.e. Xt=α+Xt1+εt.

  2. A uniform prior on Φ by setting U = O which corresponds to λ = ∞ in the Minnesota prior. The posterior mean coincides with the least squared estimate (LS).

Let X^t+h be the h-step ahead predicted value for Xt+h based on our posited Bayesian model and using the data upto time t. The corresponding forecast under the benchmark prior is X^Ot+h. The mean squared forecast error relative to the benchmark (RMSFE) is defined to t=T0T1Xt+hX^t+h2t=T0T1Xt+hX^Ot+h2. Table 8 gives the RMFSE results for three different choices of forecasting horizons, h = 1,6,12, for the two benchmark priors considered, over the period T0 = January 1978 to T1 = December 2006. An RMSFE value smaller than 1 implies the VAR model with the corresponding prior outperforms that with the naive/benchmark prior.

Table 8:

Out of sample relative prediction error

Uniform prior Random Walk

SMALL MEDIUM LARGE SMALL MEDIUM LARGE

p = 4 p = 20 p = 134 p = 4 p = 20 p = 134
Non Hierarchical h = 1 0.88 0.72 0.62 0.49 0.40 0.33
h = 6 0.90 0.78 0.62 0.43 0.42 0.37
h = 12 0.95 0.84 0.74 0.43 0.41 0.37

Wishart h = 1 0.88 0.81 0.68 0.49 0.45 0.32
h = 6 0.86 0.86 0.68 0.42 0.40 0.36
h = 12 0.93 0.92 0.71 0.43 0.41 0.38

Group Lasso h = 1 0.90 0.86 0.60 0.51 0.45 0.30
h = 6 0.88 0.88 0.63 0.46 0.42 0.34
h = 12 0.93 0.92 0.67 0.47 0.45 0.37

Mult. t h = 1 0.91 0.89 0.71 0.50 0.46 0.31
h = 6 0.87 0.93 0.77 0.49 0.44 0.34
h = 12 0.92 0.94 0.81 0.45 0.41 0.38

It can easily be seen that all four Bayesian methods not only outperform the LS estimate (uniform prior on Φ), but also exhibit substantially smaller relative error compared to the random walk with drift process (point-mass prior on Φ). Further, increasing the number of predictor variables improves forecasting performance, a point argued forcefully in favor of large VAR models by Sims Sims (1980). On the other hand, forecasting performance deteriorates for larger values of h, an expected result. Nevertheless, even for h = 12 (one year ahead) the results are still very satisfactory. Further, for the SMALL and MEDIUM VAR models, the non-hierarchical normal and hierarchical Wishart priors result in better prediction, whereas for the LARGE VAR model the group lasso prior outperforms other forecasts.

Next, we examine closely the out-of-sample prediction performance of the following three macroeconomic variables - CPI, PAYEMS and FEDFUND under the hierarchical Wishart prior.

Note that Bańbura et al. (2010) only consider a random walk process as the naive prior. From the above table it can be seen that although for CPI and PAYEMS the LS estimate performs better than the Bayesian estimates in SMALL and MEDIUM VARs, overall the Wishart prior has better forecasting accuracy than both of the benchmark priors. As previously observed, adding information (i.e. including more variables) improves the accuracy of forecasts for all three variables. The fourth column (LARGE BGR) provides the numbers reported in Table III in Bańbura et al. (2010), where a Bayesian VAR model on the same 134 variables, with d = 13 lags was estimated using a normal-inverted Wishart distribution that leads to a ridge type posterior mean estimate for the parameters in Φ and based on data covering the period 1971–2003. Although the results are not directly comparable to those obtained by our methodology, they nevertheless provide a certain degree of calibration. It can be seen that our model is more parsimonious using only d = 6 lags and provides better forecasting performance for all three variables at all forecasting horizons.

Next, we examine in more detail the estimated transition matrix A1 for the SMALL VAR model (p = 4) and the non-hierarchical normal and group lasso priors. Estimated posterior densities of the bold marked entries are shown next. It is worth noting that the non-hierarchical prior centers around a different value and exhibits a less smooth behavior, than the hierarchical one. This smoothness should be expected given the specification of the latter.

A^1NonH=CPIPAYEMSFEDFundUnRate(CPIPAYEMSFEDFundUnRate0.1330.0010.0010.0010.0160.3110.0010.0021.00010.2000.4980.1851.21723.3000.0350.105)A^1GL=CPIPAYEMSFEDFundUnRate(CPIPAYEMSFEDFundUnRate0.1670.0050.0010.0010.0210.5600.0010.0011.61418.60790.4860.1341.60941.8180.0150.107)

Further, we present the 95% posterior credible intervals (PCI) of A1 under the above two priors

Nonhierarchical:=((0.19,0.07)(0.08,0.07)(+0,+0)(0,+0)(0.05,0.01)(0.27,0.35)(+0,+0)(0,0)(7,5.25)(1.86,19)(0.44,0.55)(0.3,0.06)(1.98,4.45)(27.8,18.7)(0.06,0.01)(0.07,0.06))
Grouplasso:=((0.22,0.11)(0.10,0.1)(+0,+0)(0,+0)(0.05,0.01)(0.51,0.61)(+0,+0)(0,0)(8.58,5.7)(6.5,31.01)(0.44,0.55)(0.26,0)(1.75,4.86)(47.72,36.37)(0.04,0.01)(0.07,0.06))

Next, in Figure 2 we depict the estimated networks for the MEDIUM VAR based on the first lag transition matrix produced by: (a) least squares and (b) a non-hierarchical normal prior, where for ease of representation the nodes of the network are abbreviated; the full list of the variable names is given in Table A1 of Appendix A in the supplementary material.

Figure 2:

Figure 2:

Network representation of the transition matrix (A1)

As expected, for most variables their previous lag value influences the current value. Further, for the LS based network, there is a high degree of connectivity, whereas the non-hierarchical based one exhibits a sparser structure. For the latter, of interest is that the employment index (PAYEMS), the personal consumer expenditures (GS10) and CPI are influenced by many other variables. On the other hand, the Federal Funds Rate influences the broad stock market (SP500) as expected based on finance theory and GS10. In general the sparser result provided by the non-hierarchical prior, in addition to better forecasting also aids in interpretation, vis-a-vis the LS estimate.

6. Discussion

In this paper, we investigate posterior consistency in Bayesian VAR(d) models with both nonhierarchical and hierarchical matrix normal prior distributions on the transition matrices under a Gaussian assumption for the temporal evolution of the time series under consideration and in the presence of a general covariance matrix that captures additional contemporary dependence between them. We establish posterior consistency for both of these priors under high-dimensional scaling. To obtain the desired results, some novel concentration inequalities are provided that are of independent interest. The methodology is illustrated on synthetic and real macroeconomic data. The proposed priors provide better forecasts than the LS estimates for periods up to one year ahead, while leading to sparser and potentially easier to interpret relationships, especially for large scale models.

Supplementary Material

Supp1
Supp2

Figure 1:

Figure 1:

Posterior densities of entries (11), (24), (32) and (42) in A1 under 4 different priors

Table 5:

Relative error in VAR (d = 1,2) with Σε = Toeplitz (ρ = 0.80), where % denotes percentage of non-zero entries in Φ0.

Wishart
Group lasso
Multivariate t
Lag d = 1 % n1 n2 n3 n1 n2 n3 n1 n2 n3
Small VAR 10 0.90 0.76 0.65 0.90 0.83 0.69 0.89 0.79 0.70
20 1.06 0.94 0.77 1.03 0.92 0.82 1.07 0.96 0.81
30 1.23 1.13 0.99 1.21 1.09 0.99 1.23 1.14 0.97

Medium VAR 10 1.75 1.61 1.50 1.73 1.63 1.53 1.74 1.61 1.56
20 1.92 1.80 1.67 1.92 1.78 1.73 1.91 1.79 1.72
30 2.08 1.98 1.82 2.02 1.98 1.88 2.07 1.94 1.84

Large VAR 10 2.58 2.49 2.33 2.60 2.50 2.38 2.60 2.49 2.33
20 2.76 2.65 2.51 2.75 2.65 2.54 2.77 2.62 2.57
30 2.93 2.81 2.71 2.90 2.81 2.74 2.93 2.82 2.74

Lag d = 2

Small VAR 10 1.16 1.02 0.96 1.15 1.06 0.98 1.18 1.06 0.89
20 1.31 1.24 1.12 1.31 1.22 1.11 1.34 1.24 1.10
30 1.49 1.41 1.23 1.49 1.41 1.27 1.49 1.40 1.26

Medium VAR 10 2.25 2.13 2.04 2.25 2.17 2.05 2.27 2.16 2.01
20 2.43 2.30 2.16 2.41 2.32 2.24 2.45 2.32 2.26
30 2.59 2.47 2.33 2.59 2.49 2.38 2.59 2.48 2.36

Large VAR 10 3.37 3.24 3.10 3.36 3.25 3.18 3.36 3.24 3.11
20 3.53 3.40 3.29 3.51 3.43 3.31 3.54 3.46 3.36
30 3.72 3.62 3.41 3.65 3.57 3.49 3.74 3.59 3.48

Table 9:

Out-of-sample relative prediction error for CPI, PAYEMS and Fed- Fund for the three VAR model specifications considered. The column LARGE BVAR corresponds to the entries of Table III in Bafibura et al. (2010) for a Bayesian VAR model with a normal-inverted Wishart prior distribution and d =13 lags, based on the same set of variables, but covering the period 1971–2003.)

SMALL MEDIUM LARGE LARGE BVAR

Uniform prior p = 4 p = 20 p = 134 p = 134
h = 1 CPI 1.05 0.91 0.44 -
PAYEMS 1.21 1.04 0.91 -
FFUND 0.78 0.75 0.68 -

h = 6 CPI 1.03 0.97 0.38 -
PAYEMS 1.08 1.06 0.48 -
FFUND 0.92 0.82 0.68 -

h = 12 CPI 0.98 0.96 0.42 -
PAYEMS 0.93 0.91 0.73 -
FFUND 0.92 0.88 0.72 -

Random Walk

h = 1 CPI 0.43 0.41 0.34 0.50
PAYEMS 0.45 0.43 0.39 0.46
FFUND 0.50 0.45 0.36 0.75

h = 6 CPI 0.38 0.34 0.28 0.40
PAYEMS 0.53 0.48 0.39 0.50
FFUND 0.41 0.37 0.36 1.29

h = 12 CPI 0.51 0.45 0.42 0.44
PAYEMS 0.51 0.88 0.73 0.78
FFUND 0.33 0.31 0.28 1.93

Acknowledgments

* The authors gratefully acknowledge support from NSF grants DMS-1511945 (KK) and IIS-1632730 and CNS-1422078 (GM)and NIH grant R01 5R01GM11402902.

Contributor Information

Satyajit Ghosh, Email: satyajitghosh90@ufl.edu.

Kshitij Khare, Email: kdkhare@ufl.edu.

George Michailidis, Email: gmichail@ufl.edu.

References

  1. Armagan A, Clyde M, and Dunson DB (2011). Generalized beta mixtures of gaussians In Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, and Weinberger KQ (Eds.), Advances in Neural Information Processing Systems 24, pp. 523–531. Curran Associates, Inc. [PMC free article] [PubMed] [Google Scholar]
  2. Armagan A, Dunson DB, and Lee J (2013). Generalized double pareto shrinkage. Statistica Sinica 23, 119–143. [PMC free article] [PubMed] [Google Scholar]
  3. Armagan A, Dunson DB, Lee J, Bajwa WU, and Strawn N (2013). Posterior consistency in linear models under shrinkage priors. Biometrika 100(4), 1011–1018. [Google Scholar]
  4. Bańbura M, Giannone D, and Reichlin L (2010). Large bayesian vector auto regressions. Journal of Applied Econometrics 25(1), 71–92. [Google Scholar]
  5. Basu S and Michailidis G (2015, 08). Regularized estimation in sparse high-dimensional time series models. Ann. Statist 43(4), 1535–1567. [Google Scholar]
  6. Basu S, Shojaie A, and Michailidis G (2015). Network granger causality with inherent grouping structure. Journal of Machine Learning Research 16, 417–453. [PMC free article] [PubMed] [Google Scholar]
  7. Bernanke BS, Boivin J, and Eliasz P (2005). Measuring the effects of monetary policy: a factor-augmented vector autoregressive (FAVAR) approach. The Quarterly Journal of Economics 120(1), 387–422. [Google Scholar]
  8. Billio M, Getmansky M, Lo AW, and Pelizzon L (2012). Econometric measures of connectedness and systemic risk in the finance and insurance sectors. Journal of Financial Economics 104(3), 535–559. [Google Scholar]
  9. Bontemps D (2011, 10). Bernstein-von mises theorems for gaussian regression with increasing number of regressors. Ann. Statist 39f, 2557–2584. [Google Scholar]
  10. Carvalho CM, Polson NG, and Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480. [Google Scholar]
  11. Doan T, Litterman R, and Sims C (1984). Forecasting and conditional projection using realistic prior distributions. Econometric reviews 3(1), 1–100. [Google Scholar]
  12. Ghoshal S (1999). Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli 5(2), 315–331. [Google Scholar]
  13. Griffin JE and Brown PJ (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 5(1), 171–188. [Google Scholar]
  14. Korobilis D (2013). Var forecasting using bayesian variable selection. Journal of Applied Econometrics 28, 204–230. [Google Scholar]
  15. Kyung M, Gill J, Ghosh M, and Casella G (2010, 06). Penalized regression, standard errors, and bayesian lassos. Bayesian Anal. 5(2), 369–411. [Google Scholar]
  16. Lee J and Oh H-S (2013). Bayesian regression based on principal components for high-dimensional data. Journal of Multivariate Analysis 117, 175–192. [Google Scholar]
  17. Lin J and Michailidis G (2017). Regularized estimation and testing for high-dimensional multiblock vector-autoregressive models. Journal of Machine Learning Research 18(117), 1–49. [Google Scholar]
  18. Litterman RB et al. (1979). Techniques of forecasting using vector autoregressions. Technical report.
  19. Lütkepohl H (2007). New Introduction to Multiple Time Series Analysis. Springer. [Google Scholar]
  20. Melnyk I and Banerjee A (2016). Estimating structured vector autoregressive models. Proceedings of The 33rd International Conference on Machine Learning, 830–839. [Google Scholar]
  21. Mol CD, Giannone D, and Reichlin L (2008). Forecasting using a large number of predictors: Is bayesian shrinkage a valid alternative to principal components? Journal of Econometrics 146(2), 318–328. Honoring the research contributions of Charles R. Nelson. [Google Scholar]
  22. Park T and Casella G (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]
  23. Raskutti G and Yuan M (2015, December). Convex Regularization for High-Dimensional Tensor Regression. ArXiv e-prints.
  24. Schweinberger M, Babkin S, and Ensor KB (2017). High-dimensional multivariate time series with additional structure. Journal of Computational and Graphical Statistics 26(3), 610–622. [Google Scholar]
  25. Seth AK, Barrett AB, and Barnett L (2015). Granger causality analysis in neuroscience and neuroimaging. Journal of Neuroscience 35(8), 3293–3297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Sims CA (1980). Macroeconomics and reality. Econometrica: Journal of the Econometric Society 48(1), 1–48. [PubMed] [Google Scholar]
  27. Sparks DK, Khare K, and Ghosh M (2015). Necessary and sufficient conditions for high-dimensional posterior consistency under g-priors. Bayesian Anal. 10(3), 627–664. [Google Scholar]
  28. Stock JH and Watson MW (2005). An empirical comparison of methods for forecasting using many predictors. Manuscript, Princeton University. [Google Scholar]
  29. Zellner A (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57(298), 348–368. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1
Supp2

RESOURCES