Abstract
Vector autoregressive (VAR) models aim to capture linear temporal interdependencies amongst multiple time series. They have been widely used in macroeconomics and financial econometrics and more recently have found novel applications in functional genomics and neuroscience. These applications have also accentuated the need to investigate the behavior of the VAR model in a high-dimensional regime, which provides novel insights into the role of temporal dependence for regularized estimates of the model’s parameters. However, hardly anything is known regarding properties of the posterior distribution for Bayesian VAR models in such regimes. In this work, we consider a VAR model with two prior choices for the autoregressive coefficient matrix: a non-hierarchical matrix-normal prior and a hierarchical prior, which corresponds to an arbitrary scale mixture of normals. We establish posterior consistency for both these priors under standard regularity assumptions, when the dimension p of the VAR model grows with the sample size n (but still remains smaller than n). A special case corresponds to a shrinkage prior that introduces (group) sparsity in the columns of the model coefficient matrices. The performance of the model estimates are illustrated on synthetic and real macroeconomic data sets.
Keywords: Vector Autoregressive Models, Posterior consistency, Shrinkage prior, Bayesian Lasso
1. Introduction
There has been recent interest in modeling high-dimensional time series data sets. In macroeconomics, Mol et al. (2008) advocated the need to include a large number of variables in econometric models to improve forecastability, while Billio et al. (2012) examined stock returns of many financial institutions to assess systemic risk of the financial system. Similar modeling challenges arise in functional genomics for the reconstruction of regulatory networks as discussed in Basu et al. (2015), while in neuroscience one is interested in understanding functional connectivity between brain regions Seth et al. (2015).
A popular and informative model has been vector autoregressions (VAR), that captures linear temporal dependencies between time series. The VAR model and its properties have been thoroughly explored in low-dimensional settings both from a frequentist (for a comprehensive overview see Lütkepohl (2007)) and a Bayesian perspective Bańbura et al. (2010).
More recently, Basu and Michailidis (2015) provided an in depth analysis of the model for Gaussian data in a high-dimensional setting under sparsity assumptions, while Melnyk and Banerjee (2016) extended the results to other regularizers (e.g. group lasso, sparse group lasso, etc.). The results of Basu and Michailidis (2015); Melnyk and Banerjee (2016) and related follow-up work (Raskutti and Yuan (2015); Schweinberger et al. (2017); Lin and Michailidis (2017)) indicate that the resulting estimation error rates are those obtained for independent and identically distributed data times a factor that captures the temporal dependence in the data.
On the Bayesian front, there has been primarily methodological/computational work for low-dimensional VAR models. The so-called Minnesota prior Litterman et al. (1979); Doan et al. (1984) has been a staple of applied econometric work involving VAR models. This is a normal prior distribution on the elements of the transition matrix that puts stronger weights on the “own” lags of each time series, since they are considered more informative for forecasting purposes than lags from “other” time series. For large size VAR models, Bańbura et al. (2010) advocate normal-inverted Wishart distribution that leads to a posterior mean that can be interpreted as a ridge shrinkage estimator, suitable for such models. A first attempt for Bayesian estimation of VAR models combined with variable selection is presented in Korobilis (2013), where an indicator variable is specified for each parameter in the transition matrix that indicates whether the cross-autocorrelation coeffcient is included or set to zero. A prior is specified for the indicator variables that in principle can also be combined with the Minnesota prior.
On the other hand, Bayesian investigations into high-dimensional asymptotics of statistical models that incorporate sparsity with temporally dependent data are not in general available to the best of our knowledge. Hence, the main objective of this work is to study posterior (estimation) consistency for a VAR model, which asserts that the posterior concentrates around the “true” parameter value (in an appropriate norm) as the sample size increases. There is a rich literature on high-dimensional posterior estimation consistency for linear regression models for independent and identically distributed data. Ghoshal (1999) established posterior consistency and asymptotic normality with a general prior on the p-vector of regression coefficients (with appropriate positivity and Lipschitz assumptions) when p3 log p/n → 0 and p4logp/n → 0, respectively. Bontemps (2011) extended the work of Ghoshal (1999) by permitting the model to be misspecified and the number of predictor variables to grow proportionally to the sample size. Armagan et al. (2013) focus on shrinkage priors, which are appropriate scale mixtures of normal priors and induce weak sparsity in the vector of regression coefficients (see Carvalho et al. (2010); Griffin and Brown (2010); Armagan et al. (2011, 2013)). They establish posterior consistency under a simple sufficient condition on prior concentration when p = o(n). Lee and Oh (2013) establish posterior consistency under a high dimensional Bayesian PCA regression setup with p > n under appropriate assumptions on the rank of the design matrix. Posterior estimation consistency in linear regression models with g-priors has also been addressed in Sparks et al. (2015).
A crucial difference between the linear regression models considered in the above work and VAR models (expressed as a linear model) is that the design matrix in the latter case is random, and exhibits dependencies both between its rows and across its columns, and also with the error term in the model (see Section 2). This leads to a significantly more involved and challenging theoretical analysis that we successfully resolve. In this paper, we investigate high-dimensional posterior consistency for Bayesian VAR models in two natural and relevant settings: (a) with a non-hierarchical matrix normal prior on the dp × p autoregressive parameter matrix and (b) a hierarchical prior which corresponds to a general scale mixture of normals. In particular, this includes spherically symmetric priors such as the multivariate-t and standard shrinkage priors which induce (group) sparsity in the columns of the coefficient matrices, such as the group structure in Basu et al. (2015). Further, we employ a flat (uniform) prior distribution for the error term. Note that the joint maximum likelihood estimation problem for a sparse VAR model, with a sparse error covariance matrix is investigated in Lin and Michailidis (2017). The posterior consistency results are established under mild regularity assumptions on the underlying spectral density and with p = o(n/ log n). The key to handling the dependencies, within the design matrix and also between the design matrix and the error term, is a pair of high-dimensional concentration inequalities established in the supplementary material (Propositions B1 and B3). Note that we are considering the “large p large n” setting with p = o(n). However, we make no assumption reducing the effective dimension of the “true parameter matrix”. We only assume that the matrix norm of the true parameter matrix is of the order p in the non-hierarchical prior setting and bounded by a constant in the hierarchical prior setting. The large p small n situation, where p is allowed to grow at a much faster rate than n is also of interest, but assumptions such as sparsity/restricted eigenvalue type conditions are required, which in turn reduce the effective dimension of the true parameters. General posterior consistency results for VAR models in the large p small n setting are also not available to the best of our knowledge and are topics of future discussion/research.
The remainder of the paper is organized as follows. In Section 2, we introduce the VAR model and necessary notions of posterior consistency. We consider the non-hierarchical matrix normal prior on the coefficient matrix in Section 3.1 and establish posterior consistency under suitable regularity assumptions. In Section 3.2, we prove posterior consistency considering a hierarchical prior corresponding to a scale-mixture of matrix normals. In Section 4 and Section 5 the methodology/results of this paper are illustrated on simulated and real data sets, respectively. Finally, we conclude with a discussion in Section 6.
1.1. Notation
Throughout this paper, ℤ, ℝ and ℂ denote the sets of integers, real numbers and complex numbers, respectively. We denote the cardinality of a set J by |J|. For a vector v ϵ ℝp, denotes the ℓ2-norm. For a matrix A, ‖A‖ and σmax(A) denote spectral norm i.e., and the largest singular value of A, respectively. For a symmetric or Hermitian matrix A, we denote its maximum and minimum eigenvalues by λmax(A) and λmin(A). The vector ei is used for the i-th unit vector in ℝp. Bold uppercase letters are only used to denote matrices, and vectorized form of such matrices are represented by corresponding lower cases. For example, if Φ is a p×p matrix then ϕ is vec(Φ). Also, O represents a zero-matrix of appropriate dimension, and in general vectors are denoted by italicized bold lowercase letters.
2. Model Formulation
For a p-dimensional stationary time series {Xt}, a vector autoregressive model of lag-d is given by
| (1) |
The temporal dependence structure of the VAR model is characterized by the p × p transition matrices A1,A2,· · · ·,Ad and c is a p×1 location vector which we choose to be 0. In the Gaussian VAR, the errors εt are i.i.d where Σε is a p × p unknown error covariance matrix. The model in (1) can be rewritten in the Yule-Walker representation (Lütkepohl (2007)) as
where μ = (I – A1 – A2 – · · ·– Ad)–1c is known as the process mean. Usually μ will not be known in advance. In that case, μ may be estimated by the vector of sample means . An alternative estimator is in which and ’s are the least squares estimator. Henceforth we assume without loss of generality μ = 0. Based on the data {X0,··· ,XT}, we define the response matrix Y and design matrix X as follows,
We can now rewrite the above model in a linear regression setup as
| (2) |
where
In this formulation, the number of samples is n = T − d + 1 and the number of unknown parameters is q = dp2, respectively. Vectorizing (column-wise) each matrix, we get
where Z := (Ip⊗X), φ = vec(Φ) and ε = vec(E). In this paper, we consider a high-dimensional setting where the dimension p of the VAR model increases with the sample size n. However, we assume that the lag d does not vary with n. This basic formulation of regression lends itself easily to a Bayesian analysis in which priors are placed on the unknown parameter matrices Φ and Σε.
As previously mentioned, we let the dimension p = pn of the VAR model vary with n, so that our results are relevant to high dimensional settings. We assume that our data come from a true VAR model described as follows: for every, n ≥ 1, let be the set of observations for sample size n, which satisfy for d ≤ k ≤ n + d − 1. The errors are i.i.d. . Here denotes the sequence of the true coefficient matrices given by , and denotes the sequence of the true error covariance matrices. Let ℙ0 denote the probability measure underlying the true model described above.
Next, consider a Bayesian model which builds on (2) by placing priors on the parameters (Φ,Σε). In particular, let and denote the sequences of the corresponding (joint) prior and posterior densities. Analogously, {Πn(·)}n≥1 and denote the corresponding sequences of (joint) prior and posterior distributions. We will also use the notation πn an Πn to denote the marginal prior and posterior densities/distributions for Φ and Σε as appropriate.
Note that our main parameter of interest is Φ, while the error covariance matrix Σε is more of an unknown nuisance parameter that we need to deal with. One would hope that as the sample size n tends to infinity, the posterior probability assigned to any ε neighborhood of Φ0n converges to 1 almost surely under ℙ0. We now formally define a notion of posterior consistency that formalizes this.
Definition 1. The sequence of posterior distributions is said to be consistent at , if for every ε > 0, as n → ∞ a.s. ℙ0.
For ease of exposition, we will henceforth denote Φ0n as Φ0, and Σε,0n as Σε,0, and highlight their dependence on n as needed.
2.1. Stability of VAR(d) process
Since VAR models are dynamical systems, the notion of ‘stability’ plays an important role in their analysis and asymptotic properties.
Definition 2. A VAR(d) process defined in (1) is said to be stable if the matrix valued polynomials satisfies det on the unit circle of the complex plane {z ∈ ℂ: |z| = 1}.
The autocovariance function of a p-dimensional centered covariance-stationary time series {Xt} is defined as and the corresponding spectral density is given by , θ ∈ [−π, π]. For a Gaussian stable VAR(d) model the spectral density has a closed form expression,
where * denotes the Hermitian conjugate of a matrix and The autocovariance function which characterizes a centered Gaussian process, can be used to quantify the temporal and cross-sectional dependence for VAR(d) models. The peak of the spectral density, measured by its maximum eigenvalue can be used as a measure of stability of the process. Also the minimum eigenvalue captures cross-dependence among its components. However, as mentioned in Basu and Michailidis (2015) instead of working with and it is often easier to work with the eigenvalues of over the unit circle {z ∈ ℂ : |z| = 1}. Let
For stable VAR(d) process . Since each εt is i.i.d as , each row of X is distributed as , where the covariance matrix CX has the following structure,
| (3) |
The quantities and provide a useful bound for the eigenvalues of CX. As mentioned in Melnyk and Banerjee (2016) and from Proposition 2.3 and equation (2.6) of Basu and Michailidis (2015) we have the following chain of inequalities,
| (4) |
We finally note that the p dimensional VAR(d) model in (2) can be equivalently written as a dp dimensional VAR(1) process. Let
Then the new representation becomes
| (5) |
It follows that , i.e., i-th row of X is denoted as (dp × 1 vector) . Note that if the underlying VAR(d) process {Xt} is stable then the process with the characteristic polynomial, is also stable. This is because can be viewed as generated according to a stable VAR(1) process with transition matrix and is stable if and only if {Xt} is stable (Lütkepohl (2007)). Based on we define
| (6) |
While and are not necessarily the same as and the inequalities in (4) still hold with and replaced by and , respectively.
3. Bayesian Estimation and Posterior Consistency
In this section, we first discuss Bayesian estimation of VAR models with non-hierarchical and hierarchical scale mixture matrix normal prior distributions on the parameter matrix Φ (conditioned on Σε) and subsequently establish high-dimensional posterior consistency in this setting under mild regularity assumptions. We start by introducing the necessary notation for the matrix-variate normal distribution. Let Ma,b denote the space of a × b matrices.
Definition 3. An a×b random matrix X is defined to follow a matrix-variate normal distribution if its density function (on the space Ma,b ) is given by
Here M ∈ Ma,b. B1 ∈ Ma,a and B2 ∈ Mb,b which are both positive definite matrices corresponding to the variances among the rows and columns of X, respectively. Note that the matrix normal distribution is related to the multivariate normal distribution in the following way: , if and only if .
3.1. Non-Hierarchical Matrix Normal Prior
We consider a matrix normal prior for Φ conditional on Σε, and a flat (uniform) prior on Σε. In particular,
| (7) |
where U is a dp × dp known positive definite matrix. Note that under this matrix normal prior U−1 and Σε are the covariance matrices corresponding to the columns and rows of Φ, respectively. The posterior distribution of Φ (conditional on Σε) can easily be shown to be , where is the (conditional) posterior mean which does not depend on Σε. Hence, the unconditional posterior mean of Φ is available in closed form and is given by ΦPM. It follows by standard computations using the multivariate normal density that the marginal posterior density of Σε is proportional to
Where . This density is proper if and only if n > 2p. In this case, the marginal posterior density of Σε corresponds to the Inverse-Wishart density with scale parameter and shape parameter n − p − 1. We summarize the above observations in the lemma below.
Lemma 3.1. Under the non-hierarchical prior in (7), the posterior density of (Φ,Σε) is proper if and only if n > 2p. In this case
Assumptions for posterior consistency
We will establish our results under the high-dimensional setting from Section 2. Recall that Φ0 = Φ0n denotes the true underlying parameter matrix, and Σε,0 = Σε,0n denotes the true underlying error covariance matrix in this setting. The quantities , and CX are as defined in (6) and (3), but with Φ0 and Σε,0 as the underlying parameter values. We assume the following:
Assumption A1. The VAR(d) model given in (1) is stable.
Assumption A2. is as n → ∞.
Assumption A3. and .
Assumption A4. The true parameter matrix of the VAR model (2), Φ0 and the hyperparameter U of (7) are such that and .
Assumption A5. p = o(n).
When d = 1, we deal with a VAR(1) model and CX becomes ΓX(0), while is the same as and is also equal to . Assumption A1 is a standard assumption which ensures that the underlying VAR process is well-behaved. Assumption A2 plays an important role in deriving high-dimensional concentration bounds for X′X/n and X′E/n around CX and O, respectively (see Propositions B.1 and B.3 in the supplementary material). Assumption A3 is needed to ensure that λmin(X′X/n) is bounded away from 0 with high probability. Further, if we consider that each column of Φ is independently and identically distributed according to a normal prior distribution, that is U = Idp, Assumption A4 reduces to and .
We now state the main theoretical result of posterior consistency with a non-hierarchical matrix normal prior distribution on Φ. The proof is given in Appendix C.1 of the supplementary material.
Theorem 3.2 (Posterior consistency for non-hierarchical prior). For any centered VAR(d) model (2) with non-hierarchical prior (7) on Φ satisfying Assumptions A1-A5, the posterior consistency of the parameter matrix can be achieved i.e. for every fixed every ε > 0
where Φ0 is the true parameter matrix under the model (2).
A natural question to ask is whether the assumption p = o(n) can be relaxed for posterior consistency. In the lemma below, we consider a situation in which Assumptions A1 - A4 are satisfied and p is the same order as n, and prove that the resulting posterior is not consistent. The proof is given in Appendix C.2 of the supplementary material.
Lemma 3.3. Consider a (sequence of) VAR(1) model with pn = γn, Φ0n = and Σε,0n = , where , α ∈ (0,1) do not depend on n. If we use the non-hierarchical prior (7) on Φ with , then there exists ε > 0 such that
Remark Note that the condition assumed in Lemma 3.3 corresponds to Assumption A4 in the setting of the Lemma. The reason for making this assumption is that we want to show violating Assumption A5 (p = o(n)) can lead to posterior inconsistency, even if all of Assumptions A1-A4 hold. If we decide to violate Assumption A4 too by assuming or (goes to ∞ at the same rate or faster than n) then the posterior inconsistency proof becomes comparatively easier. We have provided the corresponding proofs in Supplemental Section C.3 and Supplemental Section C.4, respectively.
3.2. Hierarchical Normal-mixture Prior
Next, we study the posterior consistency of the parameter matrix in model (2) in which Φ has the following hierarchical prior
| (8) |
where U is the dp×dp matrix having probability density πscl(·) over the space of dp×dp positive definite matrices, . As shown below, the group lasso and multivariate t distribution prior on Φ can be obtained from (8) using appropriate choices of πscl(·). The lemma below shows that the posterior is proper if n > (d+1)p, and provides the form of various conditional and marginal posterior densities. The proof is given in Appendix C.5 of the supplementary material.
Lemma 3.4. Under the hierarchical normal-mixture prior in (8), the posterior density of (Φ,Σε,U) is proper if n > (d + 1)p. In this case
The Bayesian group lasso prior was proposed by Kyung et al. (2010) in the context of linear regression. We adapt it to the VAR setting as follows. Suppose the rows of Φ are divided in G groups , where each Φ[g] is an mg × p sub-matrix of Φ (hence ) and Xg is the submatrix of X of order n × mg corresponding to the group Φ[g]. The frequentist group lasso estimator (conditional on Σε) is obtained by solving
where λg is a tuning parameter corresponding to the group g. The group lasso estimator (conditional on Σε) can also be expressed as the maximum a posteriori probability (MAP) estimate under model (2) with the prior
which is a multivariate generalization of the double exponential prior and can also be expressed as a scale mixture of normals with Gamma hyperpriors (Park and Casella (2008), Kyung et al. (2010)) leading to the group lasso hierarchy,
Here Gamma(α,λ) denotes the Gamma distribution with shape parameter α and rate parameter λ. This can be alternatively presented as and where BDiag(τ1,· · ·,τG) denotes a block-diagonal matrix with g−th block to be . Note that under the above hierarchical prior, conditionally on (τ1,· · ·,τG) and Σε, the columns of Φ are independent. If mg = 1 ∀ g = 1,· · · ,dp we get the ordinary Bayesian lasso.
Under the specification given in (8), suppose we assume U = Diag(τ1,· · ·,τdp) and then it can be shown that the prior density for Φ given only Σε is proportional to
which corresponds to the multivariate t-distribution.
Estimation
For the hierarchical model given in (8), the posterior density of Φ is intractable and quantities such as the posterior mean are not available in closed form. Hence, we develop a Markov Chain Monte Carlo algorithm to generate values from the posterior density. It follows by straightforward calculations that
where . While the conditional posterior distribution of Φ given Σε,U and Σε given U are easy to simulate from (being Matrix-normal and InverseWishart), the tractability of the conditional posterior density of U given Φ,Σε depends on the form of the prior πscl(U). We show below that for three standard choices of πscl(U) corresponding to the Wishart prior, the group lasso prior and the multivariate t-prior π(U|Φ) becomes a tractable density and easy to simulate from.
Case 1: Wishart Prior
For a dp × dp positive definite matrix V, let U ~ Wishartdp(V,df = ν + dp) that is . In this case,
which is . Note that as long as we have ν > −(dp+1) the posterior of U given Φ,Σε, is proper.
Case 2: Bayesian Group Lasso
In this case as already discussed in Subsection 3.2, U−1 has a block diagonal form BDiag(τ1,· · · ,τG) and τg’s are apriori independently distributed as Gamma with scale (mg + 1)/2 and rate . Hence, the conditional distribution of τg has the following form,
Case 3: Multivariate t-distribution
By taking U−1 = Diag(τ1,··· ,τdp) and 1/τi to be independently distributed as Gamma with scale αi and rate λi/2 we have the multivariate t-distribution as the prior on Φ. In this case, the conditional distribution of τi has the following form
Assumptions for posterior consistency
We now introduce regularity conditions to establish posterior consistency under the hierarchical prior model.
Assumption B1. The VAR(d) model given in (1) is stable.
Assumption B2. is as n → ∞.
Assumption B3. and 0 < λmax(Σε,0n) = O(1).
Assumption B4. The singular values of the true parameter matrices are uniformly bounded. Equivalently, the eigenvalues of are uniformly bounded.
Assumption B5. .
Assumption B6. There exists (fixed and not-depending on n) α > 0 such that and for every β > 0 we have .
We now discuss these assumptions and compare them to the assumptions for the non-hierarchical prior model.
Assumptions B1 and B2 are identical to A1 and A2, while B3 is fairly similar to A3.
One key difference is the permissible scaling of p as a function of the sample size n in Assumption B5, which is slightly more stringent than the permissible scaling for the nonhierarchical matrix normal prior in Assumption A5.
Note that Assumption B6 is a mild one. For example, a sufficient condition for this assumption to be satisfied is that for some δ > 0 and . It can be easily checked that this condition, and hence Assumption B6, is satisfied in the case of Wishart, Inverse-Wishart, Bayesian group lasso, multivariate t-distribution, Horseshoe (Carvalho et al. (2010)), Strawderman-Berger and generalized double Pareto (Armagan et al. (2013)) priors as long as the prior parameters do not depend on n.
In the non-hierarchical prior case, assumptions regarding Φ0 and U (non-random) are simultaneously and exclusively provided in Assumption A4 through the conditions and . For the hierarchical prior case, for clarity of exposition, we provide the assumptions regarding Φ0 in Assumption B4, and those regarding the distribution of U (random) in Assumption B6. Combining these two assumptions it can be easily shown that a priori and converge to zero in Πscl,n-probability as n → ∞. In that sense, the assumptions on (Φ0,U) in the hierarchical model are stronger than in the non-hierarchical model case.
With these assumptions in hand, we state our key consistency result, whose proof is delegated to Appendix C.6 of the supplementary material.
Theorem 3.5 (Posterior Consistency for Hierarchical Prior). For any centered VAR(d) model with the hierarchical prior (8) on the transition matrix satisfying Assumptions B1-B6, the posterior consistency of the transition matrix can be achieved i.e. for every fixed cε > 0
where Φ0 is the true parameter matrix under the model (2).
4. Performance Evaluation
To illustrate the performance of our Bayesian modeling framework for VAR processes, we design three sets of numerical experiments involving: (a) Small VAR (p = 10), (b) Medium VAR (p = 100) and (c) Large VAR (p = 500) models, each with two lags - (i) d = 1 and (ii) d = 2.
In each setting, we use transition matrices Ai’s with 10−30% non-zero entries that are generated from U(0,2) ∪ U(−2,0) selected at random and rescaled to ensure that the process is stable with SNR = 2. For small VAR models, we generate n = 40,80,120 time points, for medium VAR models, n = 400,800,1200, while for large VAR models we use n = 2000,4000,6000. The hyper-parameters for the prior distributions are selected using the Deviance Information Criterion (DIC). Note that , where , is the posterior expectation of D(Φ,Σε) and and are the posterior expectation of Φ and Σε respectively.
4.1. Non-hierarchical prior
We generate two different error processes using and (Toeplitz form). For each of the small, medium and large VAR models, U is taken to be a diagonal matrix, c Ip where c is chosen according to the minimum DIC value over the interval [0,10]. In Table 1, for both the posterior mean (PM) and least squares estimator (LS), we report their relative estimation error and the standard error of within parenthesis averaged over 10×p replicates for small and medium VAR and 100 replications for large VAR (p = 500). Since the true parameter matrix Φ0 is sparse, we identify entries whose 95% posterior credible intervals contain zero, and set them to zero in both parameter matrix estimates (PM and LS).
Table 1:
Relative error in VAR (d = 1,2) with Σε= σ2Ip where % denotes percentage of non-zero entries in Φ0.
| Lag d = 1 | Lag d = 2 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
n1 |
n2 |
n3 |
n1 |
n2 |
n3 |
||||||||
| % | LS | PM | LS | PM | LS | PM | LS | PM | LS | PM | LS | PM | |
| Small VAR | 10 | 0.93 | 0.83 | 0.82 | 0.70 | 0.70 | 0.55 | 1.26 | 1.09 | 1.11 | 0.96 | 1.00 | 0.76 |
| (0.21) | (0.12) | (0.12) | (0.08) | (0.07) | (0.05) | (0.24) | (0.17) | (0.17) | (0.07) | (0.10) | (0.08) | ||
| 20 | 1.06 | 1.00 | 1.00 | 0.87 | 0.83 | 0.71 | 1.39 | 1.22 | 1.28 | 1.07 | 1.14 | 0.94 | |
| (0.34) | (0.24) | (0.26) | (0.16) | (0.16) | (0.11) | (0.35) | (0.30) | (0.27) | (0.21) | (0.18) | (0.15) | ||
| 30 | 1.23 | 1.15 | 1.12 | 0.97 | 0.99 | 0.81 | 1.57 | 1.40 | 1.45 | 1.28 | 1.30 | 1.15 | |
| (0.45) | (0.37) | (0.37) | (0.27) | (0.27) | (0.19) | (0.47) | (0.40) | (0.37) | (0.32) | (0.34) | (0.22) | ||
| Medium VAR | 10 | 1.81 | 1.64 | 1.69 | 1.53 | 1.56 | 1.40 | 2.45 | 2.12 | 2.31 | 2.01 | 2.24 | 1.85 |
| (0.40) | (0.21) | (0.31) | (0.12) | (0.23) | (0.07) | (0.46) | (0.32) | (0.37) | (0.24) | (0.31) | (0.14) | ||
| 20 | 1.95 | 1.81 | 1.85 | 1.70 | 1.74 | 1.56 | 2.60 | 2.26 | 2.50 | 2.15 | 2.43 | 2.01 | |
| (0.50) | (0.32) | (0.42) | (0.25) | (0.35) | (0.17) | (0.57) | (0.42) | (0.51) | (0.33) | (0.40) | (0.29) | ||
| 30 | 2.10 | 1.94 | 1.98 | 1.84 | 1.87 | 1.74 | 2.74 | 2.46 | 2.62 | 2.31 | 2.52 | 2.20 | |
| (0.64) | (0.43) | (0.52) | (0.38) | (0.45) | (0.27) | (0.69) | (0.54) | (0.60) | (0.46) | (0.56) | (0.39) | ||
| Large VAR | 10 | 2.70 | 2.47 | 2.55 | 2.34 | 2.46 | 2.22 | 3.66 | 3.18 | 3.50 | 3.04 | 3.38 | 2.92 |
| (0.58) | (0.30) | (0.50) | (0.23) | (0.40) | (0.16) | (0.67) | (0.45) | (0.57) | (0.35) | (0.49) | (0.28) | ||
| 20 | 2.84 | 2.62 | 2.75 | 2.52 | 2.60 | 2.35 | 3.81 | 3.33 | 3.69 | 3.23 | 3.54 | 3.05 | |
| (0.70) | (0.42) | (0.60) | (0.34) | (0.54) | (0.25) | (0.78) | (0.56) | (0.70) | (0.50) | (0.61) | (0.40) | ||
| 30 | 3.03 | 2.78 | 2.90 | 2.65 | 2.71 | 2.49 | 3.96 | 3.49 | 3.89 | 3.32 | 3.78 | 3.25 | |
| (0.82) | (0.54) | (0.70) | (0.44) | (0.63) | (0.35) | (0.88) | (0.69) | (0.84) | (0.63) | (0.76) | (0.51) | ||
First, we assume the true error covariance matrix Σε is diagonal i.e. . Here % denotes percentage of non-zero entries in Φ and d represents lag length of the underlying VAR process. Recall that the sample sizes used for small VAR models are n1 = 40, n2 = 80, n3 = 120, for medium VAR ones n1 = 400, n2 = 800, n3 = 1200, and for large VAR ones n1 = 2000, n2 = 4000, n3 = 6000.
It can be seen that the relative estimation error decreases with an increase in the number of time points (sample size) n for both lags d = 1,2; further, its values are significantly larger in medium and large size VAR models than in small VAR ones. Moreover, the estimation error for lag 1 is uniformly smaller than that for lag 2, and the same holds true for their respective standard errors. Regarding the percentage of non-zero entries in the true transition matrices, the results show that for fixed n and p, the more true non-zero entries in A1,A2, the less accurate the posterior mean and the LS estimator are, while their variability as indicated by their standard errors also follows the same pattern. However, the posterior mean clearly outperforms the LS estimates, especially in settings with large p. This is to a large extent due to the fact that the true transition matrices A1,A2 exhibit weaker signal as p or the number of non-zero edges increases (this is to ensure stability of the underlying VAR model) and due to our choice of U = cIdp the posterior mean is the ridge regression estimator which applies shrinkage on the coefficients.
Next, we introduce correlation in the error components by specifying Σε to be of Toeplitz form. As discussed in Section 3.2.1 of Lütkepohl (2007) the generalized least squares estimate in this multivariate regression set up is the same as the ordinary one; i.e. (X′X)–1X′Y, a result due to Zellner (1962). In Table 2, we compare the performance of least squares and posterior (ridge) estimates with noise covariance Σε =Toeplitz (ρ = 0.8).
Table 2:
Relative error in VAR (d = 1,2) with Σε =Toeplitz (ρ = 0.80) where % denotes percentage of non-zero entries in Φ0.
| Lag d = 1 | Lag d = 2 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n1 | n2 | n3 | n1 | n2 | n3 | ||||||||
| % | LS | PM | LS | PM | LS | PM | LS | PM | LS | PM | LS | PM | |
| Small VAR | 10 | 1.03 | 0.95 | 0.95 | 0.82 | 0.82 | 0.64 | 1.45 | 1.24 | 1.31 | 1.08 | 1.19 | 0.98 |
| (0.27) | (0.18) | (0.20) | (0.09) | (0.10) | (0.05) | (0.37) | (0.27) | (0.30) | (0.19) | (0.18) | (0.09) | ||
| 20 | 1.20 | 1.10 | 1.06 | 0.94 | 0.98 | 0.83 | 1.60 | 1.41 | 1.49 | 1.30 | 1.35 | 1.17 | |
| (0.40) | (0.30) | (0.29) | (0.21) | (0.25) | (0.17) | (0.48) | (0.37) | (0.40) | (0.29) | (0.31 | (0.19) | ||
| 30 | 1.36 | 1.27 | 1.22 | 1.11 | 1.12 | 0.99 | 1.76 | 1.61 | 1.66 | 1.45 | 1.50 | 1.26 | |
| (0.50) | (0.41) | (0.44) | (0.32) | (0.32) | (0.25) | (0.60) | (0.51) | (0.50) | (0.40) | (0.46) | (0.30) | ||
| Medium VAR | 10 | 2.02 | 1.87 | 1.94 | 1.72 | 1.75 | 1.61 | 2.80 | 2.46 | 2.70 | 2.35 | 2.53 | 2.21 |
| (0.52) | (0.34) | (0.41) | (0.27) | (0.37) | (0.16) | (0.69) | (0.50) | (0.62) | (0.43) | (0.54) | (0.33) | ||
| 20 | 2.20 | 2.01 | 2.10 | 1.87 | 1.92 | 1.80 | 2.95 | 2.62 | 2.86 | 2.52 | 2.74 | 2.34 | |
| (0.63) | (0.45) | (0.55) | (0.38) | (0.45) | (0.33) | (0.81) | (0.62) | (0.71) | (0.52) | (0.65) | (0.42) | ||
| 30 | 2.36 | 2.20 | 2.19 | 2.06 | 2.06 | 1.91 | 3.12 | 2.78 | 2.99 | 2.67 | 2.91 | 2.50 | |
| (0.73) | (0.56) | (0.67) | (0.48) | (0.56) | (0.41) | (0.94) | (0.72) | (0.83) | (0.67) | (0.77) | (0.57) | ||
| Large VAR | 10 | 3.03 | 2.80 | 2.88 | 2.63 | 2.83 | 2.55 | 4.19 | 3.67 | 4.09 | 3.55 | 3.93 | 3.44 |
| (0.75) | (0.49) | (0.67) | (0.41) | (0.60) | (0.33) | (1.03) | (0.73) | (0.97) | (0.62) | (0.87) | (0.58) | ||
| 20 | 3.17 | 2.94 | 3.07 | 2.79 | 2.89 | 2.69 | 4.34 | 3.85 | 4.25 | 3.70 | 4.11 | 3.594 | |
| (0.86) | (0.60) | (0.79) | (0.51) | (0.69) | (0.45) | (1.14) | (0.84) | (1.06) | (0.76) | (1.00) | (0.66) | ||
| 30 | 3.33 | 3.10 | 3.27 | 3.00 | 3.08 | 2.80 | 4.53 | 4.00 | 4.37 | 3.89 | 4.26 | 3.70 | |
| (0.98) | (0.73) | (0.90) | (0.63) | (0.82) | (0.56) | (1.26) | (0.95) | (1.17) | (0.87) | (1.11) | (0.79) | ||
In this setting, the relative estimation error of both the least squares and ridge estimators increases compared to that with an uncorrelated error structure given in Table 1; in particular, the performance of the LS estimator deteriorates even further. However, with an increase in sample size, the accuracy of both estimates significantly improves. Further, as gleaned from the entries of the Table corresponding to lag 2, the relative error exhibits a further increase, a pattern consistent with the results in Table 1. This is quite expected as we not only have Toeplitz type error covariance structure, but also the total number of unknown parameters has increased by p2.
Finally, we study the support recovery under both error processes. In Table 3, we provide the percentage of true positives recovered by using 95% posterior credible intervals based on the same sample sizes n1, n2, and n3 as used previously.
Table 3:
Percentage of true positive non-zero entries recovered in Φ.
| Lag d = 1 | Lag d = 2 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Σε= σ2Ip | Σε= Toep | Σε= σ2Ip | Σε= Toep | ||||||||||
| % | n1 | n2 | n3 | n1 | n2 | n3 | n1 | n2 | n3 | n1 | n2 | n3 | |
| Small VAR | 10 | 85.0 | 85.0 | 82.0 | 85.0 | 85.0 | 84.0 | 84.7 | 84.5 | 81.6 | 84.6 | 84.6 | 83.6 |
| 20 | 80.0 | 80.0 | 81.0 | 80.0 | 78.0 | 80.0 | 79.7 | 79.6 | 80.5 | 79.5 | 77.6 | 79.8 | |
| 30 | 77.0 | 75.0 | 77.0 | 77.0 | 73.0 | 73.0 | 76.6 | 74.6 | 76.5 | 76.7 | 72.6 | 72.6 | |
| Medium VAR | 10 | 89.3 | 90.0 | 89.3 | 89.3 | 89.8 | 88.3 | 88.8 | 89.5 | 89.0 | 88.9 | 89.4 | 87.9 |
| 20 | 85.3 | 85.5 | 85.5 | 85.3 | 84.5 | 85.0 | 84.9 | 85.3 | 85.3 | 84.8 | 84.1 | 84.8 | |
| 30 | 81.0 | 80.8 | 81.3 | 81.0 | 79.8 | 78.8 | 80.8 | 80.5 | 81.0 | 80.6 | 79.3 | 78.3 | |
| Large VAR | 10 | 92.0 | 92.1 | 92.2 | 92.0 | 92.1 | 91.8 | 91.8 | 91.6 | 91.8 | 91.7 | 91.7 | 91.4 |
| 20 | 88.1 | 88.1 | 88.1 | 88.1 | 87.9 | 87.9 | 87.7 | 87.8 | 87.7 | 87.7 | 87.5 | 87.5 | |
| 30 | 83.1 | 83.3 | 83.4 | 83.1 | 82.8 | 82.9 | 82.7 | 82.8 | 82.9 | 82.7 | 82.4 | 82.6 | |
The above table indicates that support recovery is not sensitive to the sample size, or to the lag; however, it deteriorates for all VAR models and error covariance settings, as the density of non-zero entries increases and exhibits a small increase with model dimension.
4.2. Hierarchical Priors
As discussed in Section 3.2, three types of hierarchical priors (Wishart, group-lasso and multivariate t) are studied. Analogously to the non-hierarchical prior case, the performance of the LS estimator is not at all satisfactory in this set up as well. Thus, we only compare the relative accuracy of the three prior choices in this setting. We select V = cIp and df = ν = dp for the Wishart prior, λi = λ for all 1 ≤ i ≤ dp for the group-lasso prior and αi = 1, λi = λ for all 1 ≤ i ≤ dp for the multivariate t prior. The hyper-parameters c and λ are chosen using DIC. In Table 4, we report the relative estimation error of the three hierarchical estimators when the error process covariance is set to and d represents lag length in the underlying VAR model.
Table 4:
Relative error in VAR (d = 1,2) with Σε = σ2Ip, where % denotes percentage of non-zero entries in Φ0.
| Wishart |
Group lasso |
Multivariate t |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Lag d = 1 | % | n1 | n2 | n3 | n1 | n2 | n3 | n1 | n2 | n3 |
| Small VAR | 10 | 0.84 | 0.75 | 0.65 | 0.84 | 0.75 | 0.64 | 0.85 | 0.74 | 0.58 |
| 20 | 0.99 | 0.91 | 0.73 | 1.00 | 0.89 | 0.79 | 1.00 | 0.90 | 0.76 | |
| 30 | 1.14 | 1.09 | 0.92 | 1.16 | 1.02 | 0.94 | 1.15 | 1.04 | 0.93 | |
| Medium VAR | 10 | 1.63 | 1.53 | 1.42 | 1.62 | 1.54 | 1.43 | 1.63 | 1.53 | 1.39 |
| 20 | 1.78 | 1.71 | 1.53 | 1.76 | 1.69 | 1.56 | 1.79 | 1.66 | 1.55 | |
| 30 | 1.93 | 1.83 | 1.70 | 1.95 | 1.86 | 1.72 | 1.95 | 1.82 | 1.69 | |
| Large VAR | 10 | 2.39 | 2.26 | 2.16 | 2.41 | 2.31 | 2.22 | 2.42 | 2.34 | 2.22 |
| 20 | 2.59 | 2.42 | 2.31 | 2.58 | 2.47 | 2.39 | 2.59 | 2.46 | 2.32 | |
| 30 | 2.77 | 2.61 | 2.52 | 2.70 | 2.60 | 2.53 | 2.74 | 2.63 | 2.54 | |
| Lag d = 2 | ||||||||||
| Small VAR | 10 | 0.88 | 0.74 | 0.60 | 0.87 | 0.80 | 0.71 | 0.89 | 0.78 | 0.63 |
| 20 | 1.05 | 0.91 | 0.83 | 1.04 | 0.94 | 0.85 | 1.08 | 0.94 | 0.86 | |
| 30 | 1.25 | 1.13 | 0.97 | 1.23 | 1.10 | 0.99 | 1.27 | 1.13 | 1.03 | |
| Medium VAR | 10 | 1.71 | 1.58 | 1.46 | 1.71 | 1.58 | 1.53 | 1.70 | 1.60 | 1.50 |
| 20 | 1.85 | 1.72 | 1.60 | 1.85 | 1.76 | 1.71 | 1.89 | 1.82 | 1.66 | |
| 30 | 2.05 | 1.96 | 1.77 | 2.01 | 1.95 | 1.82 | 2.04 | 1.97 | 1.83 | |
| Large VAR | 10 | 2.52 | 2.37 | 2.26 | 2.51 | 2.41 | 2.29 | 2.53 | 2.44 | 2.25 |
| 20 | 2.69 | 2.58 | 2.47 | 2.67 | 2.59 | 2.47 | 2.69 | 2.59 | 2.51 | |
| 30 | 2.88 | 2.75 | 2.60 | 2.86 | 2.75 | 2.62 | 2.87 | 2.74 | 2.64 | |
In the following table, we present relative estimation errors with the same 3 hierarchical priors when Σε =Toeplitz (ρ = 0.6).
All of our hierarchical estimates outperform the ridge estimator (Table 1 and Table 2) across all settings considered. This is again expected, since the Ai’s have sparse structure by construction and the group lasso prior favors sparsity. However, the above results are not conclusive whether the group lasso estimate exhibits better accuracy than the Wishart or multivariate t estimates. To gain some insight into this issue, we use a VAR(1) model with p = 9 and transition matrix A1 in which the columns form three groups each containing three columns. The sparsity increases as we move from group 1 to group 3. Finally, we rescale the coefficient matrix so that the corresponding VAR process is stable with SNR= 2. The structure of the resulting A1 transition matrix is depicted next, where * indicates non-zero entries.
We generate n = 100 observations from the above VAR(1) model using two white noise variances (1) and (2) Σε =Toeplitz (ρ = 0.80) and report the relative estimation error of five different estimates - least squares (LS), posterior mean for non-hierarchical normal prior (Non H), hierarchical Wishart (W), group lasso (GL) and multivariate t prior (Mult. t), in Table 6.
Table 6:
Relative error
| Estimator | Σε = σ2Ιρ | Σε = Toeplitz |
|---|---|---|
| LS | 0.604 | 1.384 |
| Non H | 0.583 | 0.614 |
| W | 0.462 | 0.544 |
| Mult. t | 0.430 | 0.414 |
| GL | 0.321 | 0.362 |
The results show that the group lasso estimator exhibits the best performance, followed by the multivariate t one, whereas the LS estimator is the least accurate. The result is consistent with the structure of the underlying transition matrix, since the group lasso prior can capitalize on it.
In Appendix A.1 of the supplementary document we illustrate the posterior estimate of a VAR(1) model transition matrix A1, 95% credible intervals and estimated posterior densities of several entries of A1. We also look into the performance of when the true error covariance is Toeplitz(ρ = 0.8). The relative error of under all the four different priors using a new noise covariance matrix which is generated from a Wishart distribution with degrees of freedom ν = p and scale matrix Ip is also given.
5. Application to Macroeconomic Data
We use the proposed Bayesian framework to understand the lead-lag relationships in the FRED-MD dataset containing p = 137 key macroeconomic variables for the period January 1973 to December 2014. VAR modeling for this task was strongly advocated by Sims (1980) and since then has become a standard tool for it, although usually the focus is on small models involving few macroeconomic indicators (e.g. consumer price index, employment index and the federal funds rate). However, recent work has advocated for larger VAR models (see Bernanke et al. (2005); Bańbura et al. (2010) and references therein), in order to improve forecastability and also avoid the presence of hard to interpret or even contradictory to economic theory relationships, due to the fact of not including an adequate number of variables for properly modeling the economic phenomenon under consideration. Before centering the data and estimating Σε as discussed earlier in Section 3.2, we ensure stationarity by processing the variables according to the recommendations in Stock and Watson (2005). The specific transformations used for each time series are given in the supplementary documents. Analogously to Bańbura et al. (2010), we consider the following 3 different size VAR models:
SMALL: This model contains p = 4 key variables - CPI, number of employees non-farm (PAYEMS), Federal Funds Rate (FEDFUNDS) and Unemployment Rate (UNRATE).
MEDIUM: In addition to the four variables in the SMALL VAR model, this one contains an additional 16 variables (total p = 20) listed next - Reserves Of Depository Institutions (NONBORRES), Total Reserves of Depository Institutions (TOTRESNS), M2 (M2REAL), Real Personal Income (RPI), Real personal consumption expenditures (DPCERA), IP Index (INDPRO), Capacity Utilization: Manufacturing (CAPT), Housing Starts: Total New Privately Owned (HOUST), Avg Hourly Earnings : Goods-Producing (CES), M1 (M1), S & P’s Common Stock Price Index: Composite (S.P.), 10-Year Treasury Rate (GS10), Personal Cons. Expend.: Chain Index (PCEPI), Foreign Exchange Rate (EXS), Crude Oil, spliced WTI and Cushing (OIL) and Retail and Food Services Sales (RETAILx).
LARGE: This specification has all p = 134 macroeconomic indicators (3 were excluded from further analysis due to the presence of a large number of missing values).
Based on initial exploratory work, we choose lag d = 6 according to the Bayesian information criterion (BIC) and the following distributions were used for prior specification to obtain the estimated parameter matrix Φ: (i) non-hierarchical normal (Non H), (ii) hierarchical Wishart (W), (iii) group lasso (GL) and (iv) multivariate t prior (Mult. t). Since with an increase in the lag length d the number of parameters increases linearly we suggest using BIC over the Akaike information criterion (AIC). For the non-hierarchical prior, we use U = BDiag(λ1,⋯,λd), while for the hierarchical Wishart, group-lasso and multivariate t priors on Φ, we use V = c1Idp and αi = α for all 1 ≤ i ≤ dp. The values of c1 and λ are chosen using the Deviance Information Criterion (DIC) which, as explained previously, is a hierarchical Bayesian modeling generalization of BIC. The respective posterior means were compared to the least squares (LS) estimate. For each of the estimates , the residual norm ratio which measures the in-sample fit, is reported in Table 7.
Table 7:
In-sample prediction error
| SMALL (p = 4) | MEDIUM (p = 20) | LARGE (p = 134) | |
|---|---|---|---|
| LS | 0.847 | 0.863 | 0.673 |
| Non-H | 0.852 | 0.864 | 0.674 |
| W | 0.845 | 0.863 | 0.675 |
| Mult. t | 0.877 | 0.885 | 0.685 |
| GL | 0.847 | 0.873 | 0.674 |
Note that since the LS estimator is obtained by minimizing , it will always result in minimum relative residual norm as observed in the above table; i.e. the LS estimator is always the best one in terms of in-sample prediction accuracy.
Next, we investigate the 4 different Bayesian estimates based on their out-of-sample prediction performance with respect to the benchmark prior, analogously to the evaluation strategies discussed in Bańbura et al. (2010) and Stock and Watson (2005). We consider the following two benchmark priors:
Prior information is imposed exactly by setting U−1 = O matrix (the zero matrix) and it corresponds to λ = 0 in the Minnesota prior. Bańbura et al. (2010) uses this specification as the benchmark prior, in which case the corresponding benchmark model becomes a random walk with drift; i.e. .
A uniform prior on Φ by setting U = O which corresponds to λ = ∞ in the Minnesota prior. The posterior mean coincides with the least squared estimate (LS).
Let be the h-step ahead predicted value for Xt+h based on our posited Bayesian model and using the data upto time t. The corresponding forecast under the benchmark prior is . The mean squared forecast error relative to the benchmark (RMSFE) is defined to . Table 8 gives the RMFSE results for three different choices of forecasting horizons, h = 1,6,12, for the two benchmark priors considered, over the period T0 = January 1978 to T1 = December 2006. An RMSFE value smaller than 1 implies the VAR model with the corresponding prior outperforms that with the naive/benchmark prior.
Table 8:
Out of sample relative prediction error
| Uniform prior | Random Walk | ||||||
|---|---|---|---|---|---|---|---|
| SMALL | MEDIUM | LARGE | SMALL | MEDIUM | LARGE | ||
| p = 4 | p = 20 | p = 134 | p = 4 | p = 20 | p = 134 | ||
| Non Hierarchical | h = 1 | 0.88 | 0.72 | 0.62 | 0.49 | 0.40 | 0.33 |
| h = 6 | 0.90 | 0.78 | 0.62 | 0.43 | 0.42 | 0.37 | |
| h = 12 | 0.95 | 0.84 | 0.74 | 0.43 | 0.41 | 0.37 | |
| Wishart | h = 1 | 0.88 | 0.81 | 0.68 | 0.49 | 0.45 | 0.32 |
| h = 6 | 0.86 | 0.86 | 0.68 | 0.42 | 0.40 | 0.36 | |
| h = 12 | 0.93 | 0.92 | 0.71 | 0.43 | 0.41 | 0.38 | |
| Group Lasso | h = 1 | 0.90 | 0.86 | 0.60 | 0.51 | 0.45 | 0.30 |
| h = 6 | 0.88 | 0.88 | 0.63 | 0.46 | 0.42 | 0.34 | |
| h = 12 | 0.93 | 0.92 | 0.67 | 0.47 | 0.45 | 0.37 | |
| Mult. t | h = 1 | 0.91 | 0.89 | 0.71 | 0.50 | 0.46 | 0.31 |
| h = 6 | 0.87 | 0.93 | 0.77 | 0.49 | 0.44 | 0.34 | |
| h = 12 | 0.92 | 0.94 | 0.81 | 0.45 | 0.41 | 0.38 | |
It can easily be seen that all four Bayesian methods not only outperform the LS estimate (uniform prior on Φ), but also exhibit substantially smaller relative error compared to the random walk with drift process (point-mass prior on Φ). Further, increasing the number of predictor variables improves forecasting performance, a point argued forcefully in favor of large VAR models by Sims Sims (1980). On the other hand, forecasting performance deteriorates for larger values of h, an expected result. Nevertheless, even for h = 12 (one year ahead) the results are still very satisfactory. Further, for the SMALL and MEDIUM VAR models, the non-hierarchical normal and hierarchical Wishart priors result in better prediction, whereas for the LARGE VAR model the group lasso prior outperforms other forecasts.
Next, we examine closely the out-of-sample prediction performance of the following three macroeconomic variables - CPI, PAYEMS and FEDFUND under the hierarchical Wishart prior.
Note that Bańbura et al. (2010) only consider a random walk process as the naive prior. From the above table it can be seen that although for CPI and PAYEMS the LS estimate performs better than the Bayesian estimates in SMALL and MEDIUM VARs, overall the Wishart prior has better forecasting accuracy than both of the benchmark priors. As previously observed, adding information (i.e. including more variables) improves the accuracy of forecasts for all three variables. The fourth column (LARGE BGR) provides the numbers reported in Table III in Bańbura et al. (2010), where a Bayesian VAR model on the same 134 variables, with d = 13 lags was estimated using a normal-inverted Wishart distribution that leads to a ridge type posterior mean estimate for the parameters in Φ and based on data covering the period 1971–2003. Although the results are not directly comparable to those obtained by our methodology, they nevertheless provide a certain degree of calibration. It can be seen that our model is more parsimonious using only d = 6 lags and provides better forecasting performance for all three variables at all forecasting horizons.
Next, we examine in more detail the estimated transition matrix A1 for the SMALL VAR model (p = 4) and the non-hierarchical normal and group lasso priors. Estimated posterior densities of the bold marked entries are shown next. It is worth noting that the non-hierarchical prior centers around a different value and exhibits a less smooth behavior, than the hierarchical one. This smoothness should be expected given the specification of the latter.
Further, we present the 95% posterior credible intervals (PCI) of A1 under the above two priors
Next, in Figure 2 we depict the estimated networks for the MEDIUM VAR based on the first lag transition matrix produced by: (a) least squares and (b) a non-hierarchical normal prior, where for ease of representation the nodes of the network are abbreviated; the full list of the variable names is given in Table A1 of Appendix A in the supplementary material.
Figure 2:
Network representation of the transition matrix (A1)
As expected, for most variables their previous lag value influences the current value. Further, for the LS based network, there is a high degree of connectivity, whereas the non-hierarchical based one exhibits a sparser structure. For the latter, of interest is that the employment index (PAYEMS), the personal consumer expenditures (GS10) and CPI are influenced by many other variables. On the other hand, the Federal Funds Rate influences the broad stock market (SP500) as expected based on finance theory and GS10. In general the sparser result provided by the non-hierarchical prior, in addition to better forecasting also aids in interpretation, vis-a-vis the LS estimate.
6. Discussion
In this paper, we investigate posterior consistency in Bayesian VAR(d) models with both nonhierarchical and hierarchical matrix normal prior distributions on the transition matrices under a Gaussian assumption for the temporal evolution of the time series under consideration and in the presence of a general covariance matrix that captures additional contemporary dependence between them. We establish posterior consistency for both of these priors under high-dimensional scaling. To obtain the desired results, some novel concentration inequalities are provided that are of independent interest. The methodology is illustrated on synthetic and real macroeconomic data. The proposed priors provide better forecasts than the LS estimates for periods up to one year ahead, while leading to sparser and potentially easier to interpret relationships, especially for large scale models.
Supplementary Material
Figure 1:
Posterior densities of entries (11), (24), (32) and (42) in A1 under 4 different priors
Table 5:
Relative error in VAR (d = 1,2) with Σε = Toeplitz (ρ = 0.80), where % denotes percentage of non-zero entries in Φ0.
| Wishart |
Group lasso |
Multivariate t |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Lag d = 1 | % | n1 | n2 | n3 | n1 | n2 | n3 | n1 | n2 | n3 |
| Small VAR | 10 | 0.90 | 0.76 | 0.65 | 0.90 | 0.83 | 0.69 | 0.89 | 0.79 | 0.70 |
| 20 | 1.06 | 0.94 | 0.77 | 1.03 | 0.92 | 0.82 | 1.07 | 0.96 | 0.81 | |
| 30 | 1.23 | 1.13 | 0.99 | 1.21 | 1.09 | 0.99 | 1.23 | 1.14 | 0.97 | |
| Medium VAR | 10 | 1.75 | 1.61 | 1.50 | 1.73 | 1.63 | 1.53 | 1.74 | 1.61 | 1.56 |
| 20 | 1.92 | 1.80 | 1.67 | 1.92 | 1.78 | 1.73 | 1.91 | 1.79 | 1.72 | |
| 30 | 2.08 | 1.98 | 1.82 | 2.02 | 1.98 | 1.88 | 2.07 | 1.94 | 1.84 | |
| Large VAR | 10 | 2.58 | 2.49 | 2.33 | 2.60 | 2.50 | 2.38 | 2.60 | 2.49 | 2.33 |
| 20 | 2.76 | 2.65 | 2.51 | 2.75 | 2.65 | 2.54 | 2.77 | 2.62 | 2.57 | |
| 30 | 2.93 | 2.81 | 2.71 | 2.90 | 2.81 | 2.74 | 2.93 | 2.82 | 2.74 | |
| Lag d = 2 | ||||||||||
| Small VAR | 10 | 1.16 | 1.02 | 0.96 | 1.15 | 1.06 | 0.98 | 1.18 | 1.06 | 0.89 |
| 20 | 1.31 | 1.24 | 1.12 | 1.31 | 1.22 | 1.11 | 1.34 | 1.24 | 1.10 | |
| 30 | 1.49 | 1.41 | 1.23 | 1.49 | 1.41 | 1.27 | 1.49 | 1.40 | 1.26 | |
| Medium VAR | 10 | 2.25 | 2.13 | 2.04 | 2.25 | 2.17 | 2.05 | 2.27 | 2.16 | 2.01 |
| 20 | 2.43 | 2.30 | 2.16 | 2.41 | 2.32 | 2.24 | 2.45 | 2.32 | 2.26 | |
| 30 | 2.59 | 2.47 | 2.33 | 2.59 | 2.49 | 2.38 | 2.59 | 2.48 | 2.36 | |
| Large VAR | 10 | 3.37 | 3.24 | 3.10 | 3.36 | 3.25 | 3.18 | 3.36 | 3.24 | 3.11 |
| 20 | 3.53 | 3.40 | 3.29 | 3.51 | 3.43 | 3.31 | 3.54 | 3.46 | 3.36 | |
| 30 | 3.72 | 3.62 | 3.41 | 3.65 | 3.57 | 3.49 | 3.74 | 3.59 | 3.48 | |
Table 9:
Out-of-sample relative prediction error for CPI, PAYEMS and Fed- Fund for the three VAR model specifications considered. The column LARGE BVAR corresponds to the entries of Table III in Bafibura et al. (2010) for a Bayesian VAR model with a normal-inverted Wishart prior distribution and d =13 lags, based on the same set of variables, but covering the period 1971–2003.)
| SMALL | MEDIUM | LARGE | LARGE BVAR | ||
|---|---|---|---|---|---|
| Uniform prior | p = 4 | p = 20 | p = 134 | p = 134 | |
| h = 1 | CPI | 1.05 | 0.91 | 0.44 | - |
| PAYEMS | 1.21 | 1.04 | 0.91 | - | |
| FFUND | 0.78 | 0.75 | 0.68 | - | |
| h = 6 | CPI | 1.03 | 0.97 | 0.38 | - |
| PAYEMS | 1.08 | 1.06 | 0.48 | - | |
| FFUND | 0.92 | 0.82 | 0.68 | - | |
| h = 12 | CPI | 0.98 | 0.96 | 0.42 | - |
| PAYEMS | 0.93 | 0.91 | 0.73 | - | |
| FFUND | 0.92 | 0.88 | 0.72 | - | |
| Random Walk | |||||
| h = 1 | CPI | 0.43 | 0.41 | 0.34 | 0.50 |
| PAYEMS | 0.45 | 0.43 | 0.39 | 0.46 | |
| FFUND | 0.50 | 0.45 | 0.36 | 0.75 | |
| h = 6 | CPI | 0.38 | 0.34 | 0.28 | 0.40 |
| PAYEMS | 0.53 | 0.48 | 0.39 | 0.50 | |
| FFUND | 0.41 | 0.37 | 0.36 | 1.29 | |
| h = 12 | CPI | 0.51 | 0.45 | 0.42 | 0.44 |
| PAYEMS | 0.51 | 0.88 | 0.73 | 0.78 | |
| FFUND | 0.33 | 0.31 | 0.28 | 1.93 | |
Acknowledgments
* The authors gratefully acknowledge support from NSF grants DMS-1511945 (KK) and IIS-1632730 and CNS-1422078 (GM)and NIH grant R01 5R01GM11402902.
Contributor Information
Satyajit Ghosh, Email: satyajitghosh90@ufl.edu.
Kshitij Khare, Email: kdkhare@ufl.edu.
George Michailidis, Email: gmichail@ufl.edu.
References
- Armagan A, Clyde M, and Dunson DB (2011). Generalized beta mixtures of gaussians In Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, and Weinberger KQ (Eds.), Advances in Neural Information Processing Systems 24, pp. 523–531. Curran Associates, Inc. [PMC free article] [PubMed] [Google Scholar]
- Armagan A, Dunson DB, and Lee J (2013). Generalized double pareto shrinkage. Statistica Sinica 23, 119–143. [PMC free article] [PubMed] [Google Scholar]
- Armagan A, Dunson DB, Lee J, Bajwa WU, and Strawn N (2013). Posterior consistency in linear models under shrinkage priors. Biometrika 100(4), 1011–1018. [Google Scholar]
- Bańbura M, Giannone D, and Reichlin L (2010). Large bayesian vector auto regressions. Journal of Applied Econometrics 25(1), 71–92. [Google Scholar]
- Basu S and Michailidis G (2015, 08). Regularized estimation in sparse high-dimensional time series models. Ann. Statist 43(4), 1535–1567. [Google Scholar]
- Basu S, Shojaie A, and Michailidis G (2015). Network granger causality with inherent grouping structure. Journal of Machine Learning Research 16, 417–453. [PMC free article] [PubMed] [Google Scholar]
- Bernanke BS, Boivin J, and Eliasz P (2005). Measuring the effects of monetary policy: a factor-augmented vector autoregressive (FAVAR) approach. The Quarterly Journal of Economics 120(1), 387–422. [Google Scholar]
- Billio M, Getmansky M, Lo AW, and Pelizzon L (2012). Econometric measures of connectedness and systemic risk in the finance and insurance sectors. Journal of Financial Economics 104(3), 535–559. [Google Scholar]
- Bontemps D (2011, 10). Bernstein-von mises theorems for gaussian regression with increasing number of regressors. Ann. Statist 39f, 2557–2584. [Google Scholar]
- Carvalho CM, Polson NG, and Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480. [Google Scholar]
- Doan T, Litterman R, and Sims C (1984). Forecasting and conditional projection using realistic prior distributions. Econometric reviews 3(1), 1–100. [Google Scholar]
- Ghoshal S (1999). Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli 5(2), 315–331. [Google Scholar]
- Griffin JE and Brown PJ (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 5(1), 171–188. [Google Scholar]
- Korobilis D (2013). Var forecasting using bayesian variable selection. Journal of Applied Econometrics 28, 204–230. [Google Scholar]
- Kyung M, Gill J, Ghosh M, and Casella G (2010, 06). Penalized regression, standard errors, and bayesian lassos. Bayesian Anal. 5(2), 369–411. [Google Scholar]
- Lee J and Oh H-S (2013). Bayesian regression based on principal components for high-dimensional data. Journal of Multivariate Analysis 117, 175–192. [Google Scholar]
- Lin J and Michailidis G (2017). Regularized estimation and testing for high-dimensional multiblock vector-autoregressive models. Journal of Machine Learning Research 18(117), 1–49. [Google Scholar]
- Litterman RB et al. (1979). Techniques of forecasting using vector autoregressions. Technical report.
- Lütkepohl H (2007). New Introduction to Multiple Time Series Analysis. Springer. [Google Scholar]
- Melnyk I and Banerjee A (2016). Estimating structured vector autoregressive models. Proceedings of The 33rd International Conference on Machine Learning, 830–839. [Google Scholar]
- Mol CD, Giannone D, and Reichlin L (2008). Forecasting using a large number of predictors: Is bayesian shrinkage a valid alternative to principal components? Journal of Econometrics 146(2), 318–328. Honoring the research contributions of Charles R. Nelson. [Google Scholar]
- Park T and Casella G (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]
- Raskutti G and Yuan M (2015, December). Convex Regularization for High-Dimensional Tensor Regression. ArXiv e-prints.
- Schweinberger M, Babkin S, and Ensor KB (2017). High-dimensional multivariate time series with additional structure. Journal of Computational and Graphical Statistics 26(3), 610–622. [Google Scholar]
- Seth AK, Barrett AB, and Barnett L (2015). Granger causality analysis in neuroscience and neuroimaging. Journal of Neuroscience 35(8), 3293–3297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sims CA (1980). Macroeconomics and reality. Econometrica: Journal of the Econometric Society 48(1), 1–48. [PubMed] [Google Scholar]
- Sparks DK, Khare K, and Ghosh M (2015). Necessary and sufficient conditions for high-dimensional posterior consistency under g-priors. Bayesian Anal. 10(3), 627–664. [Google Scholar]
- Stock JH and Watson MW (2005). An empirical comparison of methods for forecasting using many predictors. Manuscript, Princeton University. [Google Scholar]
- Zellner A (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57(298), 348–368. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


