High-Dimensional Posterior Consistency in Bayesian Vector Autoregressive Models

Satyajit Ghosh; Kshitij Khare; George Michailidis

doi:10.1080/01621459.2018.1437043

. Author manuscript; available in PMC: 2020 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2018 Aug 7;114(526):735–748. doi: 10.1080/01621459.2018.1437043

High-Dimensional Posterior Consistency in Bayesian Vector Autoregressive Models

Satyajit Ghosh ¹, Kshitij Khare ¹, George Michailidis ¹

PMCID: PMC6716151 NIHMSID: NIHMS1515017 PMID: 31474783

Abstract

Vector autoregressive (VAR) models aim to capture linear temporal interdependencies amongst multiple time series. They have been widely used in macroeconomics and financial econometrics and more recently have found novel applications in functional genomics and neuroscience. These applications have also accentuated the need to investigate the behavior of the VAR model in a high-dimensional regime, which provides novel insights into the role of temporal dependence for regularized estimates of the model’s parameters. However, hardly anything is known regarding properties of the posterior distribution for Bayesian VAR models in such regimes. In this work, we consider a VAR model with two prior choices for the autoregressive coefficient matrix: a non-hierarchical matrix-normal prior and a hierarchical prior, which corresponds to an arbitrary scale mixture of normals. We establish posterior consistency for both these priors under standard regularity assumptions, when the dimension p of the VAR model grows with the sample size n (but still remains smaller than n). A special case corresponds to a shrinkage prior that introduces (group) sparsity in the columns of the model coefficient matrices. The performance of the model estimates are illustrated on synthetic and real macroeconomic data sets.

Keywords: Vector Autoregressive Models, Posterior consistency, Shrinkage prior, Bayesian Lasso

1. Introduction

There has been recent interest in modeling high-dimensional time series data sets. In macroeconomics, Mol et al. (2008) advocated the need to include a large number of variables in econometric models to improve forecastability, while Billio et al. (2012) examined stock returns of many financial institutions to assess systemic risk of the financial system. Similar modeling challenges arise in functional genomics for the reconstruction of regulatory networks as discussed in Basu et al. (2015), while in neuroscience one is interested in understanding functional connectivity between brain regions Seth et al. (2015).

A popular and informative model has been vector autoregressions (VAR), that captures linear temporal dependencies between time series. The VAR model and its properties have been thoroughly explored in low-dimensional settings both from a frequentist (for a comprehensive overview see Lütkepohl (2007)) and a Bayesian perspective Bańbura et al. (2010).

More recently, Basu and Michailidis (2015) provided an in depth analysis of the model for Gaussian data in a high-dimensional setting under sparsity assumptions, while Melnyk and Banerjee (2016) extended the results to other regularizers (e.g. group lasso, sparse group lasso, etc.). The results of Basu and Michailidis (2015); Melnyk and Banerjee (2016) and related follow-up work (Raskutti and Yuan (2015); Schweinberger et al. (2017); Lin and Michailidis (2017)) indicate that the resulting estimation error rates are those obtained for independent and identically distributed data times a factor that captures the temporal dependence in the data.

On the Bayesian front, there has been primarily methodological/computational work for low-dimensional VAR models. The so-called Minnesota prior Litterman et al. (1979); Doan et al. (1984) has been a staple of applied econometric work involving VAR models. This is a normal prior distribution on the elements of the transition matrix that puts stronger weights on the “own” lags of each time series, since they are considered more informative for forecasting purposes than lags from “other” time series. For large size VAR models, Bańbura et al. (2010) advocate normal-inverted Wishart distribution that leads to a posterior mean that can be interpreted as a ridge shrinkage estimator, suitable for such models. A first attempt for Bayesian estimation of VAR models combined with variable selection is presented in Korobilis (2013), where an indicator variable is specified for each parameter in the transition matrix that indicates whether the cross-autocorrelation coeffcient is included or set to zero. A prior is specified for the indicator variables that in principle can also be combined with the Minnesota prior.

On the other hand, Bayesian investigations into high-dimensional asymptotics of statistical models that incorporate sparsity with temporally dependent data are not in general available to the best of our knowledge. Hence, the main objective of this work is to study posterior (estimation) consistency for a VAR model, which asserts that the posterior concentrates around the “true” parameter value (in an appropriate norm) as the sample size increases. There is a rich literature on high-dimensional posterior estimation consistency for linear regression models for independent and identically distributed data. Ghoshal (1999) established posterior consistency and asymptotic normality with a general prior on the p-vector of regression coefficients (with appropriate positivity and Lipschitz assumptions) when p³ log p/n → 0 and p⁴logp/n → 0, respectively. Bontemps (2011) extended the work of Ghoshal (1999) by permitting the model to be misspecified and the number of predictor variables to grow proportionally to the sample size. Armagan et al. (2013) focus on shrinkage priors, which are appropriate scale mixtures of normal priors and induce weak sparsity in the vector of regression coefficients (see Carvalho et al. (2010); Griffin and Brown (2010); Armagan et al. (2011, 2013)). They establish posterior consistency under a simple sufficient condition on prior concentration when p = o(n). Lee and Oh (2013) establish posterior consistency under a high dimensional Bayesian PCA regression setup with p > n under appropriate assumptions on the rank of the design matrix. Posterior estimation consistency in linear regression models with g-priors has also been addressed in Sparks et al. (2015).

A crucial difference between the linear regression models considered in the above work and VAR models (expressed as a linear model) is that the design matrix in the latter case is random, and exhibits dependencies both between its rows and across its columns, and also with the error term in the model (see Section 2). This leads to a significantly more involved and challenging theoretical analysis that we successfully resolve. In this paper, we investigate high-dimensional posterior consistency for Bayesian VAR models in two natural and relevant settings: (a) with a non-hierarchical matrix normal prior on the dp × p autoregressive parameter matrix and (b) a hierarchical prior which corresponds to a general scale mixture of normals. In particular, this includes spherically symmetric priors such as the multivariate-t and standard shrinkage priors which induce (group) sparsity in the columns of the coefficient matrices, such as the group structure in Basu et al. (2015). Further, we employ a flat (uniform) prior distribution for the error term. Note that the joint maximum likelihood estimation problem for a sparse VAR model, with a sparse error covariance matrix is investigated in Lin and Michailidis (2017). The posterior consistency results are established under mild regularity assumptions on the underlying spectral density and with p = o(n/ log n). The key to handling the dependencies, within the design matrix and also between the design matrix and the error term, is a pair of high-dimensional concentration inequalities established in the supplementary material (Propositions B1 and B3). Note that we are considering the “large p large n” setting with p = o(n). However, we make no assumption reducing the effective dimension of the “true parameter matrix”. We only assume that the matrix norm of the true parameter matrix is of the order p in the non-hierarchical prior setting and bounded by a constant in the hierarchical prior setting. The large p small n situation, where p is allowed to grow at a much faster rate than n is also of interest, but assumptions such as sparsity/restricted eigenvalue type conditions are required, which in turn reduce the effective dimension of the true parameters. General posterior consistency results for VAR models in the large p small n setting are also not available to the best of our knowledge and are topics of future discussion/research.

The remainder of the paper is organized as follows. In Section 2, we introduce the VAR model and necessary notions of posterior consistency. We consider the non-hierarchical matrix normal prior on the coefficient matrix in Section 3.1 and establish posterior consistency under suitable regularity assumptions. In Section 3.2, we prove posterior consistency considering a hierarchical prior corresponding to a scale-mixture of matrix normals. In Section 4 and Section 5 the methodology/results of this paper are illustrated on simulated and real data sets, respectively. Finally, we conclude with a discussion in Section 6.

1.1. Notation

Throughout this paper, ℤ, ℝ and ℂ denote the sets of integers, real numbers and complex numbers, respectively. We denote the cardinality of a set J by |J|. For a vector v ϵ ℝ^p, $‖ v ‖ ≔ \sqrt{\sum v_{j}^{2}}$ denotes the ℓ₂-norm. For a matrix A, ‖A‖ and σ_max(A) denote spectral norm i.e., $‖ A ‖ = \sup_{x \neq 0} \frac{‖ A x ‖_{2}}{‖ x ‖_{2}}$ and the largest singular value of A, respectively. For a symmetric or Hermitian matrix A, we denote its maximum and minimum eigenvalues by λ_max(A) and λ_min(A). The vector e_i is used for the i-th unit vector in ℝ^p. Bold uppercase letters are only used to denote matrices, and vectorized form of such matrices are represented by corresponding lower cases. For example, if Φ is a p×p matrix then ϕ is vec(Φ). Also, O represents a zero-matrix of appropriate dimension, and in general vectors are denoted by italicized bold lowercase letters.

2. Model Formulation

For a p-dimensional stationary time series {X^t}, a vector autoregressive model of lag-d is given by

X^{t} = c + \sum_{i = 1}^{d} A_{i} X^{t - i} + ε^{t} .

(1)

The temporal dependence structure of the VAR model is characterized by the p × p transition matrices A₁,A₂,· · · ·,A_d and c is a p×1 location vector which we choose to be 0. In the Gaussian VAR, the errors ε^t are i.i.d $N_{p} (0, Σ_{ε})$ where Σ_ε is a p × p unknown error covariance matrix. The model in (1) can be rewritten in the Yule-Walker representation (Lütkepohl (2007)) as

X^{t} - μ = \sum_{i = 1}^{d} A_{i} (X^{t - i} - μ) + ε^{t},

where μ = (I – A₁ – A₂ – · · ·– A_d)^–1c is known as the process mean. Usually μ will not be known in advance. In that case, μ may be estimated by the vector of sample means $\bar{X} = \sum_{1}^{n} X^{t}$ . An alternative estimator is $\hat{μ} = {(I - {\hat{A}}_{1} - {\hat{A}}_{2} - \dots - {\hat{A}}_{d})}^{- 1} \hat{c}$ in which $\hat{c}$ and ${\hat{A}}_{i}$ ’s are the least squares estimator. Henceforth we assume without loss of generality μ = 0. Based on the data {X⁰,··· ,X^T}, we define the response matrix Y and design matrix X as follows,

Y = {[\begin{matrix} {(X^{T})}^{'} \\ ⋮ \\ {(X^{d})}^{'} \end{matrix}]}_{n \times p} X = {[\begin{matrix} {(X^{T - 1})}^{'} & \dots & {(X^{T - d})}^{'} \\ ⋮ & ⋱ & ⋮ \\ {(X^{d - 1})}^{'} & \dots & {(X^{0})}^{'} \end{matrix}]}_{n \times d p} .

We can now rewrite the above model in a linear regression setup as

Y = X Φ + E

(2)

where

Φ = [\begin{matrix} A_{1}^{'} \\ A_{2}^{'} \\ ⋮ \\ A_{d}^{'} \end{matrix}] E = [\begin{matrix} {(ε^{T})}^{'} \\ ⋮ \\ {(ε^{d})}^{'} \end{matrix}] .

In this formulation, the number of samples is n = T − d + 1 and the number of unknown parameters is q = dp², respectively. Vectorizing (column-wise) each matrix, we get

y ≔ vec (Y) = Z ϕ + ε,

where Z := (I_p⊗X), φ = vec(Φ) and ε = vec(E). In this paper, we consider a high-dimensional setting where the dimension p of the VAR model increases with the sample size n. However, we assume that the lag d does not vary with n. This basic formulation of regression lends itself easily to a Bayesian analysis in which priors are placed on the unknown parameter matrices Φ and Σ_ε.

As previously mentioned, we let the dimension p = p_n of the VAR model vary with n, so that our results are relevant to high dimensional settings. We assume that our data come from a true VAR model described as follows: for every, n ≥ 1, let $Y_{n} ≔ (X^{n, 0}, \dots, X^{n, n + d - 1})$ be the set of observations for sample size n, which satisfy $X^{n, k} = \sum_{i = 1}^{d} A_{i, 0 n} X^{n, k - i} + ε^{n, k}$ for d ≤ k ≤ n + d − 1. The errors ${ε^{n, k}}_{k = d}^{n + d - 1}$ are i.i.d. $N_{p_{n}} (0, Σ_{ε, 0 n})$ . Here ${Φ_{0 n}}_{n \geq 1}$ denotes the sequence of the true coefficient matrices given by $Φ_{0 n}^{'} ≔ [A_{1, 0 n} A_{2, 0 n} \dots A_{d, 0 n}]$ , and ${Σ_{ε, 0 n}}_{n \geq 1}$ denotes the sequence of the true error covariance matrices. Let ℙ₀ denote the probability measure underlying the true model described above.

Next, consider a Bayesian model which builds on (2) by placing priors on the parameters (Φ,Σ_ε). In particular, let ${π_{n} (Φ, Σ_{ε})}_{n \geq 1}$ and ${π_{n} (Φ, Σ_{ε} | Y_{n})}_{n \geq 1}$ denote the sequences of the corresponding (joint) prior and posterior densities. Analogously, {Π_n(·)}_n≥1 and ${Π_{n} (\cdot | Y_{n})}_{n \geq 1}$ denote the corresponding sequences of (joint) prior and posterior distributions. We will also use the notation π_n an Π_n to denote the marginal prior and posterior densities/distributions for Φ and Σ_ε as appropriate.

Note that our main parameter of interest is Φ, while the error covariance matrix Σ_ε is more of an unknown nuisance parameter that we need to deal with. One would hope that as the sample size n tends to infinity, the posterior probability assigned to any ε neighborhood of Φ_0n converges to 1 almost surely under ℙ₀. We now formally define a notion of posterior consistency that formalizes this.

Definition 1. The sequence of posterior distributions $Π_{n} (\cdot | Y_{n})$ is said to be consistent at ${Φ_{0 n}}_{n \geq 1}$ , if for every ε > 0, $Π_{n} (‖ Φ - Φ_{0 n} ‖ > ε | Y) \to 0$ as n → ∞ a.s. ℙ₀.

For ease of exposition, we will henceforth denote Φ_0n as Φ₀, and Σ_ε,0n as Σ_ε,0, and highlight their dependence on n as needed.

2.1. Stability of VAR(d) process

Since VAR models are dynamical systems, the notion of ‘stability’ plays an important role in their analysis and asymptotic properties.

Definition 2. A VAR(d) process defined in (1) is said to be stable if the matrix valued polynomials $A (z) ≔ I_{p} - \sum_{i = 1}^{d} A_{i} z^{i}$ satisfies det $(A (z)) \neq 0$ on the unit circle of the complex plane {z ∈ ℂ: |z| = 1}.

The autocovariance function of a p-dimensional centered covariance-stationary time series {X^t} is defined as $Γ_{X} (h) = Cov (X^{t}, X^{t + h}) t, h \in ℤ$ and the corresponding spectral density is given by $f_{X} (θ) ≔ \frac{1}{2 π} \sum_{h = - \infty}^{\infty} Γ_{X} (h) e^{- i h θ}$ , θ ∈ [−π, π]. For a Gaussian stable VAR(d) model the spectral density has a closed form expression,

f_{X} (θ) = \frac{1}{2 π} {(I_{p} - \sum_{j = 1}^{d} A_{j} e^{- i j θ})}^{- 1} Σ_{ε} {[{(I_{p} - \sum_{j = 1}^{d} A_{j} e^{- i j θ})}^{- 1}]}^{*},

where * denotes the Hermitian conjugate of a matrix and $i \equiv \sqrt{- 1} .$ The autocovariance function which characterizes a centered Gaussian process, can be used to quantify the temporal and cross-sectional dependence for VAR(d) models. The peak of the spectral density, measured by its maximum eigenvalue $M (f_{X}) ≔ \max_{θ \in [- π, π]} λ_{\max} (f_{X} (θ))$ can be used as a measure of stability of the process. Also the minimum eigenvalue $m (f_{X}) ≔ \min_{θ \in [- π, π]} λ_{\max} (f_{X} (θ))$ captures cross-dependence among its components. However, as mentioned in Basu and Michailidis (2015) instead of working with $M (f_{X})$ and $m (f_{X})$ it is often easier to work with the eigenvalues of $A^{*} (z) A (z)$ over the unit circle {z ∈ ℂ : |z| = 1}. Let

μ_{\min} (A) ≔ \min_{| z | = 1} λ_{\min} (A^{*} (z) A (z)) = \min_{θ \in [- π, π]} λ_{\min} ((I_{p} - \sum_{i = 1}^{d} A_{j}^{'} e^{i j θ}) (I_{p} - \sum_{i = 1}^{d} A_{j} e^{- i j θ})) μ_{\max} (A) ≔ \min_{_{| z | = 1}} λ_{\max} (A^{*} (z) A (z)) = \max_{_{θ \in [- π, π]}} λ_{\max} ((I_{p} - \sum_{j = 1}^{d} A_{j}^{'} e^{i j θ}) (I_{p} - \sum_{j = 1}^{d} A_{j} e^{- i j θ}))

For stable VAR(d) process $0 < μ_{\min} (A) \leq μ_{\max} (A) < \infty$ . Since each ε^t is i.i.d as $N_{p} (0, Σ_{ε})$ , each row of X is distributed as $N_{d p} (0, C_{X})$ , where the covariance matrix C_X has the following structure,

C_{X} = {[\begin{matrix} Γ (0) & Γ (1) & \dots & Γ (d - 1) \\ Γ {(1)}^{'} & Γ (0) & \dots & Γ (d - 2) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Γ {(d - 1)}^{'} & Γ {(d - 2)}^{'} & \dots & Γ (0) \end{matrix}]}_{d p \times d p}

(3)

The quantities $μ_{\min} (A)$ and $μ_{\max} (A)$ provide a useful bound for the eigenvalues of C_X. As mentioned in Melnyk and Banerjee (2016) and from Proposition 2.3 and equation (2.6) of Basu and Michailidis (2015) we have the following chain of inequalities,

\frac{λ_{\min} (Σ_{ε})}{μ_{\max} (A)} \leq 2 π m (f_{X}) \leq λ_{\min} (C_{X}) \leq λ_{\max} (C_{X}) \leq 2 π M (f_{X}) \leq \frac{λ_{\max} (Σ_{ε})}{μ_{\min} (A)} .

(4)

We finally note that the p dimensional VAR(d) model in (2) can be equivalently written as a dp dimensional VAR(1) process. Let

{\tilde{X}}^{t} = {[\begin{matrix} X^{t} \\ ⋮ \\ X^{t - d + 1} \end{matrix}]}_{d p \times 1}, {\tilde{A}}_{1} = {[\begin{matrix} A_{1} & A_{2} & \dots & A_{d - 1} & A_{d} \\ I_{p} & O & \dots & O & O \\ O & I_{p} & \dots & O & O \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ O & O & \dots & I_{p} & O \end{matrix}]}_{d p \times d p} and ω^{t} = {[\begin{matrix} ε^{t + 1} \\ 0 \\ ⋮ \\ 0 \end{matrix}]}_{d p \times 1} .

Then the new representation becomes

{\tilde{X}}^{t} = {\tilde{A}}_{1} {\tilde{X}}^{t - 1} + ω^{t} t = d, \dots, n + d - 1.

(5)

It follows that $X = {[{\tilde{X}}^{n + d - 2} {\tilde{X}}^{n + d - 3} \dots {\tilde{X}}^{d - 1}]}^{'}$ , i.e., i-th row of X is denoted as (dp × 1 vector) ${\tilde{X}}^{n + d - i - 1}$ . Note that if the underlying VAR(d) process {X^t} is stable then the process ${\tilde{X}}^{t}$ with the characteristic polynomial, $\tilde{A} (z) ≔ I_{d p} - {\tilde{A}}_{1} z$ is also stable. This is because ${{\tilde{X}}^{t}}$ can be viewed as generated according to a stable VAR(1) process with transition matrix ${\tilde{A}}_{1}$ and ${{\tilde{X}}^{t}}$ is stable if and only if {X^t} is stable (Lütkepohl (2007)). Based on $\tilde{A} (z)$ we define

μ_{\min} (\tilde{A}) ≔ \min_{θ \in [- π, π]} λ_{\min} ((I_{p} - {\tilde{A}}_{1}^{'} e^{i θ}) (I_{p} - {\tilde{A}}_{1} e^{- i θ})) μ_{\max} (\tilde{A}) ≔ \max_{θ \in [- π, π]} λ_{\max} ((I_{p} - {\tilde{A}}_{1}^{'} e^{i θ}) (I_{p} - {\tilde{A}}_{1} e^{- i θ}))

(6)

While $μ_{\min} (\tilde{A})$ and $μ_{\max} (\tilde{A})$ are not necessarily the same as $μ_{\min} (A)$ and $μ_{\max} (A)$ the inequalities in (4) still hold with $μ_{\min} (A)$ and $μ_{\min} (A)$ replaced by $μ_{\min} (\tilde{A})$ and $μ_{\max} (\tilde{A})$ , respectively.

3. Bayesian Estimation and Posterior Consistency

In this section, we first discuss Bayesian estimation of VAR models with non-hierarchical and hierarchical scale mixture matrix normal prior distributions on the parameter matrix Φ (conditioned on Σ_ε) and subsequently establish high-dimensional posterior consistency in this setting under mild regularity assumptions. We start by introducing the necessary notation for the matrix-variate normal distribution. Let M_a,b denote the space of a × b matrices.

Definition 3. An a×b random matrix X is defined to follow a matrix-variate normal distribution $(M N_{a \times b} (M, B_{1}, B_{2}))$ if its density function (on the space M_a,b ) is given by

\frac{{| B_{1} |}^{- b / 2} {| B_{2} |}^{- a / 2}}{{(2 π)}^{a b / 2}} e^{- \frac{1}{2} tr {B_{1}^{- 1} (X - M) B_{2}^{- 1} {(X - M)}^{'}}} .

Here M ∈ M_a,b. B₁ ∈ M_a,a and B₂ ∈ M_b,b which are both positive definite matrices corresponding to the variances among the rows and columns of X, respectively. Note that the matrix normal distribution is related to the multivariate normal distribution in the following way: $X \sim M N_{n \times p} (M, B_{1}, B_{2})$ , if and only if $vec (X) \sim N_{n p} (vec (M), B_{2} \otimes B_{1})$ .

3.1. Non-Hierarchical Matrix Normal Prior

We consider a matrix normal prior for Φ conditional on Σ_ε, and a flat (uniform) prior on Σ_ε. In particular,

Φ | Σ_{ε} \sim M N_{d p \times p} (O, U^{- 1}, Σ_{ε}) and π (Σ_{ε}) \propto 1,

(7)

where U is a dp × dp known positive definite matrix. Note that under this matrix normal prior U⁻¹ and Σ_ε are the covariance matrices corresponding to the columns and rows of Φ, respectively. The posterior distribution of Φ (conditional on Σ_ε) can easily be shown to be $M N_{d p \times p} (Φ_{PM}, {(X^{'} X + U)}^{- 1}, Σ_{ε})$ , where $Φ_{PM} ≔ {(X^{'} X + U)}^{- 1} X^{'} Y$ is the (conditional) posterior mean which does not depend on Σ_ε. Hence, the unconditional posterior mean of Φ is available in closed form and is given by Φ_PM. It follows by standard computations using the multivariate normal density that the marginal posterior density of Σ_ε is proportional to

{| Σ_{ε} |}^{- n / 2} e^{- t r (Σ_{ε}^{- 1} {\hat{Σ}}_{r e s})},

Where ${\hat{Σ}}_{r e s} = Y^{T} (I - X {(X^{T} X + U)}^{- 1} X^{T}) Y$ . This density is proper if and only if n > 2p. In this case, the marginal posterior density of Σ_ε corresponds to the Inverse-Wishart density with scale parameter ${\hat{Σ}}_{r e s}$ and shape parameter n − p − 1. We summarize the above observations in the lemma below.

Lemma 3.1. Under the non-hierarchical prior in (7), the posterior density of (Φ,Σ_ε) is proper if and only if n > 2p. In this case

Φ | Σ_{ε}, Y \sim M N_{d p \times p} (Φ_{P M}, {(X^{'} X + U)}^{- 1}, Σ_{ε}) Σ_{ε} | Y \sim I n v e r s e - W i s h a r t ({\hat{Σ}}_{r e s}, n - p - 1) .

Assumptions for posterior consistency

We will establish our results under the high-dimensional setting from Section 2. Recall that Φ₀ = Φ_0n denotes the true underlying parameter matrix, and Σ_ε,0 = Σ_ε,0n denotes the true underlying error covariance matrix in this setting. The quantities $μ_{\min} (\tilde{A})$ , $μ_{\max} (\tilde{A})$ and C_X are as defined in (6) and (3), but with Φ₀ and Σ_ε,0 as the underlying parameter values. We assume the following:

Assumption A1. The VAR(d) model given in (1) is stable.

Assumption A2. $\frac{1 + μ_{\max} (\tilde{A})}{μ_{\min} (\tilde{A})}$ is $O (\sqrt[5]{\frac{n}{p}})$ as n → ∞.

Assumption A3. $0 < \inf_{n \geq 1} λ_{\min} (C_{X}) < \infty$ and $λ_{\max} (Σ_{ε, 0 n}) = O (1)$ .

Assumption A4. The true parameter matrix of the VAR model (2), Φ₀ and the hyperparameter U of (7) are such that $‖ Φ_{0}^{T} U Φ_{0} ‖ = o (n)$ and $‖ U Φ_{0} ‖ = o (n)$ .

Assumption A5. p = o(n).

When d = 1, we deal with a VAR(1) model and C_X becomes Γ_X(0), while $μ_{\min} (\tilde{A})$ is the same as $μ_{\min} (A)$ and $μ_{\max} (\tilde{A})$ is also equal to $μ_{\max} (A)$ . Assumption A1 is a standard assumption which ensures that the underlying VAR process is well-behaved. Assumption A2 plays an important role in deriving high-dimensional concentration bounds for X^′X/n and X^′E/n around C_X and O, respectively (see Propositions B.1 and B.3 in the supplementary material). Assumption A3 is needed to ensure that λ_min(X^′X/n) is bounded away from 0 with high probability. Further, if we consider that each column of Φ is independently and identically distributed according to a normal prior distribution, that is U = I_dp, Assumption A4 reduces to $Φ_{0}^{'} Φ_{0} = o (n)$ and $‖ Φ_{0} ‖ = o (n)$ .

We now state the main theoretical result of posterior consistency with a non-hierarchical matrix normal prior distribution on Φ. The proof is given in Appendix C.1 of the supplementary material.

Theorem 3.2 (Posterior consistency for non-hierarchical prior). For any centered VAR(d) model (2) with non-hierarchical prior (7) on Φ satisfying Assumptions A1-A5, the posterior consistency of the parameter matrix can be achieved i.e. for every fixed every ε > 0

E_{0} [Π_{n} (‖ Φ - Φ_{0} ‖ > ε | Y = (X^{0}, \dots, X^{n}))] \to 0 a s n \to \infty

where Φ₀ is the true parameter matrix under the model (2).

A natural question to ask is whether the assumption p = o(n) can be relaxed for posterior consistency. In the lemma below, we consider a situation in which Assumptions A1 - A4 are satisfied and p is the same order as n, and prove that the resulting posterior is not consistent. The proof is given in Appendix C.2 of the supplementary material.

Lemma 3.3. Consider a (sequence of) VAR(1) model with p_n = γⁿ, Φ_0n = $α I_{p_{n}}$ and Σ_ε,0n = $I_{p_{n}}$ , where $γ \in (0, \frac{1}{2})$ , α ∈ (0,1) do not depend on n. If we use the non-hierarchical prior (7) on Φ with $‖ U ‖ = o (n)$ , then there exists ε > 0 such that

\underset{n \to \infty}{\lim \inf} E_{0} [Π_{n} (‖ Φ - Φ_{0} ‖ > ε | Y = (X^{0}, \dots, X^{n}))] > 0.

Remark Note that the condition $‖ U ‖ = o (n)$ assumed in Lemma 3.3 corresponds to Assumption A4 in the setting of the Lemma. The reason for making this assumption is that we want to show violating Assumption A5 (p = o(n)) can lead to posterior inconsistency, even if all of Assumptions A1-A4 hold. If we decide to violate Assumption A4 too by assuming $‖ U ‖ = o (n)$ or $‖ U ‖ ≫ n$ (goes to ∞ at the same rate or faster than n) then the posterior inconsistency proof becomes comparatively easier. We have provided the corresponding proofs in Supplemental Section C.3 and Supplemental Section C.4, respectively.

3.2. Hierarchical Normal-mixture Prior

Next, we study the posterior consistency of the parameter matrix in model (2) in which Φ has the following hierarchical prior

Φ | Σ_{ε}, U \sim M N_{d p \times p} (O, U^{- 1}, Σ_{ε}), π (Σ_{ε}) \propto 1, and U \sim π_{scl} (.)

(8)

where U is the dp×dp matrix having probability density π_scl(·) over the space of dp×dp positive definite matrices, $M_{d p}^{+}$ . As shown below, the group lasso and multivariate t distribution prior on Φ can be obtained from (8) using appropriate choices of π_scl(·). The lemma below shows that the posterior is proper if n > (d+1)p, and provides the form of various conditional and marginal posterior densities. The proof is given in Appendix C.5 of the supplementary material.

Lemma 3.4. Under the hierarchical normal-mixture prior in (8), the posterior density of (Φ,Σ_ε,U) is proper if n > (d + 1)p. In this case

Φ | Σ_{ε}, U, Y \sim M N_{d p \times p} (Φ_{P M}, {(X^{'} X + U)}^{- 1}, Σ_{ε}) Σ_{ε} | U, Y \sim I n v e r s e - W i s h a r t ({\hat{Σ}}_{r e s}, n - p - 1) π (U | Y) \propto \frac{| U |^{d p / 2}}{{| X^{'} X + U |}^{d p / 2} {| {\hat{Σ}}_{r e s} |}^{(n - p - 1) / 2}} π_{scl} (U) .

The Bayesian group lasso prior was proposed by Kyung et al. (2010) in the context of linear regression. We adapt it to the VAR setting as follows. Suppose the rows of Φ are divided in G groups $Φ_{[1]}, \dots, Φ_{[G]}$ , where each Φ_[g] is an m_g × p sub-matrix of Φ (hence $\sum m_{g} = d p$ ) and X_g is the submatrix of X of order n × m_g corresponding to the group Φ_[g]. The frequentist group lasso estimator (conditional on Σ_ε) is obtained by solving

\min_{Φ_{[1]}, \dots, Φ_{[G]}} {‖ Σ_{ε}^{- 1 / 2} (Y - \sum_{g = 1}^{G} X_{g} Φ_{[g]}) ‖}_{F}^{2} + \sum_{g = 1}^{G} λ_{g} {‖ Φ_{[g]} Σ_{ε}^{- 1 / 2} ‖}_{F},

where λ_g is a tuning parameter corresponding to the group g. The group lasso estimator (conditional on Σ_ε) can also be expressed as the maximum a posteriori probability (MAP) estimate under model (2) with the prior

π (Φ | Σ_{ε}) \propto \exp (- \sum_{g = 1}^{G} λ_{g} {‖ Φ_{[g]} Σ_{ε}^{- 1 / 2} ‖}_{F})

which is a multivariate generalization of the double exponential prior and can also be expressed as a scale mixture of normals with Gamma hyperpriors (Park and Casella (2008), Kyung et al. (2010)) leading to the group lasso hierarchy,

Φ_{[g]} | τ_{g}, Σ_{ε} \overset{i n d}{\sim} M N_{m_{g} \times p} (O, τ_{g} I_{m_{g}}, Σ_{ε}) and τ_{g} \overset{i n d}{\sim} Gamma (\frac{m_{g} + 1}{2}, \frac{λ_{g}^{2}}{2}), g = 1, \dots, G .

Here Gamma(α,λ) denotes the Gamma distribution with shape parameter α and rate parameter λ. This can be alternatively presented as $Φ | τ, Σ_{ε} \sim M N_{d p \times p} (O, BDiag (τ_{1}, \dots, τ_{G}), Σ_{ε})$ and $τ_{g} \overset{i n d}{\sim} Gamma (\frac{m_{g} + 1}{2}, \frac{λ_{g}^{2}}{2})$ where BDiag(τ₁,· · ·,τ_G) denotes a block-diagonal matrix with g−th block to be $τ_{g} I_{m_{g}}$ . Note that under the above hierarchical prior, conditionally on (τ₁,· · ·,τ_G) and Σ_ε, the columns of Φ are independent. If m_g = 1 ∀ g = 1,· · · ,dp we get the ordinary Bayesian lasso.

Under the specification given in (8), suppose we assume U = Diag(τ₁,· · ·,τ_dp) and $1 / τ_{i} \overset{^{i n d}}{\sim} Gamma (α_{i}, λ_{i} / 2)$ then it can be shown that the prior density for Φ given only Σ_ε is proportional to

\prod_{i = 1}^{d p} {({‖ Φ_{i} . Σ_{ε}^{- 1 / 2} ‖}_{2}^{2} + λ_{i})}^{- (α_{i} + \frac{1}{2})}

which corresponds to the multivariate t-distribution.

Estimation

For the hierarchical model given in (8), the posterior density of Φ is intractable and quantities such as the posterior mean are not available in closed form. Hence, we develop a Markov Chain Monte Carlo algorithm to generate values from the posterior density. It follows by straightforward calculations that

Φ | Σ_{ε}, U, Y \sim M N_{d p \times p} (Φ_{PM}, {(X^{'} X + U)}^{- 1}, Σ_{ε}) Σ_{ε} | U, Y \sim Inverse-Wishart ({\hat{Σ}}_{r e s,} n - p - 1) π (U | Φ, Σ_{ε}, Y) \propto | U |^{d p / 2} \exp [- \frac{1}{2} tr {Φ Σ_{ε}^{- 1} Φ^{'} U}] π_{s c l} (U),

where ${\hat{Σ}}_{r e s} = Y^{T} (I - X {(X^{T} X + U)}^{- 1} X^{T}) Y$ . While the conditional posterior distribution of Φ given Σ_ε,U and Σ_ε given U are easy to simulate from (being Matrix-normal and InverseWishart), the tractability of the conditional posterior density of U given Φ,Σ_ε depends on the form of the prior π_scl(U). We show below that for three standard choices of π_scl(U) corresponding to the Wishart prior, the group lasso prior and the multivariate t-prior π(U|Φ) becomes a tractable density and easy to simulate from.

Case 1: Wishart Prior

For a dp × dp positive definite matrix V, let U ~ Wishart_dp(V,df = ν + dp) that is $π (U) \propto | U |^{\frac{ν - 1}{2}} \exp [- \frac{1}{2} tr {V^{- 1} U}]$ . In this case,

π (U | Φ, Σ_{ε}, Y) \propto | U |^{\frac{ν + d p - 1}{2}} \exp [- \frac{1}{2} tr {(Φ Σ_{ε}^{- 1} Φ^{'} + V^{- 1}) U}]

which is ${Wishart}_{d p} ({(Φ Σ_{ε}^{- 1} Φ^{'} + V^{- 1})}^{- 1}, d f = ν + 2 d p)$ . Note that as long as we have ν > −(dp+1) the posterior of U given Φ,Σ_ε, $Y$ is proper.

Case 2: Bayesian Group Lasso

In this case as already discussed in Subsection 3.2, U⁻¹ has a block diagonal form BDiag(τ₁,· · · ,τ_G) and τ_g’s are apriori independently distributed as Gamma with scale (m_g + 1)/2 and rate $λ_{g}^{2} / 2$ . Hence, the conditional distribution of τ_g has the following form,

\frac{1}{τ_{g}} | Φ, Σ_{ε} Y \overset{i n d}{\sim} Inverse - Gaussian (μ_{g} = \frac{λ_{g}}{{‖ Φ_{[g]} Σ_{ε}^{- 1 / 2} ‖}_{F}}, λ_{g}^{2}) .

Case 3: Multivariate t-distribution

By taking U⁻¹ = Diag(τ₁,··· ,τ_dp) and 1/τ_i to be independently distributed as Gamma with scale α_i and rate λ_i/2 we have the multivariate t-distribution as the prior on Φ. In this case, the conditional distribution of τ_i has the following form

\frac{1}{τ_{i}} | Φ, Σ_{ε}, Y \overset{i n d}{\sim} Gamma (α_{i} + \frac{d p}{2}, \frac{{‖ Φ_{i .}^{'} Σ_{ε}^{- 1 / 2} ‖}_{2}^{2} + λ_{i}}{2}) .

Assumptions for posterior consistency

We now introduce regularity conditions to establish posterior consistency under the hierarchical prior model.

Assumption B1. The VAR(d) model given in (1) is stable.

Assumption B2. $\frac{1 + μ_{\max} (\tilde{A})}{μ_{\min} (\tilde{A})}$ is $O (\sqrt[5]{\frac{n}{p}})$ as n → ∞.

Assumption B3. $0 < \inf_{n \geq 1} λ_{\min} (C_{X}) \leq \sup_{n \geq 1} λ_{\max} (C_{X}) < \infty$ and 0 < λ_max(Σ_ε,0n) = O(1).

Assumption B4. The singular values of the true parameter matrices ${Φ_{0 n}}_{n \geq 1}$ are uniformly bounded. Equivalently, the eigenvalues of ${Φ_{0 n}^{'} Φ_{0 n}}_{n \geq 1}$ are uniformly bounded.

Assumption B5. $p = O (\frac{n}{\log n})$ .

Assumption B6. There exists (fixed and not-depending on n) α > 0 such that $\underset{n \to \infty}{\lim \inf} π_{scl, n} (λ_{\max} (U) > α) > 0$ and for every β > 0 we have $\lim_{n \to \infty} π_{scl, n} (λ_{\max} (U) > β n) = 0$ .

We now discuss these assumptions and compare them to the assumptions for the non-hierarchical prior model.

Assumptions B1 and B2 are identical to A1 and A2, while B3 is fairly similar to A3.
One key difference is the permissible scaling of p as a function of the sample size n in Assumption B5, which is slightly more stringent than the permissible scaling for the nonhierarchical matrix normal prior in Assumption A5.
Note that Assumption B6 is a mild one. For example, a sufficient condition for this assumption to be satisfied is that $\lim \sup_{n \to \infty} \max_{1 \leq i \leq p} E_{π_{scl, n}} [U_{i i}^{δ}] < \infty$ for some δ > 0 and $\lim \inf_{n \to \infty} π_{scl, n} (U_{11} > α) > 0$ . It can be easily checked that this condition, and hence Assumption B6, is satisfied in the case of Wishart, Inverse-Wishart, Bayesian group lasso, multivariate t-distribution, Horseshoe (Carvalho et al. (2010)), Strawderman-Berger and generalized double Pareto (Armagan et al. (2013)) priors as long as the prior parameters do not depend on n.
In the non-hierarchical prior case, assumptions regarding Φ₀ and U (non-random) are simultaneously and exclusively provided in Assumption A4 through the conditions $‖ Φ_{0}^{T} U Φ_{0} ‖ = o (n)$ and $‖ U Φ_{0} ‖ = O (n)$ . For the hierarchical prior case, for clarity of exposition, we provide the assumptions regarding Φ₀ in Assumption B4, and those regarding the distribution of U (random) in Assumption B6. Combining these two assumptions it can be easily shown that a priori $‖ Φ_{0}^{T} U Φ_{0} ‖$ and $‖ U Φ_{0} ‖$ converge to zero in Π_scl,n-probability as n → ∞. In that sense, the assumptions on (Φ₀,U) in the hierarchical model are stronger than in the non-hierarchical model case.

With these assumptions in hand, we state our key consistency result, whose proof is delegated to Appendix C.6 of the supplementary material.

Theorem 3.5 (Posterior Consistency for Hierarchical Prior). For any centered VAR(d) model with the hierarchical prior (8) on the transition matrix satisfying Assumptions B1-B6, the posterior consistency of the transition matrix can be achieved i.e. for every fixed cε > 0

E_{0} [Π_{n} (‖ Φ - Φ_{0} ‖ > ε | Y = (X^{0}, \dots, X^{n}))] \to 0 a s n \to \infty

where Φ₀ is the true parameter matrix under the model (2).

4. Performance Evaluation

To illustrate the performance of our Bayesian modeling framework for VAR processes, we design three sets of numerical experiments involving: (a) Small VAR (p = 10), (b) Medium VAR (p = 100) and (c) Large VAR (p = 500) models, each with two lags - (i) d = 1 and (ii) d = 2.

In each setting, we use transition matrices A_i’s with 10−30% non-zero entries that are generated from U(0,2) ∪ U(−2,0) selected at random and rescaled to ensure that the process is stable with SNR = 2. For small VAR models, we generate n = 40,80,120 time points, for medium VAR models, n = 400,800,1200, while for large VAR models we use n = 2000,4000,6000. The hyper-parameters for the prior distributions are selected using the Deviance Information Criterion (DIC). Note that $D I C = 2 \bar{D} - D (\bar{Φ}, {\bar{Σ}}_{ε})$ , where $D (Φ, Σ_{ε}) ≔ - 2 \log L (Y | Φ, Σ_{ε}) = n \log | Σ_{ε} | + tr {Σ_{ε}^{- 1} (Φ^{'} X^{'} X Φ - 2 Φ^{'} X^{'} Y + Y^{'} Y)}$ , $\bar{D}$ is the posterior expectation of D(Φ,Σ_ε) and $\bar{Φ}$ and ${\bar{Σ}}_{ε}$ are the posterior expectation of Φ and Σ_ε respectively.

4.1. Non-hierarchical prior

We generate two different error processes using $Σ_{ε} = σ^{2} I_{p}$ and $Σ_{ε} = σ^{2} {((ρ^{| i - j |}))}_{i j}$ (Toeplitz form). For each of the small, medium and large VAR models, U is taken to be a diagonal matrix, c I_p where c is chosen according to the minimum DIC value over the interval [0,10]. In Table 1, for both the posterior mean (PM) and least squares estimator (LS), we report their relative estimation error $({‖ \hat{Φ} - Φ_{0} ‖}_{2} / {‖ Φ_{0} ‖}_{2})$ and the standard error of $‖ \hat{Φ} ‖_{2}$ within parenthesis averaged over 10×p replicates for small and medium VAR and 100 replications for large VAR (p = 500). Since the true parameter matrix Φ₀ is sparse, we identify entries whose 95% posterior credible intervals contain zero, and set them to zero in both parameter matrix estimates (PM and LS).

Table 1:

Relative error in VAR (d = 1,2) with Σ_ε= σ²I_p where % denotes percentage of non-zero entries in Φ₀.

		Lag d = 1						Lag d = 2

		n₁		n₂		n₃		n₁		n₂		n₃
	%	LS	PM	LS	PM	LS	PM	LS	PM	LS	PM	LS	PM
Small VAR	10	0.93	0.83	0.82	0.70	0.70	0.55	1.26	1.09	1.11	0.96	1.00	0.76
		(0.21)	(0.12)	(0.12)	(0.08)	(0.07)	(0.05)	(0.24)	(0.17)	(0.17)	(0.07)	(0.10)	(0.08)
	20	1.06	1.00	1.00	0.87	0.83	0.71	1.39	1.22	1.28	1.07	1.14	0.94
		(0.34)	(0.24)	(0.26)	(0.16)	(0.16)	(0.11)	(0.35)	(0.30)	(0.27)	(0.21)	(0.18)	(0.15)
	30	1.23	1.15	1.12	0.97	0.99	0.81	1.57	1.40	1.45	1.28	1.30	1.15
		(0.45)	(0.37)	(0.37)	(0.27)	(0.27)	(0.19)	(0.47)	(0.40)	(0.37)	(0.32)	(0.34)	(0.22)

Medium VAR	10	1.81	1.64	1.69	1.53	1.56	1.40	2.45	2.12	2.31	2.01	2.24	1.85
		(0.40)	(0.21)	(0.31)	(0.12)	(0.23)	(0.07)	(0.46)	(0.32)	(0.37)	(0.24)	(0.31)	(0.14)
	20	1.95	1.81	1.85	1.70	1.74	1.56	2.60	2.26	2.50	2.15	2.43	2.01
		(0.50)	(0.32)	(0.42)	(0.25)	(0.35)	(0.17)	(0.57)	(0.42)	(0.51)	(0.33)	(0.40)	(0.29)
	30	2.10	1.94	1.98	1.84	1.87	1.74	2.74	2.46	2.62	2.31	2.52	2.20
		(0.64)	(0.43)	(0.52)	(0.38)	(0.45)	(0.27)	(0.69)	(0.54)	(0.60)	(0.46)	(0.56)	(0.39)

Large VAR	10	2.70	2.47	2.55	2.34	2.46	2.22	3.66	3.18	3.50	3.04	3.38	2.92
		(0.58)	(0.30)	(0.50)	(0.23)	(0.40)	(0.16)	(0.67)	(0.45)	(0.57)	(0.35)	(0.49)	(0.28)
	20	2.84	2.62	2.75	2.52	2.60	2.35	3.81	3.33	3.69	3.23	3.54	3.05
		(0.70)	(0.42)	(0.60)	(0.34)	(0.54)	(0.25)	(0.78)	(0.56)	(0.70)	(0.50)	(0.61)	(0.40)
	30	3.03	2.78	2.90	2.65	2.71	2.49	3.96	3.49	3.89	3.32	3.78	3.25
		(0.82)	(0.54)	(0.70)	(0.44)	(0.63)	(0.35)	(0.88)	(0.69)	(0.84)	(0.63)	(0.76)	(0.51)

Open in a new tab

First, we assume the true error covariance matrix Σ_ε is diagonal i.e. $σ^{2} I_{p}$ . Here % denotes percentage of non-zero entries in Φ and d represents lag length of the underlying VAR process. Recall that the sample sizes used for small VAR models are n₁ = 40, n₂ = 80, n₃ = 120, for medium VAR ones n₁ = 400, n₂ = 800, n₃ = 1200, and for large VAR ones n₁ = 2000, n₂ = 4000, n₃ = 6000.

It can be seen that the relative estimation error decreases with an increase in the number of time points (sample size) n for both lags d = 1,2; further, its values are significantly larger in medium and large size VAR models than in small VAR ones. Moreover, the estimation error for lag 1 is uniformly smaller than that for lag 2, and the same holds true for their respective standard errors. Regarding the percentage of non-zero entries in the true transition matrices, the results show that for fixed n and p, the more true non-zero entries in A₁,A₂, the less accurate the posterior mean and the LS estimator are, while their variability as indicated by their standard errors also follows the same pattern. However, the posterior mean clearly outperforms the LS estimates, especially in settings with large p. This is to a large extent due to the fact that the true transition matrices A₁,A₂ exhibit weaker signal as p or the number of non-zero edges increases (this is to ensure stability of the underlying VAR model) and due to our choice of U = cI_dp the posterior mean is the ridge regression estimator which applies shrinkage on the coefficients.

Next, we introduce correlation in the error components by specifying Σ_ε to be of Toeplitz form. As discussed in Section 3.2.1 of Lütkepohl (2007) the generalized least squares estimate in this multivariate regression set up is the same as the ordinary one; i.e. (X′X)^–1X′Y, a result due to Zellner (1962). In Table 2, we compare the performance of least squares and posterior (ridge) estimates with noise covariance Σ_ε =Toeplitz (ρ = 0.8).

Table 2:

Relative error in VAR (d = 1,2) with Σ_ε =Toeplitz (ρ = 0.80) where % denotes percentage of non-zero entries in Φ₀.

		Lag d = 1						Lag d = 2

		n₁		n₂		n₃		n₁		n₂		n₃

	%	LS	PM	LS	PM	LS	PM	LS	PM	LS	PM	LS	PM
Small VAR	10	1.03	0.95	0.95	0.82	0.82	0.64	1.45	1.24	1.31	1.08	1.19	0.98
		(0.27)	(0.18)	(0.20)	(0.09)	(0.10)	(0.05)	(0.37)	(0.27)	(0.30)	(0.19)	(0.18)	(0.09)
	20	1.20	1.10	1.06	0.94	0.98	0.83	1.60	1.41	1.49	1.30	1.35	1.17
		(0.40)	(0.30)	(0.29)	(0.21)	(0.25)	(0.17)	(0.48)	(0.37)	(0.40)	(0.29)	(0.31	(0.19)
	30	1.36	1.27	1.22	1.11	1.12	0.99	1.76	1.61	1.66	1.45	1.50	1.26
		(0.50)	(0.41)	(0.44)	(0.32)	(0.32)	(0.25)	(0.60)	(0.51)	(0.50)	(0.40)	(0.46)	(0.30)

Medium VAR	10	2.02	1.87	1.94	1.72	1.75	1.61	2.80	2.46	2.70	2.35	2.53	2.21
		(0.52)	(0.34)	(0.41)	(0.27)	(0.37)	(0.16)	(0.69)	(0.50)	(0.62)	(0.43)	(0.54)	(0.33)
	20	2.20	2.01	2.10	1.87	1.92	1.80	2.95	2.62	2.86	2.52	2.74	2.34
		(0.63)	(0.45)	(0.55)	(0.38)	(0.45)	(0.33)	(0.81)	(0.62)	(0.71)	(0.52)	(0.65)	(0.42)
	30	2.36	2.20	2.19	2.06	2.06	1.91	3.12	2.78	2.99	2.67	2.91	2.50
		(0.73)	(0.56)	(0.67)	(0.48)	(0.56)	(0.41)	(0.94)	(0.72)	(0.83)	(0.67)	(0.77)	(0.57)

Large VAR	10	3.03	2.80	2.88	2.63	2.83	2.55	4.19	3.67	4.09	3.55	3.93	3.44
		(0.75)	(0.49)	(0.67)	(0.41)	(0.60)	(0.33)	(1.03)	(0.73)	(0.97)	(0.62)	(0.87)	(0.58)
	20	3.17	2.94	3.07	2.79	2.89	2.69	4.34	3.85	4.25	3.70	4.11	3.594
		(0.86)	(0.60)	(0.79)	(0.51)	(0.69)	(0.45)	(1.14)	(0.84)	(1.06)	(0.76)	(1.00)	(0.66)
	30	3.33	3.10	3.27	3.00	3.08	2.80	4.53	4.00	4.37	3.89	4.26	3.70
		(0.98)	(0.73)	(0.90)	(0.63)	(0.82)	(0.56)	(1.26)	(0.95)	(1.17)	(0.87)	(1.11)	(0.79)

Open in a new tab

In this setting, the relative estimation error of both the least squares and ridge estimators increases compared to that with an uncorrelated error structure given in Table 1; in particular, the performance of the LS estimator deteriorates even further. However, with an increase in sample size, the accuracy of both estimates significantly improves. Further, as gleaned from the entries of the Table corresponding to lag 2, the relative error exhibits a further increase, a pattern consistent with the results in Table 1. This is quite expected as we not only have Toeplitz type error covariance structure, but also the total number of unknown parameters has increased by p².

Finally, we study the support recovery under both error processes. In Table 3, we provide the percentage of true positives recovered by using 95% posterior credible intervals based on the same sample sizes n₁, n₂, and n₃ as used previously.

Table 3:

Percentage of true positive non-zero entries recovered in Φ.

		Lag d = 1						Lag d = 2

		Σ_ε= σ²I_p			Σ_ε= Toep			Σ_ε= σ²I_p			Σ_ε= Toep

	%	n₁	n₂	n₃	n₁	n₂	n₃	n₁	n₂	n₃	n₁	n₂	n₃
Small VAR	10	85.0	85.0	82.0	85.0	85.0	84.0	84.7	84.5	81.6	84.6	84.6	83.6
	20	80.0	80.0	81.0	80.0	78.0	80.0	79.7	79.6	80.5	79.5	77.6	79.8
	30	77.0	75.0	77.0	77.0	73.0	73.0	76.6	74.6	76.5	76.7	72.6	72.6

Medium VAR	10	89.3	90.0	89.3	89.3	89.8	88.3	88.8	89.5	89.0	88.9	89.4	87.9
	20	85.3	85.5	85.5	85.3	84.5	85.0	84.9	85.3	85.3	84.8	84.1	84.8
	30	81.0	80.8	81.3	81.0	79.8	78.8	80.8	80.5	81.0	80.6	79.3	78.3

Large VAR	10	92.0	92.1	92.2	92.0	92.1	91.8	91.8	91.6	91.8	91.7	91.7	91.4
	20	88.1	88.1	88.1	88.1	87.9	87.9	87.7	87.8	87.7	87.7	87.5	87.5
	30	83.1	83.3	83.4	83.1	82.8	82.9	82.7	82.8	82.9	82.7	82.4	82.6

Open in a new tab

The above table indicates that support recovery is not sensitive to the sample size, or to the lag; however, it deteriorates for all VAR models and error covariance settings, as the density of non-zero entries increases and exhibits a small increase with model dimension.

4.2. Hierarchical Priors

As discussed in Section 3.2, three types of hierarchical priors (Wishart, group-lasso and multivariate t) are studied. Analogously to the non-hierarchical prior case, the performance of the LS estimator is not at all satisfactory in this set up as well. Thus, we only compare the relative accuracy of the three prior choices in this setting. We select V = cI_p and df = ν = dp for the Wishart prior, λ_i = λ for all 1 ≤ i ≤ dp for the group-lasso prior and α_i = 1, λ_i = λ for all 1 ≤ i ≤ dp for the multivariate t prior. The hyper-parameters c and λ are chosen using DIC. In Table 4, we report the relative estimation error $({{‖ \hat{Φ} - Φ_{0} ‖}_{2} / ‖ Φ_{0} ‖}_{2}, d = 1, 2)$ of the three hierarchical estimators when the error process covariance is set to $σ^{2} I_{p}$ and d represents lag length in the underlying VAR model.

Table 4:

Relative error in VAR (d = 1,2) with Σ_ε = σ²I_p, where % denotes percentage of non-zero entries in Φ₀.

		Wishart			Group lasso			Multivariate t
Lag d = 1	%	n₁	n₂	n₃	n₁	n₂	n₃	n₁	n₂	n₃
Small VAR	10	0.84	0.75	0.65	0.84	0.75	0.64	0.85	0.74	0.58
	20	0.99	0.91	0.73	1.00	0.89	0.79	1.00	0.90	0.76
	30	1.14	1.09	0.92	1.16	1.02	0.94	1.15	1.04	0.93

Medium VAR	10	1.63	1.53	1.42	1.62	1.54	1.43	1.63	1.53	1.39
	20	1.78	1.71	1.53	1.76	1.69	1.56	1.79	1.66	1.55
	30	1.93	1.83	1.70	1.95	1.86	1.72	1.95	1.82	1.69

Large VAR	10	2.39	2.26	2.16	2.41	2.31	2.22	2.42	2.34	2.22
	20	2.59	2.42	2.31	2.58	2.47	2.39	2.59	2.46	2.32
	30	2.77	2.61	2.52	2.70	2.60	2.53	2.74	2.63	2.54

Lag d = 2

Small VAR	10	0.88	0.74	0.60	0.87	0.80	0.71	0.89	0.78	0.63
	20	1.05	0.91	0.83	1.04	0.94	0.85	1.08	0.94	0.86
	30	1.25	1.13	0.97	1.23	1.10	0.99	1.27	1.13	1.03

Medium VAR	10	1.71	1.58	1.46	1.71	1.58	1.53	1.70	1.60	1.50
	20	1.85	1.72	1.60	1.85	1.76	1.71	1.89	1.82	1.66
	30	2.05	1.96	1.77	2.01	1.95	1.82	2.04	1.97	1.83

Large VAR	10	2.52	2.37	2.26	2.51	2.41	2.29	2.53	2.44	2.25
	20	2.69	2.58	2.47	2.67	2.59	2.47	2.69	2.59	2.51
	30	2.88	2.75	2.60	2.86	2.75	2.62	2.87	2.74	2.64

Open in a new tab

In the following table, we present relative estimation errors with the same 3 hierarchical priors when Σ_ε =Toeplitz (ρ = 0.6).

All of our hierarchical estimates outperform the ridge estimator (Table 1 and Table 2) across all settings considered. This is again expected, since the A_i’s have sparse structure by construction and the group lasso prior favors sparsity. However, the above results are not conclusive whether the group lasso estimate exhibits better accuracy than the Wishart or multivariate t estimates. To gain some insight into this issue, we use a VAR(1) model with p = 9 and transition matrix A₁ in which the columns form three groups each containing three columns. The sparsity increases as we move from group 1 to group 3. Finally, we rescale the coefficient matrix so that the corresponding VAR process is stable with SNR= 2. The structure of the resulting A1 transition matrix is depicted next, where * indicates non-zero entries.

A_{1} = {(\begin{matrix} * & * & * & * & * & * & 0 & 0 & 0 \\ * & * & * & * & * & * & 0 & 0 & 0 \\ * & * & * & * & * & * & 0 & 0 & 0 \\ * & * & * & * & 0 & 0 & 0 & 0 & 0 \\ * & * & * & 0 & * & 0 & 0 & 0 & 0 \\ * & * & * & 0 & 0 & * & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & * & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & * & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & * \end{matrix})}_{9 \times 9}

We generate n = 100 observations from the above VAR(1) model using two white noise variances (1) $Σ_{ε} = σ^{2} I_{p}$ and (2) Σ_ε =Toeplitz (ρ = 0.80) and report the relative estimation error $({‖ {\hat{A}}_{1} - A_{1} ‖}_{2} / {‖ A_{1} ‖}_{2})$ of five different estimates - least squares (LS), posterior mean for non-hierarchical normal prior (Non H), hierarchical Wishart (W), group lasso (GL) and multivariate t prior (Mult. t), in Table 6.

Table 6:

Relative error

Estimator	Σ_ε = σ²Ι_ρ	Σ_ε = Toeplitz

LS	0.604	1.384
Non H	0.583	0.614
W	0.462	0.544
Mult. t	0.430	0.414
GL	0.321	0.362

Open in a new tab

The results show that the group lasso estimator exhibits the best performance, followed by the multivariate t one, whereas the LS estimator is the least accurate. The result is consistent with the structure of the underlying transition matrix, since the group lasso prior can capitalize on it.

In Appendix A.1 of the supplementary document we illustrate the posterior estimate of a VAR(1) model transition matrix A₁, 95% credible intervals and estimated posterior densities of several entries of A₁. We also look into the performance of ${\hat{Σ}}_{ε}$ when the true error covariance is Toeplitz(ρ = 0.8). The relative error of $\hat{Φ}$ under all the four different priors using a new noise covariance matrix which is generated from a Wishart distribution with degrees of freedom ν = p and scale matrix I_p is also given.

5. Application to Macroeconomic Data

We use the proposed Bayesian framework to understand the lead-lag relationships in the FRED-MD dataset containing p = 137 key macroeconomic variables for the period January 1973 to December 2014. VAR modeling for this task was strongly advocated by Sims (1980) and since then has become a standard tool for it, although usually the focus is on small models involving few macroeconomic indicators (e.g. consumer price index, employment index and the federal funds rate). However, recent work has advocated for larger VAR models (see Bernanke et al. (2005); Bańbura et al. (2010) and references therein), in order to improve forecastability and also avoid the presence of hard to interpret or even contradictory to economic theory relationships, due to the fact of not including an adequate number of variables for properly modeling the economic phenomenon under consideration. Before centering the data and estimating Σ_ε as discussed earlier in Section 3.2, we ensure stationarity by processing the variables according to the recommendations in Stock and Watson (2005). The specific transformations used for each time series are given in the supplementary documents. Analogously to Bańbura et al. (2010), we consider the following 3 different size VAR models:

SMALL: This model contains p = 4 key variables - CPI, number of employees non-farm (PAYEMS), Federal Funds Rate (FEDFUNDS) and Unemployment Rate (UNRATE).
MEDIUM: In addition to the four variables in the SMALL VAR model, this one contains an additional 16 variables (total p = 20) listed next - Reserves Of Depository Institutions (NONBORRES), Total Reserves of Depository Institutions (TOTRESNS), M2 (M2REAL), Real Personal Income (RPI), Real personal consumption expenditures (DPCERA), IP Index (INDPRO), Capacity Utilization: Manufacturing (CAPT), Housing Starts: Total New Privately Owned (HOUST), Avg Hourly Earnings : Goods-Producing (CES), M1 (M1), S & P’s Common Stock Price Index: Composite (S.P.), 10-Year Treasury Rate (GS10), Personal Cons. Expend.: Chain Index (PCEPI), Foreign Exchange Rate (EXS), Crude Oil, spliced WTI and Cushing (OIL) and Retail and Food Services Sales (RETAILx).
LARGE: This specification has all p = 134 macroeconomic indicators (3 were excluded from further analysis due to the presence of a large number of missing values).

Based on initial exploratory work, we choose lag d = 6 according to the Bayesian information criterion (BIC) and the following distributions were used for prior specification to obtain the estimated parameter matrix Φ: (i) non-hierarchical normal (Non H), (ii) hierarchical Wishart (W), (iii) group lasso (GL) and (iv) multivariate t prior (Mult. t). Since with an increase in the lag length d the number of parameters increases linearly we suggest using BIC over the Akaike information criterion (AIC). For the non-hierarchical prior, we use U = BDiag(λ₁,⋯,λ_d), while for the hierarchical Wishart, group-lasso and multivariate t priors on Φ, we use V = c₁I_dp and α_i = α for all 1 ≤ i ≤ dp. The values of c₁ and λ are chosen using the Deviance Information Criterion (DIC) which, as explained previously, is a hierarchical Bayesian modeling generalization of BIC. The respective posterior means were compared to the least squares (LS) estimate. For each of the estimates $\hat{Φ}$ , the residual norm ratio $(‖ Y - X \hat{Φ} ‖_{F} / ‖ Y ‖_{F})$ which measures the in-sample fit, is reported in Table 7.

Table 7:

In-sample prediction error

	SMALL (p = 4)	MEDIUM (p = 20)	LARGE (p = 134)

LS	0.847	0.863	0.673
Non-H	0.852	0.864	0.674
W	0.845	0.863	0.675
Mult. t	0.877	0.885	0.685
GL	0.847	0.873	0.674

Open in a new tab

Note that since the LS estimator is obtained by minimizing $‖ Y - X Φ ‖_{F}$ , it will always result in minimum relative residual norm as observed in the above table; i.e. the LS estimator is always the best one in terms of in-sample prediction accuracy.

Next, we investigate the 4 different Bayesian estimates based on their out-of-sample prediction performance with respect to the benchmark prior, analogously to the evaluation strategies discussed in Bańbura et al. (2010) and Stock and Watson (2005). We consider the following two benchmark priors:

Prior information is imposed exactly by setting U⁻¹ = O matrix (the zero matrix) and it corresponds to λ = 0 in the Minnesota prior. Bańbura et al. (2010) uses this specification as the benchmark prior, in which case the corresponding benchmark model becomes a random walk with drift; i.e. $X^{t} = α + X^{t - 1} + ε^{t}$ .
A uniform prior on Φ by setting U = O which corresponds to λ = ∞ in the Minnesota prior. The posterior mean coincides with the least squared estimate (LS).

Let ${\hat{X}}^{t + h}$ be the h-step ahead predicted value for X^t+h based on our posited Bayesian model and using the data upto time t. The corresponding forecast under the benchmark prior is ${\hat{X}}_{O}^{t + h}$ . The mean squared forecast error relative to the benchmark (RMSFE) is defined to $\frac{\sum_{t = T_{0}}^{T_{1}} {‖ X^{t + h} - {\hat{X}}^{t + h} ‖}_{2}}{\sum_{t = T_{0}}^{T_{1}} {‖ X^{t + h} - {\hat{X}}_{O}^{t + h} ‖}_{2}}$ . Table 8 gives the RMFSE results for three different choices of forecasting horizons, h = 1,6,12, for the two benchmark priors considered, over the period T₀ = January 1978 to T₁ = December 2006. An RMSFE value smaller than 1 implies the VAR model with the corresponding prior outperforms that with the naive/benchmark prior.

Table 8:

Out of sample relative prediction error

		Uniform prior			Random Walk

		SMALL	MEDIUM	LARGE	SMALL	MEDIUM	LARGE

		p = 4	p = 20	p = 134	p = 4	p = 20	p = 134
Non Hierarchical	h = 1	0.88	0.72	0.62	0.49	0.40	0.33
	h = 6	0.90	0.78	0.62	0.43	0.42	0.37
	h = 12	0.95	0.84	0.74	0.43	0.41	0.37

Wishart	h = 1	0.88	0.81	0.68	0.49	0.45	0.32
	h = 6	0.86	0.86	0.68	0.42	0.40	0.36
	h = 12	0.93	0.92	0.71	0.43	0.41	0.38

Group Lasso	h = 1	0.90	0.86	0.60	0.51	0.45	0.30
	h = 6	0.88	0.88	0.63	0.46	0.42	0.34
	h = 12	0.93	0.92	0.67	0.47	0.45	0.37

Mult. t	h = 1	0.91	0.89	0.71	0.50	0.46	0.31
	h = 6	0.87	0.93	0.77	0.49	0.44	0.34
	h = 12	0.92	0.94	0.81	0.45	0.41	0.38

Open in a new tab

It can easily be seen that all four Bayesian methods not only outperform the LS estimate (uniform prior on Φ), but also exhibit substantially smaller relative error compared to the random walk with drift process (point-mass prior on Φ). Further, increasing the number of predictor variables improves forecasting performance, a point argued forcefully in favor of large VAR models by Sims Sims (1980). On the other hand, forecasting performance deteriorates for larger values of h, an expected result. Nevertheless, even for h = 12 (one year ahead) the results are still very satisfactory. Further, for the SMALL and MEDIUM VAR models, the non-hierarchical normal and hierarchical Wishart priors result in better prediction, whereas for the LARGE VAR model the group lasso prior outperforms other forecasts.

Next, we examine closely the out-of-sample prediction performance of the following three macroeconomic variables - CPI, PAYEMS and FEDFUND under the hierarchical Wishart prior.

Note that Bańbura et al. (2010) only consider a random walk process as the naive prior. From the above table it can be seen that although for CPI and PAYEMS the LS estimate performs better than the Bayesian estimates in SMALL and MEDIUM VARs, overall the Wishart prior has better forecasting accuracy than both of the benchmark priors. As previously observed, adding information (i.e. including more variables) improves the accuracy of forecasts for all three variables. The fourth column (LARGE BGR) provides the numbers reported in Table III in Bańbura et al. (2010), where a Bayesian VAR model on the same 134 variables, with d = 13 lags was estimated using a normal-inverted Wishart distribution that leads to a ridge type posterior mean estimate for the parameters in Φ and based on data covering the period 1971–2003. Although the results are not directly comparable to those obtained by our methodology, they nevertheless provide a certain degree of calibration. It can be seen that our model is more parsimonious using only d = 6 lags and provides better forecasting performance for all three variables at all forecasting horizons.

Next, we examine in more detail the estimated transition matrix A₁ for the SMALL VAR model (p = 4) and the non-hierarchical normal and group lasso priors. Estimated posterior densities of the bold marked entries are shown next. It is worth noting that the non-hierarchical prior centers around a different value and exhibits a less smooth behavior, than the hierarchical one. This smoothness should be expected given the specification of the latter.

{\hat{A}}_{1}^{N o n H} = \begin{matrix} CPI \\ PAYEMS \\ FEDFund \\ UnRate \end{matrix} (\begin{matrix} CPI & PAYEMS & FEDFund & UnRate \\ - 0.133 & - 0.001 & 0.001 & 0.001 \\ - 0.016 & 0.311 & 0.001 & - 0.002 \\ - 1.000 & 10.200 & 0.498 & - 0.185 \\ 1.217 & - 23.300 & - 0.035 & - 0.105 \end{matrix}) {\hat{A}}_{1}^{G L} = \begin{matrix} CPI \\ PAYEMS \\ FEDFund \\ UnRate \end{matrix} (\begin{matrix} CPI & PAYEMS & FEDFund & UnRate \\ - 0.167 & - 0.005 & 0.001 & 0.001 \\ - 0.021 & 0.560 & 0.001 & - 0.001 \\ - 1.614 & 18.6079 & 0.486 & - 0.134 \\ 1.609 & - 41.818 & - 0.015 & - 0.107 \end{matrix})

Further, we present the 95% posterior credible intervals (PCI) of A₁ under the above two priors

Non hierarchical: = (\begin{matrix} (- 0.19, - 0.07) & (- 0.08, 0.07) & (+ 0, + 0) & (- 0, + 0) \\ (- 0.05, 0.01) & (0.27, 0.35) & (+ 0, + 0) & (- 0, - 0) \\ (- 7, 5.25) & (1.86, 19) & (0.44, 0.55) & (- 0.3, - 0.06) \\ (- 1.98, 4.45) & (- 27.8, - 18.7) & (- 0.06, - 0.01) & (- 0.07, 0.06) \end{matrix})

Group lasso: = (\begin{matrix} (- 0.22, - 0.11) & (- 0.10, 0.1) & (+ 0, + 0) & (- 0, + 0) \\ (- 0.05, 0.01) & (0.51, 0.61) & (+ 0, + 0) & (- 0, - 0) \\ (- 8.58, 5.7) & (6.5, 31.01) & (0.44, 0.55) & (- 0.26, - 0) \\ (- 1.75, 4.86) & (- 47.72, - 36.37) & (- 0.04, 0.01) & (- 0.07, 0.06) \end{matrix})

Next, in Figure 2 we depict the estimated networks for the MEDIUM VAR based on the first lag transition matrix produced by: (a) least squares and (b) a non-hierarchical normal prior, where for ease of representation the nodes of the network are abbreviated; the full list of the variable names is given in Table A1 of Appendix A in the supplementary material.

As expected, for most variables their previous lag value influences the current value. Further, for the LS based network, there is a high degree of connectivity, whereas the non-hierarchical based one exhibits a sparser structure. For the latter, of interest is that the employment index (PAYEMS), the personal consumer expenditures (GS10) and CPI are influenced by many other variables. On the other hand, the Federal Funds Rate influences the broad stock market (SP500) as expected based on finance theory and GS10. In general the sparser result provided by the non-hierarchical prior, in addition to better forecasting also aids in interpretation, vis-a-vis the LS estimate.

6. Discussion

In this paper, we investigate posterior consistency in Bayesian VAR(d) models with both nonhierarchical and hierarchical matrix normal prior distributions on the transition matrices under a Gaussian assumption for the temporal evolution of the time series under consideration and in the presence of a general covariance matrix that captures additional contemporary dependence between them. We establish posterior consistency for both of these priors under high-dimensional scaling. To obtain the desired results, some novel concentration inequalities are provided that are of independent interest. The methodology is illustrated on synthetic and real macroeconomic data. The proposed priors provide better forecasts than the LS estimates for periods up to one year ahead, while leading to sparser and potentially easier to interpret relationships, especially for large scale models.

Supplementary Material

Supp1

NIHMS1515017-supplement-Supp1.pdf^{(780.4KB, pdf)}

Supp2

NIHMS1515017-supplement-Supp2.pdf^{(801.3KB, pdf)}

Figure 1: — Posterior densities of entries (11), (24), (32) and (42) in A₁ under 4 different priors

Table 5:

Relative error in VAR (d = 1,2) with Σ_ε = Toeplitz (ρ = 0.80), where % denotes percentage of non-zero entries in Φ₀.

		Wishart			Group lasso			Multivariate t
Lag d = 1	%	n₁	n₂	n₃	n₁	n₂	n₃	n₁	n₂	n₃
Small VAR	10	0.90	0.76	0.65	0.90	0.83	0.69	0.89	0.79	0.70
	20	1.06	0.94	0.77	1.03	0.92	0.82	1.07	0.96	0.81
	30	1.23	1.13	0.99	1.21	1.09	0.99	1.23	1.14	0.97

Medium VAR	10	1.75	1.61	1.50	1.73	1.63	1.53	1.74	1.61	1.56
	20	1.92	1.80	1.67	1.92	1.78	1.73	1.91	1.79	1.72
	30	2.08	1.98	1.82	2.02	1.98	1.88	2.07	1.94	1.84

Large VAR	10	2.58	2.49	2.33	2.60	2.50	2.38	2.60	2.49	2.33
	20	2.76	2.65	2.51	2.75	2.65	2.54	2.77	2.62	2.57
	30	2.93	2.81	2.71	2.90	2.81	2.74	2.93	2.82	2.74

Lag d = 2

Small VAR	10	1.16	1.02	0.96	1.15	1.06	0.98	1.18	1.06	0.89
	20	1.31	1.24	1.12	1.31	1.22	1.11	1.34	1.24	1.10
	30	1.49	1.41	1.23	1.49	1.41	1.27	1.49	1.40	1.26

Medium VAR	10	2.25	2.13	2.04	2.25	2.17	2.05	2.27	2.16	2.01
	20	2.43	2.30	2.16	2.41	2.32	2.24	2.45	2.32	2.26
	30	2.59	2.47	2.33	2.59	2.49	2.38	2.59	2.48	2.36

Large VAR	10	3.37	3.24	3.10	3.36	3.25	3.18	3.36	3.24	3.11
	20	3.53	3.40	3.29	3.51	3.43	3.31	3.54	3.46	3.36
	30	3.72	3.62	3.41	3.65	3.57	3.49	3.74	3.59	3.48

Open in a new tab

Table 9:

Out-of-sample relative prediction error for CPI, PAYEMS and Fed- Fund for the three VAR model specifications considered. The column LARGE BVAR corresponds to the entries of Table III in Bafibura et al. (2010) for a Bayesian VAR model with a normal-inverted Wishart prior distribution and d =13 lags, based on the same set of variables, but covering the period 1971–2003.)

		SMALL	MEDIUM	LARGE	LARGE BVAR

Uniform prior		p = 4	p = 20	p = 134	p = 134
h = 1	CPI	1.05	0.91	0.44	-
	PAYEMS	1.21	1.04	0.91	-
	FFUND	0.78	0.75	0.68	-

h = 6	CPI	1.03	0.97	0.38	-
	PAYEMS	1.08	1.06	0.48	-
	FFUND	0.92	0.82	0.68	-

h = 12	CPI	0.98	0.96	0.42	-
	PAYEMS	0.93	0.91	0.73	-
	FFUND	0.92	0.88	0.72	-

Random Walk

h = 1	CPI	0.43	0.41	0.34	0.50
	PAYEMS	0.45	0.43	0.39	0.46
	FFUND	0.50	0.45	0.36	0.75

h = 6	CPI	0.38	0.34	0.28	0.40
	PAYEMS	0.53	0.48	0.39	0.50
	FFUND	0.41	0.37	0.36	1.29

h = 12	CPI	0.51	0.45	0.42	0.44
	PAYEMS	0.51	0.88	0.73	0.78
	FFUND	0.33	0.31	0.28	1.93

Open in a new tab

Acknowledgments

* The authors gratefully acknowledge support from NSF grants DMS-1511945 (KK) and IIS-1632730 and CNS-1422078 (GM)and NIH grant R01 5R01GM11402902.

Contributor Information

Satyajit Ghosh, Email: satyajitghosh90@ufl.edu.

Kshitij Khare, Email: kdkhare@ufl.edu.

George Michailidis, Email: gmichail@ufl.edu.

References

Armagan A, Clyde M, and Dunson DB (2011). Generalized beta mixtures of gaussians In Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, and Weinberger KQ (Eds.), Advances in Neural Information Processing Systems 24, pp. 523–531. Curran Associates, Inc. [PMC free article] [PubMed] [Google Scholar]
Armagan A, Dunson DB, and Lee J (2013). Generalized double pareto shrinkage. Statistica Sinica 23, 119–143. [PMC free article] [PubMed] [Google Scholar]
Armagan A, Dunson DB, Lee J, Bajwa WU, and Strawn N (2013). Posterior consistency in linear models under shrinkage priors. Biometrika 100(4), 1011–1018. [Google Scholar]
Bańbura M, Giannone D, and Reichlin L (2010). Large bayesian vector auto regressions. Journal of Applied Econometrics 25(1), 71–92. [Google Scholar]
Basu S and Michailidis G (2015, 08). Regularized estimation in sparse high-dimensional time series models. Ann. Statist 43(4), 1535–1567. [Google Scholar]
Basu S, Shojaie A, and Michailidis G (2015). Network granger causality with inherent grouping structure. Journal of Machine Learning Research 16, 417–453. [PMC free article] [PubMed] [Google Scholar]
Bernanke BS, Boivin J, and Eliasz P (2005). Measuring the effects of monetary policy: a factor-augmented vector autoregressive (FAVAR) approach. The Quarterly Journal of Economics 120(1), 387–422. [Google Scholar]
Billio M, Getmansky M, Lo AW, and Pelizzon L (2012). Econometric measures of connectedness and systemic risk in the finance and insurance sectors. Journal of Financial Economics 104(3), 535–559. [Google Scholar]
Bontemps D (2011, 10). Bernstein-von mises theorems for gaussian regression with increasing number of regressors. Ann. Statist 39f, 2557–2584. [Google Scholar]
Carvalho CM, Polson NG, and Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480. [Google Scholar]
Doan T, Litterman R, and Sims C (1984). Forecasting and conditional projection using realistic prior distributions. Econometric reviews 3(1), 1–100. [Google Scholar]
Ghoshal S (1999). Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli 5(2), 315–331. [Google Scholar]
Griffin JE and Brown PJ (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 5(1), 171–188. [Google Scholar]
Korobilis D (2013). Var forecasting using bayesian variable selection. Journal of Applied Econometrics 28, 204–230. [Google Scholar]
Kyung M, Gill J, Ghosh M, and Casella G (2010, 06). Penalized regression, standard errors, and bayesian lassos. Bayesian Anal. 5(2), 369–411. [Google Scholar]
Lee J and Oh H-S (2013). Bayesian regression based on principal components for high-dimensional data. Journal of Multivariate Analysis 117, 175–192. [Google Scholar]
Lin J and Michailidis G (2017). Regularized estimation and testing for high-dimensional multiblock vector-autoregressive models. Journal of Machine Learning Research 18(117), 1–49. [Google Scholar]
Litterman RB et al. (1979). Techniques of forecasting using vector autoregressions. Technical report.
Lütkepohl H (2007). New Introduction to Multiple Time Series Analysis. Springer. [Google Scholar]
Melnyk I and Banerjee A (2016). Estimating structured vector autoregressive models. Proceedings of The 33rd International Conference on Machine Learning, 830–839. [Google Scholar]
Mol CD, Giannone D, and Reichlin L (2008). Forecasting using a large number of predictors: Is bayesian shrinkage a valid alternative to principal components? Journal of Econometrics 146(2), 318–328. Honoring the research contributions of Charles R. Nelson. [Google Scholar]
Park T and Casella G (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]
Raskutti G and Yuan M (2015, December). Convex Regularization for High-Dimensional Tensor Regression. ArXiv e-prints.
Schweinberger M, Babkin S, and Ensor KB (2017). High-dimensional multivariate time series with additional structure. Journal of Computational and Graphical Statistics 26(3), 610–622. [Google Scholar]
Seth AK, Barrett AB, and Barnett L (2015). Granger causality analysis in neuroscience and neuroimaging. Journal of Neuroscience 35(8), 3293–3297. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sims CA (1980). Macroeconomics and reality. Econometrica: Journal of the Econometric Society 48(1), 1–48. [PubMed] [Google Scholar]
Sparks DK, Khare K, and Ghosh M (2015). Necessary and sufficient conditions for high-dimensional posterior consistency under g-priors. Bayesian Anal. 10(3), 627–664. [Google Scholar]
Stock JH and Watson MW (2005). An empirical comparison of methods for forecasting using many predictors. Manuscript, Princeton University. [Google Scholar]
Zellner A (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57(298), 348–368. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

NIHMS1515017-supplement-Supp1.pdf^{(780.4KB, pdf)}

Supp2

NIHMS1515017-supplement-Supp2.pdf^{(801.3KB, pdf)}

[R1] Armagan A, Clyde M, and Dunson DB (2011). Generalized beta mixtures of gaussians In Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, and Weinberger KQ (Eds.), Advances in Neural Information Processing Systems 24, pp. 523–531. Curran Associates, Inc. [PMC free article] [PubMed] [Google Scholar]

[R2] Armagan A, Dunson DB, and Lee J (2013). Generalized double pareto shrinkage. Statistica Sinica 23, 119–143. [PMC free article] [PubMed] [Google Scholar]

[R3] Armagan A, Dunson DB, Lee J, Bajwa WU, and Strawn N (2013). Posterior consistency in linear models under shrinkage priors. Biometrika 100(4), 1011–1018. [Google Scholar]

[R4] Bańbura M, Giannone D, and Reichlin L (2010). Large bayesian vector auto regressions. Journal of Applied Econometrics 25(1), 71–92. [Google Scholar]

[R5] Basu S and Michailidis G (2015, 08). Regularized estimation in sparse high-dimensional time series models. Ann. Statist 43(4), 1535–1567. [Google Scholar]

[R6] Basu S, Shojaie A, and Michailidis G (2015). Network granger causality with inherent grouping structure. Journal of Machine Learning Research 16, 417–453. [PMC free article] [PubMed] [Google Scholar]

[R7] Bernanke BS, Boivin J, and Eliasz P (2005). Measuring the effects of monetary policy: a factor-augmented vector autoregressive (FAVAR) approach. The Quarterly Journal of Economics 120(1), 387–422. [Google Scholar]

[R8] Billio M, Getmansky M, Lo AW, and Pelizzon L (2012). Econometric measures of connectedness and systemic risk in the finance and insurance sectors. Journal of Financial Economics 104(3), 535–559. [Google Scholar]

[R9] Bontemps D (2011, 10). Bernstein-von mises theorems for gaussian regression with increasing number of regressors. Ann. Statist 39f, 2557–2584. [Google Scholar]

[R10] Carvalho CM, Polson NG, and Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480. [Google Scholar]

[R11] Doan T, Litterman R, and Sims C (1984). Forecasting and conditional projection using realistic prior distributions. Econometric reviews 3(1), 1–100. [Google Scholar]

[R12] Ghoshal S (1999). Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli 5(2), 315–331. [Google Scholar]

[R13] Griffin JE and Brown PJ (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 5(1), 171–188. [Google Scholar]

[R14] Korobilis D (2013). Var forecasting using bayesian variable selection. Journal of Applied Econometrics 28, 204–230. [Google Scholar]

[R15] Kyung M, Gill J, Ghosh M, and Casella G (2010, 06). Penalized regression, standard errors, and bayesian lassos. Bayesian Anal. 5(2), 369–411. [Google Scholar]

[R16] Lee J and Oh H-S (2013). Bayesian regression based on principal components for high-dimensional data. Journal of Multivariate Analysis 117, 175–192. [Google Scholar]

[R17] Lin J and Michailidis G (2017). Regularized estimation and testing for high-dimensional multiblock vector-autoregressive models. Journal of Machine Learning Research 18(117), 1–49. [Google Scholar]

[R18] Litterman RB et al. (1979). Techniques of forecasting using vector autoregressions. Technical report.

[R19] Lütkepohl H (2007). New Introduction to Multiple Time Series Analysis. Springer. [Google Scholar]

[R20] Melnyk I and Banerjee A (2016). Estimating structured vector autoregressive models. Proceedings of The 33rd International Conference on Machine Learning, 830–839. [Google Scholar]

[R21] Mol CD, Giannone D, and Reichlin L (2008). Forecasting using a large number of predictors: Is bayesian shrinkage a valid alternative to principal components? Journal of Econometrics 146(2), 318–328. Honoring the research contributions of Charles R. Nelson. [Google Scholar]

[R22] Park T and Casella G (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]

[R23] Raskutti G and Yuan M (2015, December). Convex Regularization for High-Dimensional Tensor Regression. ArXiv e-prints.

[R24] Schweinberger M, Babkin S, and Ensor KB (2017). High-dimensional multivariate time series with additional structure. Journal of Computational and Graphical Statistics 26(3), 610–622. [Google Scholar]

[R25] Seth AK, Barrett AB, and Barnett L (2015). Granger causality analysis in neuroscience and neuroimaging. Journal of Neuroscience 35(8), 3293–3297. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Sims CA (1980). Macroeconomics and reality. Econometrica: Journal of the Econometric Society 48(1), 1–48. [PubMed] [Google Scholar]

[R27] Sparks DK, Khare K, and Ghosh M (2015). Necessary and sufficient conditions for high-dimensional posterior consistency under g-priors. Bayesian Anal. 10(3), 627–664. [Google Scholar]

[R28] Stock JH and Watson MW (2005). An empirical comparison of methods for forecasting using many predictors. Manuscript, Princeton University. [Google Scholar]

[R29] Zellner A (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57(298), 348–368. [Google Scholar]

PERMALINK

High-Dimensional Posterior Consistency in Bayesian Vector Autoregressive Models

Satyajit Ghosh

Kshitij Khare

George Michailidis

Abstract

1. Introduction

1.1. Notation

2. Model Formulation

2.1. Stability of VAR(d) process

3. Bayesian Estimation and Posterior Consistency

3.1. Non-Hierarchical Matrix Normal Prior

Assumptions for posterior consistency

3.2. Hierarchical Normal-mixture Prior

Estimation

Case 1: Wishart Prior

Case 2: Bayesian Group Lasso

Case 3: Multivariate t-distribution

Assumptions for posterior consistency

4. Performance Evaluation

4.1. Non-hierarchical prior

Table 1:

Table 2:

Table 3:

4.2. Hierarchical Priors

Table 4:

Table 6:

5. Application to Macroeconomic Data

Table 7:

Table 8:

Figure 2:

6. Discussion

Supplementary Material

Figure 1:

Table 5:

Table 9:

Acknowledgments

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases