High dimensional Bernstein-von Mises: simple examples

Iain M Johnstone

doi:10.1214/10-IMSCOLL607

. Author manuscript; available in PMC: 2011 Jan 1.

Published in final edited form as: Inst Math Stat Collect. 2010;6:87–98. doi: 10.1214/10-IMSCOLL607

High dimensional Bernstein-von Mises: simple examples

Iain M Johnstone ^1,^*

PMCID: PMC2990974 NIHMSID: NIHMS199314 PMID: 21113327

Abstract

In Gaussian sequence models with Gaussian priors, we develop some simple examples to illustrate three perspectives on matching of posterior and frequentist probabilities when the dimension p increases with sample size n: (i) convergence of joint posterior distributions, (ii) behavior of a non-linear functional: squared error loss, and (iii) estimation of linear functionals. The three settings are progressively less demanding in terms of conditions needed for validity of the Bernstein-von Mises theorem.

Keywords: high dimensional inference, Gaussian sequence, linear functional, squared error loss, posterior distribution, frequentist

The Bernstein-von Mises theorem is a formalization of conditions under which Bayesian posterior credible intervals agree approximately with frequentist confidence intervals constructed from likelihood theory. It is traditionally formulated in situations in which the number of parameters p is fixed and the sample size n → ∞. The situation is very different in high dimensional settings in which p is allowed to grow with n. In this primarily expository paper, we use simple Gaussian sequence models to draw some conclusions about when a version of Bernstein-von Mises can hold.

We begin with a somewhat informal statement of the classical theorem. Suppose that Y₁, …, Y_n are i.i.d. observations from a distribution P_θ having density p_θ(y)dμ(y) where $θ \in Θ \subset R^{p}$ . The log-likelihood for a single observation

ℓ_{θ} = \log p_{θ} (y),

and, as usual, the score function vector and Fisher information matrix are given by

{\dot{ℓ}}_{θ} = (\partial ∕ \partial θ) \log p_{θ} (y) I_{θ} = E_{θ} {\dot{ℓ}}_{θ} {\dot{ℓ}}_{θ}^{T} .

Writing Yⁿ = (Y₁, …, Y_n) for the full data, the log-likelihood

L_{n} (θ) = \sum_{k = 1}^{n} ℓ_{θ} (Y_{k}),

and we write ${\hat{θ}}_{MLE}$ for a maximizer of L_n(θ). Classical likelihood theory says that any (nice) estimator satisfies the information bound

V a r_{θ} \hat{θ} \geq n^{- 1} I_{θ}^{- 1},

in the usual ordering of nonnegative definite matrices, and that the bound is asymptotically attained by the MLE, which is also asymptotically Gaussian:

{\hat{θ}}_{M L E} ∣ θ ~ N_{p} (θ, n^{- 1} I_{θ}^{- 1}) .

Now suppose that π(θ) is the density of a prior distribution with respect to Legesgue measure. Then the posterior distribution of θ given Yⁿ is given by Bayes' rule; we denote it simply by P_θ|Yⁿ.

The Bernstein-von Mises theorem says, informally, that this posterior distribution is, in large samples, approximately normal with mean approximately the MLE, ${\hat{θ}}_{MLE}$ and variance matrix approximately $n^{- 1} I_{θ_{0}}^{- 1}$ (here θ₀ is the `true' value of θ generating the observations Y₁, …, Y_n). Using the scalar case for simplicity, and writing $σ_{n}^{2} = n^{- 1} I_{θ_{0}}^{- 1}$ and $z_{α} = {\tilde{Φ}}^{- 1} (α)$ , we have that an approximate 100(1 – α)% credible interval for θ would be given by ${\hat{θ}}_{MLE} \pm z_{α ∕ 2} {\hat{σ}}_{n}$ . This is exactly the same as the frequentist confidence interval based on asymptotic normality of the MLE. Thus in large samples the effect of the prior density π disappears: “the data overwhelms the prior”.

A somewhat more formal statement uses the notion of variation distance between probability measures P and Q, and an equivalent expression in terms of the densities p = dP/dμ and q = dQ/dμ relative to a dominating measure μ:

∥ P - Q ∥ = \max_{A} ∣ P (A) - Q (A) ∣ = \frac{1}{2} \int ∣ p - q ∣ d μ .

Suppose that π(θ) is continuous and positive at the `true' value θ₀, and that θ → P_θ is differentiable in quadratic mean and satisfies a further mild separation condition, then

∥ P_{θ ∣ Y^{n}} - N ({\hat{θ}}_{M L E}, n^{- 1} I_{θ_{o}}^{- 1}) ∥ \to 0 .

(1)

in probability under $P_{θ_{0}}^{n}$ .

In other words, the variation distance between posterior and the approximating Gaussian distribution is a random variable depending on Yⁿ, and which converges to zero in probability under repeated draws from P_θ0.

A development of the Bernstein-von Mises theorem as formulated above may be found in van der Vaart (1998, §10.2). A proof due to Bickel is given in Lehmann and Casella (1998, §6.8). Extension from independent to dependent sampling settings are possible, see e.g. Borwanker et al. (1971); Heyde and Johnstone (1979). For further references and methods of proof of the classical results, see Ghosh and Ramamoorthi (2003, §1.4 and §1.5).

1. Growing Gaussian location model

In nonparametric and semiparametric settings the situation is very different. Even frequentist consistency of nonparametric Bayesian methods is a difficult issue with a large literature of both positive and negative results (e.g. Ghosh and Ramamoorthi (2003); Ghosal and van der Vaart, (2010)). One cannot therefore expect Bernstein-von Mises phenomena in any great generality for the full posterior.

In this largely expository paper, we do some simple calculations in symmetric Gaussian sequence models. The Gaussian sequence structure makes possible an elementary set of examples that avoid the technical challenges posed by, and sophistication needed for, posterior Gaussian approximation in high dimensional settings (see references in Section 5). Nevertheless, the Gaussian examples can conveniently illustrate some of the issues related to validity of the Bernstein-von Mises theorem in high dimensional models. Depending on the frequentist or Bayesian perspective, we assume that p = p(n) grows with n, and one, or both, of

\begin{matrix} (D) Data : \overset{‒}{Y} ∣ θ ~ N_{p} (θ, σ_{n}^{2} I), and \\ (P) Prior : θ ~ N_{p} (0, τ_{n}^{2} I) . \end{matrix}

The notation $\overset{‒}{Y}$ suggests an average (Y₁+⋯+Y_n)/n of observations individually of variance $σ_{0}^{2}$ , so that in this case $σ_{n}^{2} = σ_{0}^{2} ∕ n$ . [If p were held fixed, not depending on n, then $σ_{n}^{2}$ would match with the definition given in the introductory section.] We also allow the prior variance $τ_{n}^{2}$ to depend on the sample size n.

Our goal is to compare the Bayesian posterior distribution $L (θ ∣ Y)$ with frequentist distributions, in particular those of the MLE $L ({\hat{θ}}_{MLE} ∣ θ)$ and of the posterior mean Bayes estimator $L ({\hat{θ}}_{B} ∣ θ)$ . A key simplification is that since both prior and likelihood are Gaussian, so also is the posterior distribution, and hence all the behavior will be determined by centering and scaling. Thus from standard results, the posterior is given by

\begin{matrix} θ ∣ \overset{‒}{Y} & = \overset{‒}{y} ~ N_{p} (w_{n} y, w_{n} σ_{n}^{2} I), \\ w_{n} = τ_{n}^{2} ∕ (σ_{n}^{2} + τ_{n}^{2}) . \end{matrix}

(2)

Remarks: 1. The reference to Gaussian sequence models becomes clearer if, as will be helpful later, we write out assumptions (D) and (P) in co-ordinates:

\begin{matrix} (D_{seq}) Data : {\overset{‒}{y}}_{k} = θ_{k} + σ_{n} ∊_{k}, a n d \\ (P_{seq}) Prior : θ_{k} = τ_{n} ζ_{k}, \end{matrix}

with ε_k and ζ_k all i.i.d standard Gaussian, for k = 1, …, p(n).

Strictly speaking, the indexing by n of parameters σ_n, τ_n and p(n) creates a sequence of sequence models. However, one can, as needed for almost sure results, think of the infinite sequences { $(∊_{k}, ζ_{k}), k \in N$ } as being drawn from a single common probability space.

2. We also consider the infinite sequence Gaussian white noise model

Y_{t} = \int_{0}^{t} f (s) d s + σ_{n} W_{t} 0 \leq t \leq 1

(3)

or equivalently, when expressed in any orthonormal basis {φ_k(t)} for L₂[0, 1],

y_{k} = θ_{k} + σ_{n} ∊_{k} ∊_{k} \overset{ind}{~} N (0, 1),

(4)

where it is assumed that $σ_{n} = σ_{0} ∕ \sqrt{n}$ . For some examples, it is helpful to use doubly indexed orthonormal bases { $φ_{j k} (t), k = 1, \dots, 2^{j}, j \in N$ } such as arise with systems of orthonormal wavelets.

The forthcoming book Johnstone (2010) will have more on estimation in such Gaussian sequence models.

We develop three perspectives on the Bernstein-von Mises phenomenon:

global convergence of the posterior,
behavior of a non-linear functional ${‖ θ - \hat{θ} ‖}^{2}$ , and of
linear functionals Lf, in the white noise model (3) – (4).

We shall see that these situations are progressively “less demanding” in terms of validity of the Bernstein-von Mises phenomenon. Indeed, case (1) requires that w_n → 1 at a sufficiently fast rate, while setting (2) needs only w_n → 1. In case (3), the formulation itself delivers w_n → 1, and covers at least all bounded linear functionals.

2. Global convergence of posterior

The first calculation considers the p—dimensional posterior distribution (2) and shows that the convergence in (1) occurs, even in the best possible case that θ₀ = 0, only if the shrinkage factor w_n approaches 1 at a sufficiently fast rate.

Proposition 1

Let θ₀ = 0. The variation distance between posterior distribution $P_{θ ∣ Y^{n}}$ and $N ({\hat{θ}}_{MLE}, n^{- 1} I_{θ_{0}}^{- 1})$ converges to zero inP_θ₀ — probability if and only if $\sqrt{p} σ_{n}^{2} ∕ τ_{n}^{2} \to 0$ , or equivalently, if

w_{n} = 1 - o (1 ∕ \sqrt{p_{n}}) .

(5)

PROOF. We introduce notation P_y,n(dθ) for the posterior distribution of θ|Yⁿ = y and Q_y,n(dθ) for the distribution centered at ${\hat{θ}}_{MLE} = \overset{‒}{y}$ . Thus

\begin{matrix} P_{y, n} (d θ) \leftrightarrow N_{p} (w_{n} \overset{‒}{y}, σ_{n}^{2} w_{n} I) \\ Q_{y, n} (d θ) \leftrightarrow N_{p} (\overset{‒}{y}, σ_{n}^{2} I) . \end{matrix}

(6)

Let $ρ (P, Q) = \int \sqrt{p} \sqrt{q} d μ$ denote the Hellinger affinity between two probability measures P,Q having densities p, q with respect to a common dominating measure μ. We recall an elementary bound (van der Vaart, 1998, p. 212) for variation distance in terms of Hellinger distance and hence Hellinger affinity:

2 [1 - ρ (P, Q)] \leq ∥ P - Q ∥ \leq \sqrt{8} {[1 - ρ (P, Q)]}^{1 ∕ 2}

(7)

Thus ∥P_y,n − Q_y,n∥ → 0 if and only if ρ(P_y,n, Q_y,n) → 1. We recall also that affinity commutes with products:

ρ (Π P_{i}, Π Q_{i}) = Π ρ (P_{i}, Q_{i}) .

An elementary calculation shows that

ρ^{2} (N (θ_{1}, σ_{1}^{2}), N (θ_{2}, σ_{2}^{2}) = (\frac{2 σ_{1} σ_{2}}{σ_{1}^{2} + σ_{2}^{2}}) \exp {- \frac{{(θ_{1} - θ_{2})}^{2}}{2 (σ_{1}^{2} + σ_{2}^{2})}} .

(8)

When applied to P_y,n and Q_y,n, we set $θ_{1_{i}} = w_{n} {\overset{‒}{y}}_{i}$ , $θ_{2_{i}} = {\overset{‒}{y}}_{i}$ and $σ_{1}^{2} = σ_{n}^{2} w_{n}$ , $σ_{2}^{2} = σ_{n}^{2}$ to obtain

ρ (P_{y, n}, Q_{y, n}) = \exp {- \frac{p}{2} \log \frac{1}{2} (w_{n}^{1 ∕ 2} + w_{n}^{- 1 ∕ 2}) - \frac{{(1 - w_{n})}^{2} {∥ \overset{‒}{y} ∥}^{2}}{4 (1 + w_{n}) σ_{n}^{2}}} .

(9)

Introduce $r_{n} = σ_{n}^{2} ∕ τ_{n}^{2} = w_{n}^{- 1} - 1$ . Suppose first that $p r_{n}^{2} \to 0$ . Since w_n = (1 + r_n)⁻¹, we have

\log \frac{1}{2} (w_{n}^{1 ∕ 2} + w_{n}^{- 1 ∕ 2}) = - \frac{1}{2} \log (1 + r_{n}) + \log (1 + \frac{1}{2} r_{n}) \leq 2 c_{1} r_{n}^{3}

for $r_{n} \leq \frac{1}{2}$ , say. When p → ∞, we have with probability tending to one that ${‖ \overset{‒}{y} ‖}^{2} ∕ σ_{n}^{2} < 2 p$ , and so for $r_{n} \leq \frac{1}{2}$ ,

\frac{{(1 - w_{n})}^{2}}{4 (1 + w_{n})} \frac{{∥ \overset{‒}{y} ∥}^{2}}{σ_{n}^{2}} \leq \frac{2 p r_{n}^{2}}{(1 + r_{n}) (1 + r_{n} ∕ 2)} \leq c_{2} p r_{n}^{2} .

Consequently, when $p r_{n}^{2} \to 0$ ,

ρ (P_{y, n}, Q_{y, n}) \geq \exp {- c_{1} p r_{n}^{3} - c_{2} p r_{n}^{2}} \to 1 .

Suppose now that $p r_{n}^{2}$ does not approach 0. Again with probability tending to one, ${‖ \overset{‒}{y} ‖}^{2} ∕ σ_{n}^{2} > p ∕ 2$ , and since $\frac{1}{2} (w_{n}^{1 ∕ 2} + w_{n}^{- 1 ∕ 2}) > 1$ , we have from (9) that

- \log ρ (P_{y, n}, Q_{y, n}) > \frac{p {(1 - w_{n})}^{2}}{8 (1 + w_{n})} > c_{3} \min {p r_{n}^{2}, p}

which cannot converge to zero if $p r_{n}^{2}$ does not.

Remark. If θ₀ = θ_0n ≠ 0, so that the data mean differs from the prior mean, then the rate condition is replaced by

w_{n} - 1 = o (1 ∕ q_{n} (θ_{0 n})), q_{n} (θ_{0 n}) = \sqrt{p_{n}} + ‖ θ_{0 n} ‖ ∕ σ_{n} .

Example. We illustrate the result by considering estimation in the Gaussian white noise model (3). When expressed in a suitable orthonormal basis of wavelets, we obtain $y_{j k} \overset{\underset{~}{i n d}}{} N (θ_{j, k}, σ_{n}^{2})$ , for k = 1, …, 2^j, and $j \in N$ . Pinsker's theorem (Pinsker, 1980) describes the minimax linear estimator of f, or equivalently of (θ_jk), under squared error loss when it is assumed that f has α mean square derivatives and shows that such minimax linear estimators are asymptotically minimax among all estimators as σ_n → 0.

Pinsker's estimator is necessarily posterior mean Bayes for a corresponding Gaussian prior. The mean square differentiability condition can be equivalently expressed in terms of the coefficients as

\sum_{j, k} 2^{2 j α} θ_{j, k}^{2} \leq C^{2},

and the corresponding least favorable Gaussian prior puts

θ_{j, k} \overset{ind}{~} N (0, τ_{j}^{2}) τ_{j}^{2} = σ_{n}^{2} {(μ_{n} 2^{- j α} - 1)}_{+},

(10)

where μ_n = c_αn(C/σ_n)^2α/(2α+1). The constant c_αn satisfies bounds independent of n, c_1α ≤ c_αn ≤ c_2α, whose precise values are unimportant here–for further details see Johnstone (2010).

We consider the validity of the Bernstein-von Mises phenomenon for the collection of coefficients {θ_jk, k = 1, …, 2^j} at a given level j = j(n)–possibly fixed, or possibly varying with n.

The prior variances $τ_{j}^{2}$ decrease with j, and vanish above a “critical level” j_* = j_*(α, C;n). Since j_* ~ (2/(2α+1)) log(C/σ_n) grows with n, so does the number of parameters θ_j*,k at the critical level. From (10), we conclude that

τ_{j *}^{2} ∕ σ_{n}^{2} \leq 2^{α} - 1,

and hence that w_n ≤ 1 − 2^−α does not approach 1, so that the condition of Proposition 1 fails.

On the other hand, at a fixed level j₀, we have p = 2^jo fixed and $τ_{j_{0}}^{2} ∕ σ_{n}^{2} = μ_{n} 2^{- j * α} - 1 \to \infty$ , so that $\sqrt{p} σ_{n}^{2} ∕ τ_{j_{0}}^{2} \to 0$ and so Proposition 1 applies. Thus we may say informally that the Bernstein-von Mises phenomenon holds at a fixed level but fails at the critical level.

3. Behavior of the squared loss

In this section, we pay homage to a remarkable paper by Freedman (1999), itself stimulated by Cox (1993), which sets out the failure of the Bernstein-von Mises theorem in a simple sequence model of function estimation in Gaussian white noise. To further simplify the calculations, we use the growing Gaussian location model (D), (P), yielding results parallel to, but not identical with, Freedman's. Hence, define

T_{n} (θ, Y) = {∥ θ - {\hat{θ}}_{B} ∥}^{2} = \sum_{k = 1}^{p (n)} {(θ_{k} - {\hat{θ}}_{k})}^{2} .

The posterior distribution of θ|Y is described by (2); in particular the shrinkage factor $w_{n} = τ_{n}^{2} ∕ (σ_{n}^{2} + τ_{n}^{2})$ again plays a critical role.

Theorem 2 (Bayesian) The posterior distribution $L (T_{n} ∣ Y)$ is given by

T_{n} = C_{n} + \sqrt{D_{n} Z_{1 n}},

where

C_{n} = p σ_{n}^{2} w_{n}

(11)

\sqrt{D_{n}} = \sqrt{2 p σ_{n}^{2} w_{n}}

(12)

and the random variable Z_1n has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞.

Proof. From (2), the posterior distribution of T_n given Y is $σ_{n}^{2} w_{n} χ_{(p)}^{2}$ and in particular it is free of Y. Hence we have the representation

T_{n} = p σ_{n}^{2} w_{n} + \sqrt{2 p σ_{n}^{2}} w_{n} Z_{1} n,

and the theorem follows because $(χ_{p}^{2} - p) ∕ \sqrt{2 p} \Rightarrow N (0, 1)$ as p → ∞.

Turn now to the frequentist perspective, in which θ is a fixed and unknown (sequence of) parameters. We will therefore use the decomposition y_k = θ_k + σ_nε_k with $∊_{k} \overset{\underset{~}{i i d}}{} N (0, 1)$ , c.f. (D_seq) above. Since ${\hat{θ}}_{B, k} = w_{n} y_{k}$ we have

θ_{k} - {\hat{θ}}_{B, k} = (1 - w_{n}) θ_{k} - w_{n} σ_{n} ∊_{k} .

(13)

Some of the conclusions will be valid only for “most” θ: to formulate this it is useful to give θ a distribution. The natural one to use is (P), despite the possible confusion arising because, for the frequentist, this is not an a priori law!

Theorem 3 (Frequentist) The conditional distribution $L (T_{n} ∣ θ)$ is given by

T_{n} = C_{n} + \sqrt{F_{n}} Z_{2 n} (θ) + \sqrt{G_{n} (θ)} Z_{3 n} (θ, ∊),

(14)

where C_n is as in Theorem 2, while Z_3n (θ, ε) has mean 0 and variance 1.

If θ is distributed according to (P), then Z_2n(θ) has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞. In addition, if w_n ∞ w = 1 − cosω,

\begin{matrix} \sqrt{F_{n}} ~ \sqrt{D_{n}} \cos ω \\ \sqrt{G_{n} (θ)} ~ \sqrt{D_{n}} \sin ω, \end{matrix}

(15)

and

Z_{3 n} (θ, \cdot) \Rightarrow N (0, 1)

(16)

Formulas (15) and (16) hold as n → ∞, for almost all θ's generated from (P).

Proof. Using (13), and ${(1 - w_{n})}^{2} τ_{n}^{2} + w_{n}^{2} σ_{n}^{2} = σ_{n}^{2} w_{n}$ , we may write

\begin{matrix} T_{n} & = \sum_{k} {[(1 - w_{n}) θ_{k} - w_{n} σ_{n} ∊_{k}]}^{2} \\ = p σ_{n}^{2} w_{n} + \sqrt{2 p τ_{n}^{2}} {(1 - w_{n})}^{2} \cdot \frac{Σ θ_{k}^{2} - p τ_{n}^{2}}{\sqrt{2 p τ_{n}^{2}}} + R_{n} (θ, ∊), \end{matrix}

(17)

with

R_{n} (θ, ∊) = - 2 w_{n} (1 - w_{n}) σ_{n} Σ θ_{k} ∊_{k} + w_{n}^{2} σ_{n}^{2} Σ (∊_{n}^{2} - 1)

This leads immediately to the representation (14) after observing that $τ_{n}^{2} (1 - w_{n}) = σ_{n}^{2} w_{n}$ and setting

\begin{matrix} \sqrt{F_{n}} & = \sqrt{2 p} σ_{n}^{2} w_{n} (1 - w_{n}), \\ Z_{2 n} (θ) & = Σ (θ_{k}^{2} - τ_{n}^{2}) ∕ \sqrt{2 p} τ_{n}^{2}, \\ G_{n} (θ) & = Var R_{n} (θ, ∊) = 4 w_{n}^{2} {(1 - w_{n})}^{2} σ_{n}^{2} Σ θ_{k}^{2} + w_{n}^{4} \cdot 2 p σ_{n}^{4} \\ = G_{1 n} (θ) + G_{2 n} \end{matrix}

Turning to the final assertions, we may rewrite

R_{n} (θ, ∊) = \sqrt{G_{1 n} (θ)} Z_{4 n} (θ) + \sqrt{G_{2 n}} Z_{5 n},

where

Z_{4 n} (θ) = Σ θ_{k} ∊_{k} ∕ {(Σ θ_{k}^{2})}^{1 ∕ 2}, Z_{5 n} = (Σ ∊_{k}^{2} - p) ∕ \sqrt{2 p} .

Using again $σ_{n}^{2} w_{n} = τ_{n}^{2} (1 - w_{n})$ , we have

G_{1 n} (θ) = 2 p σ_{n}^{4} \cdot w_{n}^{2} (2 w_{n} - 2 w_{n}^{2}) \cdot p^{- 1} \sum_{1}^{p} {(θ_{k} ∕ τ_{n})}^{2} .

For almost all θ's generated from (P), $p^{- 1} Σ_{1}^{p} {(θ_{k} ∕ τ_{n})}^{2} \to 1$ , and since G_n (θ) = G_1n(θ) + G_2n, (15) follows.

Clearly Z_4n(θ) ~ N(0, 1), free of θ, while Z_5n ⇒ N (0, 1) and so (16) follows.

Remark. The doctrinaire frequentist would not contemplate the joint distribution of (θ, Y ) in (D, P); but anyone else would observe that in that joint distribution, $T_{n} ~ σ_{n}^{2} w_{n} χ_{(p)}^{2}$ , as follows easily in two ways, either from the proof of Theorem 2, or from (17).

The Bernstein-von Mises theorem fails if lim w_n = w < 1, as may be seen in Figure 1. For the Bayesian, conditional on $Y, θ - {\hat{θ}}_{B}$ is a noise vector, and Theorem 2 says that the distribution of ${‖ θ - {\hat{θ}}_{B} ‖}^{2}$ is approximately normal with mean C_n and standard deviation $\sqrt{D_{n}}$ . For the frequentist, $E [{\hat{θ}}_{B} ∣ θ] = w_{n} θ$ is biased (also asymptotically), and some of ${‖ {\hat{θ}}_{B} - θ ‖}^{2}$ comes from this bias. As a result, Theorem 3 says that, conditional on $θ, {‖ {\hat{θ}}_{B} - θ ‖}^{2}$ is approximately normal with mean $C_{n} + \sqrt{F_{n}} Z_{2 n} (θ)$ and standard deviation $\sqrt{G_{n} (θ)}$ . Comparing (12) and (15) shows that the frequentist SD is smaller than the Bayesian $\sqrt{G_{n} (θ)} < \sqrt{D_{n}}$ .

Fig 1 — The top panel, for the Bayesian, has $θ - {\hat{θ}}_{B}$ as a noise vector, and the posterior distribution $L (T ∣ Y)$ approximately N(C_n, D_n). The bottom panel, for the frequentist, shows the effect of the bias of $E [{\hat{θ}}_{B} ∣ θ]$ for θ, with $L (T ∣ θ)$ approximately $N (C_{n} + \sqrt{F_{n}} Z_{2 n} (θ), G_{2 n} (θ))$ .

Under the assumption (P), $θ_{i} \overset{\underset{~}{i i d}}{} N (0, τ_{n}^{2})$ , the `wobble' in the frequentist mean be arbitrarily large relative to $\sqrt{D_{n}}$ : from the law of the iterated logarithm, with probability one

\lim \inf Z_{2 n} (θ) ∕ \sqrt{2 \log \log p} = 1

By contrast, if lim w_n = 1, then the wobble disappears: $\sqrt{F_{n}} = o (\sqrt{D_{n}})$ and the Bayesian SD equals the frequentist SD asymptotically: $\sqrt{G_{n} (θ)} ~ \sqrt{D_{n}}$ .

4. Linear Functionals

We turn now to the least demanding of our three scenarios for the Bernstein-von Mises theorem: the behavior of linear functionals. We change the setting slightly to the infinite sequence Gaussian white noise model (3). We consider linear functionals Lf such as integrals $\int_{B} f$ or derivatives f^(r) (t₀): if f has expansion $f (t) = Σ θ_{k} φ_{k} (t)$ , then on setting $a_{k} = L φ_{k}$ , we have

L f = Σ θ_{k} L φ_{k} = Σ θ_{k} a_{k} .

Again, for maximum simplicity, we consider Gaussian priors on the coefficients:

θ_{k} \overset{ind}{~} N (0, τ_{k}^{2}) .

(18)

In order that $Σ θ_{k}^{2} < \infty$ with probability 1, it is necessary and sufficient that $Σ τ_{n}^{2} < \infty$ .

Consequently, the posterior laws are Gaussian

θ_{k} ∣ y_{k} \overset{ind}{~} N (w_{k n} y_{k}, w_{k n} σ_{n}^{2}),

again with

w_{k n} = τ_{k}^{2} ∕ (σ_{n}^{2} + τ_{k}^{2})

(19)

so that the posterior mean estimate

{\hat{L f}}_{n} = \sum_{n} a_{k} w_{k n} y_{k} .

Centering at posterior mean. For the Bayesian, the posterior distribution

L f ∣ y ~ N ({\hat{L f}}_{n}, V_{y n}), V_{y n} = σ_{n}^{2} \sum_{k} a_{k}^{2} w_{k n}

while for the frequentist, the conditional distribution

{\hat{L f}}_{n} ∣ f ~ N (E_{f} {\hat{L f}}_{n}, V_{F n}), V_{F n} = σ_{n}^{2} \sum_{k} a_{k}^{2} w_{k n}^{2} .

The Bayesian might use 100(1 − α)% posterior credible intervals of the form ${\hat{L f}}_{n} \pm z_{α ∕ 2} \sqrt{V_{y n}}$ , while the frequentist might employ 100(1 − α)% confidence intervals ${\hat{L f}}_{n} \pm z_{α ∕ 2} \sqrt{V_{F n}}$ . This leads us to conider the variance ratio

\frac{V_{F n}}{V_{y n}} = \frac{Σ a_{k}^{2} w_{k n}^{2}}{Σ a_{k}^{2} w_{k n}} < 1,

(20)

from which we see that the frequentist intervals are narrower–because the frequentist bias $E_{f} L \hat{f} - L f$ is being ignored for now, along with the attendant implications for coverage (but see below).

As sample size n → ∞, the noise variance $σ_{n}^{2} \to 0$ and so for a given Gaussian prior (18), the weights (19) converge marginally: w_kn → 1 for each fixed k. This alone does not imply convergence of the variance ratio V_Fn/V_yn → 1, as a later example shows. A sufficient condition is that the linear functional Lf be bounded (as a mapping from L₂[0, 1] to $R$ .) This amounts to saying that Lf has the representation $L f = \int_{0}^{1} a (t) f (t) d t$ with ∫ a²(t)dt < ∞, or equivalently, in sequence terms, that $Σ a_{k}^{2} \leq \infty$ .

Proposition 4

Let $P_{f}^{n}$ denote the measure corresponding to (3). If Lf = ∫ af is a bounded linear functional, then the variation distance between Bayesian and frequentist distributions converges to zero:

‖ N ({\hat{L f}}_{n}, V_{y n}) - N ({\hat{L f}}_{n}, V_{F n}) ‖ \overset{P_{f}^{n}}{\to} 0 .

(21)

Proof. We again use the Hellinger affinity (7) and apply (8) to the laws $P = N ({\hat{L f}}_{n}, V_{y n})$ and $Q = N ({\hat{L f}}_{n}, V_{F n})$ to obtain

ρ^{2} (P, Q) = \frac{2 \sqrt{V_{F n} ∕ V_{y n}}}{1 + V_{F n} ∕ V_{y n}} .

In view of (7), the merging in (21) occurs if and only if

V_{F n} ∕ V_{y n} \to 1 .

When $Σ a_{k}^{2} < \infty$ , this convergence follows from (20) and the dominated convergence theorem.

Remarks. 1. Examples of bounded functionals include polynomials $a (t) = Σ_{k = 0}^{K} c_{k} t^{k}$ and “regions of interest” a(t) = I{t ε B}.

2. Examples of unbounded functionals are given by evaluation of a function (or its derivatives) at a point: Lf = f^(r)(t₀). We shall see that the variance ratio does not converge to 1, and so the Bernstein-von Mises theorem fails. Indeed, in the Fourier basis

φ 0 (t) \equiv 1, {\begin{matrix} φ_{2 k - 1} (t) = \sqrt{2} \sin 2 π k t \\ φ_{2 k} (t) = \sqrt{2} \sin 2 π k t, \end{matrix} k = 1, 2, \dots

we find that, a_k = Lφ_k = d^rφ_k(t₀)/dt^r and an easy calculation shows that $a_{2 k - 1}^{2} + a_{2 k}^{2} = 2 {(2 π k)}^{2 r}$ . We use a Gaussian prior (18) with $τ_{2 k - 1}^{2} = τ_{2 k}^{2} = k^{- 2 m}$ and 2m > 2r + 1. It follows from (19) that, after writing V_1n and V_2n for V_yn and V_Fn respectively, we have

V_{j n} = 2 {(2 π)}^{2 r} σ_{n}^{2} \sum_{k} k^{2 r} {(1 + σ_{n}^{2} k^{2 m})}^{- j} .

As λ → 0, sums of the form

\sum_{k = 0}^{\infty} k^{p} {(1 + λ k^{q})}^{- r} ~ k λ^{- (p + 1) ∕ q},

with $κ = κ (p, r; q) = \int_{0}^{\infty} v^{p} {(1 + v^{q})}^{- r} d v = Γ (r - μ) Γ (μ) ∕ (q Γ (r))$ and μ (p+1)/q. In the present case, with p = 2r, q = 2m and r = j, we conclude that

\frac{V_{F n}}{V_{y n}} \to 1 - \frac{2 r + 1}{2 m} < 1 .

Centering at the MLE. For a bounded linear functional, the MLE $L {\hat{f}}_{M} = Σ a_{k} y_{k}$ is well defined and unbiased, with mean $E (L {\hat{f}}_{M}) = L f$ and frequentist variance $V_{M n} = Var (L {\hat{f}}_{M}) = σ_{n}^{2} Σ_{k} a_{k}^{2}$ . A frequentist might prefer to use 100(1 − α)% intervals $L {\hat{f}}_{M} \pm z_{α ∕ 2} \sqrt{V_{M n}}$ which will have the correct coverage property. However, extra conditions are required for the Bernstein-von Mises result to hold in this case.

Proposition 5

Assume that Lf = ∫ af is a bounded linear functional. Suppose also that the coefficients of θ_k = 〈f, φ_k〉 of the `true' f, and the variances $τ_{k}^{2}$ of the Gaussian prior together satisfy Σ|a_kθ_k/τ_k|< ∞. Then the distance between Bayesian and frequentist distributions

∥ N ({\hat{L f}}_{n}, V_{y n}) - N (L {\hat{f}}_{M}, V_{M n}) ∥ \overset{P_{f}^{n}}{\to} 0 .

(22)

Proof. The argument is a slight elaboration of that used in the previous proposition. We use (7) and $P = N ({\hat{L f}}_{n}, V_{y n})$ as before, but now $Q = N (L {\hat{f}}_{M}, V_{M n})$ and (8) yields

ρ^{2} (P, Q) = \frac{2 \sqrt{V_{y n} V_{M n}}}{V_{y n} + V_{M n}} \exp {- \frac{1}{2} \frac{{(L {\hat{f}}_{M} - {\hat{L f}}_{n})}^{2}}{V_{y n} + V_{M n}}} .

As before $V_{y n} ∕ V_{M n} = Σ a_{k}^{2} w_{k n}^{2} ∕ Σ a_{k}^{2} \to 1$ by dominated convergence. Using this and the expression $V_{M n} = σ_{n}^{2} Σ a_{k}^{2}$ and in view of the bounds (7), the conclusion (22) is equivalent to $σ_{n}^{- 1} ∣ L {\hat{f}}_{M} - {\hat{L f}}_{n} ∣ \overset{P_{f}^{n}}{\to} 0$ . We may write

σ_{n}^{- 1} (L {\hat{f}}_{M} - {\hat{L f}}_{n}) \overset{D}{=} \sum_{k} a_{k} σ_{n}^{- 1} (1 - w_{k n}) θ_{k} + Σ a_{k} (1 - w_{k n}) ∊_{k} .

The stochastic term has mean 0 and variance $Σ a_{k}^{2} {(1 - w_{k n})}^{2} \to 0$ , again by dominated convergence. Thus we may focus on the deterministic term, and note that the merging in (22) occurs if and anly if

σ_{n} \sum_{k} \frac{a_{k} θ_{k}}{σ_{n}^{2} + τ_{k}^{2}} \to 0 .

The bound $σ_{n} τ_{n} ∕ (σ_{n}^{2} + τ_{k}^{2}) \leq \frac{1}{2}$ along with the dominated convergence theorem then shows that $Σ ∣ a_{k} θ_{k} τ_{k}^{- 1} ∣ < \infty$ is a sufficient condition for (22) as claimed.

5. Related work

As remarked earlier, this paper avoids the important Gaussian approximation part of the Bernstein-von Mises phenomenon by focusing on examples with Gaussian likelihoods and priors. A growing literature addresses the approximation challenges; we give a brief listing here, and refer to the books (Ghosh and Ramamoorthi, 2003; Ghosal and van der Vaart, 2010) and the survey discussion in Ghosal (2010, §2.7) for more detailed discussion.

Ghosal (1997, 1999, 2000) developed posterior normality results for the full posterior in cases where the dimension of the parameter space increases sufficiently slowly. In each case, the emphasis is on conditions under which a non-Gaussian likelihood and appropriate prior sequence can yield approximate Guassian posteriors. However Ghosal (2000, Sec. 4) specializes his results to our setting (D) with $σ_{n}^{2} = 1 ∕ n$ and notes that one can choose priors-in general not Gaussian-so that the posterior distribution centered by the MLE is approximately Gaussian if p³(log p)/n → 0.

In survival analysis, Bernstein-von Mises theorems for the cumulative hazard function are established by Kim and Lee (2004) and for the cumulative hazard and fixed dimensional covariate regression parameter in a proportional hazards model in Kim (2006).

Boucheron and Gassiat (2009) develop a Bernstein-von Mises theorem for discrete probability distributions of growing dimension, and consider application to functionals such as Shannon and Renyi entropies.

In a semiparametric setting, where a finite dimensional parameter of interest can be separated from an infinite dimensional nuisance parameter, Castillo (2008) obtains conditions leading to a Bernstein-von Mises theorem on the parametric part, clarifying an earlier work of Shen (2002).

Rivoirard and Rousseau (2009) give conditions under which Bernstein-von Mises holds for linear functionals of a nonparametrically specified probability density function.

Acknowledgments

This work was supported in part by NIH grant RO1 EB 001988. and NSF DMS 0906812.

References

Borwanker J, Kallianpur G, Prakasa Rao BLS. The Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist. 1971;42:1241–1253. [Google Scholar]
Boucheron S, Gassiat E. A Bernstein-von Mises theorem for discrete probability distributions. Electron. J. Stat. 2009;3:114–148. [Google Scholar]
Castillo I. A semiparametric Bernstein-von Mises theorem. 2008. submitted. [Google Scholar]
Cox DD. An analysis of Bayesian inference for nonparametric regression. Ann. Statist. 1993;21(2):903–923. [Google Scholar]
Freedman D. On the Bernstein-von Mises theorem with infinite dimensional parameters. Annals of Statistics. 1999;27:1119–1140. [Google Scholar]
Ghosal S. Normal approximation to the posterior distribution for generalized linear models with many covariates. Math. Methods Statist. 1997;6(3):332–348. [Google Scholar]
Ghosal S. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli. 1999;5(2):315–331. [Google Scholar]
Ghosal S. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 2000;74(1):49–68. [Google Scholar]
Ghosal S. The Dirichlet process, related priors and posterior asymptotics. In: Hjort NL, Holmes C, Müller P, Walker SG, editors. Bayesian Nonparametrics. Cambridge University Press; 2010. chapter 2. [Google Scholar]
Ghosal S, van der Vaart A. Theory of Nonparametric Bayesian Inference. Cambridge University Press; 2010. in preparation. [Google Scholar]
Ghosh JK, Ramamoorthi RV. Bayesian Nonparametrics. Springer-Verlag; New York: 2003. (Springer Series in Statistics). [Google Scholar]
Heyde CC, Johnstone IM. On asymptotic posterior normality for stochastic processes. J. Roy. Statist. Soc. Ser. B. 1979;41(2):184–189. [Google Scholar]
Johnstone IM. Function estimation and Gaussian sequence models. 2010 Book manuscript at www-stat.stanford.edu.
Kim Y. The Bernstein-von Mises theorem for the proportional hazard model. Ann. Statist. 2006;34(4):1678–1700. [Google Scholar]
Kim Y, Lee J. A Bernstein-von Mises theorem in the nonparametric right-censoring model. Ann. Statist. 2004;32(4):1492–1512. [Google Scholar]
Lehmann EL, Casella G. Springer Texts in Statistics. second edn Springer-Verlag; New York: 1998. Theory of Point Estimation. [Google Scholar]
Pinsker M. Optimal filtering of square integrable signals in Gaussian white noise. Problems of Information Transmission. 1980;1616:120–133. 52–68. originally in Russian in Problemy Peredatsii Informatsii. [Google Scholar]
Rivoirard V, Rousseau J. Bernstein von Mises theorem for linear functionals of the density. 2009. submitted. [Google Scholar]
Shen X. Asymptotic normality of semiparametric and nonparametric posterior distributions. J. Amer. Statist. Assoc. 2002;97(457):222–235. [Google Scholar]
van der Vaart AW. Asymptotic statistics, Vol. 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; Cambridge: 1998. [Google Scholar]

[R1] Borwanker J, Kallianpur G, Prakasa Rao BLS. The Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist. 1971;42:1241–1253. [Google Scholar]

[R2] Boucheron S, Gassiat E. A Bernstein-von Mises theorem for discrete probability distributions. Electron. J. Stat. 2009;3:114–148. [Google Scholar]

[R3] Castillo I. A semiparametric Bernstein-von Mises theorem. 2008. submitted. [Google Scholar]

[R4] Cox DD. An analysis of Bayesian inference for nonparametric regression. Ann. Statist. 1993;21(2):903–923. [Google Scholar]

[R5] Freedman D. On the Bernstein-von Mises theorem with infinite dimensional parameters. Annals of Statistics. 1999;27:1119–1140. [Google Scholar]

[R6] Ghosal S. Normal approximation to the posterior distribution for generalized linear models with many covariates. Math. Methods Statist. 1997;6(3):332–348. [Google Scholar]

[R7] Ghosal S. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli. 1999;5(2):315–331. [Google Scholar]

[R8] Ghosal S. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 2000;74(1):49–68. [Google Scholar]

[R9] Ghosal S. The Dirichlet process, related priors and posterior asymptotics. In: Hjort NL, Holmes C, Müller P, Walker SG, editors. Bayesian Nonparametrics. Cambridge University Press; 2010. chapter 2. [Google Scholar]

[R10] Ghosal S, van der Vaart A. Theory of Nonparametric Bayesian Inference. Cambridge University Press; 2010. in preparation. [Google Scholar]

[R11] Ghosh JK, Ramamoorthi RV. Bayesian Nonparametrics. Springer-Verlag; New York: 2003. (Springer Series in Statistics). [Google Scholar]

[R12] Heyde CC, Johnstone IM. On asymptotic posterior normality for stochastic processes. J. Roy. Statist. Soc. Ser. B. 1979;41(2):184–189. [Google Scholar]

[R13] Johnstone IM. Function estimation and Gaussian sequence models. 2010 Book manuscript at www-stat.stanford.edu.

[R14] Kim Y. The Bernstein-von Mises theorem for the proportional hazard model. Ann. Statist. 2006;34(4):1678–1700. [Google Scholar]

[R15] Kim Y, Lee J. A Bernstein-von Mises theorem in the nonparametric right-censoring model. Ann. Statist. 2004;32(4):1492–1512. [Google Scholar]

[R16] Lehmann EL, Casella G. Springer Texts in Statistics. second edn Springer-Verlag; New York: 1998. Theory of Point Estimation. [Google Scholar]

[R17] Pinsker M. Optimal filtering of square integrable signals in Gaussian white noise. Problems of Information Transmission. 1980;1616:120–133. 52–68. originally in Russian in Problemy Peredatsii Informatsii. [Google Scholar]

[R18] Rivoirard V, Rousseau J. Bernstein von Mises theorem for linear functionals of the density. 2009. submitted. [Google Scholar]

[R19] Shen X. Asymptotic normality of semiparametric and nonparametric posterior distributions. J. Amer. Statist. Assoc. 2002;97(457):222–235. [Google Scholar]

[R20] van der Vaart AW. Asymptotic statistics, Vol. 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; Cambridge: 1998. [Google Scholar]

PERMALINK

High dimensional Bernstein-von Mises: simple examples

Iain M Johnstone

Abstract

1. Growing Gaussian location model

2. Global convergence of posterior

Proposition 1

3. Behavior of the squared loss

Fig 1.

4. Linear Functionals

Proposition 4

Proposition 5

5. Related work

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

High dimensional Bernstein-von Mises: simple examples

Iain M Johnstone

Abstract

1. Growing Gaussian location model

2. Global convergence of posterior

Proposition 1

3. Behavior of the squared loss

Fig 1.

4. Linear Functionals

Proposition 4

Proposition 5

5. Related work

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases