Abstract
In Gaussian sequence models with Gaussian priors, we develop some simple examples to illustrate three perspectives on matching of posterior and frequentist probabilities when the dimension p increases with sample size n: (i) convergence of joint posterior distributions, (ii) behavior of a non-linear functional: squared error loss, and (iii) estimation of linear functionals. The three settings are progressively less demanding in terms of conditions needed for validity of the Bernstein-von Mises theorem.
Keywords: high dimensional inference, Gaussian sequence, linear functional, squared error loss, posterior distribution, frequentist
The Bernstein-von Mises theorem is a formalization of conditions under which Bayesian posterior credible intervals agree approximately with frequentist confidence intervals constructed from likelihood theory. It is traditionally formulated in situations in which the number of parameters p is fixed and the sample size n → ∞. The situation is very different in high dimensional settings in which p is allowed to grow with n. In this primarily expository paper, we use simple Gaussian sequence models to draw some conclusions about when a version of Bernstein-von Mises can hold.
We begin with a somewhat informal statement of the classical theorem. Suppose that Y1, …, Yn are i.i.d. observations from a distribution Pθ having density pθ(y)dμ(y) where . The log-likelihood for a single observation
and, as usual, the score function vector and Fisher information matrix are given by
Writing Yn = (Y1, …, Yn) for the full data, the log-likelihood
and we write for a maximizer of Ln(θ). Classical likelihood theory says that any (nice) estimator satisfies the information bound
in the usual ordering of nonnegative definite matrices, and that the bound is asymptotically attained by the MLE, which is also asymptotically Gaussian:
Now suppose that π(θ) is the density of a prior distribution with respect to Legesgue measure. Then the posterior distribution of θ given Yn is given by Bayes' rule; we denote it simply by Pθ|Yn.
The Bernstein-von Mises theorem says, informally, that this posterior distribution is, in large samples, approximately normal with mean approximately the MLE, and variance matrix approximately (here θ0 is the `true' value of θ generating the observations Y1, …, Yn). Using the scalar case for simplicity, and writing and , we have that an approximate 100(1 – α)% credible interval for θ would be given by . This is exactly the same as the frequentist confidence interval based on asymptotic normality of the MLE. Thus in large samples the effect of the prior density π disappears: “the data overwhelms the prior”.
A somewhat more formal statement uses the notion of variation distance between probability measures P and Q, and an equivalent expression in terms of the densities p = dP/dμ and q = dQ/dμ relative to a dominating measure μ:
Suppose that π(θ) is continuous and positive at the `true' value θ0, and that θ → Pθ is differentiable in quadratic mean and satisfies a further mild separation condition, then
| (1) |
in probability under .
In other words, the variation distance between posterior and the approximating Gaussian distribution is a random variable depending on Yn, and which converges to zero in probability under repeated draws from Pθ0.
A development of the Bernstein-von Mises theorem as formulated above may be found in van der Vaart (1998, §10.2). A proof due to Bickel is given in Lehmann and Casella (1998, §6.8). Extension from independent to dependent sampling settings are possible, see e.g. Borwanker et al. (1971); Heyde and Johnstone (1979). For further references and methods of proof of the classical results, see Ghosh and Ramamoorthi (2003, §1.4 and §1.5).
1. Growing Gaussian location model
In nonparametric and semiparametric settings the situation is very different. Even frequentist consistency of nonparametric Bayesian methods is a difficult issue with a large literature of both positive and negative results (e.g. Ghosh and Ramamoorthi (2003); Ghosal and van der Vaart, (2010)). One cannot therefore expect Bernstein-von Mises phenomena in any great generality for the full posterior.
In this largely expository paper, we do some simple calculations in symmetric Gaussian sequence models. The Gaussian sequence structure makes possible an elementary set of examples that avoid the technical challenges posed by, and sophistication needed for, posterior Gaussian approximation in high dimensional settings (see references in Section 5). Nevertheless, the Gaussian examples can conveniently illustrate some of the issues related to validity of the Bernstein-von Mises theorem in high dimensional models. Depending on the frequentist or Bayesian perspective, we assume that p = p(n) grows with n, and one, or both, of
The notation suggests an average (Y1+⋯+Yn)/n of observations individually of variance , so that in this case . [If p were held fixed, not depending on n, then would match with the definition given in the introductory section.] We also allow the prior variance to depend on the sample size n.
Our goal is to compare the Bayesian posterior distribution with frequentist distributions, in particular those of the MLE and of the posterior mean Bayes estimator . A key simplification is that since both prior and likelihood are Gaussian, so also is the posterior distribution, and hence all the behavior will be determined by centering and scaling. Thus from standard results, the posterior is given by
| (2) |
Remarks: 1. The reference to Gaussian sequence models becomes clearer if, as will be helpful later, we write out assumptions (D) and (P) in co-ordinates:
with εk and ζk all i.i.d standard Gaussian, for k = 1, …, p(n).
Strictly speaking, the indexing by n of parameters σn, τn and p(n) creates a sequence of sequence models. However, one can, as needed for almost sure results, think of the infinite sequences {} as being drawn from a single common probability space.
2. We also consider the infinite sequence Gaussian white noise model
| (3) |
or equivalently, when expressed in any orthonormal basis {φk(t)} for L2[0, 1],
| (4) |
where it is assumed that . For some examples, it is helpful to use doubly indexed orthonormal bases {} such as arise with systems of orthonormal wavelets.
The forthcoming book Johnstone (2010) will have more on estimation in such Gaussian sequence models.
We develop three perspectives on the Bernstein-von Mises phenomenon:
global convergence of the posterior,
behavior of a non-linear functional , and of
We shall see that these situations are progressively “less demanding” in terms of validity of the Bernstein-von Mises phenomenon. Indeed, case (1) requires that wn → 1 at a sufficiently fast rate, while setting (2) needs only wn → 1. In case (3), the formulation itself delivers wn → 1, and covers at least all bounded linear functionals.
2. Global convergence of posterior
The first calculation considers the p—dimensional posterior distribution (2) and shows that the convergence in (1) occurs, even in the best possible case that θ0 = 0, only if the shrinkage factor wn approaches 1 at a sufficiently fast rate.
Proposition 1
Let θ0 = 0. The variation distance between posterior distribution and converges to zero inPθ0 — probability if and only if , or equivalently, if
| (5) |
PROOF. We introduce notation Py,n(dθ) for the posterior distribution of θ|Yn = y and Qy,n(dθ) for the distribution centered at . Thus
| (6) |
Let denote the Hellinger affinity between two probability measures P,Q having densities p, q with respect to a common dominating measure μ. We recall an elementary bound (van der Vaart, 1998, p. 212) for variation distance in terms of Hellinger distance and hence Hellinger affinity:
| (7) |
Thus ∥Py,n − Qy,n∥ → 0 if and only if ρ(Py,n, Qy,n) → 1. We recall also that affinity commutes with products:
An elementary calculation shows that
| (8) |
When applied to Py,n and Qy,n, we set , and , to obtain
| (9) |
Introduce . Suppose first that . Since wn = (1 + rn)−1, we have
for , say. When p → ∞, we have with probability tending to one that , and so for ,
Consequently, when ,
Suppose now that does not approach 0. Again with probability tending to one, , and since , we have from (9) that
which cannot converge to zero if does not.
Remark. If θ0 = θ0n ≠ 0, so that the data mean differs from the prior mean, then the rate condition is replaced by
Example. We illustrate the result by considering estimation in the Gaussian white noise model (3). When expressed in a suitable orthonormal basis of wavelets, we obtain , for k = 1, …, 2j, and . Pinsker's theorem (Pinsker, 1980) describes the minimax linear estimator of f, or equivalently of (θjk), under squared error loss when it is assumed that f has α mean square derivatives and shows that such minimax linear estimators are asymptotically minimax among all estimators as σn → 0.
Pinsker's estimator is necessarily posterior mean Bayes for a corresponding Gaussian prior. The mean square differentiability condition can be equivalently expressed in terms of the coefficients as
and the corresponding least favorable Gaussian prior puts
| (10) |
where μn = cαn(C/σn)2α/(2α+1). The constant cαn satisfies bounds independent of n, c1α ≤ cαn ≤ c2α, whose precise values are unimportant here–for further details see Johnstone (2010).
We consider the validity of the Bernstein-von Mises phenomenon for the collection of coefficients {θjk, k = 1, …, 2j} at a given level j = j(n)–possibly fixed, or possibly varying with n.
The prior variances decrease with j, and vanish above a “critical level” j* = j*(α, C;n). Since j* ~ (2/(2α+1)) log(C/σn) grows with n, so does the number of parameters θj*,k at the critical level. From (10), we conclude that
and hence that wn ≤ 1 − 2−α does not approach 1, so that the condition of Proposition 1 fails.
On the other hand, at a fixed level j0, we have p = 2jo fixed and , so that and so Proposition 1 applies. Thus we may say informally that the Bernstein-von Mises phenomenon holds at a fixed level but fails at the critical level.
3. Behavior of the squared loss
In this section, we pay homage to a remarkable paper by Freedman (1999), itself stimulated by Cox (1993), which sets out the failure of the Bernstein-von Mises theorem in a simple sequence model of function estimation in Gaussian white noise. To further simplify the calculations, we use the growing Gaussian location model (D), (P), yielding results parallel to, but not identical with, Freedman's. Hence, define
The posterior distribution of θ|Y is described by (2); in particular the shrinkage factor again plays a critical role.
Theorem 2 (Bayesian) The posterior distribution is given by
where
| (11) |
| (12) |
and the random variable Z1n has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞.
Proof. From (2), the posterior distribution of Tn given Y is and in particular it is free of Y. Hence we have the representation
and the theorem follows because as p → ∞.
Turn now to the frequentist perspective, in which θ is a fixed and unknown (sequence of) parameters. We will therefore use the decomposition yk = θk + σnεk with , c.f. (Dseq) above. Since we have
| (13) |
Some of the conclusions will be valid only for “most” θ: to formulate this it is useful to give θ a distribution. The natural one to use is (P), despite the possible confusion arising because, for the frequentist, this is not an a priori law!
Theorem 3 (Frequentist) The conditional distribution is given by
| (14) |
where Cn is as in Theorem 2, while Z3n (θ, ε) has mean 0 and variance 1.
If θ is distributed according to (P), then Z2n(θ) has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞. In addition, if wn ∞ w = 1 − cosω,
| (15) |
and
| (16) |
Formulas (15) and (16) hold as n → ∞, for almost all θ's generated from (P).
Proof. Using (13), and , we may write
| (17) |
with
This leads immediately to the representation (14) after observing that and setting
Turning to the final assertions, we may rewrite
where
Using again , we have
For almost all θ's generated from (P), , and since Gn (θ) = G1n(θ) + G2n, (15) follows.
Clearly Z4n(θ) ~ N(0, 1), free of θ, while Z5n ⇒ N (0, 1) and so (16) follows.
Remark. The doctrinaire frequentist would not contemplate the joint distribution of (θ, Y ) in (D, P); but anyone else would observe that in that joint distribution, , as follows easily in two ways, either from the proof of Theorem 2, or from (17).
The Bernstein-von Mises theorem fails if lim wn = w < 1, as may be seen in Figure 1. For the Bayesian, conditional on is a noise vector, and Theorem 2 says that the distribution of is approximately normal with mean Cn and standard deviation . For the frequentist, is biased (also asymptotically), and some of comes from this bias. As a result, Theorem 3 says that, conditional on is approximately normal with mean and standard deviation . Comparing (12) and (15) shows that the frequentist SD is smaller than the Bayesian .
Fig 1.
The top panel, for the Bayesian, has as a noise vector, and the posterior distribution approximately N(Cn, Dn). The bottom panel, for the frequentist, shows the effect of the bias of for θ, with approximately .
Under the assumption (P), , the `wobble' in the frequentist mean be arbitrarily large relative to : from the law of the iterated logarithm, with probability one
.
By contrast, if lim wn = 1, then the wobble disappears: and the Bayesian SD equals the frequentist SD asymptotically: .
4. Linear Functionals
We turn now to the least demanding of our three scenarios for the Bernstein-von Mises theorem: the behavior of linear functionals. We change the setting slightly to the infinite sequence Gaussian white noise model (3). We consider linear functionals Lf such as integrals or derivatives f(r) (t0): if f has expansion , then on setting , we have
Again, for maximum simplicity, we consider Gaussian priors on the coefficients:
| (18) |
In order that with probability 1, it is necessary and sufficient that .
Consequently, the posterior laws are Gaussian
again with
| (19) |
so that the posterior mean estimate
Centering at posterior mean. For the Bayesian, the posterior distribution
while for the frequentist, the conditional distribution
The Bayesian might use 100(1 − α)% posterior credible intervals of the form , while the frequentist might employ 100(1 − α)% confidence intervals . This leads us to conider the variance ratio
| (20) |
from which we see that the frequentist intervals are narrower–because the frequentist bias is being ignored for now, along with the attendant implications for coverage (but see below).
As sample size n → ∞, the noise variance and so for a given Gaussian prior (18), the weights (19) converge marginally: wkn → 1 for each fixed k. This alone does not imply convergence of the variance ratio VFn/Vyn → 1, as a later example shows. A sufficient condition is that the linear functional Lf be bounded (as a mapping from L2[0, 1] to .) This amounts to saying that Lf has the representation with ∫ a2(t)dt < ∞, or equivalently, in sequence terms, that .
Proposition 4
Let denote the measure corresponding to (3). If Lf = ∫ af is a bounded linear functional, then the variation distance between Bayesian and frequentist distributions converges to zero:
| (21) |
Proof. We again use the Hellinger affinity (7) and apply (8) to the laws and to obtain
In view of (7), the merging in (21) occurs if and only if
When , this convergence follows from (20) and the dominated convergence theorem.
Remarks. 1. Examples of bounded functionals include polynomials and “regions of interest” a(t) = I{t ε B}.
2. Examples of unbounded functionals are given by evaluation of a function (or its derivatives) at a point: Lf = f(r)(t0). We shall see that the variance ratio does not converge to 1, and so the Bernstein-von Mises theorem fails. Indeed, in the Fourier basis
we find that, ak = Lφk = drφk(t0)/dtr and an easy calculation shows that . We use a Gaussian prior (18) with and 2m > 2r + 1. It follows from (19) that, after writing V1n and V2n for Vyn and VFn respectively, we have
As λ → 0, sums of the form
with and μ (p+1)/q. In the present case, with p = 2r, q = 2m and r = j, we conclude that
Centering at the MLE. For a bounded linear functional, the MLE is well defined and unbiased, with mean and frequentist variance . A frequentist might prefer to use 100(1 − α)% intervals which will have the correct coverage property. However, extra conditions are required for the Bernstein-von Mises result to hold in this case.
Proposition 5
Assume that Lf = ∫ af is a bounded linear functional. Suppose also that the coefficients of θk = 〈f, φk〉 of the `true' f, and the variances of the Gaussian prior together satisfy Σ|akθk/τk|< ∞. Then the distance between Bayesian and frequentist distributions
| (22) |
Proof. The argument is a slight elaboration of that used in the previous proposition. We use (7) and as before, but now and (8) yields
As before by dominated convergence. Using this and the expression and in view of the bounds (7), the conclusion (22) is equivalent to . We may write
The stochastic term has mean 0 and variance , again by dominated convergence. Thus we may focus on the deterministic term, and note that the merging in (22) occurs if and anly if
The bound along with the dominated convergence theorem then shows that is a sufficient condition for (22) as claimed.
5. Related work
As remarked earlier, this paper avoids the important Gaussian approximation part of the Bernstein-von Mises phenomenon by focusing on examples with Gaussian likelihoods and priors. A growing literature addresses the approximation challenges; we give a brief listing here, and refer to the books (Ghosh and Ramamoorthi, 2003; Ghosal and van der Vaart, 2010) and the survey discussion in Ghosal (2010, §2.7) for more detailed discussion.
Ghosal (1997, 1999, 2000) developed posterior normality results for the full posterior in cases where the dimension of the parameter space increases sufficiently slowly. In each case, the emphasis is on conditions under which a non-Gaussian likelihood and appropriate prior sequence can yield approximate Guassian posteriors. However Ghosal (2000, Sec. 4) specializes his results to our setting (D) with and notes that one can choose priors-in general not Gaussian-so that the posterior distribution centered by the MLE is approximately Gaussian if p3(log p)/n → 0.
In survival analysis, Bernstein-von Mises theorems for the cumulative hazard function are established by Kim and Lee (2004) and for the cumulative hazard and fixed dimensional covariate regression parameter in a proportional hazards model in Kim (2006).
Boucheron and Gassiat (2009) develop a Bernstein-von Mises theorem for discrete probability distributions of growing dimension, and consider application to functionals such as Shannon and Renyi entropies.
In a semiparametric setting, where a finite dimensional parameter of interest can be separated from an infinite dimensional nuisance parameter, Castillo (2008) obtains conditions leading to a Bernstein-von Mises theorem on the parametric part, clarifying an earlier work of Shen (2002).
Rivoirard and Rousseau (2009) give conditions under which Bernstein-von Mises holds for linear functionals of a nonparametrically specified probability density function.
Acknowledgments
This work was supported in part by NIH grant RO1 EB 001988. and NSF DMS 0906812.
References
- Borwanker J, Kallianpur G, Prakasa Rao BLS. The Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist. 1971;42:1241–1253. [Google Scholar]
- Boucheron S, Gassiat E. A Bernstein-von Mises theorem for discrete probability distributions. Electron. J. Stat. 2009;3:114–148. [Google Scholar]
- Castillo I. A semiparametric Bernstein-von Mises theorem. 2008. submitted. [Google Scholar]
- Cox DD. An analysis of Bayesian inference for nonparametric regression. Ann. Statist. 1993;21(2):903–923. [Google Scholar]
- Freedman D. On the Bernstein-von Mises theorem with infinite dimensional parameters. Annals of Statistics. 1999;27:1119–1140. [Google Scholar]
- Ghosal S. Normal approximation to the posterior distribution for generalized linear models with many covariates. Math. Methods Statist. 1997;6(3):332–348. [Google Scholar]
- Ghosal S. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli. 1999;5(2):315–331. [Google Scholar]
- Ghosal S. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 2000;74(1):49–68. [Google Scholar]
- Ghosal S. The Dirichlet process, related priors and posterior asymptotics. In: Hjort NL, Holmes C, Müller P, Walker SG, editors. Bayesian Nonparametrics. Cambridge University Press; 2010. chapter 2. [Google Scholar]
- Ghosal S, van der Vaart A. Theory of Nonparametric Bayesian Inference. Cambridge University Press; 2010. in preparation. [Google Scholar]
- Ghosh JK, Ramamoorthi RV. Bayesian Nonparametrics. Springer-Verlag; New York: 2003. (Springer Series in Statistics). [Google Scholar]
- Heyde CC, Johnstone IM. On asymptotic posterior normality for stochastic processes. J. Roy. Statist. Soc. Ser. B. 1979;41(2):184–189. [Google Scholar]
- Johnstone IM. Function estimation and Gaussian sequence models. 2010 Book manuscript at www-stat.stanford.edu.
- Kim Y. The Bernstein-von Mises theorem for the proportional hazard model. Ann. Statist. 2006;34(4):1678–1700. [Google Scholar]
- Kim Y, Lee J. A Bernstein-von Mises theorem in the nonparametric right-censoring model. Ann. Statist. 2004;32(4):1492–1512. [Google Scholar]
- Lehmann EL, Casella G. Springer Texts in Statistics. second edn Springer-Verlag; New York: 1998. Theory of Point Estimation. [Google Scholar]
- Pinsker M. Optimal filtering of square integrable signals in Gaussian white noise. Problems of Information Transmission. 1980;1616:120–133. 52–68. originally in Russian in Problemy Peredatsii Informatsii. [Google Scholar]
- Rivoirard V, Rousseau J. Bernstein von Mises theorem for linear functionals of the density. 2009. submitted. [Google Scholar]
- Shen X. Asymptotic normality of semiparametric and nonparametric posterior distributions. J. Amer. Statist. Assoc. 2002;97(457):222–235. [Google Scholar]
- van der Vaart AW. Asymptotic statistics, Vol. 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; Cambridge: 1998. [Google Scholar]

