Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jul 23.
Published in final edited form as: J Multivar Anal. 2013 Apr 1;116:456–472. doi: 10.1016/j.jmva.2013.01.011

Posterior consistency in conditional distribution estimation

Debdeep Pati a, David B Dunson b, Surya T Tokdar b
PMCID: PMC4107341  NIHMSID: NIHMS441685  PMID: 25067858

Abstract

A wide variety of priors have been proposed for nonparametric Bayesian estimation of conditional distributions, and there is a clear need for theorems providing conditions on the prior for large support, as well as posterior consistency. Estimation of an uncountable collection of conditional distributions across different regions of the predictor space is a challenging problem, which differs in some important ways from density and mean regression estimation problems. Defining various topologies on the space of conditional distributions, we provide sufficient conditions for posterior consistency focusing on a broad class of priors formulated as predictor-dependent mixtures of Gaussian kernels. This theory is illustrated by showing that the conditions are satisfied for a class of generalized stick-breaking process mixtures in which the stick-breaking lengths are monotone, differentiable functions of a continuous stochastic process. We also provide a set of sufficient conditions for the case where stick-breaking lengths are predictor independent, such as those arising from a fixed Dirichlet process prior.

Keywords: Asymptotics, Bayesian nonparametrics, Density regression, Dependent Dirichlet process, Large support, Probit stick-breaking process

1. Introduction

One of the most common problems in data analysis is the need to characterize the dependence of a response on predictors in a flexible manner. We want to avoid parametric assumptions on the response density and how features, such as the mean, variance, skewness, shape and even modality, change with predictors. Nonparametric estimates of the conditional distribution [1, 2] are appealing in this context, but in most applications one requires not just a point estimate but also a characterization of uncertainty. For this reason, and because of excellent practical performance in a rich variety of application areas, Bayesian approaches for conditional distribution estimation have become popular in recent years. The most common class of models are infinite mixture models due in part to the rich literature on algorithms for posterior computation using Markov chain Monte Carlo (MCMC) [35] and fast approximations [6]. Such MCMC algorithms are straightforward to implement, and the output can be used to estimate exact posterior densities for functionals of interest.

The ever increasing literature on new nonparametric Bayes models and exciting new applications in areas ranging from finance to biostatistics to machine learning has generated considerable enthusiasm. However, there is a clear lack of frequentist asymptotic theory supporting these models. The emphasis of this article is on substantially closing this gap focusing on a new class of generalized stick-breaking process (gSB) priors, which encompasses a number of the most widely applied priors as special cases.

In the absence of predictors, there is a rich theory and methods literature on nonparametric Bayes methods for estimating a density f using mixture models of the form

yi~f,f~Π, (1.1)

where Π is a mixture prior of the form h=1πhk(y;θh) for suitably chosen kernel k, atoms and weights {(θh, πh), h = 1, …, ∞} with h=1πh=1 almost surely. The most common choice of Π is the Dirichlet process mixture of normals, first introduced by [7]. Original works on Dirichlet process can be found in [8, 9]. Support of Π in (1.1) and asymptotic properties of the posterior are now well-understood [1015].

Recent literature has focused on generalizing model (1.1) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Bayesian density regression views the entire conditional density f(y | x) as a function valued parameter and allows its center, spread, skewness, modality and other such features to vary with x. For data {(yi, xi), i = 1, …, n} let

yixi~f(·xi),{f(·x),xX}~ΠX, (1.2)

where Inline graphic is the predictor space and Inline graphic is a prior for the class of conditional densities {fx, xInline graphic} indexed by the predictors. Refer, for example, to [1621] and [22] among others.

The primary focus of this recent development has been infinite mixture models of the form

f(yx)=h=1πh(x)ϕ{y-μh(x)σh}, (1.3)

where ϕ is the standard normal density, {πh(x), h = 1, 2, …} are predictor-dependent probability weights that sum to one almost surely for each xInline graphic, and (μh, σh) ~ G0 independently, with G0 a base probability measure on Inline graphic×ℜ+, Inline graphicInline graphic, the space of all Inline graphic → ℜ functions. A single finite mixture of Gaussians is inadequate to represent the shape of the density f(y | x) for different levels of the predictor x unless the number of components is huge. By using an infinite mixture we inherently allow for uncertainty in the number of components needed to characterize the data and bypass the difficult issue of selecting the number of components.

(1.1) is similar in spirit to kernel mixtures used in non-parametric smoothing approaches. However, a major advantage of using a Bayesian paradigm is that we do not need to deal with optimizing tuning parameters, which becomes difficult in higher dimensions. The new adaptation results [23, 24] reveal that even a single prior specification can adapt to the unknown correct smoothness level of the true density and optimizes estimation in an asymptotic minimax sense. For conditional densities, smoothing needs to be done over the response space as well as the predictor space, making the choice of optimal smoothing even more difficult, especially when the predictors have varying degrees of influence on the response. A Bayesian approach offers an easier practical solution in this case.

To our knowledge, only [25] have considered formalizing the notions of support for dependent stick-breaking processes. We focus on a novel class of gSB processes, which express the probability weights πh(x) in stick-breaking form, with the stick lengths constructed through mapping continuous stochastic processes to the unit interval using a monotone differentiable link function. This class includes dependent Dirichlet processes [26] as a special case.

Only a few papers have considered asymptotic properties of the posterior in conditional density estimation. [22] considers posterior consistency in estimating conditional distributions focusing exclusively on logistic Gaussian process priors [27]. Such priors lack the computational simplicity of the countable mixture priors in (1.3). [28] considers posterior consistency in conditional distribution estimation through a limited information approach by approximating the likelihood by the quantiles of the true distribution. [29, 30] provide sufficient conditions for showing posterior consistency in estimating an autoregressive conditional density and a transition density rather than regression with respect to another covariate.

In this article, focusing on model (1.3), we initially provide sufficient conditions on the prior and true data-generating model under which the prior leads to weak and various types of strong posterior consistency. In this context, we first define notions of weak and L1-integrated neighborhoods. We then show that the sufficient conditions are satisfied for gSB priors. The theory is illustrated through application to a model relying on probit transformations of Gaussian processes, an approach related to the probit stick-breaking process of [21] and [31]. We also considered Gaussian mixtures of fixed-π dependent processes [26, 32].

[33] showed posterior consistency in conditional density estimation using kernel stick breaking process mixtures of Gaussians in a very recent unpublished article. They approximated a conditional density by a smooth mixture of linear regressions as in [34] to demonstrate the KL property. In this paper, we have shown KL support using a more direct approach of approximating the true density by a kernel mixture of a compactly supported conditional measure.

The fundamental contribution of this article is formalizing the notion of support of the gSB process mixture of Gaussians on the space of conditional densities and formulating sufficient conditions to ensure that it leads to a consistent posterior. In doing so, a key technical contribution is the development of a novel method of constructing a sieve for the proposed class of priors. It has been noted by [35] that the usual method of constructing a sieve by controlling prior probabilities is unable to lead to a consistency theorem in the multivariate case. This is because of the explosion of the L1-metric entropy with increasing dimension. They developed a technique specific to the Dirichlet process in the multivariate case for showing weak and strong posterior consistency. The proposed sieve1 avoids the pitfall mentioned by [35] in showing consistency using multivariate mixtures. Our sieve construction has been recently used for studying convergence rates in multivariate density estimation [36, 37].

2. Notations

Throughout the paper, Lebesgue measure on ℜ or ℜp is denoted by λ and the set of natural numbers by ℕ. The supremum and the L1-norms are denoted by ||·|| and ||·||1 respectively. The indicator function of a set B is denoted by 1B. Let Lp(ν, M) denote the space of real valued measurable functions defined on M with ν-integrable pth absolute power. For two density functions f, g, the Kullback-Leibler divergence is given by K(f, g) = ∫ log(f/g)fdλ. A ball of radius r with centre x0 relative to the metric d is defined as B(x0, r; d). The diameter of a bounded metric space M relative to a metric d is defined to be sup{d(x, y) : x, yM}. The ε-covering number N(ε, M, d) of a semi-metric space M relative to the semi-metric d is the minimal number of balls of radius ε needed to cover M. The logarithm of the covering number is referred to as the entropy. “≾” stands for inequality up to a constant multiple or if the constant multiple is irrelevant to the given situation. δ0 stands for a distribution degenerate at 0 and supp(ν) for the support of a measure ν.

3. Conditional density estimation

In this section, we will define the space of conditional densities and construct a prior on this space. It is first necessary to generalize the topologies to allow appropriate neighborhoods to be constructed around an uncountable collection of conditional densities indexed by predictors. With such neighborhoods in place, we then state our main theorems providing sufficient conditions under which various modes of posterior consistency hold for a broad class of predictor-dependent mixtures of Gaussian kernels.

Let Inline graphic = ℜ be the response space and Inline graphic be the covariate space which is a compact subset of ℜp. Unless otherwise stated, we will assume Inline graphic = [0, 1]p without loss of generality. Let Inline graphic denote the space of densities on Inline graphic× Inline graphic w.r.t. the Lebesgue measure and Inline graphic denote a subset of the space of conditional densities satisfying,

Fd={g:X×Y(0,),Yg(x,y)dy=1xX,xg(x,·)continuousasafunctionfromXL1(λ,Y)}.

Suppose yi is observed independently given the covariates xi, i = 1, 2, … which are drawn independently from a probability distribution Q on Inline graphic. Assume that Q admits a density q with respect to the Lebesgue measure.

If we define h(x, y) = q(x)f(y | x) and h0(x, y) = q(x)f0(y | x) then h, h0Inline graphic. Throughout the paper, h0 is assumed to be a fixed density in Inline graphic which we alternatively refer to as the true data generating density and {f0(· | x), xInline graphic} is referred to as the true conditional density. The density q(x) will be needed only for theoretical investigation. In practice, we do not need to know it or learn it from the data.

We propose to induce a prior Inline graphic on the space of conditional densities through a prior Inline graphic for a collection of mixing measures Inline graphic = {Gx, xInline graphic} using the following predictor-dependent mixture of kernels

f(yx)=1σϕ(y-μσ)dGx(ψ), (3.1)

where ψ = (μ, σ), and

Gxh=1πh(x)δ{μh(x),σh},(μh,σh)~G0, (3.2)

where πh(x) ≥ 0 are random functions of x such that h=1πh(x)=1 a.s. for each fixed xInline graphic. {μh(x),xX}h=1are i.i.d. realizations of a real valued stochastic process, i.e., G0 is a probability distribution over Inline graphic×ℜ+, where Inline graphicInline graphic, Inline graphic being the space of functions from Inline graphic to ℜ. Hence for each xInline graphic, Gx is a random probability measure over the measurable Polish space (ℜ × ℜ+, Inline graphic(ℜ × ℜ+)). We are interested the following two important special cases.

3.1. Predictor dependent countable mixtures of Gaussian linear regressions

We define the predictor dependent countable mixtures of Gaussian linear regressions (MGLRx) as

f(yx)=1σϕ(y-xβδ)dGx(β,σ),

and

Gx=h=1πh(x)δ(βh,σh),(βh,σh)~G0 (3.3)

where πh(x) ≥ 0 are random functions of x such that h=1πh(x)=1 a.s. for each fixed xInline graphic and G0 = G0,β × G0,σ is a probability distribution on ℜp × ℜ+ where G0,β and G0,σ are probability distributions on ℜp and ℜ+ respectively. For a particular choice of πh(x)’s, we obtain the probit stick-breaking mixtures of Gaussians which have been previously applied to real data applications by [21, 31, 38]. The latter two articles considered probit transformations of Gaussian processes in constructing the stick-breaking weights.

3.2. Gaussian mixtures of fixed-π dependent processes

In (3.1), set Gx as in (3.2) with πh(x) ≡ πh for all xInline graphic where πh ≥ 0 are random probability weights h=1πh=1 a.s. and {μh(x),xX}h=1 are as in (3.2). Examples include fixed-π dependent Dirichlet process mixtures of Gaussians [26]. Versions of the fixed π-DDP have been applied to ANOVA [32], survival analysis [39, 40], spatial modeling [41], and many more.

A Gaussian process is a common choice for constructing stochastic processes πh(x)’s and μh(x)’s. Recall that a Gaussian process {α(x) : xInline graphic} is defined as a stochastic process for which any finite dimensional representation {α(x1), …, α(xp)}, p ≥ 1 has a joint Gaussian distribution. We denote by GP(μ, c) a Gaussian process with mean function μ : Inline graphic → ℝ and c : Inline graphic × Inline graphic → ℝ.

4. Notions of posterior consistency for conditional densities

We recall the definition of posterior consistency through yn = (y1, …, yn) and xn = (x1, …, xn).

Definition 4.1

The posterior Inline graphic(· | yn, xn) is consistent at {f0(· | x), xInline graphic} with respect to a given topology if Inline graphic(Uc | yn, xn) → 0 a.s. for an arbitrary neighborhood U of {f0(· | x), xInline graphic} in that topology.

Here a.s. consistency at {f0(· | x), xInline graphic} means that the posterior distribution concentrates around a neighborhood of {f0(· | x), xInline graphic} for almost every sequence {yi,xi}i=1 generated by i.i.d. sampling from the joint density q(x)f0(y | x).

We define the weak and ν-integrated L1 neighborhoods of a collection of conditional densities {f0(· | x), xInline graphic} in the following. A sub-base of a weak neighborhood is defined as

Wε,g(f0)={f:fFd,|X×Ygh-X×Ygh0|<ε}, (4.1)

for a bounded continuous function g : Inline graphic × Inline graphic → ℜ. A weak neighborhood base is formed by finite intersections of neighborhoods of the type (4.1). Define a ν-integrated L1 neighborhood

Sε(f0:ν)={f:fFd,f(·x)-f0(·x)1ν(x)dx<ε} (4.2)

for any measure ν with supp(ν) ⊂ Inline graphic. Observe that under the topology in (4.2), Inline graphic can be identified to a closed subset of L1(λ×ν, Inline graphic×supp(ν)) making it a complete separable metric space. Thus measurability issues won’t arise with these topologies.

In the following, we define the Kullback-Leibler (KL) property of Inline graphic at a given f0Inline graphic. Note that we define a KL-type neighborhood around the collection of conditional densities f0 through defining a KL neighborhood around the joint density h0, while keeping Q fixed at its true unknown value.

Definition 4.2

For any f0Inline graphic, such that h0(x, y) = q(x)f0(y | x) is the true joint data-generating density, we define an ε-sized KL neighborhood around f0 as

Kε(f0)={f:fFd,KL(h0,h)<ε,h(x,y)=q(x)f(yx)yY,xX},

where KL(h0, h) = ∫ h0 log(h0/h). Then, Inline graphic is said to have KL property at f0Inline graphic, denoted f0KL( Inline graphic), if Inline graphic{Kε(f0)} > 0 for any ε > 0.

Another definition we would require for showing the KL support is the notion of weak neighborhood of a collection of mixing measures Inline graphic = {Gx, xInline graphic} where Gx is a probability measure on S × ℜ+ for each xInline graphic. Here S = ℜp or ℜ depending on the cases considered above. We formulate the notion of a sub-base of the weak neighborhood of Inline graphic = {Gx, xInline graphic} below.

Definition 4.3

For a bounded continuous function g : S × ℜ+ × Inline graphic → ℜ and ε > 0, a sub-base of the weak neighborhood of a conditional probability measure {Fx, xInline graphic} is defined as

{{Gx,xX}:|S×R+×Xg(s,σ,x)dGx(s,σ)q(x)dx-g(s,σ,x)dFx(s,σ)q(x)dx|<ε} (4.3)

A conditional probability measure {Gx, xInline graphic} lies in the weak support of Inline graphic if Inline graphic assigns positive probability to every basic neighborhood generated by the sub-base of the type (4.3). In the sequel, we will also consider a neighborhood of the form

{{Gx,xX}:supxX|S×R+{g(s,σ)dGx(s,σ)-g(s,σ)dFx(s,σ)}|<ε}. (4.4)

for a bounded continuous function g : S × ℜ+ → ℜ.

5. Posterior consistency in MGLRx mixture of Gaussians

5.1. Kullback-Leibler property

We will work with a specific choice of Inline graphic motivated by the probit stick breaking process construction in [21]. Let

πh(x)=Φ{αh(x)}l<h[1-Φ{αl(x)}], (5.1)

where αh ~ GP(0, ch), for h = 1, 2, …, ∞. Assume the following holds.

  • S1

    ch is chosen so that αh ~ GP(0, ch) has continuous path realizations

  • S2
    for any continuous function under the GP(0, ch) prior for αh g : Inline graphic ↦ ℜ,
    PX{supxXαh(x)-g(x)<ε}>0

    h = 1, …, ∞ and for any ε > 0.

  • S3

    G0 is absolutely continuous with respect to λ(ℜp × ℜ+).

Consider the subset FdFd satisfying the following conditions.

  • A1

    f is nowhere zero and bounded by M < ∞.

  • A2

    | Inline graphic Inline graphic f(y | x) log f(y | x)dyq(x)dx| < ∞.

  • A3

    XYf(yx)logf(yx)ψx(y)dyq(x)dx<,

    where ψx(y) = inft∈[y−1,y+1] f(t | x).

  • A4

    η > 0 such that Inline graphic Inline graphic |y|2(1+η) f(y | x)dyq(x)dx < ∞.

  • A5

    (x, y) ↦ f(y | x) is jointly continuous.

Remark 5.1

A1 is usually satisfied by common densities arising in practice. A4 imposes a minor tail restriction; e.g., a mean regression model with continuous mean function and a heavy-tailed t residual density with 4 degrees of freedom satisfies A4. Conditions A2 and A3 are more subtle, but are also mild. A flexible class of models which satisfies A1–A5 is as follows. Let yi = μ(xi) + εi, with μ : Inline graphic → ℜ continuous and εi ~ fxi, where fx(ε)=h=1Hπh(x)ψ(ε;μh,σh2) for some H ≥ 1, h=1Hπh(x)=1, πh : Inline graphic → [0, 1] continuous and ψ is Gaussian or t with greater than 2 degrees of freedom.

Remark 5.2

S2 is satisfied if ch(x, x′) = eAh||xx′||2 and the prior for Ah has full support on ℝ+.

The following theorem characterizes the subset of Inline graphic for which Inline graphic has the KL property. The proof of Theorem 5.3 is provided in Appendix C.

Theorem 5.3

f0KL( Inline graphic) for each f0 in Fd if Inline graphic satisfies S1–S3.

Remark 5.4

The conditions are satisfied for a class of gSB process mixtures in which the stick-breaking lengths are constructed through mapping continuous stochastic processes to the unit interval using a monotone differentiable link function.

To prove Theorem 5.3, we need several auxiliary results related to the support of the prior Inline graphic which might be of independent interest. The key idea for showing that the true f0 satisfies Inline graphic{Kε(f0)} > 0 for any ε > 0 is to impose certain tail conditions on f0(y | x) and approximate it by f(yx)=1σϕ(y-xβσ)dGx(β,σ), where {x, xInline graphic} is compactly supported. Observe that,

KL(h0,h)=XYf0(yx)logf0(yx)f(yx)dyq(x)dx+XYf0(yx)logf(yx)f(yx)dyq(x)dx. (5.2)

We construct such an in Theorem 5.3 which makes the first term in the right hand side of (5.2) sufficiently small. The following lemma (which is similar to Lemma 3.1 in [12] and Theorem 3 in [11]) guarantees that the second term in the right hand side of (5.2) is also sufficiently small if {Gx, xInline graphic} lies inside a finite intersection of neighborhoods of {x, xInline graphic} of the type (4.4).

Lemma 5.5

Assume that f0Inline graphic satisfies Inline graphic Inline graphic y2f0(y | x)dyq(x)dx < ∞. Suppose f(yx)=1σϕ(y-xβσ)dGx(β,σ), where ∃a > 0 and 0 < σ < σ̄ such that

Gx([-a,a]p×(σ_,σ¯))=1xX, (5.3)

so that x has compact support for each xInline graphic. Then given any ε > 0, ∃ a finite intersection W of neighborhoods of {x, xInline graphic} of the type (4.4) such that for any conditional density f(yx)=1σϕ(y-xβσ)dGx(β,σ), xInline graphic, with {Gx, xInline graphic} ∈ W,

XYf0(yx)logf(yx)f(yx)dyq(x)dx<ε. (5.4)

The proof is similar to Theorem 3 in [11] and is omitted here. In order to ensure that the weak support of Inline graphic is sufficiently large to contain all densities satisfying the assumptions of Lemma 5.5, we define a collection of fixed conditional probability measures on (ℜp × ℜ+, Inline graphic(ℜp × ℜ+)) denoted by GX satisfying

  1. xFx(B) is a continuous function of xInline graphic, ∀ BInline graphic(ℜp × ℜ+).

  2. For any sequence of sets An ⊂ ℜp × ℜ+ ↓ ∅, Inline graphic Fx(An) ↓ 0.

Next we state the theorem characterizing the weak support of Inline graphic which will be proved in Appendix B.

Theorem 5.6

If Inline graphic satisfies S1–S3, then any {Fx,XX}GX lies in the weak support of Inline graphic.

Corollary 5.7

Assume S1–S3 hold and assume FxGX is compactly supported, i.e., there exists a, σ, σ̄ > 0 such that Fx([−a, a]p × [σ, σ̄]) = 1. Then for a bounded uniformly continuous function g : ℜp × ℜ+ → [0, 1] satisfying g(β, σ) → 0 as ||β|| → ∞, σ → ∞,

PX{{Gx,xX}:supxX|Rp×R+{g(β,σ)dGx(β,σ)-g(β,σ)dFx(β,σ)}|<ε}>0. (5.5)
Proof

The proof is similar to Theorem 5.6 with the L1 convergence in (B.1) replaced by convergence uniformly in x. This is because under the assumptions of Corollary 5.7, the uniformly continuous sequence of functions k=1ng(βk,n,σk,n)Fx(Ak,n) on Inline graphic monotonically decreases to ∫C g(β, σ)dFx(β, σ) as n → ∞ where C is given by [−a, a]p × [σ, σ̄].

The proof of the following corollary is along the lines of the proof of Theorem 5.6 and is omitted here.

Corollary 5.8

Under the assumptions of Corollary 5.7 for any k0 ≥ 1,

PX{j=1k0Uj}>0, (5.6)

where Uj’s are neighborhoods of the type (5.5).

5.2. Strong Consistency with the q-integrated L1 neighborhood

To obtain strong consistency in the q-integrated L1 topology, we need a very straight forward extension of Theorem 2 of [11] below.

Theorem 5.9

Suppose f0KL( Inline graphic) and there exist subsets Inline graphicInline graphic with

  1. log N (ε, Inline graphic, ||·||1) = o(n),

  2. ΠX(Fnc)c2e-nβ2 for some c2, β2 > 0,

then the posterior is strongly consistent with respect to the q-integrated L1 neighborhood.

Before stating the main theorem on strong consistency, we consider a hierarchical extension of MGLRx where the bandwidths are taken to be random. We define a sequence of random inverse-bandwidths Ah of the Gaussian process αh, h ≥ 1 each having ℜ+ as its support. Since the first few atoms suffice to explain most of the dependence of y on x, we expect that the variability due to the covariate in the stochastic process Φ{αh} decreases as h increases. This is achieved through a carefully chosen prior for the covariance kernel ch of the Gaussian process αh.

Let α0 denote the base Gaussian process on [0, 1]p with covariance kernel c0(x, x′) = τ2e−||xx′||2. Then αh(x)=α0(Ah1/2x) for each xInline graphic The variability of αh with respect to the covariate is shrunk or stretched to the rectangle [0,Ah1/2]p as Ah decreases or increases. Ah’s are constructed to be stochastically decreasing to δ0 in the following manner. We assume that there exist η, η0 > 0 and a sequence δn = O((log n)2/n5/2) such that P (Ah > δn) ≤ exp{−nη0h(η0+2) log h} for each h ≥ 1. Also assume that there exists a sequence rn ↑ ∞ such that rnpnη(logn)p+1=o(n) and P (Ah > rn) ≤ en. We will discuss how to construct such a sequence of random variables in the Remark 5.12 following Theorem 5.10.

The following theorem provides sufficient conditions for strong posterior consistency in the q-integrated L1 topology. The proof is provided in Appendix D.

Theorem 5.10

Let πh’s satisfy (5.1) with αh ~ GP(0, ch) where ch(x, x′) = τ2eAh||xx′||2, h ≥1, τ2 > 0 fixed.

  • C1

    There exist sequences an, hn ↑ ∞, ln ↓ 0 with anln=O(n),hnln=O(en), and constants d1, d2 > 0 such that G0{B(0; an) × [ln, hn]}c < d1ed2n for some d1, d2 > 0.

  • C2

    Ah’s are constructed as in the second last paragraph before Theorem 5.10.

then f0KL( Inline graphic) implies that Inline graphic achieves strong posterior consistency in q-integrated L1 topology at f0.

Remark 5.11

Verification of condition C1 of Theorem 5.10 is particularly simple. For example, if G0 is a product of multivariate normals on β and an inverse Gamma prior on σ2, the condition C1 is satisfied with an=O(n), hn = en, ln=O(1n). It follows from [42] that f0KL( Inline graphic) is still satisfied when we have the additional assumptions C1–C2 together with S1–S3 on the prior Inline graphic.

Remark 5.12

Since we need rnpnη(logn)p+1=o(n),rnp can be chosen to be O(nη1) for some 0 < η1 < 1. Let d be such that 1/p ≥ 1 and set η0 = 3d. Let Ah = chBh, where Bhd~Exp(λ) and ch = (h(3d+2)/η log h)−1/d for any 0 < η < 1. Then P (Ah > nη1/p) ≤ P(Bh > nη1/p) ≤ en1/pen and P(Ah > (log n)2n−5/2) ≤ exp{−n−3dh(3d+2)/η log h}.

Remark 5.13

The theory of strong posterior consistency can be generalized to an arbitrary monotone differentiable link function L : ℜ ↦ [0, 1] which is Lipschitz, i.e., there exists a constant K > 0 such that |L(x) − L(x′)| ≤ K |xx′| for all x, x′ ∈ Inline graphic. Also, as long as the πh(x)’s satisfy the hypothesis of Lemma Appendix A.1 and possess the required tail probability in Lemma 5.15, general predictor dependent mixing weights can be used.

Below we will develop several auxiliary results required to prove Theorem 5.10. They are stated below as some of them might be of independent interest. Let ϕβ,σ(x,y):=1σϕ(y-xβσ) for yInline graphic and xInline graphic. From [12], we obtain for σ2>σ1>σ22 and for each xInline graphic,

Yϕβ1,σ1(x,y)-ϕβ2,σ2(x,y)dy(2π)1/2β2-β1pσ2+3(σ2-σ1)σ1

Construct a sieve for (β, σ) as

Θa,h,l={ϕβ,σ:βa,lσh}. (5.7)

In the following Lemma, we provide an upper bound to Na,h,l, ε, dSS). The proof is omitted as it follows trivially from Lemma 4.1 in [12].

Lemma 5.14

There exist constants d1, d2 > 0 such that N(Θa,h,l,ε,dSS)d1(al)p+d2loghl+1.

In the proof of Theorem 5.10, we will verify the sufficient conditions of Theorem 5.9. We calibrate Inline graphic by a carefully chosen sequence of subsets Inline graphicInline graphic. The fundamental problem with mixture models ∫ N (y; μ, σ2Ip)dP (μ) in estimating a multivariate density lies in attempting to compactify the model space by {∫ N (y, μ, σ2Ip)dP (μ) : P ((−an, an]p) > 1 − δ} for each σ leading to an entropy anp growing exponentially with the dimension p. Here we marginalize P in ∫ N(y; μ, σ2Ip)dP (μ) to yield the following construction h=1mnπhN(y;μh,σ2Ip):μhan, h = 1, …, mn, h=mn+1πh<ε leading to an entropy mn log an where mn is related to the tail-decay of P(h=mn+1πh>ε). With this idea in place, we extend the construction of Inline graphic for conditional densities below.

Before constructing a sieve, we briefly review alternative definitions [43] of a Gaussian process as a Banach space valued element below. A Borel measurable random element W with values in a separable Banach space ( Inline graphic, ||·||) is called Gaussian if the random variable b*W is normally distributed for any element b*Inline graphic, the dual space of Inline graphic. Recall that in general, the reproducing kernel Hilbert space (RKHS) ℍ attached to a zero-mean Gaussian process W is defined as the collection of all EHW for H ranging over the closed linear span of the variables b*W in L2(ν, M) with inner product

EW(·)H1;EW(·)H2=EH1H2, (5.8)

The RKHS can be viewed as a subset of Inline graphic and the RKHS norm ||·|| stronger than the Banach space norm ||·||.

In particular, if W is a Borel measurable zero-mean Gaussian random element in a complete separable subspace of ℓ (T), the Banach space of uniformly bounded functions g : T → ℝ equipped with the uniform norm ||g|| = sup{|g(t)| : tT }, then the RKHS is actually the completion of the linear space of functions tEW (t)H relative to the inner product (5.8) where H, H1 and H2 are finite linear combinations of the form Σi aiW (si) with ai ∈ ℝ and si in the index set of W. See Theorem 2.1 of [43] for details.

Next we turn to constructing the sieve. Assume ε > 0 is given. Let 1a denote a unit ball in the RKHS of the covariance kernel τ2ea||xx′||2and Inline graphic is a unit ball in ℂ [0, 1]p. For numbers M, m, r, δ, construct a sequence of subsets {Bh, h = 1, …, m} of ℂ [0, 1]p as follows.

Bh={(Mr/δ1r+εm2B1)(a<δM1a+εm2B1),ifh=1,,mηa<δnMn1a+εm2B1,ifh=mη+1,,m.

The idea is to construct

Fn={f:f(yx)=h=1πh(x)1σhϕ(y-xβhσh),{ϕβh,σh}h=1mnΘan,hn,ln,αhBh,n,h=1,,mn,supxXhmn+1πh(x)ε}. (5.9)

for appropriate sequences am, ln, hn, Mn, mn, rn, δn to be chosen in the proof of Theorem 5.10.

The following lemma is also crucial to the proof of Theorem 5.10 which allows us to calculate the rate of decay of P( Inline graphic πh(x) > ε) with mn.

Lemma 5.15

Let πh’s satisfy (5.1) with αh ~ GP(0, ch) where ch(x, x′) = τ2eAh||xx′||2, h ≥ 1, τ2 > 0 fixed. Then for some constant C7 > 0,

ΠX(h=mn+1πh>ε)e-C7mnlogmn+n=mnη+1mnP(Ah>δn). (5.10)
Proof

Let Wh=-log[1-Φ{αh}] where αh=infxXαh(x), Zh ~ Ga(1, γ0). We will choose an appropriate value for γ0 in the sequel. Let t0 = −log ε > 0. Observe that

ΠX(h=mn+1πh>ε)=ΠX(-h=mnη+1mnlog{1-Φ(αh)}<t0).

Observe that ΠX(-h=1mnlog{1-Φ(αh)}<t0)=ΠX(Λh<t0) where Λh ~ Ga(mn, 1). Then it is easy to show that Inline graphich < t0) ≾ emn log mn. However, the calculation gets complicated when αh’s are i.i.d realizations of a zero mean Gaussian process. The proof relies on the fact that the supremum of Gaussian processes has sub-Gaussian tails.

Below we calculate the rate of decay of ΠX(h=mn+1πh>ε) with mn. We will show that there exists γ0, depending on ε and τ but not depending on n, such that

ΠX(h=mnη+1mnWh+t0)ξ(δn)mn-mnηΠX(h=mnη+1mnZh<t0)+h=mnη+1mnP(Ah>δn). (5.11)

where there exists a constant C5 > 0 such that ξ(x) = C5xp/2 for x > 0. Observe that ΠX(h=mnη+1mnWh<t0)ΠX(h=mnη+1mnWh<t0,Ahδn,h=mnη+1,,mn)+h=mnη+1mnP(Ah>δn).

Since ΠX(h=mnη+1mnWh<t0)=ΠX(h=mnη+1mn(τ/τ)Wh<τt0/τ) for some τ′ < 1, we can re-parameterize t0 as τt0/τ and τ as τ′. Hence without loss of generality we assume τ < 1.

Define g : [0, t0] → ℜ, t ↦ −Φ−1(1 − et). It holds that g is a continuous function on (0, t0]. Assume α0 ~ GP(0, c0) where c0(x, x′) = τ2e−||xx′||2. For h=mnη+1,,mn,

P(supxXαh(x)λ,Ahδn)P(supxδnXα0(x)λ).

Below we estimate P(supxδnXα0(x)λ) for large enough λ following Theorem 5.2 of [44]. However extra care is required to identify the role of δn. Since N(ε,δnX,·)C1(δn/ε)p,

0ε{logN(ε,δnX,·)}1/2dεC2ε{1+log(1/ε)}.

for some constant C2 > 0. Hence

P(supxrnXα0(x)λ)C3(δnλ)pexp[-1/2{λ-C2/λ(1+logλ)}2/τ2]C3δnp/2λp+2{1-Φ(λ/τ2)}C4δnp/2{1-Φ(λ)}.

for constants C3, C4 > 0. The last inequality holds for all large λ because τ < 1. Hence there exists t1 ∈ (0, t0) sufficiently small and independent of n such that for all t ∈ (0, t1), ΠX{supxδnXα0(x)g(t)}C4δnp/2Φ{-g(t)}. Observe that

ΠX{supxδnXα0(x)g(t)}C4δnp/2Φ{-g(t)}<C5δnp/2(1-e-γ0t),

for any γ0 > 1. Further choose γ0 large enough such that 2(1 − eγ0t) > 1 ∀ t ∈ [t1, t0]. Hence P(Wht, Ahδn) ≤ ξ(δn)P(Zh < t) ∀ t ∈ (0, t0] where ξ(δn)=C5δnp/2, with C5 = max{2, C4}. Applying Lemma Appendix E.1, we conclude (5.11) by induction. Lemma Appendix E.1 is proved in Appendix E. As h=1mnZh~Ga(mn,γ0),ΠX(h=1mnZh<t0)e-C6mnlogmn for some constant C6 > 0. Since ξ(δn)mn-mnηΠX(h=1mnZh<t0)(e-C7mnlogmn) for some constant C7 > 0, the result follows immediately.

5.3. Prior specification and posterior computation

To illustrate the applicability of the proposed methods, we mention the prior choices and key steps for posterior computation for the MGLRx model. Recall that

f(yx)-12ϕ(y-xβσ)dGx(β,σ), (5.12)
Gx=h=1πh(x)δ(βh,σh2),(βh,σh-2)~N(β0,0)×Ga(ασ,βσ), (5.13)

where πh(x) = Φ{αh(x)}Πl<h{1−Φ{αl(x)}. We assume αh ~ GP(0, ch), where ch(x,x)=1ταe-Ahx-x2, τα ~ Ga(να/2, να/2). See Remark 5.12 for constructing prior for Ah’s. If the yi’s are standardized, we would expect that the total variance h=1πhσh2 should be around 1. Hence choose aσ = 1, bσ = 10 so that the E(σh-2)=0.1. We can resort to an MCMC algorithm, which is a hybrid of data augmentation, the exact block Gibbs sampler of [45] and Metropolis Hastings sampling to sample from the posterior of (5.12). [45] proposed the exact block Gibbs sampler as an efficient approach to posterior computation in infinite-dimensional Dirichlet process mixture models, modifying the block Gibbs sampler of [46] to avoid truncation approximations. The exact block Gibbs sampler combines characteristics of the retrospective sampler [47] and the slice sampler [4, 48]. Introduce γ1, …, γn such that πh(xi) = P(γi = h), h = 1, 2, …, ∞. Then

γi~h=1πh(xi)δh=h=11(ui<πh(xi))δh

where ui ~ U(0, 1).

We continue up to h=max{h1,,hn}, where hi is the minimum integer satisfying l=1hiπl(xi)>1-min{u1,,un}, i = 1, …, n. The Markov chain adaptively estimates the desired number of components h* at each iteration of the MCMC, thus making it more efficient than a finite mixture model with a pre-specified large number of components. Here we describe the key steps for the posterior computation.

  1. Update ui’s and stick breaking random variables: Generate
    ui-~U(0,πγi(xi))
    where πh(xi) = Φ{αh(xi)}Πl<h[1 − Φ{αl(xi)}]. For i = 1, …, n, introduce latent variables Zh(xi), h = 1, 2, … such that Zh(xi) ~ N(αh(xi), 1). Thus πh(xi) = P(Zh(xi) > 0, Zl(xi) < 0 for l < h). Then
    Zh(xi)-~{N(αh(xi),1)I+,h=γiN(αh(xi),1)I-,h<γi.
    Let Zh = (Zh(x1), …, Zh(xn))′ and αh = (αh(x1), …, αh(xn))′. Letting (Σh)ij = eAh||xixj||, Zh ~ N(αh, I) and αh~N(0,1ταh),
    αh-~N((ταh-1+In)-1Zh,(ταh-1+In)-1)
    Continue up to h=max{h1,,hn}, where hi is the minimum integer satisfying l=1hiπl(xi)>1-min{u1,,un}, i = 1, …, n. Now
    τα-~Ga(12(nh+να),12(l=1hαkl-1αk+να)),

    while κα is updated using a Metropolis Hastings step.

  2. Update allocation to atoms: Update (γ1, …, γn)|– as multinomial random variables with probabilities
    P(γi=h)N(yi;xiβh,τh-1)I(ui<πh(xi)),h=1,,h.
  3. Update component-specific locations and precisions: Let nh = #{i : γi = h}, l = 1, 2, …, h*. Let Yh = (yi : γi = h) be a nh dimensional vector and Xh is the corresponding nh × p dimensional covariate matrix.
    βh-~N((XhXh+0-1)-1(XhYh+0-1β0),(XhXh+0-1)-1)τh-~Ga(nh2+ατ,βτ+i:γi=h(yi-xiβh))2),h=1,2,,h

    Update Ah’s in a Metropolis Hastings step.

At each iteration of the MCMC, we obtain samples from the full conditional distributions of the parameters, which after discarding a burn-in can be used to get summary statistics of posterior distribution of the parameters or functionals of interest.

6. Posterior consistency in Gaussian mixture of fixed-π dependent processes

6.1. Kullback-Leibler property

The following theorem verifies that Inline graphic has KL property at f0Fd. The proof of Theorem 6.1 is somewhat similar to that of Theorem 5.3 and can be found in Appendix F.

Theorem 6.1

f0 ∈ KL( Inline graphic) for each f0 in Fd if Inline graphic satisfies

  • T1

    G0 is specified by μh ~ GP(μ, c), σh ~ G0,σ where c is chosen so that GP(0, c) has continuous path realizations and Πσ is absolutely continuous w.r.t. Lebesgue measure on ℜ+.

  • T2

    For every k ≥2, (π1, …, πk) is absolutely continuous w.r.t. to the Lebesgue measure on Sk−1.

  • T3
    For any continuous function g : Inline graphic↦ ℜ,
    PX{supxXμh(x)-g(x)<ε}>0

    h = 1, …, ∞ and for any ε > 0.

6.2. Strong consistency with the q-integrated L1 neighborhood

Next we summarize the consistency theorem with respect to the q-integrated L1 topology. The proof of Theorem 6.2 is also similar to that of Theorem 5.10 and is provided in Appendix G.

Theorem 6.2

Let μh(x) = xβh + ηh(x), βh ~ Gβ and ηh ~ GP(0, c), h = 1, …, ∞ where c(x, x′) = τ2eA||xx′||2, Ap(1+η2)/η2 ~ Ga(a, b) for some η2 > 0.

  • F1

    There exist sequences an, hn ↑ ∞, ln ↓ 0 with anln=O(n),hnln=O(en), and constants d1, d2, d3 and d4 > 0 such that Gβ{B(0; an)}c < d1ed2n and G0,σ{[ln, hn]}cd3ed4n.

  • F2

    P(h=nπh>ε)O(e-n1+η2(logn)(p+1)).

    then f0 ∈ KL( Inline graphic) implies that Inline graphic achieves strong posterior consistency at f0 with respect to the q-integrated L1 topology.

Remark 6.3

F2 is satisfied if πh’s are made to decay more rapidly than the usual Beta(1, α) stick-breaking random variables, e.g, if πh = νhΠl<h(1−νh) and if νh ~ Beta(1, αh) where αh = h1+η2(log h)p+1α0 for some α0 > 0, then F2 is satisfied. Large value of αh for the higher indexed weights favors smaller number of components.

Remark 6.4

A Gaussian kernel is used here for technical simplification. One can obtain similar results using a variety of kernels e.g. t, Laplace, etc. However, the KL support conditions A1–A5 will be different for different kernels. Refer to [49] for a catalogue of conditions for various kernels in a density estimation framework.

7. Discussion

We have provided sufficient conditions to show posterior consistency in estimating the conditional density via predictor dependent mixtures of Gaussians which include probit stick-breaking mixtures of Gaussians and the fixed-π dependent processes as special cases. The problem is of interest, providing a more flexible and informative alternative to the usual mean regression. For both the models, we need the same set of tail conditions (mentioned in Fd) on f0 for KL support. Although the first prior is flexible in the weights and the second one in the atoms through their corresponding GP terms, S1, S2, T1 and T3 show that verification of KL property only requires that both the GP terms have continuous path realizations and desired approximation property. Moreover, for the second prior, any set of weights summing to one a.s. T2 suffices for showing KL property. Careful investigations of the prior for the GP kernel for the first model and the probability weights for the second one are required for strong consistency. For the first one we need the covariate dependence of the higher indexed GP terms in the weights to fade off. On the other hand, for the second model, the atoms can be i.i.d. realizations of a GP with Gaussian covariance kernel with inverse- Gamma bandwidth while limiting the model complexity through a sequence of probability weights which are allowed to decay rapidly. This suggests that full flexibility in the weights should be down-weighted by an appropriately chosen prior while full flexibility in the atoms should be accompanied by a restriction imposing fewer number of components. It would be interesting to see how the conditions on the bandwidth can be modified when we actually use a sieve Bayes prior, i.e. a prior with number of components kn diverging to ∞.

Another interesting direction is to consider rates of convergence of the posterior and Bernstein von-Mises (BvM) type results. For infinite dimensional parameters [50], there has been quite a few positive BvM results very recently for linear functionals of a probability density function [51] and for general classes of linear and non-linear functionals in a Gaussian white noise model [52]. We conjecture that such BvM-type results hold for linear functionals of conditional density (e.g. conditional mean, conditional cdf) too under appropriate conditions on the prior and the true data generating conditional density.

Acknowledgments

This work was supported by Award Number R01ES017240 from the National Institute of Environmental Health Sciences. We also thank the Associate Editor and the referees for the comments which significantly improved the exposition of the paper.

Appendix A. A useful lemma

To prove Theorem 5.6, we need an auxiliary lemma which we state below.

Lemma Appendix A.1

If {πh(x), h = 1, …, ∞} constructed as in (5.1) satisfies S1 and S2 then

PX{supxXπ1(x)-Fx(A1)<ε1,,supxXπk(x)-Fx(Ak)<εk}>0. (A.1)

for a measurable partition {Ai, i = 1, …, k} of ℜp × ℜ+, εi > 0 and a conditional cdf {Fx, xInline graphic}.

Proof

Without loss of generality, let 0 < Fx(Ai) < 1, i = 1, …, kxInline graphic. We want to show that for any εi > 0, i = 1, …, k, (A.1) holds. Construct continuous functions gi : Inline graphic ↦ ℜ, 0 < gi(x) < 1 ∀xInline graphic, i = 1, …, k−1 such that

g1(x)=Fx(A1),gi(x)l<i{1-gl(x)}=Fx(Ai),2ik-1,gk(x)=1x. (A.2)

As 0 < Fx(Ai) < 1, i = 1, …, kxInline graphic, it is trivial to find gi, i = 1, …, k satisfying (A.2) since one can solve back for the gi’s from (A.2). i=1kFx(Ai)=1 enforces gk ≡1. Since Φ is a continuous function, for any εi > 0, i = 1, …, k − 1,

PX{supxXΦ{αi(x)}-gi(x)<εi}>0 (A.3)

and for i = k,

PX{supxXΦ{αk(x)}-1<εk}=PX{infxXαk(x)>Φ-1(1-εk)}. (A.4)

Choose M > Φ−1(1 − εk) + εk. We have 0 < M < 1 and

{supxXαk(x)-M<εk}{infxXαk(x)>Φ-1(1-εk)}.

Hence by assumption, Inline graphic{ Inline graphicαk(x) > Φ−1(1 − εk)} > 0. Let Sk−1 denote the k-dimensional simplex. For notational simplicity let pi(x) = Φ{αi(x)}, gi(x) = Fx(Ai), i = 1, …, k−1 and gk(x) = 1. Let z = (z1, …, zp)′, fi : Sk−1 → ℜ, zziΠl<i(1 − zl), i = 2, …, k and f1(z) = z1. Let p(x) = (p1(x), …, pk(x)) and g(x) = (g1(x), …, gk(x)). Then we need to show that

PX{f1(p)-f1(g)<ε1,,fk-1(p)-fk-1(g)<εk-1,fk(p)-1<εk}>0.

Note that for 2 ≤ik,

fi(p)-fi(g)(i-1)pi-gi+l<ifl(p)-fl(g).

Thus one can get εi>0, i = 1, …, k, such that

{pi-gi<εi,i=1,,k}{f1(p)-f1(g)<ε1,,fk-1(p)-fk-1(g)<εk-1,fk(p)-1<εk}.

But since PX{pi-gi<εi,i=1,,k}=i=1kPX{pi-gi<εi}, the result follows immediately.

Appendix B. Proof of Theorem 5.6

Fix {Fx,xX}GX. Without loss of generality it is enough to show that for a uniformly continuous function g : ℜp×ℜ+× Inline graphic → [0, 1] and ε > 0,

PX{{Gx,xX}:|Rp×R+×X{g(β,σ,x)dGx(β,σ)q(x)dx-g(β,σ,x)dFx(β,σ)q(x)dx}|<ε}>0.

Furthermore, it suffices to assume g(β, σ, x) → 0 uniformly in xInline graphic as ||β|| → ∞, σ → ∞.

Fix ε > 0, there exist a, σ, σ̄ > 0 not depending on x such that Fx([−a, a]p×[σ, σ̄]) > 1 − ε for all xInline graphic. Let C = [−a, a]p × [σ, σ̄].

Rp×R+X{g(β,σ,x)dGx(β,σ)-g(β,σ,x)dFx(β,σ)}q(x)dxX{h=1πh(x)g(βh,σh,x)-Cg(β,σ,x)dFx(β,σ)}q(x)dx+ε.

where πh’s are specified by 5.1 with ch satisfying S1 and S2 and (βh, σh) ~ G0. Now for each xInline graphic, construct a Riemann sum approximation of ∫C g(β, σ, x)dFx(β, σ).

Let {Ak,n, k = 1, …, n} be a sequence of partitions of C with increasing refinement as n increases. Assume max1≤kn diam(Ak,n) → 0 as n ↑ ∞. Fix (β̃k,n, σ̃k,n) ∈ Ak,n, k = 1, …, n. Then by DCT as n → ∞,

X{k=1ng(βk,n,σk,n,x)Fx(Ak,n)}q(x)dxXCg(β,σ,x)dFx(β,σ)q(x)dx. (B.1)

Hence there exists n1 such that for nn1

|Rp×R+X{g(β,σ,x)dGx(β,σ)-g(β,σ,x)dFx(β,σ)}q(x)dx||X{h=1πh(x)g(βh,σh,x)-k=1ng(βk,n,σk,n,x)Fx(Ak,n)}q(x)dx|+2ε.

Consider the set

Ω1={(πh,h=1,,):supxXπ1(x)-Fx(A1,n1)<εn1,,supxXπn1(x)-Fx(An1,n1)<εn1}.

By Lemma Appendix A.1 which is proved in Appendix A, Inline graphic1) > 0. Since h=1πh(x)=1 a.s. there ∃ Ω with Inline graphic (Ω) = 1, such that for each ω= {πh, h = 1, …, ∞} ∈ Ω, gn(x)=h=1nπh(x)1 as n → ∞ for each x in Inline graphic. Note that this convergence is uniform since, gn(·), n ≥ 1 are continuous functions defined on a compact set monotonically increasing to a continuous function identically equal to 1. Hence for each ω = {πh, h = 1, …, ∞} ∈ Ω, gn(x) → 1 uniformly in x. By Egoroff’s theorem, there exists a measurable subset Ω2 of Ω1 with Inline graphic2) > 0 such that within this subset gn(x) → 1 uniformly in x and uniformly in ω in Ω2. Thus there exists a positive integer nεn1 not depending on x and ω, such that h=nε+1πh(x)<ε on Ω2. Moreover, one can find a K > 0 independent of x such that g(β, σ, x) < ε if ||β|| > K and σ > K. Let A1 = {(β, σ) : ||β|| > K, σ > K}. Let Ω3 = Ω2 ∩ {(βn1+1, σn1+1) ∈ A1, …, (βnε1, σnε1) ∈ A1}. For ω ∈ Ω3,

|Rp×R+X{g(β,σ,x)dGx(β,σ)-g(β,σ,x)dFx(β,σ)}q(x)dx|X{k=1n1|πk(x)g(βk,σk,x)-g(βk,n,σk,n,x)Fx(Ak,n1)|}q(x)dx+4ε

and

X{k=1n1|πk(x)g(βk,σk,x)-g(βk,n,σk,n,x)Fx(Ak,n1)|}q(x)dxk=1n1Xπk(x)|g(βk,σk,x)-g(βk,n,σk,n,x)|q(x)dx+ε.

There exists sets Bk, k = 1, …, n1 depending on n1 but independent of x such that if (βk, σk) ∈ Bk,|g(βk, σk, x) − g(β̃k,n1, σ̃k,n1, x)|< ε. So for ω ∈ Ω4 = Omega;3 ∩ {(β1, σ1) ∈ B1, …, (βn1, σn1) ∈ Bn1},

|Rp×R+X{g(β,σ,x)dGx(β,σ)-g(β,σ,x)dFx(β,σ)}q(x)dx|<5ε.

Now since Inline graphic2) > 0 and the sets {(βn1+1, σn1+1) ∈ A1, …, (βnε1, σnε1) ∈ A1} and {(β1, σ1) ∈ B1, …, (βn1, σn1) ∈ Bn1} are independent from Ω2 and have positive probability, it follows that Inline graphic4) > 0.

Appendix C. Proof of Theorem 5.3

Without loss of generality, assume that the covariate space Inline graphic is [ζ, 1]p for some 0 < ζ < 1. The proof is essentially along the lines of Theorem 3.2 of [12]. The in (5.2) will be constructed so as to satisfy the assumptions of Lemma 5.5 and such that XYf0(yx)logf0(yx)f(yx)dyq(x)dx<ε2 for any ε > 0. Define a sequence of conditional densities fn(yx)=1σϕ(y-xβσ)dGn,x(β,σ), n ≥ 1 where for σn = nη,

dGn,x(β,σ)=Iβ1[-n,n]f0(xβx)j=2pδ0(βj)δσn(σ)-nnf0(x1β1x)dβ1. (C.1)

Define

fn(yx)=-nx1nx11σnϕ(y-tσn)f0(tx)dt-nx1nx1f0(tx)dt. (C.2)

Proceeding as in Theorem 3.2 of [12], an application of DCT using the conditions A1–A5 yields

XYf0(yx)logf0(yx)fn(yx)dyq(x)dx0asn.

Therefore one can simply choose = fn0 for sufficiently large n0. fn0 satisfies the assumptions of Lemma 5.5 since {Gn0,x, xInline graphic} is compactly supported. Also {Gn0,x,xX}GX as xGn0,x(A) is continuous. Hence there exists a finite intersection W of neighborhoods of {Gn0,x, xInline graphic} the type (5.5) such that for any {Gx, xInline graphic} ∈ W, the second term of (5.2) is arbitrarily small. The conclusion of the theorem follows immediately from Corollary 5.8.

Appendix D. Proof of Theorem 5.10

Consider the sequence of sieves defined by (5.9) for given ε > 0 and for sequences an, hn, ln, Mn, mn, rn to be chosen later with δn=K1ε/(Mnmn2) for some constant K1. We will first show that given ξ > 0, there exist c1, c2 > 0 and sequences mn and Mn, such that ΠX(Fnc)c1e-nc2 and log N (δ, Inline graphic, ||·||) < .

For f1, f2Inline graphic, we have for each xInline graphic,

f1(·x)-f2(·x)1h=1mnπh(1)-πh(2)+2ε.

Let Θπ,n = {πmn = (π1, π2, …, πmn) : αhBh,n, h = 1, …, mn}. Fix π1mn,π2mnΘπ,n. Note that since |Φ(x1) − Φ(x2)| < K2 |x1x2| for a global constant K2 > 0, we have

Φ(αh,1)-Φ(αh,2)K2αh,1-αh,2.

The above fact together with the proof of Lemma Appendix A.1 show that if we can make αh,1-αh,2<εmn2, h = 1, …, mn, we would have h=1mnπh(1)-πh(2)<ε. From the proof of Theorem 3.1 in [42] it follows that for h=1,,mnη and for sufficiently large Mn, rn,

logN(2ε/mn2,Bh,n,·)K3rnplog(Mnmn2rn/δnε)p+1+2logK4Mnmn2ε. (D.1)

for global constants K3, K4 > 0. For Mn2>16K5rnp(log(rn/ε))1+p, rn > 1 we have for h=1,,mnη,

P(αhBh,n)P(Ah>rn)+e-Mn2/2. (D.2)

Hence for sufficiently large Mn, we have for h=mnη+1,,mn,

logN(3ε/mn2,Bh,n,·)2logK4Mnmn2ε. (D.3)

For h=mnη+1,,mn,

P(αhBh,n)P(Ah>δn)+a=0δnP(αhBh,nAh=a)gAh(a)daP(Ah>δn)+(1-Φ(Φ-1(e-ϕ0δn(ε/mn2))+Mn)).

where ϕ0κ(ε) denotes the concentration function of the Gaussian process with covariance kernel c(x, x′) = τ2eκ||xx′||2. Now

ϕ0δn(ε/mn2)-logP(W0ε/mn2)=K6log(ε/mn2)

for some constant K6 > 0. Hence if MnK7log(ε/mn2) for some K7 > 0, then it follows from the proof of Theorem 3.1 in [42] that

P(αhBh,n)P(Ah>δn)+e-Mn2/2. (D.4)

From (D.1) and (D.3),

log(N(ε,B1,n××Bmn,n,·)2mnlogK4Mnmn2ε+mnηrnplog(Mnmn2rn/δnε)p+1. (D.5)

Also from (D.2) and (D.4),

h=1mnP(αhBh,n)mne-Mn2/2+h=1mnηP(Ah>rn)+h=mnη+1mnP(Ah>δn).

We will show that with mn=O(nlogn),ΠX(Fnc)<e-nξ0 for some ξ0 assumption C1, we have

ΠX(Θan,hn,lnc)mnO(e-n)O(e-n). (D.6)

With mn = O(n/log n), h=1mnηP(Ah>rn)mnηe-ne-n,h=mnη+1mnP(Ah>δn)(mn-mnη)e-n-η0mnη0+2logmne-mnlogmn.

With mn=nlogn,mnlogmn>n2 for large enough n and it follows from Lemma 5.15 that

ΠX(supxXh=mn+1πh(x)>ε)O(e-n/2). (D.7)

Thus with Mn = O(n1/2),

h=1mnP(αhBh,n)e-n. (D.8)

(D.6), (D.7) and (D.8) together imply that ΠX(Fnc)O(e-n).

Also mnηrnplog(Mnrn/δnε)p+1=o(n) for the choice of the sequence rn. With mn = n/(C log n) for some large C > 0, one can make

log(N(ε,B1,n××Bmn,n,·)<nξ (D.9)

for any ξ > 0. Also from Lemma 5.14,

mnlogN(Θan,hn,ln,ε,·)mnlog{d1(anln)p+d2loghnln+1}<nξ (D.10)

for any ξ > 0. Combining (D.9) and (D.10), log N ( Inline graphic, 4ε, ||·||1) < for any ξ > 0.

Appendix E. Another useful lemma

We state without proof the following Lemma needed to prove Theorem 6.1.

Lemma Appendix E.1

For non-negative r.v.s Ai, Bi, if P(Aiu) ≤ Ci P(Biu) for u ∈ (0, t0), t0 > 0, i = 1, 2, P(A1 + A2t0) ≤ C1C2P(B1 + B2t0).

Appendix F. Proof of Theorem 6.1

Proof

Once again we approximate f0(y|x) by f(yx)=1σϕ(y-μσ)dGx(μ,σ), so that the first term of 5.2 is arbitrarily small. We construct such an analogous to that in Theorem 5.3. Lemma Appendix F.1 is a variant of Lemma 5.5 which ensures that the second term in (5.2) is also sufficiently small. Before that we need a different notion of neighborhood of {Fx, xInline graphic} which we formulate below.

{{Gx,xX}:supxX|R×R+{g(μ,σ)dGx(μ,σ)-g(μ,σ)dFx(μ,σ)}|<ε}. (F.1)

Lemma Appendix F.1

Assume that f0Inline graphic satisfies Inline graphic Inline graphic y2f0(y | x)dyq(x)dx < ∞. Suppose f(yx)=1σϕ(y-μσ)dGx(μ,σ), where ∃ a > 0 and 0 < σ < σ̄ such that

Gx([-a,a]×(σ_,σ¯))=1xX, (F.2)

so that x has compact support for each xInline graphic. Then given any ε > 0, ∃ a neighborhood W of {x, xInline graphic} which is a finite intersection of neighborhoods of the type (F.1) such that for any conditional density f(yx)=1σϕ(y-μσ)dGx(μ,σ), xInline graphic, with {Gx, xInline graphic} ∈ W,

XYf0(yx)logf(yx)f(yx)dyq(x)dx<ε. (F.3)

The proof of Lemma Appendix F.1 is similar to that of Lemma 5.5 and is omitted here. To characterize the support of Inline graphic, we define a collection of fixed conditional probability measures {Fx, xInline graphic} on (ℜ × ℜ+, Inline graphic(ℜ × ℜ+)) denoted by GX satisfying x ↦ ∫ℜ×ℜ+ g(μ, σ)dFx(μ) is a continuous function of x for all bounded uniformly continuous functions g : ℜ × ℜ+ → [0, 1].

Theorem Appendix F.2

Assume the following holds.

  • T1

    G0 is specified by μh ~ GP(μ, c), σh ~ G0,σ where c is chosen so that GP(0, c) has continuous path realizations and Πσ is absolutely continuous w.r.t. Lebesgue measure on ℜ+.

  • T2

    For every k ≥ 2, (π1, …, πk) is absolutely continuous w.r.t. to the Lebesgue measure on Sk−1.

  • T3
    For any continuous function g : Inline graphic ↦ ℜ,
    PX{supxXμh(x)-g(x)<ε}>0

    h = 1, …, ∞ and for any ε > 0.

Then for a bounded uniformly continuous function g : ℜ × ℜ+ : [0, 1] satisfying g(μ, σ) → 0 as |μ| → ∞, σ → ∞,

PX{{Gx,xX}:supxX|R×R+{g(μ,σ)dGx(μ,σ)-g(μ,σ)dFx(μ,σ)}|<ε}>0. (F.4)

Proof

It suffices to assume that g is coordinatewise monotonically increasing on ℜ × ℜ+. Let ε > 0 be given and ψ(x) = ∫ℜ×ℜ+ g(μ, σ)dFx(μ, σ). Let nε be such that Inline graphic1) > 0 where Ω1={h=nε+1πh<ε}. Then in Ω1,

|R×R+{g(μ,σ)dGx(μ,σ)-ψ(x)}|k=1nεπkg(μk(x),σk)-ψ(x)+ε.

Define Ω2 = { Inline graphic|g(μk(x), σk) − ψ(x)|< ε, k = 1, …, nε}. For a fixed σk, there exists a δ such that Inline graphic|g(μk(x), σk) − ψ(x)|< ε/2 if supxXμk(x)-gσk-1ψ(x)<δ where gσk-1 denotes the inverse of g(·, σk) for fixed σk. Hence there exists a neighborhood Bk of σk such that for σkBk and supxXμk(x)-gσk-1ψ(x)<δ, we have Inline graphic |g(μk(x), σk) − ψ(x)| < ε. Since for each k = 1, …, nε, PX{σkBk,supxXμk(x)-gσk-1ψ(x)<δ}=

σkBkPX{supxXμk(x)-gσk-1ψ(x)<δ}dG0,σ(σk)>0,

Inline graphic2) > 0. The conclusion of the theorem follows from the independence of Ω1 and Ω2.

in (5.2) will be constructed so as to satisfy the assumptions of Lemma Appendix F.1 and such that XYf0(yx)logf0(yx)f(yx)dyq(x)dx<ε2 for any ε > 0. Define a sequence of conditional densities fn(yx)=1σϕ(y-μσ)dGn,x(μ,σ), n ≥ 1 where for σn = n−η,

dGn,x(μ,σ)=Iμ[-n,n]f0(μx)δσn(σ)-nnf0(μx). (F.5)

As before define the approximator

fn(yx)=-nn1σnϕ(y-tσn)f0(tx)dt-nnf0(tx)dt. (F.6)

will be chosen to be fn0 for some large n0. fn0 satisfies the assumptions of Lemma Appendix F.1 since {Gn0,x, xInline graphic} is compactly supported. Moreover {Gn0,x,xX}GX as x → ∫ℜ×ℜ+ g(μ, σ)dGn0,x(μ, σ) is continuous function of x for all bounded uniformly continuous function g. Hence there exists a finite intersection W of neighborhoods of {Gn0x, xInline graphic} the type (F.1) such that for any {Gx, xInline graphic} ∈ W, the second term of (5.2) is arbitrarily small. The conclusion of the theorem follows immediately from a variant of Corollary 5.8 applied to neighborhoods of the type (F.1).

Appendix G. Proof of Theorem 6.2

Proof

As before we establish q-integrated L1 consistency of Gaussian mixtures of fixed-π dependent processes by verifying the conditions of Theorem 5.9. Let ϕμ,σ(x,y):=1σϕ(y-μ(x)σ) for yInline graphic and xInline graphic. Construct Bn as

Bn=(Mnrnδn1rn+εlnπ42B1)(a<δnMn1a+εlnπ42B1).

with δn=K1εlnMn for some constant K1 > 0. Let

Θn={ϕμ,σ:βan,ηBn,lnσhn}. (G.1)

It is easy to see that

logN(Fn,4ε,·)K2mnrnp{log(82Mnrn/δnεπln)}p+1+mnlogK4Mn2εmnlogK3Mnεln+mnlog{d1(anln)p+d2loghnln+1}. (G.2)

Note that ΠX(Fnc)mnP(Θnc)+P(h=mnπh>ε) and P(Θnc){P(βan)+P(σ[ln,hn]c)+P(ηBnc)}. It follows from the proof of Theorem 3.1 of [42] that

P(ηBnc)P(A>rn)+e-Mn2/2

if Mn2>rnp{log(82Mnrn/δnεπln). Since Ap(1+η2)/η2 ~ Ga(a, b), Lemma 4.9 of [42] indicates that P(A>rn)exp{-rnp(1+η2)/η2}. Hence with Mn = O(n1/2), mn = O{n/(log n)p+1}1/(1+η2) and rnp=O{nη2/(1+η2)}P(Θnc)e-n and

P(h=mnπh>ε)exp{-mn1+η2(logmn)(p+1)}e-n. (G.3)

Also, the first term in the right hand side of (G.2) can be made smaller than since mnrnp=O(n/(logn)p+1). Also by F1, the last two terms of the right hand side of (G.2) can be made to grow at o(n).

Footnotes

1

A similar sieve appears in [33] with a citation to an earlier draft of our paper.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Fan J, Yao Q, Tong H. Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika. 1996;83:189–206. [Google Scholar]
  • 2.Rojas A, Genovese C, Miller C, Nichol R, Wasserman L. Conditional density estimation using finite mixture models with an application to astrophysics. 2005. [Google Scholar]
  • 3.Jain S, Neal R. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics. 2004;13:158–182. [Google Scholar]
  • 4.Walker S. Sampling the Dirichlet mixture model with slices. Communications in Statistics-Simulation and Computation. 2007;36:45–54. [Google Scholar]
  • 5.Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169. [Google Scholar]
  • 6.Minka T. PhD thesis. Massachusetts Institute of Technology; 2001. A family of algorithms for approximate Bayesian inference. [Google Scholar]
  • 7.Lo A. On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics. 1984;12:351–357. [Google Scholar]
  • 8.Ferguson T. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
  • 9.Ferguson T. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2:615–629. [Google Scholar]
  • 10.Barron A, Schervish M, Wasserman L. The consistency of posterior distributions in nonparametric problems. The Annals of Statistics. 1999;27:536–561. [Google Scholar]
  • 11.Ghosal S, Ghosh J, Ramamoorthi R. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics. 1999;27:143–158. [Google Scholar]
  • 12.Tokdar S. Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhyâ: The Indian Journal of Statistics. 2006;67:90–110. [Google Scholar]
  • 13.Ghosal S, van der Vaart A. Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics. 2001;29:1233–1263. [Google Scholar]
  • 14.Ghosal S, van der Vaart A. Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics. 2007;35:697–723. [Google Scholar]
  • 15.Bhattacharya A, Dunson D. Strong consistency of nonparametric bayes density estimation on compact metric spaces with applications to specific manifolds. Annals of the Institute of Statistical Mathematics. 2011:1–28. doi: 10.1007/s10463-011-0341-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79. [Google Scholar]
  • 17.Griffin J, Steel M. Order-based dependent Dirichlet processes. Journal of The American Statistical Association. 2006;101:179–194. [Google Scholar]
  • 18.Griffin J, Steel M. Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica. 2010;20:1507–1527. [Google Scholar]
  • 19.Dunson D, Pillai N, Park J. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]
  • 20.Dunson D, Park J. Kernel stick-breaking processes. Biometrika. 2008;95:307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Chung Y, Dunson D. Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2009;104:1646–1660. doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Tokdar S, Zhu Y, Ghosh J. Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Analysis. 2010;5:1–26. [Google Scholar]
  • 23.Kruijer W, Rousseau J, Van Der Vaart A. Adaptive bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics. 2010;4:1225–1257. [Google Scholar]
  • 24.Scricciolo C. Posterior rates of convergence for dirichlet mixtures of exponential power densities. Electronic Journal of Statistics. 2011;5:270–308. [Google Scholar]
  • 25.Barrientos F, Jara A, Quintana F. On the support of MacEachern’s dependent Dirichlet processes. Bayesian Analysis. 2012;7:1–34. [Google Scholar]
  • 26.MacEachern S. Dependent nonparametric processes. 1999. pp. 50–55. [Google Scholar]
  • 27.Tokdar S, Ghosh J. Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference. 2007;137:34–42. [Google Scholar]
  • 28.Yoon J. Unpublished manuscript. Claremont Mckenna College; 2009. Bayesian analysis of conditional density functions: a limited information approach. [Google Scholar]
  • 29.Tang Y, Ghosal S. A consistent nonparametric Bayesian procedure for estimating autoregressive conditional densities. Computational Statistics & Data Analysis. 2007;51:4424–4437. [Google Scholar]
  • 30.Tang Y, Ghosal S. Posterior consistency of Dirichlet mixtures for estimating a transition density. Journal of Statistical Planning and Inference. 2007;137:1711–1726. [Google Scholar]
  • 31.Rodriguez A, Dunson D. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6:145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.De Iorio M, Mueller P, Rosner G, MacEachern S. An ANOVA model for dependent random measures. Journal of the American Statistical Association. 2004;99:205–215. [Google Scholar]
  • 33.Norets A, Pelenis J, editors. Unpublished manuscript. Princeton Univ; 2010. Posterior consistency in conditional density estimation by covariate dependent mixtures. [Google Scholar]
  • 34.Norets A. Approximation of conditional densities by smooth mixtures of regressions. The Annals of Statistics. 2010;38:1733–1766. [Google Scholar]
  • 35.Wu Y, Ghosal S. The L1-consistency of Dirichlet mixtures in multivariate Bayesian density estimation. Journal of Multivariate Analysis. 2010:2411–2419. [Google Scholar]
  • 36.Shen W, Tokdar S, Ghosal S. Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. 2011. Arxiv preprint arXiv:1109.6406. [Google Scholar]
  • 37.Tokdar S. Adaptive convergence rates in Dirichlet process mixtures of multivariate normals. 2011. Arxiv preprint arXiv:1111.4148. [Google Scholar]
  • 38.Pati D, Dunson D. Unpublished paper. 2009. Bayesian nonparametric regression with varying residual density. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.De Iorio M, Johnson W, Müller P, Rosner G. Bayesian nonparametric nonproportional hazards survival modeling. Biometrics. 2009;65:762–771. doi: 10.1111/j.1541-0420.2008.01166.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Jara A, Lesaffre E, De Iorio M, Quintana F. Bayesian semiparametric inference for multivariate doubly-interval-censored data. The Annals of Applied Statistics. 2010;4:2126–2149. [Google Scholar]
  • 41.Gelfand A, Kottas A, MacEachern S. Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association. 2005;100:1021–1035. [Google Scholar]
  • 42.van der Vaart A, van Zanten J. Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics. 2009;37:2655–2675. [Google Scholar]
  • 43.van der Vaart A, van Zanten J. Reproducing kernel Hilbert spaces of Gaussian priors. IMS Collections. 2008;3:200–222. [Google Scholar]
  • 44.Adler R. An introduction to continuity, extrema, and related topics for general Gaussian processes. Vol. 12. Institute of Mathematical Statistics; 1990. [Google Scholar]
  • 45.Papaspiliopoulos O. Technical Report. 2008. A note on posterior sampling from Dirichlet mixture models. [Google Scholar]
  • 46.Ishwaran H, James L. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]
  • 47.Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–183. [Google Scholar]
  • 48.Kalli M, Griffin J, Walker S. Slice sampling mixture models. Statistics and computing. 2010:1–13. [Google Scholar]
  • 49.Wu Y, Ghosal S. Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics. 2008;2:298–331. [Google Scholar]
  • 50.Freedman D. Wald lecture: On the Bernstein-von Mises theorem with infinite-dimensional parameters. The Annals of Statistics. 1999;27:1119–1141. [Google Scholar]
  • 51.Rivoirard V, Rousseau J. Bernstein–von Mises theorem for linear functionals of the density. The Annals of Statistics. 2012;40:1489–1523. [Google Scholar]
  • 52.Castillo I, Nickl R. Nonparametric bernstein-von mises theorems. 2012. arXiv preprint arXiv:1208.3862. [Google Scholar]

RESOURCES