Posterior consistency in conditional distribution estimation

Debdeep Pati; David B Dunson; Surya T Tokdar

doi:10.1016/j.jmva.2013.01.011

. Author manuscript; available in PMC: 2014 Jul 23.

Published in final edited form as: J Multivar Anal. 2013 Apr 1;116:456–472. doi: 10.1016/j.jmva.2013.01.011

Posterior consistency in conditional distribution estimation

Debdeep Pati ^a, David B Dunson ^b, Surya T Tokdar ^b

PMCID: PMC4107341 NIHMSID: NIHMS441685 PMID: 25067858

Abstract

A wide variety of priors have been proposed for nonparametric Bayesian estimation of conditional distributions, and there is a clear need for theorems providing conditions on the prior for large support, as well as posterior consistency. Estimation of an uncountable collection of conditional distributions across different regions of the predictor space is a challenging problem, which differs in some important ways from density and mean regression estimation problems. Defining various topologies on the space of conditional distributions, we provide sufficient conditions for posterior consistency focusing on a broad class of priors formulated as predictor-dependent mixtures of Gaussian kernels. This theory is illustrated by showing that the conditions are satisfied for a class of generalized stick-breaking process mixtures in which the stick-breaking lengths are monotone, differentiable functions of a continuous stochastic process. We also provide a set of sufficient conditions for the case where stick-breaking lengths are predictor independent, such as those arising from a fixed Dirichlet process prior.

Keywords: Asymptotics, Bayesian nonparametrics, Density regression, Dependent Dirichlet process, Large support, Probit stick-breaking process

1. Introduction

One of the most common problems in data analysis is the need to characterize the dependence of a response on predictors in a flexible manner. We want to avoid parametric assumptions on the response density and how features, such as the mean, variance, skewness, shape and even modality, change with predictors. Nonparametric estimates of the conditional distribution [1, 2] are appealing in this context, but in most applications one requires not just a point estimate but also a characterization of uncertainty. For this reason, and because of excellent practical performance in a rich variety of application areas, Bayesian approaches for conditional distribution estimation have become popular in recent years. The most common class of models are infinite mixture models due in part to the rich literature on algorithms for posterior computation using Markov chain Monte Carlo (MCMC) [3–5] and fast approximations [6]. Such MCMC algorithms are straightforward to implement, and the output can be used to estimate exact posterior densities for functionals of interest.

The ever increasing literature on new nonparametric Bayes models and exciting new applications in areas ranging from finance to biostatistics to machine learning has generated considerable enthusiasm. However, there is a clear lack of frequentist asymptotic theory supporting these models. The emphasis of this article is on substantially closing this gap focusing on a new class of generalized stick-breaking process (gSB) priors, which encompasses a number of the most widely applied priors as special cases.

In the absence of predictors, there is a rich theory and methods literature on nonparametric Bayes methods for estimating a density f using mixture models of the form

y_{i} ~ f, f ~ Π,

(1.1)

where Π is a mixture prior of the form $\sum_{h = 1}^{\infty} π_{h} k (y; θ_{h})$ for suitably chosen kernel k, atoms and weights {(θ_h, π_h), h = 1, …, ∞} with $\sum_{h = 1}^{\infty} π_{h} = 1$ almost surely. The most common choice of Π is the Dirichlet process mixture of normals, first introduced by [7]. Original works on Dirichlet process can be found in [8, 9]. Support of Π in (1.1) and asymptotic properties of the posterior are now well-understood [10–15].

Recent literature has focused on generalizing model (1.1) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Bayesian density regression views the entire conditional density f(y | x) as a function valued parameter and allows its center, spread, skewness, modality and other such features to vary with x. For data {(y_i, x_i), i = 1, …, n} let

y_{i} ∣ x_{i} ~ f (\cdot ∣ x_{i}), {f (\cdot ∣ x), x \in X} ~ Π_{X},

(1.2)

where Inline graphic is the predictor space and is a prior for the class of conditional densities {f_x, x ∈ } indexed by the predictors. Refer, for example, to [16–21] and [22] among others.

The primary focus of this recent development has been infinite mixture models of the form

f (y ∣ x) = \sum_{h = 1}^{\infty} π_{h} (x) ϕ {\frac{y - μ_{h} (x)}{σ_{h}}},

(1.3)

where ϕ is the standard normal density, {π_h(x), h = 1, 2, …} are predictor-dependent probability weights that sum to one almost surely for each x ∈ Inline graphic , and (μ_h, σ_h) ~ G₀ independently, with G₀ a base probability measure on ×ℜ⁺, ⊂ , the space of all → ℜ functions. A single finite mixture of Gaussians is inadequate to represent the shape of the density f(y | x) for different levels of the predictor x unless the number of components is huge. By using an infinite mixture we inherently allow for uncertainty in the number of components needed to characterize the data and bypass the difficult issue of selecting the number of components.

(1.1) is similar in spirit to kernel mixtures used in non-parametric smoothing approaches. However, a major advantage of using a Bayesian paradigm is that we do not need to deal with optimizing tuning parameters, which becomes difficult in higher dimensions. The new adaptation results [23, 24] reveal that even a single prior specification can adapt to the unknown correct smoothness level of the true density and optimizes estimation in an asymptotic minimax sense. For conditional densities, smoothing needs to be done over the response space as well as the predictor space, making the choice of optimal smoothing even more difficult, especially when the predictors have varying degrees of influence on the response. A Bayesian approach offers an easier practical solution in this case.

To our knowledge, only [25] have considered formalizing the notions of support for dependent stick-breaking processes. We focus on a novel class of gSB processes, which express the probability weights π_h(x) in stick-breaking form, with the stick lengths constructed through mapping continuous stochastic processes to the unit interval using a monotone differentiable link function. This class includes dependent Dirichlet processes [26] as a special case.

Only a few papers have considered asymptotic properties of the posterior in conditional density estimation. [22] considers posterior consistency in estimating conditional distributions focusing exclusively on logistic Gaussian process priors [27]. Such priors lack the computational simplicity of the countable mixture priors in (1.3). [28] considers posterior consistency in conditional distribution estimation through a limited information approach by approximating the likelihood by the quantiles of the true distribution. [29, 30] provide sufficient conditions for showing posterior consistency in estimating an autoregressive conditional density and a transition density rather than regression with respect to another covariate.

In this article, focusing on model (1.3), we initially provide sufficient conditions on the prior and true data-generating model under which the prior leads to weak and various types of strong posterior consistency. In this context, we first define notions of weak and L₁-integrated neighborhoods. We then show that the sufficient conditions are satisfied for gSB priors. The theory is illustrated through application to a model relying on probit transformations of Gaussian processes, an approach related to the probit stick-breaking process of [21] and [31]. We also considered Gaussian mixtures of fixed-π dependent processes [26, 32].

[33] showed posterior consistency in conditional density estimation using kernel stick breaking process mixtures of Gaussians in a very recent unpublished article. They approximated a conditional density by a smooth mixture of linear regressions as in [34] to demonstrate the KL property. In this paper, we have shown KL support using a more direct approach of approximating the true density by a kernel mixture of a compactly supported conditional measure.

The fundamental contribution of this article is formalizing the notion of support of the gSB process mixture of Gaussians on the space of conditional densities and formulating sufficient conditions to ensure that it leads to a consistent posterior. In doing so, a key technical contribution is the development of a novel method of constructing a sieve for the proposed class of priors. It has been noted by [35] that the usual method of constructing a sieve by controlling prior probabilities is unable to lead to a consistency theorem in the multivariate case. This is because of the explosion of the L₁-metric entropy with increasing dimension. They developed a technique specific to the Dirichlet process in the multivariate case for showing weak and strong posterior consistency. The proposed sieve¹ avoids the pitfall mentioned by [35] in showing consistency using multivariate mixtures. Our sieve construction has been recently used for studying convergence rates in multivariate density estimation [36, 37].

2. Notations

Throughout the paper, Lebesgue measure on ℜ or ℜ^p is denoted by λ and the set of natural numbers by ℕ. The supremum and the L₁-norms are denoted by ||·||_∞ and ||·||₁ respectively. The indicator function of a set B is denoted by 1_B. Let L_p(ν, M) denote the space of real valued measurable functions defined on M with ν-integrable pth absolute power. For two density functions f, g, the Kullback-Leibler divergence is given by K(f, g) = ∫ log(f/g)fdλ. A ball of radius r with centre x₀ relative to the metric d is defined as B(x₀, r; d). The diameter of a bounded metric space M relative to a metric d is defined to be sup{d(x, y) : x, y ∈ M}. The ε-covering number N(ε, M, d) of a semi-metric space M relative to the semi-metric d is the minimal number of balls of radius ε needed to cover M. The logarithm of the covering number is referred to as the entropy. “≾” stands for inequality up to a constant multiple or if the constant multiple is irrelevant to the given situation. δ₀ stands for a distribution degenerate at 0 and supp(ν) for the support of a measure ν.

3. Conditional density estimation

In this section, we will define the space of conditional densities and construct a prior on this space. It is first necessary to generalize the topologies to allow appropriate neighborhoods to be constructed around an uncountable collection of conditional densities indexed by predictors. With such neighborhoods in place, we then state our main theorems providing sufficient conditions under which various modes of posterior consistency hold for a broad class of predictor-dependent mixtures of Gaussian kernels.

Let Inline graphic = ℜ be the response space and be the covariate space which is a compact subset of ℜ^p. Unless otherwise stated, we will assume = [0, 1]^p without loss of generality. Let denote the space of densities on × w.r.t. the Lebesgue measure and denote a subset of the space of conditional densities satisfying,

F_{d} = {g : X \times Y \to (0, \infty), \int_{Y} g (x, y) d y = 1 \forall x \in X, x \mapsto g (x, \cdot) continuous as a function from X \to L_{1} (λ, Y)} .

Suppose y_i is observed independently given the covariates x_i, i = 1, 2, … which are drawn independently from a probability distribution Q on Inline graphic . Assume that Q admits a density q with respect to the Lebesgue measure.

If we define h(x, y) = q(x)f(y | x) and h₀(x, y) = q(x)f₀(y | x) then h, h₀ ∈ Inline graphic . Throughout the paper, h₀ is assumed to be a fixed density in which we alternatively refer to as the true data generating density and {f₀(· | x), x ∈ } is referred to as the true conditional density. The density q(x) will be needed only for theoretical investigation. In practice, we do not need to know it or learn it from the data.

We propose to induce a prior Inline graphic on the space of conditional densities through a prior for a collection of mixing measures = {G_x, x ∈ } using the following predictor-dependent mixture of kernels

f (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - μ}{σ}) {d G}_{x} (ψ),

(3.1)

where ψ = (μ, σ), and

G_{x} \sum_{h = 1}^{\infty} π_{h} (x) δ_{{μ_{h} (x), σ_{h}}}, (μ_{h}, σ_{h}) ~ G_{0},

(3.2)

where π_h(x) ≥ 0 are random functions of x such that $\sum_{h = 1}^{\infty} π_{h} (x) = 1$ a.s. for each fixed x ∈ Inline graphic . ${μ_{h} (x), x \in X}_{h = 1}^{\infty}$ are i.i.d. realizations of a real valued stochastic process, i.e., G₀ is a probability distribution over ×ℜ⁺, where ⊂ , being the space of functions from to ℜ. Hence for each x ∈ , G_x is a random probability measure over the measurable Polish space (ℜ × ℜ⁺, Inline graphic (ℜ × ℜ⁺)). We are interested the following two important special cases.

3.1. Predictor dependent countable mixtures of Gaussian linear regressions

We define the predictor dependent countable mixtures of Gaussian linear regressions (MGLR_x) as

f (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - x^{'} β}{δ}) {d G}_{x} (β, σ),

and

G_{x} = \sum_{h = 1}^{\infty} π_{h} (x) δ_{(β_{h}, σ_{h})}, (β_{h}, σ_{h}) ~ G_{0}

(3.3)

where π_h(x) ≥ 0 are random functions of x such that $\sum_{h = 1}^{\infty} π_{h} (x) = 1$ a.s. for each fixed x ∈ Inline graphic and G₀ = G_0,_β × G_0,_σ is a probability distribution on ℜ^p × ℜ⁺ where G_0,_β and G_0,_σ are probability distributions on ℜ^p and ℜ⁺ respectively. For a particular choice of π_h(x)’s, we obtain the probit stick-breaking mixtures of Gaussians which have been previously applied to real data applications by [21, 31, 38]. The latter two articles considered probit transformations of Gaussian processes in constructing the stick-breaking weights.

3.2. Gaussian mixtures of fixed-π dependent processes

In (3.1), set G_x as in (3.2) with π_h(x) ≡ π_h for all x ∈ Inline graphic where π_h ≥ 0 are random probability weights $\sum_{h = 1}^{\infty} π_{h} = 1$ a.s. and ${μ_{h} (x), x \in X}_{h = 1}^{\infty}$ are as in (3.2). Examples include fixed-π dependent Dirichlet process mixtures of Gaussians [26]. Versions of the fixed π-DDP have been applied to ANOVA [32], survival analysis [39, 40], spatial modeling [41], and many more.

A Gaussian process is a common choice for constructing stochastic processes π_h(x)’s and μ_h(x)’s. Recall that a Gaussian process {α(x) : x ∈ Inline graphic } is defined as a stochastic process for which any finite dimensional representation {α(x₁), …, α(x_p)}, p ≥ 1 has a joint Gaussian distribution. We denote by GP(μ, c) a Gaussian process with mean function μ : → ℝ and c : × → ℝ.

4. Notions of posterior consistency for conditional densities

We recall the definition of posterior consistency through yⁿ = (y₁, …, y_n) and xⁿ = (x₁, …, x_n).

Definition 4.1

The posterior Inline graphic (· | yⁿ, xⁿ) is consistent at {f₀(· | x), x ∈ } with respect to a given topology if (U^c | yⁿ, xⁿ) → 0 a.s. for an arbitrary neighborhood U of {f₀(· | x), x ∈ } in that topology.

Here a.s. consistency at {f₀(· | x), x ∈ Inline graphic } means that the posterior distribution concentrates around a neighborhood of {f₀(· | x), x ∈ } for almost every sequence ${y_{i}, x_{i}}_{i = 1}^{\infty}$ generated by i.i.d. sampling from the joint density q(x)f₀(y | x).

We define the weak and ν-integrated L₁ neighborhoods of a collection of conditional densities {f₀(· | x), x ∈ Inline graphic } in the following. A sub-base of a weak neighborhood is defined as

W_{ε, g} (f_{0}) = {f : f \in F_{d}, | \int_{X \times Y} g h - \int_{X \times Y} {g h}_{0} | < ε},

(4.1)

for a bounded continuous function g : Inline graphic × → ℜ. A weak neighborhood base is formed by finite intersections of neighborhoods of the type (4.1). Define a ν-integrated L₁ neighborhood

S_{ε} (f_{0} : ν) = {f : f \in F_{d}, \int {‖ f (\cdot ∣ x) - f_{0} (\cdot ∣ x) ‖}_{1} ν (x) d x < ε}

(4.2)

for any measure ν with supp(ν) ⊂ Inline graphic . Observe that under the topology in (4.2), can be identified to a closed subset of L₁(λ×ν, ×supp(ν)) making it a complete separable metric space. Thus measurability issues won’t arise with these topologies.

In the following, we define the Kullback-Leibler (KL) property of Inline graphic at a given f₀ ∈ . Note that we define a KL-type neighborhood around the collection of conditional densities f₀ through defining a KL neighborhood around the joint density h₀, while keeping Q fixed at its true unknown value.

Definition 4.2

For any f₀ ∈ Inline graphic , such that h₀(x, y) = q(x)f₀(y | x) is the true joint data-generating density, we define an ε-sized KL neighborhood around f₀ as

K_{ε} (f_{0}) = {f : f \in F_{d}, K L (h_{0}, h) < ε, h (x, y) = q (x) f (y ∣ x) \forall y \in Y, x \in X},

where KL(h₀, h) = ∫ h₀ log(h₀/h). Then, Inline graphic is said to have KL property at f₀ ∈ , denoted f₀ ∈ KL( ), if {K_ε(f₀)} > 0 for any ε > 0.

Another definition we would require for showing the KL support is the notion of weak neighborhood of a collection of mixing measures Inline graphic = {G_x, x ∈ } where G_x is a probability measure on S × ℜ⁺ for each x ∈ . Here S = ℜ^p or ℜ depending on the cases considered above. We formulate the notion of a sub-base of the weak neighborhood of = {G_x, x ∈ } below.

Definition 4.3

For a bounded continuous function g : S × ℜ⁺ × Inline graphic → ℜ and ε > 0, a sub-base of the weak neighborhood of a conditional probability measure {F_x, x ∈ } is defined as

{{G_{x}, x \in X} : | \int_{S \times R + \times X} g (s, σ, x) {d G}_{x} (s, σ) q (x) d x - g (s, σ, x) {d F}_{x} (s, σ) q (x) d x | < ε}

(4.3)

A conditional probability measure {G_x, x ∈ Inline graphic } lies in the weak support of if assigns positive probability to every basic neighborhood generated by the sub-base of the type (4.3). In the sequel, we will also consider a neighborhood of the form

{{G_{x}, x \in X} : sup_{x \in X} | \int_{S \times R^{+}} {g (s, σ) {d G}_{x} (s, σ) - g (s, σ) {d F}_{x} (s, σ)} | < ε} .

(4.4)

for a bounded continuous function g : S × ℜ⁺ → ℜ.

5. Posterior consistency in MGLR_x mixture of Gaussians

5.1. Kullback-Leibler property

We will work with a specific choice of Inline graphic motivated by the probit stick breaking process construction in [21]. Let

π_{h} (x) = Φ {α_{h} (x)} \prod_{l < h} [1 - Φ {α_{l} (x)}],

(5.1)

where α_h ~ GP(0, c_h), for h = 1, 2, …, ∞. Assume the following holds.

S1
c_h is chosen so that α_h ~ GP(0, c_h) has continuous path realizations
S2
for any continuous function under the GP(0, c_h) prior for α_h g : ↦ ℜ,
$P_{X} {sup_{x \in X} ∣ α_{h} (x) - g (x) ∣ < ε} > 0$

h = 1, …, ∞ and for any ε > 0.
S3
G₀ is absolutely continuous with respect to λ(ℜ^p × ℜ⁺).

Consider the subset $F_{d}^{*} \subset F_{d}$ satisfying the following conditions.

A1
f is nowhere zero and bounded by M < ∞.
A2
| f(y | x) log f(y | x)dyq(x)dx| < ∞.
A3
$∣ \int_{X} \int_{Y} f (y ∣ x) log \frac{f (y ∣ x)}{ψ_{x} (y)} dyq (x) d x ∣ < \infty$ ,

where ψ_x(y) = inf_t_∈[_y_−1,_y_+1] f(t | x).
A4
∃ η > 0 such that |y|²⁽¹⁺^η⁾ f(y | x)dyq(x)dx < ∞.
A5
(x, y) ↦ f(y | x) is jointly continuous.

Remark 5.1

A1 is usually satisfied by common densities arising in practice. A4 imposes a minor tail restriction; e.g., a mean regression model with continuous mean function and a heavy-tailed t residual density with 4 degrees of freedom satisfies A4. Conditions A2 and A3 are more subtle, but are also mild. A flexible class of models which satisfies A1–A5 is as follows. Let y_i = μ(x_i) + ε_i, with μ : Inline graphic → ℜ continuous and ε_i ~ f_{x_i}, where $f_{x} (ε) = \sum_{h = 1}^{H} π_{h} (x) ψ (ε; μ_{h}, σ_{h}^{2})$ for some H ≥ 1, $\sum_{h = 1}^{H} π_{h} (x) = 1$ , π_h : → [0, 1] continuous and ψ is Gaussian or t with greater than 2 degrees of freedom.

Remark 5.2

S2 is satisfied if c_h(x, x′) = e^−A_h||^x⁻^x^′||² and the prior for A_h has full support on ℝ⁺.

The following theorem characterizes the subset of Inline graphic for which has the KL property. The proof of Theorem 5.3 is provided in Appendix C.

Theorem 5.3

f₀ ∈ KL( Inline graphic ) for each f₀ in $F_{d}^{*}$ if satisfies S1–S3.

Remark 5.4

The conditions are satisfied for a class of gSB process mixtures in which the stick-breaking lengths are constructed through mapping continuous stochastic processes to the unit interval using a monotone differentiable link function.

To prove Theorem 5.3, we need several auxiliary results related to the support of the prior Inline graphic which might be of independent interest. The key idea for showing that the true f₀ satisfies {K_ε(f₀)} > 0 for any ε > 0 is to impose certain tail conditions on f₀(y | x) and approximate it by $\tilde{f} (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - x^{'} β}{σ}) d {\tilde{G}}_{x} (β, σ)$ , where {G̃_x, x ∈ Inline graphic } is compactly supported. Observe that,

K L (h_{0}, h) = \int_{X} \int_{Y} f_{0} (y ∣ x) log \frac{f_{0} (y ∣ x)}{\tilde{f} (y ∣ x)} dyq (x) d x + \int_{X} \int_{Y} f_{0} (y ∣ x) log \frac{\tilde{f} (y ∣ x)}{f (y ∣ x)} dyq (x) d x .

(5.2)

We construct such an f̃ in Theorem 5.3 which makes the first term in the right hand side of (5.2) sufficiently small. The following lemma (which is similar to Lemma 3.1 in [12] and Theorem 3 in [11]) guarantees that the second term in the right hand side of (5.2) is also sufficiently small if {G_x, x ∈ Inline graphic } lies inside a finite intersection of neighborhoods of {G̃_x, x ∈ } of the type (4.4).

Lemma 5.5

Assume that f₀ ∈ Inline graphic satisfies y²f₀(y | x)dyq(x)dx < ∞. Suppose $\tilde{f} (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - x^{'} β}{σ}) d {\tilde{G}}_{x} (β, σ)$ , where ∃a > 0 and 0 < σ < σ̄ such that

{\tilde{G}}_{x} ({[- a, a]}^{p} \times (\underline{σ}, \bar{σ})) = 1 \forall x \in X,

(5.3)

so that G̃_x has compact support for each x ∈ Inline graphic . Then given any ε > 0, ∃ a finite intersection W of neighborhoods of {G̃_x, x ∈ } of the type (4.4) such that for any conditional density $f (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - x^{'} β}{σ}) {d G}_{x} (β, σ)$ , x ∈ , with {G_x, x ∈ } ∈ W,

\int_{X} \int_{Y} f_{0} (y ∣ x) log \frac{\tilde{f} (y ∣ x)}{f (y ∣ x)} dyq (x) d x < ε .

(5.4)

The proof is similar to Theorem 3 in [11] and is omitted here. In order to ensure that the weak support of Inline graphic is sufficiently large to contain all densities f̃ satisfying the assumptions of Lemma 5.5, we define a collection of fixed conditional probability measures on (ℜ^p × ℜ⁺, (ℜ^p × ℜ⁺)) denoted by $G_{X}^{*}$ satisfying

x ↦ F_x(B) is a continuous function of x ∈ , ∀ B ∈ (ℜ^p × ℜ⁺).
For any sequence of sets A_n ⊂ ℜ^p × ℜ⁺ ↓ ∅, F_x(A_n) ↓ 0.

Next we state the theorem characterizing the weak support of Inline graphic which will be proved in Appendix B.

Theorem 5.6

If Inline graphic satisfies S1–S3, then any ${F_{x}, X \in X} \in G_{X}^{*}$ lies in the weak support of .

Corollary 5.7

Assume S1–S3 hold and assume $F_{x} \in G_{X}^{*}$ is compactly supported, i.e., there exists a, σ, σ̄ > 0 such that F_x([−a, a]^p × [σ, σ̄]) = 1. Then for a bounded uniformly continuous function g : ℜ^p × ℜ⁺ → [0, 1] satisfying g(β, σ) → 0 as ||β|| → ∞, σ → ∞,

P_{X} {{G_{x}, x \in X} : sup_{x \in X} | \int_{R^{p} \times R^{+}} {g (β, σ) {d G}_{x} (β, σ) - g (β, σ) {d F}_{x} (β, σ)} | < ε} > 0.

(5.5)

Proof

The proof is similar to Theorem 5.6 with the L₁ convergence in (B.1) replaced by convergence uniformly in x. This is because under the assumptions of Corollary 5.7, the uniformly continuous sequence of functions $\sum_{k = 1}^{n} g ({\tilde{β}}_{k, n}, {\tilde{σ}}_{k, n}) F_{x} (A_{k, n})$ on Inline graphic monotonically decreases to ∫_C g(β, σ)dF_x(β, σ) as n → ∞ where C is given by [−a, a]^p × [σ, σ̄].

The proof of the following corollary is along the lines of the proof of Theorem 5.6 and is omitted here.

Corollary 5.8

Under the assumptions of Corollary 5.7 for any k₀ ≥ 1,

P_{X} {\cap_{j = 1}^{k_{0}} U_{j}} > 0,

(5.6)

where U_j’s are neighborhoods of the type (5.5).

5.2. Strong Consistency with the q-integrated L₁ neighborhood

To obtain strong consistency in the q-integrated L₁ topology, we need a very straight forward extension of Theorem 2 of [11] below.

Theorem 5.9

Suppose f₀ ∈ KL( Inline graphic ) and there exist subsets ⊂ with

log N (ε, , ||·||₁) = o(n),
$Π_{X} (F_{n}^{c}) \leq c_{2} e^{- n β_{2}}$ for some c₂, β₂ > 0,

then the posterior is strongly consistent with respect to the q-integrated L₁ neighborhood.

Before stating the main theorem on strong consistency, we consider a hierarchical extension of MGLR_x where the bandwidths are taken to be random. We define a sequence of random inverse-bandwidths A_h of the Gaussian process α_h, h ≥ 1 each having ℜ⁺ as its support. Since the first few atoms suffice to explain most of the dependence of y on x, we expect that the variability due to the covariate in the stochastic process Φ{α_h} decreases as h increases. This is achieved through a carefully chosen prior for the covariance kernel c_h of the Gaussian process α_h.

Let α₀ denote the base Gaussian process on [0, 1]^p with covariance kernel c₀(x, x′) = τ²e^−||^x⁻^x^′||². Then $α_{h} (x) = α_{0} (A_{h}^{1 / 2} x)$ for each x∈ Inline graphic The variability of α_h with respect to the covariate is shrunk or stretched to the rectangle ${[0, A_{h}^{1 / 2}]}^{p}$ as A_h decreases or increases. A_h’s are constructed to be stochastically decreasing to δ₀ in the following manner. We assume that there exist η, η₀ > 0 and a sequence δ_n = O((log n)²/n^5/2) such that P (A_h > δ_n) ≤ exp{−n^−η₀h^(η₀+2)/η log h} for each h ≥ 1. Also assume that there exists a sequence r_n ↑ ∞ such that $r_{n}^{p} n^{η} {(log n)}^{p + 1} = o (n)$ and P (A_h > r_n) ≤ e⁻ⁿ. We will discuss how to construct such a sequence of random variables in the Remark 5.12 following Theorem 5.10.

The following theorem provides sufficient conditions for strong posterior consistency in the q-integrated L₁ topology. The proof is provided in Appendix D.

Theorem 5.10

Let π_h’s satisfy (5.1) with α_h ~ GP(0, c_h) where c_h(x, x′) = τ²e^{−A_h||x−x′||²}, h ≥1, τ² > 0 fixed.

C1
There exist sequences a_n, h_n ↑ ∞, l_n ↓ 0 with $\frac{a_{n}}{l_{n}} = O (n), \frac{h_{n}}{l_{n}} = O (e^{n})$ , and constants d₁, d₂ > 0 such that G₀{B(0; a_n) × [l_n, h_n]}^c < d₁e^−d₂n for some d₁, d₂ > 0.
C2
A_h’s are constructed as in the second last paragraph before Theorem 5.10.

then f₀ ∈ KL( Inline graphic ) implies that achieves strong posterior consistency in q-integrated L₁ topology at f₀.

Remark 5.11

Verification of condition C1 of Theorem 5.10 is particularly simple. For example, if G₀ is a product of multivariate normals on β and an inverse Gamma prior on σ², the condition C1 is satisfied with $a_{n} = O (\sqrt{n})$ , h_n = eⁿ, $l_{n} = O (\frac{1}{\sqrt{n}})$ . It follows from [42] that f₀ ∈ KL( Inline graphic ) is still satisfied when we have the additional assumptions C1–C2 together with S1–S3 on the prior .

Remark 5.12

Since we need $r_{n}^{p} n^{η} {(log n)}^{p + 1} = o (n), r_{n}^{p}$ can be chosen to be O(n^η₁) for some 0 < η₁ < 1. Let d be such that dη₁/p ≥ 1 and set η₀ = 3d. Let A_h = c_hB_h, where $B_{h}^{d} ~ Exp (λ)$ and c_h = (h⁽³^d^+2)/^η log h)^−1/^d for any 0 < η < 1. Then P (A_h > n^η₁/p) ≤ P(B_h > n^η₁/p) ≤ e^{−n^dη₁/p} ≤ e⁻ⁿ and P(A_h > (log n)²n^−5/2) ≤ exp{−n⁻³^dh⁽³^d^+2)/^η log h}.

Remark 5.13

The theory of strong posterior consistency can be generalized to an arbitrary monotone differentiable link function L : ℜ ↦ [0, 1] which is Lipschitz, i.e., there exists a constant K > 0 such that |L(x) − L(x′)| ≤ K |x − x′| for all x, x′ ∈ Inline graphic . Also, as long as the π_h(x)’s satisfy the hypothesis of Lemma Appendix A.1 and possess the required tail probability in Lemma 5.15, general predictor dependent mixing weights can be used.

Below we will develop several auxiliary results required to prove Theorem 5.10. They are stated below as some of them might be of independent interest. Let $ϕ_{β, σ} (x, y) : = \frac{1}{σ} ϕ (\frac{y - x^{'} β}{σ})$ for y ∈ Inline graphic and x ∈ . From [12], we obtain for $σ_{2} > σ_{1} > \frac{σ_{2}}{2}$ and for each x ∈ ,

\int_{Y} ∣ ϕ_{β_{1}, σ_{1}} (x, y) - ϕ_{β_{2}, σ_{2}} (x, y) ∣ d y \leq {(\frac{2}{π})}^{1 / 2} \frac{‖ β_{2} - β_{1} ‖ \sqrt{p}}{σ_{2}} + \frac{3 (σ_{2} - σ_{1})}{σ_{1}}

Construct a sieve for (β, σ) as

Θ_{a, h, l} = {ϕ_{β, σ} : ‖ β ‖ \leq a, l \leq σ \leq h} .

(5.7)

In the following Lemma, we provide an upper bound to N (Θ_a,h,l, ε, d_SS). The proof is omitted as it follows trivially from Lemma 4.1 in [12].

Lemma 5.14

There exist constants d₁, d₂ > 0 such that $N (Θ_{a, h, l}, ε, d_{S S}) \leq d_{1} {(\frac{a}{l})}^{p} + d_{2} log \frac{h}{l} + 1$ .

In the proof of Theorem 5.10, we will verify the sufficient conditions of Theorem 5.9. We calibrate Inline graphic by a carefully chosen sequence of subsets ⊂ . The fundamental problem with mixture models ∫ N (y; μ, σ²I_p)dP (μ) in estimating a multivariate density lies in attempting to compactify the model space by {∫ N (y, μ, σ²I_p)dP (μ) : P ((−a_n, a_n]^p) > 1 − δ} for each σ leading to an entropy $a_{n}^{p}$ growing exponentially with the dimension p. Here we marginalize P in ∫ N(y; μ, σ²I_p)dP (μ) to yield the following construction $\sum_{h = 1}^{m_{n}} π_{h} N (y; μ_{h}, σ^{2} I_{p}) : ‖ μ_{h} ‖ \leq a_{n}$ , h = 1, …, m_n, $\sum_{h = m_{n} + 1}^{\infty} π_{h} < ε$ leading to an entropy m_n log a_n where m_n is related to the tail-decay of $P (\sum_{h = m_{n} + 1}^{\infty} π_{h} > ε)$ . With this idea in place, we extend the construction of Inline graphic for conditional densities below.

Before constructing a sieve, we briefly review alternative definitions [43] of a Gaussian process as a Banach space valued element below. A Borel measurable random element W with values in a separable Banach space ( Inline graphic , ||·||) is called Gaussian if the random variable b^*W is normally distributed for any element b^* ∈ , the dual space of . Recall that in general, the reproducing kernel Hilbert space (RKHS) ℍ attached to a zero-mean Gaussian process W is defined as the collection of all EHW for H ranging over the closed linear span of the variables b^*W in L₂(ν, M) with inner product

{〈 E W (\cdot) H_{1}; E W (\cdot) H_{2} 〉}_{ℍ} = E H_{1} H_{2},

(5.8)

The RKHS can be viewed as a subset of Inline graphic and the RKHS norm ||·||_ℍ stronger than the Banach space norm ||·||.

In particular, if W is a Borel measurable zero-mean Gaussian random element in a complete separable subspace of ℓ^∞ (T), the Banach space of uniformly bounded functions g : T → ℝ equipped with the uniform norm ||g|| = sup{|g(t)| : t ∈ T }, then the RKHS is actually the completion of the linear space of functions t ↦ EW (t)H relative to the inner product (5.8) where H, H₁ and H₂ are finite linear combinations of the form Σ_i a_iW (s_i) with a_i ∈ ℝ and s_i in the index set of W. See Theorem 2.1 of [43] for details.

Next we turn to constructing the sieve. Assume ε > 0 is given. Let $ℍ_{1}^{a}$ denote a unit ball in the RKHS of the covariance kernel τ²e^{−a||x−x′||²}and Inline graphic is a unit ball in ℂ [0, 1]^p. For numbers M, m, r, δ, construct a sequence of subsets {B_h, h = 1, …, m} of ℂ [0, 1]^p as follows.

B_{h} = {\begin{cases} (M \sqrt{r / δ} ℍ_{1}^{r} + \frac{ε}{m^{2}} B_{1}) \cup (\cup_{a < δ} M ℍ_{1}^{a} + \frac{ε}{m^{2}} B_{1}), if h = 1, \dots, m^{η} \\ \cup_{a < δ_{n}} M_{n} ℍ_{1}^{a} + \frac{ε}{m^{2}} B_{1}, if h = m^{η} + 1, \dots, m . \end{cases}

The idea is to construct

\begin{array}{l} F_{n} = {f : f (y ∣ x) = \sum_{h = 1}^{\infty} π_{h} (x) \frac{1}{σ_{h}} ϕ (\frac{y - x^{'} β_{h}}{σ_{h}}), {ϕ_{β_{h}, σ_{h}}}_{h = 1}^{m_{n}} \\ \in Θ_{a_{n}, h_{n}, l_{n}}, α_{h} \in B_{h, n}, h = 1, \dots, m_{n}, sup_{x \in X} \sum_{h \geq m_{n} + 1} π_{h} (x) \leq ε} . \end{array}

(5.9)

for appropriate sequences a_m, l_n, h_n, M_n, m_n, r_n, δ_n to be chosen in the proof of Theorem 5.10.

The following lemma is also crucial to the proof of Theorem 5.10 which allows us to calculate the rate of decay of P( Inline graphic π_h(x) > ε) with m_n.

Lemma 5.15

Let π_h’s satisfy (5.1) with α_h ~ GP(0, c_h) where c_h(x, x′) = τ²e^{−A_h||x−x′||²}, h ≥ 1, τ² > 0 fixed. Then for some constant C₇ > 0,

Π_{X} ({‖ \sum_{h = m_{n} + 1}^{\infty} π_{h} ‖}_{\infty} > ε) \leq e^{- C_{7} m_{n} log m_{n}} + \sum_{n = m_{n}^{η} + 1}^{m_{n}} P (A_{h} > δ_{n}) .

(5.10)

Proof

Let $W_{h} = - log [1 - Φ {α_{h}^{'}}]$ where $α_{h}^{'} = {inf}_{x \in X} α_{h} (x)$ , Z_h ~ Ga(1, γ₀). We will choose an appropriate value for γ₀ in the sequel. Let t₀ = −log ε > 0. Observe that

Π_{X} ({‖ \sum_{h = m_{n} + 1}^{\infty} π_{h} ‖}_{\infty} > ε) = Π_{X} (- \sum_{h = m_{n}^{η} + 1}^{m_{n}} log {1 - Φ (α_{h}^{'})} < t_{0}) .

Observe that $Π_{X} (- \sum_{h = 1}^{m_{n}} log {1 - Φ (α_{h})} < t_{0}) = Π_{X} (Λ_{h} < t_{0})$ where Λ_h ~ Ga(m_n, 1). Then it is easy to show that Inline graphic (Λ_h < t₀) ≾ e^{−m_n log m_n}. However, the calculation gets complicated when α_h’s are i.i.d realizations of a zero mean Gaussian process. The proof relies on the fact that the supremum of Gaussian processes has sub-Gaussian tails.

Below we calculate the rate of decay of $Π_{X} ({‖ \sum_{h = m_{n} + 1}^{\infty} π_{h} ‖}_{\infty} > ε)$ with m_n. We will show that there exists γ₀, depending on ε and τ but not depending on n, such that

Π_{X} (\sum_{h = m_{n}^{η} + 1}^{m_{n}} W_{h} + t_{0}) \leq ξ {(δ_{n})}^{m_{n} - m_{n}^{η}} Π_{X} (\sum_{h = m_{n}^{η} + 1}^{m_{n}} Z_{h} < t_{0}) + \sum_{h = m_{n}^{η} + 1}^{m_{n}} P (A_{h} > δ_{n}) .

(5.11)

where there exists a constant C₅ > 0 such that ξ(x) = C₅x^p^/2 for x > 0. Observe that $Π_{X} (\sum_{h = m_{n}^{η} + 1}^{m_{n}} W_{h} < t_{0}) \leq Π_{X} (\sum_{h = m_{n}^{η} + 1}^{m_{n}} W_{h} < t_{0}, A_{h} \leq δ_{n}, h = m_{n}^{η} + 1, \dots, m_{n}) + \sum_{h = m_{n}^{η} + 1}^{m_{n}} P (A_{h} > δ_{n})$ .

Since $Π_{X} (\sum_{h = m_{n}^{η} + 1}^{m_{n}} W_{h} < t_{0}) = Π_{X} (\sum_{h = m_{n}^{η} + 1}^{m_{n}} (τ^{'} / τ) W_{h} < τ^{'} t_{0} / τ)$ for some τ′ < 1, we can re-parameterize t₀ as τ′t₀/τ and τ as τ′. Hence without loss of generality we assume τ < 1.

Define g : [0, t₀] → ℜ, t ↦ −Φ⁻¹(1 − e⁻^t). It holds that g is a continuous function on (0, t₀]. Assume α₀ ~ GP(0, c₀) where c₀(x, x′) = τ²e^{−||x−x′||²}. For $h = m_{n}^{η} + 1, \dots, m_{n}$ ,

P (sup_{x \in X} α_{h} (x) \geq λ, A_{h} \leq δ_{n}) \leq P (sup_{x \in \sqrt{δ_{n}} X} α_{0} (x) \geq λ) .

Below we estimate $P ({sup}_{x \in \sqrt{δ_{n}} X} α_{0} (x) \geq λ)$ for large enough λ following Theorem 5.2 of [44]. However extra care is required to identify the role of δ_n. Since $N (ε, \sqrt{δ_{n}} X, ‖ \cdot ‖) \leq C_{1} {(\sqrt{δ_{n}} / ε)}^{p}$ ,

\int_{0}^{ε} {log N (ε, \sqrt{δ_{n}} X, ‖ \cdot ‖)}^{1 / 2} d ε \leq C_{2} ε {1 + \sqrt{log (1 / ε)}} .

for some constant C₂ > 0. Hence

\begin{array}{l} P (sup_{x \in r_{n} X} α_{0} (x) \geq λ) \leq C_{3} {(\sqrt{δ_{n}} λ)}^{p} exp [- 1 / 2 {λ - C_{2} / λ (1 + \sqrt{log λ})}^{2} / τ^{2}] \\ \leq C_{3} δ_{n}^{p / 2} λ^{p + 2} {1 - Φ (λ / τ^{2})} \leq C_{4} δ_{n}^{p / 2} {1 - Φ (λ)} . \end{array}

for constants C₃, C₄ > 0. The last inequality holds for all large λ because τ < 1. Hence there exists t₁ ∈ (0, t₀) sufficiently small and independent of n such that for all t ∈ (0, t₁), $Π_{X} {{sup}_{x \in \sqrt{δ_{n}} X} α_{0} (x) \geq g (t)} \leq C_{4} δ_{n}^{p / 2} Φ {- g (t)}$ . Observe that

Π_{X} {sup_{x \in \sqrt{δ_{n}} X} α_{0} (x) \geq g (t)} \leq C_{4} δ_{n}^{p / 2} Φ {- g (t)} < C_{5} δ_{n}^{p / 2} (1 - e^{- γ_{0} t}),

for any γ₀ > 1. Further choose γ₀ large enough such that 2(1 − e^−γ₀t) > 1 ∀ t ∈ [t₁, t₀]. Hence P(W_h ≤ t, A_h ≤ δ_n) ≤ ξ(δ_n)P(Z_h < t) ∀ t ∈ (0, t₀] where $ξ (δ_{n}) = C_{5} δ_{n}^{p / 2}$ , with C₅ = max{2, C₄}. Applying Lemma Appendix E.1, we conclude (5.11) by induction. Lemma Appendix E.1 is proved in Appendix E. As $\sum_{h = 1}^{m_{n}} Z_{h} ~ Ga (m_{n}, γ_{0}), Π_{X} (\sum_{h = 1}^{m_{n}} Z_{h} < t_{0}) \leq e^{- C_{6} m_{n} log m_{n}}$ for some constant C₆ > 0. Since $ξ {(δ_{n})}^{m_{n} - m_{n}^{η}} Π_{X} (\sum_{h = 1}^{m_{n}} Z_{h} < t_{0}) \leq (e^{- C_{7} m_{n} log m_{n}})$ for some constant C₇ > 0, the result follows immediately.

5.3. Prior specification and posterior computation

To illustrate the applicability of the proposed methods, we mention the prior choices and key steps for posterior computation for the MGLR_x model. Recall that

f (y ∣ x) - \int \frac{1}{2} ϕ (\frac{y - x^{'} β}{σ}) {d G}_{x} (β, σ),

(5.12)

G_{x} = \sum_{h = 1}^{\infty} π_{h} (x) δ_{(β_{h}, σ_{h}^{2})}, (β_{h}, σ_{h}^{- 2}) ~ N (β_{0}, \sum_{0}) \times Ga (α_{σ}, β_{σ}),

(5.13)

where π_h(x) = Φ{α_h(x)}Π_l_<_h{1−Φ{α_l(x)}. We assume α_h ~ GP(0, c_h), where $c_{h} (x, x^{'}) = \frac{1}{τ_{α}} e^{- A_{h} {‖ x - x^{'} ‖}^{2}}$ , τ_α ~ Ga(ν_α/2, ν_α/2). See Remark 5.12 for constructing prior for A_h’s. If the y_i’s are standardized, we would expect that the total variance $\sum_{h = 1}^{\infty} π_{h} σ_{h}^{2}$ should be around 1. Hence choose a_σ = 1, b_σ = 10 so that the $E (σ_{h}^{- 2}) = 0.1$ . We can resort to an MCMC algorithm, which is a hybrid of data augmentation, the exact block Gibbs sampler of [45] and Metropolis Hastings sampling to sample from the posterior of (5.12). [45] proposed the exact block Gibbs sampler as an efficient approach to posterior computation in infinite-dimensional Dirichlet process mixture models, modifying the block Gibbs sampler of [46] to avoid truncation approximations. The exact block Gibbs sampler combines characteristics of the retrospective sampler [47] and the slice sampler [4, 48]. Introduce γ₁, …, γ_n such that π_h(x_i) = P(γ_i = h), h = 1, 2, …, ∞. Then

γ_{i} ~ \sum_{h = 1}^{\infty} π_{h} (x_{i}) δ_{h} = \sum_{h = 1}^{\infty} 1 (u_{i} < π_{h} (x_{i})) δ_{h}

where u_i ~ U(0, 1).

We continue up to $h^{*} = max {h_{1}^{*}, \dots, h_{n}^{*}}$ , where $h_{i}^{*}$ is the minimum integer satisfying $\sum_{l = 1}^{h_{i}^{*}} π_{l} (x_{i}) > 1 - min {u_{1}, \dots, u_{n}}$ , i = 1, …, n. The Markov chain adaptively estimates the desired number of components h^* at each iteration of the MCMC, thus making it more efficient than a finite mixture model with a pre-specified large number of components. Here we describe the key steps for the posterior computation.

Update u_i’s and stick breaking random variables: Generate
$u_{i} ∣ - ~ U (0, π_{γ_{i}} (x_{i}))$

where π_h(x_i) = Φ{α_h(x_i)}Π_l_<_h[1 − Φ{α_l(x_i)}]. For i = 1, …, n, introduce latent variables Z_h(x_i), h = 1, 2, … such that Z_h(x_i) ~ N(α_h(x_i), 1). Thus π_h(x_i) = P(Z_h(x_i) > 0, Z_l(x_i) < 0 for l < h). Then
$Z_{h} (x_{i}) ∣ - ~ {\begin{cases} N (α_{h} (x_{i}), 1) I_{ℝ^{+}}, h = γ_{i} \\ N (α_{h} (x_{i}), 1) I_{ℝ^{-}}, h < γ_{i} . \end{cases}$

Let Z_h = (Z_h(x₁), …, Z_h(x_n))′ and α_h = (α_h(x₁), …, α_h(x_n))′. Letting (Σ_h)_ij = e^{−A_h||xi₋x_j||}, Z_h ~ N(α_h, I) and $α_{h} ~ N (0, \frac{1}{τ_{α}} \sum_{h})$ ,
$α_{h} ∣ - ~ N ({(τ_{α} \sum_{h}^{- 1} + I_{n})}^{- 1} Z_{h}, {(τ_{α} \sum_{h}^{- 1} + I_{n})}^{- 1})$

Continue up to $h^{*} = max {h_{1}^{*}, \dots, h_{n}^{*}}$ , where $h_{i}^{*}$ is the minimum integer satisfying $\sum_{l = 1}^{h_{i}^{*}} π_{l} (x_{i}) > 1 - min {u_{1}, \dots, u_{n}}$ , i = 1, …, n. Now
$τ_{α} ∣ - ~ Ga (\frac{1}{2} ({n h}^{*} + ν_{α}), \frac{1}{2} (\sum_{l = 1}^{h^{*}} α_{k}^{'} \sum_{l}^{- 1} α_{k} + ν_{α})),$

while κ_α is updated using a Metropolis Hastings step.
Update allocation to atoms: Update (γ₁, …, γ_n)|– as multinomial random variables with probabilities
$P (γ_{i} = h) \propto N (y_{i}; x_{i}^{'} β_{h}, τ_{h}^{- 1}) I (u_{i} < π_{h} (x_{i})), h = 1, \dots, h^{*} .$
Update component-specific locations and precisions: Let n_h = #{i : γ_i = h}, l = 1, 2, …, h^*. Let Y_h = (y_i : γ_i = h) be a n_h dimensional vector and X_h is the corresponding n_h × p dimensional covariate matrix.
$\begin{array}{l} β_{h} ∣ - ~ N ({(X_{h}^{'} X_{h} + \sum_{0}^{- 1})}^{- 1} (X_{h}^{'} Y_{h} + \sum_{0}^{- 1} β_{0}), {(X_{h}^{'} X_{h} + \sum_{0}^{- 1})}^{- 1}) \\ τ_{h} ∣ - ~ Ga (\frac{n_{h}}{2} + α_{τ}, β_{τ} + \sum_{i : γ_{i} = h} {(y_{i} - x_{i}^{'} β_{h}))}^{2}), h = 1, 2, \dots, h^{*} \end{array}$

Update A_h’s in a Metropolis Hastings step.

At each iteration of the MCMC, we obtain samples from the full conditional distributions of the parameters, which after discarding a burn-in can be used to get summary statistics of posterior distribution of the parameters or functionals of interest.

6. Posterior consistency in Gaussian mixture of fixed-π dependent processes

6.1. Kullback-Leibler property

The following theorem verifies that Inline graphic has KL property at $f_{0} \in F_{d}^{*}$ . The proof of Theorem 6.1 is somewhat similar to that of Theorem 5.3 and can be found in Appendix F.

Theorem 6.1

f₀ ∈ KL( Inline graphic ) for each f₀ in $F_{d}^{*}$ if satisfies

T1
G₀ is specified by μ_h ~ GP(μ, c), σ_h ~ G₀,_σ where c is chosen so that GP(0, c) has continuous path realizations and Π_σ is absolutely continuous w.r.t. Lebesgue measure on ℜ⁺.
T2
For every k ≥2, (π₁, …, π_k) is absolutely continuous w.r.t. to the Lebesgue measure on S_k₋₁.
T3
For any continuous function g : ↦ ℜ,
$P_{X} {sup_{x \in X} ∣ μ_{h} (x) - g (x) ∣ < ε} > 0$

h = 1, …, ∞ and for any ε > 0.

6.2. Strong consistency with the q-integrated L₁ neighborhood

Next we summarize the consistency theorem with respect to the q-integrated L₁ topology. The proof of Theorem 6.2 is also similar to that of Theorem 5.10 and is provided in Appendix G.

Theorem 6.2

Let μ_h(x) = x′β_h + η_h(x), β_h ~ G_β and η_h ~ GP(0, c), h = 1, …, ∞ where c(x, x′) = τ²e⁻^A^||^x⁻^x^′||2, A^{p(1+η₂)/η₂} ~ Ga(a, b) for some η₂ > 0.

F1
There exist sequences a_n, h_n ↑ ∞, l_n ↓ 0 with $\frac{a_{n}}{l_{n}} = O (n), \frac{h_{n}}{l_{n}} = O (e^{n})$ , and constants d₁, d₂, d₃ and d₄ > 0 such that G_β{B(0; a_n)}^c < d₁e^−d₂n and G_0,_σ{[l_n, h_n]}^c ≤ d₃e^−d₄n.
F2
$P (\sum_{h = n}^{\infty} π_{h} > ε) ≾ O (e^{- n^{1 + η_{2}} {(log n)}^{(p + 1)}})$ .

then f₀ ∈ KL( ) implies that achieves strong posterior consistency at f₀ with respect to the q-integrated L₁ topology.

Remark 6.3

F2 is satisfied if π_h’s are made to decay more rapidly than the usual Beta(1, α) stick-breaking random variables, e.g, if π_h = ν_hΠ_l_<_h(1−ν_h) and if ν_h ~ Beta(1, α_h) where α_h = h^1+η₂(log h)^p⁺¹α₀ for some α₀ > 0, then F2 is satisfied. Large value of α_h for the higher indexed weights favors smaller number of components.

Remark 6.4

A Gaussian kernel is used here for technical simplification. One can obtain similar results using a variety of kernels e.g. t, Laplace, etc. However, the KL support conditions A1–A5 will be different for different kernels. Refer to [49] for a catalogue of conditions for various kernels in a density estimation framework.

7. Discussion

We have provided sufficient conditions to show posterior consistency in estimating the conditional density via predictor dependent mixtures of Gaussians which include probit stick-breaking mixtures of Gaussians and the fixed-π dependent processes as special cases. The problem is of interest, providing a more flexible and informative alternative to the usual mean regression. For both the models, we need the same set of tail conditions (mentioned in $F_{d}^{*}$ ) on f₀ for KL support. Although the first prior is flexible in the weights and the second one in the atoms through their corresponding GP terms, S1, S2, T1 and T3 show that verification of KL property only requires that both the GP terms have continuous path realizations and desired approximation property. Moreover, for the second prior, any set of weights summing to one a.s. T2 suffices for showing KL property. Careful investigations of the prior for the GP kernel for the first model and the probability weights for the second one are required for strong consistency. For the first one we need the covariate dependence of the higher indexed GP terms in the weights to fade off. On the other hand, for the second model, the atoms can be i.i.d. realizations of a GP with Gaussian covariance kernel with inverse- Gamma bandwidth while limiting the model complexity through a sequence of probability weights which are allowed to decay rapidly. This suggests that full flexibility in the weights should be down-weighted by an appropriately chosen prior while full flexibility in the atoms should be accompanied by a restriction imposing fewer number of components. It would be interesting to see how the conditions on the bandwidth can be modified when we actually use a sieve Bayes prior, i.e. a prior with number of components k_n diverging to ∞.

Another interesting direction is to consider rates of convergence of the posterior and Bernstein von-Mises (BvM) type results. For infinite dimensional parameters [50], there has been quite a few positive BvM results very recently for linear functionals of a probability density function [51] and for general classes of linear and non-linear functionals in a Gaussian white noise model [52]. We conjecture that such BvM-type results hold for linear functionals of conditional density (e.g. conditional mean, conditional cdf) too under appropriate conditions on the prior and the true data generating conditional density.

Acknowledgments

This work was supported by Award Number R01ES017240 from the National Institute of Environmental Health Sciences. We also thank the Associate Editor and the referees for the comments which significantly improved the exposition of the paper.

Appendix A. A useful lemma

To prove Theorem 5.6, we need an auxiliary lemma which we state below.

Lemma Appendix A.1

If {π_h(x), h = 1, …, ∞} constructed as in (5.1) satisfies S1 and S2 then

P_{X} {sup_{x \in X} ∣ π_{1} (x) - F_{x} (A_{1}) ∣ < ε_{1}, \dots, sup_{x \in X} ∣ π_{k} (x) - F_{x} (A_{k}) ∣ < ε_{k}} > 0.

(A.1)

for a measurable partition {A_i, i = 1, …, k} of ℜ^p × ℜ⁺, ε_i > 0 and a conditional cdf {F_x, x ∈ Inline graphic }.

Proof

Without loss of generality, let 0 < F_x(A_i) < 1, i = 1, …, k ∀ x ∈ Inline graphic . We want to show that for any ε_i > 0, i = 1, …, k, (A.1) holds. Construct continuous functions g_i : ↦ ℜ, 0 < g_i(x) < 1 ∀x ∈ , i = 1, …, k−1 such that

g_{1} (x) = F_{x} (A_{1}), g_{i} (x) \prod_{l < i} {1 - g_{l} (x)} = F_{x} (A_{i}), 2 \leq i \leq k - 1, g_{k} (x) = 1 \forall x .

(A.2)

As 0 < Fx(A_i) < 1, i = 1, …, k ∀ x ∈ Inline graphic , it is trivial to find g_i, i = 1, …, k satisfying (A.2) since one can solve back for the g_i’s from (A.2). $\sum_{i = 1}^{k} F_{x} (A_{i}) = 1$ enforces g_k ≡1. Since Φ is a continuous function, for any ε_i > 0, i = 1, …, k − 1,

P_{X} {sup_{x \in X} ∣ Φ {α_{i} (x)} - g_{i} (x) ∣ < ε_{i}} > 0

(A.3)

and for i = k,

P_{X} {sup_{x \in X} ∣ Φ {α_{k} (x)} - 1 ∣ < ε_{k}} = P_{X} {inf_{x \in X} α_{k} (x) > Φ^{- 1} (1 - ε_{k})} .

(A.4)

Choose M > Φ⁻¹(1 − ε_k) + ε_k. We have 0 < M < 1 and

{sup_{x \in X} ∣ α_{k} (x) - M ∣ < ε_{k}} \subset {inf_{x \in X} α_{k} (x) > Φ^{- 1} (1 - ε_{k})} .

Hence by assumption, Inline graphic { α_k(x) > Φ⁻¹(1 − ε_k)} > 0. Let S_k₋₁ denote the k-dimensional simplex. For notational simplicity let p_i(x) = Φ{α_i(x)}, g_i(x) = F_x(A_i), i = 1, …, k−1 and g_k(x) = 1. Let z = (z₁, …, z_p)′, f_i : S_k₋₁ → ℜ, z ↦ z_iΠ_l_<_i(1 − z_l), i = 2, …, k and f₁(z) = z₁. Let p(x) = (p₁(x), …, p_k(x)) and g(x) = (g₁(x), …, g_k(x)). Then we need to show that

P_{X} {{‖ f_{1} (p) - f_{1} (g) ‖}_{\infty} < ε_{1}, \dots, {‖ f_{k - 1} (p) - f_{k - 1} (g) ‖}_{\infty} < ε_{k - 1}, {‖ f_{k} (p) - 1 ‖}_{\infty} < ε_{k}} > 0.

Note that for 2 ≤i ≤k,

{‖ f_{i} (p) - f_{i} (g) ‖}_{\infty} \leq (i - 1) {‖ p_{i} - g_{i} ‖}_{\infty} + \sum_{l < i} {‖ f_{l} (p) - f_{l} (g) ‖}_{\infty} .

Thus one can get $ε_{i}^{*} > 0$ , i = 1, …, k, such that

\begin{array}{l} {{‖ p_{i} - g_{i} ‖}_{\infty} < ε_{i}^{*}, i = 1, \dots, k} \subset {{‖ f_{1} (p) - f_{1} (g) ‖}_{\infty} < ε_{1}, \dots, \\ {‖ f_{k - 1} (p) - f_{k - 1} (g) ‖}_{\infty} < ε_{k - 1}, {‖ f_{k} (p) - 1 ‖}_{\infty} < ε_{k}} . \end{array}

But since $P_{X} {{‖ p_{i} - g_{i} ‖}_{\infty} < ε_{i}^{*}, i = 1, \dots, k} = \prod_{i = 1}^{k} P_{X} {{‖ p_{i} - g_{i} ‖}_{\infty} < ε_{i}^{*}}$ , the result follows immediately.

Appendix B. Proof of Theorem 5.6

Fix ${F_{x}, x \in X} \in G_{X}^{*}$ . Without loss of generality it is enough to show that for a uniformly continuous function g : ℜ^p×ℜ⁺× Inline graphic → [0, 1] and ε > 0,

P_{X} {{G_{x}, x \in X} : | \int_{R^{p} \times R^{+} \times X} {g (β, σ, x) {d G}_{x} (β, σ) q (x) d x - g (β, σ, x) {d F}_{x} (β, σ) q (x) d x} | < ε} > 0.

Furthermore, it suffices to assume g(β, σ, x) → 0 uniformly in x ∈ Inline graphic as ||β|| → ∞, σ → ∞.

Fix ε > 0, there exist a, σ, σ̄ > 0 not depending on x such that F_x([−a, a]^p×[σ, σ̄]) > 1 − ε for all x ∈ Inline graphic . Let C = [−a, a]^p × [σ, σ̄].

\begin{array}{l} \int_{R^{p} \times R^{+} X} {g (β, σ, x) {d G}_{x} (β, σ) - g (β, σ, x) {d F}_{x} (β, σ)} q (x) d x \leq \\ \int_{X} {\sum_{h = 1}^{\infty} π_{h} (x) g (β_{h}, σ_{h}, x) - \int_{C} g (β, σ, x) {d F}_{x} (β, σ)} q (x) d x + ε . \end{array}

where π_h’s are specified by 5.1 with c_h satisfying S1 and S2 and (β_h, σ_h) ~ G₀. Now for each x ∈ Inline graphic , construct a Riemann sum approximation of ∫_C g(β, σ, x)dF_x(β, σ).

Let {A_k,n, k = 1, …, n} be a sequence of partitions of C with increasing refinement as n increases. Assume max_1≤_k_≤_n diam(A_k,n) → 0 as n ↑ ∞. Fix (β̃_k,n, σ̃_k,n) ∈ A_k,n, k = 1, …, n. Then by DCT as n → ∞,

\int_{X} {\sum_{k = 1}^{n} g ({\tilde{β}}_{k, n}, {\tilde{σ}}_{k, n}, x) F_{x} (A_{k, n})} q (x) d x \to \int_{X} \int_{C} g (β, σ, x) {d F}_{x} (β, σ) q (x) d x .

(B.1)

Hence there exists n₁ such that for n ≥ n₁

\begin{array}{l} | \int_{R^{p} \times R^{+} X} {g (β, σ, x) {d G}_{x} (β, σ) - g (β, σ, x) {d F}_{x} (β, σ)} q (x) d x | \leq \\ | \int_{X} {\sum_{h = 1}^{\infty} π_{h} (x) g (β_{h}, σ_{h}, x) - \sum_{k = 1}^{n} g ({\tilde{β}}_{k, n}, {\tilde{σ}}_{k, n}, x) F_{x} (A_{k, n})} q (x) d x | + 2 ε . \end{array}

Consider the set

Ω_{1} = {(π_{h}, h = 1, \dots, \infty) : sup_{x \in X} ∣ π_{1} (x) - F_{x} (A_{1, n_{1}}) ∣ < \frac{ε}{n_{1}}, \dots, sup_{x \in X} ∣ π_{n_{1}} (x) - F_{x} (A_{n_{1}, n_{1}}) ∣ < \frac{ε}{n_{1}}} .

By Lemma Appendix A.1 which is proved in Appendix A, Inline graphic (Ω₁) > 0. Since $\sum_{h = 1}^{\infty} π_{h} (x) = 1$ a.s. there ∃ Ω with (Ω) = 1, such that for each ω= {π_h, h = 1, …, ∞} ∈ Ω, $g_{n} (x) = \sum_{h = 1}^{n} π_{h} (x) \to 1$ as n → ∞ for each x in . Note that this convergence is uniform since, g_n(·), n ≥ 1 are continuous functions defined on a compact set monotonically increasing to a continuous function identically equal to 1. Hence for each ω = {π_h, h = 1, …, ∞} ∈ Ω, g_n(x) → 1 uniformly in x. By Egoroff’s theorem, there exists a measurable subset Ω₂ of Ω₁ with Inline graphic (Ω₂) > 0 such that within this subset g_n(x) → 1 uniformly in x and uniformly in ω in Ω₂. Thus there exists a positive integer n_ε ≥ n₁ not depending on x and ω, such that $\sum_{h = n_{ε} + 1}^{\infty} π_{h} (x) < ε$ on Ω₂. Moreover, one can find a K > 0 independent of x such that g(β, σ, x) < ε if ||β|| > K and σ > K. Let A₁ = {(β, σ) : ||β|| > K, σ > K}. Let Ω₃ = Ω₂ ∩ {(β_n₁+1, σ_n₁+1) ∈ A₁, …, (β_{n_ε−₁}, σ_{n_ε−₁}) ∈ A₁}. For ω ∈ Ω₃,

\begin{array}{l} | \int_{R^{p} \times R^{+} X} {g (β, σ, x) {d G}_{x} (β, σ) - g (β, σ, x) {d F}_{x} (β, σ)} q (x) d x | \leq \\ \int_{X} {\sum_{k = 1}^{n_{1}} | π_{k} (x) g (β_{k}, σ_{k}, x) - g ({\tilde{β}}_{k, n}, {\tilde{σ}}_{k, n}, x) F_{x} (A_{k, n_{1}}) |} q (x) d x + 4 ε \end{array}

and

\begin{array}{l} \int_{X} {\sum_{k = 1}^{n_{1}} | π_{k} (x) g (β_{k}, σ_{k}, x) - g ({\tilde{β}}_{k, n}, {\tilde{σ}}_{k, n}, x) F_{x} (A_{k, n_{1}}) |} q (x) d x \\ \leq \sum_{k = 1}^{n_{1}} \int_{X} π_{k} (x) | g (β_{k}, σ_{k}, x) - g ({\tilde{β}}_{k, n}, {\tilde{σ}}_{k, n}, x) | q (x) d x + ε . \end{array}

There exists sets B_k, k = 1, …, n₁ depending on n₁ but independent of x such that if (β_k, σ_k) ∈ B_k,|g(β_k, σ_k, x) − g(β̃_k,n₁, σ̃_k,n₁, x)|< ε. So for ω ∈ Ω₄ = Omega;₃ ∩ {(β₁, σ₁) ∈ B₁, …, (β_n₁, σ_n₁) ∈ B_n₁},

| \int_{R^{p} \times R^{+} X} {g (β, σ, x) {d G}_{x} (β, σ) - g (β, σ, x) {d F}_{x} (β, σ)} q (x) d x | < 5 ε .

Now since Inline graphic (Ω₂) > 0 and the sets {(β_n₁+1, σ_n₁+1) ∈ A₁, …, (β_{n_ε−₁}, σ_{n_ε−₁}) ∈ A₁} and {(β₁, σ₁) ∈ B₁, …, (β_n₁, σ_n₁) ∈ B_n₁} are independent from Ω₂ and have positive probability, it follows that (Ω₄) > 0.

Appendix C. Proof of Theorem 5.3

Without loss of generality, assume that the covariate space Inline graphic is [ζ, 1]^p for some 0 < ζ < 1. The proof is essentially along the lines of Theorem 3.2 of [12]. The f̃ in (5.2) will be constructed so as to satisfy the assumptions of Lemma 5.5 and such that $\int_{X} \int_{Y} f_{0} (y ∣ x) log \frac{f_{0} (y ∣ x)}{f (y ∣ x)} dyq (x) d x < \frac{ε}{2}$ for any ε > 0. Define a sequence of conditional densities $f_{n} (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - x^{'} β}{σ}) d {\tilde{G}}_{n, x} (β, σ)$ , n ≥ 1 where for σ_n = n⁻^η,

{d G}_{n, x} (β, σ) = \frac{I_{β_{1} \in [- n, n]} f_{0} (x^{'} β ∣ x) \prod_{j = 2}^{p} δ_{0} (β_{j}) δ_{σ_{n}} (σ)}{\int_{- n}^{n} f_{0} (x_{1} β_{1} ∣ x) d β_{1}} .

(C.1)

Define

f_{n} (y ∣ x) = \frac{\int_{- {n x}_{1}}^{{n x}_{1}} \frac{1}{σ_{n}} ϕ (\frac{y - t}{σ_{n}}) f_{0} (t ∣ x) d t}{\int_{- {n x}_{1}}^{{n x}_{1}} f_{0} (t ∣ x) d t} .

(C.2)

Proceeding as in Theorem 3.2 of [12], an application of DCT using the conditions A1–A5 yields

\int_{X} \int_{Y} f_{0} (y ∣ x) log \frac{f_{0} (y ∣ x)}{f_{n} (y ∣ x)} dyq (x) d x \to 0 as n \to \infty .

Therefore one can simply choose f̃ = f_n₀ for sufficiently large n₀. f_n₀ satisfies the assumptions of Lemma 5.5 since {G_n₀,x, x ∈ Inline graphic } is compactly supported. Also ${G_{n_{0}, x}, x \in X} \in G_{X}^{*}$ as x → G_n₀,_x(A) is continuous. Hence there exists a finite intersection W of neighborhoods of {G_n₀,x, x ∈ } the type (5.5) such that for any {G_x, x ∈ } ∈ W, the second term of (5.2) is arbitrarily small. The conclusion of the theorem follows immediately from Corollary 5.8.

Appendix D. Proof of Theorem 5.10

Consider the sequence of sieves defined by (5.9) for given ε > 0 and for sequences a_n, h_n, l_n, M_n, m_n, r_n to be chosen later with $δ_{n} = K_{1} ε / (M_{n} m_{n}^{2})$ for some constant K₁. We will first show that given ξ > 0, there exist c₁, c₂ > 0 and sequences m_n and M_n, such that $Π_{X} (F_{n}^{c}) \leq c_{1} e^{- {n c}_{2}}$ and log N (δ, Inline graphic , ||·||) < nξ.

For f₁, f₂ ∈ Inline graphic , we have for each x ∈ ,

{‖ f_{1} (\cdot ∣ x) - f_{2} (\cdot ∣ x) ‖}_{1} \leq \sum_{h = 1}^{m_{n}} {‖ π_{h}^{(1)} - π_{h}^{(2)} ‖}_{\infty} + 2 ε .

Let Θ_π,n = {π^m_n = (π₁, π₂, …, π_{m_n}) : α_h ∈ B_h,n, h = 1, …, m_n}. Fix $π_{1}^{m_{n}}, π_{2}^{m_{n}} \in Θ_{π, n}$ . Note that since |Φ(x₁) − Φ(x₂)| < K₂ |x₁ − x₂| for a global constant K₂ > 0, we have

{‖ Φ (α_{h, 1}) - Φ (α_{h, 2}) ‖}_{\infty} \leq K_{2} {‖ α_{h, 1} - α_{h, 2} ‖}_{\infty} .

The above fact together with the proof of Lemma Appendix A.1 show that if we can make ${‖ α_{h, 1} - α_{h, 2} ‖}_{\infty} < \frac{ε}{m_{n}^{2}}$ , h = 1, …, m_n, we would have $\sum_{h = 1}^{m_{n}} {‖ π_{h}^{(1)} - π_{h}^{(2)} ‖}_{\infty} < ε$ . From the proof of Theorem 3.1 in [42] it follows that for $h = 1, \dots, m_{n}^{η}$ and for sufficiently large M_n, r_n,

log N (2 ε / m_{n}^{2}, B_{h, n}, {‖ \cdot ‖}_{\infty}) \leq K_{3} r_{n}^{p} log {(\frac{M_{n} m_{n}^{2} \sqrt{r_{n} / δ_{n}}}{ε})}^{p + 1} + 2 log \frac{K_{4} M_{n} m_{n}^{2}}{ε} .

(D.1)

for global constants K₃, K₄ > 0. For $M_{n}^{2} > 16 K_{5} r_{n}^{p} {(log (r_{n} / ε))}^{1 + p}$ , r_n > 1 we have for $h = 1, \dots, m_{n}^{η}$ ,

P (α_{h} \notin B_{h, n}) \leq P (A_{h} > r_{n}) + e^{- M_{n}^{2} / 2} .

(D.2)

Hence for sufficiently large M_n, we have for $h = m_{n}^{η} + 1, \dots, m_{n}$ ,

log N (3 ε / m_{n}^{2}, B_{h, n}, {‖ \cdot ‖}_{\infty}) \leq 2 log \frac{K_{4} M_{n} m_{n}^{2}}{ε} .

(D.3)

For $h = m_{n}^{η} + 1, \dots, m_{n}$ ,

\begin{array}{l} P (α_{h} \notin B_{h, n}) \leq P (A_{h} > δ_{n}) + \int_{a = 0}^{δ_{n}} P (α_{h} \notin B_{h, n} ∣ A_{h} = a) g_{A_{h}} (a) d a \\ \leq P (A_{h} > δ_{n}) + (1 - Φ (Φ^{- 1} (e^{- ϕ_{0}^{δ_{n}} (ε / m_{n}^{2})}) + M_{n})) . \end{array}

where $ϕ_{0}^{κ} (ε)$ denotes the concentration function of the Gaussian process with covariance kernel c(x, x′) = τ²e^{−κ||x−x′||²}. Now

ϕ_{0}^{δ_{n}} (ε / m_{n}^{2}) \leq - log P (∣ W_{0} ∣ \leq ε / m_{n}^{2}) = K_{6} ∣ log (ε / m_{n}^{2}) ∣

for some constant K₆ > 0. Hence if $M_{n} \geq K_{7} ∣ log (ε / m_{n}^{2}) ∣$ for some K₇ > 0, then it follows from the proof of Theorem 3.1 in [42] that

P (α_{h} \notin B_{h, n}) \leq P (A_{h} > δ_{n}) + e^{- M_{n}^{2} / 2} .

(D.4)

From (D.1) and (D.3),

log (N (ε, B_{1, n} \times \dots \times B_{m_{n}, n}, {‖ \cdot ‖}_{\infty}) \leq 2 m_{n} log \frac{K_{4} M_{n} m_{n}^{2}}{ε} + m_{n}^{η} r_{n}^{p} log {(\frac{M_{n} m_{n}^{2} \sqrt{r_{n} / δ_{n}}}{ε})}^{p + 1} .

(D.5)

Also from (D.2) and (D.4),

\sum_{h = 1}^{m_{n}} P (α_{h} \notin B_{h, n}) \leq m_{n} e^{- M_{n}^{2} / 2} + \sum_{h = 1}^{m_{n}^{η}} P (A_{h} > r_{n}) + \sum_{h = m_{n}^{η} + 1}^{m_{n}} P (A_{h} > δ_{n}) .

We will show that with $m_{n} = O (\frac{n}{log n}), Π_{X} (F_{n}^{c}) < e^{- n ξ_{0}}$ for some ξ₀ assumption C1, we have

Π_{X} (Θ_{a_{n}, h_{n}, l_{n}}^{c}) ≾ m_{n} O (e^{- n}) ≾ O (e^{- n}) .

(D.6)

With m_n = O(n/log n), $\sum_{h = 1}^{m_{n}^{η}} P (A_{h} > r_{n}) \leq m_{n}^{η} e^{- n} ≾ e^{- n}, \sum_{h = m_{n}^{η} + 1}^{m_{n}} P (A_{h} > δ_{n}) \leq (m_{n} - m_{n}^{η}) e^{- n^{- η_{0}} m_{n}^{η_{0} + 2} log m_{n}} ≾ e^{- m_{n} log m_{n}}$ .

With $m_{n} = \frac{n}{log n}, m_{n} log m_{n} > \frac{n}{2}$ for large enough n and it follows from Lemma 5.15 that

Π_{X} (sup_{x \in X} \sum_{h = m_{n} + 1}^{\infty} π_{h} (x) > ε) ≾ O (e^{- n / 2}) .

(D.7)

Thus with M_n = O(n^1/2),

\sum_{h = 1}^{m_{n}} P (α_{h} \notin B_{h, n}) ≾ e^{- n} .

(D.8)

(D.6), (D.7) and (D.8) together imply that $Π_{X} (F_{n}^{c}) ≾ O (e^{- n})$ .

Also $m_{n}^{η} r_{n}^{p} log {(\frac{M_{n} \sqrt{r_{n} / δ_{n}}}{ε})}^{p + 1} = o (n)$ for the choice of the sequence r_n. With m_n = n/(C log n) for some large C > 0, one can make

log (N (ε, B_{1, n} \times \dots \times B_{m_{n}, n}, {‖ \cdot ‖}_{\infty}) < n ξ

(D.9)

for any ξ > 0. Also from Lemma 5.14,

m_{n} log N (Θ_{a_{n}, h_{n}, l_{n}}, ε, {‖ \cdot ‖}_{\infty}) \leq m_{n} log {d_{1} {(\frac{a_{n}}{l_{n}})}^{p} + d_{2} log \frac{h_{n}}{l_{n}} + 1} < n ξ

(D.10)

for any ξ > 0. Combining (D.9) and (D.10), log N ( Inline graphic , 4ε, ||·||₁) < nξ for any ξ > 0.

Appendix E. Another useful lemma

We state without proof the following Lemma needed to prove Theorem 6.1.

Lemma Appendix E.1

For non-negative r.v.s A_i, B_i, if P(A_i ≤ u) ≤ C_i P(B_i ≤ u) for u ∈ (0, t₀), t₀ > 0, i = 1, 2, P(A₁ + A₂ ≤ t₀) ≤ C₁C₂P(B₁ + B₂ ≤ t₀).

Appendix F. Proof of Theorem 6.1

Proof

Once again we approximate f₀(y|x) by $\tilde{f} (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - μ}{σ}) d {\tilde{G}}_{x} (μ, σ)$ , so that the first term of 5.2 is arbitrarily small. We construct such an f̃ analogous to that in Theorem 5.3. Lemma Appendix F.1 is a variant of Lemma 5.5 which ensures that the second term in (5.2) is also sufficiently small. Before that we need a different notion of neighborhood of {F_x, x ∈ Inline graphic } which we formulate below.

{{G_{x}, x \in X} : sup_{x \in X} | \int_{R \times R^{+}} {g (μ, σ) {d G}_{x} (μ, σ) - g (μ, σ) {d F}_{x} (μ, σ)} | < ε} .

(F.1)

Lemma Appendix F.1

Assume that f₀ ∈ Inline graphic satisfies y²f₀(y | x)dyq(x)dx < ∞. Suppose $\tilde{f} (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - μ}{σ}) d {\tilde{G}}_{x} (μ, σ)$ , where ∃ a > 0 and 0 < σ < σ̄ such that

{\tilde{G}}_{x} ([- a, a] \times (\underline{σ}, \bar{σ})) = 1 \forall x \in X,

(F.2)

so that G̃_x has compact support for each x ∈ Inline graphic . Then given any ε > 0, ∃ a neighborhood W of {G̃_x, x ∈ } which is a finite intersection of neighborhoods of the type (F.1) such that for any conditional density $f (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - μ}{σ}) {d G}_{x} (μ, σ)$ , x ∈ , with {G_x, x ∈ } ∈ W,

\int_{X} \int_{Y} f_{0} (y ∣ x) log \frac{\tilde{f} (y ∣ x)}{f (y ∣ x)} dyq (x) d x < ε .

(F.3)

The proof of Lemma Appendix F.1 is similar to that of Lemma 5.5 and is omitted here. To characterize the support of Inline graphic , we define a collection of fixed conditional probability measures {F_x, x ∈ } on (ℜ × ℜ⁺, (ℜ × ℜ⁺)) denoted by $G_{X}^{* *}$ satisfying x ↦ ∫_ℜ×ℜ⁺ g(μ, σ)dF_x(μ) is a continuous function of x for all bounded uniformly continuous functions g : ℜ × ℜ⁺ → [0, 1].

Theorem Appendix F.2

Assume the following holds.

T1
G₀ is specified by μ_h ~ GP(μ, c), σ_h ~ G_0,_σ where c is chosen so that GP(0, c) has continuous path realizations and Π_σ is absolutely continuous w.r.t. Lebesgue measure on ℜ⁺.
T2
For every k ≥ 2, (π₁, …, π_k) is absolutely continuous w.r.t. to the Lebesgue measure on S_k₋₁.
T3
For any continuous function g : ↦ ℜ,
$P_{X} {sup_{x \in X} ∣ μ_{h} (x) - g (x) ∣ < ε} > 0$

h = 1, …, ∞ and for any ε > 0.

Then for a bounded uniformly continuous function g : ℜ × ℜ⁺ : [0, 1] satisfying g(μ, σ) → 0 as |μ| → ∞, σ → ∞,

P_{X} {{G_{x}, x \in X} : sup_{x \in X} | \int_{R \times R^{+}} {g (μ, σ) {d G}_{x} (μ, σ) - g (μ, σ) {d F}_{x} (μ, σ)} | < ε} > 0.

(F.4)

Proof

It suffices to assume that g is coordinatewise monotonically increasing on ℜ × ℜ⁺. Let ε > 0 be given and ψ(x) = ∫_ℜ×ℜ⁺ g(μ, σ)dF_x(μ, σ). Let n_ε be such that Inline graphic (Ω₁) > 0 where $Ω_{1} = {\sum_{h = n_{ε} + 1}^{\infty} π_{h} < ε}$ . Then in Ω₁,

| \int_{R \times R^{+}} {g (μ, σ) {d G}_{x} (μ, σ) - ψ (x)} | \leq \sum_{k = 1}^{n_{ε}} π_{k} ∣ g (μ_{k} (x), σ_{k}) - ψ (x) ∣ + ε .

Define Ω₂ = { Inline graphic |g(μ_k(x), σ_k) − ψ(x)|< ε, k = 1, …, n_ε}. For a fixed σ_k, there exists a δ such that |g(μ_k(x), σ_k) − ψ(x)|< ε/2 if ${sup}_{x \in X} ∣ μ_{k} (x) - g_{σ_{k}}^{- 1} ψ (x) ∣ < δ$ where $g_{σ_{k}}^{- 1}$ denotes the inverse of g(·, σ_k) for fixed σ_k. Hence there exists a neighborhood B_k of σ_k such that for σ_k ∈ B_k and ${sup}_{x \in X} ∣ μ_{k} (x) - g_{σ_{k}}^{- 1} ψ (x) ∣ < δ$ , we have Inline graphic |g(μ_k(x), σ_k) − ψ(x)| < ε. Since for each k = 1, …, n_ε, $P_{X} {σ_{k} \in B_{k}, {sup}_{x \in X} ∣ μ_{k} (x) - g_{σ_{k}}^{- 1} ψ (x) ∣ < δ} =$

\int_{σ_{k} \in B_{k}} P_{X} {sup_{x \in X} ∣ μ_{k} (x) - g_{σ_{k}}^{- 1} ψ (x) ∣ < δ} {d G}_{0, σ} (σ_{k}) > 0,

Inline graphic (Ω₂) > 0. The conclusion of the theorem follows from the independence of Ω₁ and Ω₂.

f̃ in (5.2) will be constructed so as to satisfy the assumptions of Lemma Appendix F.1 and such that $\int_{X} \int_{Y} f_{0} (y ∣ x) log \frac{f_{0} (y ∣ x)}{f (y ∣ x)} dyq (x) d x < \frac{ε}{2}$ for any ε > 0. Define a sequence of conditional densities $f_{n} (y ∣ x) = \int \frac{1}{σ} ϕ (\frac{y - μ}{σ}) d {\tilde{G}}_{n, x} (μ, σ)$ , n ≥ 1 where for σ_n = n^−η,

{d G}_{n, x} (μ, σ) = \frac{I_{μ \in [- n, n]} f_{0} (μ ∣ x) δ_{σ_{n}} (σ)}{\int_{- n}^{n} f_{0} (μ ∣ x)} .

(F.5)

As before define the approximator

f_{n} (y ∣ x) = \frac{\int_{- n}^{n} \frac{1}{σ_{n}} ϕ (\frac{y - t}{σ_{n}}) f_{0} (t ∣ x) d t}{\int_{- n}^{n} f_{0} (t ∣ x) d t} .

(F.6)

f̃ will be chosen to be f_n₀ for some large n₀. f_n₀ satisfies the assumptions of Lemma Appendix F.1 since {G_n₀,_x, x ∈ Inline graphic } is compactly supported. Moreover ${G_{n_{0}, x}, x \in X} \in G_{X}^{* *}$ as x → ∫_ℜ×ℜ⁺ g(μ, σ)dG_n₀,x(μ, σ) is continuous function of x for all bounded uniformly continuous function g. Hence there exists a finite intersection W of neighborhoods of {G_n₀x, x ∈ Inline graphic } the type (F.1) such that for any {G_x, x ∈ } ∈ W, the second term of (5.2) is arbitrarily small. The conclusion of the theorem follows immediately from a variant of Corollary 5.8 applied to neighborhoods of the type (F.1).

Appendix G. Proof of Theorem 6.2

Proof

As before we establish q-integrated L₁ consistency of Gaussian mixtures of fixed-π dependent processes by verifying the conditions of Theorem 5.9. Let $ϕ_{μ, σ} (x, y) : = \frac{1}{σ} ϕ (\frac{y - μ (x)}{σ})$ for y ∈ Inline graphic and x ∈ . Construct B_n as

B_{n} = (M_{n} \sqrt{\frac{r_{n}}{δ_{n}}} ℍ_{1}^{r_{n}} + \frac{ε l_{n} \sqrt{π}}{4 \sqrt{2}} B_{1}) \cup (\cup_{a < δ_{n}} M_{n} ℍ_{1}^{a} + \frac{ε l_{n} \sqrt{π}}{4 \sqrt{2}} B_{1}) .

with $δ_{n} = \frac{K_{1} ε l_{n}}{M_{n}}$ for some constant K₁ > 0. Let

Θ_{n} = {ϕ_{μ, σ} : ‖ β ‖ \leq a_{n}, η \in B_{n}, l_{n} \leq σ \leq h_{n}} .

(G.1)

It is easy to see that

\begin{array}{l} log N (F_{n}, 4 ε, ‖ \cdot ‖) \leq K_{2} m_{n} r_{n}^{p} {log (\frac{8 \sqrt{2} M_{n} \sqrt{r_{n} / δ_{n}}}{ε \sqrt{π} l_{n}})}^{p + 1} + m_{n} log \frac{K_{4} M_{n}^{2}}{ε} \\ m_{n} log \frac{K_{3} M_{n}}{ε l_{n}} + m_{n} log {d_{1} {(\frac{a_{n}}{l_{n}})}^{p} + d_{2} log \frac{h_{n}}{l_{n}} + 1} . \end{array}

(G.2)

Note that $Π_{X} (F_{n}^{c}) \leq m_{n} P (Θ_{n}^{c}) + P (\sum_{h = m_{n}}^{\infty} π_{h} > ε)$ and $P (Θ_{n}^{c}) \leq {P (‖ β ‖ \geq a_{n}) + P (σ \in {[l_{n}, h_{n}]}^{c}) + P (η \in B_{n}^{c})}$ . It follows from the proof of Theorem 3.1 of [42] that

P (η \in B_{n}^{c}) \leq P (A > r_{n}) + e^{- M_{n}^{2} / 2}

if $M_{n}^{2} > r_{n}^{p} {log (\frac{8 \sqrt{2} M_{n} \sqrt{r_{n} / δ_{n}}}{ε \sqrt{π} l_{n}})$ . Since A^{p(1+η₂)/η₂} ~ Ga(a, b), Lemma 4.9 of [42] indicates that $P (A > r_{n}) ≾ exp {- r_{n}^{p (1 + η_{2}) / η_{2}}}$ . Hence with M_n = O(n^1/2), m_n = O{n/(log n)^p+1}^1/(1+η₂) and $r_{n}^{p} = O {n^{η_{2} / (1 + η_{2})}} P (Θ_{n}^{c}) ≾ e^{- n}$ and

P (\sum_{h = m_{n}}^{\infty} π_{h} > ε) ≾ exp {- m_{n}^{1 + η_{2}} {(log m_{n})}^{(p + 1)}} ≾ e^{- n} .

(G.3)

Also, the first term in the right hand side of (G.2) can be made smaller than nξ since $m_{n} r_{n}^{p} = O (n / {(log n)}^{p + 1})$ . Also by F1, the last two terms of the right hand side of (G.2) can be made to grow at o(n).

Footnotes

A similar sieve appears in [33] with a citation to an earlier draft of our paper.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Fan J, Yao Q, Tong H. Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika. 1996;83:189–206. [Google Scholar]
2.Rojas A, Genovese C, Miller C, Nichol R, Wasserman L. Conditional density estimation using finite mixture models with an application to astrophysics. 2005. [Google Scholar]
3.Jain S, Neal R. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics. 2004;13:158–182. [Google Scholar]
4.Walker S. Sampling the Dirichlet mixture model with slices. Communications in Statistics-Simulation and Computation. 2007;36:45–54. [Google Scholar]
5.Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169. [Google Scholar]
6.Minka T. PhD thesis. Massachusetts Institute of Technology; 2001. A family of algorithms for approximate Bayesian inference. [Google Scholar]
7.Lo A. On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics. 1984;12:351–357. [Google Scholar]
8.Ferguson T. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
9.Ferguson T. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2:615–629. [Google Scholar]
10.Barron A, Schervish M, Wasserman L. The consistency of posterior distributions in nonparametric problems. The Annals of Statistics. 1999;27:536–561. [Google Scholar]
11.Ghosal S, Ghosh J, Ramamoorthi R. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics. 1999;27:143–158. [Google Scholar]
12.Tokdar S. Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhyâ: The Indian Journal of Statistics. 2006;67:90–110. [Google Scholar]
13.Ghosal S, van der Vaart A. Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics. 2001;29:1233–1263. [Google Scholar]
14.Ghosal S, van der Vaart A. Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics. 2007;35:697–723. [Google Scholar]
15.Bhattacharya A, Dunson D. Strong consistency of nonparametric bayes density estimation on compact metric spaces with applications to specific manifolds. Annals of the Institute of Statistical Mathematics. 2011:1–28. doi: 10.1007/s10463-011-0341-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79. [Google Scholar]
17.Griffin J, Steel M. Order-based dependent Dirichlet processes. Journal of The American Statistical Association. 2006;101:179–194. [Google Scholar]
18.Griffin J, Steel M. Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica. 2010;20:1507–1527. [Google Scholar]
19.Dunson D, Pillai N, Park J. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]
20.Dunson D, Park J. Kernel stick-breaking processes. Biometrika. 2008;95:307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Chung Y, Dunson D. Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2009;104:1646–1660. doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tokdar S, Zhu Y, Ghosh J. Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Analysis. 2010;5:1–26. [Google Scholar]
23.Kruijer W, Rousseau J, Van Der Vaart A. Adaptive bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics. 2010;4:1225–1257. [Google Scholar]
24.Scricciolo C. Posterior rates of convergence for dirichlet mixtures of exponential power densities. Electronic Journal of Statistics. 2011;5:270–308. [Google Scholar]
25.Barrientos F, Jara A, Quintana F. On the support of MacEachern’s dependent Dirichlet processes. Bayesian Analysis. 2012;7:1–34. [Google Scholar]
26.MacEachern S. Dependent nonparametric processes. 1999. pp. 50–55. [Google Scholar]
27.Tokdar S, Ghosh J. Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference. 2007;137:34–42. [Google Scholar]
28.Yoon J. Unpublished manuscript. Claremont Mckenna College; 2009. Bayesian analysis of conditional density functions: a limited information approach. [Google Scholar]
29.Tang Y, Ghosal S. A consistent nonparametric Bayesian procedure for estimating autoregressive conditional densities. Computational Statistics & Data Analysis. 2007;51:4424–4437. [Google Scholar]
30.Tang Y, Ghosal S. Posterior consistency of Dirichlet mixtures for estimating a transition density. Journal of Statistical Planning and Inference. 2007;137:1711–1726. [Google Scholar]
31.Rodriguez A, Dunson D. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6:145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.De Iorio M, Mueller P, Rosner G, MacEachern S. An ANOVA model for dependent random measures. Journal of the American Statistical Association. 2004;99:205–215. [Google Scholar]
33.Norets A, Pelenis J, editors. Unpublished manuscript. Princeton Univ; 2010. Posterior consistency in conditional density estimation by covariate dependent mixtures. [Google Scholar]
34.Norets A. Approximation of conditional densities by smooth mixtures of regressions. The Annals of Statistics. 2010;38:1733–1766. [Google Scholar]
35.Wu Y, Ghosal S. The L1-consistency of Dirichlet mixtures in multivariate Bayesian density estimation. Journal of Multivariate Analysis. 2010:2411–2419. [Google Scholar]
36.Shen W, Tokdar S, Ghosal S. Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. 2011. Arxiv preprint arXiv:1109.6406. [Google Scholar]
37.Tokdar S. Adaptive convergence rates in Dirichlet process mixtures of multivariate normals. 2011. Arxiv preprint arXiv:1111.4148. [Google Scholar]
38.Pati D, Dunson D. Unpublished paper. 2009. Bayesian nonparametric regression with varying residual density. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.De Iorio M, Johnson W, Müller P, Rosner G. Bayesian nonparametric nonproportional hazards survival modeling. Biometrics. 2009;65:762–771. doi: 10.1111/j.1541-0420.2008.01166.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Jara A, Lesaffre E, De Iorio M, Quintana F. Bayesian semiparametric inference for multivariate doubly-interval-censored data. The Annals of Applied Statistics. 2010;4:2126–2149. [Google Scholar]
41.Gelfand A, Kottas A, MacEachern S. Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association. 2005;100:1021–1035. [Google Scholar]
42.van der Vaart A, van Zanten J. Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics. 2009;37:2655–2675. [Google Scholar]
43.van der Vaart A, van Zanten J. Reproducing kernel Hilbert spaces of Gaussian priors. IMS Collections. 2008;3:200–222. [Google Scholar]
44.Adler R. An introduction to continuity, extrema, and related topics for general Gaussian processes. Vol. 12. Institute of Mathematical Statistics; 1990. [Google Scholar]
45.Papaspiliopoulos O. Technical Report. 2008. A note on posterior sampling from Dirichlet mixture models. [Google Scholar]
46.Ishwaran H, James L. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]
47.Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–183. [Google Scholar]
48.Kalli M, Griffin J, Walker S. Slice sampling mixture models. Statistics and computing. 2010:1–13. [Google Scholar]
49.Wu Y, Ghosal S. Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics. 2008;2:298–331. [Google Scholar]
50.Freedman D. Wald lecture: On the Bernstein-von Mises theorem with infinite-dimensional parameters. The Annals of Statistics. 1999;27:1119–1141. [Google Scholar]
51.Rivoirard V, Rousseau J. Bernstein–von Mises theorem for linear functionals of the density. The Annals of Statistics. 2012;40:1489–1523. [Google Scholar]
52.Castillo I, Nickl R. Nonparametric bernstein-von mises theorems. 2012. arXiv preprint arXiv:1208.3862. [Google Scholar]

[R1] 1.Fan J, Yao Q, Tong H. Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika. 1996;83:189–206. [Google Scholar]

[R2] 2.Rojas A, Genovese C, Miller C, Nichol R, Wasserman L. Conditional density estimation using finite mixture models with an application to astrophysics. 2005. [Google Scholar]

[R3] 3.Jain S, Neal R. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics. 2004;13:158–182. [Google Scholar]

[R4] 4.Walker S. Sampling the Dirichlet mixture model with slices. Communications in Statistics-Simulation and Computation. 2007;36:45–54. [Google Scholar]

[R5] 5.Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169. [Google Scholar]

[R6] 6.Minka T. PhD thesis. Massachusetts Institute of Technology; 2001. A family of algorithms for approximate Bayesian inference. [Google Scholar]

[R7] 7.Lo A. On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics. 1984;12:351–357. [Google Scholar]

[R8] 8.Ferguson T. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]

[R9] 9.Ferguson T. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2:615–629. [Google Scholar]

[R10] 10.Barron A, Schervish M, Wasserman L. The consistency of posterior distributions in nonparametric problems. The Annals of Statistics. 1999;27:536–561. [Google Scholar]

[R11] 11.Ghosal S, Ghosh J, Ramamoorthi R. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics. 1999;27:143–158. [Google Scholar]

[R12] 12.Tokdar S. Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhyâ: The Indian Journal of Statistics. 2006;67:90–110. [Google Scholar]

[R13] 13.Ghosal S, van der Vaart A. Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics. 2001;29:1233–1263. [Google Scholar]

[R14] 14.Ghosal S, van der Vaart A. Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics. 2007;35:697–723. [Google Scholar]

[R15] 15.Bhattacharya A, Dunson D. Strong consistency of nonparametric bayes density estimation on compact metric spaces with applications to specific manifolds. Annals of the Institute of Statistical Mathematics. 2011:1–28. doi: 10.1007/s10463-011-0341-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79. [Google Scholar]

[R17] 17.Griffin J, Steel M. Order-based dependent Dirichlet processes. Journal of The American Statistical Association. 2006;101:179–194. [Google Scholar]

[R18] 18.Griffin J, Steel M. Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica. 2010;20:1507–1527. [Google Scholar]

[R19] 19.Dunson D, Pillai N, Park J. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]

[R20] 20.Dunson D, Park J. Kernel stick-breaking processes. Biometrika. 2008;95:307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Chung Y, Dunson D. Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2009;104:1646–1660. doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Tokdar S, Zhu Y, Ghosh J. Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Analysis. 2010;5:1–26. [Google Scholar]

[R23] 23.Kruijer W, Rousseau J, Van Der Vaart A. Adaptive bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics. 2010;4:1225–1257. [Google Scholar]

[R24] 24.Scricciolo C. Posterior rates of convergence for dirichlet mixtures of exponential power densities. Electronic Journal of Statistics. 2011;5:270–308. [Google Scholar]

[R25] 25.Barrientos F, Jara A, Quintana F. On the support of MacEachern’s dependent Dirichlet processes. Bayesian Analysis. 2012;7:1–34. [Google Scholar]

[R26] 26.MacEachern S. Dependent nonparametric processes. 1999. pp. 50–55. [Google Scholar]

[R27] 27.Tokdar S, Ghosh J. Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference. 2007;137:34–42. [Google Scholar]

[R28] 28.Yoon J. Unpublished manuscript. Claremont Mckenna College; 2009. Bayesian analysis of conditional density functions: a limited information approach. [Google Scholar]

[R29] 29.Tang Y, Ghosal S. A consistent nonparametric Bayesian procedure for estimating autoregressive conditional densities. Computational Statistics & Data Analysis. 2007;51:4424–4437. [Google Scholar]

[R30] 30.Tang Y, Ghosal S. Posterior consistency of Dirichlet mixtures for estimating a transition density. Journal of Statistical Planning and Inference. 2007;137:1711–1726. [Google Scholar]

[R31] 31.Rodriguez A, Dunson D. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6:145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.De Iorio M, Mueller P, Rosner G, MacEachern S. An ANOVA model for dependent random measures. Journal of the American Statistical Association. 2004;99:205–215. [Google Scholar]

[R33] 33.Norets A, Pelenis J, editors. Unpublished manuscript. Princeton Univ; 2010. Posterior consistency in conditional density estimation by covariate dependent mixtures. [Google Scholar]

[R34] 34.Norets A. Approximation of conditional densities by smooth mixtures of regressions. The Annals of Statistics. 2010;38:1733–1766. [Google Scholar]

[R35] 35.Wu Y, Ghosal S. The L1-consistency of Dirichlet mixtures in multivariate Bayesian density estimation. Journal of Multivariate Analysis. 2010:2411–2419. [Google Scholar]

[R36] 36.Shen W, Tokdar S, Ghosal S. Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. 2011. Arxiv preprint arXiv:1109.6406. [Google Scholar]

[R37] 37.Tokdar S. Adaptive convergence rates in Dirichlet process mixtures of multivariate normals. 2011. Arxiv preprint arXiv:1111.4148. [Google Scholar]

[R38] 38.Pati D, Dunson D. Unpublished paper. 2009. Bayesian nonparametric regression with varying residual density. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.De Iorio M, Johnson W, Müller P, Rosner G. Bayesian nonparametric nonproportional hazards survival modeling. Biometrics. 2009;65:762–771. doi: 10.1111/j.1541-0420.2008.01166.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Jara A, Lesaffre E, De Iorio M, Quintana F. Bayesian semiparametric inference for multivariate doubly-interval-censored data. The Annals of Applied Statistics. 2010;4:2126–2149. [Google Scholar]

[R41] 41.Gelfand A, Kottas A, MacEachern S. Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association. 2005;100:1021–1035. [Google Scholar]

[R42] 42.van der Vaart A, van Zanten J. Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics. 2009;37:2655–2675. [Google Scholar]

[R43] 43.van der Vaart A, van Zanten J. Reproducing kernel Hilbert spaces of Gaussian priors. IMS Collections. 2008;3:200–222. [Google Scholar]

[R44] 44.Adler R. An introduction to continuity, extrema, and related topics for general Gaussian processes. Vol. 12. Institute of Mathematical Statistics; 1990. [Google Scholar]

[R45] 45.Papaspiliopoulos O. Technical Report. 2008. A note on posterior sampling from Dirichlet mixture models. [Google Scholar]

[R46] 46.Ishwaran H, James L. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]

[R47] 47.Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–183. [Google Scholar]

[R48] 48.Kalli M, Griffin J, Walker S. Slice sampling mixture models. Statistics and computing. 2010:1–13. [Google Scholar]

[R49] 49.Wu Y, Ghosal S. Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics. 2008;2:298–331. [Google Scholar]

[R50] 50.Freedman D. Wald lecture: On the Bernstein-von Mises theorem with infinite-dimensional parameters. The Annals of Statistics. 1999;27:1119–1141. [Google Scholar]

[R51] 51.Rivoirard V, Rousseau J. Bernstein–von Mises theorem for linear functionals of the density. The Annals of Statistics. 2012;40:1489–1523. [Google Scholar]

[R52] 52.Castillo I, Nickl R. Nonparametric bernstein-von mises theorems. 2012. arXiv preprint arXiv:1208.3862. [Google Scholar]

PERMALINK

Posterior consistency in conditional distribution estimation

Debdeep Pati

David B Dunson

Surya T Tokdar

Abstract

1. Introduction

2. Notations

3. Conditional density estimation

3.1. Predictor dependent countable mixtures of Gaussian linear regressions

3.2. Gaussian mixtures of fixed-π dependent processes

4. Notions of posterior consistency for conditional densities

Definition 4.1

Definition 4.2

Definition 4.3

5. Posterior consistency in MGLRx mixture of Gaussians

5.1. Kullback-Leibler property

Remark 5.1

Remark 5.2

Theorem 5.3

Remark 5.4

Lemma 5.5

Theorem 5.6

Corollary 5.7

Proof

Corollary 5.8

5.2. Strong Consistency with the q-integrated L1 neighborhood

Theorem 5.9

Theorem 5.10

Remark 5.11

Remark 5.12

Remark 5.13

Lemma 5.14

Lemma 5.15

Proof

5.3. Prior specification and posterior computation

6. Posterior consistency in Gaussian mixture of fixed-π dependent processes

6.1. Kullback-Leibler property

Theorem 6.1

6.2. Strong consistency with the q-integrated L1 neighborhood

Theorem 6.2

Remark 6.3

Remark 6.4

7. Discussion

Acknowledgments

Appendix A. A useful lemma

Lemma Appendix A.1

Proof

Appendix B. Proof of Theorem 5.6

Appendix C. Proof of Theorem 5.3

Appendix D. Proof of Theorem 5.10

Appendix E. Another useful lemma

Lemma Appendix E.1

Appendix F. Proof of Theorem 6.1

Proof

Lemma Appendix F.1

Theorem Appendix F.2

Proof

Appendix G. Proof of Theorem 6.2

Proof

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

5. Posterior consistency in MGLR_x mixture of Gaussians

5.2. Strong Consistency with the q-integrated L₁ neighborhood

6.2. Strong consistency with the q-integrated L₁ neighborhood