The local Dirichlet process

Yeonseung Chung; David B Dunson

doi:10.1007/s10463-008-0218-9

. Author manuscript; available in PMC: 2013 May 1.

Published in final edited form as: Ann Inst Stat Math. 2009 Jan 22;63(1):59–80. doi: 10.1007/s10463-008-0218-9

The local Dirichlet process

Yeonseung Chung ^1,^✉, David B Dunson ²

PMCID: PMC3640338 NIHMSID: NIHMS445009 PMID: 23645935

Abstract

As a generalization of the Dirichlet process (DP) to allow predictor dependence, we propose a local Dirichlet process (lDP). The lDP provides a prior distribution for a collection of random probability measures indexed by predictors. This is accomplished by assigning stick-breaking weights and atoms to random locations in a predictor space. The probability measure at a given predictor value is then formulated using the weights and atoms located in a neighborhood about that predictor value. This construction results in a marginal DP prior for the random measure at any specific predictor value. Dependence is induced through local sharing of random components. Theoretical properties are considered and a blocked Gibbs sampler is proposed for posterior computation in lDP mixture models. The methods are illustrated using simulated examples and an epidemiologic application.

Keywords: Dependent Dirichlet process, Blocked Gibbs sampler, Mixture model, Non-parametric Bayes, Stick-breaking representation

1 Introduction

In recent years, there has been a dramatic increase in applications of non-parametric Bayes methods, motivated largely by the availability of simple and efficient methods for posterior computation in Dirichlet process mixture (DPM) models (Lo 1984; Escobar 1994; Escobar and West 1995). The DPM models incorporate Dirichlet process (DP) priors (Ferguson 1973, 1974) for components in Bayesian hierarchical models, resulting in an extremely flexible class of models. Due to the flexibility and ease in implementation, DPM models are now routinely implemented in a wide variety of applications, ranging from machine learning (Beal et al. 2002; Blei et al. 2004) to genomics (Xing et al. 2004; Kim et al. 2006).

In many settings, it is natural to consider generalizations of the DP and DPM-based models to accommodate dependence. For example, one may be interested in studying changes in a density with predictors. Following Lo (1984), one can use a DPM for Bayes inference on a single density as follows:

f (y) = \int_{Ω} k (y, u) G (d u),

(1)

where k(y, u) is a non-negative valued kernel defined on ( $D \times Ω, F \times B$ ) such that for each $u \in Ω, \int_{D} k (y, u) d y = 1$ and for each $y \in D$ , $\int_{Ω} k (y, u) G (d u) < \infty$ with $D$ , Ω Borel subsets of Euclidean spaces and $F$ , $B$ the corresponding σ-fields, and G is a finite randomprobabilitymeasure on (Ω, $B$ ) following a DP. A natural extension for modeling of a conditional density f (y|x) for $x \in X$ , with $X$ a Lebesgue measurable subset of $R^{p}$ , is as follows:

f (y ∣ x) = \int_{Ω} k (y, u) G_{x} (d u),

(2)

where the mixing measure G_x is now indexed by the predictor value. We are then faced with modeling a collection of random mixing measures denoted as $G_{X} = {G_{x} : x \in X}$ .

Recent work on defining priors for collections of random probability measures has primarily relied on extending the stick-breaking representation of the DP (Sethuraman 1994). This literature was stimulated by the dependent Dirichlet process (DDP) framework proposed by MacEachern (1999, 2000, 2001), which replaces the atoms in the Sethuraman (1994) representation with stochastic processes. The DDP framework has been adopted to develop ANOVA-type models for random probability measures (De Iorio et al. 2004), for flexible spatial modeling (Gelfand et al. 2004), in time series applications (Caron et al. 2006), and for inferences on stochastic ordering (Dunson and Peddada 2008). The specification of the DDP used in applications incorporates dependence only through the atoms while assuming fixed weights. In other recent work, Griffin and Steel (2006) proposed an order-based DDP (πDDP) which allows varying weights, while Duan et al. (2005) developed a multivariate stick-breaking process for spatial data.

Alternatively, convex combinations of independent DPs can be used for modeling collections of dependent random measures. Müller et al. (2004) proposed this idea to allow dependence across experiments and discrete dynamic settings were considered by Pennell and Dunson (2006) and Dunson (2006). Recently, the idea has been extended to continuous covariate cases by Dunson et al. (2007) and Dunson and Park (2008).

Some desirable properties of a prior for a collection, $G_{X} = {G_{x} : x \in X}$ , of predictor-dependent probability measures are: (1) increasing dependence in G_x and G_x′ with decreasing distance between x and x′; (2) simple and interpretable expressions for the expectation and variance of each G_x as well as the correlation between G_x and G_x′; (3) G_x has a marginal DP prior for all $x \in X$ ; (4) posterior computation can proceed efficiently through a straightforward MCMC algorithm in a broad variety of applications. Although the DDP, πDDP and the prior proposed by Duan et al. (2005) achieve (1), πDDP and Duan et al. (2005) approaches are not straightforward to implement in general applications. The fixed stick-breaking weights version of the DDP tends to be easy to implement, but has the disadvantage of not allowing locally adaptive mixture weights. The kernel mixture approaches of Dunson et al. (2007) and Dunson and Park (2008) lack the marginal DP property (3). Property (3) is appealing in that there is rich theoretical literature on DPs, showing posterior consistency (Ghosal et al. 1999; Lijoi et al. 2005) and rates of convergence (Ghosal and Van der Vaart 2007).

This article proposes a simple extension of the DP, which provides an alternative to the fixed weights DDP in order to allow local adaptivity, while also achieving properties (1)–(4). The prior is constructed by first assigning stick-breaking weights and atoms to random locations in a predictor space. Each predictor-dependent random probability measure is formulated using the random weights and atoms located in a neighborhood about that predictor value. Dependence is induced by local sharing of random components. We call this prior the local Dirichlet process (lDP).

Section 2 describes stick-breaking priors (SBP) for collections of predictor-dependent random probability measures. Section 3 introduces the lDP and discusses properties. Computation is described in Sect. 4. Sections 5 and 6 include simulation studies and an epidemiologic application. Section 7 concludes with a discussion. Proofs are included in appendices.

2 Predictor-dependent stick-breaking priors

2.1 Stick-breaking priors

Ishwaran and James (2001) proposed a general class of SBPs for random probability measures. This class provides a useful starting point in considering extensions to allow predictor dependence.

Definition 1 A random probability measure, G, has a SBP if

G = \sum_{h = 1}^{N} p_{h} δ_{θ_{h}}, 0 \leq p_{h} \leq 1, \sum_{h = 1}^{N} p_{h} = 1 a . s .,

(3)

where δ_θis a discrete measure concentrated at θ, $p_{h} = V_{h} \prod_{l < h} (1 - V_{l})$ are random weights with $V_{h} \overset{ind}{~} B e t a (a_{h}, b_{h})$ independently from $θ_{h} \overset{iid}{~} G_{0}$ with G₀ a non-atomic base probability measure. For N = ∞, the condition $Σ_{h = 1}^{N} p_{h} = 1$ a.s. is satisfied by Lemma 1 in Ishwaran and James (2001). For finite N, the condition is satisfied by letting V_N = 1.

There are many processes that fall into this class of SBP. The DP corresponds to the special case in which N = ∞, a_h = 1 and b_h = α as established in Sethuraman (1994). The two-parameter Poisson-DP corresponds to the case where N = ∞, a_h = 1 − a, and b_h = b + ha with 0 ≤ a < 1 and b > −a (Pitman 1995, 1996). Additional special cases are listed in Ishwaran and James (2001).

2.2 Predictor-dependent stick-breaking priors

Consider an uncountable collection of predictor-dependent random probability measures, of $G_{X} = {G_{x} : x \in X}$ . The predictor space $X$ is Lebesgue measurable subset Euclidian space and the random measures G_x are defined on (Ω, $B$ ) where Ω is a complete and separable metric space and $B$ is a corresponding Borel σ-algebra. Let $P$ be a probability measure on ( $M$ , $N$ ) where $M$ is the space of uncountable collections of random probability measures G_x and $N$ is the corresponding Borel σ-algebra. Then, $G_{X} ~ P$ denotes that $P$ is a prior on the random collection $G_{X}$ .

We call $P$ a predictor-dependent stick-breaking prior ( ${SBP}_{X}$ ) if $G_{x} \in G_{X} ~ P$ can represented as:

\begin{matrix} G_{x} & = \sum_{h = 1}^{N (x)} p_{h} (x) δ_{θ_{h} (x)} \\ with 0 \leq p_{h} (x) \leq 1 and \sum_{h = 1}^{N (x)} p_{h} (x) = 1 a . s ., \forall x \in X, \end{matrix}

(4)

where the random weights p_h(x) have a stick-breaking form, p_h(x) and θ_h(x) are predictor-dependent, and N(x) is also indexed by the predictor value x. Depending on how we form p_h(x), θ_h(x) and N(x), different dependencies among G_x are induced. Several interesting priors, such as the DDP, πDDP and the prior proposed by Duan et al. (2005) fall into the ${SBP}_{X}$ class. In the next section, we propose a new choice of ${SBP}_{X}$ deemed the lDP.

3 Local Dirichlet process

3.1 Formulation

Formulating the lDP starts with obtaining the following three sequences of mutually independent global random components:

Γ = {Γ_{h}, h = 1, \dots, \infty}, V = {V_{h}, h = 1, \dots, \infty}, ϴ = {θ_{h}, h = 1, \dots, \infty},

(5)

where $Γ_{h} \overset{iid}{~} H$ are locations, $V_{h} \overset{iid}{~} B e t a (1, α)$ are probability weights, and $θ_{h} \overset{iid}{~} G_{0}$ are atoms. G₀ is a probability measure on (Ω, $B$ ) on which G_x will be defined and H is a probability measure on ( $X^{'}$ , $A$ ) where $A$ is a Borel σ-algebra of subsets of $X^{'}$ and $X^{'}$ is a Lebesgue measurable subset of Euclidian space that may or may not correspond to the predictor space $X$ . For a given predictor space $X$ , we introduce the probability space ( $X^{'}$ , $A$ , H) such that it satisfies the following regularity condition from which one can deduce $X \subset X^{'}$ :

Condition 1 For all $x \in X$ and ψ > 0, $H (η_{x}^{ψ}) > 0$ , where $η_{x}^{ψ} = {x^{'} : d ({x, x}^{'}) < ψ, x^{'} \in X^{'}}$ is defined as a ψ-neighborhood around a point $x \in X$ with $d : X \times X^{'} \to R^{+}$ being some distance measure.

Next, focusing on a local predictor point $x \in X$ , we define sets of local random components for x as:

Γ (x) = {Γ_{h}, h \in L_{x}}, V (x) = {V_{h}, h \in L_{x}}, ϴ (x) = {θ_{h}, h \in L_{x}},

(6)

where $L x = {h : d (x, Γ_{h}) < ψ, h = 1, \dots, \infty}$ is a predictor-dependent set indexing the locations belonging to the ψ-neighborhood of x, $η_{x}^{ψ}$ , which is defined on $X^{'}$ by ψ and d(·, ·). Hence, the sets V(x) and Θ(x) contain the random weights and atoms that are assigned to the locations Γ(x) in $η_{x}^{ψ}$ . Here, ψ controls the neighborhood size. For simplicity, we treat ψ as fixed throughout the paper, though one can obtain a more flexible class of priors by assuming a hyper prior for ψ.

Using the local random components in (6), we consider the following form for G_x:

G_{x} = \sum_{l = 1}^{N (x)} p_{l} (x) δ_{θ_{π_{l} (x)}} with p_{l} (x) = V_{π_{l} (x)} \prod_{j < l} (1 - V_{π_{j} (x)}),

(7)

where N(x) is the cardinality of $L_{x}$ and π_l(x) is the lth ordered index in $L_{x}$ . Then, Condition 1 ensures that the following lemma holds (refer to the Proof of Lemma 1 in the Appendix).

Lemma 1 For all $x \in X$ , N(x) = ∞ and $\sum_{l = 1}^{N (x)} p_{l} (x) = 1$ almost surely.

By Lemma 1, it is apparent that G_x formed as in (7) is a well-defined stick-breaking random probability measure for x. It is also straightforward that we can define G_x for all $x \in X$ by (6) and (7) using the global components in (5). Therefore, given α, G₀, H, ψ) with a choice of d(·, ·), the steps from (5) to (7) defines a new choice of predictor-dependent SBP ( ${SBP}_{X}$ ) for $G_{X}$ , deemed the lDP. We use the shorthand notation $G_{X} = {G_{x} : x \in X}$ ~ lDP(α, G₀, H, ψ) to denote that $G_{X}$ is assigned a lDP with hyperparameters α, G₀, H, ψ.

Figure 1 illustrates the lDP formulation graphically for a case where $X = {[0, 1]}^{2}$ and $G_{X} ~ IDP (α, G_{0}, H, ψ)$ with H =Uniform([0, 1]²) leading to $X = X^{'}$ and ψ = 0.2. For a simple illustration, we consider Euclidian distance for d(·, ·) for bivariate predictors. Random locations in [0, 1]² are generated from a uniform distribution, with the first 100 locations plotted as ‘*’ in Fig. 1. The random pair of weight and atom (V_h, θ_h) is placed at location Γ_h, with the first ten pairs labeled in Fig. 1. For a predictor value x = (0.5, 0.3)′, the red dashed circle indicates the neighborhood of x, $η_{x}^{ψ}$ . Then, G_x at x = (0.5, 0.3)′ is constructed using the weights and atoms within the dashed circle in the order of the index to formulate the stick-breaking representation. For all other $x \in X$ , G_x are formed following the same steps.

From Fig. 1, it is apparent that the dependence between G_x and G_x′ increases as the distance between x and x′ decreases. For closer x and x′, their neighborhoods overlap more so that similar components are used for constructing G_x and G_x′, while if x and x′ are far apart, there will be at most a small area of intersection so that few to none of the random components are shared. In the non-overlapping case, G_x and $G_{x}^{'}$ are assigned independent DP priors, as is clear from Theorem 1 and the subsequent development.

Theorem 1 If $G_{X} ~ I D P (α, G_{0}, H, ψ)$ , for any $x \in X$ , G_x ~ DP(αG₀).

The marginal DP property shown in Theorem 1 is appealing in allowing one to rely directly on the rich literature on properties of the DP to obtain insight into the prior for the random probability measure at any particular predictor value. However, unlike the DP, the lDP allows the probability measure to vary with predictors, while borrowing information across local regions of the predictor space. This is accomplished through incorporating shared random components. Due to the sharing and to the almost sure discreteness property of each G_x, the lDP will induce local clustering of subjects according to their predictor values. Theorem 2 illustrates this local clustering property more clearly.

Theorem 2 Suppose $G_{X} ~ I D P (α, G_{0}, H, ψ)$ and $ϕ_{i} ∣ G_{x_{i}} \overset{ind}{} G_{x_{i}}$ , for i = 1, … , n, with x_idenoting the predictor value for subject i. Then,

κ_{x_{i}, x_{j}} = P r (ϕ_{i} = ϕ_{j} ∣ x_{i}, x_{j}, α, ψ) = \frac{2 P_{x_{i}, x_{j}}}{(1 + P_{x_{i}, x_{j}}) α + 2}, f o r a n y x_{i}, x_{j} \in X,

where $P_{x_{i}, x_{j}} = \frac{H (η_{x_{i}}^{ψ} \cap η_{x_{j}}^{ψ})}{H (η_{x_{i}}^{ψ} \cup η_{x_{j}}^{ψ})}$ is the conditional probability of Γ_hfalling within the intersection region $η_{x_{i}}^{ψ} \cap η_{x_{j}}^{ψ}$ given $Γ_{h} \in η_{x_{i}}^{ψ} \cup η_{x_{j}}^{ψ}$ .

The clustering probability κ_{x_i ,x_j} increases from 0 when $η_{x_{i}}^{ψ} \cap η_{x_{j}}^{ψ} = ⊘$ to 1/(α + 1) when x_i = x_j which is the case of P_{x_i ,x_j} = 1. This implies that, for fixed α, the clustering probability under $G_{X} ~ IDP (α, G_{0}, H, ψ)$ is bounded above by the clustering probability under the global DP, which takes $G_{x} \equiv G ~ DP (α G_{0})$ , leading to Pr(φ_i = φ_j | α) = 1/(α + 1). Also, note that small values of the precision parameter α will induce V_h values that are close to one. This in turn causes a small number of atoms in each neighborhood to dominate, inducing few local clusters. However, when ψ is small and hence neighborhood sizes are small, there will still be many clusters across $X$ .

It is interesting to consider relationships between the lDP and other priors proposed in the literature in limiting special cases. First, note that the lDP converges to the DP as ψ → ∞, so that all the neighborhoods around each of the predictor values encompass the entire predictor space. Also, the lDP(α, G₀, H, ψ) corresponds to a limiting case of the kernel stick-breaking process (KSBP) (Dunson and Park 2008), in which the kernel is defined as K(x, Γ) = 1 (d(x, Γ) < ψ) and the DP placed at each location have precision parameters → 0.

3.2 Moments and correlation

From Theorem 1 and properties of the DP, $G_{X} ~ IDP (α, G_{0}, H, ψ)$ implies, for any $x \in X$ ,

E {G_{x} (B)} = G_{0} (B) and Var {G_{x} (B)} = \frac{G_{0} (B) (1 - G_{0} (B))}{1 + α}, \forall B \in B .

(8)

Next, let us consider the correlation between G_x₁ and G_x₂, for any x₁, $x_{2} \in X$ . First, we show the correlation conditionally on the locations Γ but marginalizing out the weights V and atoms Θ. As discussed in Sect. 3.1, if Γ is given, the lDP can be regarded as a special case of the πDDP. Hence, following Theorem 1 in Griffin and Steel (2006), for any x₁, $x_{2} \in X$ ,

\begin{matrix} ρ_{x_{1}, x_{2}} (Γ) & = Corr {G_{x_{1}} (B), G_{x_{2}} (B) ∣ Γ} \\ = \frac{2}{α + 2} \sum_{h \in L_{x_{1}} \cap L_{x_{2}}} {(\frac{α}{α + 2})}^{# S_{h}} {(\frac{α}{α + 1})}^{# S_{h}^{'}}, \forall B \in B, \end{matrix}

(9)

where #S is the cardinality of the set $S$ , $S_{h} = A_{1 h} \cap A_{2 h}$ , $S_{h}^{'} = A_{1 h} \cup A_{2 h} - S_{h}$ , and $A_{k h} = {π_{j} (x_{k}) : j < l, π_{l} (x_{k}) = h}$ for $h \in L_{x_{1}} \cap L_{x_{2}}$ . In other words, #S_h is the number of indices on the locations Γ that are below h and are shared in the neighborhoods of x₁ and x₂, while $# S_{h}^{'}$ is the number of indices that are below h and belong to the neighborhoods of either x₁ or x₂ but not both. For a given h, reducing #S_h by one induces adding two elements to $S_{h}^{'}$ , thus reducing the correlation, as expected. From expression (9), it is clear that the neighborhoods around x₁ and x₂ are increasingly overlapping and the correlation between G_x₁ and G_x₂ increases as x₁ → x₂. Expression (9) is particularly useful in being free of dependence on B.

Marginalizing the correlation in (9) over the prior for the random locations Γ is equivalent to marginalizing out the #S_h and $# S_{h}^{'}$ for $h \in L_{x_{1}} \cap L_{x_{2}}$ . In considering the correlation between G_x₁ and G_x₂, we can ignore the Γ_h for $h \in {1, \dots, \infty} \ L_{x_{1}} \cup L_{x_{2}}$ and focus on the Γ_h only for $h \in L_{x_{1}} \cup L_{x_{2}}$ . Let γ_j be the jth ordered component of $L_{x_{1}} \cup L_{x_{2}}$ . For example, if $L_{x_{1}} \cup L_{x_{2}} = {1, 3, 5, 6, 7, 8, \dots}$ , γ₁ = 1, γ₂ = 3,γ₃ = 5, γ₄ = 6, …. Let $Z_{γ_{j}} = 1 (γ_{j} \in L_{x_{1}} \cap L_{x_{2}})$ be an indicator for whether Γ_γj are shared by the neighborhoods of x₁ and x₂ or not. Then, the formula in (9) can be reexpressed with respect to Z_γj as follows:

\begin{matrix} ρ_{x_{1}, x_{2}} (Γ) & = Corr {G_{x_{1}} (B), G_{x_{2}} (B) ∣ Γ} \\ = \frac{2}{α + 2} \sum_{j = 1}^{\infty} Z_{γ_{j}} {(\frac{α}{α + 2})}^{Σ_{k = 1}^{j - 1} Z_{γ_{k}}} {(\frac{α}{α + 1})}^{j - 1 - Σ_{k = 1}^{j - 1} Z_{γ_{k}}} . \end{matrix}

(10)

Note that it is straightforward to show that $Z_{γ_{j}} \overset{iid}{~} Bernoulli (P_{x_{1}, x_{2}})$ Bernoulli(P_x₁,x₂), for j = 1, … ,∞, with $P_{x_{1}, x_{2}} = \frac{H (η_{x_{1}}^{ψ} \cap η_{x_{2}}^{ψ})}{H (η_{x_{1}}^{ψ} \cup η_{x_{2}}^{ψ})}$ the conditional probability of Γ_h falling within the intersection region $η_{x_{1}}^{ψ} \cap η_{x_{2}}^{ψ}$ given $Γ_{h} \in η_{x_{1}}^{ψ} \cup η_{x_{2}}^{ψ}$ . Finally, marginalizing out ${Z_{γ_{j}}}_{j = 1}^{\infty}$ results in the following Theorem.

Theorem 3 If $G_{X} ~ I D P (α, G_{0}, H, ψ)$ , for any $x_{1}, x_{2} \in X$ ,

ρ_{x_{1}, x_{2}} = Corr {G_{x_{1}} (B), G_{x_{2}} (B)} = \frac{2 P_{x_{1}, x_{2}} (α + 1)}{(1 + P_{x_{1}, x_{2}}) α + 2}, \forall B \in B .

The correlation is expressed only in terms of P_x₁,x₂ and α. Regardless of α, the correlation is 1 if x₁ = x₂ which implies the neighborhoods around x₁, x₂ are identical and P_x₁,x₂=1. Also, the correlation is 0 when the neighborhoods are non-overlapping with P_x₁,x₂=0. In addition, P_x₁,x₂ ≤ ρ_x₁,x₂ ≤ 1 and ρ_x₁,x₂ increases as α increases for fixed P_x₁,x₂ . When α → 0, the correlation converges to P_x₁,x₂. Meanwhile, when α → ∞, the correlation converges to $\frac{2 P_{x_{1}, x_{2}}}{1 + P_{x_{1}, x_{2}}}$ .

Note that P_x₁,x₂ depends on H, ψ, and the locations x₁ and x₂ given a choice of d(·, ·). When $X^{'}$ for H is chosen to satisfy Condition 2, some appealing properties result.

Condition 2 For all $x \in X$ with $X$ being p-dimensional, ${x^{*}; d (x^{*}, x) < ψ, x^{*} \in R^{p}} \subset X^{'}$ .

From Condition 2, one can deduce that $X^{'}$ contains all the points in $R^{p}$ within the distance of ψ from x for any $x \in X$ . Under Condition 2, with H chosen to be a uniform probability measure on a bounded space $X^{'}$ , P_x₁,x₂ depends only on ψ and d(x₁, x₂) which is the distance between x₁ and x₂, but not on the exact locations of x₁ and x₂ in $X$ . Hence, upon examination of Theorem 3, it is apparent that Condition 2 implies an isotropic correlation structure, which is an appealing default in the absence of prior knowledge of changes in the correlation structure according to the locations in $X$ . Figure 2 shows how the correlation ρ_x₁,x₂ changes as a function of d(x₁, x₂) in the case where $x \in X = [0, 1]$ and H is Uniform([−ψ, 1 + ψ]) so that $X^{'} = [- ψ, 1 + ψ]$ and Condition 2 holds for different ψ with d(·, ·) corresponding to the Euclidian distance. The ρ_x₁,x₂ decays from 1 to 0 as d(x₁, x₂) increases and the decay is faster for smaller ψ. As ψ → ∞, the decay line gets closer to a horizontal line at ρ_x₁,x₂ = 1, which is the case of lDP=DP. Also, for a given ψ and d(x₁, x₂), the ρ_x₁,x₂ is higher as α → ∞. Although the choice of d(·, ·) being Euclidian makes the curves in Fig. 2 close to linear, the curvature can easily be changed by choosing a different distance measure d(·, ·).

Fig. 2 — Change in correlation ρ_x₁,x₂ over the change in distance d(x₁, x₂) for different α and ψ: α = 0.0001 (*red dashed*), α = 1 (*blue dot-dashed*), α = 10 (*green dotted*), α = 10, 000 (*black solid*)

3.3 Truncation approximation

Finite approximations to infinite SBPs form the basis for commonly used computational algorithms (Ishwaran and James 2001). In this subsection, we discuss a finite dimensional approximation to the lDP.

Since the lDP has the marginal DP property, let us recall the finite dimensional DP. Ishwaran and James (2001) defines an N-truncation of the DP (DP^N) by discarding the N + 1, N + 2, … , ∞ terms and replacing p_N with $1 - \sum_{h = 1}^{N - 1} p_{h}$ in the DP stick-breaking form in (3). They show that the DP^N approximates the DP well in terms of the total variation (tv) norm of the marginal densities of the data obtained from the corresponding DPM models. According to their Theorem 2,

∣ ∣ μ_{N} - μ_{\infty} ∣ ∣ \leq 4 [1 - E {{(\sum_{h = 1}^{N - 1} p_{h})}^{n}}] \approx 4 n \times \exp {- (N - 1) ∕ α},

(11)

where || · || is tv norm, μ_N and μ_∞ are the marginal probability measures for the data from the DPM^N and DPM models, and n is the sample size. Note that the sample size has a modest effect on the bound for a reasonably large value of N and the bound decreases exponentially with N increasing, implying that even for a fairly large sample size, the DPM^N approximates the DP well with moderate N.

Following a similar route, let us define an N-truncation of the lDP (lDP^N) as follows:

Definition 2 For a finite N, let Γ^N ={Γ_h, h = 1, … , N}, V^N ={V_h, h =1, … , N}, and Θ^N = {θ_h, h = 1, … , N} be the sets of global random locations, weights, and atoms, respectively. Distributional assumptions for Γ_h, V_h, and θ_h are the same as in (5) and the corresponding local sets are defined as in (6). Then, $G_{X} ~ I D P^{N} (α, G_{0}, H, ψ)$ if

\begin{matrix} G_{x} & = \sum_{l = 1}^{N (x) - 1} p_{l} (x) δ_{θ_{π_{l} (x)}} + (1 - \sum_{l = 1}^{N (x) - 1} p_{l} (x)) δ_{θ_{π_{N (x)} (x)}} \\ with p_{l} (x) = V_{π_{l} (x)} \prod_{j < 1} (1 - V_{π_{j} (x)}) for l = 1, \dots, N (x) - 1 . \end{matrix}

The G_x in Definition 3 has a similar form to $G = \sum_{h = 1}^{N} p_{h} δ_{θ_{h}}$ obtained from the DP^N except that N in G is replaced by N(x) in G_x and N in DP^N is fixed while N(x) in lDP^N is random. Focusing on a particular predictor value x, it is easy to show that N(x) ~ Binomial(N,P_x), where N is the total number of global locations in lDP^N and $P_{x} = H (η_{x}^{ψ})$ is the probability that a location belongs to the neighborhood around x, $η_{x}^{ψ}$ . Then, marginalizing out N(x) in the bound on the tv distance between the marginal densities of an observation obtained at a particular predictor value x from the lDPM and lDPM^N models results in Theorem 4.

Theorem 4 Define a model (2) with $G_{X} ~ I D P (α, G_{0}, H, ψ)$ as local Dirichlet process mixture (lDPM) model). lDPM^Ncorresponds to (2) with $G_{X} ~ I D P^{N} (α, G_{0}, H, ψ)$ . Suppose an observation is obtained from lDPM^Nand lDPM models at x. Then,

∣ ∣ μ_{N} (x) - μ_{\infty} (x) ∣ ∣ \leq 4 (\frac{α + 1}{α}) {1 - (\frac{1}{α + 1}) P_{x}}^{N},

where μ_N (x) and μ_∞(x) are the marginal probability measures for the observation. Notice that the bound decreases exponentially with N increasing, suggesting that we can obtain a good approximation to the lDP using a moderate N, as long as α is small and the neighborhood size is not too small. In particular, we require a large N for a given level of accuracy as ψ → 0, since P_x decreases as the size of $η_{x}^{ψ}$ decreases.

4 Posterior computation

We develop an MCMC algorithm based on the blocked Gibbs sampler (Ishwaran and James 2001) for an lDPM^N model. For simplicity in exposition, we describe a Gibbs sampling algorithm for a particular hierarchical model, though the approach can be easily adapted for computation in a broad variety of other settings. We let

\begin{matrix} f (y_{i} ∣ x_{i}, τ) & = \int f (y_{i} ∣ x_{i}, β_{i}, τ) d G_{x_{i}} (β_{i}) for i = 1, \dots, n \\ G_{X} & ~ {IDP}^{N} (α, G_{0}, H, ψ), \end{matrix}

(12)

where f (y_i|x_i, β_i, τ) = N(y_i; $x_{i}^{'}$ , β_i, τ⁻¹) β_i = (β_i1, … ,β_ip)′. For simplicity, we consider a univariate predictor case where p = 2 and $x_{i}^{'} = (1, x_{i})$ with d(·, ·) Euclidian distance but the generalization to multiple predictors or to using different distance metric is straightforward. G₀ is assumed to be N_p(μ_β, Σ_β), H is assumed to be Uniform(a_Γ, b_Γ) and additional conjugate priors are assigned for τ, α, μ_β and Σ_β.

Let K_i be an indicator variable denoting that K_i = h implies ith subject is assigned to the hth mixture component. Then, the hierarchical structure of the model (12) with respect to the random variables is recast as follows.

\begin{matrix} (y_{i} ∣ x_{i}, β^{*}, τ, K & ~ N (x_{i}^{'} β_{K_{i}}^{*}, τ^{- 1}), i = 1, \dots, n \\ (K_{i} ∣ V, Γ) & ~ \sum_{l = 1}^{N (x_{i})} p_{l} (x_{i}) δ_{π_{l} (x_{i})} (\cdot), i = 1, \dots, n \\ (V_{h} ∣ α & ~ Beta (1, α), h = 1, \dots, N \\ (Γ_{h}) & ~ Uniform (a_{Γ}, b_{Γ}) h = 1, \dots, N \\ (β_{h}^{*} ∣ μ_{β}, Σ_{β}) & ~ N_{p} (μ_{β}, Σ_{β}), h = 1, \dots, N \\ μ_{β} & ~ N_{p} (μ_{0}, Σ_{μ}) \\ Σ_{β}^{- 1} & ~ Wishart ({ν_{0} Σ_{0}}^{- 1}, ν_{0}) \\ τ & ~ Gamma (ν_{1}, ν_{2}) \\ α & ~ Gamma (η_{1}, η_{2}), \end{matrix}

(13)

where $β^{*} = {β_{h}^{*}, h = 1, \dots, N}$ , K = {K_i, i = 1 …, n}, V = {V_h, h = 1, …, N}, and Γ = {Γ_h, h = 1, …, N}. The full conditionals for each of the random components are based on the following joint distribution.

(y, K, V, Γ, β^{*}, μ_{β}, Σ_{β}, τ, α) \propto (y ∣ β^{*}, τ, K) (K ∣ V, Γ) (V ∣ α) (Γ) (β^{*} ∣ μ_{β}, Σ_{β}) (μ_{β}) (Σ_{β}) (τ) (α)

(14)

Then, the Gibbs sampler proceeds by sampling from the following conditional posterior distributions:

Conditional for K_i, i = 1, … , n

\begin{matrix} (K_{i} ∣ y, V, Γ, β^{*}, τ & ~ \sum_{l = 1}^{N (x_{i})} p_{l}^{'} (x_{i}) δ_{π_{l} (x_{i})} (K_{i}) \\ p_{l}^{'} (x_{i}) & = \frac{N (y_{i}; x_{i}^{'} β_{π_{l} (x_{i})}^{*}, τ^{- 1}) p_{l} (x_{i})}{\sum_{l = 1}^{N (x_{i})} N (y_{i}; x_{i}^{'} β_{π_{l} (x_{i})}^{*}, τ^{- 1}) p_{l} (x_{i})} \\ p_{l} (x_{i}) & = V_{π_{l} (x_{i})} \prod_{j < l} (1 - V_{π_{j} (x_{i})}) for l < N (x_{i}) \\ p_{l} (x_{i}) & = \prod_{j < 1} (1 - V_{π_{j} (x_{i})}) for l = N (x_{i}) \end{matrix}

Conditional for V_h, h = 1, … , N
$(V_{h} ∣ K, Γ, α) ~ Beta (1 + \sum_{i = 1}^{n} 1 (K_{i} = h and K_{i} \neq π_{N (x_{i})} (x_{i})), α + \sum_{i = 1}^{n} 1 (K_{i} > h))$
Conditional for Γ_h, h = 1, … , N
$(Γ_{h} ∣ K, V) ~ Uniform (\max [\max_{i = K_{i} = h} (x_{i} - ψ), a_{Γ}], \min [\min_{i = K_{i} = h} (x_{i} + ψ), a_{Γ}])$
Conditional for $β_{h}^{*}, h = 1, \dots, N$
$\begin{matrix} (β_{h}^{*} ∣ y, K, μ_{β}, Σ_{β}, τ) & ~ N_{p} ({\hat{μ}}_{β h}, {\hat{Σ}}_{β h}) \\ {\hat{μ}}_{β h} & = {\hat{Σ}}_{β h} [Σ_{β}^{- 1} μ_{β} + τ X_{i h} y_{i h}] \\ {\hat{Σ}}_{β h} & = {[{Σ_{β}}^{- 1} + τ X_{i h} X_{i h}^{'}]}^{- 1}, \end{matrix}$
where y_ih is n_h × 1 response vector and $X_{i h}^{'}$ is n_h × p design matrix for the subjects with K_i = h and n_h is the number of those subjects.
Conditional for μ_β
$\begin{matrix} (μ_{β} ∣ β^{*}, Σ_{β}) ~ N_{p} ({\hat{μ}}_{0}, {\hat{Σ}}_{μ}) \\ {\hat{μ}}_{0} = {\hat{Σ}}_{μ} [Σ_{μ}^{- 1} μ_{0} + Σ_{β}^{- 1} \sum_{h = 1}^{N} β_{h}^{*}] \\ {\hat{Σ}}_{μ} = {[Σ_{μ}^{- 1} + N Σ_{β}^{- 1}]}^{- 1} \end{matrix}$
Conditional for $\sum_{β}^{- 1}$
$(Σ_{β}^{- 1} ∣ β^{*}, μ_{β}) ~ Wishart ({[\sum_{h = 1}^{N} (β_{h}^{*} - μ_{β}) {(β_{h}^{*} - μ_{β})}^{'} + ν_{0} Σ_{0}]}^{- 1}, N + ν_{0})$
Conditional for τ
$(τ ∣ y, β^{*}, K) ~ Gamma (ν_{1} + \frac{n}{2}, ν_{2} + \frac{1}{2} \sum_{i = 1}^{n} {(y_{i} - {x_{i}}^{'} β_{K_{i}}^{*})}^{2})$
Conditional for α
$(α ∣ V) ~ Gamma (η_{1} + N, η_{2} - \sum_{h = 1}^{N} \log (1 - V_{h}))$
Note that this Gibbs sampling algorithm consists only of simple steps for sampling from standard distributions and is no more complex than blocked Gibbs samplers for DPMs. In addition, we have observed good computational performance, in terms of mixing and convergence rates, in simulated and real data applications.

5 Simulation examples

We obtained data from two simulated examples, where n = 500 and a univariate predictor x_i was simulated from Uniform(0,1). Case 1 was a null case where y_i was generated from a normal regression model N(y_i; −1 + 2x_i, 0.01). Case 2 was a mixture of two normal linear regression models, with the mixture weights depending on the predictor, with the error variance differing, and with a nonlinear mean function for the second component:

f (y_{i} ∣ x_{i}) = e^{- 2 x_{i}} N (y_{i}; x_{i}, 0.01) + (1 - e^{- 2 x_{i}}) N (y_{i}; x_{i}^{4}, 0.04) .

(15)

We applied the lDPM^N model in (12) to the simulated data with N = 50. Based on the results, N = 50 seems to be chosen to be large enough since the higher clusters having higher indices are not used in any of the subjects or are used in only a small proportion of them. Also, repeating the analysis for twice N, we obtained very similar results, suggesting that the results are robust to the choice of N, as long as N is not chosen to be small.

For the hyperparameters, we let ν₁ = ν₂ = 0.01, η₁ = η₂ = 2, ν₀ = p, Σ₀ = I_p, μ₀ = 0, Σ_μ = n(X′X)⁻¹, a_Γ = −0.05, and b_Γ = 1.05. The neighborhood size ψ = 0.05 was chosen such that the average number of subjects belonging to the neighborhoods around each predictor value in the sample is ≈n/10. We analyzed the simulated data using the proposed Gibbs sampling algorithm run for 10,000 iterations with a 5,000 iteration burn-in. The convergence and mixing of the MCMC algorithm were good (trace plots not shown). Also, results tended to be robust to repeating analysis with reasonable alternative hyperparameter values.

For Case 1, as shown in Fig. 3, the predictive mean regression curve (blue dashed, right bottom panel), the true linear regression function (red solid), and the pointwise 95% credible intervals (green dashed) were almost the same. Figure 3 also shows the predictive densities (blue dashed) at the 10th, 25th, 50th, 75th, and 90th sample percentiles of x_i, with these densities almost indistinguishable from the true densities (red solid).

For Case 2, Fig. 4 shows an x – y plot (right bottom panel) of the data along with the estimated predictive mean curve (blue dashed), which closely follows the true mean curve (red solid). Figure 4 also shows the estimated predictive densities (blue dashed) correspond approximately to the true densities (red solid) in most cases and the 95% credible intervals (green dashed) closely cover the true densities in all cases.

Repeating the analysis for Case 2, but with $β_{i} \overset{iid}{~} G$ and G ~ DP(αG₀), we obtained poor results estimates diverged substantially from true densities, posterior mean curve failed to capture true nonlinear function), suggesting that a DPM model is inadequate.

6 Epidemiological application

6.1 Background and motivation

In diabetic studies, interest often focuses on the relationship between 2-h serum insulin levels (indicator for insulin sensitivity/resistence) and 2-h plasma glucose levels (indicator for diabetic risk) that are measured in the oral glucose tolerance test (OGTT). Although most studies examine the mean change of the 2-h insulin versus 2-h glucose, it would be more interesting to assess the whole distributional change of the 2-h insulin level across the range of the 2-h glucose levels.

We obtained data from a study which followed a sample of Pima Indians from a population near Phoenix, Arizona since 1965. This study was conducted by the National Institute of Diabetes and Digestive and Kidney Disease, with the Pima Indians chosen because of their high risk of diabetes. Using these data, our goal is conducting inferences on changes in the 2-h serum insulin distribution with changes in 2-h glucose level without making restrictive assumptions, such as normality or a constant residual variation. Certainly, it is biologically plausible that the insulin distribution is non-normal and should change as the glucose level changes not only in mean but also in other features such as skewness, residual variation, and modality.

6.2 Analysis and results

For woman i (i = 1, … , 393), let y_i correspond to the 2-h serum insulin level measured in μU/ml (micro Units per milliliter) and let x_i denote the 2-h plasma glucose level measured in mg/dl (milligrams per deciliter). We applied the lDPM^N model described in (12), after scaling y and x by dividing by 100. Hyperparameters were set to be the same as in the simulation study except that ψ = 0.08 such that n/10 subjects belong to each neighborhood on average and a_Γ = min(x_i) − ψ, and b_Γ = max(x_i) + ψ) such that the edge effects are avoided in the inference. We analyzed the data using the proposed Gibbs sampling algorithm run for 10,000 iterations with a 5,000 iteration burn-in. The convergence and mixing of the MCMC algorithm were good (Trace plots not shown) and results were robust with reasonable alternative hyperparameter values.

Figure 5 shows the predictive distributions for the insulin level at various empirical percentiles of the glucose level. As the glucose level increases, there is a slightly nonlinear change in the mean insulin level (right bottom panel) and a dramatic increase in the heaviness of the right tail of the insulin distribution. Also, some multi-modality in the insulin distribution appears as the glucose level falls into the pre-diabetes range (140–200mg/dl) and closer to the cut point (200mg/dl) for the diagnosis of diabetes. This shift in the shape of the insulin distribution biologically implies that the women with pre-diabetes are expected to have different insulin sensitivities, which may further induce different diabetic risks even for the same glucose level. This may be due to unadjusted covariates or unmeasured risk factors. Such distributional changes in response induced by predictors (e.g. risk factor, exposure, treatment, and, etc.) is pervasive in epidemiologic studies, but is not at all well characterized by standard regression models that do not allow the whole distribution to flexibly change with predictors.

Fig. 5 — Results for Pima Indian Example: predictive conditional densities (*blue dot-dashed*), and 95% pointwise credible intervals (*green dashed*). The *lower right panel* shows the data (*black dots*), along with estimated mean (*blue dashed*) regression curves superimposed with 95% credible line (*green dashed*)

7 Discussion

This article proposed a new SBP for the collection of predictor-dependent random probability measures. The prior, called the lDP, is a useful alternative to recently developed prior models that induce predictor-dependence among distributions. Its marginal DP structure should be useful in considering theoretical properties, such as posterior consistency and rates of convergence. A related formulation was independently developed by Griffin and Steel (2008) although the lDP is appealing in its simplicity for construction and computation. In particular, the construction is intuitive and leads to simple expressions for the dependence in random measures at different locations, while also leading to straightforward posterior computation relying on truncation with a fair amount of accuracy.

Although we have focused on a conditional density estimation application, there are many interesting applications of the lDP to be considered in future work. First, the DP is widely used to induce a prior on a random partition or clustering structure (Quintana 2006; Kim et al. 2006). In such settings, the DP has the potential disadvantage of requiring an exchangeability assumption, which may be violated when predictors are available that can inform about the clustering. The lDP provides a straightforward mechanism for local, predictor-dependent clustering, which can be used as an alternative to product partition models (Quintana and Iglesias 2003) and model-based clustering approaches (Fraley and Raftery 2002). It is of interest to explore the theoretical properties of the induced prior on the random partition. In this respect, it is likely that the hyperparameter ψ plays a key role. Hence, as a more robust data-driven approach one may consider fully Bayes or empirical Bayes methods for allowing uncertainty in ψ.

Acknowledgments

This research was supported in part by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences.

Appendix

Proof of Lemma 1 An infinite number of locations Γ = {Γ_h, h = 1, …, ∞} are generated from H on $X^{'}$ . Any ψ-neighborhood of x defined as $η_{x}^{ψ} = {x^{'} : d (x, x^{'}) < ψ, x^{'} \in X^{'}$ with ψ > 0 is a subsect of $X^{'}$ . The regularity Condition 1 for H ensures that there is a positive probability for a location Γ_h to be generated in any $η_{x}^{ψ}$ . Therefore, there are also an infinite number of locations in $η_{x}^{ψ}$ for all $x \in X$ and ψ > 0, which implies N(x) = ∞. Then, $\sum_{l = 1}^{N (x)} p_{l} (x)$ almost surely by Lemma 1 in Ishwaran and James (2001).

Proof of Theorem 1 Assume that $G_{X} ~ IDP (α, G_{0}, H, ψ)$ . Then, from the definition of the lDP in (5)–(7), we can reexpress (7) as $G_{x} = \sum_{l = 1}^{N (x)} V_{l}^{(x)} \prod_{j < l} (1 - V_{j}^{(x)}) δ_{θ_{l}^{(x)}}$ , where $V_{l}^{(x)}$ is the lth element of V(x) and $θ_{l}^{(x)}$ is the lth element of Θ(x). Note that it follows from the proof of Lemma 1 that N(x) = ∞. Since the random weights and atoms are generated by iid sampling from Beta(1, α) and G₀, respectively, independently from the location, we have $V_{l}^{(x)} \overset{iid}{~} Beta (1, α)$ independently from $ϴ_{l}^{(x)} \overset{iid}{~} G_{0}$ , for l = 1, … , ∞. Hence, it follows directly from Sethuraman’s (1994) representation of the DP, that $G_{x} ~ DP (α G_{0}), \forall_{x} \in X$ .

Proof of Theorem 2 Given Γ and V,

\begin{matrix} P r (ϕ_{i} = ϕ_{j} ∣ x_{i}, x_{j}, Γ, V, ψ) & = \sum_{{(k, l) : π_{k} (x_{i}) = π_{l} (x_{j})}} p_{k} (x_{i}) p_{l} (x_{j}) \\ = \sum_{h \in L_{x_{j}} \cap L_{x_{j}}} V_{h}^{2} \prod_{m \in S_{h}} {(1 - V_{m})}^{2} \prod_{n \in S_{h}^{'}} (1 - V_{n}) . \end{matrix}

For the definition of $S_{h}$ and $S_{h}^{'}$ , refer to the Eq. (9) in Sect. 3.2. Marginalizing out V over the Beta distribution,

P r (ϕ_{i} = ϕ_{j} ∣ x_{i}, x_{j}, Γ, α, ψ) = \frac{2}{(α + 1) (α + 2)} \sum_{h \in L_{x_{i}} \cap L x_{j}} {(\frac{α}{α + 2})}^{# S_{h}} {(\frac{α}{α + 1})}^{# S_{h}^{'}} .

In order to marginalize out $S_{h}$ and $S_{h}^{'}$ , we introduce $Z_{γ_{j}} \overset{iid}{~} Bernoulli (P_{x_{i}, x_{j}})$ as described in the formulations from (9) to (10) in Sect. 3.2. Then,

P r (ϕ_{i} = ϕ_{j} ∣ x_{i}, x_{j}, Γ, α, ψ) = \frac{2}{(α + 1) (α + 2)} \sum_{j = 1}^{\infty} Z_{γ_{j}} {(\frac{α}{α + 2})}^{\sum_{k = 1}^{j - 1} Z_{γ_{k}}} \times {(\frac{α}{α + 1})}^{j - 1 - \sum_{k = 1}^{j - 1} Z_{γ_{k}}} .

(After marginalizing out the ${Z_{γ_{j}}}_{j = 1}^{\infty}$ as in the Proof of Theorem 3, we obtain:

\begin{matrix} P r (ϕ_{i} = ϕ_{j} ∣ x_{i}, x_{j}, α, ψ) & = [\frac{2}{(α + 1) (α + 2)}] [\frac{P_{x_{i}, x_{j}} (α + 2) (α + 1)}{α (1 + P_{x_{i}, x_{j}}) + 2}] \\ = \frac{2 P_{x_{i}, x_{j}}}{(1 + P_{x_{i}, x_{j}}) α + 2} . \end{matrix}

Proof of Theorem 3 From (10),

Corr {G_{x_{i}} (B), G_{x_{2}} (B) ∣ Γ} = \frac{2}{α + 2} \sum_{j = 1}^{\infty} Z_{γ_{j}} {(\frac{α + 1}{α + 2})}^{\sum_{k = 1}^{j - 1} Z_{γ_{k}}} {(\frac{α}{α + 1})}^{j - 1},

where Z_γj are iid draws from Bernoulli(P_{x₁, x₂}). Taking expectation of ${Z_{γ_{j}}}_{j = 1}^{\infty}$ with respect to Bernoulli((P_{x₁, x₂})),

E [Corr {G_{x_{i}} (B), G_{x_{2}} (B)}] = \frac{2}{α + 2} P_{x_{1}, x_{2}} \sum_{j = 1}^{\infty} {(\frac{α}{α + 1})}^{j - 1} E [{(\frac{α + 1}{α + 2})}^{Y_{j}}],

where Y_j ~ Binomial(j − 1, P_x₁,x₂). Using the Binomial Theorem, the expectation on the right is marginalized out with respect to Binomial(j – 1, P_x₁,x₂), which results in

Corr {{G_{x}}_{1} (B), {G_{x}}_{2} (B)} = \frac{2}{α + 2} P_{x_{1}, x_{2}} \sum_{j = 1}^{\infty} {[\frac{- α P_{x_{1}, x_{2}}}{(α + 2) (α + 1)} + \frac{α}{α + 1}]}^{j - 1} .

Since $∣ \frac{- α P_{x_{1}, x_{2}}}{(α + 2) (α + 1)} + \frac{α}{α + 1} ∣ \leq 1$ , the infinite sum on the right converges. Then,

Corr {G_{x_{1}} (B), G_{x_{2}} (B)} = (\frac{2 P_{x_{1}, x_{2}}}{α + 2}) (\frac{(α + 2) (α + 1)}{α (1 + P_{x_{1}, x_{2}}) + 2}) = \frac{2 P_{x_{1}, x_{2}} (α + 1)}{(1 + P_{x_{1}, x_{2}}) α + 2} .

Proof of Theorem 4 Due to the marginal DP property and using the inequality on the left in (11) with n = 1, we get $∣ ∣ μ_{N} (x) - μ_{\infty} (x) ∣ ∣ \leq 4 (1 - E [(\sum_{h = 1}^{N (x) - 1} p_{h})])$ , where μ_N, μ_∞, N in (11) are replaced by μ_N(x), μ_∞(x), N(x), respectively, and n is substituted by 1. Here, N(x) is random differently from N in (11). Conditioned on N(x) but marginalizing out p_h, we get $∣ ∣ μ_{N} (x) - μ_{\infty} (x) ∣ ∣ \leq 4 E [{(\frac{α}{1 + α})}^{N (x) - 1}]$ . Note that N(x) ~ Binomial(N, P_x) as discussed in Sect. 3.3. Then, using the Binomial Theorem, we obtain $∣ ∣ μ_{N} (x) - μ_{\infty} (x) ∣ ∣ \leq 4 (\frac{α + 1}{α}) {[1 - (\frac{1}{α + 1}) P_{x}]}^{N}$ .

Contributor Information

Yeonseung Chung, Department of Biostatistics, Harvard School of Public Health, 655 Huntington Ave. Bldg 2, Room 435A, Boston, MA 02115, USA.

David B. Dunson, Department of Statistical Science, Duke University, 219A Old Chemistry Bldg, Box 90251, Durham, NC 27708, USA, dunson@stat.duke.edu

References

Beal M, Ghahramani Z, Rasmussen C. Neural information processing systems. Vol. 14. MIT Press; Cambridge: 2002. The infinite hidden Markov model. [Google Scholar]
Blei D, Griffiths T, Jordan M, Tenenbaum J. Neural information processing systems. Vol. 16. MIT Press; Cambridge: 2004. Hierarchical topic models and the nested Chinese restaurant process. [Google Scholar]
Caron F, Davy M, Doucet A, Duflos E, Vanheeghe P. Bayesian inference for dynamic models with Dirichet process mixtures; International conference on information fusion; Italia. 2006.Jul 10-13, [Google Scholar]
De Iorio M, Müller P, Rosner GL, MacEachern SN. An Anova model for dependent random measures. Journal of the American Statistical Association. 2004;99:205–215. [Google Scholar]
Dowse KG, Zimmet PZ, Alberti GMM, Bringham L, Carlin JB, Tuomlehto J, Knight LT, Gareeboo H. Serum insulin distributions and reproducibility of the relationship between 2- hour insulin and plasma glucose levels in Asian Indian, Creole, and Chinese Mauritians. Metabolism. 1993;42:1232–1241. doi: 10.1016/0026-0495(93)90119-9. [DOI] [PubMed] [Google Scholar]
Duan JA, Guidani M, Gelfand AE. ISDS Discussion Paper, 05-23. Duke University; Durham: 2005. Generalized spatial Dirichlet process models. [Google Scholar]
Dunson DB. Bayesian dynamic modeling of latent trait distributions. Biostatistics. 2006;7:551–568. doi: 10.1093/biostatistics/kxj025. [DOI] [PubMed] [Google Scholar]
Dunson DB, Park J-H. Kernel stick-breaking process. Biometrika. 2008;95:307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–874. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dunson DB, Pillai N, Park J-H. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]
Escobar MD. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association. 1994;89:268–277. [Google Scholar]
Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90:577–588. [Google Scholar]
Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
Ferguson TS. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2:615–629. [Google Scholar]
Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]
Gelfand AE, Kottas A, MacEachern SN. Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association. 2004;100:1021–1035. [Google Scholar]
Ghosal S, Van der Vaart AW. Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics. 2007;35(2):697–723. [Google Scholar]
Ghosal S, Ghosh JK, Ramamoorthi RV. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics. 1999;27:143–158. [Google Scholar]
Griffin JE, Steel MFJ. Order-based dependent Dirichlet processes. Journal of the American Statistical Association. 2006;101:179–194. [Google Scholar]
Griffin JE, Steel MFJ. Bayesian nonparametric modeling with the Dirichlet process regression smoother. Technical Report. University of Warwick; 2008. [Google Scholar]
Ishwaran H, James LF. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]
Kim S, Tadesse MG, Vannucci M. Variable selection in clustering via Dirichlet process mixture models. Biometrika. 2006;94:877–893. [Google Scholar]
Lijoi A, Prünster I, Walker SG. On consistency of non-parametric normal mixtures for Bayesian density estimation. Journal of the American Statistical Association. 2005;100:1292–1296. [Google Scholar]
Lo AY. On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics. 1984;12:351–357. [Google Scholar]
MacEachern SN. Dependent nonparametric processes; ASA proceedings of the section on bayesian statistical science; Alexandria. 1999; American Statistical Association; [Google Scholar]
MacEachern SN. Dependent Dirichlet processes. Department of Statistics, The Ohio State University; 2000. Unpublished manuscript. [Google Scholar]
MacEachern SN. Decision theoretic aspects of dependent nonparametric processes. In: George E, editor. Bayesian methods with applications to science, policy and official statistics. ISBA; Creta: 2001. pp. 551–560. [Google Scholar]
Müller P, Quintana F, Rosner G. A method for combining inference across related nonparamet- ricBayesian models. Journal of the Royal Statistical Society B. 2004;66:735–749. [Google Scholar]
Pennell ML, Dunson DB. Bayesian semiparametric dynamic frailty models for multiple event time data. Biometrics. 2006;62:1044–1052. doi: 10.1111/j.1541-0420.2006.00571.x. [DOI] [PubMed] [Google Scholar]
Pitman J. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields. 1995;102:145–158. [Google Scholar]
Pitman J. Some developments of the Blackwell-MacQueen urn scheme. In: Ferguson TS, Shapley LS, MacQueen JB, editors. Statistics, probability and game theory. Vol. 30. Institute of Mathematical Statistics; Hayward: 1996. pp. 245–267. IMS Lecture Notes-Monograph Series. [Google Scholar]
Quintana FA. A predictive view of Bayesian clustering. Journal of Statistical Planning and Inference. 2006;136:2407–2429. [Google Scholar]
Quintana FA, Iglesias PL. Bayesian Clustering and product partition models. Journal of the Royal Statistical Society B. 2003;65:557–574. [Google Scholar]
Sethuraman J. A constructive definition of the Dirichlet process prior. Statistica Sinica. 1994;2:639–650. [Google Scholar]
Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus; Proceedings of the symposium on computer applications in medical care; 1988.pp. 261–265. [Google Scholar]
Xing EP, Sharan R, Jordan M. Bayesian haplotype inference via the Dirichlet process; Proceedings of the international conference on machine learning (ICML); 2004. [DOI] [PubMed] [Google Scholar]

[R1] Beal M, Ghahramani Z, Rasmussen C. Neural information processing systems. Vol. 14. MIT Press; Cambridge: 2002. The infinite hidden Markov model. [Google Scholar]

[R2] Blei D, Griffiths T, Jordan M, Tenenbaum J. Neural information processing systems. Vol. 16. MIT Press; Cambridge: 2004. Hierarchical topic models and the nested Chinese restaurant process. [Google Scholar]

[R3] Caron F, Davy M, Doucet A, Duflos E, Vanheeghe P. Bayesian inference for dynamic models with Dirichet process mixtures; International conference on information fusion; Italia. 2006.Jul 10-13, [Google Scholar]

[R4] De Iorio M, Müller P, Rosner GL, MacEachern SN. An Anova model for dependent random measures. Journal of the American Statistical Association. 2004;99:205–215. [Google Scholar]

[R5] Dowse KG, Zimmet PZ, Alberti GMM, Bringham L, Carlin JB, Tuomlehto J, Knight LT, Gareeboo H. Serum insulin distributions and reproducibility of the relationship between 2- hour insulin and plasma glucose levels in Asian Indian, Creole, and Chinese Mauritians. Metabolism. 1993;42:1232–1241. doi: 10.1016/0026-0495(93)90119-9. [DOI] [PubMed] [Google Scholar]

[R6] Duan JA, Guidani M, Gelfand AE. ISDS Discussion Paper, 05-23. Duke University; Durham: 2005. Generalized spatial Dirichlet process models. [Google Scholar]

[R7] Dunson DB. Bayesian dynamic modeling of latent trait distributions. Biostatistics. 2006;7:551–568. doi: 10.1093/biostatistics/kxj025. [DOI] [PubMed] [Google Scholar]

[R8] Dunson DB, Park J-H. Kernel stick-breaking process. Biometrika. 2008;95:307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–874. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Dunson DB, Pillai N, Park J-H. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]

[R11] Escobar MD. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association. 1994;89:268–277. [Google Scholar]

[R12] Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90:577–588. [Google Scholar]

[R13] Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]

[R14] Ferguson TS. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2:615–629. [Google Scholar]

[R15] Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]

[R16] Gelfand AE, Kottas A, MacEachern SN. Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association. 2004;100:1021–1035. [Google Scholar]

[R17] Ghosal S, Van der Vaart AW. Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics. 2007;35(2):697–723. [Google Scholar]

[R18] Ghosal S, Ghosh JK, Ramamoorthi RV. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics. 1999;27:143–158. [Google Scholar]

[R19] Griffin JE, Steel MFJ. Order-based dependent Dirichlet processes. Journal of the American Statistical Association. 2006;101:179–194. [Google Scholar]

[R20] Griffin JE, Steel MFJ. Bayesian nonparametric modeling with the Dirichlet process regression smoother. Technical Report. University of Warwick; 2008. [Google Scholar]

[R21] Ishwaran H, James LF. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]

[R22] Kim S, Tadesse MG, Vannucci M. Variable selection in clustering via Dirichlet process mixture models. Biometrika. 2006;94:877–893. [Google Scholar]

[R23] Lijoi A, Prünster I, Walker SG. On consistency of non-parametric normal mixtures for Bayesian density estimation. Journal of the American Statistical Association. 2005;100:1292–1296. [Google Scholar]

[R24] Lo AY. On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics. 1984;12:351–357. [Google Scholar]

[R25] MacEachern SN. Dependent nonparametric processes; ASA proceedings of the section on bayesian statistical science; Alexandria. 1999; American Statistical Association; [Google Scholar]

[R26] MacEachern SN. Dependent Dirichlet processes. Department of Statistics, The Ohio State University; 2000. Unpublished manuscript. [Google Scholar]

[R27] MacEachern SN. Decision theoretic aspects of dependent nonparametric processes. In: George E, editor. Bayesian methods with applications to science, policy and official statistics. ISBA; Creta: 2001. pp. 551–560. [Google Scholar]

[R28] Müller P, Quintana F, Rosner G. A method for combining inference across related nonparamet- ricBayesian models. Journal of the Royal Statistical Society B. 2004;66:735–749. [Google Scholar]

[R29] Pennell ML, Dunson DB. Bayesian semiparametric dynamic frailty models for multiple event time data. Biometrics. 2006;62:1044–1052. doi: 10.1111/j.1541-0420.2006.00571.x. [DOI] [PubMed] [Google Scholar]

[R30] Pitman J. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields. 1995;102:145–158. [Google Scholar]

[R31] Pitman J. Some developments of the Blackwell-MacQueen urn scheme. In: Ferguson TS, Shapley LS, MacQueen JB, editors. Statistics, probability and game theory. Vol. 30. Institute of Mathematical Statistics; Hayward: 1996. pp. 245–267. IMS Lecture Notes-Monograph Series. [Google Scholar]

[R32] Quintana FA. A predictive view of Bayesian clustering. Journal of Statistical Planning and Inference. 2006;136:2407–2429. [Google Scholar]

[R33] Quintana FA, Iglesias PL. Bayesian Clustering and product partition models. Journal of the Royal Statistical Society B. 2003;65:557–574. [Google Scholar]

[R34] Sethuraman J. A constructive definition of the Dirichlet process prior. Statistica Sinica. 1994;2:639–650. [Google Scholar]

[R35] Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus; Proceedings of the symposium on computer applications in medical care; 1988.pp. 261–265. [Google Scholar]

[R36] Xing EP, Sharan R, Jordan M. Bayesian haplotype inference via the Dirichlet process; Proceedings of the international conference on machine learning (ICML); 2004. [DOI] [PubMed] [Google Scholar]

PERMALINK

The local Dirichlet process

Yeonseung Chung

David B Dunson

Abstract

1 Introduction