Bayesian isotonic density regression

Lianming Wang; David B Dunson

doi:10.1093/biomet/asr025

. 2011 Sep;98(3):537–551. doi: 10.1093/biomet/asr025

Bayesian isotonic density regression

Lianming Wang ¹, David B Dunson ²

PMCID: PMC3384359 PMID: 22822259

Abstract

Density regression models allow the conditional distribution of the response given predictors to change flexibly over the predictor space. Such models are much more flexible than nonparametric mean regression models with nonparametric residual distributions, and are well supported in many applications. A rich variety of Bayesian methods have been proposed for density regression, but it is not clear whether such priors have full support so that any true data-generating model can be accurately approximated. This article develops a new class of density regression models that incorporate stochastic-ordering constraints which are natural when a response tends to increase or decrease monotonely with a predictor. Theory is developed showing large support. Methods are developed for hypothesis testing, with posterior computation relying on a simple Gibbs sampler. Frequentist properties are illustrated in a simulation study, and an epidemiology application is considered.

Keywords: Conditional density estimation, Dependent Dirichlet process, Hypothesis test, Isotonic regression, Nonparametric Bayes, Quantile regression, Stochastic ordering

1. Introduction

In studying the relationship between a continuous predictor x and a response variable y, it is common to have prior knowledge of stochastic ordering. For example, in environmental epidemiology studies, one can assume that the severity of an adverse health outcome is stochastically nondecreasing with dose of a potentially adverse exposure. Incorporation of stochastic-ordering constraints can improve the efficiency in estimating conditional distributions and in testing for local or global changes as x increases. The focus of this article is on addressing these problems using nonparametric Bayesian methods.

Letting F_x (y) = pr(Y ⩽ y | X = x) denote the conditional distribution function of the response given the predictor, we assume prior knowledge is available that F_x is stochastically smaller than F_x_′ for any two points x ⩽ x′ in 𝒳, with 𝒳 the real line or an interval on the real line. Such a restriction is natural when the response is known to increase monotonically with the predictor. We otherwise have substantial uncertainty about {F_x, x ∈ 𝒳}, with the true conditional distribution potentially having any form subject to the stochastic-ordering restriction. It is appealing to choose a prior with full support on the space of stochastically ordered conditional distributions. Full support means that the prior can generate conditional distributions within arbitrarily small neighbourhoods of any true conditional distribution. If the true conditional distribution is not in the support of the prior, the posterior will not concentrate on small neighbourhoods of the truth, and hence an unrealistic and inconsistent estimate will be obtained.

Although using a prior with large support is critical in obtaining robust inferences and appropriately accounting for uncertainty in the true data-generating model, few results are available characterizing the support of the prior in Bayesian density regression models even in the unconstrained case. An important exception is the recent article of Tokdar et al. (2010), who showed large support and posterior consistency in unconstrained density estimation based on a class of logistic Gaussian process priors. Their approach was specific to the logistic Gaussian process, relying heavily on theoretical properties of Gaussian processes. Norets (2010) showed large Kullback–Leibler support for finite mixtures of normal regressions, with the mixing weights potentially dependent on predictors, and unpublished work by D. Pati and D. B. Dunson from Duke University and A. Norets and J. Pelenis from Princeton University independently obtained large support and posterior consistency results for various density regression priors. All of the previously proposed Bayesian models for incorporating monotonicity constraints in nonparametric regression with a continuous predictor violate the full support condition, and make restrictive assumptions about the nature of dependence on the predictor.

For example, the typical focus in the literature has been on nonparametric regression models of the form y_i = f (x_i) + ∊_i, where f is an unknown nondecreasing function and ∊_i ∼ F is a residual. One can then induce a prior on the conditional distribution through independent priors for the regression function f and residual distribution F. Using this general idea, Lavine & Mockus (1995) allowed both f and F to be unknown using Dirichlet process priors (Ferguson, 1973, 1974); Neelon & Dunson (2004) instead used adaptive linear splines for f ; Holmes & Held (2006) used a piecewise constant model; Wang (2008) used approximated monotone functions with cubic splines allowing an unknown number and location of knots; Bornkamp & Ickstadt (2009) modelled the monotone function with a mixture of parametric probability distributions with the mixing measure assigned a stick-breaking prior (Ongaro & Cattaneo, 2004); Shively et al. (2009) proposed Bayesian nonparametric approaches relying on the characterization of monotone functions proposed by Ramsay (1998) and quadratic regression splines. Even choosing flexible priors for f and F leads to a prior on {F_x, x ∈ 𝒳} with small support, as conditional distributions having residual variance or shape varying with x are not supported.

There has been some work on more flexible nonparametric Bayes methods for categorical x. Arjas & Gasbarra (1996) characterized stochastically ordered survival functions in two groups using a piecewise constant hazard model on a fixed grid, with stochastic ordering implying a restriction on the interval-specific hazards. Gelfand & Kottas (2001) instead expressed the distribution functions in two groups as F₁ = G₁ and F₂ = G₁G₂, with independent Dirichlet process (Ferguson, 1973, 1974) mixture priors used for G₁ and G₂. Although this approach enforces stochastic ordering, the implied constraint is more restrictive and many stochastically ordered distributions are not supported. Karabatsos & Walker (2007) proposed alternative models for stochastically ordered distributions based on Pólya tree (Ferguson, 1974) or Bernstein polynomial (Petrone, 1999) priors, but did not show large support. Without support results for the above priors, it is not clear whether the priors rule out a large subset of the set of stochastically ordered distributions a priori.

Hoff (2003a) proposed a latent variable specification in which individual i has potential outcomes in each of K groups, with these outcomes following a partial ordering. By placing a Dirich-let process prior on the joint distribution of the potential outcomes, he induced a prior on the marginal distributions in each group having full support on the space of distributions satisfying a partial stochastic ordering. Dunson & Peddada (2008) generalized this approach to a mixture specification by proposing a restricted dependent Dirichlet process, while also developing methods for hypothesis testing. The models, theory of large support and algorithms for computation are not straightforward to modify to accommodate continuous predictors. The continuous case is much more challenging, since instead of a finite collection of unknown distributions, we are faced with an uncountable collection.

2. Stochastically ordered densities

2.1. The restricted dependent Dirichlet process mixture

Let f_𝒳 = {f_x, x ∈ 𝒳} denote a collection of conditional densities indexed by the predictor x ∈ 𝒳. To induce a prior on f_𝒳, we consider the mixture model

f_{x} (y; P) = \int τ^{1 / 2} ϕ [τ^{1 / 2} {y - μ (x)}] d P_{x} {μ (x), τ},

(1)

where ϕ(·) denotes the standard normal density, μ = {(x), x ∈ 𝒳} is a real-valued stochastic process over 𝒳, τ ∈ ℜ⁺ is a scalar precision parameter and P = (P_x, x ∈ 𝒳) is a collection of predictor-dependent mixing measures.

A prior for f_𝒳 can be induced through a prior for P. One such prior is the fixed p-dependent Dirichlet process (MacEachern, 1999), which can be expressed in stick-breaking form as

\begin{matrix} P_{x} = \sum_{h = 1}^{\infty} p_{h} δ_{Θ_{h} (x)}, & p_{h} = V_{h} \prod_{l < h} (1 - V_{l}), \end{matrix}

(2)

where δ_a is a Dirac probability measure concentrated at a, V_h ∼ Be(1, α) and Θ_h = (μ̃_h, τ̃_h) ∼ M are mutually independent for h = 1, . . . , ∞, and M = H ⊗ L is a baseline probability measure. For this choice of prior for P = (P_x, x ∈ 𝒳), (1) becomes

f_{x} (y; P) = \sum_{h = 1}^{\infty} p_{h} {\tilde{τ}}_{h}^{1 / 2} ϕ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x)}],

(3)

and the conditional density f_x is marginally assigned a Dirichlet process location-scale mixture of Gaussians prior (Lo, 1984; Escobar & West, 1995). The f_x and f_x_′ are not independent under this dependent Dirichlet process mixture prior but are correlated due to the common mixture weights {p_h} and scale parameters {τ̃_h} and to dependence in the kernel locations μ̃_h(x) and μ̃_h(x′). By letting τ̃_h = τ to fix the kernel bandwidth across the mixture components, the model simplifies to a location mixture. Allowing the bandwidth to vary across components has some well-known advantages in terms of sparsely characterizing unknown densities.

The above model specification closely follows that proposed in previous work. In an as-yet unpublished paper, D. Pati and D. B. Dunson recently provided conditions for large support and weak and strong posterior consistency for fixed p-dependent Dirichlet processes for unconstrained density regression. Their theory does not apply automatically to the constrained case, and it is not clear whether the prior can be specified in such a way as to restrict support to a large subset of the space of stochastically nondecreasing densities.

Assumption 1. Assume the following continuity condition holds: f_x (·) → f_x_′ (·) as x → x′, for any x, x′ ∈ 𝒳.

From an applied perspective, it is appealing to focus on conditional densities satisfying the continuity condition stated in Assumption 1. This regularity condition rules out sudden jumps in f_x as x changes.

Letting (y_i, x_i) denote the data for subject i, for i = 1, . . . , n, (3) implies that

y_{i} ~ N {μ_{i} (x_{i}), τ_{i}^{- 1}}, (μ_{i}, τ_{i}) ~ P, P = \sum_{h = 1}^{\infty} p_{h} δ_{({\tilde{μ}}_{h}, {\tilde{τ}}_{h})} .

(4)

Hence, to sample y_i first generate a random function μ_i : 𝒳 → ℜ and precision τ_i ∈ ℜ⁺ from a discrete distribution, which is assigned a Dirichlet process prior. Then, draw y_i from a normal distribution with precision $τ_{i}^{- 1}$ centred on the realization of μ_i at predictor value x_i. The function μ_i can be viewed as a latent trajectory defining the mean of the normal kernel for individual i at each possible value of the predictor. With this interpretation, a natural strategy for restricting the induced conditional density to be nondecreasing with x is to restrict the μ_i s to be nondecreasing functions. To also incorporate the continuity condition, one can draw μ̃_h ∼ H independently from τ̃_h ∼ L, with H chosen as a probability measure on S_𝒳,c, the space of continuous nondecreasing 𝒳 → ℜ functions.

Theorem 1 states that restricting H to have support on S_𝒳,c is a necessary and sufficient condition for the collection of conditional densities to be stochastically nondecreasing in the index x and subject to the continuity condition with probability one.

Theorem 1. Let D_𝒳 denote the space of collections of densities indexed by x ∈ 𝒳 and subject to Assumption 1 and the nondecreasing stochastic ordering constraint. Under (2) and (3), the following statements are equivalent: (a) f_𝒳 ∈ D_𝒳 with probability one; and (b) H is a probability measure with support on S_𝒳,c.

Theorem 1 also holds for a simplified version of (3) with common unknown variance in all of the normal components.

2.2. Hypothesis testing

We also develop methods for testing equalities in conditional distributions against stochastically ordered alternatives. Letting D(x, x′) = sup_y_∈ℜ |F_x (y) − F_x_′ (y)| as a measure of distance between $F_{x} (y) = \int_{- \infty}^{y} f_{x} (z) d z$ and $F_{x^{'}} (y) = \int_{- \infty}^{y} f_{x^{'}} (z) d z$ for x < x′, we consider the hypotheses H₀(x, x′) : D(x, x′) < ∊ versus H₁(x, x′) : D(x, x′) ⩾ ∊, where ∊ is a small positive constant chosen so that D(x, x′) < ∊ implies that F_x and F_x_′ are not significantly different. We recommend ∊ = 0.05 as a default value. As a weight of evidence in favour of H₁(x, x′), we use the Bayes factor. Fixing x at the minimum predictor value observed and varying x′, we obtain a log Bayes factor function, which is useful for identifying threshold predictor values, a common interest in biomedical studies. Global hypothesis tests of near equality in the conditional distributions across a region A = [a, b] ⊂ 𝒳 can be formulated as H₀(A) = ∩_x_,_x_′∈_A H₀(x, x′). We use the supremum of the log Bayes factor function as a weight of evidence in favour of H₁(A).

3. Prior specification and posterior computation

To complete a specification of the prior under the model proposed in § 2.1, first let L correspond to the Ga(a_τ, b_τ) distribution and choose α ∼ Ga(a_α, b_α) to allow the Dirichlet process concentration parameter to be unknown. In addition, H is chosen to induce a prior for f_𝒳 with support on a large subset of D_𝒳 by letting

{\tilde{μ}}_{h} (x) = \sum_{l = 0}^{k} β_{h l} b_{l} (x)

(5)

for all x ∈ 𝒳, with b₀(x) = 1, ${b_{l}}_{l = 1}^{k}$ monotone spline basis functions (Ramsay, 1988), and ${β_{h l}}_{l = 0}^{k}$ basis coefficients having nonnegative constraint for l ⩾ 1. We focus in our applications on splines of degree 1, leading to piecewise linearity. It is well known that any continuous function can be approximated arbitrarily well by a piecewise linear function with sufficiently many knots.

Potentially, one can treat the number and locations of knots as unknown and adaptively determine their values by using reversible jump Markov chain Monte Carlo (Green, 1995). To facilitate efficient computation, we instead choose a large number of equally-spaced knots and then adaptively drop unnecessary basis functions by letting their coefficients have zero values (Smith & Kohn, 1996). This is accomplished through the following prior for β_h,

π (β_{h}; m_{0}, ν_{0}, λ, π_{0}) = N (β_{h 0}; m_{0}, ν_{0}^{- 1}) \prod_{l = 1}^{k} {π_{0} δ_{0} (β_{h l}) + (1 - π_{0}) Exp (β_{h l}; λ)},

where m₀ ∼ N(ω₀, η₀), ν₀ ∼ Ga(a_ν₀, b_ν₀), π₀ ∼ Be(a_π, b_π) is the prior probability that a basis function is excluded from the model, λ ∼ Ga(a_λ, b), and Exp(λ) denotes the exponential distribution with mean λ⁻¹. The exponential distribution penalizes large values of the nonzero spline coefficients.

For posterior computation, we use the exact block Gibbs sampler (Yau et al., 2011), which bypasses the need to update the infinitely-many unknowns in the model by combining the ideas of retrospective sampling (Papaspiliopoulos & Roberts, 2008) and slice sampling (Walker, 2007). This algorithm avoids using truncation of the stick-breaking process as in Ishwaran & James (2001). Let N be the number of atoms sampled at the current iteration of the Gibbs sampler, K = (K₁, . . . , K_n)^T with K_i = h if μ_i = μ̃_h and τ_i = τ̃_h, and $c_{h} = \sum_{i = 1}^{n} 1_{(K_{i} = h)}$ for h = 1, . . . , ∞. The Gibbs sampler cycles through the following steps.

Algorithm 1. The Gibbs sampler.

Step 1. Sample β_h = (β_h₀, β_h₁, . . . , β_hk)^T (h = 1, . . . , N).
1. If c_h = 0 or $\sum_{i : K_{i} = h} b_{l}^{2} (x_{i}) = 0$ , sample β_h₀ from N (m₀, ν₀), and sample β_hl from the mixture distribution π₀δ₀ + (1 − π₀)Exp(λ) for l = 1, . . . , k.
2. If c_h > 0 and $\sum_{i : K_{i} = h} b_{l}^{2} (x_{i}) > 0$ , sample β_h₀ from N (E_h₀, W_h₀), where
  $\begin{matrix} W_{h 0} = {({\tilde{τ}}_{h} c_{h} + ν_{0})}^{- 1}, & E_{h 0} = W_{h 0} [{\tilde{τ}}_{h} \sum_{i : K_{i} = h} {y_{i} - \sum_{l \neq 0} β_{h l} b_{l} (x_{i})} + ν_{0} m_{0}] . \end{matrix}$

For l = 1, . . . , k, sample β_hl from π_hl δ₀ + (1 − π_hl)N₊(E_hl, W_hl), where

\begin{array}{l} π_{h l} = {1 + \frac{(1 - π_{0}) λ Φ (W_{h l}^{- 1 / 2} E_{h l})}{π_{0} ϕ (- W_{h l}^{- 1 / 2} E_{h l}) W_{h l}^{- 1 / 2}}}^{- 1}, W_{h l} = {{\tilde{τ}}_{h} \sum_{i : K_{i} = h} b_{l}^{2} (x_{i})}^{- 1}, \\ E_{h l} = W_{h l} [{\tilde{τ}}_{h} \sum_{i : K_{i} = h} b_{l} (x_{i}) {y_{i} - \sum_{l^{'} \neq l} β_{{h l}^{'}} b_{l^{'}} (x_{i})} - λ] . \end{array}

In the above, Φ(·) denotes the cumulative distribution function associated with N(0, 1), and N₊(a, b) denotes the truncated normal distribution of N(a, b) with support (0, ∞).

Step 2. Sample τ̃_h from $Ga (a_{τ} + c_{h} / 2, b_{τ} + \sum_{K_{i} = h} {y_{i} - \sum_{l = 0}^{k} β_{h l} b_{l} (x_{i})}^{2} / 2)$ for h = 1, . . . , N.
Step 3. Sample V_h from $Be (1 + c_{h}, α + n - \sum_{j = 1}^{h} c_{j})$ for h = 1, . . . , N.
Step 4. Sample U_i from Un(0, p_{K_i}) for i = 1, . . . , n.
Step 5. Sample K_i, i = 1, . . . , n. Let U^* = min(U₁, . . . , U_n).
1. If $\sum_{h = 1}^{N}$ p_h > 1 − U^*, sample K_i from a multinomial distribution with
  $\begin{matrix} pr (K_{i} = j | \cdot) \propto {\tilde{τ}}_{j}^{1 / 2} ϕ [{\tilde{τ}}_{j}^{1 / 2} {y_{i} - \sum_{l = 0}^{k} β_{j l} b_{l} (x_{i})}] 1_{(p_{j} > U_{i})} & (i = 1, \dots, n) . \end{matrix}$
2. Otherwise, keep updating N = N + 1 and sampling (V_N, β_N, τ_N) from their priors until $\sum_{h = 1}^{N}$ p_h > 1 − U^*. Then sample K_i s as in (a).
Step 6. Sample π₀ from $Be (a_{π_{0}} + \sum_{h = 1}^{N} \sum_{l = 1}^{k} 1_{(β_{h l} = 0)}, b_{π_{0}} + \sum_{h = 1}^{N} \sum_{l = 1}^{k} 1_{(β_{h l} > 0)})$ .
Step 7. Sample α from $Ga {a_{α} + N, b_{α} - \sum_{h = 1}^{N} log (1 - V_{h})}$ .
Step 8. Sample λ from $Ga (a_{λ} + \sum_{h = 1}^{N} \sum_{l = 1}^{k} 1_{(β_{h l} > 0)}, b_{λ} + \sum_{h = 1}^{N} \sum_{l = 1}^{k} β_{h l})$ .
Step 9. Sample ν₀ from $Ga {a_{ν_{0}} + N / 2, b_{ν_{0}} + \sum_{h = 1}^{N} {(β_{h 0} - m_{0})}^{2} / 2}$ .
Step 10. Sample m₀ from $N {{(τ_{0}^{- 1} + N ν_{0})}^{- 1} (τ_{0}^{- 1} ω_{0} + ν_{0} \sum_{h = 1}^{N} β_{h 0}), {(τ_{0}^{- 1} + N ν_{0})}^{- 1}}$ .

In Algorithm 1, steps 4 and 5 allow the number of clusters N to increase as needed during the sampling process. Because all the unknowns have full conditional posterior distributions that are easy to sample from directly, the algorithm is simple to implement and computation is fast per Gibbs iteration. Many functionals, such as conditional densities, distribution functions, quantile functions and posterior hypothesis probabilities, can be estimated easily from the Markov chain Monte Carlo samples.

Nonparametric Bayes models involve infinitely-many parameters and are typically specified so that higher level parameters in the hierarchy are under-identified from a frequentist perspective and there are many values of these parameters that lead to similar values of the observed data likelihood. In this sense, the models share a close connection with parametric models that arise from redundant parameterizations (Gelman, 2004, 2006; Ghosh & Dunson, 2009). In both cases, we expect the redundancy and under-identification to lead to poor mixing for higher level unknowns that are typically not of inferential interest, while obtaining good mixing for unknowns that directly appear in the observed data likelihood. For example, we anticipate slow mixing for some of the hyperparameters sampled in Steps 6 through 10 and even for the basis coefficients sampled in Step 1. As we start with an overcomplete set of potential basis functions, we fully expect that there are many different subsets of these bases that lead to similar induced collections of conditional densities. This will intrinsically lead to slow mixing for the basis coefficients that should not degrade and may even aid mixing of the conditional density values f (y | x). Indeed, this is exactly the behaviour we have observed in comprehensive simulation and sensitivity analyses that are summarized in the Supplementary Material; rates of convergence and mixing are good for f (y | x) and functions of these conditional densities, but are poor for a number of the hyperparameters.

Prior elicitation in nonparametric models is challenging, since the impact of the hyperparameters on the induced conditional densities f (y | x) is opaque, though some insight can be obtained through sampling from the prior. This opaqueness is not specific to our proposed model and is a general feature of most nonparametric Bayes models even in simple cases. We follow the strategy of choosing a weakly informative prior (Gelman et al., 2008) that leads to good performance in a wide variety of cases, with the posterior densities for f (y | x) substantially different from the priors even in small sample sizes such as n = 25. Specifically, we suggest choosing α ∼ Ga(1, 1), λ ∼ Ga(1, 1), ν₀ ∼ Ga(1, 1), π₀ ∼ Un(0, 1), m₀ ∼ N(0, 1) and L = Ga(1, 1) after normalizing y. The Supplementary Material contains a detailed sensitivity analysis.

4. Simulation studies

Two simulation cases were investigated to evaluate the performance of the proposed method: a null case and an alternative case. For each case, we tried different sample sizes n = 25, 50, 100, 200 and 500. All x_i s were independently sampled from Un(0, 1). In the null case, y_i was generated from

f (y | x) = 0·6 N (y; 0, {0·3}^{2}) + 0·4 N (y; 1, {0·2}^{2}),

and in the alternative case from

f (y | x) = 0·6 N {y; 3 {(x - 0·3)}^{2} 1_{(x > 0·3)}, {0·3}^{2}} + 0·4 N {y; 5 {(x - 0·8)}^{3 / 2} 1_{(x > 0·8)}, {0·2}^{2}} .

We generated 100 datasets for each set-up. For each data set, we ran the Gibbs sampler described in § 4. In specifying the nondecreasing piecewise linear basis functions, we chose I-splines (Ramsay, 1988) with degree equal to 1 and 19 equally spaced interior knots in the predictor space (0, 1). Markov chain Monte Carlo was started with initial parameter values sampled from the recommended priors and N = 20, and results were summarized based on 2000 iterations after deleting the first 500 iterations as a burn-in for each dataset. Using 49 equally spaced interior knots did not give noticeably different results. As shown in the Supplementary Material, we ran a smaller number of simulations using 1100 iterations with a 100 iteration burn-in and obtained essentially identical results, justifying the use of short 2500 iteration chains in these simulations.

Figure 1 shows the results of conditional density estimation at predictor values x = 0.10, 0.50, 0.75 and 0.90 with the sample sizes n = 25 and 50 in the null case shown by the panels on the left and the alternative case shown by the panels on the right respectively. In the null case, the true conditional densities are all the same, while in the alternative case, they are substantially different across 𝒳. The proposed method does a good job in estimating the conditional densities, with the performance improving with increased sample size as expected.

Fig. 1 — Estimates of the conditional densities f (y | x) at x = 0.10, 0.50, 0.75, and 0.90, respectively. The left panels correspond to the null case and the right panels correspond to the alternative case. Each plot shows the true density (solid), the average of the 100 estimated densities when sample size n = 25 (dash), and the average of the 100 estimated densities when n = 50 (dot-dash).

For global testing, we reject the global null if the supremum of the log Bayes factor function is larger than log(10) for each dataset. The choice of log(10) corresponds to a standard threshold for strong evidence against the null hypothesis (Kass & Raftery, 1995). The proportion of rejected global hypotheses is interpreted as the type 1 error of the test in the null case and the power in the alternative case. Our simulations yield type 1 errors equal to 0.05, 0.07, 0.03, 0.04 and 0.02 and powers equal to 0.79, 0.98, 1.0, 1.0 and 1.0 for sample size n = 25, 50, 100, 200, and 500, respectively.

We are also interested in local hypothesis testing of differences between F_x and F₀, with F₀ denoting the distribution of y given x = 0. We estimate a threshold x̂ = min{x : BF(0, x) > 10}, the smallest predictor value at which the Bayes factor is larger than 10. There is significant evidence in favour of the local alternative hypothesis for x > x̂. Figure 2 shows the box plots of the estimated thresholds obtained from 100 datasets in the alternative case with sample size n = 25, 50, 100, 200, and 500, respectively. The true threshold is 0.45. From Fig. 2, the centre of the box plot moves closer to the true threshold and the variation becomes smaller as sample size increases. The averages of the thresholds are 0.89, 0.74, 0.65, 0.57, and 0.46 when sample size is 25, 50, 100, 200, and 500, respectively. Hence, for small sample sizes, there is a tendency to overestimate the threshold, which is as expected.

Fig. 2 — Boxplots of thresholds x̂ identified from the local hypothesis tests for 100 datasets in the alternative case of simulation with sample size n = 25, 50, 100, 200, and 500, respectively. The dashed horizontal line shows the true value of threshold, 0.45.

5. An epidemiologic application

We apply the proposed method to an epidemiologic study relating 1,1-dichloro-2,2-bis (p-chlorophenyl)-ethylene, commonly referred to as p, p′-DDE, to risk of premature delivery. The study was based on the U.S. Collaborative Perinatal Project (Longnecker et al., 2001). Dunson & Park (2008) previously showed a relationship between DDE and the density of gestational age at delivery without incorporating the stochastic ordering constraint. As it is reasonable to assume that gestational length does not increase as dose increases, the stochastic-ordering assumption is biologically plausible. The proposed method allows for testing and identification of a threshold DDE level, which is of substantial interest from a public health perspective.

We analysed the data using a modification of (4), which adds $z_{i}^{T} γ$ to μ_i (x_i) to include a parametric adjustment for the covariates z_i, with z_i including race and standardized cholesterol and triglyceride levels for subject i. Letting z = (z₁, . . . , z_n)^T with sample size n = 2313, we chose the prior γ ∼ N{0, n(z^Tz)⁻¹}, with the prior otherwise specified as in § 3. The Gibbs sampler of § 3 can be applied with a minor modification. We used 36 I spline basis functions ${{\tilde{b}}_{l}}_{l = 1}^{36}$ resulting from taking 35 equally spaced interior knots within (0, 180) and specifying degree equal to 1, and took b_l = 1 − b̃_l for l = 1, . . . , 36 as basis functions in (5) in order to model stochastically nonincreasing densities. We based our results on 10 000 Markov chain Monte Carlo iterations collected after discarding the first 10 000 iterations as a burn-in.

Figure 3(a) shows the estimated conditional densities of gestational age at delivery for a DDE level equal to 2.5, 24.68, 36.53, 53.75 corresponding to the minimum, the 50th, 75th, and 90th percentiles of the observed DDE values, controlling the other covariates equal to zero. These density plots show an increasingly heavy left tail as DDE increases, implying an increasing risk of premature delivery. We also obtain substantially narrower credible intervals for the density estimates compared with Dunson & Park (2008), likely due to improved efficiency attributable to inclusion of the order constraint.

Fig. 3 — Estimated conditional densities of gestational age at delivery and the log Bayes factor curve used for global hypothesis testing. (a) The estimated conditional densities of gestational age at delivery at DDE values x = 2.5 (solid), 24.68 (dot), 36.53 (dash), 53.75 (dot-dash), respectively. (b) the log Bayes factor curve log-BF (2.5, x) (solid). The three dashed horizontal lines in the right panel are log(10), log(3) and 0 for reference.

In order to assess the weight of evidence in the data in favour of stochastic ordering and to estimate a threshold dose, we apply the approach implemented in § 4. Let F₀ denote the distribution function of gestation length given x = 2.5, the minimum of the observed DDE levels, and let F_x denote the distribution function at DDE = x. Figure 3(b) plots the log Bayes factor curve for testing H₀(2.5, x) versus H₁(2.5, x) for increasing x. Treating Bayes factors greater than 10 as significant, we conclude in favour of H₁(2.5, x) for all x > 52.

In addition to density estimation and hypothesis testing, the proposed method also yields a mean function estimate, as well as various quantile curves of gestational age at delivery changing with the DDE values. The proposed mean function estimate has much narrower 95% intervals than the estimate obtained by fitting the generalized additive model fitted with the gam function in R (R Development Core Team, 2011). We also estimate different quantile curves by using the proposed approach and by running the R function rqss with the nonincreasing constraint for a nonparametric additive quantile regression model. The estimates from both methods look similar overall, but the estimates from the proposed method are much smoother. The proposed method should be more efficient in estimating different quantile curves since it estimates all of them jointly while existing methods usually estimate them separately. Matlab codes for the simulation study and data analysis are available upon request from the first author.

6. Stochastically ordered discrete distributions

When the response variable is discrete, it is useful to consider a simplification in which y_i ∼ P_{x_i}, with P = (P_x, x ∈ 𝒳) assigned the prior in (2) simplified to let Θ_h = μ̃_h ∼ H. A restricted dependent Dirichlet process prior for P results by constraining H to have support on the space of nondecreasing functions. We show this leads to a prior with large support on the space of nondecreasing stochastically ordered distributions.

Remark 1. When y_i ∈ {0, 1, . . . , ∞} is a count, H should have support on the space of nondecreasing 𝒳 → {0, 1, . . . , ∞} functions. This can be accomplished by starting with an H^* having support on nondecreasing 𝒳 → ℜ⁺ functions. One such H^* corresponds to an exponentiated Gaussian process. Then, to obtain realizations μ̃_h ∼ H, draw $μ_{h}^{*} ~ H^{*}$ and let ${\tilde{μ}}_{h} (x) = ⌊ μ_{h}^{*} (x) ⌋$ . The resulting P_{x_i} will have nonnegative integer atoms.

Let P = (P_x, x ∈ 𝒳) denote an uncountable collection of stochastically ordered probability measures. The nondecreasing stochastic ordering constraint means P_x ≼ P_x_′, for any two points x ⩽ x′ in 𝒳. Let 𝒫 denote the set of probability measures on {𝒴, 𝒝(𝒴)}, with 𝒴 a subset of the real line and 𝒝(𝒴) the Borel subsets of 𝒴. Let P_𝒳 = {P = (P_x, x ∈ 𝒳) : P_x ∈ 𝒫 for all x ∈ 𝒳}, and C_𝒳 = {P = (P_x, x ∈ 𝒳) : P_x ∈ 𝒫, P_x ≼ P_x_′ for any x ⩽ x′, x, x′ ∈ 𝒳}, with P_𝒳 the set of uncountable collections of probability measures on {𝒴, 𝒝(𝒴)} indexed by x ∈ 𝒳, and C_𝒳 the subset of collections satisfying nondecreasing stochastic order.

In addition, let R_𝒳 denote the 𝒳 → 𝒴 function space and S_𝒳 the nondecreasing 𝒳 → 𝒴 function space. Define a metric d in R_𝒳 with d(s₁, s₂) = sup_x_∈_𝒳 |s₁(x) − s₂(x)| for any s₁, s₂ ∈ R_𝒳. This metric induces a topology on R_𝒳 and thus on S_𝒳. Let 𝒬 denote the set of probability measures on S_𝒳.

Theorem 2. Consider the prior π_P for P defined in (2), with Θ_h ∼ H and H ∈ 𝒬 a probability measure that has full support on S_𝒳. Then π _P has full support on C_𝒳 with respect to the topology of weak convergence defined in the Appendix.

From Theorem 2, the constructed prior π_P in (2) can assign a positive probability to any open set containing the true P of interest. Theorem 2 follows directly from a more general result in Theorem 3 to be presented below. In the Appendix, Lemma A1 establishes a generalized Choquet’s theorem by showing that every P ∈ C_𝒳 can be represented as a mixture over the extreme points of C_𝒳 as follows

P (B) = T (Q) (B) = \int_{S 𝒳} {δ_{S (x)} (B), x \in 𝒳} d Q (s)

(6)

for any B ∈ 𝒝(𝒴), where Q is a mixing measure over S_𝒳, T is an integral operator mapping Q to its marginals and {δ_s₍_x₎(·), x ∈ 𝒳} is an extreme point of C_𝒳. The mapping operator T in (6) is shown to be continuous in Lemma A2.

The integral representation (6) allows us to induce a prior π_P for P by introducing a prior π_Q for the mixing measure Q. Based on Lemmas 1 and 2, we establish the following theorem.

Theorem 3. Through (6), the prior π_P for P has full support on C_𝒳 with respect to the topology of weak convergence if and only if the prior π_Q for Q has full support on 𝒬.

The proofs of Lemmas A1 and A2 and Theorem 3 are sketched in the Appendix. Theorem 2 follows directly from Theorem 3 by taking π_Q to be DP(α H), a Dirichlet process with a precision parameter α and a base probability measure H having full support on S_𝒳, after noting that dp(α H) has a weak support on all of 𝒬 (Hoff, 2003a) with respect to the topology induced by the metric d.

Supplementary material

Supplementary material available at Biometrika online contains a sensitivity analysis, the influence of the recommended prior specification for the hyperparameters, the Markov chain Monte Carlo mixing performance in the simulation and DDE data analysis, and the estimated covariate effects in the DDE data analysis.

Acknowledgments

This research was partially supported by a grant from the National Institute of Environmental Health Sciences of the National Institutes of Health. The authors wish to thank the editor, an associate editor, and two reviewers for their critical and constructive comments that greatly improved the original presentation.

Appendix

Proof of Theorem 1. Under (3), the conditional cumulative distribution functions are

F_{x} (y) = \sum_{h = 1}^{\infty} p_{h} Φ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x)}]

for any y ∈ 𝒴 and x ∈ 𝒳. Then for any two predictor values x and x′ with x ⩽ x′,

F_{x} (y) - F_{x^{'}} (y) = \sum_{h = 1}^{\infty} p_{h} (Φ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x)}] - Φ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x^{'})}]) .

First, we show that (b) implies (a). It is straightforward to see F_x (y) − F_x_′ (y) ⩾ 0 for any y ∈ 𝒴 and x < x′ ∈ 𝒳 with probability 1 based on the facts that Φ is an increasing function and that μ̃_hs are nondecreasing random functions sampled from H. This suggests that f_𝒳 is subject to the nondecreasing stochastic ordering constraint. Now we show f_𝒳 under (3) are subject to the continuity condition. First, for any y₁, y₂ ∈ ℜ and b > 0, one has $ϕ (b^{1 / 2} y_{1}) - ϕ (b^{1 / 2} y_{2}) = {(2 π)}^{- 1 / 2} b y_{0} exp (- b y_{0}^{2} / 2) (y_{1} - y_{2})$ for some y₀ between y₁ and y₂ using the mean value theorem. One further obtains ϕ (b^1/2 y₁) − ϕ (b^1/2 y₂) ⩽ (2π)^−1/2b^1/2|y₁ − y₂| since $b^{1 / 2} y_{0} exp (- b y_{0}^{2} / 2) ⩽ 1$ for any y₀. Fixing x′ ∈ 𝒳, for any y ∈ 𝒴 and any x ∈ 𝒳, one then has

\begin{matrix} f_{x} (y) - f_{x^{'}} (y) = \sum_{h = 1}^{\infty} p_{h} {\tilde{τ}}_{h}^{1 / 2} (ϕ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x)}] - ϕ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x^{'})}]) \\ ⩽ {(2 π)}^{- 1 / 2} \sum_{h = 1}^{\infty} p_{h} {\tilde{τ}}_{h} | {\tilde{μ}}_{h} (x) - {\tilde{μ}}_{h} (x^{'}) | . \end{matrix}

Then f_x (y) − f_x_′ (y) → 0 for any y → 𝒴 as x → x′ follows directly by noting that $\sum_{h = 1}^{\infty} p_{h} {\tilde{τ}}_{h} | {\tilde{μ}}_{h} (x) - {\tilde{μ}}_{h} (x^{'}) | \to 0$ as x → x′ since μ̃_hs are continuous random functions sampled from H.

Now we show (a) implying (b) by showing if (b) does not hold, then (a) does not hold. Suppose H(S_𝒳,c) < 1. Denote B_x,x_′ = {s ∈ R_𝒳 : s(x) ⩽ s(x′)} for a pair of rational numbers x, x′ ∈ 𝒳 and x < x′, where R_𝒳 is the space of 𝒳 → ℜ functions. Being the space of continuous nondecreasing functions on 𝒳, S_𝒳,c can be written as a countable intersection of B_x,x_′ over all possible rational pairs (x, x′) in 𝒳. Thus, R_𝒳\S_𝒳,c can be written as a countable union of $B_{x, x^{'}}^{c}$ over all possible rational pairs (x, x′) in 𝒳, where $B_{x, x^{'}}^{c} = R_{𝒳} \ B_{x, x^{'}}$ . Since H(R_𝒳 \S_𝒳,c) = 1 − H(S_𝒳,c) > 0, there exist a pair of rational numbers (x₁, x₂) with x₁ < x₂ such that $H (B_{x_{1}, x_{2}}^{c}) > 0$ by using the subadditivity of a probability measure. Taking this pair (x₁, x₂) and fixing a y ∈ 𝒴, we consider

C = {(s, t) \in B_{x_{1}, x_{2}}^{c} \times ℜ^{+} : Φ [t^{1 / 2} {y - s (x_{1})}] - Φ [t^{1 / 2} {y - s (x_{2})}] < 0} .

It is straightforward to show $C = B_{x_{1}, x_{2}}^{c} \times ℜ^{+}$ and

pr {(Θ_{1}, {\tilde{τ}}_{1}) \in C} = M (C) = H \otimes L (B_{x_{1}, x_{2}}^{c} \times ℜ^{+}) = H (B_{x_{1}, x_{2}}^{c}) L (ℜ^{+}) = H (B_{x_{1}, x_{2}}^{c}) > 0

due to the independence between H and L and L(ℜ⁺) = 1. Write

C = \cup_{k = 1}^{\infty} C_{k} = \cup_{k = 1}^{\infty} {(s, t) \in B_{x_{1}, x_{2}}^{c} \times ℜ^{+} : Φ [t^{1 / 2} {y - s (x_{1})}] - Φ [t^{1 / 2} {y - s (x_{2})}] < - 1 / k} .

Then there exists a positive integer k₀ such that pr(C_k₀) > 0 since $0 < pr (C) ⩽ \sum_{k = 0}^{\infty} pr (C_{k})$ .

Choose ∊ ∈ {0, 1/(1 + 2k₀)} and a positive integer T₀ such that T₀ > log(∊)/ log{α/(1 + α)}. Define

A = {q = (q_{h}, h = 1, \dots, \infty) : \sum_{h = 1}^{\infty} q_{h} = 1, \sum_{h = T_{0} + 1}^{\infty} q_{h} ⩽ ∊, and 0 < q_{h} < 1 for any h} .

Denote $g_{h} = Φ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x_{1})}] - Φ [{\tilde{τ}}_{h}^{1 / 2} {y - {\tilde{μ}}_{h} (x_{2})}]$ for each h. Clearly, g_hs are independent and identically distributed, and pr(|g_h| ⩽ 2) = 1 for each h since Φ is a cumulative distribution function. Then on D = {p = (p_h, h = 1, . . . , ∞) ∈ A, and (μ̃_h, τ̃_h) ∈ C_k₀ for h = 1, . . . , T₀},

\sum_{h = 1}^{\infty} p_{h} g_{h} = \sum_{h = 1}^{T_{0}} p_{h} g_{h} + \sum_{h = T_{0} + 1}^{\infty} p_{h} g_{h} < - \frac{1}{k_{0}} \sum_{h = 1}^{T_{0}} p_{h} + 2 ∊ < - \frac{1}{k_{0}} (1 - ∊) + 2 ∊ < 0.

(A1)

By using Markov’s inequality, one has

pr (p \in A) = pr (\sum_{h = T_{0} + 1}^{\infty} p_{h} ⩽ ∊) ⩾ 1 - \frac{1}{∊} E (\sum_{h = T_{0} + 1}^{\infty} p_{h}) = 1 - \frac{1}{∊} {(\frac{α}{1 + α})}^{T_{0}} > 0.

Hence, one obtains pr(D) = pr(A)pr(C_k₀)^T₀ > 0 by using the mutual independence of (μ̃_h, τ̃_h)s and p. This result together with (A1) indicates that F_x₂ is not stochastically larger than F_x₁, and thus (a) cannot hold.

Definition 1 (Weak convergence in P_𝒳 used in § 6). Define G_Z (P)(·) = ∫_𝒳 Px (·)dZ(x) for any P = (P_x, x ∈ 𝒳) ∈ P_𝒳 and any probability measure Z on 𝒝(𝒳), the Borel sets of 𝒳. Then G_Z (P) ∈ 𝒫 is a probability measure on 𝒝(𝒴). Define weak convergence in P_𝒳 as follows: a sequence ${P^{(n)}}_{n = 1}^{\infty}$ , with P⁽ⁿ⁾ ∈ P_𝒳 for all n, converges weakly to P ∈ P_𝒳, if and only if G_Z (P⁽ⁿ⁾) converges weakly to G_Z (P) for any probability measure Z on 𝒝(𝒳).

The above weak convergence of P⁽ⁿ⁾ to P in P_𝒳 implies pointwise weak convergence of $P_{x}^{(n)}$ to P_x for all x ∈ 𝒳, which can be easily seen by taking Z = δ_x. Essentially the weak convergence of P_n to P in P_𝒳 is equivalent to the uniform weak convergence of $P_{x}^{(n)}$ to P_x for all x ∈ 𝒳. Another equivalent definition is that d(∫ fd P, ∫fd P⁽ⁿ⁾) → 0 for any bounded continuous f on 𝒴 with d being the uniform metric defined in § 6 and ∫ fd P = (∫ fd P_x, x ∈ 𝒳) ∈ R_𝒳.

The above weak convergence induces a weak topology in P_𝒳. Under this topology, it can be shown that C_𝒳 is a weakly closed convex set and that the set of extreme points of C_𝒳 is exC_𝒳 = {(δ_s₍_x₎, x ∈ 𝒳) : s ∈ S_𝒳}. Lemma A1 generalizes Choquet’s Theorem by considering an uncountable collection of probability measures.

Lemma A1. For any probability measure Q ∈ 𝒬, T (Q) ∈ C_𝒳, where T is defined in (6). Also, for any P ∈ C_𝒳, there exists a probability measure Q ∈ 𝒬 such that Q({s ∈ S_𝒳 : s(x) ∈ ·}) = P_x (·) for all x ∈ 𝒳, where s denotes a random function over 𝒳.

Proof of Lemma A1. For a probability measure Q on S_𝒳, define Q_x (·) = ∫_{S_𝒳} δ_s₍_x₎(·)d Q(s). Then Q_x is a probability measure on 𝒝(𝒴). To show T (Q) ∈ C_𝒳, one only needs to show (Q_x, x ∈ 𝒳) are subject to nondecreasing stochastic ordering constraint. For any two points x, x′ ∈ 𝒳 with x ⩽ x′, one has s(x) ⩽ s(x′) for s ∈ S_𝒳. Further one has {s ∈ S_𝒳 : s(x) > a} ⊆ {s ∈ S_𝒳 : s(x′) > a} for any a ∈ ℜ. By the definition of Q_x and Q_x_′, one obtains that Q_x (a, ∞) = ∫_{S_𝒳} δ_s₍_x₎(a, ∞)dQ(s) = Q[{s ∈ S_𝒳 : s(x) > a}] ⩽ Q[{s ∈ S_𝒳 : s(x′) > a}] = ∫_{S_𝒳} δ_s₍_x_′)(a, ∞)d Q(s) = Q_x_′ (a, ∞). This suggests Q_x ≼ Q_x_′ and thus T (Q) = (Q_x, x ∈ 𝒳) ∈ C_𝒳.

Suppose P = (P_x, x ∈ 𝒳) ∈ C_𝒳. Let F_x denote the cumulative distribution function with respect to P_x and define $s_{x} (ω) = F_{x}^{- 1} (ω) = inf {u : F_{x} (u) ⩾ ω}$ for any x ∈ 𝒳 and w ∈ [0, 1]. For any x, x′ ∈ 𝒳 with x ⩽ x′, P_x ≼ P_x_′ leads to {u : F_x_′ (u) ⩾ ω} ⊆ {u : F_x (u) ⩾ ω} and thus s_x (ω) ⩽ s_x_′ (ω) for all ω. Let Q be the canonical measure of s = (s(x), x ∈ 𝒳) on (S_𝒳, σ_{S_𝒳}) induced by a uniform distribution on ω, where σ_{S_𝒳} is a σ-algebra on S_𝒳. Then Q represents P with Q_x = P_x for any x ∈ 𝒳 using similar arguments to the proof of Proposition 8 in Hoff (2003b).

Lemma A2. The integral operator T defined in (6) is continuous.

Proof of Lemma A2. The continuity of T means that if P⁽ⁿ⁾ = T (Q⁽ⁿ⁾), P = T (Q), and Q⁽ⁿ⁾ converges weakly to Q, then P⁽ⁿ⁾ converges weakly to P. By the definition of weak convergence in P_𝒳, we need to show G_Z (P⁽ⁿ⁾) converges to G_Z (P) for any probability measure Z on 𝒝(𝒳). This is equivalent to showing that ∫_𝒴 fdGZ (P⁽ⁿ⁾) → ∫_𝒴 fdGZ (P) for any bounded uniformly continuous function f on 𝒴 and any probability measure Z on 𝒝(𝒳) by noting that 𝒴 is a separable metric space (Parthasarthy, 1967; Billingsley, 1999). Denote by U(𝒴) the set of bounded uniformly continuous functions on 𝒴.

First, for any f ∈ U(𝒴) and any probability measure Z on 𝒝(𝒳), we have the following facts: G_Z (P)(·) = ∫_𝒳 Q_x (·)d Z(x) = ∫_𝒳 ∫_{S_𝒳} δ_s₍_x₎(·)d Q(s)dZ(x) and

\begin{array}{l} \int_{𝒴} f d G_{Z} (P) & = & \int_{𝒴} f (y) \int_{𝒳} \int_{S_{𝒳}} d Q (s) d Z (x) δ_{s (x)} (d u) \\ = & \int_{S_{𝒳}} \int_{𝒳} \int_{𝒴} f (y) δ_{s (x)} (d y) d Z (x) d Q (s) = \int_{S_{𝒳}} \int_{𝒳} f (s (x)) d Z (x) d Q (s) . \end{array}

Let g₍_{Z, f}₎(s) = ∫_𝒳 f (s(x))d Z(x), a function on S_𝒳, given Z ∈ 𝒝(𝒳) and f ∈ U(𝒴). Then one has ∫_𝒴 f dG Z (P) = ∫_{S_𝒳} g_{(Z, f)}(s)d Q(s) and ∫_𝒴 f dG_Z (P⁽ⁿ⁾) = ∫_{S_𝒳} g_{(Z, f)}(s)d Q⁽ⁿ⁾(s). Since Q⁽ⁿ⁾ converges weakly to Q, one only needs to show that g₍_{Z, f}₎ is a bounded continuous function of s on S_𝒳 for any f ∈ U(𝒴) and any measure Z on 𝒝(𝒳). This is trivial based on the uniform metric d in R_𝒳

Proof of Theorem 3. Suppose that π_Q has a full support on 𝒬. Let A be an open subset of C_𝒳, then one has π_P (A) = pr{T (Q) ∈ A} = π_Q(T⁻¹ A) > 0 based on the facts that T⁻¹ A is an open set due to the continuity of T and that π_Q has the full support on 𝒬. It is straightforward to have π_P (C_𝒳) = π_Q(𝒬) = 1 based on Lemma A1. The other part can be proven similarly.

References

Arjas E, Gasbarra D. Bayesian inference of survival probabilities, under stochastic ordering constraints. J Am Statist Assoc. 1996;91:1101–9. [Google Scholar]
Billingsley P. Convergence of Probability Measures. New York: John Wiley & Sons, Inc; 1999. [Google Scholar]
Bornkamp B, Ickstadt K. Bayesian nonparametric estimation of continuous monotone functions with applications to dose response analysis. Biometrics. 2009;65:198–205. doi: 10.1111/j.1541-0420.2008.01060.x. [DOI] [PubMed] [Google Scholar]
Dunson DB, Park J-H. Kernel stick-breaking processes. Biometrika. 2008;95:307–23. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–74. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]
Escobar MD, West M. Bayesian density estimation and inference using mixtures. J Am Statist Assoc. 1995;90:577–88. [Google Scholar]
Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann Statist. 1973;1:209–30. [Google Scholar]
Ferguson TS. Prior distributions on spaces of probability measures. Ann Statist. 1974;2:615–29. [Google Scholar]
Gelfand AE, Kottas A. Nonparametric Bayesian modeling for stochastic order. Inst Statist Math. 2001;53:865–76. [Google Scholar]
Gelman A. Parameterization and Bayesian modeling. J Am Statist Assoc. 2004;99:537–45. [Google Scholar]
Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006;3:515–33. [Google Scholar]
Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. Ann Appl Statist. 2008;2:1360–83. [Google Scholar]
Ghosh J, Dunson DB. Default prior distributions and efficient posterior computation in Bayesian factor analysis. J Comp Graph Statist. 2009;18:306–20. doi: 10.1198/jcgs.2009.07145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–32. [Google Scholar]
Hoff PD. Bayesian methods for partial stochastic orderings. Biometrika. 2003a;90:303–17. [Google Scholar]
Hoff PD. Nonparametric estimation of convex models via mixtures. Ann Statist. 2003b;31:174–200. [Google Scholar]
Holmes CC, Held L. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Anal. 2006;1:145–68. [Google Scholar]
Ishwaran H, James LF. Gibbs sampling methods for stick breaking priors. J Am Statist Assoc. 2001;96:161–73. [Google Scholar]
Karabatsos G, Walker SG. Bayesian nonparametric inference of stochastically ordered distributions, with Pólya trees and Bernstein polynomials. Statist Prob Lett. 2007;77:907–13. [Google Scholar]
Kass RE, Raftery AE. Bayes factors. J Am Statist Assoc. 1995;90:773–95. [Google Scholar]
Lavine M, Mockus A. A nonparametric Bayes method for isotonic regression. J Statist Plan Infer. 1995;46:235–48. [Google Scholar]
Lo AY. On a class of Bayesian nonparametric estimates: I. Density estimates. Ann Statist. 1984;12:351–7. [Google Scholar]
Longnecker MP, Klebanoff MA, Zhou HB, Brock JW. Association between maternal serum concentration of the DDT metabolite DDE and preterm and small-for-gestational-age babies at birth. Lancet. 2001;358:110–4. doi: 10.1016/S0140-6736(01)05329-6. [DOI] [PubMed] [Google Scholar]
MacEachern SN. Proc Bayesian Statist Sci Sect. Alexandria, VA: American Statistical Association; 1999. Dependent nonparametric processes; pp. 50–5. [Google Scholar]
Neelon B, Dunson DB. Bayesian isotonic regression and trend analysis. Biometrics. 2004;60:398–406. doi: 10.1111/j.0006-341X.2004.00184.x. [DOI] [PubMed] [Google Scholar]
Norets A. Approximation of conditional densities by smooth mixtures of regressions. Ann Statist. 2010;38:1733–66. [Google Scholar]
Ongaro A, Cattaneo C. Discrete random probability measures: A general framework for nonparametric Bayesian inference. Statist Prob Lett. 2004;67:33–5. [Google Scholar]
Papaspiliopoulos O, Roberts GO. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–86. [Google Scholar]
Parthasarthy K. Probability Measures on Metric Spaces. New York: Academic Press Inc; 1967. [Google Scholar]
Petrone S. Bayesian density estimation using Bernstein polynomials. Can J Statist. 1999;27:105–26. [Google Scholar]
Ramsay JO. Monotone regression splines in action. Statist Sci. 1988;3:425–41. [Google Scholar]
Ramsay JO. Estimating smooth monotone functions. J. R. Statist. Soc. B. 1998;60:365–75. [Google Scholar]
Shively TS, Sager TW, Walker SG. A Bayesian approach to non-parametric monotone function estimation. J. R. Statist. Soc B. 2009;71:159–75. [Google Scholar]
Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. J Economet. 1996;75:317–43. [Google Scholar]
Tokdar S, Zhu Y, Ghosh J. Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Anal. 2010;5:1–26. [Google Scholar]
Walker SG. Sampling the Dirichlet mixture model with slices. Commun. Statist. B. 2007;36:45–54. [Google Scholar]
Wang X. Bayesian free-knot monotone cubic spline regression. J Comp Graph Statist. 2008;17:373–87. [Google Scholar]
Yau C, Papaspiliopoulos O, Roberts GO, Holmes C. Bayesian nonparametric hidden Markov models with applications in genomics. J. R. Statist. Soc. B. 2011;73:37–57. doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b1-asr025] Arjas E, Gasbarra D. Bayesian inference of survival probabilities, under stochastic ordering constraints. J Am Statist Assoc. 1996;91:1101–9. [Google Scholar]

[b2-asr025] Billingsley P. Convergence of Probability Measures. New York: John Wiley & Sons, Inc; 1999. [Google Scholar]

[b3-asr025] Bornkamp B, Ickstadt K. Bayesian nonparametric estimation of continuous monotone functions with applications to dose response analysis. Biometrics. 2009;65:198–205. doi: 10.1111/j.1541-0420.2008.01060.x. [DOI] [PubMed] [Google Scholar]

[b4-asr025] Dunson DB, Park J-H. Kernel stick-breaking processes. Biometrika. 2008;95:307–23. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5-asr025] Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–74. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6-asr025] Escobar MD, West M. Bayesian density estimation and inference using mixtures. J Am Statist Assoc. 1995;90:577–88. [Google Scholar]

[b7-asr025] Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann Statist. 1973;1:209–30. [Google Scholar]

[b8-asr025] Ferguson TS. Prior distributions on spaces of probability measures. Ann Statist. 1974;2:615–29. [Google Scholar]

[b9-asr025] Gelfand AE, Kottas A. Nonparametric Bayesian modeling for stochastic order. Inst Statist Math. 2001;53:865–76. [Google Scholar]

[b10-asr025] Gelman A. Parameterization and Bayesian modeling. J Am Statist Assoc. 2004;99:537–45. [Google Scholar]

[b11-asr025] Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006;3:515–33. [Google Scholar]

[b12-asr025] Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. Ann Appl Statist. 2008;2:1360–83. [Google Scholar]

[b13-asr025] Ghosh J, Dunson DB. Default prior distributions and efficient posterior computation in Bayesian factor analysis. J Comp Graph Statist. 2009;18:306–20. doi: 10.1198/jcgs.2009.07145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14-asr025] Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–32. [Google Scholar]

[b15-asr025] Hoff PD. Bayesian methods for partial stochastic orderings. Biometrika. 2003a;90:303–17. [Google Scholar]

[b16-asr025] Hoff PD. Nonparametric estimation of convex models via mixtures. Ann Statist. 2003b;31:174–200. [Google Scholar]

[b17-asr025] Holmes CC, Held L. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Anal. 2006;1:145–68. [Google Scholar]

[b18-asr025] Ishwaran H, James LF. Gibbs sampling methods for stick breaking priors. J Am Statist Assoc. 2001;96:161–73. [Google Scholar]

[b19-asr025] Karabatsos G, Walker SG. Bayesian nonparametric inference of stochastically ordered distributions, with Pólya trees and Bernstein polynomials. Statist Prob Lett. 2007;77:907–13. [Google Scholar]

[b20-asr025] Kass RE, Raftery AE. Bayes factors. J Am Statist Assoc. 1995;90:773–95. [Google Scholar]

[b21-asr025] Lavine M, Mockus A. A nonparametric Bayes method for isotonic regression. J Statist Plan Infer. 1995;46:235–48. [Google Scholar]

[b22-asr025] Lo AY. On a class of Bayesian nonparametric estimates: I. Density estimates. Ann Statist. 1984;12:351–7. [Google Scholar]

[b23-asr025] Longnecker MP, Klebanoff MA, Zhou HB, Brock JW. Association between maternal serum concentration of the DDT metabolite DDE and preterm and small-for-gestational-age babies at birth. Lancet. 2001;358:110–4. doi: 10.1016/S0140-6736(01)05329-6. [DOI] [PubMed] [Google Scholar]

[b24-asr025] MacEachern SN. Proc Bayesian Statist Sci Sect. Alexandria, VA: American Statistical Association; 1999. Dependent nonparametric processes; pp. 50–5. [Google Scholar]

[b25-asr025] Neelon B, Dunson DB. Bayesian isotonic regression and trend analysis. Biometrics. 2004;60:398–406. doi: 10.1111/j.0006-341X.2004.00184.x. [DOI] [PubMed] [Google Scholar]

[b26-asr025] Norets A. Approximation of conditional densities by smooth mixtures of regressions. Ann Statist. 2010;38:1733–66. [Google Scholar]

[b27-asr025] Ongaro A, Cattaneo C. Discrete random probability measures: A general framework for nonparametric Bayesian inference. Statist Prob Lett. 2004;67:33–5. [Google Scholar]

[b28-asr025] Papaspiliopoulos O, Roberts GO. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–86. [Google Scholar]

[b29-asr025] Parthasarthy K. Probability Measures on Metric Spaces. New York: Academic Press Inc; 1967. [Google Scholar]

[b30-asr025] Petrone S. Bayesian density estimation using Bernstein polynomials. Can J Statist. 1999;27:105–26. [Google Scholar]

[b31-asr025] Ramsay JO. Monotone regression splines in action. Statist Sci. 1988;3:425–41. [Google Scholar]

[b32-asr025] Ramsay JO. Estimating smooth monotone functions. J. R. Statist. Soc. B. 1998;60:365–75. [Google Scholar]

[b33-asr025] Shively TS, Sager TW, Walker SG. A Bayesian approach to non-parametric monotone function estimation. J. R. Statist. Soc B. 2009;71:159–75. [Google Scholar]

[b34-asr025] Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. J Economet. 1996;75:317–43. [Google Scholar]

[b35-asr025] Tokdar S, Zhu Y, Ghosh J. Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Anal. 2010;5:1–26. [Google Scholar]

[b36-asr025] Walker SG. Sampling the Dirichlet mixture model with slices. Commun. Statist. B. 2007;36:45–54. [Google Scholar]

[b37-asr025] Wang X. Bayesian free-knot monotone cubic spline regression. J Comp Graph Statist. 2008;17:373–87. [Google Scholar]

[b38-asr025] Yau C, Papaspiliopoulos O, Roberts GO, Holmes C. Bayesian nonparametric hidden Markov models with applications in genomics. J. R. Statist. Soc. B. 2011;73:37–57. doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian isotonic density regression

Lianming Wang

David B Dunson

Abstract

1. Introduction

2. Stochastically ordered densities

2.1. The restricted dependent Dirichlet process mixture

2.2. Hypothesis testing

3. Prior specification and posterior computation

4. Simulation studies

Fig. 1.

Fig. 2.

5. An epidemiologic application

Fig. 3.

6. Stochastically ordered discrete distributions

Supplementary material

Acknowledgments

Appendix

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bayesian isotonic density regression

Lianming Wang

David B Dunson

Abstract

1. Introduction

2. Stochastically ordered densities

2.1. The restricted dependent Dirichlet process mixture

2.2. Hypothesis testing

3. Prior specification and posterior computation

4. Simulation studies

Fig. 1.

Fig. 2.

5. An epidemiologic application

Fig. 3.

6. Stochastically ordered discrete distributions

Supplementary material

Acknowledgments

Appendix

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases