Abstract
Density regression models allow the conditional distribution of the response given predictors to change flexibly over the predictor space. Such models are much more flexible than nonparametric mean regression models with nonparametric residual distributions, and are well supported in many applications. A rich variety of Bayesian methods have been proposed for density regression, but it is not clear whether such priors have full support so that any true data-generating model can be accurately approximated. This article develops a new class of density regression models that incorporate stochastic-ordering constraints which are natural when a response tends to increase or decrease monotonely with a predictor. Theory is developed showing large support. Methods are developed for hypothesis testing, with posterior computation relying on a simple Gibbs sampler. Frequentist properties are illustrated in a simulation study, and an epidemiology application is considered.
Keywords: Conditional density estimation, Dependent Dirichlet process, Hypothesis test, Isotonic regression, Nonparametric Bayes, Quantile regression, Stochastic ordering
1. Introduction
In studying the relationship between a continuous predictor x and a response variable y, it is common to have prior knowledge of stochastic ordering. For example, in environmental epidemiology studies, one can assume that the severity of an adverse health outcome is stochastically nondecreasing with dose of a potentially adverse exposure. Incorporation of stochastic-ordering constraints can improve the efficiency in estimating conditional distributions and in testing for local or global changes as x increases. The focus of this article is on addressing these problems using nonparametric Bayesian methods.
Letting Fx (y) = pr(Y ⩽ y | X = x) denote the conditional distribution function of the response given the predictor, we assume prior knowledge is available that Fx is stochastically smaller than Fx′ for any two points x ⩽ x′ in 𝒳, with 𝒳 the real line or an interval on the real line. Such a restriction is natural when the response is known to increase monotonically with the predictor. We otherwise have substantial uncertainty about {Fx, x ∈ 𝒳}, with the true conditional distribution potentially having any form subject to the stochastic-ordering restriction. It is appealing to choose a prior with full support on the space of stochastically ordered conditional distributions. Full support means that the prior can generate conditional distributions within arbitrarily small neighbourhoods of any true conditional distribution. If the true conditional distribution is not in the support of the prior, the posterior will not concentrate on small neighbourhoods of the truth, and hence an unrealistic and inconsistent estimate will be obtained.
Although using a prior with large support is critical in obtaining robust inferences and appropriately accounting for uncertainty in the true data-generating model, few results are available characterizing the support of the prior in Bayesian density regression models even in the unconstrained case. An important exception is the recent article of Tokdar et al. (2010), who showed large support and posterior consistency in unconstrained density estimation based on a class of logistic Gaussian process priors. Their approach was specific to the logistic Gaussian process, relying heavily on theoretical properties of Gaussian processes. Norets (2010) showed large Kullback–Leibler support for finite mixtures of normal regressions, with the mixing weights potentially dependent on predictors, and unpublished work by D. Pati and D. B. Dunson from Duke University and A. Norets and J. Pelenis from Princeton University independently obtained large support and posterior consistency results for various density regression priors. All of the previously proposed Bayesian models for incorporating monotonicity constraints in nonparametric regression with a continuous predictor violate the full support condition, and make restrictive assumptions about the nature of dependence on the predictor.
For example, the typical focus in the literature has been on nonparametric regression models of the form yi = f (xi) + ∊i, where f is an unknown nondecreasing function and ∊i ∼ F is a residual. One can then induce a prior on the conditional distribution through independent priors for the regression function f and residual distribution F. Using this general idea, Lavine & Mockus (1995) allowed both f and F to be unknown using Dirichlet process priors (Ferguson, 1973, 1974); Neelon & Dunson (2004) instead used adaptive linear splines for f ; Holmes & Held (2006) used a piecewise constant model; Wang (2008) used approximated monotone functions with cubic splines allowing an unknown number and location of knots; Bornkamp & Ickstadt (2009) modelled the monotone function with a mixture of parametric probability distributions with the mixing measure assigned a stick-breaking prior (Ongaro & Cattaneo, 2004); Shively et al. (2009) proposed Bayesian nonparametric approaches relying on the characterization of monotone functions proposed by Ramsay (1998) and quadratic regression splines. Even choosing flexible priors for f and F leads to a prior on {Fx, x ∈ 𝒳} with small support, as conditional distributions having residual variance or shape varying with x are not supported.
There has been some work on more flexible nonparametric Bayes methods for categorical x. Arjas & Gasbarra (1996) characterized stochastically ordered survival functions in two groups using a piecewise constant hazard model on a fixed grid, with stochastic ordering implying a restriction on the interval-specific hazards. Gelfand & Kottas (2001) instead expressed the distribution functions in two groups as F1 = G1 and F2 = G1G2, with independent Dirichlet process (Ferguson, 1973, 1974) mixture priors used for G1 and G2. Although this approach enforces stochastic ordering, the implied constraint is more restrictive and many stochastically ordered distributions are not supported. Karabatsos & Walker (2007) proposed alternative models for stochastically ordered distributions based on Pólya tree (Ferguson, 1974) or Bernstein polynomial (Petrone, 1999) priors, but did not show large support. Without support results for the above priors, it is not clear whether the priors rule out a large subset of the set of stochastically ordered distributions a priori.
Hoff (2003a) proposed a latent variable specification in which individual i has potential outcomes in each of K groups, with these outcomes following a partial ordering. By placing a Dirich-let process prior on the joint distribution of the potential outcomes, he induced a prior on the marginal distributions in each group having full support on the space of distributions satisfying a partial stochastic ordering. Dunson & Peddada (2008) generalized this approach to a mixture specification by proposing a restricted dependent Dirichlet process, while also developing methods for hypothesis testing. The models, theory of large support and algorithms for computation are not straightforward to modify to accommodate continuous predictors. The continuous case is much more challenging, since instead of a finite collection of unknown distributions, we are faced with an uncountable collection.
2. Stochastically ordered densities
2.1. The restricted dependent Dirichlet process mixture
Let f𝒳 = {fx, x ∈ 𝒳} denote a collection of conditional densities indexed by the predictor x ∈ 𝒳. To induce a prior on f𝒳, we consider the mixture model
| (1) |
where ϕ(·) denotes the standard normal density, μ = {(x), x ∈ 𝒳} is a real-valued stochastic process over 𝒳, τ ∈ ℜ+ is a scalar precision parameter and P = (Px, x ∈ 𝒳) is a collection of predictor-dependent mixing measures.
A prior for f𝒳 can be induced through a prior for P. One such prior is the fixed p-dependent Dirichlet process (MacEachern, 1999), which can be expressed in stick-breaking form as
| (2) |
where δa is a Dirac probability measure concentrated at a, Vh ∼ Be(1, α) and Θh = (μ̃h, τ̃h) ∼ M are mutually independent for h = 1, . . . , ∞, and M = H ⊗ L is a baseline probability measure. For this choice of prior for P = (Px, x ∈ 𝒳), (1) becomes
| (3) |
and the conditional density fx is marginally assigned a Dirichlet process location-scale mixture of Gaussians prior (Lo, 1984; Escobar & West, 1995). The fx and fx′ are not independent under this dependent Dirichlet process mixture prior but are correlated due to the common mixture weights {ph} and scale parameters {τ̃h} and to dependence in the kernel locations μ̃h(x) and μ̃h(x′). By letting τ̃h = τ to fix the kernel bandwidth across the mixture components, the model simplifies to a location mixture. Allowing the bandwidth to vary across components has some well-known advantages in terms of sparsely characterizing unknown densities.
The above model specification closely follows that proposed in previous work. In an as-yet unpublished paper, D. Pati and D. B. Dunson recently provided conditions for large support and weak and strong posterior consistency for fixed p-dependent Dirichlet processes for unconstrained density regression. Their theory does not apply automatically to the constrained case, and it is not clear whether the prior can be specified in such a way as to restrict support to a large subset of the space of stochastically nondecreasing densities.
Assumption 1. Assume the following continuity condition holds: fx (·) → fx′ (·) as x → x′, for any x, x′ ∈ 𝒳.
From an applied perspective, it is appealing to focus on conditional densities satisfying the continuity condition stated in Assumption 1. This regularity condition rules out sudden jumps in fx as x changes.
Letting (yi, xi) denote the data for subject i, for i = 1, . . . , n, (3) implies that
| (4) |
Hence, to sample yi first generate a random function μi : 𝒳 → ℜ and precision τi ∈ ℜ+ from a discrete distribution, which is assigned a Dirichlet process prior. Then, draw yi from a normal distribution with precision centred on the realization of μi at predictor value xi. The function μi can be viewed as a latent trajectory defining the mean of the normal kernel for individual i at each possible value of the predictor. With this interpretation, a natural strategy for restricting the induced conditional density to be nondecreasing with x is to restrict the μi s to be nondecreasing functions. To also incorporate the continuity condition, one can draw μ̃h ∼ H independently from τ̃h ∼ L, with H chosen as a probability measure on S𝒳,c, the space of continuous nondecreasing 𝒳 → ℜ functions.
Theorem 1 states that restricting H to have support on S𝒳,c is a necessary and sufficient condition for the collection of conditional densities to be stochastically nondecreasing in the index x and subject to the continuity condition with probability one.
Theorem 1. Let D𝒳 denote the space of collections of densities indexed by x ∈ 𝒳 and subject to Assumption 1 and the nondecreasing stochastic ordering constraint. Under (2) and (3), the following statements are equivalent: (a) f𝒳 ∈ D𝒳 with probability one; and (b) H is a probability measure with support on S𝒳,c.
Theorem 1 also holds for a simplified version of (3) with common unknown variance in all of the normal components.
2.2. Hypothesis testing
We also develop methods for testing equalities in conditional distributions against stochastically ordered alternatives. Letting D(x, x′) = supy∈ℜ |Fx (y) − Fx′ (y)| as a measure of distance between and for x < x′, we consider the hypotheses H0(x, x′) : D(x, x′) < ∊ versus H1(x, x′) : D(x, x′) ⩾ ∊, where ∊ is a small positive constant chosen so that D(x, x′) < ∊ implies that Fx and Fx′ are not significantly different. We recommend ∊ = 0.05 as a default value. As a weight of evidence in favour of H1(x, x′), we use the Bayes factor. Fixing x at the minimum predictor value observed and varying x′, we obtain a log Bayes factor function, which is useful for identifying threshold predictor values, a common interest in biomedical studies. Global hypothesis tests of near equality in the conditional distributions across a region A = [a, b] ⊂ 𝒳 can be formulated as H0(A) = ∩x,x′∈A H0(x, x′). We use the supremum of the log Bayes factor function as a weight of evidence in favour of H1(A).
3. Prior specification and posterior computation
To complete a specification of the prior under the model proposed in § 2.1, first let L correspond to the Ga(aτ, bτ) distribution and choose α ∼ Ga(aα, bα) to allow the Dirichlet process concentration parameter to be unknown. In addition, H is chosen to induce a prior for f𝒳 with support on a large subset of D𝒳 by letting
| (5) |
for all x ∈ 𝒳, with b0(x) = 1, monotone spline basis functions (Ramsay, 1988), and basis coefficients having nonnegative constraint for l ⩾ 1. We focus in our applications on splines of degree 1, leading to piecewise linearity. It is well known that any continuous function can be approximated arbitrarily well by a piecewise linear function with sufficiently many knots.
Potentially, one can treat the number and locations of knots as unknown and adaptively determine their values by using reversible jump Markov chain Monte Carlo (Green, 1995). To facilitate efficient computation, we instead choose a large number of equally-spaced knots and then adaptively drop unnecessary basis functions by letting their coefficients have zero values (Smith & Kohn, 1996). This is accomplished through the following prior for βh,
where m0 ∼ N(ω0, η0), ν0 ∼ Ga(aν0, bν0), π0 ∼ Be(aπ, bπ) is the prior probability that a basis function is excluded from the model, λ ∼ Ga(aλ, b), and Exp(λ) denotes the exponential distribution with mean λ−1. The exponential distribution penalizes large values of the nonzero spline coefficients.
For posterior computation, we use the exact block Gibbs sampler (Yau et al., 2011), which bypasses the need to update the infinitely-many unknowns in the model by combining the ideas of retrospective sampling (Papaspiliopoulos & Roberts, 2008) and slice sampling (Walker, 2007). This algorithm avoids using truncation of the stick-breaking process as in Ishwaran & James (2001). Let N be the number of atoms sampled at the current iteration of the Gibbs sampler, K = (K1, . . . , Kn)T with Ki = h if μi = μ̃h and τi = τ̃h, and for h = 1, . . . , ∞. The Gibbs sampler cycles through the following steps.
Algorithm 1. The Gibbs sampler.
-
Step 1. Sample βh = (βh0, βh1, . . . , βhk)T (h = 1, . . . , N).
If ch = 0 or , sample βh0 from N (m0, ν0), and sample βhl from the mixture distribution π0δ0 + (1 − π0)Exp(λ) for l = 1, . . . , k.
- If ch > 0 and , sample βh0 from N (Eh0, Wh0), where
For l = 1, . . . , k, sample βhl from πhl δ0 + (1 − πhl)N+(Ehl, Whl), where
In the above, Φ(·) denotes the cumulative distribution function associated with N(0, 1), and N+(a, b) denotes the truncated normal distribution of N(a, b) with support (0, ∞).
Step 2. Sample τ̃h from for h = 1, . . . , N.
Step 3. Sample Vh from for h = 1, . . . , N.
Step 4. Sample Ui from Un(0, pKi) for i = 1, . . . , n.
-
Step 5. Sample Ki, i = 1, . . . , n. Let U* = min(U1, . . . , Un).
- If ph > 1 − U*, sample Ki from a multinomial distribution with
Otherwise, keep updating N = N + 1 and sampling (VN, βN, τN) from their priors until ph > 1 − U*. Then sample Ki s as in (a).
Step 6. Sample π0 from .
Step 7. Sample α from .
Step 8. Sample λ from .
Step 9. Sample ν0 from .
Step 10. Sample m0 from .
In Algorithm 1, steps 4 and 5 allow the number of clusters N to increase as needed during the sampling process. Because all the unknowns have full conditional posterior distributions that are easy to sample from directly, the algorithm is simple to implement and computation is fast per Gibbs iteration. Many functionals, such as conditional densities, distribution functions, quantile functions and posterior hypothesis probabilities, can be estimated easily from the Markov chain Monte Carlo samples.
Nonparametric Bayes models involve infinitely-many parameters and are typically specified so that higher level parameters in the hierarchy are under-identified from a frequentist perspective and there are many values of these parameters that lead to similar values of the observed data likelihood. In this sense, the models share a close connection with parametric models that arise from redundant parameterizations (Gelman, 2004, 2006; Ghosh & Dunson, 2009). In both cases, we expect the redundancy and under-identification to lead to poor mixing for higher level unknowns that are typically not of inferential interest, while obtaining good mixing for unknowns that directly appear in the observed data likelihood. For example, we anticipate slow mixing for some of the hyperparameters sampled in Steps 6 through 10 and even for the basis coefficients sampled in Step 1. As we start with an overcomplete set of potential basis functions, we fully expect that there are many different subsets of these bases that lead to similar induced collections of conditional densities. This will intrinsically lead to slow mixing for the basis coefficients that should not degrade and may even aid mixing of the conditional density values f (y | x). Indeed, this is exactly the behaviour we have observed in comprehensive simulation and sensitivity analyses that are summarized in the Supplementary Material; rates of convergence and mixing are good for f (y | x) and functions of these conditional densities, but are poor for a number of the hyperparameters.
Prior elicitation in nonparametric models is challenging, since the impact of the hyperparameters on the induced conditional densities f (y | x) is opaque, though some insight can be obtained through sampling from the prior. This opaqueness is not specific to our proposed model and is a general feature of most nonparametric Bayes models even in simple cases. We follow the strategy of choosing a weakly informative prior (Gelman et al., 2008) that leads to good performance in a wide variety of cases, with the posterior densities for f (y | x) substantially different from the priors even in small sample sizes such as n = 25. Specifically, we suggest choosing α ∼ Ga(1, 1), λ ∼ Ga(1, 1), ν0 ∼ Ga(1, 1), π0 ∼ Un(0, 1), m0 ∼ N(0, 1) and L = Ga(1, 1) after normalizing y. The Supplementary Material contains a detailed sensitivity analysis.
4. Simulation studies
Two simulation cases were investigated to evaluate the performance of the proposed method: a null case and an alternative case. For each case, we tried different sample sizes n = 25, 50, 100, 200 and 500. All xi s were independently sampled from Un(0, 1). In the null case, yi was generated from
and in the alternative case from
We generated 100 datasets for each set-up. For each data set, we ran the Gibbs sampler described in § 4. In specifying the nondecreasing piecewise linear basis functions, we chose I-splines (Ramsay, 1988) with degree equal to 1 and 19 equally spaced interior knots in the predictor space (0, 1). Markov chain Monte Carlo was started with initial parameter values sampled from the recommended priors and N = 20, and results were summarized based on 2000 iterations after deleting the first 500 iterations as a burn-in for each dataset. Using 49 equally spaced interior knots did not give noticeably different results. As shown in the Supplementary Material, we ran a smaller number of simulations using 1100 iterations with a 100 iteration burn-in and obtained essentially identical results, justifying the use of short 2500 iteration chains in these simulations.
Figure 1 shows the results of conditional density estimation at predictor values x = 0.10, 0.50, 0.75 and 0.90 with the sample sizes n = 25 and 50 in the null case shown by the panels on the left and the alternative case shown by the panels on the right respectively. In the null case, the true conditional densities are all the same, while in the alternative case, they are substantially different across 𝒳. The proposed method does a good job in estimating the conditional densities, with the performance improving with increased sample size as expected.
Fig. 1.
Estimates of the conditional densities f (y | x) at x = 0.10, 0.50, 0.75, and 0.90, respectively. The left panels correspond to the null case and the right panels correspond to the alternative case. Each plot shows the true density (solid), the average of the 100 estimated densities when sample size n = 25 (dash), and the average of the 100 estimated densities when n = 50 (dot-dash).
For global testing, we reject the global null if the supremum of the log Bayes factor function is larger than log(10) for each dataset. The choice of log(10) corresponds to a standard threshold for strong evidence against the null hypothesis (Kass & Raftery, 1995). The proportion of rejected global hypotheses is interpreted as the type 1 error of the test in the null case and the power in the alternative case. Our simulations yield type 1 errors equal to 0.05, 0.07, 0.03, 0.04 and 0.02 and powers equal to 0.79, 0.98, 1.0, 1.0 and 1.0 for sample size n = 25, 50, 100, 200, and 500, respectively.
We are also interested in local hypothesis testing of differences between Fx and F0, with F0 denoting the distribution of y given x = 0. We estimate a threshold x̂ = min{x : BF(0, x) > 10}, the smallest predictor value at which the Bayes factor is larger than 10. There is significant evidence in favour of the local alternative hypothesis for x > x̂. Figure 2 shows the box plots of the estimated thresholds obtained from 100 datasets in the alternative case with sample size n = 25, 50, 100, 200, and 500, respectively. The true threshold is 0.45. From Fig. 2, the centre of the box plot moves closer to the true threshold and the variation becomes smaller as sample size increases. The averages of the thresholds are 0.89, 0.74, 0.65, 0.57, and 0.46 when sample size is 25, 50, 100, 200, and 500, respectively. Hence, for small sample sizes, there is a tendency to overestimate the threshold, which is as expected.
Fig. 2.
Boxplots of thresholds x̂ identified from the local hypothesis tests for 100 datasets in the alternative case of simulation with sample size n = 25, 50, 100, 200, and 500, respectively. The dashed horizontal line shows the true value of threshold, 0.45.
5. An epidemiologic application
We apply the proposed method to an epidemiologic study relating 1,1-dichloro-2,2-bis (p-chlorophenyl)-ethylene, commonly referred to as p, p′-DDE, to risk of premature delivery. The study was based on the U.S. Collaborative Perinatal Project (Longnecker et al., 2001). Dunson & Park (2008) previously showed a relationship between DDE and the density of gestational age at delivery without incorporating the stochastic ordering constraint. As it is reasonable to assume that gestational length does not increase as dose increases, the stochastic-ordering assumption is biologically plausible. The proposed method allows for testing and identification of a threshold DDE level, which is of substantial interest from a public health perspective.
We analysed the data using a modification of (4), which adds to μi (xi) to include a parametric adjustment for the covariates zi, with zi including race and standardized cholesterol and triglyceride levels for subject i. Letting z = (z1, . . . , zn)T with sample size n = 2313, we chose the prior γ ∼ N{0, n(zTz)−1}, with the prior otherwise specified as in § 3. The Gibbs sampler of § 3 can be applied with a minor modification. We used 36 I spline basis functions resulting from taking 35 equally spaced interior knots within (0, 180) and specifying degree equal to 1, and took bl = 1 − b̃l for l = 1, . . . , 36 as basis functions in (5) in order to model stochastically nonincreasing densities. We based our results on 10 000 Markov chain Monte Carlo iterations collected after discarding the first 10 000 iterations as a burn-in.
Figure 3(a) shows the estimated conditional densities of gestational age at delivery for a DDE level equal to 2.5, 24.68, 36.53, 53.75 corresponding to the minimum, the 50th, 75th, and 90th percentiles of the observed DDE values, controlling the other covariates equal to zero. These density plots show an increasingly heavy left tail as DDE increases, implying an increasing risk of premature delivery. We also obtain substantially narrower credible intervals for the density estimates compared with Dunson & Park (2008), likely due to improved efficiency attributable to inclusion of the order constraint.
Fig. 3.
Estimated conditional densities of gestational age at delivery and the log Bayes factor curve used for global hypothesis testing. (a) The estimated conditional densities of gestational age at delivery at DDE values x = 2.5 (solid), 24.68 (dot), 36.53 (dash), 53.75 (dot-dash), respectively. (b) the log Bayes factor curve log-BF (2.5, x) (solid). The three dashed horizontal lines in the right panel are log(10), log(3) and 0 for reference.
In order to assess the weight of evidence in the data in favour of stochastic ordering and to estimate a threshold dose, we apply the approach implemented in § 4. Let F0 denote the distribution function of gestation length given x = 2.5, the minimum of the observed DDE levels, and let Fx denote the distribution function at DDE = x. Figure 3(b) plots the log Bayes factor curve for testing H0(2.5, x) versus H1(2.5, x) for increasing x. Treating Bayes factors greater than 10 as significant, we conclude in favour of H1(2.5, x) for all x > 52.
In addition to density estimation and hypothesis testing, the proposed method also yields a mean function estimate, as well as various quantile curves of gestational age at delivery changing with the DDE values. The proposed mean function estimate has much narrower 95% intervals than the estimate obtained by fitting the generalized additive model fitted with the gam function in R (R Development Core Team, 2011). We also estimate different quantile curves by using the proposed approach and by running the R function rqss with the nonincreasing constraint for a nonparametric additive quantile regression model. The estimates from both methods look similar overall, but the estimates from the proposed method are much smoother. The proposed method should be more efficient in estimating different quantile curves since it estimates all of them jointly while existing methods usually estimate them separately. Matlab codes for the simulation study and data analysis are available upon request from the first author.
6. Stochastically ordered discrete distributions
When the response variable is discrete, it is useful to consider a simplification in which yi ∼ Pxi, with P = (Px, x ∈ 𝒳) assigned the prior in (2) simplified to let Θh = μ̃h ∼ H. A restricted dependent Dirichlet process prior for P results by constraining H to have support on the space of nondecreasing functions. We show this leads to a prior with large support on the space of nondecreasing stochastically ordered distributions.
Remark 1. When yi ∈ {0, 1, . . . , ∞} is a count, H should have support on the space of nondecreasing 𝒳 → {0, 1, . . . , ∞} functions. This can be accomplished by starting with an H* having support on nondecreasing 𝒳 → ℜ+ functions. One such H* corresponds to an exponentiated Gaussian process. Then, to obtain realizations μ̃h ∼ H, draw and let . The resulting Pxi will have nonnegative integer atoms.
Let P = (Px, x ∈ 𝒳) denote an uncountable collection of stochastically ordered probability measures. The nondecreasing stochastic ordering constraint means Px ≼ Px′, for any two points x ⩽ x′ in 𝒳. Let 𝒫 denote the set of probability measures on {𝒴, (𝒴)}, with 𝒴 a subset of the real line and (𝒴) the Borel subsets of 𝒴. Let P𝒳 = {P = (Px, x ∈ 𝒳) : Px ∈ 𝒫 for all x ∈ 𝒳}, and C𝒳 = {P = (Px, x ∈ 𝒳) : Px ∈ 𝒫, Px ≼ Px′ for any x ⩽ x′, x, x′ ∈ 𝒳}, with P𝒳 the set of uncountable collections of probability measures on {𝒴, (𝒴)} indexed by x ∈ 𝒳, and C𝒳 the subset of collections satisfying nondecreasing stochastic order.
In addition, let R𝒳 denote the 𝒳 → 𝒴 function space and S𝒳 the nondecreasing 𝒳 → 𝒴 function space. Define a metric d in R𝒳 with d(s1, s2) = supx∈𝒳 |s1(x) − s2(x)| for any s1, s2 ∈ R𝒳. This metric induces a topology on R𝒳 and thus on S𝒳. Let 𝒬 denote the set of probability measures on S𝒳.
Theorem 2. Consider the prior πP for P defined in (2), with Θh ∼ H and H ∈ 𝒬 a probability measure that has full support on S𝒳. Then π P has full support on C𝒳 with respect to the topology of weak convergence defined in the Appendix.
From Theorem 2, the constructed prior πP in (2) can assign a positive probability to any open set containing the true P of interest. Theorem 2 follows directly from a more general result in Theorem 3 to be presented below. In the Appendix, Lemma A1 establishes a generalized Choquet’s theorem by showing that every P ∈ C𝒳 can be represented as a mixture over the extreme points of C𝒳 as follows
| (6) |
for any B ∈ (𝒴), where Q is a mixing measure over S𝒳, T is an integral operator mapping Q to its marginals and {δs(x)(·), x ∈ 𝒳} is an extreme point of C𝒳. The mapping operator T in (6) is shown to be continuous in Lemma A2.
The integral representation (6) allows us to induce a prior πP for P by introducing a prior πQ for the mixing measure Q. Based on Lemmas 1 and 2, we establish the following theorem.
Theorem 3. Through (6), the prior πP for P has full support on C𝒳 with respect to the topology of weak convergence if and only if the prior πQ for Q has full support on 𝒬.
The proofs of Lemmas A1 and A2 and Theorem 3 are sketched in the Appendix. Theorem 2 follows directly from Theorem 3 by taking πQ to be DP(α H), a Dirichlet process with a precision parameter α and a base probability measure H having full support on S𝒳, after noting that dp(α H) has a weak support on all of 𝒬 (Hoff, 2003a) with respect to the topology induced by the metric d.
Supplementary material
Supplementary material available at Biometrika online contains a sensitivity analysis, the influence of the recommended prior specification for the hyperparameters, the Markov chain Monte Carlo mixing performance in the simulation and DDE data analysis, and the estimated covariate effects in the DDE data analysis.
Acknowledgments
This research was partially supported by a grant from the National Institute of Environmental Health Sciences of the National Institutes of Health. The authors wish to thank the editor, an associate editor, and two reviewers for their critical and constructive comments that greatly improved the original presentation.
Appendix
Proof of Theorem 1. Under (3), the conditional cumulative distribution functions are
for any y ∈ 𝒴 and x ∈ 𝒳. Then for any two predictor values x and x′ with x ⩽ x′,
First, we show that (b) implies (a). It is straightforward to see Fx (y) − Fx′ (y) ⩾ 0 for any y ∈ 𝒴 and x < x′ ∈ 𝒳 with probability 1 based on the facts that Φ is an increasing function and that μ̃hs are nondecreasing random functions sampled from H. This suggests that f𝒳 is subject to the nondecreasing stochastic ordering constraint. Now we show f𝒳 under (3) are subject to the continuity condition. First, for any y1, y2 ∈ ℜ and b > 0, one has for some y0 between y1 and y2 using the mean value theorem. One further obtains ϕ (b1/2 y1) − ϕ (b1/2 y2) ⩽ (2π)−1/2b1/2|y1 − y2| since for any y0. Fixing x′ ∈ 𝒳, for any y ∈ 𝒴 and any x ∈ 𝒳, one then has
Then fx (y) − fx′ (y) → 0 for any y → 𝒴 as x → x′ follows directly by noting that as x → x′ since μ̃hs are continuous random functions sampled from H.
Now we show (a) implying (b) by showing if (b) does not hold, then (a) does not hold. Suppose H(S𝒳,c) < 1. Denote Bx,x′ = {s ∈ R𝒳 : s(x) ⩽ s(x′)} for a pair of rational numbers x, x′ ∈ 𝒳 and x < x′, where R𝒳 is the space of 𝒳 → ℜ functions. Being the space of continuous nondecreasing functions on 𝒳, S𝒳,c can be written as a countable intersection of Bx,x′ over all possible rational pairs (x, x′) in 𝒳. Thus, R𝒳\S𝒳,c can be written as a countable union of over all possible rational pairs (x, x′) in 𝒳, where . Since H(R𝒳 \S𝒳,c) = 1 − H(S𝒳,c) > 0, there exist a pair of rational numbers (x1, x2) with x1 < x2 such that by using the subadditivity of a probability measure. Taking this pair (x1, x2) and fixing a y ∈ 𝒴, we consider
It is straightforward to show and
due to the independence between H and L and L(ℜ+) = 1. Write
Then there exists a positive integer k0 such that pr(Ck0) > 0 since .
Choose ∊ ∈ {0, 1/(1 + 2k0)} and a positive integer T0 such that T0 > log(∊)/ log{α/(1 + α)}. Define
Denote for each h. Clearly, ghs are independent and identically distributed, and pr(|gh| ⩽ 2) = 1 for each h since Φ is a cumulative distribution function. Then on D = {p = (ph, h = 1, . . . , ∞) ∈ A, and (μ̃h, τ̃h) ∈ Ck0 for h = 1, . . . , T0},
| (A1) |
By using Markov’s inequality, one has
Hence, one obtains pr(D) = pr(A)pr(Ck0)T0 > 0 by using the mutual independence of (μ̃h, τ̃h)s and p. This result together with (A1) indicates that Fx2 is not stochastically larger than Fx1, and thus (a) cannot hold.
Definition 1 (Weak convergence in P𝒳 used in § 6). Define GZ (P)(·) = ∫𝒳 Px (·)dZ(x) for any P = (Px, x ∈ 𝒳) ∈ P𝒳 and any probability measure Z on (𝒳), the Borel sets of 𝒳. Then GZ (P) ∈ 𝒫 is a probability measure on (𝒴). Define weak convergence in P𝒳 as follows: a sequence , with P(n) ∈ P𝒳 for all n, converges weakly to P ∈ P𝒳, if and only if GZ (P(n)) converges weakly to GZ (P) for any probability measure Z on (𝒳).
The above weak convergence of P(n) to P in P𝒳 implies pointwise weak convergence of to Px for all x ∈ 𝒳, which can be easily seen by taking Z = δx. Essentially the weak convergence of Pn to P in P𝒳 is equivalent to the uniform weak convergence of to Px for all x ∈ 𝒳. Another equivalent definition is that d(∫ fd P, ∫fd P(n)) → 0 for any bounded continuous f on 𝒴 with d being the uniform metric defined in § 6 and ∫ fd P = (∫ fd Px, x ∈ 𝒳) ∈ R𝒳.
The above weak convergence induces a weak topology in P𝒳. Under this topology, it can be shown that C𝒳 is a weakly closed convex set and that the set of extreme points of C𝒳 is exC𝒳 = {(δs(x), x ∈ 𝒳) : s ∈ S𝒳}. Lemma A1 generalizes Choquet’s Theorem by considering an uncountable collection of probability measures.
Lemma A1. For any probability measure Q ∈ 𝒬, T (Q) ∈ C𝒳, where T is defined in (6). Also, for any P ∈ C𝒳, there exists a probability measure Q ∈ 𝒬 such that Q({s ∈ S𝒳 : s(x) ∈ ·}) = Px (·) for all x ∈ 𝒳, where s denotes a random function over 𝒳.
Proof of Lemma A1. For a probability measure Q on S𝒳, define Qx (·) = ∫S𝒳 δs(x)(·)d Q(s). Then Qx is a probability measure on (𝒴). To show T (Q) ∈ C𝒳, one only needs to show (Qx, x ∈ 𝒳) are subject to nondecreasing stochastic ordering constraint. For any two points x, x′ ∈ 𝒳 with x ⩽ x′, one has s(x) ⩽ s(x′) for s ∈ S𝒳. Further one has {s ∈ S𝒳 : s(x) > a} ⊆ {s ∈ S𝒳 : s(x′) > a} for any a ∈ ℜ. By the definition of Qx and Qx′, one obtains that Qx (a, ∞) = ∫S𝒳 δs(x)(a, ∞)dQ(s) = Q[{s ∈ S𝒳 : s(x) > a}] ⩽ Q[{s ∈ S𝒳 : s(x′) > a}] = ∫S𝒳 δs(x′)(a, ∞)d Q(s) = Qx′ (a, ∞). This suggests Qx ≼ Qx′ and thus T (Q) = (Qx, x ∈ 𝒳) ∈ C𝒳.
Suppose P = (Px, x ∈ 𝒳) ∈ C𝒳. Let Fx denote the cumulative distribution function with respect to Px and define for any x ∈ 𝒳 and w ∈ [0, 1]. For any x, x′ ∈ 𝒳 with x ⩽ x′, Px ≼ Px′ leads to {u : Fx′ (u) ⩾ ω} ⊆ {u : Fx (u) ⩾ ω} and thus sx (ω) ⩽ sx′ (ω) for all ω. Let Q be the canonical measure of s = (s(x), x ∈ 𝒳) on (S𝒳, σS𝒳) induced by a uniform distribution on ω, where σS𝒳 is a σ-algebra on S𝒳. Then Q represents P with Qx = Px for any x ∈ 𝒳 using similar arguments to the proof of Proposition 8 in Hoff (2003b).
Lemma A2. The integral operator T defined in (6) is continuous.
Proof of Lemma A2. The continuity of T means that if P(n) = T (Q(n)), P = T (Q), and Q(n) converges weakly to Q, then P(n) converges weakly to P. By the definition of weak convergence in P𝒳, we need to show GZ (P(n)) converges to GZ (P) for any probability measure Z on (𝒳). This is equivalent to showing that ∫𝒴 fdGZ (P(n)) → ∫𝒴 fdGZ (P) for any bounded uniformly continuous function f on 𝒴 and any probability measure Z on (𝒳) by noting that 𝒴 is a separable metric space (Parthasarthy, 1967; Billingsley, 1999). Denote by U(𝒴) the set of bounded uniformly continuous functions on 𝒴.
First, for any f ∈ U(𝒴) and any probability measure Z on (𝒳), we have the following facts: GZ (P)(·) = ∫𝒳 Qx (·)d Z(x) = ∫𝒳 ∫S𝒳 δs(x)(·)d Q(s)dZ(x) and
Let g(Z, f)(s) = ∫𝒳 f (s(x))d Z(x), a function on S𝒳, given Z ∈ (𝒳) and f ∈ U(𝒴). Then one has ∫𝒴 f dG Z (P) = ∫S𝒳 g(Z, f)(s)d Q(s) and ∫𝒴 f dGZ (P(n)) = ∫S𝒳 g(Z, f)(s)d Q(n)(s). Since Q(n) converges weakly to Q, one only needs to show that g(Z, f) is a bounded continuous function of s on S𝒳 for any f ∈ U(𝒴) and any measure Z on (𝒳). This is trivial based on the uniform metric d in R𝒳
Proof of Theorem 3. Suppose that πQ has a full support on 𝒬. Let A be an open subset of C𝒳, then one has πP (A) = pr{T (Q) ∈ A} = πQ(T−1 A) > 0 based on the facts that T−1 A is an open set due to the continuity of T and that πQ has the full support on 𝒬. It is straightforward to have πP (C𝒳) = πQ(𝒬) = 1 based on Lemma A1. The other part can be proven similarly.
References
- Arjas E, Gasbarra D. Bayesian inference of survival probabilities, under stochastic ordering constraints. J Am Statist Assoc. 1996;91:1101–9. [Google Scholar]
- Billingsley P. Convergence of Probability Measures. New York: John Wiley & Sons, Inc; 1999. [Google Scholar]
- Bornkamp B, Ickstadt K. Bayesian nonparametric estimation of continuous monotone functions with applications to dose response analysis. Biometrics. 2009;65:198–205. doi: 10.1111/j.1541-0420.2008.01060.x. [DOI] [PubMed] [Google Scholar]
- Dunson DB, Park J-H. Kernel stick-breaking processes. Biometrika. 2008;95:307–23. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–74. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Escobar MD, West M. Bayesian density estimation and inference using mixtures. J Am Statist Assoc. 1995;90:577–88. [Google Scholar]
- Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann Statist. 1973;1:209–30. [Google Scholar]
- Ferguson TS. Prior distributions on spaces of probability measures. Ann Statist. 1974;2:615–29. [Google Scholar]
- Gelfand AE, Kottas A. Nonparametric Bayesian modeling for stochastic order. Inst Statist Math. 2001;53:865–76. [Google Scholar]
- Gelman A. Parameterization and Bayesian modeling. J Am Statist Assoc. 2004;99:537–45. [Google Scholar]
- Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006;3:515–33. [Google Scholar]
- Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. Ann Appl Statist. 2008;2:1360–83. [Google Scholar]
- Ghosh J, Dunson DB. Default prior distributions and efficient posterior computation in Bayesian factor analysis. J Comp Graph Statist. 2009;18:306–20. doi: 10.1198/jcgs.2009.07145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–32. [Google Scholar]
- Hoff PD. Bayesian methods for partial stochastic orderings. Biometrika. 2003a;90:303–17. [Google Scholar]
- Hoff PD. Nonparametric estimation of convex models via mixtures. Ann Statist. 2003b;31:174–200. [Google Scholar]
- Holmes CC, Held L. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Anal. 2006;1:145–68. [Google Scholar]
- Ishwaran H, James LF. Gibbs sampling methods for stick breaking priors. J Am Statist Assoc. 2001;96:161–73. [Google Scholar]
- Karabatsos G, Walker SG. Bayesian nonparametric inference of stochastically ordered distributions, with Pólya trees and Bernstein polynomials. Statist Prob Lett. 2007;77:907–13. [Google Scholar]
- Kass RE, Raftery AE. Bayes factors. J Am Statist Assoc. 1995;90:773–95. [Google Scholar]
- Lavine M, Mockus A. A nonparametric Bayes method for isotonic regression. J Statist Plan Infer. 1995;46:235–48. [Google Scholar]
- Lo AY. On a class of Bayesian nonparametric estimates: I. Density estimates. Ann Statist. 1984;12:351–7. [Google Scholar]
- Longnecker MP, Klebanoff MA, Zhou HB, Brock JW. Association between maternal serum concentration of the DDT metabolite DDE and preterm and small-for-gestational-age babies at birth. Lancet. 2001;358:110–4. doi: 10.1016/S0140-6736(01)05329-6. [DOI] [PubMed] [Google Scholar]
- MacEachern SN. Proc Bayesian Statist Sci Sect. Alexandria, VA: American Statistical Association; 1999. Dependent nonparametric processes; pp. 50–5. [Google Scholar]
- Neelon B, Dunson DB. Bayesian isotonic regression and trend analysis. Biometrics. 2004;60:398–406. doi: 10.1111/j.0006-341X.2004.00184.x. [DOI] [PubMed] [Google Scholar]
- Norets A. Approximation of conditional densities by smooth mixtures of regressions. Ann Statist. 2010;38:1733–66. [Google Scholar]
- Ongaro A, Cattaneo C. Discrete random probability measures: A general framework for nonparametric Bayesian inference. Statist Prob Lett. 2004;67:33–5. [Google Scholar]
- Papaspiliopoulos O, Roberts GO. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–86. [Google Scholar]
- Parthasarthy K. Probability Measures on Metric Spaces. New York: Academic Press Inc; 1967. [Google Scholar]
- Petrone S. Bayesian density estimation using Bernstein polynomials. Can J Statist. 1999;27:105–26. [Google Scholar]
- Ramsay JO. Monotone regression splines in action. Statist Sci. 1988;3:425–41. [Google Scholar]
- Ramsay JO. Estimating smooth monotone functions. J. R. Statist. Soc. B. 1998;60:365–75. [Google Scholar]
- Shively TS, Sager TW, Walker SG. A Bayesian approach to non-parametric monotone function estimation. J. R. Statist. Soc B. 2009;71:159–75. [Google Scholar]
- Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. J Economet. 1996;75:317–43. [Google Scholar]
- Tokdar S, Zhu Y, Ghosh J. Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Anal. 2010;5:1–26. [Google Scholar]
- Walker SG. Sampling the Dirichlet mixture model with slices. Commun. Statist. B. 2007;36:45–54. [Google Scholar]
- Wang X. Bayesian free-knot monotone cubic spline regression. J Comp Graph Statist. 2008;17:373–87. [Google Scholar]
- Yau C, Papaspiliopoulos O, Roberts GO, Holmes C. Bayesian nonparametric hidden Markov models with applications in genomics. J. R. Statist. Soc. B. 2011;73:37–57. doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]



