Bayesian nonparametric regression with varying residual density

Debdeep Pati; David B Dunson

doi:10.1007/s10463-013-0415-z

. Author manuscript; available in PMC: 2014 Feb 1.

Published in final edited form as: Ann Inst Stat Math. 2013 Jun 16;66(1):1–31. doi: 10.1007/s10463-013-0415-z

Bayesian nonparametric regression with varying residual density

Debdeep Pati ¹, David B Dunson ²

PMCID: PMC3898864 NIHMSID: NIHMS456120 PMID: 24465053

Abstract

We consider the problem of robust Bayesian inference on the mean regression function allowing the residual density to change flexibly with predictors. The proposed class of models is based on a Gaussian process prior for the mean regression function and mixtures of Gaussians for the collection of residual densities indexed by predictors. Initially considering the homoscedastic case, we propose priors for the residual density based on probit stick-breaking (PSB) scale mixtures and symmetrized PSB (sPSB) location-scale mixtures. Both priors restrict the residual density to be symmetric about zero, with the sPSB prior more flexible in allowing multimodal densities. We provide sufficient conditions to ensure strong posterior consistency in estimating the regression function under the sPSB prior, generalizing existing theory focused on parametric residual distributions. The PSB and sPSB priors are generalized to allow residual densities to change nonparametrically with predictors through incorporating Gaussian processes in the stick-breaking components. This leads to a robust Bayesian regression procedure that automatically down-weights outliers and influential observations in a locally-adaptive manner. Posterior computation relies on an efficient data augmentation exact block Gibbs sampler. The methods are illustrated using simulated and real data applications.

Keywords: Data augmentation, exact block Gibbs sampler, Gaussian process, nonparametric regression, outliers, symmetrized probit stick breaking process

1 Introduction

Nonparametric regression offers a more flexible way of modeling the effect of covariates on the response compared to parametric models having restrictive assumptions on the mean function and the residual distribution. Here we consider a fully Bayesian approach. The response y ∈ 𝒴 corresponding to a set of covariates x = (x₁, x₂, …, x_p)′ ∈ 𝒳 can be expressed as

y = η (x) + ε

(1)

where η(x) = 𝖤(y | x) is the mean regression function under the assumption that the residual density has mean zero, i.e., 𝖤(ε | x) = 0 for all x ∈ 𝒳. Our focus is on obtaining a robust estimate of η while allowing heavy tails to down-weight influential observations. We propose a class of models that allows the residual density to change nonparametrically with predictors x, with homoscedasticity arising as a special case.

There is a substantial literature proposing priors for flexible estimation of the mean function, typically using basis function representations such as splines or wavelets (Denison et al 2002). Most of this literature assumes a constant residual density, possibly up to a scale factor allowing heteroscedasticity. Yau and Kohn (2003) allow the mean and variance to change with predictors using thin plate splines. In certain applications, this structure may be overly restrictive due to the specific splines used and the normality assumption. Chan et al (2006) also used splines for heteroscedastic regression, but with locally adaptive estimation of the residual variance and allowance for uncertainty in variable selection. Nott (2006) considered the problem of simultaneous estimation of the mean and variance function by using penalized splines for possibly non Gaussian data. Due to the lack of conjugacy, these methods rely on involved sampling techniques using Metropolis Hastings, requiring proposal distributions to be chosen that may not be efficient in all cases. The residual density is assumed to have a known parametric form and heavy-tailed distributions have not been considered. In addition, since basis function selection for multiple predictors is highly computationally demanding, additive assumptions are typically made that rule out interactions.

Gaussian process (GP) regression (Adler 1990; Ghosal and Roy 2006; van der Vaart and van Zanten 2008, 2009; Neal 1998) is an increasingly popular choice, which avoids the need to explicitly choose the basis functions, while having many appealing computational and theoretical properties. For articles describing some of these properties, refer to Adler (1990), Cramér and Leadbetter (1967), van der Vaart and van Zanten (2008) and van der Vaart and Wellner (1996). A wide variety of functions can arise as the sample paths of the Gaussian process. GP priors can be chosen that have support on the space of all smooth functions while facilitating Bayes computation through conjugacy properties. In particular, the GP realizations at the data points are simply multivariate Gaussian. As shown by Choi and Schervish (2007), GP priors also lead to consistent estimation of the regression function under normality assumptions on the residuals. van der Vaart and van Zanten (2009) demonstrated that a Gaussian process prior with an inverse-gamma bandwidth leads to an optimal rate of posterior convergence in a mean regression problem with Gaussian errors. Recently, Choi (2009) extended the results of Choi and Schervish (2007) to allow for non-Gaussian symmetric residual distributions (for example, the Laplace distribution) which satisfy certain regularity conditions and the induced conditional density belongs to a location-scale family. Although they require mild assumptions on the parametric scale family, the results depend heavily on parametric assumptions. In particular, their theory of posterior consistency is not applicable to an infinite mixture prior on the residual density. We extend their result allowing a rich class of residual distributions through PSB mixtures of Gaussians in Section 3.

There is a rich literature on Bayesian methods for density estimation using mixture models of the form

y_{i} ~ f (θ_{i}), θ_{i} ~ P, P ~ Π,

(2)

where f(·) is a parametric density and P is an unknown mixing distribution assigned a prior Π. The most common choice of Π is the Dirichlet process (Ferguson 1973), 1974). Lo (1984) showed that Dirichlet process mixtures of normals have dense support on the space of densities with respect to Lesbesgue measure, while Escobar and West (1995) developed methods for posterior computation and inference. James et al (2005) considered a broader class of normalized random measures for Π.

In order to combine methods for Bayesian nonparametric regression with methods for Bayesian density estimation, one can potentially use mixture model (2) for the residual density in (1). A number of authors have considered nonparametric priors for the residual distribution in regression. For example, Kottas and Gelfand (2001) proposed mixture models for the error distributions in median regression models. To ensure identifiability of the regression coefficients, the residual distribution is constrained to have median zero. Their approach is very flexible but has the unappealing property of producing a residual density that is discontinuous at zero. In addition, the approach of mixing uniforms leads to blocky looking estimates of the residual density particularly for sparse data. Lavine and Mockus (2005) allow both a regression function for a single predictor and the residual distribution to be unknown subject to a monotonicity constraint. A number of recent papers have focused on generalizing model (2) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Refer, for example, to Müller et al (1996); Griffin and Steel (2006, 2008); Dunson et al (2007) and Dunson and Park (2008b) among others. Bush and MacEachern (1996) consider estimating the random block effects non parametrically in an anova-type mean linear-regression model with a t-residual density rather than density regression.

Although these approaches are clearly highly flexible, there are several issues that provide motivation for this article. First, to simplify inferences and prior elicitation, it is appealing to separate the mean function η(x) from the residual distribution in the specification, which is accomplished by only a few density regression methods. The general framework of separately modeling the mean function and residual distribution nonparametrically was introduced by Griffin and Steel (2008). They allow the residual distribution to change flexibly with predictors using the order-based Dirichlet process (Griffin and Steel 2006). On the other hand, we want to able to have a computationally simpler specification with straightforward prior elicitation. Chib and Greenberg (2010) develops a nonparametric model jointly for continuous and categorical responses where they model the mean of the link function and residual density separately. The mean is modeled using flexible additive splines and the residual density is modeled using a DP scale mixture of normals. However they didn’t allow the residual distribution to change flexibly with the predictors. Often we have strong prior information regarding the form of the regression function. Most density regression models do not allow incorporation of prior information on the mean function separately from the residual densities. Second, in many applications, the main interest is in inference on η or in prediction, and the residual distribution can be considered as a nuisance. Third, the use of residual distribution with zero mean has rarely been attempted in the nonparametric Bayes literature. This is one of the important contributions of the paper. Finally, we would like to be able to provide a specification with theoretical support. In particular, it would be appealing to show strong posterior consistency in estimating η without requiring restrictive assumptions on η or the residual distribution. Current density regression models lack such theoretical support. In addition, computation for density regression can be quite involved, particularly in cases involving more than a few predictors, and one encounters the curse of dimensionality. Our goal is to obtain a computationally convenient specification that allows consistent estimation of the regression function, while being flexible in the residual distribution specification to obtain robust estimates.

We propose to place a Gaussian process prior on η and to allow the residual density to be unknown through a probit stick-breaking (PSB) process mixture. The basic PSB process specification was proposed by Chung and Dunson (2009) in developing a density regression approach that allows variable selection. On the other hand we are concerned with robust estimation of the mean regression function allowing the residual distribution to change flexibly with predictors. While we want to model the mean regression function nonparametrically, we also want to be able to incorporate our prior knowledge for the regression function quite easily. Here, we propose four novel variants of PSB mixtures for the residual distribution. The first uses a scale mixture of Gaussians to obtain a prior with large support on unimodal symmetric distributions. The next is based on a symmetrized location-scale PSB mixture, which is more flexible in avoiding the unimodality constraint, while constraining the residual density to be symmetric and have mean zero. In addition, we show that this prior leads to strong posterior consistency in estimating η under weak conditions.

To allow the residual density to change flexibly with predictors, we generalize the above priors through incorporating probit transformations of Gaussian processes in the weights. The last two prior specifications allow changing residual variances and tail heaviness with predictors, leading to a highly robust specification which is shown to have better performance in simulation studies and out of sample prediction. It will be shown in some small sample simulated examples that the heteroscedastic symmetrized location-scale PSB mixture leads to even more robust inference than the heteroscedastic scale PSB mixture without compromising out of sample predictive performance.

Section 2 proposes the class of models under consideration. Section 3 shows consistency properties. Section 4 develops efficient posterior computation through an exact block Gibbs sampler. Section 5 describes measures of influence to study robustness properties of our proposed methods. Section 6 contains simulation study results, Section 7 applies the methods to the Boston housing data and body fat data, and Section 8 discusses the results. Proofs are included in the Appendix.

2 Nonparametric regression modeling

2.1 Data Structure and Model

Consider n observations with the ith observation recorded in response to the covariate x_i = (x_i1, x_i2, … x_ip)′. Let X = (x₁, … x_n)′ be the predictor matrix for all n subjects. The regression model can be expressed as

y_{i} = η (x_{i}) + ε_{i}, ε_{i} ~ f_{x_{i}}, i = 1, \dots, n .

We assume that the response y ∈ 𝒴 is continuous and x ∈ 𝒳 where 𝒳 ⊂ ℝ^p is compact. Also, the residuals ε_i are sampled independently, with f_x denoting the residual density specific to predictor value x_i = x. We focus initially on the case in which the covariate space 𝒳 is continuous, with the covariates arising from a fixed, non-random design or consisting of i.i.d realizations of a random variable. We choose a prior on the regression function η(x) that has support on a large subset of 𝒞^∞(𝒳), the space of smooth real valued 𝒳 → ℝ functions. The priors proposed for {f_x, x ∈ 𝒳} will be chosen to have large support so that heavy-tailed distributions and outliers will automatically be accommodated, with influential observations downweighted in estimating η.

2.2 Prior on the Mean Regression Function

We assume that η ∈ ℱ = {g : 𝒳 → ℝ is a continuous function}, with η assigned a Gaussian process (GP) prior, η ~ GP(μ, c), where μ is the mean function and c is the covariance kernel. A Gaussian process is a stochastic process {η(x) : x ∈ 𝒳} such that any finite dimensional distribution is multivariate normal, i.e., for any n and x₁, …, x_n, η(X) ≔ (η(x₁), …, η(x_n))′ ~ N(μ(X), Σ^η), where μ(X) = (μ(x₁),…, μ(x_n))′ and $Σ_{i j}^{η} = c (x_{i}, x_{j})$ . Naturally the covariance kernel c(·, ·) must satisfy, for each n and x₁, …, x_n, that the matrix Σ^η is positive definite. The smoothness of the covariance kernel essentially controls the smoothness of the sample paths of {η(x) : x ∈ 𝒳}. For an appropriate choice of c, a Gaussian process has large support in the space of all smooth functions. More precisely, the support of a Gaussian process is the closure of the reproducing kernel Hilbert space generated by the covariance kernel with a shift by the mean function (Ghosal and Roy 2006). For example, when 𝒳 ⊂ ℝ, the eigenfunctions of the univariate covariance kernel, $c (x, x') = \frac{1}{τ} e^{- κ {(x - x')}^{2}}$ , span C^∞(𝒳) if κ is allowed to vary freely over ℝ⁺. Thus we can see that the Gaussian process prior has a rich class of functions as its support and hence is appealing as a prior on the mean regression function. Refer to Rasmussen and Williams (2006) as an introductory textbook on Gaussian processes.

We follow common practice in choosing the mean function in the GP prior to correspond to a linear regression, μ(X) = Xβ, with β denoting unknown regression coefficients. As a commonly used covariance kernel, we took the Gaussian kernel $c (x, x') = \frac{1}{τ} e^{- κ {∥ x - x' ∥}^{2}}$ , where τ and κ are unknown hyperparameters, with κ controlling the local smoothness of the sample paths of η(x). Smoother sample paths imply more borrowing of information from neighboring x values.

2.3 Priors for Residual Distribution

Motivated by the problem of robust estimation of the regression function η, we consider five different types of priors for the residual distributions {f_x, x ∈ 𝒳} as enumerated below. The first prior corresponds to the t distribution, which is widely used for robust modeling of residual distributions (West 1984); Lange et al 1989; Fonseca et al 2008), while the remaining priors are flexible nonparametric specifications.

P1. Heavy tailed parametric error distribution

Following many previous authors, we first consider the case in which the residual distributions follow a homoscedastic Student-t distribution with unknown degrees of freedom. As the Student-t with low degrees of freedom is heavy tailed, outliers are allowed. By placing a hyperprior on the degrees of freedom, ν_σ ~ Ga(a_ν, b_ν), with Ga(a, b) denoting the Gamma distribution with mean a/b, one can obtain a data adaptive approach to down-weighting outliers in estimating the mean regression function. However, note that this specification assumes that the same degrees of freedom and tail-heaviness holds for all x ∈ 𝒳. Following West (1987), we express the Student-t distribution as a scale mixture of normals for ease in computation. In addition, we allow an unknown scale parameter, letting ε_i ~ N(0, σ²/ϕ_i), with ϕ_i ~ Ga(ν_σ/2, ν_σ/2), σ⁻² ~ Ga(a, b).

P2. Nonparametric error distribution

Let 𝒴 = ℜ be the response space and 𝒳 be the covariate space which is a compact subset of ℜ^p. Let ℱ denote the space of densities on 𝒳 × 𝒴 w.r.t the Lebesgue measure and ℱ_d denotes the space of all conditional densities subject to mean zero,

ℱ_{d} = {g : 𝒳 \times 𝒴 \to (0, \infty), \int_{𝒴} g (x, y) d y = 1, \int_{𝒴} y g (x, y) d y = 0 \forall x \in 𝒳} .

We propose to induce a prior on the space of mean zero conditional densities through a prior for a collection of mixing measures {P_x, x ∈ 𝒳} using the following predictor-dependent mixture of kernels.

P_{x} = \sum_{h = 1}^{\infty} π_{h} (x) δ_{{μ_{h} (x), σ_{h}},} μ_{h} ~ P_{0}, σ_{h} ~ P_{0, σ}

(3)

where π_h(x) ≥ 0 are random functions of x such that $Σ_{h = 1}^{\infty} π_{h} (x) = 1$ a.s. for each fixed x ∈ 𝒳. ${μ_{h} (x), x \in X}_{h = 1}^{\infty}$ are iid realizations of a real valued stochastic process, i.e., P₀ is a probability distribution over a function space ℱ_𝒳. Here P_0,σ is a probability distribution on ℜ⁺. Hence for each x ∈ 𝒳, P_x is a random probability measure over the measurable Polish space (ℜ × ℜ⁺, ℬ(ℜ × ℜ⁺)). Before proposing the prior, we first review the probit stick breaking process specification and its relationship to the Dirichlet process. Rodriguez and Dunson (2011) introduce the probit stick-breaking process in broad settings and discuss some smoothness and clustering properties. A probability measure P ∈ 𝒫 on (𝒴, ℬ(𝒴)) follows a probit stick-breaking process with base measure P₀ if it has a representation of the form

P (\cdot) = \sum_{h = 1}^{\infty} π_{h} δ_{θ_{h}} (\cdot), θ_{h} ~ P_{0},

(4)

where the atoms ${θ_{h}}_{h = 1}^{\infty}$ are independent and identically distributed from P₀ and the random weights are defined as $π_{h} = Φ (α_{h}) \prod_{l < h} {1 - Φ (α_{l})}, α_{h} ~ N (μ_{α}, σ_{α}^{2}), h = 1, \dots, \infty$ . Here Φ(·) denotes the cumulative distribution function for the standard normal distribution. Note that expression (4) is identical to the stick-breaking representation (Sethuraman 1994) of the Dirichlet process (DP), but the DP is obtained by replacing the stick-breaking weight Φ(α_h) with a beta(1, α) distributed random variable. Hence, the PSB process differs from the DP in using probit transformations of Gaussian random variables instead of betas for the stick lengths, with the two specifications being identical in the special case in which μ_α = 0, σ_α = 1 and the DP precision parameter is α = 1. Rodriguez and Dunson (2011) also mentioned the possibility of constructing a variety of predictor dependent models e.g., latent Markov random fields, spatio-temporal processes, etc by using probit transformation latent Gaussian processes. Such latent Gaussian processes can be updated using data augmentation Gibbs sampling as in continuation-ratio probit models for survival analysis (Albert and Chib 2001). While we follow similar computational strategies as in Rodriguez and Dunson (2011), they didn’t consider robust regression using predictor dependent residual density.

Under the symmetric about zero assumption, we propose two nonparametric priors for the residual density f_x for all x ∈ 𝒳. The first prior is a predictor dependent PSB scale mixture of Gaussians which enforces symmetry about zero and unimodality, and the next is a symmetrized location-scale PSB mixture of Gaussians, which we develop to satisfy the symmetric about zero assumption while allowing multimodality.

P2a. Heteroscedastic scale PSB mixtures

To allow the residual density to change flexibly with predictors, while maintaining the constraint that each of the predictor-dependent residual distributions is unimodal and symmetric about zero, we propose the following specification

f (\cdot) = \int N (\cdot; 0, τ^{- 1}) P_{x} (d τ), P_{x} = \sum_{h = 1}^{\infty} π_{h} (x) δ_{τ_{h}}, τ_{h} ~ G a (α_{τ}, β_{τ}),

(5)

where π_h(x) = Φ{α_h(x)} ∏_l<h[1−Φ{α_l(x)}] is the predictor-dependent probability weight on the hth mixture component, and the α_h’s are drawn independently from zero mean Gaussian processes having covariance kernel $c_{α} (x, x') = \frac{1}{τ_{α}} e^{- κ_{α} {∥ x - x' ∥}^{2}}$ . This implies $f_{x} (\cdot) = \sum_{h = 1}^{\infty} π_{h} (x) N (\cdot; 0, τ_{h}^{- 1})$ and is a highly-flexible specification that enforces smoothly changing mixture weights across the predictor space, so that the residual densities at x and x′ will tend to be similar if x is located close to x′, as measured by κ_α‖x − x′‖².

Clearly, the specification allows the residual variance to change flexibly with predictors, as we have $var (ε | x) = \sum_{h = 1}^{\infty} π_{h} (x) τ_{h}^{- 1})$ . However, unlike the previously proposed methods for heteroscedastic nonlinear regression, we do not just allow the variances to vary, but allow any aspect of the density to vary, including the heaviness of the tails. This allows locally adaptive downweighting of outliers in estimating the mean function. Previous methods, which instead assume a single heavy-tailed residual distribution, such as a t-distribution, can lead to a lack of robustness due to global estimation of a single degree of freedom parameter. In addition, due to the form of our specification, posterior computation becomes very straightforward using a data augmentation Gibbs sampler, which involves simple steps for sampling from conjugate full conditional distributions. Even under the assumption of Gaussian residual distributions, posterior computation for heteroscedastic models tends to be complex, with Gibbs sampling typically not possible due to the lack of conditional conjugacy.

P2b. Heteroscedastic symmetric PSB (sPSB) location-scale mixtures

The PSB scale mixture in (5) restricts the residual density to be unimodal. As this is a very restrictive assumption, it is appealing to define a prior with larger support that allows multimodal residual densities, while enforcing the symmetric about zero assumption so that the residual density is constrained to have mean zero. To accomplish this, we propose a novel symmetrized PSB process specification, which is related to the symmetrized Dirichlet process proposed by Tokdar (2006). We define

f (\cdot) = \int N (\cdot; μ, τ^{- 1}) d P_{x}^{s} (μ, τ), d P_{x}^{s} (μ, τ) = \frac{1}{2} d P_{x} (- μ, τ) + \frac{1}{2} d P_{x} (μ, τ),

(6)

where the atoms (μ_h, τ_h) are drawn independently from P₀ a priori, with P₀ chosen as a product of a $N (μ_{0}, σ_{0}^{2})$ and Ga(α_τ, β_τ measure. The difference between the sPSB process prior and the PSB process prior is that instead of just placing probability weight π_h on atom (μ_h, τ_h), we place probability τ_h/2 on (−μ_h, τ_h) and (μ_h, τ_h). The resulting residual density under (6) has the form $f (\cdot) = \sum_{h = 1}^{\infty} \frac{π_{h} (x)}{2} {N (\cdot .,; - μ_{h}, τ_{h}^{- 1}) + N (\cdot; μ_{h}, τ_{h}^{- 1})}$ . Clearly, each of the realizations corresponds to a mixture of Gaussians that is constrained to be symmetric about zero. The same comments made for the heteroscedastic scale PSB mixture apply here, but (6) is more flexible in allowing multi-modal residual distributions, with modality changing flexibly with predictors. Posterior computation is again straightforward, as will be shown later.

P2c. Homoscedastic scale PSB process mixture of Gaussians

A simpler homoscedastic version of (5) is to consider

f (\cdot) = \int N (\cdot; 0, τ^{- 1}) P (d τ), P = \sum_{h = 1}^{\infty} π_{h} δ_{τ_{h}}, τ_{h} ~ G a (α_{τ}, β_{τ}),

(7)

where the weights {π_h} are specified as in

π_{h} = ν_{h} \prod_{l < h} (1 - ν_{h}), ν_{h} = Φ (α_{h}), α_{h} ~ N (μ_{α}, σ_{α}^{2}) .

(8)

This implies that $f (\cdot) = \sum_{h = 1}^{\infty} π_{h} N (\cdot; 0, τ_{h}^{- 1})$ , so that the unknown density of the residuals is expressed as a countable mixture of Gaussians centered at zero but with varying variances. Observations will be automatically allocated to clusters, with outlying clusters corresponding to components having large variance (low τ_h). By choosing a hyperprior on μ_α while letting σ_α = 1, we allow the data to inform more strongly about the posterior distribution on the number, sizes and allocation to clusters.

P2d. Location-scale symmetrized PSB (sPSB) mixture of Gaussians

A homoscedastic version of (6) is the following.

f (\cdot) = \int N (\cdot; μ, τ^{- 1}) d P^{s} (μ, τ), d P^{s} (μ, τ) = \frac{1}{2} d P (- μ, τ) + \frac{1}{2} d P (μ, τ), P = \sum_{h = 1}^{\infty} π_{h} δ_{(μ_{h}, τ_{h})}, (μ_{h}, τ_{h}) ~ P_{0},

(9)

where the prior on the weights π_h are given by (8) and the prior for (μ_h, τ_h) are exactly as in 2b.

3 Consistency properties

Let f ~ Π_u and f ~ Π_s denote the priors for the unknown residual density defined in expressions (7) and (9) respectively. It is appealing for Π_u and Π_s to have support on a large subset of 𝒮_u and 𝒮_s respectively, where 𝒮_s denotes the set of densities on ℝ with respect to Lebesgue measure that are symmetric about zero and 𝒮_u ⊂ 𝒮_s is the subset of 𝒮_s corresponding to unimodal densities. We characterize the weak support of Π_u, denoted by wk(Π_u) ⊂ 𝒮_u, in the following lemma.

Lemma 1 wk(Π_u) = 𝒞_m, where 𝒞_m = {f : f ∈ 𝒮_u, $h (x) = f (\sqrt{x}), x > 0$ is a completely monotone function}.

A function h(x) on (0,∞) is completely monotone in x if it is infinitely differentiable and ${(- 1)}^{m} \frac{d^{m}}{d x^{m}} h (x) \geq 0$ for all x and for all m ∈ {1, 2, …, ∞}. Chu (1973) proved that if f is a density on ℝ which is symmetric about zero and unimodal, it can be written as a scale mixture of normals,

f (x) = \int σ^{- 1} ϕ (σ^{- 1} x) g (σ) d σ

for some density g on ℝ, if and only if $h (x) = f (\sqrt{x}), x > 0$ , is a completely monotone function, where ϕ is the standard normal pdf. This restriction places a smoothness constraint on f(x), but still allows a broad variety of densities.

Definition 1 Letting f ~ Π, f₀ is in the Kullback-Leibler(KL) support of Π if

Π (f : \int f_{0} (y) log \frac{f_{0} (y)}{f (y)} d y < ε) > 0, \forall ε > 0

The set of densities f in the Kullback-Leibler support of Π is denoted by KL(Π).

Let 𝒮̃_s denote the subset of 𝒮_s corresponding to densities satisfying the following regularity conditions.

f is nowhere zero and bounded by M < ∞
| ∫_ℜ f (y) log f (y)dy| < ∞
$| \int_{ℜ} f (y) log \frac{f (y)}{ψ_{1} (y)} d y | < \infty$ , where ψ₁ (y) = inf_{t∈[y−1, y+1]} f(t)
there exists ψ > 0 such that ∫_ℜ |y|^2(1+ψ) f (y)dy < ∞

Lemma 2 KL(Π_s) ⊇ 𝒮̃_s.

Remark 1 The above assumptions on f are standard regularity conditions introduced by Tokdar (2006) and Wu and Ghoshal (2008) to prove that f ∈ KL(Π), where Π is a general stick breaking prior which has all compactly supported probability distributions as its support. (1) is usually satisfied by common densities arising in practice. (4) imposes a minor tail restriction e.g., t-density with (2 + δ) degrees of freedom for some δ > 0 satisfies (4). (1)–(4) are satisfied by a finite mixture of t-densities or even by an infinite mixture of t-densites with (2 + δ) degrees of freedom for some δ > 0 and bounded component specific means and variances.

From Lemma 2, it follows that the sPSB location-scale mixture has KL-support on a large subset of the set of densities symmetric about zero. These conditions are important in verifying that the priors are flexible enough to approximate any density subject to the noted restrictions.

We provide fairly general sufficient conditions to ensure strong and weak posterior consistency in estimating the mean regression function and the residual density, respectively. We focus on the case in which a GP prior is chosen for η and an sPSB location-scale mixture of Gaussians is chosen for the residual density as in (9). Similar results can be obtained for the homoscedastic scale PSB process mixture under stronger restrictions on the true residual density. Although showing consistency results using predictor dependent mixtures of normals as the prior for the residual density in (5) and (6) is a challenging task, one can anticipate such results given the theory in Pati et al (2010) and Norets and Pelenis (2010). Indeed Norets and Pelenis (2011) showed posterior consistency of the regression coefficients in a mean linear regression model with covariate dependent nonparametric residuals using the kernel stick-breaking process Dunson and Park (2008a). However, showing posterior consistency of the mean regression when we have a Gaussian process prior on the regression function and predictor dependent residuals is quite challenging and is a topic of future research.

For this section, we assume x_i’s are non random and arising from a fixed design, though the proofs are easily modified for random x_i’s. When the covariate values are fixed in advance, we consider the neighborhood based on the empirical measure of the design points. Let Q_n be the empirical probability measure of the design points, $Q_{n} (x) = \frac{1}{n} \sum_{i = 1}^{n} I_{x_{i}} (x)$ . Based on Q_n, we define a strong L₁ neighborhood of radius Δ > 0 as in Choi (2005) around the true regression function η₀. Letting ∥η − η₀∥_1,n = ∫_x∈𝒳 |η(x) − η₀(x)|dQ_n(x) set,

S_{n} (η_{0}; Δ) = {η : {‖ η - η_{0} ‖}_{1, η} < Δ}

(10)

We introduce the following notation. Let f₀ denote an arbitrary fixed density in 𝒮̃_s, η₀ denote an arbitrary fixed regression function in ℱ, and

f_{0 i} = f_{0} (y - η_{0} (x_{i})) f_{η i} = f (y - η (x_{i})) .

For any two densities f and g, let

K (f, g) = \int_{ℝ} f (y) log {f (y) / g (y)} d y, V (f, g) = \int_{ℝ} f (y) {[{log}_{+} {f (y) / g (y)}]}^{2} d y,

where log₊ x = max(log x, 0). Set K_i(f, η) = K(f_0i, f_ηi) and V_i(f, η) = V (f_0i, f_ηi) for i = 1, …, n.

For technical simplicity assume 𝒳 = [0, 1]^p, τ = 1 and μ ≡ 0. Denote a mean zero Gaussian process {W_x : x ∈ [0, 1]^p} with covariance kernel c(x, x′) = e^{−∥x−x′∥²} by W. Rescaling the sample paths of an infinitely smooth Gaussian process is a powerful technique to improve the approximation of α- Hölder functions from the RKHS of the scaled process ${W_{x}^{κ} = W_{\sqrt{κ} x} : x \in {[0, 1]}^{d}}$ with κ > 0. Intuitively, for large values of κ, the scaled process traverses the sample path of an unscaled process on the larger interval ${[0, \sqrt{κ}]}^{p}$ , thereby incorporating more “roughness”. van der Vaart and van Zanten (2009) studied rescaled Gaussian processes $W^{κ} = {W_{\sqrt{κ} x} : x \in {[0, 1]}^{p}}$ for a positive random variable κ stochastically independent of W and showed that with a Gamma prior on κ^p/2, one obtains the minimax-optimal rate of convergence for arbitrary smooth functions.

Assumption1: η ~ W^κ with the density g of √κ on the positive real line satisfying

C_{1} x^{p} exp (- D_{1} x {log}^{q} x) \leq g (x) \leq C_{2} x^{p} exp (- D_{2} x {log}^{q} x),

for positive constants C₁, C₂, D₁, D₂ and every sufficiently large x > 0. Next we state the lemma on prior positivity due to van der Vaart and van Zanten (2009).

Lemma 3 If η satisfies Assumption 1 then P(∥η − η₀∥_∞ < ε) > 0 ∀ ε > 0, if η₀ is continuous.

In order to prove posterior consistency for our proposed model, we rely on a theorem of Amewou-Atisso et al (2003), which is a modification of the celebrated Schwartz (1965) theorem to accommodate independent but not identically distributed data.

Theorem 1 Suppose η as in Assumption 1 with q ≥ p + 2 and f ~ Π_s, with Π_s defined in (9). In addition, assume the data are drawn from the true density f₀ (y_i − η₀ (x_i)), with {x_i} fixed and non-random, f₀ ∈ 𝒮̃_s, η₀ ∈ ℱ and f₀ following the additional regularity conditions,

∫ y⁴ f₀ (y) dy < ∞ and ∫ f₀ (y) | log f₀ (y)|² dy < ∞.
$\int_{ℝ} f_{0} (y) {| log \frac{f_{0} (y)}{ψ_{1} (y)} |}^{2} d y < \infty$ , where ψ₁(y) = inf_{t∈[y−1,y+1]} f₀(t).

Let 𝒰 be a weak neighborhood of f₀ and 𝒲_n = 𝒰×S_n (η₀; Δ), with 𝒲_n ⊂ 𝒮̃_s × ℱ. Then the posterior probability

(Π_{s} \times W^{κ}) (𝒲_{n} | y_{1}, \dots, y_{n}, x_{1}, \dots, x_{n}) = \frac{\int_{𝒲_{n}} \prod_{i = 1}^{n} f_{η i} (y_{i}) d Π_{s} (f) d W^{κ} (η)}{\int_{{\tilde{𝒮}}_{s} \times ℱ} \prod_{i = 1}^{n} f_{η i} (y_{i}) d Π_{s} (f) d W^{κ} (η)} \to 1 a . s .

Theorem 1 ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η.

4 Posterior computation

We first describe the choice of hyperpriors and hyperparameters for the regression function. We choose the typical conjugate prior for the regression coefficients in the mean of the GP, β ~ N(β₀, Σ₀), where β₀ = 0 and Σ₀ = cI is a common choice corresponding to a ridge regression shrinkage prior. The prior on τ is given by $τ ~ G a (\frac{ν_{τ}}{2}, \frac{ν_{τ}}{2})$ . We let κ ~ Ga(α_κ, β_κ) with small β_κ, and large α_κ. Normalizing the predictors prior to analysis, we find that the data are quite informative about κ under these priors, so as long as the priors are not overly informative, inferences are robust. The parameter τ controls the heaviness of the tails of the prior for the regression function. In fact, choosing a Ga(ν_τ/2, ν_τ/2) prior induces a heavy tailed t-process with ν_τ degrees of freedom as a prior for the regression function. We chose ν_τ to be 3. κ controls the correlation of the Gaussian process at two points in the covariate space similar to a spatial decay parameter in a spatial random effects model. Although a discrete uniform prior for κ is computationally efficient in leading to a griddy Gibbs update step, there can be sensitivity to the choice of grid. A gamma prior for κ eliminates such sensitivity at some associated computational price in terms of requiring a Metropolis-Hastings update that tends to mix slowly. We choose the parameters α_κ and β_κ so that the mean correlation is 0.1 for two points separated by a distance √p in the covariate space.

Next we describe the hyperprior and associated hyperparameter choices for P1 and P2a-d.

P1: Since the responses are normalized and the covariates are scaled to lie in the interval [0, 1], using a single decay parameter appears to be reasonable. ν_σ controls the tail-heaviness of the prior for the scaling ϕ. To accommodate outliers with the mean being fixed at 1, we assume ϕ_i ~ Ga(ν_σ/2, ν_σ/2) with ν_σ ~ Ga(α_ν, β_ν). We took Σ₀ = 5I, α_ν = 1, β_ν = 1. a and b are fixed at 3/2 to resemble a t-distribution with 3 degrees of freedom without the scaling ϕ_i.
P2a & P2b: We assume κ_α ~ Ga(γ_κ, δ_κ) and $τ_{α} ~ G a (\frac{ν_{α}}{2}, \frac{ν_{α}}{2})$ . Assuming y_i are normalized, we can expect the overall variance to be close to one, so the variance of the residuals, $Var (ε | x) = \sum_{h = 1}^{\infty} π_{h} (x) τ_{h}^{- 1}$ , should be less than one. We set α_τ = 1 and choose a hyperprior on β_τ, β_τ ~ Ga(1, k₀) with k₀ > 1 so that the prior mean of τ_h is significantly less than one. Different values of k₀ are tried out to assess robustness of the posteriors. In Sections 5 and 6, we choose γ_κ = 1, δ_κ = 5, ν_α = 1, k₀ = 10, μ₀ = 0, σ₀ = 1.
P2c & P2d: Same choices as above with except k₀ = 5, μ_α = 0, σ_α = 1.

For brevity, we provide details for posterior computation only for P1, P2a–b.

4.1 Gaussian process regression with t residuals (P1)

Let Y = (y₁, … y_n)′, η = (η(x₁), η(x₂), …, η(x_n))′ and define a matrix T such that T_ij = e^{−κ∥x_i−x_j∥²}. Hence $Σ^{η} = \frac{1}{τ} T .$ . Assume Ω = diag(1/ϕ_i : i = 1, …, n) and ϕ = (ϕ₁, … , ϕ_n)′. Then we have

Y | η ~ N (η, σ^{2} Ω), η | β, τ, κ ~ N (X β, τ^{- 1} T), β ~ N (β_{0}, Σ_{0}) ϕ_{i} ~ G a (\frac{ν_{σ}}{2}, \frac{ν_{σ}}{2}), ν_{σ} ~ G a (α_{ν}, β_{ν}), σ^{- 2} ~ G a (a, b) κ ~ G a (α_{κ}, β_{κ}), τ ~ G a (\frac{ν_{τ}}{2}, \frac{ν_{τ}}{2}) .

Next we provide the full conditional distributions needed for Gibbs sampling. Due to conjugacy, η, β, σ⁻², ϕ and τ have closed form full conditional distributions, while ν_α and κ are updated by using Metropolis Hastings steps within the Gibbs sampler. Let V_η = (τT⁻¹ + σ⁻² Ω⁻¹)⁻¹ and $V_{β} = {(τ X' T^{- 1} X + \sum_{0}^{- 1})}^{- 1}$ .

η | Y, β, σ^{- 2}, τ, κ, ν_{σ}, ϕ ~ N (V_{η} (τ T^{- 1} X β + σ^{- 2} Ω^{- 1} Y), V_{η}) β | Y, η, σ^{- 2}, τ, κ, ν_{σ}, ϕ ~ N (V_{β} (τ X' T^{- 1} η + \sum_{0}^{- 1} β_{0}), V_{β}) σ^{- 2} | Y, η, β, τ, κ, ν_{σ}, ϕ ~ G a (\frac{n}{2} + a, \frac{1}{2} \sum_{i = 1}^{n} ϕ_{i} {(y_{i} - η_{i})}^{2} + b) τ | Y, η, β, σ^{- 2}, κ, ν_{σ}, ϕ ~ G a (\frac{n + ν_{τ}}{2}, \frac{1}{2} {(η - X β)' T^{- 1} (η - X β) + ν_{τ}}) ϕ_{i} | Y, η, β, σ^{- 2}, κ, ν_{σ} ~ G a (\frac{ν_{σ} + 1}{2}, \frac{1}{2} {σ^{- 2} {(y_{i} - η_{i})}^{2} + ν_{σ}}) .

4.2 Heteroscedastic PSB mixture of normals (P2a)

We propose a Markov chain Monte Carlo algorithm, which is a hybrid of data augmentation, the exact block Gibbs sampler of Papaspiliopoulos (2008) and Metropolis Hastings sampling. Papaspiliopoulos (2008) proposed the exact block Gibbs sampler as an efficient approach to posterior computation in Dirichlet process mixture models, modifying the block Gibbs sampler of Ishwaran and James (2001) to avoid truncation approximations. The exact block Gibbs sampler combines characteristics of the retrospective sampler (Papaspiliopoulos and Roberts 2008) and the slice sampler (Walker 2007); Kalli et al 2010). We included the label switching moves introduced by Papaspiliopoulos and Roberts (2008) for better mixing. Introduce γ₁, …, γ_n such that π_h(x_i) = P(γ_i = h), h = 1, 2, … ,∞. Then

γ_{i} ~ \sum_{h = 1}^{\infty} π_{h} (x_{i}) δ_{h} = \sum_{h = 1}^{\infty} 1 (u_{i} < π_{h} (x_{i})) δ_{h}

where u_i ~ U(0, 1). The MCMC steps are given below.

1. Update u_i’s and stick breaking random variables: Generate

u_{i} | - ~ U (0, π_{γ_{i}} (x_{i}))

where π_h(x_i) = Φ{α_h(x_i)} ∏_l<h[1 − Φ {α_l(x_i)}]. For i = 1, …, n, introduce latent variables Z_h(x_i), h = 1, 2, … such that Z_h(x_i) ~ N(α_h(x_i), 1). Thus π_h(x_i) = P(Z_h(x_i) > 0, Z_l(x_i) < 0 for l < h). Then

Z_{h} (x_{i}) | - ~ {\begin{matrix} N (α_{h} (x_{i}), 1) I_{ℝ +}, h = γ_{i} \\ N (α_{h} (x_{i}), 1) I_{ℝ -}, h < γ_{i} . \end{matrix}

Let Z_h = (Z_h(x₁), …, Z_h(x_n))′ and α_h = (α_h(x₁), … , α_h(x_n))′. Letting (Σ_α)_ij = e^{−κ_α∥x_i−x_j∥}, Z_h ~ N(α_h, I) and $α_{h} ~ N (0, \frac{1}{τ_{α}} Σ_{α})$ ,

α_{h} | - ~ N ({(τ_{α} Σ_{α}^{- 1} + I_{n})}^{- 1} Z_{h}, {(τ_{α} Σ_{α}^{- 1} + I_{n})}^{- 1})

Continue up to $h = 1, \dots, h^{*} = max {h_{1}^{*}, \dots, h_{n}^{*}}, where h_{i}^{*}$ is the minimum integer satisfying $\sum_{l = 1}^{h_{i}^{*}} π_{l} (x_{i}) > 1 - min {u_{1}, …, u_{n}}, i = 1, \dots, n .$ . Now

τ_{α} | - ~ G a (\frac{1}{2} (n h^{*} + ν_{α}), \frac{1}{2} (\sum_{l = 1}^{h^{*}} α_{κ}^{'} Σ_{α}^{- 1} α_{κ} + ν_{α})),

while κ_α is updated using a Metropolis Hastings step.

2. Update allocation to atoms: Update (γ₁, … , γ_n)|− as multinomial random variables with probabilities

P (γ_{i} = h) \propto N (y_{i}; η (x_{i}), τ_{h}^{- 1}) I (u_{i} < π_{h} (x_{i})), h = 1, \dots, h^{*} .

3. Update component-specific locations and precisions: Letting n_l = #{i : γ_i = l}, l = 1, 2, … ; h*,

τ_{l} | - ~ G a (\frac{n_{l}}{2} + α_{τ}, β_{τ} + \sum_{i : γ_{i} = l} {(y_{i} - η (x_{i}))}^{2}), l = 1, 2, \dots, h^{*} β_{τ} | - ~ G a (1, \sum_{l = 1}^{k^{*}} τ_{l} + k_{0}) .

4. Update the mean regression function: Letting $\land = diag (τ_{γ_{1}}^{- 1}, \dots, τ_{γ_{n}}^{- 1})$ ,

η | - ~ N ({(τ T^{- 1} + \land^{- 1})}^{- 1} (τ T^{- 1} X β + \land^{- 1} Y), {(τ T^{- 1} + \land^{- 1})}^{- 1}) β | - ~ N ({(τ X' T^{- 1} X + τ \sum_{0}^{- 1})}^{- 1} (τ X' T^{- 1} η + \sum_{0}^{- 1} β_{0}), {(τ X' T^{- 1} X + \sum_{0}^{- 1})}^{- 1}) τ | - ~ G a (\frac{n + ν_{τ}}{2}, \frac{1}{2} {(η - X β)' T^{- 1} (η - X β)' + ν_{τ}}) .

5. Update κ in a Metropolis Hastings step.

4.3 Heteroscedastic sPSB process location-scale mixture (P2b)

We will need the following changes in the updating steps from the previous case.

1. Update allocation to atoms: Update (γ₁, … , γ_n)|− as multinomial random variables with probabilities

P (γ_{i} = h) \propto \frac{1}{2} {N (y_{i}; η (x_{i}) + μ_{h}, τ_{h}^{- 1}) + N (y_{i}; η (x_{i}) - μ_{h}, τ_{h}^{- 1})} I (u_{i} < π_{h} (x_{i})), h = 1, \dots, h^{*} .

h = 1, … , h*.

3. Component-specific locations and precisions: Let n_l = #{i : γ_i = l}, l = 1,2, … , h* and m_l = ∑_{i:γ_i=l}(y_i − η_i). The atoms of the base measure location is updated from a mixture of normals as

μ_{l} | - ~ p_{l} N (\frac{μ_{0} σ_{0}^{- 2} + τ_{l} n_{l}}{σ_{0}^{- 2} + n_{l} τ_{l}}, \frac{1}{σ_{0}^{- 2} + n_{l} τ_{l}}) + (1 - p_{l}) N (\frac{μ_{0} σ_{0}^{- 2} - τ_{l} n_{l}}{σ_{0}^{- 2} + n_{l} τ_{l}}, \frac{1}{σ_{0}^{- 2} + n_{l} τ_{l}}),

where $p_{l} \propto exp {\frac{1}{2} (\frac{μ_{0} σ_{0}^{- 2} + τ_{l} n_{l}}{σ_{0}^{- 2} + n_{l} τ_{l}})}$ .

τ_{l} | - ~ p_{l} G a (\frac{n_{l}}{2} + α_{τ}, β_{τ} + \sum_{i : γ_{i} = l} {y_{i} - η (x_{i}) - μ_{l}}^{2}) + (1 - p_{l}) G a (\frac{n_{l}}{2} + α_{τ}, β_{τ} + \sum_{i : γ_{i} = l} {y_{i} - η (x_{i}) + μ_{l}} 2),

where $p_{l} \propto {\frac{1}{β_{τ} + \frac{1}{2} \sum_{i : γ_{i} = l} {y_{i} - η (x_{i}) - μ_{l}}^{2})}}^{\frac{n_{l}}{2} + α}$ .

4. Update the mean regression function: Let $\land = diag (τ_{γ_{1}}^{- 1}, \dots, τ_{γ_{n}}^{- 1})$ , μ* = (μ_γ₁, μ_γ₂, … , μ_{γ_n}) and W = (τT⁻¹ + ∧⁻¹)⁻¹. Hence

η | - p N (η; W {τ T^{- 1} X β + \land^{- 1} (Y - μ^{*})}, W) + (1 - p) N (η; W {τ T^{- 1} X β + \land^{- 1} (Y + μ^{*})}, W)

where $p \propto exp [\frac{1}{2} {τ T^{- 1} X β + \land^{- 1} (Y - μ^{*}))' W X β + \land^{- 1} (Y - μ^{*}) - (Y - μ^{*})' \land^{- 1} (Y - μ^{*})}]$ .

5 Measures of influence

There has been limited work on sensitivity of the posterior distribution to perturbations of the data and outliers. Arellano-Vallea et al (2000) use deletion diagnostics to assess sensitivity, but their methods are computationally expensive in requiring posterior computation with and without data deleted. Weiss (1996) proposed an alternative that perturbs the posterior instead of the likelihood, and only requires samples from the full posterior. Following Weiss (1996), let f(y_i|Θ̃, x_i) denote the likelihood of the data y_i, define

δ_{i}^{*} (\tilde{Θ}) = \frac{f (y_{i} + δ | \tilde{Θ}, x_{i})}{f (y_{i} | \tilde{Θ}, x_{i})},

, for some small δ > 0 and let p_i (Θ̃|Y) denote a new perturbed posterior,

p_{i} (\tilde{Θ} | Y) = \frac{p (\tilde{Θ} + Y) δ_{i}^{*} (\tilde{Θ})}{𝖤 (δ_{i}^{*} (\tilde{Θ}) | Y)} .

Since the responses are normalized prior to analysis, it is reasonable to assume that the perturbation is less than 0.1. We vary δ in [0.01, 0.1] over a grid of width 0.01 and obtain the average of results. Denote by L_i the influence measure, which is a divergence measure between the unperturbed posterior p(Θ̃|Y) and the perturbed posterior p_i(Θ̃|Y),

L_{i} = \frac{1}{2} \int | p (\tilde{Θ} | Y) - p_{i} (\tilde{Θ} | Y) d \tilde{Θ} .

L_i is bounded and takes values in [0, 1]. When p(Θ̃|Y) = p_i(Θ̃|Y), L_i = 0 indicating that the perturbation $δ_{i}^{*}$ has no influence. On the other hand, if L_i = 1, the supports of p(Θ̃|Y) and p_i(Θ̃|Y) are disjoint indicating maximum influence. We can define an influence measure as $L = \frac{1}{n} \sum_{i = 1}^{n} L_{i}$ . Clearly L also takes values in [0, 1] with L = 0 ⇒ L_i = 0 ∀ i = 1, 2, … , n. Also L = 1 ⇒ L_i = 1 ∀ i = 1, 2, … , n. Weiss (1996) provided a sample version of L_i, i = 1, … , n. Letting Θ̃₁, … , Θ̃_M be the posterior samples with B the burn-in,

{\hat{L}}_{i} = \frac{1}{M - B} \sum_{k = B + 1}^{M} \frac{1}{2} | \frac{δ_{i}^{*} ({\tilde{Θ}}_{k})}{\hat{E} (δ_{i}^{*} (\tilde{Θ}))} - 1 |,

where $\hat{𝖤} {δ_{i}^{*} (\tilde{Θ})} = \frac{1}{M - B} \sum_{k = B + 1}^{M} δ_{i}^{*} ({\tilde{Θ}}_{k})$ . Our estimated influence measure is $\hat{L} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{L}}_{i}$ . We will calculate the influence measure for our proposed methods and compare their sensitivity.

6 Simulation studies

To assess the performance of our proposed approaches, we consider a number of simulation examples, (i) linear model, homoscedastic error with no outliers, (ii) linear model, homoscedastic error with outliers (iii) linear model, heteroscedastic errors and outliers, (iv) non-linear model with heteroscedastic errors and outliers and (v) non-linear model with heteroscedastic errors and outliers, but with fewer true predictors. We let the heaviness of the tails and error variance change with x in cases (iii), (iv) and (v). We considered the following methods of assessing the performance, namely, mean squared prediction error (MSPE), coverage of 95% prediction intervals, mean integrated squared error (MISE) in estimating the regression function at the points for which we have data, point wise coverage of 95% credible intervals for the regression function and the influence measure (L̂) as described in Section 5. We also consider a variety of sample sizes in the simulation, n=30, 60, 80 and simulate 10 covariates independently from U(0, 1). Let z be 10-dim vector of i.i.d U(0, 1) random variables independent of the covariates.

Generation of errors in heteroscedastic case and outliers: Let f_{x_i} (ε_i) = p_{x_i} N(ε_i; 0, 1) + q_{x_i} N(ε_i; 0, 5) where $p_{x_{i}} = Θ (x_{i}^{'} z)$ . The outliers are simulated from the model with error distribution $f_{x_{i}}^{o} (\cdot)$ , which is a mixture of truncated normal distributions as follows. In the heteroscedastic case, $f_{x_{i}}^{o} (ε_{i}) = p_{x_{i}} T N_{(- \infty, 3) \cup (3, \infty)} (ε_{i}; 0, 1) + q_{x_{i}} T N_{(- \infty, - 3 \sqrt{5}) \cup (3 \sqrt{5}, \infty)} (ε_{i}; 0, 5)$ where TN_ℛ(·;μ, σ²) denotes a truncated normal distribution with mean μ and standard deviation σ over the region ℛ. We consider the following five cases.

Case (i): y_i = 2.3 + 5.7x_1i + ε_i, ε_i ~ N(0, 1) with no outliers.
Case (ii): y_i = 2.3 + 5.7x_1i + ε_i, ε_i ~ 0.95N(0, 1) + 0.05N(0, 10).
Case (iii): y_i = 1.2 + 5.7x_1i + 4.7x_2i + 0.12x_3i −8.9x_4i + 2.4x_5i + 3.1x_6i + 0.01_x_7i + ε_i, ε_i ~ f_xi, with 5% outliers generated from $f_{x_{i}}^{o} (ε_{i})$ .
Case (iv): $y_{i} = 1.2 + 5.7 x_{1 i} + 3.4 x_{1 i}^{2} + 4.7 x_{i 2} + 0.89 x_{i 2}^{2} + 0.12 x_{i 3} - 8.9 x_{i 4} x_{i 8} + 2.4 x_{i 5} x_{i 9} + 3.1 x_{i 6} + x_{i 6}^{2} + 0.01 x_{i 7} + ε_{i}, ε_{i} ~ f_{x_{i}}$ with 5% outliers generated from $f_{x_{i}}^{o} (ε_{i})$ .
Case (v): y_i = 1.2 + 5.7 sin x_1i + 3.4 exp(x_2i) + 4.7 log |x_i3| + ε_i, ε_i ~ f_{x_i} with 5% outliers generated from $f_{x_{i}}^{o} (ε_{i})$ .

For each of the cases and for each sample size n, we took the first $\frac{n}{2}$ samples as the training set and the next $\frac{n}{2}$ samples as the test set. We also compare the MSPE of the proposed methods with robust regression using M-estimation (Huber 1964), Bayesian additive regression trees (Chipman et al 2010), and Treed Gaussian processes (Gramacy and Lee 2008). The MCMC algorithms described in Section 5 are used to obtain samples from the posterior distribution. The results for model P1 given here are based on 20,000 samples obtained after a burn-in period of 3,000. The results for models P2a–d are based on 20,000 samples obtained after a period of 7,000. Rapid convergence was observed based on diagnostic tests of Geweke (1992) and Raftery and Lewis (1992). In addition, the mixing was very good for model P1. For models P2a–d, we use the label switching moves by Papaspiliopoulos and Roberts (2008), which lead to adequate mixing. Tables 1, 2 and 3 summarize the performance of all the methods based on 50 replicated datasets.

Table 1.

Simulation results under homoscedastic residuals (Cases (i) and (ii))

n=40	MSPE	cov(y)^a	Case (i) MISE	cov(η)^b	L	MSPE	cov(y)	Case (ii) MISE	cov(η)	L
P1	0.2997	1	0.0248	1	0.0017	0.6043	1	0.0232	1	0.0027
P2a	0.2821	0.9980	0.0141	1	0.0015	0.5983	0.9740	0.0173	1	0.0019
P2b	0.2798	1	0.0144	1	0.0015	0.5987	0.9745	0.0169	1	0.0017
P2c	0.2980	1	0.0156	1	0.0016	0.5980	0.9750	0.0189	1	0.0020
P2d	0.2869	1	0.0155	1	0.0016	0.5983	0.9750	0.0190	1	0.0020
M-Estimation	0.2820		0.0140			0.6013		0.0177
BART	0.3510	0.6866	0.0714			0.7051	0.7845	0.0950
Treed GP	0.3042	0.9134	0.0256			0.6968	0.9365	0.0803

n=60	MSPE	cov(y)	MISE	cov(η)	L	MSPE	cov(y)	MISE	cov(η)	L

P1	0.2990	1	0.0246	1	0.0019	0.5776	1	0.0242	1	0.0023
P2a	0.2769	0.9947	0.0103	1	0.0017	0.5471	0.95	0.0143	0.97	0.0016
P2b	0.2752	0.9963	0.0104	1	0.0016	0.5541	0.95	0.0141	0.98	0.0016
P2c	0.2852	0.9945	0.0176	1	0.0019	0.5664	0.95	0.0142	0.99	0.0021
P2d	0.2826	0.9960	0.0173	1	0.0018	0.5561	0.95	0.0141	0.98	0.0021
M-Estimation	0.2759		0.0103			0.5623		0.0139
BART	0.3314	0.6753	0.0539			0.6725	0.7777	0.1098
Treed GP	0.3000	0.9193	0.0218			0.6880	0.9301	0.1198

n=80	MSPE	cov(y)	MISE	cov(η)	L	MSPE	cov(y)	MISE	cov(η)	L

P1	0.2913	1	0.0252	1	0.0021	0.5583	1	0.0172	1	0.0022
P2a	0.2592	0.9940	0.0086	1	0.0021	0.4989	0.97	0.0050	1	0.0014
P2b	0.2574	0.9956	0.0069	1	0.0020	0.4898	0.98	0.0067	1	0.0010
P2c	0.2724	0.9976	0.0187	1	0.0020	0.5104	0.98	0.0103	1	0.0017
P2d	0.2716	0.9976	0.0189	1	0.0020	0.5002	0.98	0.0097	1	0.0018
M-Estimation	0.2720		0.0079			0.5431		0.0068
BART	0.3128	0.6525	0.0437			0.6509	0.7815	0.1098
Treed GP	0.2886	0.9301	0.0175			0.6532	0.9224	0.1031

Open in a new tab

cov(y) denotes the coverage of the 95% predictive intervals of the test cases

cov(η) denotes the coverage of the 95% credible intervals of the mean regression function

Table 2.

Simulation results under heteroscedastic residuals (Cases (iii) and (iv))

n=40	MSPE	cov(y)	Case (iii) MISE	cov(η)	L	MSPE	cov(y)	Case (iv) MISE	cov(η)	L
P1	0.4833	1	0.3612	1	0.0027	0.4416	1	0.3274	1	0.0029
P2a	0.2570	0.9990	0.1394	1	0.0025	0.2783	0.9923	0.1583	0.98	0.0023
P2b	0.2586	0.9990	0.1298	1	0.0025	0.2712	0.9867	0.1501	0.97	0.0017
P2c	0.3057	1	0.2412	1	0.0026	0.3334	0.9967	0.2213	0.99	0.0024
P2d	0.3024	1	0.2304	1	0.0026	0.3216	0.9967	0.2192	0.99	0.0022
M-Estimation	0.2613		0.1376			0.2889		0.1663
BART	0.4639	0.8444	0.3413			0.4103	0.8833	0.2675
Treed GP	0.3320	0.7834	0.1979			0.3548	0.8268	0.2108

n=60	MSPE	cov(y)	MISE	cov(η)	L	MSPE	cov(y)	MISE	cov(η)	L

P1	0.2254	1	0.1154	1	0.0023	0.2367	1	0.1067	1	0.0021
P2a	0.1744	0.9973	0.0572	1	0.0020	0.2178	1	0.0562	0.97	0.0019
P2b	0.1712	0.9878	0.0567	1	0.0016	0.2099	1	0.0656	0.98	0.0017
P2c	0.1952	0.9998	0.0854	1	0.0021	0.2216	1	0.0879	0.99	0.0020
P2d	0.1934	0.9998	0.0799	1	0.0020	0.2208	1	0.0812	0.99	0.0020
M-Estimation	0.1746		0.0564			0.2125		0.0678
BART	0.3429	0.8546	0.2217			0.3385	0.9122	0.1799
Treed GP	0.2047	0.8349	0.0779			0.2611	0.8867	0.0899

n=80	MSPE	cov(y)	MISE	cov(η)	L	MSPE	cov(y)	MISE	cov(η)	L

P1	0.1636	1	0.0454	1	0.0018	0.1855	1	0.0346	1	0.0019
P2a	0.1509	0.9976	0.0373	0.95	0.0015	0.1653	1	0.0321	0.9952	0.0014
P2b	0.1578	0.9931	0.0324	1	0.0013	0.1614	1	0.0312	0.9932	0.0010
P2c	0.1589	0.9960	0.0404	1	0.0017	0.1774	1	0.0329	0.9980	0.0016
P2d	0.1567	0.9969	0.0401	1	0.0017	0.1770	1	0.0320	0.9969	0.0016
M-Estimation	0.1582		0.0364			0.1832		0.0325
BART	0.2284	0.9265	0.1098			0.2491	0.9490	0.1083
Treed GP	0.1655	0.8876	0.0427			0.2022	0.8923	0.0548

Open in a new tab

Table 3.

Simulation results under heteroscedastic residuals (Case (v))

n=40	MSPE	cov(y)	MISE	cov(η)	L
P1	0.6666	0.9800	0.5856	1	0.0033
P2a	0.5233	0.9770	0.3980	0.9812	0.0025
P2b	0.5231	0.9854	0.3745	0.9765	0.0019
P2c	0.5875	0.9850	0.4452	1	0.0029
P2d	0.5788	0.9859	0.4223	1	0.0028
M-Estimation	0.5531		0.3671
BART	0.4956	0.8980
Treed GP	0.7224	0.8123	0.6132

n=60	MSPE	cov(y)	MISE	cov(η)	L

P1	0.3828	1	0.2911	0.9985	0.0031
P2a	0.3745	0.9832	0.2617	0.9840	0.0022
P2b	0.3767	0.9812	0.2601	0.9867	0.0020
P2c	0.3810	0.9900	0.2800	0.9998	0.0027
P2d	0.3800	0.9906	0.2798	0.9998	0.0026
M-Estimation	0.3939		0.2824
BART	0.3930	0.9313	0.2668
Treed GP	0.4225	0.9023	0.3217

n=80	MSPE	cov(y)	MISE	cov(η)	L

P1	0.3599	0.9901	0.2759	0.9998	0.0029
P2a	0.3503	0.9762	0.2582	0.9765	0.0022
P2b	0.3519	0.9712	0.2545	0.9715	0.0019
P2c	0.3560	0.9856	0.2656	0.9885	0.0025
P2d	0.3557	0.9800	0.2677	0.9881	0.0024
M-Estimation	0.3905		0.2887
BART	0.3594	0.9442	0.2867
Treed GP	0.4489	0.9125	0.3509

Open in a new tab

Tables 1, 2 and 3 clearly show that in small samples both of the heteroscedastic methods (P2a and P2b) have substantially reduced MSPE and MISE relative to the heavy tailed parametric error model in most of the cases, interestingly even in the homoscedastic cases. This may be because discrete mixture of Gaussians better approximate a single normal than a t-distribution in small samples. Methods P2a and P2b also did a better job than method P1 in allowing uncertainty in estimating the mean regression and predicting the test sample observations. The homoscedastic versions 4 and 5 perform better than the parametric models but worse than the heteroscedastic models. In some cases, the heavy tailed t-residual distribution results in overly conservative predictive and credible intervals. As seen from the value of the influence statistic, the heteroscedastic PSB process mixtures result in more robust inference compared to the parametric error model, the sPSB process mixture of normals being more robust than the symmetric and unimodal version. As the sample size increases, the difference in the predictive performances between the parametric and the nonparametric models is reduced and in some cases the parametric error model performs as well as the nonparametric approaches, which is as expected given the Central Limit Theorem.

Table 1 shows that, in the simple linear model with normal homoscedastic errors, all the models perform similarly in terms of mean squared prediction error, though the methods P2a and P2b are somewhat better than the rest. Also, in estimating the mean regression function in case (i), methods P2a and P2b performed better than all the other methods. In case (ii)(Table 1), methods P2a and P2b are most robust in terms of estimation and prediction in presence of outliers. However, there is no significant difference between methods P2a and P2b and methods P2c and P2d in cases (i) and (ii). In cases (iii) and (iv), when the residual distribution is heteroscedastic, methods P2a and P2b perform significantly better than the parametric model P1 and the homoscedastic models P2c and P2d in both estimation and prediction, since the heteroscedastic PSB mixture is very flexible in modeling the residual distribution. This is quite evident from the MSPE values under cases (iii) and (iv) in Table 2. Huber’s M-Estimation method performs similarly to methods P2a–d in cases (i) and (ii) but did not do as well in estimation and prediction in cases (iii) and (iv) when the underlying mean function is actually non-linear with heteroscedastic residual distribution. Also BART failed to perform well in estimating the mean function in small samples in these cases. On the other hand, GP based approaches perform quite well in these cases in estimating the regression function with methods P2a and P2b performing better than the rest. Treed GP performed close to method P1 in estimation and prediction as both the methods are based on GP priors on the mean function and have a parametric error distribution. In not allowing heteroscedastic error variance, BART and Treed GP underestimate uncertainty in prediction, leading to overly narrow predictive intervals.

In case (v)(Table 3), where the true model is generated using comparatively less number of true signals, BART performed slightly better in terms of prediction than other methods in small samples. However, as the sample size increased, BART performed poorly while the GP prior on the mean can accommodate the non-linearity resulting in substantially better predictive performances.

7 Applications

7.1 Boston housing data application

To compare our proposed approaches to alternatives, we applied the methods to a commonly used data set from the literature, the Boston housing data. The response is the median value of the owner-occupied homes (measured in 1000$) in 506 census tracts in the Boston area, and there are 13 predictors (12 continuous, 1 binary) that might help to explain the variation in the median value across tracts. We predict the median value of the owner occupied homes of which the first 253 is taken as the training set and the remaining 253 as the test set. Out of sample predictive performance of our three methods is compared to competitors in Table 4. The parametric model P1, and the mixture models P2a–d and the M-Estimation methods perform very closely to each other in terms of prediction and did better than BART and Treed GP. Methods P1 and P2a even perform slightly better than method P2b, P2c and P2d. As in the simulation examples, BART and Treed GP underestimate the uncertainty in prediction. On the other hand, the predictive intervals of the methods P1, P2a–d are more conservative and accommodate uncertainty in predicting regions with outliers quite flexibly. Also the model P2b appears to be more robust compared to models P1, P2a, P2c & P2d in terms of the influence measure.

Table 4.

Boston housing data and body fat data results

Boston housing data					body fat data
Methods	MSPE	cov(y)	L	corr(Ytest, Ypred)^a	MSPE	cov(y)	L	corr(Ytest, Ypred)
P1	0.0012	0.99	0.0034	0.9894	0.0055	1	0.0020	0.9972
P2a	0.0013	0.99	0.0027	0.9901	0.0031	1	0.0017	0.9984
P2b	0.0016	0.99	0.0020	0.9863	0.0029	1	0.0017	0.9989
P2c	0.0014	0.99	0.0030	0.9879	0.0034	1	0.0019	0.9969
P2d	0.0013	0.99	0.0029	0.9881	0.0032	1	0.0018	0.9975
M-Estimation	0.0016			0.9858	0.0375			0.9710
BART	0.0024	0.92		0.9836	0.0355	0.95		0.9655
Treed GP	0.0053	0.91		0.9524	0.1526	0.98		0.9250

Open in a new tab

corr(Ytest, Ypred) denotes the sample correlation between the test and predicted y

7.2 Body fat data application

With the increasing trend in obesity and concerns about associated adverse health effects, such as heart disease and diabetes, it has become even more important to obtain accurate estimates of body fat percentage. It is well known that body mass index, which is calculated based only on weight and height, can produce a misleading measure of adiposity as it does not take into account muscle mass or variability in frame size. As a gold standard for measuring percentage of body fat, one can rely on under water weighing techniques, and age and body circumference measurements have also been widely used as additional predictors. We consider a commonly-used data set from Statlib (http://lib.stat.cmu.edu/datasets/bodyfat), which contains the following 15 variables; percentage of body fat(%), body density from underwater weighing (gm/cm³), age (year), weight (lbs.), height (inches), and ten body circumferences (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, wrist, all in cm). Percentage of body fat is given from Siri’s (1956) equation:

Percentage of body Fat = \frac{495}{Density} - 450

We predict the percentage of body fat(%) taking the first 126 as the training set and the remaining 126 as the test set. We summarize the predictive performances in Table 4.

Table 4 suggests that the nonparametric regression procedures with heteroscedastic residual distribution P2a and P2b perform better than the parametric models P1 and models P2c and P2d, BART, M-Estimation and Treed GP in predicting the percentage of body fat.

8 Discussion

We have developed a novel regression model that can accommodate a large range of non-linearity in the mean function and at the same time can flexibly deal with outliers and heteroscedasticity. Based on preliminary simulation results, it appears that our methods P2a and P2b can outperform contemporary nonparametric regression methods, such as Huber’s M-Estimation method, BART and treed Gaussian processes. We also provide theoretical support for the proposed methodology when both the mean and the residuals are modeled nonparametrically.

One possible future direction is to relax the symmetry assumption on the residual distribution and introduce a model for median regression based on conditional PSB mixtures for allowing possibly asymmetric residual densities constrained to have zero median. Conditional DP mixtures are well known in the literature (Doss 1985); Burr and Doss 2005) and it is certainly interesting to extend our approach via a conditional PSB. In that way we can hope to obtain a more robust estimate of the regression function. It is challenging to extend our theoretical results to conditional PSB and develop a fast algorithm for computation. Another possible theoretical direction is to prove posterior consistency using heteroscedastic mixtures. Currently we only have results for the homoscedastic PSBP mixture.

Acknowledgements

This research was partially supported by grant number R01 ES017240-01 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH). The authors would like to thank Prof. Taeryon Choi for helpful suggestions on the manuscript.

A Proofs of main results

Proof of Lemma 1

It follows from Chu (1973) that

f \in 𝒞_{m} \Leftrightarrow f (x) = \int σ^{- 1} ϕ (σ^{- 1} x) g (σ) d σ

for some density g on ℝ⁺. Recall from Ongaro and Cattaneo (2004) that a collection of random weights ${π_{h}}_{h = 1}^{\infty}$ with $\sum_{h = 1}^{\infty} π_{h} = 1$ a.s. is said to have a full support if for any m ≥ 1, (π₁; …, π_m) admits a positive joint density with respect to Lebesgue measure on the simplex ${(p_{1}, \dots, p_{m}) : \sum_{i = 1}^{m} p_{i} \leq 1}$ . Ongaro and Cattaneo (2004) showed that if π_h’s have a full support, the weak support of

P = \sum_{h = 1}^{\infty} π_{h} δ_{θ_{h}}, θ_{h} ~ G_{0}

is the set of all probability measures whose support is contained the support of G₀. Since

{(π_{1}, \dots, π_{m})}^{\underset{=}{d}} (Θ (α_{1}), Θ (α_{2}) {1 - Θ (α_{1})}, \dots, Θ (α_{m}) \prod_{i = 1}^{m - 1} {1 - Θ (α_{i})}), α_{i} ~ N (μ_{α}, σ_{α}^{2}),

π_h’s have a full support and hence the weak support of $P = \sum_{h = 1}^{\infty} π_{h} δ_{τ_{h}}$ defined in (7) is all probability measures on ℝ⁺. It follows that the weak support of the induced prior Π_u on 𝒮_u, denoted by wk(Π_u), is precisely 𝒞_m.

Proof of Lemma 2

It follows from Tokdar (2006) that if we can show that the weak support of Π_s contains all probability measures symmetric about zero and having compact support, then f ∈ S̃_s ⟹ f ∈ K L(Π_s). The argument given in Lemma 1 shows that the weak support of the PSB prior in (4) is the set of all probability measures on ℝ × ℝ⁺. Now we will show that an arbitrary P^s is in a weak neighborhood of P^s if P̃ is in a weak neighborhood of P. We state a lemma to prove our claim.

Lemma 4 Let P̃_n be a sequence of probability measures and P̃ be a fixed probability measure. Then $({\tilde{P}}_{n} \Rightarrow \tilde{P}) \Rightarrow ({\tilde{P}}_{n}^{s} \Rightarrow {\tilde{P}}^{s})$ , with ${\tilde{P}}_{n}^{s}$ and P̃^s the symmetrised versions of P̃_n and P̃, respectively, where the symmetrizing operation is as defined in (9).

Proof Assume P̃_n ⟹ P̃. We have to show that for any bounded function ϕ on ℝ × ℝ⁺,

\int ϕ (t, τ) d {\tilde{P}}_{n}^{s} (t, τ) \to \int ϕ (t, τ) d {\tilde{P}}^{s} (t, τ) as n \to \infty .

Now,

\int ϕ (t, τ) d {\tilde{P}}_{n}^{s} (t, τ) = \frac{1}{2} \int ϕ (t, τ) d {\tilde{P}}_{n} (t, τ) + \frac{1}{2} \int ϕ (t, τ) d {\tilde{P}}_{n} (- t, τ) = \int \frac{1}{2} {ϕ (t, τ) + ϕ (- t, τ)} d {\tilde{P}}_{n} (t, τ) .

Since $ψ (t, τ) = \frac{1}{2} {ϕ (t, τ) + ϕ (- t, τ)}$ is also a bounded continuous function and P̃_n ⟹ P̃,

\int \frac{1}{2} {ϕ (t, τ) + ϕ (- t, τ)} d {\tilde{P}}_{n} (t, τ) \to \int \frac{1}{2} {ϕ (t, τ) + ϕ (- t, τ)} d \tilde{P} (t, τ) = \int ϕ (t, τ) d {\tilde{P}}^{s} (t, τ)

as n → ∞. This completes the proof of Lemma 4.

Lemma 4 in fact shows that the weak support of Π_s contains all probability measures symmetric about zero. With an appeal to Tokdar (2006), f ∈ S̃_s ⟹ f ∈ K L(Π_s).

Proof of Theorem 1

In order to prove the theorem we need the following variant of Theorem 2.1 of Amewou-Atisso et al (2003) and Theorem 1 of Choi and Schervish (2007) which we state as Lemma 5. Existence of exponentially consistent tests is a typical tool in showing strong consistency.

Definition 2 Let 𝒲 ⊂ 𝒮̃_s × ℱ. A sequence of test functions $Θ_{n} ({y_{i}, x_{i}}_{i = 1}^{n})$ is said to be exponentially consistent for testing

H_{0} : (f, η) = (f_{0}, η_{0}) against H_{1} : (f, η) \in 𝒲_{n}

if there exists constants C₁, C₂, C > 0 such that

$𝖤_{Π_{i = 1}^{n} f_{0_{i}}} (Θ_{n}) \leq C_{1} e^{- n C},$
${inf}_{(f, η) \in 𝒲_{n}} 𝖤_{Π_{i = 1}^{n} f_{η i}} (Θ_{n}) \geq 1 - C_{2} e^{- n C} .$

Lemma 5 Let Π̃ = (Π_s × π) be the prior on S̃_s × ℱ. Let U_n be a sequence of subsets of S̃_s × ℱ. Suppose that there exists test functions ${Θ_{n}}_{n = 1}^{\infty}$ , sets Θ_n ⊂ S̃_s × ℱ, n ≥ 1 and constants C₁, C₂, c₁, c₂ > 0 such that

$\sum_{n = 1}^{\infty} E_{Π_{i = 1}^{n} f_{0_{i}}} Θ_{n} < \infty .$
${sup}_{(f, η) \in U_{n}^{c} \cap Θ_{n}} E_{Π_{i = 1}^{n} f_{η i}} (1 - Θ_{n}) \leq C_{1} e^{- c_{1} n} .$
$\tilde{Π} (Θ_{n}^{c}) \leq C_{2} e^{- c_{2} n} .$
For all δ > 0 and for almost every data sequence ${y_{i}, x_{i}}_{i = 1}^{\infty}$ ,
$\tilde{Π} {(f, η) : K_{i} (f, η) < δ \forall_{i}, \sum_{i = 1}^{\infty} \frac{V_{i} (f, η)}{i^{2}} < \infty} > 0 .$

Then $\tilde{Π} {(f, η) \in U_{n}^{c} | (Y_{1}, x_{1}), \dots, (Y_{n}, x_{n})} \to 0 a . s . [P_{f_{0}, η_{0}}]$ .

In this case U_n = 𝒲_n = 𝒰 × S_n (f₀, Δ) ∀ n ≥ 1. As in van der Vaart and van Zanten (2009), we construct Θ_n = ℱ × Θ_1n where $Θ_{1 n} = \cup_{a < r_{n}} M_{n} ℍ_{1}^{a} + ε 𝔹_{1}$ where ℍ₁ and 𝔹₁ are unit ball of the RKHS of W^a and unit ball of the Banach space of C[0, 1]^p respectively, r_n, M_n are increasing sequences to be chosen later. The nth test is constructed by combining a collection of tests one for each of the finitely many elements of Θ_n. It follows from the proof of Theorem 3.1 in van der Vaart and van Zanten (2009) that under Assumption 1, there exists constants d₁, d₂, K > 0 such that

$\tilde{Π} (Θ_{n}^{c}) \leq exp {- d_{1} r_{n}^{p} {log}^{q} (r_{n})} + exp {- M_{n}^{2} / 8} .$
$log N (ε, Θ_{1 n}, {‖ \cdot ‖}_{\infty}) \leq K r_{n}^{p} {(log \frac{M_{n}}{ε})}^{p + 1} .$

Choosing M_n = O(n^1/2), $r_{n}^{p} = O (n / {(log n)}^{p + 2})$ , we observe that

$\tilde{Π} (Θ_{n}^{c}) \leq exp {- d_{2} n} .$
log N (ε, Θ_1n, ‖·‖_∞) = o(n).

for some constant d₂ > 0.

In order to verify 1 and 2 of Lemma 5, we will write 𝒲_n as a disjoint union of two easily tractable regions. The particular form of 𝒲_n that is of interest to us is 𝒲_1n ∪ 𝒲_2n, where for any Δ > 0,

𝒲_{1 n} = 𝒰^{c} \times {η : {‖ η - η ‖}_{1, n} \leq Δ} 𝒲_{2 n} = {(f, η) : {‖ η - η ‖}_{1, n} > Δ} .

We will establish the existence of a consistent sequence of tests for each of these regions by considering the following variants of Proposition 3.1 and Proposition 3.3 of Amewou-Atisso et al (2003).

Proposition 1 There exists an exponentially consistent sequence of tests for

H_{0} : (f, η) = (f_{0}, η_{0}) against H_{1} : (f, η) \in 𝒲_{2 n} \cap Θ_{n} .

Proof Let 0 < t < Δ/2 and assume N_t = N(t, Θ_1n, ‖·‖_∞). Let η¹, …, η^N_t ∈ Θ_1n be such that for each η ∈ Θ_1n there exists j such that ‖η − η^j‖_∞ < t. If ‖η − η₀‖_1,n > Δ, ‖η^j − η₀‖_1,n > Δ/2. It follows from Lemma 3.2 Amewou-Atisso et al (2003) that there exists a set $A_{i}^{j}$ and a constant C > 0 depending on f₀ such that $α_{i}^{j} ≔ P_{f_{0_{i}}} (A_{i}^{j}) \leq \frac{1}{2} - C | η^{j} (x_{i}) - η_{0} (x_{i}) |$ . and $γ_{j}^{i} ≔ P_{f_{η} j_{i}} (A_{i}) \geq \frac{1}{2}$ . If i ≤ n and i ∉ K_n, set A_i = ℝ, so that $α_{i}^{j} = γ_{i}^{j} = 1$ . Thus

\underset{n \to \infty}{lim inf} \frac{1}{n} \sum_{i = 1}^{n} (γ_{i}^{j} - α_{i}^{j}) \geq C Δ / 2

From Lemma 3.1 and Lemma 3.2 of Amewou-Atisso et al (2003), it follows that there exist test functions $Θ_{n}^{j}$ based on ${I_{A_{i}^{j}}, i = 1, \dots, n}$ such that $E_{Π_{i = 1}^{n} f_{0_{i}}} Θ_{n}^{j} < e^{- n C_{1}}$ and $E_{Π_{i = 1}^{n} f_{η} j_{i}} (1 - Θ_{n}^{j}) < e^{- n C_{2}}$ for constants C₁, C₂ > 0 Now define $Θ_{n} = {max}_{1 \leq j \leq N_{t}} Θ_{n}^{j}$ .

Then

E_{Π_{i = 1}^{n} f_{0_{i}}} Θ_{n} \leq \sum_{j = 1}^{N_{t}} E_{Π_{i = 1}^{n} f_{0_{i}}} Θ_{n}^{j} \leq \sum_{j = 1}^{N_{t}} e^{- n C_{1}} \leq N_{t} e^{- n C_{1}} \leq e^{- n C_{3}} .

for some constant C3 > 0. Clearly $\sum_{n = 1}^{\infty} E_{Π_{i = 1}^{n} f_{0_{i}}} Θ_{n} < \infty$ .

Next we consider the type II error probability. The type II error probability of Θ_n is no larger than the type II error probability of any of the ${Θ_{n}^{j}, j = 1, \dots, N_{t}}$ and hence exponentially small.

Proposition 2 There exists an exponentially consistent sequence of tests for

H_{0} : (f, η) = (f_{0}, η_{0}) against H_{1} : (f, η) \in 𝒲_{1 n}

Proof Without loss of generality take

𝒰 = {f : \int Θ (y) f (y) d y - \int Θ (y) f_{0} (y) d y < ε}

where 0 ≤ Θ ≤ 1 and Θ is Lipschitz continuous. Hence there exists M > 0 such that |Θ(y₁) − Θ(y₂)| < M|y₁ − y₂|. Set Θ̃_i(y) = Θ{y − η₀(x_i)}. Notice that 𝖤_{f_0i} Θ̃_i = 𝖤_f₀Θ. Now

𝖤_{f_{η i}} {\tilde{Θ}}_{i} = \int {\tilde{Θ}}_{i} (y) f_{η i} (y) d y = \int Θ (y) f [y - {η (x_{i}) - η_{0} (x_{i})}] \geq \int Θ [y - {η (x_{i}) - η_{0} (x_{i})}] f [y - {η (x_{i}) - η_{0} (x_{i})}] d y - \int | Θ (y) - Θ [y - {η (x_{i}) - η_{0} (x_{i})}] | f [y - {η (x_{i}) - η_{0} (x_{i})}] d y \geq \int Θ (y) f (y) d y - M | η (x_{i}) - η_{0} (x_{i}) | \geq 𝖤_{f_{0}} Θ + ε - M | η (x_{i}) - η_{0} (x_{i}) |

Hence $1 / n \sum_{i = 1}^{n} 𝖤_{f_{η i}} {\tilde{Θ}}_{i} \geq E_{f_{0}} Θ + ε - M Δ$ for any f ∈ U^c. Now choosing Δ < ε/M and applying Lemma 3.1 of Amewou-Atisso et al (2003) we complete the proof.

It remains to verify the second sufficient condition of Theorem 1. Under the assumptions, it follows from Lemma 2 that f₀ ∈ K L(Π_s). We will present an important lemma which is similar to Lemma 5.1 of Tokdar (2006). It guarantees that K(f₀, f_θ) and V(f₀, f_θ) are continuous at θ = 0. First we state and prove some properties of the prior Π_s described in (9) which will be used to prove the lemma.

Lemma 6 If Π_s is the prior described in (9) and $P_{0} (t, τ) = N (t; μ_{0}, σ_{0}^{2}) \times G a (τ; α_{τ}, β_{τ})$ , with α_τ > 0 and β_τ > 0. Then,

\int τ d P^{s} (t, τ) < \infty a . s ., \int t^{2} d P^{s} (t, τ) < \infty a . s ., \int τ t^{2} d P^{s} (t, τ) < \infty a . s ., - \infty < \int (log τ) d P^{s} (t, τ) < \infty a . s .

(11)

Proof

\int \int_{τ > 0, t \in ℝ} τ d P^{s} (t, τ) d P = \int_{τ > 0, t \in ℝ} τ \int d P^{s} (t, τ) d P = \frac{1}{2} \int_{τ > 0, t \in ℝ} τ N (t; μ_{0}, σ_{0}^{2}) G a (τ; α_{τ}, β_{τ}) d t d τ + \frac{1}{2} \int_{τ > 0, t \in ℝ} τ N (t; - μ_{0}, σ_{0}^{2}) G a (τ; α_{τ}, β_{τ}) d t d τ = \int_{τ > 0} τ G a (τ; α_{τ}, β_{τ}) d τ < \infty .

The proofs of ∫ t²dP^s(t,τ) < ∞ a.s. and ∫ τt²dP^s(t,τ) < ∞ a.s. are similar. Since α_τ > 0, choose an integer m large enough such that $α_{τ} > \frac{1}{m}$ .

\int \int_{τ > 0, t \in ℝ} (log τ) d P^{s} (t, τ) d P = \int_{τ > 0} (log τ) G a (τ; α_{τ}, β_{τ}) d τ = C \int_{τ > 0} (log τ) τ^{α_{τ} - 1} e^{- β_{τ} τ} d τ = C \int_{τ > 0} (τ^{1 / m} log τ) τ^{α_{τ} - \frac{1}{m} - 1} e^{- β_{τ} τ} d τ > - \infty

since τ^1/m log τ is bounded in [0, 1]. Also ∫_τ>0 (log τ)τ^α_τ−1 e^−β_ττ dτ ≤ ∫_τ>0 ττ^α_τ−1 e^−β_ττ dτ < ∞.

Lemma 7 Under the conditions of the Theorem 1, if f(·) = ∫ N(·; t, τ⁻¹)dP^s(t,τ) and f_θ(y) = f(y − θ), then

${lim}_{θ \to 0} \int f_{0} (y) log \frac{f_{0} (y)}{f_{θ} (y)} d y = \int f_{0} (y) log \frac{f_{0} (y)}{f (y)} d y .$
${lim}_{θ \to 0} \int f_{0} (y) {({log}_{+} \frac{f_{0} (y)}{f_{θ} (y)})}^{2} d y = \int f_{0} (y) {({log}_{+} \frac{f_{0} (y)}{f (y)})}^{2} d y .$

Proof Clearly τϕ{τ(y−θ−t)} → τϕ{τ(y−t)} as θ → 0. Since $\int τ ϕ {τ (y - θ - t)} d P^{s} (t, τ) \leq \frac{1}{\sqrt{2 π}} \int τ d P^{s} (t, τ) < \infty$ , so by DCT f_θ(y) → f(y) as θ → 0. Hence

log \frac{f_{0} (y)}{f_{t} (y)} \to log \frac{f_{0} (y)}{f (y)} as t \to 0 {({log}_{+} \frac{f_{0} (y)}{f_{t} (y)})}^{2} \to {({log}_{+} \frac{f_{0} (y)}{f (y)})}^{2} as t \to 0 .

To apply DCT again, we have to bound the function |log f_θ(y)|by an integrable function.

| log f_{θ} (y) | \leq log \sqrt{2 π} + | log \int τ e^{- \frac{τ}{2} {(y - t - θ)}^{2}} d P^{s} (t, τ) | .

Let c = ∫ τdP^s(t,τ) < ∞. Then

| log \int τ e^{- \frac{τ}{2} {(y - t - θ)}^{2}} d P^{s} (t, τ) | \leq | log c | + | log \int \frac{τ}{c} e^{- \frac{τ}{2} {(y - t - θ)}^{2}} d P^{s} (t, τ) | .

Now since $\int τ e^{- \frac{τ}{2} {(y - t - θ)}^{2}} d P^{s} (t, τ) \leq c, | log \int \frac{τ}{c} e^{- \frac{τ}{2} {(y - t - θ)}^{2}} d P^{s} (t, τ) | = - log \int \frac{τ}{c} e^{- \frac{τ}{2} {(y - t - θ)}^{2}} d P^{s} (t, τ)$ . Hence, by Jensen’s inequality applied to −log x, we get, $- log \int \frac{τ}{c} e^{- \frac{τ}{2} {(y - t - θ)}^{2}} d P^{s} (t, τ) \leq log c - \int (log τ) d P^{s} (t, τ) + \frac{1}{2} \int τ {(y - t - θ)}^{2} d P^{s} (t, τ)$ .

Now since θ → 0, w.l.o.g assume |θ| ≤ 1. Hence

\int τ {(y - t - θ)}^{2} d P^{s} (t, τ) \leq 4 (y^{2} \int τ d P^{s} (t, τ) + \int τ t^{2} d P^{s} (t, τ) + 1) \Rightarrow | log f_{θ} (y) | \leq log \sqrt{2 π} + | log c | + log c - \int (log τ) d P^{s} (t, τ) + 2 (y^{2} \int τ d P^{s} (t, τ) + \int τ t^{2} d P^{s} (t, τ) + 1)

which is clearly f₀-integrable according to the assumptions of the lemma and from the properties of Π_s proved in Lemma 6. Similarly |log f_θ(y)|² can be bounded by an f₀-integrable function. The conclusion of the lemma follows from a simple application of DCT.

Lemma 2 together with the assumption (2) of the Theorem 1 guarantees Π{f : K(f₀, f) < δ, V(f₀, f) < ∞} > 0 for all δ > 0. Since (11) holds, we may assume

Π (𝒰) > 0, where 𝒰 = {f : K (f_{0}, f) < δ, V (f_{0}, f) < \infty, (11) holds} .

(12)

Now for every f(·) = ∫ N(·; t, τ⁻¹)dP^s(t,τ) ∈ 𝒰, using Lemma 7, choose δ_f such that for |θ| < δ_f,

K (f_{0}, f_{θ}) < 2 K (f_{0}, f), V (f, f_{θ}) < 2 V (f_{0}, f) .

Now if ‖η − η₀‖ < δ_f, |η(x_i) − η₀(x_i)| < δ_f, for i = 1, …, n. So if f ∈ 𝒰 and ‖η − η₀‖ < δ_f, we have

K_{i} (f, η) = \int f_{0_{i}} log \frac{f_{0_{i}}}{f_{η i}} = \int f_{0} log \frac{f_{0}}{f_{(η - η_{0}) i}} < 2 K (f_{0}, f), V_{i} (f, η) = \int f_{0_{i}} {({log}_{+} \frac{f_{0_{i}}}{f_{η i}})}^{2} = \int f_{0} {({log}_{+} \frac{f_{0}}{f_{(η - η_{0}) i}})}^{2} < 2 V (f_{0}, f) .

From (12) and Lemma 3 we have,

Π {(f, η) : f \in 𝒰, ‖ η - η_{0} ‖ < δ_{f}} > 0 .

Hence

Π {(f, η) : K_{i} (f, η) < 2 δ \forall_{i}, \sum_{i = 1}^{\infty} \frac{V_{i} (f, η)}{i^{2}} < \infty} > 0 .

This ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η.

Contributor Information

Debdeep Pati, Department of Statistical Science, Duke University dp55@stat.duke.edu.

David B. Dunson, Department of Statistical Science, Duke University, dunson@stat.duke.edu

References

Adler RJ. An introduction to continuity, extrema, and related topics for general Gaussian processes. vol 12. Hayward, CA: Academic Press; 1990. [Google Scholar]
Albert J, Chib S. Sequential ordinal modeling with applications to survival data. Biometrics. 2001;57(3):829–836. doi: 10.1111/j.0006-341x.2001.00829.x. [DOI] [PubMed] [Google Scholar]
Amewou-Atisso M, Ghoshal S, Ghosh JK, Ramamoorthi RV. Posterior consistency for semi-parametric regression problems. Bernoulli. 2003;9(2):291–312. [Google Scholar]
Arellano-Vallea RB, Galea-Rojasb M, Zuazola PI. Bayesian sensitivity analysis in elliptical linear regression models. Journal of Statistical Planning and Inference. 2000;86:175–199. [Google Scholar]
Burr D, Doss H. A Bayesian semiparametric model for random-effects meta-analysis. Journal of the American Statistical Association. 2005;100:242–251. [Google Scholar]
Bush C, MacEachern S. A semiparametric bayesian model for randomised block designs. Biometrika. 1996;83(2):275. [Google Scholar]
Chan D, Kohn R, Nott D, Kirby C. Locally adaptive semiparametric estimation of the mean and variance functions in regression models. Journal of Computational and Graphical Statistics. 2006;15:915–936. [Google Scholar]
Chib S, Greenberg E. Additive cubic spline regression with dirichlet process mixture errors. Journal of Econometrics. 2010;156(2):322–336. [Google Scholar]
Chipman HA, George EI, Mcculloch RE. Bart: Bayesian additive regression trees. The Annals of Applied Statistics. 2010;4(1):266–298. [Google Scholar]
Choi T. PhD thesis. Department of Statistics, Carnegie Mellon University; 2005. Posterior consistency in nonparametric regression problems in gaussian process priors. [Google Scholar]
Choi T. Asymptotic properties of posterior distributions in nonparametric regression with non-Gaussian errors. Annals of the Institute of Statistical Mathematics. 2009;61(4):835–859. [Google Scholar]
Choi T, Schervish MJ. On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis. 2007;10:1969–1987. [Google Scholar]
Chu KC. Estimation and detection in linear systems with elliptical errors. IEEE Trans Auto Control. 1973;18:499–505. [Google Scholar]
Chung Y, Dunson D. Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2009;104(488):1646–1660. doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cramér H, Leadbetter MR. Stationary and related stochastic processes, sample function properties and their applications. New York: John Wiley & Sons; 1967. [Google Scholar]
Denison D, Holmes C, Mallick B, Smith AFM. Bayesian methods for nonlinear classification and regression. London: Wiley & Sons; 2002. [Google Scholar]
Doss H. Bayesian nonparametric estimation of the median; Part I: Computation of the estimates. The Annals of Statistics. 1985;13(4):1432–1444. [Google Scholar]
Dunson D, Park J. Kernel stick-breaking processes. Biometrika. 2008a;95(2):307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dunson DB, Park JH. Kernel stick-breaking processes. Biometrika. 2008b;95(2):307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dunson DB, Pillai N, Park JH. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]
Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90(430):577–588. [Google Scholar]
Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1(2):209–230. [Google Scholar]
Ferguson TS. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2(4):615–629. [Google Scholar]
Fonseca TCO, Ferreira MAR, Migon HS. Objective Bayesian analysis for the Student-t regression model. Biometrika. 2008;95(2):325–333. [Google Scholar]
Geweke J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Bayesian Statistics. 1992;4:169–194. [Google Scholar]
Ghosal S, Roy A. Posterior consistency of Gaussian process prior in nonparametric binary regression. The Annals of Statistics. 2006;34(5):2413–2429. [Google Scholar]
Gramacy R, Lee H. Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association. 2008;103(483):1119–1130. [Google Scholar]
Griffin J, Steel M. Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica. 2008 (to appear) [Google Scholar]
Griffin J, Steel MFJ. Order-based dependent Dirichlet processes. Journal of the American Statistical Association, Theory and Methods. 2006;101(473):179–194. [Google Scholar]
Huber PJ. Robust estimation of a location parameter. The Annals of Mathematical Statistics. 1964;35(1):73–101. [Google Scholar]
Ishwaran H, James L. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96(453):161–173. [Google Scholar]
James LF, Lijoi A, Prünster I. Bayesian nonparametric inference via classes of normalized random measures. Tech. rep., ICER Applied Mathematics Working Papers Series 5/2005. 2005
Kalli M, Griffin J, Walker S. Slice sampling mixture models. Statistics and computing. 2010:1–13. [Google Scholar]
Kottas A, Gelfand AE. Bayesian semiparametric median regression modeling. Journal of the American Statistical Association. 2001;96(456):1458–1468. [Google Scholar]
Lange K, Little RJA, Taylor JMG. Robust statistical modelling using the t distribution. Journal of the American Statistical Association. 1989;84(408):881–896. [Google Scholar]
Lavine M, Mockus A. A nonparametric Bayes method for isotonic regression. Journal of Statistical Planning and Inference. 2005;46:235–248. [Google Scholar]
Lo AY. On a class of Bayesian nonparametric estimates. I: Density estimates. The Annals of Statistics. 1984;12(1):351–357. [Google Scholar]
Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83(1):67–79. [Google Scholar]
Neal RJ. Regression and classification using Gaussian process priors. Bayesian Statistics. 1998;6:475–501. [Google Scholar]
Norets A, Pelenis J. Unpublished manuscript. Princeton Univ; 2010. Posterior consistency in conditional distribution estimation by covariate dependent mixtures. [Google Scholar]
Norets A, Pelenis J. Unpublished manuscript. Princeton Univ; 2011. Bayesian semiparametric regression. [Google Scholar]
Nott D. Semiparametric estimation of mean and variance functions for non-Gaussian data. Computational Statistics. 2006;21:603–620. [Google Scholar]
Ongaro A, Cattaneo C. Discrete random probability measures: a general framework for nonparametric Bayesian inference. Statistics & Probability Letters. 2004;67:33–45. [Google Scholar]
Papaspiliopoulos O. A note on posterior sampling from Dirichlet mixture models. Tech.rep. 2008 [Google Scholar]
Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–183. [Google Scholar]
Pati D, Dunson D, Tokdar S. Posterior consistency in conditional distribution estimation. Duke University, Dept of Statistics; 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raftery AE, Lewis S. How many iterations in the Gibbs sampler? Bayesian Statistics. 1992;4:763–773. [Google Scholar]
Rasmussen C, Williams C. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) The MIT Press; 2006. [Google Scholar]
Rodriguez A, Dunson D. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6(1):145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwartz L. On Bayes procedures. Z Wahrsch Verw Gebiete. 1965;4:10–26. [Google Scholar]
Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]
Tokdar ST. Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhyā. 2006;68:90–110. [Google Scholar]
van der Vaart A, van Zanten J. Reproducing kernel hilbert spaces of gaussian priors. IMS Collections. 2008;3:200–222. [Google Scholar]
van der Vaart A, van Zanten J. Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics. 2009;37(5B):2655–2675. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996. [Google Scholar]
Walker SG. Sampling the Dirichlet mixture model with slices. Communications in Statistics-Simulation and Computation. 2007;36:45–54. [Google Scholar]
Weiss R. An approach to Bayesian sensitivity analysis. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(4):739–750. [Google Scholar]
West M. Outlier models and prior distributions in Bayesian linear regression. Journal of the Royal Statistical Society Series B. 1984;46(3):431–439. [Google Scholar]
West M. On scale mixtures of normal distributions. Biometrika. 1987;74(3):646–648. [Google Scholar]
Wu Y, Ghoshal S. Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics. 2008;2:298–331. [Google Scholar]
Yau P, Kohn R. Estimation and variable selection in nonparametric heteroscedastic regression. Statistics and Computing. 2003;13:191–208. [Google Scholar]

[R1] Adler RJ. An introduction to continuity, extrema, and related topics for general Gaussian processes. vol 12. Hayward, CA: Academic Press; 1990. [Google Scholar]

[R2] Albert J, Chib S. Sequential ordinal modeling with applications to survival data. Biometrics. 2001;57(3):829–836. doi: 10.1111/j.0006-341x.2001.00829.x. [DOI] [PubMed] [Google Scholar]

[R3] Amewou-Atisso M, Ghoshal S, Ghosh JK, Ramamoorthi RV. Posterior consistency for semi-parametric regression problems. Bernoulli. 2003;9(2):291–312. [Google Scholar]

[R4] Arellano-Vallea RB, Galea-Rojasb M, Zuazola PI. Bayesian sensitivity analysis in elliptical linear regression models. Journal of Statistical Planning and Inference. 2000;86:175–199. [Google Scholar]

[R5] Burr D, Doss H. A Bayesian semiparametric model for random-effects meta-analysis. Journal of the American Statistical Association. 2005;100:242–251. [Google Scholar]

[R6] Bush C, MacEachern S. A semiparametric bayesian model for randomised block designs. Biometrika. 1996;83(2):275. [Google Scholar]

[R7] Chan D, Kohn R, Nott D, Kirby C. Locally adaptive semiparametric estimation of the mean and variance functions in regression models. Journal of Computational and Graphical Statistics. 2006;15:915–936. [Google Scholar]

[R8] Chib S, Greenberg E. Additive cubic spline regression with dirichlet process mixture errors. Journal of Econometrics. 2010;156(2):322–336. [Google Scholar]

[R9] Chipman HA, George EI, Mcculloch RE. Bart: Bayesian additive regression trees. The Annals of Applied Statistics. 2010;4(1):266–298. [Google Scholar]

[R10] Choi T. PhD thesis. Department of Statistics, Carnegie Mellon University; 2005. Posterior consistency in nonparametric regression problems in gaussian process priors. [Google Scholar]

[R11] Choi T. Asymptotic properties of posterior distributions in nonparametric regression with non-Gaussian errors. Annals of the Institute of Statistical Mathematics. 2009;61(4):835–859. [Google Scholar]

[R12] Choi T, Schervish MJ. On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis. 2007;10:1969–1987. [Google Scholar]

[R13] Chu KC. Estimation and detection in linear systems with elliptical errors. IEEE Trans Auto Control. 1973;18:499–505. [Google Scholar]

[R14] Chung Y, Dunson D. Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2009;104(488):1646–1660. doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Cramér H, Leadbetter MR. Stationary and related stochastic processes, sample function properties and their applications. New York: John Wiley & Sons; 1967. [Google Scholar]

[R16] Denison D, Holmes C, Mallick B, Smith AFM. Bayesian methods for nonlinear classification and regression. London: Wiley & Sons; 2002. [Google Scholar]

[R17] Doss H. Bayesian nonparametric estimation of the median; Part I: Computation of the estimates. The Annals of Statistics. 1985;13(4):1432–1444. [Google Scholar]

[R18] Dunson D, Park J. Kernel stick-breaking processes. Biometrika. 2008a;95(2):307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Dunson DB, Park JH. Kernel stick-breaking processes. Biometrika. 2008b;95(2):307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Dunson DB, Pillai N, Park JH. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]

[R21] Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90(430):577–588. [Google Scholar]

[R22] Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1(2):209–230. [Google Scholar]

[R23] Ferguson TS. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2(4):615–629. [Google Scholar]

[R24] Fonseca TCO, Ferreira MAR, Migon HS. Objective Bayesian analysis for the Student-t regression model. Biometrika. 2008;95(2):325–333. [Google Scholar]

[R25] Geweke J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Bayesian Statistics. 1992;4:169–194. [Google Scholar]

[R26] Ghosal S, Roy A. Posterior consistency of Gaussian process prior in nonparametric binary regression. The Annals of Statistics. 2006;34(5):2413–2429. [Google Scholar]

[R27] Gramacy R, Lee H. Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association. 2008;103(483):1119–1130. [Google Scholar]

[R28] Griffin J, Steel M. Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica. 2008 (to appear) [Google Scholar]

[R29] Griffin J, Steel MFJ. Order-based dependent Dirichlet processes. Journal of the American Statistical Association, Theory and Methods. 2006;101(473):179–194. [Google Scholar]

[R30] Huber PJ. Robust estimation of a location parameter. The Annals of Mathematical Statistics. 1964;35(1):73–101. [Google Scholar]

[R31] Ishwaran H, James L. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96(453):161–173. [Google Scholar]

[R32] James LF, Lijoi A, Prünster I. Bayesian nonparametric inference via classes of normalized random measures. Tech. rep., ICER Applied Mathematics Working Papers Series 5/2005. 2005

[R33] Kalli M, Griffin J, Walker S. Slice sampling mixture models. Statistics and computing. 2010:1–13. [Google Scholar]

[R34] Kottas A, Gelfand AE. Bayesian semiparametric median regression modeling. Journal of the American Statistical Association. 2001;96(456):1458–1468. [Google Scholar]

[R35] Lange K, Little RJA, Taylor JMG. Robust statistical modelling using the t distribution. Journal of the American Statistical Association. 1989;84(408):881–896. [Google Scholar]

[R36] Lavine M, Mockus A. A nonparametric Bayes method for isotonic regression. Journal of Statistical Planning and Inference. 2005;46:235–248. [Google Scholar]

[R37] Lo AY. On a class of Bayesian nonparametric estimates. I: Density estimates. The Annals of Statistics. 1984;12(1):351–357. [Google Scholar]

[R38] Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83(1):67–79. [Google Scholar]

[R39] Neal RJ. Regression and classification using Gaussian process priors. Bayesian Statistics. 1998;6:475–501. [Google Scholar]

[R40] Norets A, Pelenis J. Unpublished manuscript. Princeton Univ; 2010. Posterior consistency in conditional distribution estimation by covariate dependent mixtures. [Google Scholar]

[R41] Norets A, Pelenis J. Unpublished manuscript. Princeton Univ; 2011. Bayesian semiparametric regression. [Google Scholar]

[R42] Nott D. Semiparametric estimation of mean and variance functions for non-Gaussian data. Computational Statistics. 2006;21:603–620. [Google Scholar]

[R43] Ongaro A, Cattaneo C. Discrete random probability measures: a general framework for nonparametric Bayesian inference. Statistics & Probability Letters. 2004;67:33–45. [Google Scholar]

[R44] Papaspiliopoulos O. A note on posterior sampling from Dirichlet mixture models. Tech.rep. 2008 [Google Scholar]

[R45] Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–183. [Google Scholar]

[R46] Pati D, Dunson D, Tokdar S. Posterior consistency in conditional distribution estimation. Duke University, Dept of Statistics; 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Raftery AE, Lewis S. How many iterations in the Gibbs sampler? Bayesian Statistics. 1992;4:763–773. [Google Scholar]

[R48] Rasmussen C, Williams C. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) The MIT Press; 2006. [Google Scholar]

[R49] Rodriguez A, Dunson D. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6(1):145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] Schwartz L. On Bayes procedures. Z Wahrsch Verw Gebiete. 1965;4:10–26. [Google Scholar]

[R51] Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]

[R52] Tokdar ST. Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhyā. 2006;68:90–110. [Google Scholar]

[R53] van der Vaart A, van Zanten J. Reproducing kernel hilbert spaces of gaussian priors. IMS Collections. 2008;3:200–222. [Google Scholar]

[R54] van der Vaart A, van Zanten J. Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics. 2009;37(5B):2655–2675. [Google Scholar]

[R55] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996. [Google Scholar]

[R56] Walker SG. Sampling the Dirichlet mixture model with slices. Communications in Statistics-Simulation and Computation. 2007;36:45–54. [Google Scholar]

[R57] Weiss R. An approach to Bayesian sensitivity analysis. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(4):739–750. [Google Scholar]

[R58] West M. Outlier models and prior distributions in Bayesian linear regression. Journal of the Royal Statistical Society Series B. 1984;46(3):431–439. [Google Scholar]

[R59] West M. On scale mixtures of normal distributions. Biometrika. 1987;74(3):646–648. [Google Scholar]

[R60] Wu Y, Ghoshal S. Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics. 2008;2:298–331. [Google Scholar]

[R61] Yau P, Kohn R. Estimation and variable selection in nonparametric heteroscedastic regression. Statistics and Computing. 2003;13:191–208. [Google Scholar]

PERMALINK

Bayesian nonparametric regression with varying residual density

Debdeep Pati

David B Dunson

Abstract

1 Introduction

2 Nonparametric regression modeling

2.1 Data Structure and Model

2.2 Prior on the Mean Regression Function

2.3 Priors for Residual Distribution

P1. Heavy tailed parametric error distribution

P2. Nonparametric error distribution

P2a. Heteroscedastic scale PSB mixtures

P2b. Heteroscedastic symmetric PSB (sPSB) location-scale mixtures

P2c. Homoscedastic scale PSB process mixture of Gaussians

P2d. Location-scale symmetrized PSB (sPSB) mixture of Gaussians

3 Consistency properties

4 Posterior computation

4.1 Gaussian process regression with t residuals (P1)

4.2 Heteroscedastic PSB mixture of normals (P2a)

4.3 Heteroscedastic sPSB process location-scale mixture (P2b)

5 Measures of influence

6 Simulation studies

Table 1.

Table 2.

Table 3.

7 Applications

7.1 Boston housing data application

Table 4.

7.2 Body fat data application

8 Discussion

Acknowledgements

A Proofs of main results

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bayesian nonparametric regression with varying residual density

Debdeep Pati

David B Dunson

Abstract

1 Introduction

2 Nonparametric regression modeling

2.1 Data Structure and Model

2.2 Prior on the Mean Regression Function

2.3 Priors for Residual Distribution

P1. Heavy tailed parametric error distribution

P2. Nonparametric error distribution

P2a. Heteroscedastic scale PSB mixtures

P2b. Heteroscedastic symmetric PSB (sPSB) location-scale mixtures

P2c. Homoscedastic scale PSB process mixture of Gaussians

P2d. Location-scale symmetrized PSB (sPSB) mixture of Gaussians

3 Consistency properties

4 Posterior computation

4.1 Gaussian process regression with t residuals (P1)

4.2 Heteroscedastic PSB mixture of normals (P2a)

4.3 Heteroscedastic sPSB process location-scale mixture (P2b)

5 Measures of influence

6 Simulation studies

Table 1.

Table 2.

Table 3.

7 Applications

7.1 Boston housing data application

Table 4.

7.2 Body fat data application

8 Discussion

Acknowledgements

A Proofs of main results

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases