Abstract
We consider the problem of robust Bayesian inference on the mean regression function allowing the residual density to change flexibly with predictors. The proposed class of models is based on a Gaussian process prior for the mean regression function and mixtures of Gaussians for the collection of residual densities indexed by predictors. Initially considering the homoscedastic case, we propose priors for the residual density based on probit stick-breaking (PSB) scale mixtures and symmetrized PSB (sPSB) location-scale mixtures. Both priors restrict the residual density to be symmetric about zero, with the sPSB prior more flexible in allowing multimodal densities. We provide sufficient conditions to ensure strong posterior consistency in estimating the regression function under the sPSB prior, generalizing existing theory focused on parametric residual distributions. The PSB and sPSB priors are generalized to allow residual densities to change nonparametrically with predictors through incorporating Gaussian processes in the stick-breaking components. This leads to a robust Bayesian regression procedure that automatically down-weights outliers and influential observations in a locally-adaptive manner. Posterior computation relies on an efficient data augmentation exact block Gibbs sampler. The methods are illustrated using simulated and real data applications.
Keywords: Data augmentation, exact block Gibbs sampler, Gaussian process, nonparametric regression, outliers, symmetrized probit stick breaking process
1 Introduction
Nonparametric regression offers a more flexible way of modeling the effect of covariates on the response compared to parametric models having restrictive assumptions on the mean function and the residual distribution. Here we consider a fully Bayesian approach. The response y ∈ 𝒴 corresponding to a set of covariates x = (x1, x2, …, xp)′ ∈ 𝒳 can be expressed as
(1) |
where η(x) = 𝖤(y | x) is the mean regression function under the assumption that the residual density has mean zero, i.e., 𝖤(ε | x) = 0 for all x ∈ 𝒳. Our focus is on obtaining a robust estimate of η while allowing heavy tails to down-weight influential observations. We propose a class of models that allows the residual density to change nonparametrically with predictors x, with homoscedasticity arising as a special case.
There is a substantial literature proposing priors for flexible estimation of the mean function, typically using basis function representations such as splines or wavelets (Denison et al 2002). Most of this literature assumes a constant residual density, possibly up to a scale factor allowing heteroscedasticity. Yau and Kohn (2003) allow the mean and variance to change with predictors using thin plate splines. In certain applications, this structure may be overly restrictive due to the specific splines used and the normality assumption. Chan et al (2006) also used splines for heteroscedastic regression, but with locally adaptive estimation of the residual variance and allowance for uncertainty in variable selection. Nott (2006) considered the problem of simultaneous estimation of the mean and variance function by using penalized splines for possibly non Gaussian data. Due to the lack of conjugacy, these methods rely on involved sampling techniques using Metropolis Hastings, requiring proposal distributions to be chosen that may not be efficient in all cases. The residual density is assumed to have a known parametric form and heavy-tailed distributions have not been considered. In addition, since basis function selection for multiple predictors is highly computationally demanding, additive assumptions are typically made that rule out interactions.
Gaussian process (GP) regression (Adler 1990; Ghosal and Roy 2006; van der Vaart and van Zanten 2008, 2009; Neal 1998) is an increasingly popular choice, which avoids the need to explicitly choose the basis functions, while having many appealing computational and theoretical properties. For articles describing some of these properties, refer to Adler (1990), Cramér and Leadbetter (1967), van der Vaart and van Zanten (2008) and van der Vaart and Wellner (1996). A wide variety of functions can arise as the sample paths of the Gaussian process. GP priors can be chosen that have support on the space of all smooth functions while facilitating Bayes computation through conjugacy properties. In particular, the GP realizations at the data points are simply multivariate Gaussian. As shown by Choi and Schervish (2007), GP priors also lead to consistent estimation of the regression function under normality assumptions on the residuals. van der Vaart and van Zanten (2009) demonstrated that a Gaussian process prior with an inverse-gamma bandwidth leads to an optimal rate of posterior convergence in a mean regression problem with Gaussian errors. Recently, Choi (2009) extended the results of Choi and Schervish (2007) to allow for non-Gaussian symmetric residual distributions (for example, the Laplace distribution) which satisfy certain regularity conditions and the induced conditional density belongs to a location-scale family. Although they require mild assumptions on the parametric scale family, the results depend heavily on parametric assumptions. In particular, their theory of posterior consistency is not applicable to an infinite mixture prior on the residual density. We extend their result allowing a rich class of residual distributions through PSB mixtures of Gaussians in Section 3.
There is a rich literature on Bayesian methods for density estimation using mixture models of the form
(2) |
where f(·) is a parametric density and P is an unknown mixing distribution assigned a prior Π. The most common choice of Π is the Dirichlet process (Ferguson 1973), 1974). Lo (1984) showed that Dirichlet process mixtures of normals have dense support on the space of densities with respect to Lesbesgue measure, while Escobar and West (1995) developed methods for posterior computation and inference. James et al (2005) considered a broader class of normalized random measures for Π.
In order to combine methods for Bayesian nonparametric regression with methods for Bayesian density estimation, one can potentially use mixture model (2) for the residual density in (1). A number of authors have considered nonparametric priors for the residual distribution in regression. For example, Kottas and Gelfand (2001) proposed mixture models for the error distributions in median regression models. To ensure identifiability of the regression coefficients, the residual distribution is constrained to have median zero. Their approach is very flexible but has the unappealing property of producing a residual density that is discontinuous at zero. In addition, the approach of mixing uniforms leads to blocky looking estimates of the residual density particularly for sparse data. Lavine and Mockus (2005) allow both a regression function for a single predictor and the residual distribution to be unknown subject to a monotonicity constraint. A number of recent papers have focused on generalizing model (2) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Refer, for example, to Müller et al (1996); Griffin and Steel (2006, 2008); Dunson et al (2007) and Dunson and Park (2008b) among others. Bush and MacEachern (1996) consider estimating the random block effects non parametrically in an anova-type mean linear-regression model with a t-residual density rather than density regression.
Although these approaches are clearly highly flexible, there are several issues that provide motivation for this article. First, to simplify inferences and prior elicitation, it is appealing to separate the mean function η(x) from the residual distribution in the specification, which is accomplished by only a few density regression methods. The general framework of separately modeling the mean function and residual distribution nonparametrically was introduced by Griffin and Steel (2008). They allow the residual distribution to change flexibly with predictors using the order-based Dirichlet process (Griffin and Steel 2006). On the other hand, we want to able to have a computationally simpler specification with straightforward prior elicitation. Chib and Greenberg (2010) develops a nonparametric model jointly for continuous and categorical responses where they model the mean of the link function and residual density separately. The mean is modeled using flexible additive splines and the residual density is modeled using a DP scale mixture of normals. However they didn’t allow the residual distribution to change flexibly with the predictors. Often we have strong prior information regarding the form of the regression function. Most density regression models do not allow incorporation of prior information on the mean function separately from the residual densities. Second, in many applications, the main interest is in inference on η or in prediction, and the residual distribution can be considered as a nuisance. Third, the use of residual distribution with zero mean has rarely been attempted in the nonparametric Bayes literature. This is one of the important contributions of the paper. Finally, we would like to be able to provide a specification with theoretical support. In particular, it would be appealing to show strong posterior consistency in estimating η without requiring restrictive assumptions on η or the residual distribution. Current density regression models lack such theoretical support. In addition, computation for density regression can be quite involved, particularly in cases involving more than a few predictors, and one encounters the curse of dimensionality. Our goal is to obtain a computationally convenient specification that allows consistent estimation of the regression function, while being flexible in the residual distribution specification to obtain robust estimates.
We propose to place a Gaussian process prior on η and to allow the residual density to be unknown through a probit stick-breaking (PSB) process mixture. The basic PSB process specification was proposed by Chung and Dunson (2009) in developing a density regression approach that allows variable selection. On the other hand we are concerned with robust estimation of the mean regression function allowing the residual distribution to change flexibly with predictors. While we want to model the mean regression function nonparametrically, we also want to be able to incorporate our prior knowledge for the regression function quite easily. Here, we propose four novel variants of PSB mixtures for the residual distribution. The first uses a scale mixture of Gaussians to obtain a prior with large support on unimodal symmetric distributions. The next is based on a symmetrized location-scale PSB mixture, which is more flexible in avoiding the unimodality constraint, while constraining the residual density to be symmetric and have mean zero. In addition, we show that this prior leads to strong posterior consistency in estimating η under weak conditions.
To allow the residual density to change flexibly with predictors, we generalize the above priors through incorporating probit transformations of Gaussian processes in the weights. The last two prior specifications allow changing residual variances and tail heaviness with predictors, leading to a highly robust specification which is shown to have better performance in simulation studies and out of sample prediction. It will be shown in some small sample simulated examples that the heteroscedastic symmetrized location-scale PSB mixture leads to even more robust inference than the heteroscedastic scale PSB mixture without compromising out of sample predictive performance.
Section 2 proposes the class of models under consideration. Section 3 shows consistency properties. Section 4 develops efficient posterior computation through an exact block Gibbs sampler. Section 5 describes measures of influence to study robustness properties of our proposed methods. Section 6 contains simulation study results, Section 7 applies the methods to the Boston housing data and body fat data, and Section 8 discusses the results. Proofs are included in the Appendix.
2 Nonparametric regression modeling
2.1 Data Structure and Model
Consider n observations with the ith observation recorded in response to the covariate xi = (xi1, xi2, … xip)′. Let X = (x1, … xn)′ be the predictor matrix for all n subjects. The regression model can be expressed as
We assume that the response y ∈ 𝒴 is continuous and x ∈ 𝒳 where 𝒳 ⊂ ℝp is compact. Also, the residuals εi are sampled independently, with fx denoting the residual density specific to predictor value xi = x. We focus initially on the case in which the covariate space 𝒳 is continuous, with the covariates arising from a fixed, non-random design or consisting of i.i.d realizations of a random variable. We choose a prior on the regression function η(x) that has support on a large subset of 𝒞∞(𝒳), the space of smooth real valued 𝒳 → ℝ functions. The priors proposed for {fx, x ∈ 𝒳} will be chosen to have large support so that heavy-tailed distributions and outliers will automatically be accommodated, with influential observations downweighted in estimating η.
2.2 Prior on the Mean Regression Function
We assume that η ∈ ℱ = {g : 𝒳 → ℝ is a continuous function}, with η assigned a Gaussian process (GP) prior, η ~ GP(μ, c), where μ is the mean function and c is the covariance kernel. A Gaussian process is a stochastic process {η(x) : x ∈ 𝒳} such that any finite dimensional distribution is multivariate normal, i.e., for any n and x1, …, xn, η(X) ≔ (η(x1), …, η(xn))′ ~ N(μ(X), Ση), where μ(X) = (μ(x1),…, μ(xn))′ and . Naturally the covariance kernel c(·, ·) must satisfy, for each n and x1, …, xn, that the matrix Ση is positive definite. The smoothness of the covariance kernel essentially controls the smoothness of the sample paths of {η(x) : x ∈ 𝒳}. For an appropriate choice of c, a Gaussian process has large support in the space of all smooth functions. More precisely, the support of a Gaussian process is the closure of the reproducing kernel Hilbert space generated by the covariance kernel with a shift by the mean function (Ghosal and Roy 2006). For example, when 𝒳 ⊂ ℝ, the eigenfunctions of the univariate covariance kernel, , span C∞(𝒳) if κ is allowed to vary freely over ℝ+. Thus we can see that the Gaussian process prior has a rich class of functions as its support and hence is appealing as a prior on the mean regression function. Refer to Rasmussen and Williams (2006) as an introductory textbook on Gaussian processes.
We follow common practice in choosing the mean function in the GP prior to correspond to a linear regression, μ(X) = Xβ, with β denoting unknown regression coefficients. As a commonly used covariance kernel, we took the Gaussian kernel , where τ and κ are unknown hyperparameters, with κ controlling the local smoothness of the sample paths of η(x). Smoother sample paths imply more borrowing of information from neighboring x values.
2.3 Priors for Residual Distribution
Motivated by the problem of robust estimation of the regression function η, we consider five different types of priors for the residual distributions {fx, x ∈ 𝒳} as enumerated below. The first prior corresponds to the t distribution, which is widely used for robust modeling of residual distributions (West 1984); Lange et al 1989; Fonseca et al 2008), while the remaining priors are flexible nonparametric specifications.
P1. Heavy tailed parametric error distribution
Following many previous authors, we first consider the case in which the residual distributions follow a homoscedastic Student-t distribution with unknown degrees of freedom. As the Student-t with low degrees of freedom is heavy tailed, outliers are allowed. By placing a hyperprior on the degrees of freedom, νσ ~ Ga(aν, bν), with Ga(a, b) denoting the Gamma distribution with mean a/b, one can obtain a data adaptive approach to down-weighting outliers in estimating the mean regression function. However, note that this specification assumes that the same degrees of freedom and tail-heaviness holds for all x ∈ 𝒳. Following West (1987), we express the Student-t distribution as a scale mixture of normals for ease in computation. In addition, we allow an unknown scale parameter, letting εi ~ N(0, σ2/ϕi), with ϕi ~ Ga(νσ/2, νσ/2), σ−2 ~ Ga(a, b).
P2. Nonparametric error distribution
Let 𝒴 = ℜ be the response space and 𝒳 be the covariate space which is a compact subset of ℜp. Let ℱ denote the space of densities on 𝒳 × 𝒴 w.r.t the Lebesgue measure and ℱd denotes the space of all conditional densities subject to mean zero,
We propose to induce a prior on the space of mean zero conditional densities through a prior for a collection of mixing measures {Px, x ∈ 𝒳} using the following predictor-dependent mixture of kernels.
(3) |
where πh(x) ≥ 0 are random functions of x such that a.s. for each fixed x ∈ 𝒳. are iid realizations of a real valued stochastic process, i.e., P0 is a probability distribution over a function space ℱ𝒳. Here P0,σ is a probability distribution on ℜ+. Hence for each x ∈ 𝒳, Px is a random probability measure over the measurable Polish space (ℜ × ℜ+, ℬ(ℜ × ℜ+)). Before proposing the prior, we first review the probit stick breaking process specification and its relationship to the Dirichlet process. Rodriguez and Dunson (2011) introduce the probit stick-breaking process in broad settings and discuss some smoothness and clustering properties. A probability measure P ∈ 𝒫 on (𝒴, ℬ(𝒴)) follows a probit stick-breaking process with base measure P0 if it has a representation of the form
(4) |
where the atoms are independent and identically distributed from P0 and the random weights are defined as . Here Φ(·) denotes the cumulative distribution function for the standard normal distribution. Note that expression (4) is identical to the stick-breaking representation (Sethuraman 1994) of the Dirichlet process (DP), but the DP is obtained by replacing the stick-breaking weight Φ(αh) with a beta(1, α) distributed random variable. Hence, the PSB process differs from the DP in using probit transformations of Gaussian random variables instead of betas for the stick lengths, with the two specifications being identical in the special case in which μα = 0, σα = 1 and the DP precision parameter is α = 1. Rodriguez and Dunson (2011) also mentioned the possibility of constructing a variety of predictor dependent models e.g., latent Markov random fields, spatio-temporal processes, etc by using probit transformation latent Gaussian processes. Such latent Gaussian processes can be updated using data augmentation Gibbs sampling as in continuation-ratio probit models for survival analysis (Albert and Chib 2001). While we follow similar computational strategies as in Rodriguez and Dunson (2011), they didn’t consider robust regression using predictor dependent residual density.
Under the symmetric about zero assumption, we propose two nonparametric priors for the residual density fx for all x ∈ 𝒳. The first prior is a predictor dependent PSB scale mixture of Gaussians which enforces symmetry about zero and unimodality, and the next is a symmetrized location-scale PSB mixture of Gaussians, which we develop to satisfy the symmetric about zero assumption while allowing multimodality.
P2a. Heteroscedastic scale PSB mixtures
To allow the residual density to change flexibly with predictors, while maintaining the constraint that each of the predictor-dependent residual distributions is unimodal and symmetric about zero, we propose the following specification
(5) |
where πh(x) = Φ{αh(x)} ∏l<h[1−Φ{αl(x)}] is the predictor-dependent probability weight on the hth mixture component, and the αh’s are drawn independently from zero mean Gaussian processes having covariance kernel . This implies and is a highly-flexible specification that enforces smoothly changing mixture weights across the predictor space, so that the residual densities at x and x′ will tend to be similar if x is located close to x′, as measured by κα‖x − x′‖2.
Clearly, the specification allows the residual variance to change flexibly with predictors, as we have . However, unlike the previously proposed methods for heteroscedastic nonlinear regression, we do not just allow the variances to vary, but allow any aspect of the density to vary, including the heaviness of the tails. This allows locally adaptive downweighting of outliers in estimating the mean function. Previous methods, which instead assume a single heavy-tailed residual distribution, such as a t-distribution, can lead to a lack of robustness due to global estimation of a single degree of freedom parameter. In addition, due to the form of our specification, posterior computation becomes very straightforward using a data augmentation Gibbs sampler, which involves simple steps for sampling from conjugate full conditional distributions. Even under the assumption of Gaussian residual distributions, posterior computation for heteroscedastic models tends to be complex, with Gibbs sampling typically not possible due to the lack of conditional conjugacy.
P2b. Heteroscedastic symmetric PSB (sPSB) location-scale mixtures
The PSB scale mixture in (5) restricts the residual density to be unimodal. As this is a very restrictive assumption, it is appealing to define a prior with larger support that allows multimodal residual densities, while enforcing the symmetric about zero assumption so that the residual density is constrained to have mean zero. To accomplish this, we propose a novel symmetrized PSB process specification, which is related to the symmetrized Dirichlet process proposed by Tokdar (2006). We define
(6) |
where the atoms (μh, τh) are drawn independently from P0 a priori, with P0 chosen as a product of a and Ga(ατ, βτ measure. The difference between the sPSB process prior and the PSB process prior is that instead of just placing probability weight πh on atom (μh, τh), we place probability τh/2 on (−μh, τh) and (μh, τh). The resulting residual density under (6) has the form . Clearly, each of the realizations corresponds to a mixture of Gaussians that is constrained to be symmetric about zero. The same comments made for the heteroscedastic scale PSB mixture apply here, but (6) is more flexible in allowing multi-modal residual distributions, with modality changing flexibly with predictors. Posterior computation is again straightforward, as will be shown later.
P2c. Homoscedastic scale PSB process mixture of Gaussians
A simpler homoscedastic version of (5) is to consider
(7) |
where the weights {πh} are specified as in
(8) |
This implies that , so that the unknown density of the residuals is expressed as a countable mixture of Gaussians centered at zero but with varying variances. Observations will be automatically allocated to clusters, with outlying clusters corresponding to components having large variance (low τh). By choosing a hyperprior on μα while letting σα = 1, we allow the data to inform more strongly about the posterior distribution on the number, sizes and allocation to clusters.
P2d. Location-scale symmetrized PSB (sPSB) mixture of Gaussians
A homoscedastic version of (6) is the following.
(9) |
where the prior on the weights πh are given by (8) and the prior for (μh, τh) are exactly as in 2b.
3 Consistency properties
Let f ~ Πu and f ~ Πs denote the priors for the unknown residual density defined in expressions (7) and (9) respectively. It is appealing for Πu and Πs to have support on a large subset of 𝒮u and 𝒮s respectively, where 𝒮s denotes the set of densities on ℝ with respect to Lebesgue measure that are symmetric about zero and 𝒮u ⊂ 𝒮s is the subset of 𝒮s corresponding to unimodal densities. We characterize the weak support of Πu, denoted by wk(Πu) ⊂ 𝒮u, in the following lemma.
Lemma 1 wk(Πu) = 𝒞m, where 𝒞m = {f : f ∈ 𝒮u, is a completely monotone function}.
A function h(x) on (0,∞) is completely monotone in x if it is infinitely differentiable and for all x and for all m ∈ {1, 2, …, ∞}. Chu (1973) proved that if f is a density on ℝ which is symmetric about zero and unimodal, it can be written as a scale mixture of normals,
for some density g on ℝ, if and only if , is a completely monotone function, where ϕ is the standard normal pdf. This restriction places a smoothness constraint on f(x), but still allows a broad variety of densities.
Definition 1 Letting f ~ Π, f0 is in the Kullback-Leibler(KL) support of Π if
The set of densities f in the Kullback-Leibler support of Π is denoted by KL(Π).
Let 𝒮̃s denote the subset of 𝒮s corresponding to densities satisfying the following regularity conditions.
f is nowhere zero and bounded by M < ∞
| ∫ℜ f (y) log f (y)dy| < ∞
, where ψ1 (y) = inft∈[y−1, y+1] f(t)
there exists ψ > 0 such that ∫ℜ |y|2(1+ψ) f (y)dy < ∞
Lemma 2 KL(Πs) ⊇ 𝒮̃s.
Remark 1 The above assumptions on f are standard regularity conditions introduced by Tokdar (2006) and Wu and Ghoshal (2008) to prove that f ∈ KL(Π), where Π is a general stick breaking prior which has all compactly supported probability distributions as its support. (1) is usually satisfied by common densities arising in practice. (4) imposes a minor tail restriction e.g., t-density with (2 + δ) degrees of freedom for some δ > 0 satisfies (4). (1)–(4) are satisfied by a finite mixture of t-densities or even by an infinite mixture of t-densites with (2 + δ) degrees of freedom for some δ > 0 and bounded component specific means and variances.
From Lemma 2, it follows that the sPSB location-scale mixture has KL-support on a large subset of the set of densities symmetric about zero. These conditions are important in verifying that the priors are flexible enough to approximate any density subject to the noted restrictions.
We provide fairly general sufficient conditions to ensure strong and weak posterior consistency in estimating the mean regression function and the residual density, respectively. We focus on the case in which a GP prior is chosen for η and an sPSB location-scale mixture of Gaussians is chosen for the residual density as in (9). Similar results can be obtained for the homoscedastic scale PSB process mixture under stronger restrictions on the true residual density. Although showing consistency results using predictor dependent mixtures of normals as the prior for the residual density in (5) and (6) is a challenging task, one can anticipate such results given the theory in Pati et al (2010) and Norets and Pelenis (2010). Indeed Norets and Pelenis (2011) showed posterior consistency of the regression coefficients in a mean linear regression model with covariate dependent nonparametric residuals using the kernel stick-breaking process Dunson and Park (2008a). However, showing posterior consistency of the mean regression when we have a Gaussian process prior on the regression function and predictor dependent residuals is quite challenging and is a topic of future research.
For this section, we assume xi’s are non random and arising from a fixed design, though the proofs are easily modified for random xi’s. When the covariate values are fixed in advance, we consider the neighborhood based on the empirical measure of the design points. Let Qn be the empirical probability measure of the design points, . Based on Qn, we define a strong L1 neighborhood of radius Δ > 0 as in Choi (2005) around the true regression function η0. Letting ∥η − η0∥1,n = ∫x∈𝒳 |η(x) − η0(x)|dQn(x) set,
(10) |
We introduce the following notation. Let f0 denote an arbitrary fixed density in 𝒮̃s, η0 denote an arbitrary fixed regression function in ℱ, and
For any two densities f and g, let
where log+ x = max(log x, 0). Set Ki(f, η) = K(f0i, fηi) and Vi(f, η) = V (f0i, fηi) for i = 1, …, n.
For technical simplicity assume 𝒳 = [0, 1]p, τ = 1 and μ ≡ 0. Denote a mean zero Gaussian process {Wx : x ∈ [0, 1]p} with covariance kernel c(x, x′) = e−∥x−x′∥2 by W. Rescaling the sample paths of an infinitely smooth Gaussian process is a powerful technique to improve the approximation of α- Hölder functions from the RKHS of the scaled process with κ > 0. Intuitively, for large values of κ, the scaled process traverses the sample path of an unscaled process on the larger interval , thereby incorporating more “roughness”. van der Vaart and van Zanten (2009) studied rescaled Gaussian processes for a positive random variable κ stochastically independent of W and showed that with a Gamma prior on κp/2, one obtains the minimax-optimal rate of convergence for arbitrary smooth functions.
Assumption1: η ~ Wκ with the density g of √κ on the positive real line satisfying
for positive constants C1, C2, D1, D2 and every sufficiently large x > 0. Next we state the lemma on prior positivity due to van der Vaart and van Zanten (2009).
Lemma 3 If η satisfies Assumption 1 then P(∥η − η0∥∞ < ε) > 0 ∀ ε > 0, if η0 is continuous.
In order to prove posterior consistency for our proposed model, we rely on a theorem of Amewou-Atisso et al (2003), which is a modification of the celebrated Schwartz (1965) theorem to accommodate independent but not identically distributed data.
Theorem 1 Suppose η as in Assumption 1 with q ≥ p + 2 and f ~ Πs, with Πs defined in (9). In addition, assume the data are drawn from the true density f0 (yi − η0 (xi)), with {xi} fixed and non-random, f0 ∈ 𝒮̃s, η0 ∈ ℱ and f0 following the additional regularity conditions,
∫ y4 f0 (y) dy < ∞ and ∫ f0 (y) | log f0 (y)|2 dy < ∞.
, where ψ1(y) = inft∈[y−1,y+1] f0(t).
Let 𝒰 be a weak neighborhood of f0 and 𝒲n = 𝒰×Sn (η0; Δ), with 𝒲n ⊂ 𝒮̃s × ℱ. Then the posterior probability
Theorem 1 ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η.
4 Posterior computation
We first describe the choice of hyperpriors and hyperparameters for the regression function. We choose the typical conjugate prior for the regression coefficients in the mean of the GP, β ~ N(β0, Σ0), where β0 = 0 and Σ0 = cI is a common choice corresponding to a ridge regression shrinkage prior. The prior on τ is given by . We let κ ~ Ga(ακ, βκ) with small βκ, and large ακ. Normalizing the predictors prior to analysis, we find that the data are quite informative about κ under these priors, so as long as the priors are not overly informative, inferences are robust. The parameter τ controls the heaviness of the tails of the prior for the regression function. In fact, choosing a Ga(ντ/2, ντ/2) prior induces a heavy tailed t-process with ντ degrees of freedom as a prior for the regression function. We chose ντ to be 3. κ controls the correlation of the Gaussian process at two points in the covariate space similar to a spatial decay parameter in a spatial random effects model. Although a discrete uniform prior for κ is computationally efficient in leading to a griddy Gibbs update step, there can be sensitivity to the choice of grid. A gamma prior for κ eliminates such sensitivity at some associated computational price in terms of requiring a Metropolis-Hastings update that tends to mix slowly. We choose the parameters ακ and βκ so that the mean correlation is 0.1 for two points separated by a distance √p in the covariate space.
Next we describe the hyperprior and associated hyperparameter choices for P1 and P2a-d.
P1: Since the responses are normalized and the covariates are scaled to lie in the interval [0, 1], using a single decay parameter appears to be reasonable. νσ controls the tail-heaviness of the prior for the scaling ϕ. To accommodate outliers with the mean being fixed at 1, we assume ϕi ~ Ga(νσ/2, νσ/2) with νσ ~ Ga(αν, βν). We took Σ0 = 5I, αν = 1, βν = 1. a and b are fixed at 3/2 to resemble a t-distribution with 3 degrees of freedom without the scaling ϕi.
P2a & P2b: We assume κα ~ Ga(γκ, δκ) and . Assuming yi are normalized, we can expect the overall variance to be close to one, so the variance of the residuals, , should be less than one. We set ατ = 1 and choose a hyperprior on βτ, βτ ~ Ga(1, k0) with k0 > 1 so that the prior mean of τh is significantly less than one. Different values of k0 are tried out to assess robustness of the posteriors. In Sections 5 and 6, we choose γκ = 1, δκ = 5, να = 1, k0 = 10, μ0 = 0, σ0 = 1.
P2c & P2d: Same choices as above with except k0 = 5, μα = 0, σα = 1.
For brevity, we provide details for posterior computation only for P1, P2a–b.
4.1 Gaussian process regression with t residuals (P1)
Let Y = (y1, … yn)′, η = (η(x1), η(x2), …, η(xn))′ and define a matrix T such that Tij = e−κ∥xi−xj∥2. Hence . Assume Ω = diag(1/ϕi : i = 1, …, n) and ϕ = (ϕ1, … , ϕn)′. Then we have
Next we provide the full conditional distributions needed for Gibbs sampling. Due to conjugacy, η, β, σ−2, ϕ and τ have closed form full conditional distributions, while να and κ are updated by using Metropolis Hastings steps within the Gibbs sampler. Let Vη = (τT−1 + σ−2 Ω−1)−1 and .
4.2 Heteroscedastic PSB mixture of normals (P2a)
We propose a Markov chain Monte Carlo algorithm, which is a hybrid of data augmentation, the exact block Gibbs sampler of Papaspiliopoulos (2008) and Metropolis Hastings sampling. Papaspiliopoulos (2008) proposed the exact block Gibbs sampler as an efficient approach to posterior computation in Dirichlet process mixture models, modifying the block Gibbs sampler of Ishwaran and James (2001) to avoid truncation approximations. The exact block Gibbs sampler combines characteristics of the retrospective sampler (Papaspiliopoulos and Roberts 2008) and the slice sampler (Walker 2007); Kalli et al 2010). We included the label switching moves introduced by Papaspiliopoulos and Roberts (2008) for better mixing. Introduce γ1, …, γn such that πh(xi) = P(γi = h), h = 1, 2, … ,∞. Then
where ui ~ U(0, 1). The MCMC steps are given below.
1. Update ui’s and stick breaking random variables: Generate
where πh(xi) = Φ{αh(xi)} ∏l<h[1 − Φ {αl(xi)}]. For i = 1, …, n, introduce latent variables Zh(xi), h = 1, 2, … such that Zh(xi) ~ N(αh(xi), 1). Thus πh(xi) = P(Zh(xi) > 0, Zl(xi) < 0 for l < h). Then
Let Zh = (Zh(x1), …, Zh(xn))′ and αh = (αh(x1), … , αh(xn))′. Letting (Σα)ij = e−κα∥xi−xj∥, Zh ~ N(αh, I) and ,
Continue up to is the minimum integer satisfying . Now
while κα is updated using a Metropolis Hastings step.
2. Update allocation to atoms: Update (γ1, … , γn)|− as multinomial random variables with probabilities
.
3. Update component-specific locations and precisions: Letting nl = #{i : γi = l}, l = 1, 2, … ; h*,
4. Update the mean regression function: Letting ,
5. Update κ in a Metropolis Hastings step.
4.3 Heteroscedastic sPSB process location-scale mixture (P2b)
We will need the following changes in the updating steps from the previous case.
1. Update allocation to atoms: Update (γ1, … , γn)|− as multinomial random variables with probabilities
h = 1, … , h*.
3. Component-specific locations and precisions: Let nl = #{i : γi = l}, l = 1,2, … , h* and ml = ∑i:γi=l(yi − ηi). The atoms of the base measure location is updated from a mixture of normals as
where .
where .
4. Update the mean regression function: Let , μ* = (μγ1, μγ2, … , μγn) and W = (τT−1 + ∧−1)−1. Hence
where .
5 Measures of influence
There has been limited work on sensitivity of the posterior distribution to perturbations of the data and outliers. Arellano-Vallea et al (2000) use deletion diagnostics to assess sensitivity, but their methods are computationally expensive in requiring posterior computation with and without data deleted. Weiss (1996) proposed an alternative that perturbs the posterior instead of the likelihood, and only requires samples from the full posterior. Following Weiss (1996), let f(yi|Θ̃, xi) denote the likelihood of the data yi, define
, for some small δ > 0 and let pi (Θ̃|Y) denote a new perturbed posterior,
Since the responses are normalized prior to analysis, it is reasonable to assume that the perturbation is less than 0.1. We vary δ in [0.01, 0.1] over a grid of width 0.01 and obtain the average of results. Denote by Li the influence measure, which is a divergence measure between the unperturbed posterior p(Θ̃|Y) and the perturbed posterior pi(Θ̃|Y),
Li is bounded and takes values in [0, 1]. When p(Θ̃|Y) = pi(Θ̃|Y), Li = 0 indicating that the perturbation has no influence. On the other hand, if Li = 1, the supports of p(Θ̃|Y) and pi(Θ̃|Y) are disjoint indicating maximum influence. We can define an influence measure as . Clearly L also takes values in [0, 1] with L = 0 ⇒ Li = 0 ∀ i = 1, 2, … , n. Also L = 1 ⇒ Li = 1 ∀ i = 1, 2, … , n. Weiss (1996) provided a sample version of Li, i = 1, … , n. Letting Θ̃1, … , Θ̃M be the posterior samples with B the burn-in,
where . Our estimated influence measure is . We will calculate the influence measure for our proposed methods and compare their sensitivity.
6 Simulation studies
To assess the performance of our proposed approaches, we consider a number of simulation examples, (i) linear model, homoscedastic error with no outliers, (ii) linear model, homoscedastic error with outliers (iii) linear model, heteroscedastic errors and outliers, (iv) non-linear model with heteroscedastic errors and outliers and (v) non-linear model with heteroscedastic errors and outliers, but with fewer true predictors. We let the heaviness of the tails and error variance change with x in cases (iii), (iv) and (v). We considered the following methods of assessing the performance, namely, mean squared prediction error (MSPE), coverage of 95% prediction intervals, mean integrated squared error (MISE) in estimating the regression function at the points for which we have data, point wise coverage of 95% credible intervals for the regression function and the influence measure (L̂) as described in Section 5. We also consider a variety of sample sizes in the simulation, n=30, 60, 80 and simulate 10 covariates independently from U(0, 1). Let z be 10-dim vector of i.i.d U(0, 1) random variables independent of the covariates.
Generation of errors in heteroscedastic case and outliers: Let fxi (εi) = pxi N(εi; 0, 1) + qxi N(εi; 0, 5) where . The outliers are simulated from the model with error distribution , which is a mixture of truncated normal distributions as follows. In the heteroscedastic case, where TNℛ(·;μ, σ2) denotes a truncated normal distribution with mean μ and standard deviation σ over the region ℛ. We consider the following five cases.
Case (i): yi = 2.3 + 5.7x1i + εi, εi ~ N(0, 1) with no outliers.
Case (ii): yi = 2.3 + 5.7x1i + εi, εi ~ 0.95N(0, 1) + 0.05N(0, 10).
Case (iii): yi = 1.2 + 5.7x1i + 4.7x2i + 0.12x3i −8.9x4i + 2.4x5i + 3.1x6i + 0.01x7i + εi, εi ~ fxi, with 5% outliers generated from .
Case (iv): with 5% outliers generated from .
Case (v): yi = 1.2 + 5.7 sin x1i + 3.4 exp(x2i) + 4.7 log |xi3| + εi, εi ~ fxi with 5% outliers generated from .
For each of the cases and for each sample size n, we took the first samples as the training set and the next samples as the test set. We also compare the MSPE of the proposed methods with robust regression using M-estimation (Huber 1964), Bayesian additive regression trees (Chipman et al 2010), and Treed Gaussian processes (Gramacy and Lee 2008). The MCMC algorithms described in Section 5 are used to obtain samples from the posterior distribution. The results for model P1 given here are based on 20,000 samples obtained after a burn-in period of 3,000. The results for models P2a–d are based on 20,000 samples obtained after a period of 7,000. Rapid convergence was observed based on diagnostic tests of Geweke (1992) and Raftery and Lewis (1992). In addition, the mixing was very good for model P1. For models P2a–d, we use the label switching moves by Papaspiliopoulos and Roberts (2008), which lead to adequate mixing. Tables 1, 2 and 3 summarize the performance of all the methods based on 50 replicated datasets.
Table 1.
n=40 | MSPE | cov(y)a | Case (i) MISE |
cov(η)b | L | MSPE | cov(y) | Case (ii) MISE |
cov(η) | L |
---|---|---|---|---|---|---|---|---|---|---|
P1 | 0.2997 | 1 | 0.0248 | 1 | 0.0017 | 0.6043 | 1 | 0.0232 | 1 | 0.0027 |
P2a | 0.2821 | 0.9980 | 0.0141 | 1 | 0.0015 | 0.5983 | 0.9740 | 0.0173 | 1 | 0.0019 |
P2b | 0.2798 | 1 | 0.0144 | 1 | 0.0015 | 0.5987 | 0.9745 | 0.0169 | 1 | 0.0017 |
P2c | 0.2980 | 1 | 0.0156 | 1 | 0.0016 | 0.5980 | 0.9750 | 0.0189 | 1 | 0.0020 |
P2d | 0.2869 | 1 | 0.0155 | 1 | 0.0016 | 0.5983 | 0.9750 | 0.0190 | 1 | 0.0020 |
M-Estimation | 0.2820 | 0.0140 | 0.6013 | 0.0177 | ||||||
BART | 0.3510 | 0.6866 | 0.0714 | 0.7051 | 0.7845 | 0.0950 | ||||
Treed GP | 0.3042 | 0.9134 | 0.0256 | 0.6968 | 0.9365 | 0.0803 | ||||
n=60 | MSPE | cov(y) | MISE | cov(η) | L | MSPE | cov(y) | MISE | cov(η) | L |
P1 | 0.2990 | 1 | 0.0246 | 1 | 0.0019 | 0.5776 | 1 | 0.0242 | 1 | 0.0023 |
P2a | 0.2769 | 0.9947 | 0.0103 | 1 | 0.0017 | 0.5471 | 0.95 | 0.0143 | 0.97 | 0.0016 |
P2b | 0.2752 | 0.9963 | 0.0104 | 1 | 0.0016 | 0.5541 | 0.95 | 0.0141 | 0.98 | 0.0016 |
P2c | 0.2852 | 0.9945 | 0.0176 | 1 | 0.0019 | 0.5664 | 0.95 | 0.0142 | 0.99 | 0.0021 |
P2d | 0.2826 | 0.9960 | 0.0173 | 1 | 0.0018 | 0.5561 | 0.95 | 0.0141 | 0.98 | 0.0021 |
M-Estimation | 0.2759 | 0.0103 | 0.5623 | 0.0139 | ||||||
BART | 0.3314 | 0.6753 | 0.0539 | 0.6725 | 0.7777 | 0.1098 | ||||
Treed GP | 0.3000 | 0.9193 | 0.0218 | 0.6880 | 0.9301 | 0.1198 | ||||
n=80 | MSPE | cov(y) | MISE | cov(η) | L | MSPE | cov(y) | MISE | cov(η) | L |
P1 | 0.2913 | 1 | 0.0252 | 1 | 0.0021 | 0.5583 | 1 | 0.0172 | 1 | 0.0022 |
P2a | 0.2592 | 0.9940 | 0.0086 | 1 | 0.0021 | 0.4989 | 0.97 | 0.0050 | 1 | 0.0014 |
P2b | 0.2574 | 0.9956 | 0.0069 | 1 | 0.0020 | 0.4898 | 0.98 | 0.0067 | 1 | 0.0010 |
P2c | 0.2724 | 0.9976 | 0.0187 | 1 | 0.0020 | 0.5104 | 0.98 | 0.0103 | 1 | 0.0017 |
P2d | 0.2716 | 0.9976 | 0.0189 | 1 | 0.0020 | 0.5002 | 0.98 | 0.0097 | 1 | 0.0018 |
M-Estimation | 0.2720 | 0.0079 | 0.5431 | 0.0068 | ||||||
BART | 0.3128 | 0.6525 | 0.0437 | 0.6509 | 0.7815 | 0.1098 | ||||
Treed GP | 0.2886 | 0.9301 | 0.0175 | 0.6532 | 0.9224 | 0.1031 |
cov(y) denotes the coverage of the 95% predictive intervals of the test cases
cov(η) denotes the coverage of the 95% credible intervals of the mean regression function
Table 2.
n=40 | MSPE | cov(y) | Case (iii) MISE |
cov(η) | L | MSPE | cov(y) | Case (iv) MISE |
cov(η) | L |
---|---|---|---|---|---|---|---|---|---|---|
P1 | 0.4833 | 1 | 0.3612 | 1 | 0.0027 | 0.4416 | 1 | 0.3274 | 1 | 0.0029 |
P2a | 0.2570 | 0.9990 | 0.1394 | 1 | 0.0025 | 0.2783 | 0.9923 | 0.1583 | 0.98 | 0.0023 |
P2b | 0.2586 | 0.9990 | 0.1298 | 1 | 0.0025 | 0.2712 | 0.9867 | 0.1501 | 0.97 | 0.0017 |
P2c | 0.3057 | 1 | 0.2412 | 1 | 0.0026 | 0.3334 | 0.9967 | 0.2213 | 0.99 | 0.0024 |
P2d | 0.3024 | 1 | 0.2304 | 1 | 0.0026 | 0.3216 | 0.9967 | 0.2192 | 0.99 | 0.0022 |
M-Estimation | 0.2613 | 0.1376 | 0.2889 | 0.1663 | ||||||
BART | 0.4639 | 0.8444 | 0.3413 | 0.4103 | 0.8833 | 0.2675 | ||||
Treed GP | 0.3320 | 0.7834 | 0.1979 | 0.3548 | 0.8268 | 0.2108 | ||||
n=60 | MSPE | cov(y) | MISE | cov(η) | L | MSPE | cov(y) | MISE | cov(η) | L |
P1 | 0.2254 | 1 | 0.1154 | 1 | 0.0023 | 0.2367 | 1 | 0.1067 | 1 | 0.0021 |
P2a | 0.1744 | 0.9973 | 0.0572 | 1 | 0.0020 | 0.2178 | 1 | 0.0562 | 0.97 | 0.0019 |
P2b | 0.1712 | 0.9878 | 0.0567 | 1 | 0.0016 | 0.2099 | 1 | 0.0656 | 0.98 | 0.0017 |
P2c | 0.1952 | 0.9998 | 0.0854 | 1 | 0.0021 | 0.2216 | 1 | 0.0879 | 0.99 | 0.0020 |
P2d | 0.1934 | 0.9998 | 0.0799 | 1 | 0.0020 | 0.2208 | 1 | 0.0812 | 0.99 | 0.0020 |
M-Estimation | 0.1746 | 0.0564 | 0.2125 | 0.0678 | ||||||
BART | 0.3429 | 0.8546 | 0.2217 | 0.3385 | 0.9122 | 0.1799 | ||||
Treed GP | 0.2047 | 0.8349 | 0.0779 | 0.2611 | 0.8867 | 0.0899 | ||||
n=80 | MSPE | cov(y) | MISE | cov(η) | L | MSPE | cov(y) | MISE | cov(η) | L |
P1 | 0.1636 | 1 | 0.0454 | 1 | 0.0018 | 0.1855 | 1 | 0.0346 | 1 | 0.0019 |
P2a | 0.1509 | 0.9976 | 0.0373 | 0.95 | 0.0015 | 0.1653 | 1 | 0.0321 | 0.9952 | 0.0014 |
P2b | 0.1578 | 0.9931 | 0.0324 | 1 | 0.0013 | 0.1614 | 1 | 0.0312 | 0.9932 | 0.0010 |
P2c | 0.1589 | 0.9960 | 0.0404 | 1 | 0.0017 | 0.1774 | 1 | 0.0329 | 0.9980 | 0.0016 |
P2d | 0.1567 | 0.9969 | 0.0401 | 1 | 0.0017 | 0.1770 | 1 | 0.0320 | 0.9969 | 0.0016 |
M-Estimation | 0.1582 | 0.0364 | 0.1832 | 0.0325 | ||||||
BART | 0.2284 | 0.9265 | 0.1098 | 0.2491 | 0.9490 | 0.1083 | ||||
Treed GP | 0.1655 | 0.8876 | 0.0427 | 0.2022 | 0.8923 | 0.0548 |
Table 3.
n=40 | MSPE | cov(y) | MISE | cov(η) | L |
---|---|---|---|---|---|
P1 | 0.6666 | 0.9800 | 0.5856 | 1 | 0.0033 |
P2a | 0.5233 | 0.9770 | 0.3980 | 0.9812 | 0.0025 |
P2b | 0.5231 | 0.9854 | 0.3745 | 0.9765 | 0.0019 |
P2c | 0.5875 | 0.9850 | 0.4452 | 1 | 0.0029 |
P2d | 0.5788 | 0.9859 | 0.4223 | 1 | 0.0028 |
M-Estimation | 0.5531 | 0.3671 | |||
BART | 0.4956 | 0.8980 | |||
Treed GP | 0.7224 | 0.8123 | 0.6132 | ||
n=60 | MSPE | cov(y) | MISE | cov(η) | L |
P1 | 0.3828 | 1 | 0.2911 | 0.9985 | 0.0031 |
P2a | 0.3745 | 0.9832 | 0.2617 | 0.9840 | 0.0022 |
P2b | 0.3767 | 0.9812 | 0.2601 | 0.9867 | 0.0020 |
P2c | 0.3810 | 0.9900 | 0.2800 | 0.9998 | 0.0027 |
P2d | 0.3800 | 0.9906 | 0.2798 | 0.9998 | 0.0026 |
M-Estimation | 0.3939 | 0.2824 | |||
BART | 0.3930 | 0.9313 | 0.2668 | ||
Treed GP | 0.4225 | 0.9023 | 0.3217 | ||
n=80 | MSPE | cov(y) | MISE | cov(η) | L |
P1 | 0.3599 | 0.9901 | 0.2759 | 0.9998 | 0.0029 |
P2a | 0.3503 | 0.9762 | 0.2582 | 0.9765 | 0.0022 |
P2b | 0.3519 | 0.9712 | 0.2545 | 0.9715 | 0.0019 |
P2c | 0.3560 | 0.9856 | 0.2656 | 0.9885 | 0.0025 |
P2d | 0.3557 | 0.9800 | 0.2677 | 0.9881 | 0.0024 |
M-Estimation | 0.3905 | 0.2887 | |||
BART | 0.3594 | 0.9442 | 0.2867 | ||
Treed GP | 0.4489 | 0.9125 | 0.3509 |
Tables 1, 2 and 3 clearly show that in small samples both of the heteroscedastic methods (P2a and P2b) have substantially reduced MSPE and MISE relative to the heavy tailed parametric error model in most of the cases, interestingly even in the homoscedastic cases. This may be because discrete mixture of Gaussians better approximate a single normal than a t-distribution in small samples. Methods P2a and P2b also did a better job than method P1 in allowing uncertainty in estimating the mean regression and predicting the test sample observations. The homoscedastic versions 4 and 5 perform better than the parametric models but worse than the heteroscedastic models. In some cases, the heavy tailed t-residual distribution results in overly conservative predictive and credible intervals. As seen from the value of the influence statistic, the heteroscedastic PSB process mixtures result in more robust inference compared to the parametric error model, the sPSB process mixture of normals being more robust than the symmetric and unimodal version. As the sample size increases, the difference in the predictive performances between the parametric and the nonparametric models is reduced and in some cases the parametric error model performs as well as the nonparametric approaches, which is as expected given the Central Limit Theorem.
Table 1 shows that, in the simple linear model with normal homoscedastic errors, all the models perform similarly in terms of mean squared prediction error, though the methods P2a and P2b are somewhat better than the rest. Also, in estimating the mean regression function in case (i), methods P2a and P2b performed better than all the other methods. In case (ii)(Table 1), methods P2a and P2b are most robust in terms of estimation and prediction in presence of outliers. However, there is no significant difference between methods P2a and P2b and methods P2c and P2d in cases (i) and (ii). In cases (iii) and (iv), when the residual distribution is heteroscedastic, methods P2a and P2b perform significantly better than the parametric model P1 and the homoscedastic models P2c and P2d in both estimation and prediction, since the heteroscedastic PSB mixture is very flexible in modeling the residual distribution. This is quite evident from the MSPE values under cases (iii) and (iv) in Table 2. Huber’s M-Estimation method performs similarly to methods P2a–d in cases (i) and (ii) but did not do as well in estimation and prediction in cases (iii) and (iv) when the underlying mean function is actually non-linear with heteroscedastic residual distribution. Also BART failed to perform well in estimating the mean function in small samples in these cases. On the other hand, GP based approaches perform quite well in these cases in estimating the regression function with methods P2a and P2b performing better than the rest. Treed GP performed close to method P1 in estimation and prediction as both the methods are based on GP priors on the mean function and have a parametric error distribution. In not allowing heteroscedastic error variance, BART and Treed GP underestimate uncertainty in prediction, leading to overly narrow predictive intervals.
In case (v)(Table 3), where the true model is generated using comparatively less number of true signals, BART performed slightly better in terms of prediction than other methods in small samples. However, as the sample size increased, BART performed poorly while the GP prior on the mean can accommodate the non-linearity resulting in substantially better predictive performances.
7 Applications
7.1 Boston housing data application
To compare our proposed approaches to alternatives, we applied the methods to a commonly used data set from the literature, the Boston housing data. The response is the median value of the owner-occupied homes (measured in 1000$) in 506 census tracts in the Boston area, and there are 13 predictors (12 continuous, 1 binary) that might help to explain the variation in the median value across tracts. We predict the median value of the owner occupied homes of which the first 253 is taken as the training set and the remaining 253 as the test set. Out of sample predictive performance of our three methods is compared to competitors in Table 4. The parametric model P1, and the mixture models P2a–d and the M-Estimation methods perform very closely to each other in terms of prediction and did better than BART and Treed GP. Methods P1 and P2a even perform slightly better than method P2b, P2c and P2d. As in the simulation examples, BART and Treed GP underestimate the uncertainty in prediction. On the other hand, the predictive intervals of the methods P1, P2a–d are more conservative and accommodate uncertainty in predicting regions with outliers quite flexibly. Also the model P2b appears to be more robust compared to models P1, P2a, P2c & P2d in terms of the influence measure.
Table 4.
Boston housing data | body fat data | |||||||
---|---|---|---|---|---|---|---|---|
Methods | MSPE | cov(y) | L | corr(Ytest, Ypred)a | MSPE | cov(y) | L | corr(Ytest, Ypred) |
P1 | 0.0012 | 0.99 | 0.0034 | 0.9894 | 0.0055 | 1 | 0.0020 | 0.9972 |
P2a | 0.0013 | 0.99 | 0.0027 | 0.9901 | 0.0031 | 1 | 0.0017 | 0.9984 |
P2b | 0.0016 | 0.99 | 0.0020 | 0.9863 | 0.0029 | 1 | 0.0017 | 0.9989 |
P2c | 0.0014 | 0.99 | 0.0030 | 0.9879 | 0.0034 | 1 | 0.0019 | 0.9969 |
P2d | 0.0013 | 0.99 | 0.0029 | 0.9881 | 0.0032 | 1 | 0.0018 | 0.9975 |
M-Estimation | 0.0016 | 0.9858 | 0.0375 | 0.9710 | ||||
BART | 0.0024 | 0.92 | 0.9836 | 0.0355 | 0.95 | 0.9655 | ||
Treed GP | 0.0053 | 0.91 | 0.9524 | 0.1526 | 0.98 | 0.9250 |
corr(Ytest, Ypred) denotes the sample correlation between the test and predicted y
7.2 Body fat data application
With the increasing trend in obesity and concerns about associated adverse health effects, such as heart disease and diabetes, it has become even more important to obtain accurate estimates of body fat percentage. It is well known that body mass index, which is calculated based only on weight and height, can produce a misleading measure of adiposity as it does not take into account muscle mass or variability in frame size. As a gold standard for measuring percentage of body fat, one can rely on under water weighing techniques, and age and body circumference measurements have also been widely used as additional predictors. We consider a commonly-used data set from Statlib (http://lib.stat.cmu.edu/datasets/bodyfat), which contains the following 15 variables; percentage of body fat(%), body density from underwater weighing (gm/cm3), age (year), weight (lbs.), height (inches), and ten body circumferences (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, wrist, all in cm). Percentage of body fat is given from Siri’s (1956) equation:
We predict the percentage of body fat(%) taking the first 126 as the training set and the remaining 126 as the test set. We summarize the predictive performances in Table 4.
Table 4 suggests that the nonparametric regression procedures with heteroscedastic residual distribution P2a and P2b perform better than the parametric models P1 and models P2c and P2d, BART, M-Estimation and Treed GP in predicting the percentage of body fat.
8 Discussion
We have developed a novel regression model that can accommodate a large range of non-linearity in the mean function and at the same time can flexibly deal with outliers and heteroscedasticity. Based on preliminary simulation results, it appears that our methods P2a and P2b can outperform contemporary nonparametric regression methods, such as Huber’s M-Estimation method, BART and treed Gaussian processes. We also provide theoretical support for the proposed methodology when both the mean and the residuals are modeled nonparametrically.
One possible future direction is to relax the symmetry assumption on the residual distribution and introduce a model for median regression based on conditional PSB mixtures for allowing possibly asymmetric residual densities constrained to have zero median. Conditional DP mixtures are well known in the literature (Doss 1985); Burr and Doss 2005) and it is certainly interesting to extend our approach via a conditional PSB. In that way we can hope to obtain a more robust estimate of the regression function. It is challenging to extend our theoretical results to conditional PSB and develop a fast algorithm for computation. Another possible theoretical direction is to prove posterior consistency using heteroscedastic mixtures. Currently we only have results for the homoscedastic PSBP mixture.
Acknowledgements
This research was partially supported by grant number R01 ES017240-01 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH). The authors would like to thank Prof. Taeryon Choi for helpful suggestions on the manuscript.
A Proofs of main results
Proof of Lemma 1
It follows from Chu (1973) that
for some density g on ℝ+. Recall from Ongaro and Cattaneo (2004) that a collection of random weights with a.s. is said to have a full support if for any m ≥ 1, (π1; …, πm) admits a positive joint density with respect to Lebesgue measure on the simplex . Ongaro and Cattaneo (2004) showed that if πh’s have a full support, the weak support of
is the set of all probability measures whose support is contained the support of G0. Since
πh’s have a full support and hence the weak support of defined in (7) is all probability measures on ℝ+. It follows that the weak support of the induced prior Πu on 𝒮u, denoted by wk(Πu), is precisely 𝒞m.
Proof of Lemma 2
It follows from Tokdar (2006) that if we can show that the weak support of Πs contains all probability measures symmetric about zero and having compact support, then f ∈ S̃s ⟹ f ∈ K L(Πs). The argument given in Lemma 1 shows that the weak support of the PSB prior in (4) is the set of all probability measures on ℝ × ℝ+. Now we will show that an arbitrary Ps is in a weak neighborhood of Ps if P̃ is in a weak neighborhood of P. We state a lemma to prove our claim.
Lemma 4 Let P̃n be a sequence of probability measures and P̃ be a fixed probability measure. Then , with and P̃s the symmetrised versions of P̃n and P̃, respectively, where the symmetrizing operation is as defined in (9).
Proof Assume P̃n ⟹ P̃. We have to show that for any bounded function ϕ on ℝ × ℝ+,
Now,
Since is also a bounded continuous function and P̃n ⟹ P̃,
as n → ∞. This completes the proof of Lemma 4.
Lemma 4 in fact shows that the weak support of Πs contains all probability measures symmetric about zero. With an appeal to Tokdar (2006), f ∈ S̃s ⟹ f ∈ K L(Πs).
Proof of Theorem 1
In order to prove the theorem we need the following variant of Theorem 2.1 of Amewou-Atisso et al (2003) and Theorem 1 of Choi and Schervish (2007) which we state as Lemma 5. Existence of exponentially consistent tests is a typical tool in showing strong consistency.
Definition 2 Let 𝒲 ⊂ 𝒮̃s × ℱ. A sequence of test functions is said to be exponentially consistent for testing
if there exists constants C1, C2, C > 0 such that
Lemma 5 Let Π̃ = (Πs × π) be the prior on S̃s × ℱ. Let Un be a sequence of subsets of S̃s × ℱ. Suppose that there exists test functions , sets Θn ⊂ S̃s × ℱ, n ≥ 1 and constants C1, C2, c1, c2 > 0 such that
- For all δ > 0 and for almost every data sequence ,
Then .
In this case Un = 𝒲n = 𝒰 × Sn (f0, Δ) ∀ n ≥ 1. As in van der Vaart and van Zanten (2009), we construct Θn = ℱ × Θ1n where where ℍ1 and 𝔹1 are unit ball of the RKHS of Wa and unit ball of the Banach space of C[0, 1]p respectively, rn, Mn are increasing sequences to be chosen later. The nth test is constructed by combining a collection of tests one for each of the finitely many elements of Θn. It follows from the proof of Theorem 3.1 in van der Vaart and van Zanten (2009) that under Assumption 1, there exists constants d1, d2, K > 0 such that
Choosing Mn = O(n1/2), , we observe that
log N (ε, Θ1n, ‖·‖∞) = o(n).
for some constant d2 > 0.
In order to verify 1 and 2 of Lemma 5, we will write 𝒲n as a disjoint union of two easily tractable regions. The particular form of 𝒲n that is of interest to us is 𝒲1n ∪ 𝒲2n, where for any Δ > 0,
We will establish the existence of a consistent sequence of tests for each of these regions by considering the following variants of Proposition 3.1 and Proposition 3.3 of Amewou-Atisso et al (2003).
Proposition 1 There exists an exponentially consistent sequence of tests for
Proof Let 0 < t < Δ/2 and assume Nt = N(t, Θ1n, ‖·‖∞). Let η1, …, ηNt ∈ Θ1n be such that for each η ∈ Θ1n there exists j such that ‖η − ηj‖∞ < t. If ‖η − η0‖1,n > Δ, ‖ηj − η0‖1,n > Δ/2. It follows from Lemma 3.2 Amewou-Atisso et al (2003) that there exists a set and a constant C > 0 depending on f0 such that . and . If i ≤ n and i ∉ Kn, set Ai = ℝ, so that . Thus
From Lemma 3.1 and Lemma 3.2 of Amewou-Atisso et al (2003), it follows that there exist test functions based on such that and for constants C1, C2 > 0 Now define .
Then
for some constant C3 > 0. Clearly .
Next we consider the type II error probability. The type II error probability of Θn is no larger than the type II error probability of any of the and hence exponentially small.
Proposition 2 There exists an exponentially consistent sequence of tests for
Proof Without loss of generality take
where 0 ≤ Θ ≤ 1 and Θ is Lipschitz continuous. Hence there exists M > 0 such that |Θ(y1) − Θ(y2)| < M|y1 − y2|. Set Θ̃i(y) = Θ{y − η0(xi)}. Notice that 𝖤f0i Θ̃i = 𝖤f0Θ. Now
Hence for any f ∈ Uc. Now choosing Δ < ε/M and applying Lemma 3.1 of Amewou-Atisso et al (2003) we complete the proof.
It remains to verify the second sufficient condition of Theorem 1. Under the assumptions, it follows from Lemma 2 that f0 ∈ K L(Πs). We will present an important lemma which is similar to Lemma 5.1 of Tokdar (2006). It guarantees that K(f0, fθ) and V(f0, fθ) are continuous at θ = 0. First we state and prove some properties of the prior Πs described in (9) which will be used to prove the lemma.
Lemma 6 If Πs is the prior described in (9) and , with ατ > 0 and βτ > 0. Then,
(11) |
Proof
The proofs of ∫ t2dPs(t,τ) < ∞ a.s. and ∫ τt2dPs(t,τ) < ∞ a.s. are similar. Since ατ > 0, choose an integer m large enough such that .
since τ1/m log τ is bounded in [0, 1]. Also ∫τ>0 (log τ)τατ−1 e−βττ dτ ≤ ∫τ>0 ττατ−1 e−βττ dτ < ∞.
Lemma 7 Under the conditions of the Theorem 1, if f(·) = ∫ N(·; t, τ−1)dPs(t,τ) and fθ(y) = f(y − θ), then
Proof Clearly τϕ{τ(y−θ−t)} → τϕ{τ(y−t)} as θ → 0. Since , so by DCT fθ(y) → f(y) as θ → 0. Hence
To apply DCT again, we have to bound the function |log fθ(y)|by an integrable function.
Let c = ∫ τdPs(t,τ) < ∞. Then
Now since . Hence, by Jensen’s inequality applied to −log x, we get, .
Now since θ → 0, w.l.o.g assume |θ| ≤ 1. Hence
which is clearly f0-integrable according to the assumptions of the lemma and from the properties of Πs proved in Lemma 6. Similarly |log fθ(y)|2 can be bounded by an f0-integrable function. The conclusion of the lemma follows from a simple application of DCT.
Lemma 2 together with the assumption (2) of the Theorem 1 guarantees Π{f : K(f0, f) < δ, V(f0, f) < ∞} > 0 for all δ > 0. Since (11) holds, we may assume
(12) |
Now for every f(·) = ∫ N(·; t, τ−1)dPs(t,τ) ∈ 𝒰, using Lemma 7, choose δf such that for |θ| < δf,
Now if ‖η − η0‖ < δf, |η(xi) − η0(xi)| < δf, for i = 1, …, n. So if f ∈ 𝒰 and ‖η − η0‖ < δf, we have
From (12) and Lemma 3 we have,
Hence
This ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η.
Contributor Information
Debdeep Pati, Department of Statistical Science, Duke University dp55@stat.duke.edu.
David B. Dunson, Department of Statistical Science, Duke University, dunson@stat.duke.edu
References
- Adler RJ. An introduction to continuity, extrema, and related topics for general Gaussian processes. vol 12. Hayward, CA: Academic Press; 1990. [Google Scholar]
- Albert J, Chib S. Sequential ordinal modeling with applications to survival data. Biometrics. 2001;57(3):829–836. doi: 10.1111/j.0006-341x.2001.00829.x. [DOI] [PubMed] [Google Scholar]
- Amewou-Atisso M, Ghoshal S, Ghosh JK, Ramamoorthi RV. Posterior consistency for semi-parametric regression problems. Bernoulli. 2003;9(2):291–312. [Google Scholar]
- Arellano-Vallea RB, Galea-Rojasb M, Zuazola PI. Bayesian sensitivity analysis in elliptical linear regression models. Journal of Statistical Planning and Inference. 2000;86:175–199. [Google Scholar]
- Burr D, Doss H. A Bayesian semiparametric model for random-effects meta-analysis. Journal of the American Statistical Association. 2005;100:242–251. [Google Scholar]
- Bush C, MacEachern S. A semiparametric bayesian model for randomised block designs. Biometrika. 1996;83(2):275. [Google Scholar]
- Chan D, Kohn R, Nott D, Kirby C. Locally adaptive semiparametric estimation of the mean and variance functions in regression models. Journal of Computational and Graphical Statistics. 2006;15:915–936. [Google Scholar]
- Chib S, Greenberg E. Additive cubic spline regression with dirichlet process mixture errors. Journal of Econometrics. 2010;156(2):322–336. [Google Scholar]
- Chipman HA, George EI, Mcculloch RE. Bart: Bayesian additive regression trees. The Annals of Applied Statistics. 2010;4(1):266–298. [Google Scholar]
- Choi T. PhD thesis. Department of Statistics, Carnegie Mellon University; 2005. Posterior consistency in nonparametric regression problems in gaussian process priors. [Google Scholar]
- Choi T. Asymptotic properties of posterior distributions in nonparametric regression with non-Gaussian errors. Annals of the Institute of Statistical Mathematics. 2009;61(4):835–859. [Google Scholar]
- Choi T, Schervish MJ. On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis. 2007;10:1969–1987. [Google Scholar]
- Chu KC. Estimation and detection in linear systems with elliptical errors. IEEE Trans Auto Control. 1973;18:499–505. [Google Scholar]
- Chung Y, Dunson D. Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2009;104(488):1646–1660. doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cramér H, Leadbetter MR. Stationary and related stochastic processes, sample function properties and their applications. New York: John Wiley & Sons; 1967. [Google Scholar]
- Denison D, Holmes C, Mallick B, Smith AFM. Bayesian methods for nonlinear classification and regression. London: Wiley & Sons; 2002. [Google Scholar]
- Doss H. Bayesian nonparametric estimation of the median; Part I: Computation of the estimates. The Annals of Statistics. 1985;13(4):1432–1444. [Google Scholar]
- Dunson D, Park J. Kernel stick-breaking processes. Biometrika. 2008a;95(2):307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunson DB, Park JH. Kernel stick-breaking processes. Biometrika. 2008b;95(2):307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunson DB, Pillai N, Park JH. Bayesian density regression. Journal of the Royal Statistical Society, Series B. 2007;69:163–183. [Google Scholar]
- Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90(430):577–588. [Google Scholar]
- Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1(2):209–230. [Google Scholar]
- Ferguson TS. Prior distributions on spaces of probability measures. The Annals of Statistics. 1974;2(4):615–629. [Google Scholar]
- Fonseca TCO, Ferreira MAR, Migon HS. Objective Bayesian analysis for the Student-t regression model. Biometrika. 2008;95(2):325–333. [Google Scholar]
- Geweke J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Bayesian Statistics. 1992;4:169–194. [Google Scholar]
- Ghosal S, Roy A. Posterior consistency of Gaussian process prior in nonparametric binary regression. The Annals of Statistics. 2006;34(5):2413–2429. [Google Scholar]
- Gramacy R, Lee H. Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association. 2008;103(483):1119–1130. [Google Scholar]
- Griffin J, Steel M. Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica. 2008 (to appear) [Google Scholar]
- Griffin J, Steel MFJ. Order-based dependent Dirichlet processes. Journal of the American Statistical Association, Theory and Methods. 2006;101(473):179–194. [Google Scholar]
- Huber PJ. Robust estimation of a location parameter. The Annals of Mathematical Statistics. 1964;35(1):73–101. [Google Scholar]
- Ishwaran H, James L. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96(453):161–173. [Google Scholar]
- James LF, Lijoi A, Prünster I. Bayesian nonparametric inference via classes of normalized random measures. Tech. rep., ICER Applied Mathematics Working Papers Series 5/2005. 2005
- Kalli M, Griffin J, Walker S. Slice sampling mixture models. Statistics and computing. 2010:1–13. [Google Scholar]
- Kottas A, Gelfand AE. Bayesian semiparametric median regression modeling. Journal of the American Statistical Association. 2001;96(456):1458–1468. [Google Scholar]
- Lange K, Little RJA, Taylor JMG. Robust statistical modelling using the t distribution. Journal of the American Statistical Association. 1989;84(408):881–896. [Google Scholar]
- Lavine M, Mockus A. A nonparametric Bayes method for isotonic regression. Journal of Statistical Planning and Inference. 2005;46:235–248. [Google Scholar]
- Lo AY. On a class of Bayesian nonparametric estimates. I: Density estimates. The Annals of Statistics. 1984;12(1):351–357. [Google Scholar]
- Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83(1):67–79. [Google Scholar]
- Neal RJ. Regression and classification using Gaussian process priors. Bayesian Statistics. 1998;6:475–501. [Google Scholar]
- Norets A, Pelenis J. Unpublished manuscript. Princeton Univ; 2010. Posterior consistency in conditional distribution estimation by covariate dependent mixtures. [Google Scholar]
- Norets A, Pelenis J. Unpublished manuscript. Princeton Univ; 2011. Bayesian semiparametric regression. [Google Scholar]
- Nott D. Semiparametric estimation of mean and variance functions for non-Gaussian data. Computational Statistics. 2006;21:603–620. [Google Scholar]
- Ongaro A, Cattaneo C. Discrete random probability measures: a general framework for nonparametric Bayesian inference. Statistics & Probability Letters. 2004;67:33–45. [Google Scholar]
- Papaspiliopoulos O. A note on posterior sampling from Dirichlet mixture models. Tech.rep. 2008 [Google Scholar]
- Papaspiliopoulos O, Roberts G. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–183. [Google Scholar]
- Pati D, Dunson D, Tokdar S. Posterior consistency in conditional distribution estimation. Duke University, Dept of Statistics; 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raftery AE, Lewis S. How many iterations in the Gibbs sampler? Bayesian Statistics. 1992;4:763–773. [Google Scholar]
- Rasmussen C, Williams C. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) The MIT Press; 2006. [Google Scholar]
- Rodriguez A, Dunson D. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6(1):145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartz L. On Bayes procedures. Z Wahrsch Verw Gebiete. 1965;4:10–26. [Google Scholar]
- Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]
- Tokdar ST. Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhyā. 2006;68:90–110. [Google Scholar]
- van der Vaart A, van Zanten J. Reproducing kernel hilbert spaces of gaussian priors. IMS Collections. 2008;3:200–222. [Google Scholar]
- van der Vaart A, van Zanten J. Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics. 2009;37(5B):2655–2675. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996. [Google Scholar]
- Walker SG. Sampling the Dirichlet mixture model with slices. Communications in Statistics-Simulation and Computation. 2007;36:45–54. [Google Scholar]
- Weiss R. An approach to Bayesian sensitivity analysis. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(4):739–750. [Google Scholar]
- West M. Outlier models and prior distributions in Bayesian linear regression. Journal of the Royal Statistical Society Series B. 1984;46(3):431–439. [Google Scholar]
- West M. On scale mixtures of normal distributions. Biometrika. 1987;74(3):646–648. [Google Scholar]
- Wu Y, Ghoshal S. Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics. 2008;2:298–331. [Google Scholar]
- Yau P, Kohn R. Estimation and variable selection in nonparametric heteroscedastic regression. Statistics and Computing. 2003;13:191–208. [Google Scholar]