GENERALIZED DOUBLE PARETO SHRINKAGE

Artin Armagan; David B Dunson; Jaeyong Lee

. Author manuscript; available in PMC: 2014 Jan 27.

Published in final edited form as: Stat Sin. 2013 Jan 1;23(1):119–143.

GENERALIZED DOUBLE PARETO SHRINKAGE

Artin Armagan ¹, David B Dunson ², Jaeyong Lee ³

PMCID: PMC3903426 NIHMSID: NIHMS445603 PMID: 24478567

Abstract

We propose a generalized double Pareto prior for Bayesian shrinkage estimation and inferences in linear models. The prior can be obtained via a scale mixture of Laplace or normal distributions, forming a bridge between the Laplace and Normal-Jeffreys’ priors. While it has a spike at zero like the Laplace density, it also has a Student’s t-like tail behavior. Bayesian computation is straightforward via a simple Gibbs sampling algorithm. We investigate the properties of the maximum a posteriori estimator, as sparse estimation plays an important role in many problems, reveal connections with some well-established regularization procedures, and show some asymptotic results. The performance of the prior is tested through simulations and an application.

Keywords: Heavy tails, high-dimensional data, LASSO, maximum a posteriori estimation, relevance vector machine, robust prior, shrinkage estimation

1. Introduction

There has been a great deal of work in shrinkage estimation and simultaneous variable selection in the frequentist framework. The LASSO of Tibshirani (1996) has drawn much attention to the area, particularly after the introduction of LARS (Efron et al. (2004)) due to its superb computational performance. There is a rich literature analyzing the LASSO and related approaches (Fu (1998), Knight and Fu (2000), Fan and Li (2001), Yuan and Lin (2005), Zhao and Yu (2006), Zou (2006), Zou and Li (2008)), with a number of articles considering asymptotic properties.

Bayesian approaches to the same problem became popular with the works of Tipping (2001) and Figueiredo (2003). By expressing Student’s t priors for basis coefficients as scale mixtures of normals (West (1987)), and relying on type II maximum likelihood estimation (Berger (1985)), Tipping (2001) developed the relevance vector machine for sparse estimation in kernel regression. In this setting, however, exact sparsity comes with the price of forfeiting propriety of the posterior by driving the scale parameter of the Student’s t distribution toward zero. In fact, driving both the scale parameter and the degrees of freedom to zero yields the so-called Normal-Jeffreys’ prior, π(θ) ∝ 1|θ|. The name emerges due to the fact that the hierarchy follows as θ ~ N(0, τ), π(τ) ∝ 1/τ, where the latter is the Jeffreys’ prior on the prior variance of θ. Figueiredo (2003) proposed an expectation-maximization algorithm for maximum a posteriori estimation under Laplace and Normal-Jeffreys’ priors, with estimates under the Laplace corresponding to the LASSO. The Normal-Jeffreys’ prior leads to substantially improved performance with finite samples due to the property of strongly shrinking small coefficients to zero while minimally shrinking large coefficients due to the heavy tails; however, it has no meaning from an inferential aspect as it leads to an improper posterior.

A Bayesian LASSO was proposed by Park and Casella (2008) and Hans (2009). However, these procedures inherit the problem of over-shrinking large coefficients due to the relatively light tails of the Laplace prior. Strawderman-Berger priors (Strawderman (1971), Berger (1980)) have some desirable properties yet lack a simple analytic form. Recently proposed priors have been designed to have high density near zero and heavy tails without the impropriety problem of Normal-Jeffreys. The horseshoe prior of Carvalho, Polson, and Scott (2009, 2010) is induced through a carefully-specified mixture of normals, leading to such desirable properties as an infinite spike at zero and very heavy tails. They studied sparse shrinkage estimation properties of the horseshoe in a normal means problem. Griffin and Brown (2007, 2010) proposed an alternative class of hierarchical priors for shrinkage with some similarities to the prior we propose, but it lacks a simple analytic form that facilitates the study of some properties.

There is a need for alternative shrinkage priors that lead to sparse point estimates if desired, do not over-shrink coefficients that are not close to zero, facilitate straight-forward computation even in large p cases, and result in a joint posterior distribution that does a good job of quantifying uncertainty. We propose the generalized double Pareto prior which independently finds mention in Cevher (2009). It has a simple analytic form, yields a proper posterior, and possesses such appealing properties as a spike at zero, Student’s t-like tails, and a simple characterization as a scale mixture of normals that leads to a straightforward Gibbs sampler for posterior inferences. We consider both fully Bayesian and frequentist penalized likelihood approaches based on this prior. We show that the induced penalty in the regularization framework yields a consistent thresholding rule having the continuity property in the orthogonal case, with a simple expectation-maximization algorithm described for sparse estimation in non-orthogonal cases. In another independent work motivated by applications to genome wide associations studies, Lee et al. (2011) consider a generalized t prior (McDonald and Newey (1988)) that includes the generalized double Pareto as a special case. Similarities to previous work are limited and our contributions beyond them are (i) the formal introduction of a generalized Pareto density, thresholded and folded at zero, as a shrinkage prior in Bayesian analysis, (ii) the scale mixture representation of the generalized double Pareto in Proposition 1 which is central to our work, (iii) its connection to the Laplace and Normal-Jeffreys’ priors as limiting cases in Proposition 2, (iv) the resulting fully conditional posteriors in a linear regression setting along with a simple Gibbs sampling procedure, (v) a detailed discussion on the hyper-parameters α and η and their treatment, along with the incorporation of a griddy sampling scheme into the Gibbs sampler, (vi) a detailed analysis of the induced penalty by the generalized double Pareto prior and the properties of the resulting thresholding rule, (vii) an explicit analytic form for the maximum a posteriori estimator in orthogonal cases, (viii) an expectation-maximization procedure to obtain the maximum a posteriori estimate in non-orthogonal cases using the normal mixture representation, (ix) the one-step estimator (Zou and Li (2008)) resulting from the Laplace mixture representation, revealing the connection of the resulting procedure to the adaptive LASSO of Zou (2006), and (x) the oracle properties of the resulting estimators.

2. Generalized Double Pareto Prior

The generalized double Pareto density is

f (θ ∣ ξ, α) = \frac{1}{2 ξ} {(1 + \frac{∣ θ ∣}{α ξ})}^{- (α + 1)},

(2.1)

where ξ > 0 is a scale parameter and θ > 0 is a shape parameter. In contrast to (2.1), the generalized Pareto density of Pickands (1975) is parametrized in terms of a location parameter $μ \in R$ , a scale parameter ξ > 0, and a shape parameter $α \in R$ as

f (θ ∣ ξ, α, μ) = \frac{1}{ξ} {(1 + \frac{θ - μ}{α ξ})}^{- (α + 1)},

(2.2)

with θ ≥ μ for α > 0 and μ ≤ θ ≤ μ − ξα for α < 0. The mean and variance for the generalized Pareto distribution are $E (θ) = μ + ξ ∕ (1 - 1 ∕ α)$ for α ∉ [0, 1] and $V (θ) = ξ^{2} {(1 - 1 ∕ α)}^{- 2} {(1 - 2 ∕ α)}^{- 1}$ for α ∉ [0, 2]. If we let μ = 0, (2.2) becomes an exponential density as α → ∞ with mean ξ and variance ξ².

To modify the generalized Pareto density to be a shrinkage prior, we let μ = 0 and reflect the positive part about the origin, assuming α > 0, for a density that is symmetric about zero. The mean and variance for the generalized double Pareto distribution are $E (θ) = 0$ for α > 1 and $V (θ) = 2 ξ^{2} α^{2} {(α - 1)}^{- 1} {(α - 2)}^{- 1}$ for α > 2. The dispersion is controlled by ξ and α, with α controlling the tail heaviness and α = 1 corresponding to Cauchy-like tails and no finite moments.

Figure 2.1 compares the density in (2.1) to Cauchy and Laplace densities for the special case ξ = α = 1, so that f(θ) = 1/{2(1 + |θ|)²}. We refer to this form as the standard double Pareto. Near zero, the standard double Pareto resembles the Laplace density, suggesting similar sparse shrinkage properties of small coefficients in maximum a posteriori estimation. It also has Cauchy-like tails, which is appealing in avoiding over-shrinkage away from the origin. This is illustrated in Figure 2.1(a). Figure 2.1(b) illustrates how the density in (2.1) changes for different values of ξ and α.

Prior (2.1) can be represented as a scale mixture of normal distributions leading to computational simplifications. As shorthand notation, let θ ~ GDP(ξ, α) denote that θ has density (2.1).

Proposition 1. Let θ ~ N(0, τ), τ ~ Exp(λ²/2), and λ ~ Ga(α, η), where α > 0 and η > 0. The resulting marginal density for θ is GDP(ξ = η/α, α).

Proposition 1 reveals a relationship between the prior in (2.1) and the prior of Griffin and Brown (2007), with the difference being that Griffin and Brown (2007) place a mixing distribution on λ² leading to a marginal density on θ with no simple analytic form.

In Proposition 2 we show that the prior in (2.1) forms a bridge between two limiting cases – Laplace and Normal-Jeffreys’ priors.

Proposition 2. Given the representation in Proposition 1, θ GDP(ξ = η/α, α) implies

f(θ) ∝ 1/|θ| for α = 0 and η = 0,
f(θ|λ′) = (λ′/2) exp (−λ′|θ|) for α → ∞, α/η = λ′ and 0 < λ′ < ∞

Proof. For the first item, setting α = η = 0 implies placing a Jeffreys’ prior on λ, π(λ) ∝ 1/λ. Integration over λ yields π(τ) ∝ 1/τ, which implies the Normal-Jeffreys’ prior on θ. For the second item, notice that π(λ) = δ(λ − λ′), where δ(.) denotes the Dirac delta function, since $\lim_{α \to \infty} \lim_{α ∕ η \to λ^{'}} E (λ) = λ^{'}$ and $\lim_{α \to \infty} \lim_{α ∕ η \to λ^{'}} V (λ) = 0$ . Thus, $\int_{0}^{\infty} (λ ∕ 2) \exp (- λ ∣ θ ∣) δ (d λ) = (λ^{'} ∕ 2) \exp (- λ^{'} ∣ θ ∣)$ .

As noted in Polson and Scott (2010), if π(τ) has exponential or lighter tails, observations are shrunk towards zero by some non-diminishing amount, regardless of size. This phenomenon is well-understood and commonly observed in estimation under the Laplace prior, where an exponential density mixes a normal density. The higher-level mixing (over λ) in Proposition 1 allows π(τ) to have heavier tails, remedying the unwanted bias.

As α grows, the density becomes lighter tailed, more peaked and the variance becomes smaller, while as η grows, the density becomes flatter and the variance increases. Hence if we increase α, we may cause unwanted bias for large signals, though causing stronger shrinkage for noise-like signals; if we increase η we may lose the ability to shrink noise-like signals, as the density is not as pronounced around zero; and finally, if we increase α and η at the same rate, the variance remains constant but the tails become lighter, converging to a Laplace density in the limit. This leads to over-shrinking of coefficients that are away from zero. As a typical default specification for the hyperparameters, one can take α = η = 1. This choice leads to Cauchy-like tail behavior, which is well-known to have desirable Bayesian robustness properties.

To motivate this default choice, we assess the behavior of the prior shrinkage factor κ = 1/(1 + τ) ∈ (0, 1), where θ ~ N(0, τ) is the parameter of interest (Carvalho et al. (2010)). As κ → 0, the prior imposes no shrinkage, while as κ → 1 it has a strong pull towards zero. The generalized double Pareto distribution implies a prior π(κ) on κ upon integration over λ in Proposition 1. For the standard double Pareto, this is

π (κ) = \frac{1}{2 {(1 - κ)}^{2}} [\frac{\sqrt{π} \exp {\frac{κ}{2 (1 - κ)}} Erfc {\sqrt{\frac{κ}{2 (1 - κ)}}}}{\sqrt{2 κ (1 - κ)}} - 1],

where Erfc(.) denotes the complementary error function. In Figure 2.2, we compare π(κ) under the standard double Pareto, Strawderman-Berger, horseshoe, and Cauchy priors, which may all be considered default choices. The priors behave similarly for κ ≈ 0, implying similar tail behavior. The behavior of π(κ) for κ ≈ 1 governs the strength of shrinkage of small signals. As κ → 1, π(κ) tends towards zero for the Cauchy, implying weak shrinkage, while π(κ) is unbounded for the horseshoe, suggesting a strong pull towards zero for small signals. The Strawderman-Berger and standard double Pareto priors are a compromise between these extremes, with π(κ) bounded for κ → 1 in both cases. The standard double Pareto assigns higher density close to one than the Strawderman-Berger prior, and has the advantage of a simple analytic form over the Strawderman-Berger and horseshoe priors.

Of course it is best to adjust α and η according to any available prior information pertaining to the sparsity structure of the estimated vector. For general α > 0 and η > 0 values, the prior on κ is

π (κ ∣ α, η) = \frac{2^{α ∕ 2 - 1} η^{α} κ^{(α - 1) ∕ 2} {(1 - κ)}^{- (α + 3) ∕ 2}}{Γ (α)} \times {{(\frac{1}{κ} - 1)}^{1 ∕ 2} Γ (\frac{α}{2} + 1) {}_{1}F_{1} (\frac{α}{2} + 1, \frac{1}{2}, \frac{η^{2} κ}{2 (1 - κ)}) - \sqrt{2} η Γ (\frac{α + 3}{2}) {}_{1}F_{1} (\frac{α + 3}{2}, \frac{3}{2}, \frac{η^{2} κ}{2 (1 - κ)})},

(2.3)

where ${}_{1}F_{1}$ denotes the confluent hypergeometric function. Note that π(κ|α, η) takes a “horseshoe” shape when α = η = 0. Carvalho, Polson, and Scott (2010) show that π(κ) ∝ κ⁻¹(1−κ)⁻¹ implies a Normal-Jeffreys’ prior on θ, which can also be observed by setting α = η = 0 in (2.3) in conjunction with Proposition 1. Hence π(κ|α, η) is unbounded at κ = 1 forcing π(θ|α, η) to be unbounded at 0 only if η = 0. The effects of α and η are now observed with better clarity from Figure 2.3. As η increases, less and less density is assigned to the neighborhood of κ ≈ 1, repressing shrinkage. On the other hand, increasing α values place more and more density in the neighborhood of κ ≈ 1 promoting further shrinkage. This notion is later reinforced by Proposition 3, such that the prior induces a thresholding rule under maximum a posteriori estimation if $η < 2 \sqrt{α + 1}$ . Hence, we need to carefully pick these hyper-parameters, in particular α, as there is a trade-off between the magnitude of shrinkage and tail robustness.

Figure 2.3 — Prior density of κ (a) when α = 1 and η = 0.5 (dashed), η = 1 (solid), η = 2 (dot-dash) (b) when η = 1 and α = 1 (solid), α = 2 (dashed), α = 3 (dot-dash).

3. Bayesian Inference in Linear Models

Consider the linear regression model y = Xβ + ∊, where y is an n-dimensional vector of responses, X is the n × p design matrix and ∊ ~ N (0, σ²I_n). Letting β_j|σ ~ GDP(ξ = ση/α, α) independently for j = 1, …, p,

π (β ∣ σ) = Π_{j = 1}^{p} \frac{1}{2 σ η ∕ α} {(1 + \frac{1}{α} \frac{∣ β_{j} ∣}{σ η ∕ α})}^{- (α + 1)} .

(3.1)

From Proposition 1, this prior is equivalent to β_j|σ ~ N(0, σ²τ_j), with $τ_{j} ~ Exp (λ_{j}^{2} ∕ 2)$ and λ_j ~ Ga(α, η). We place the Jeffreys’ prior on the error variance, π(σ) ∝ 1/σ.

Using the scale mixture of normals representation, we obtain a simple data augmentation Gibbs sampler having the conditional posteriors (β|σ², T, y) ~ N (X′X + T⁻¹)⁻¹X′y, σ² (X′X + T⁻¹)⁻¹, (σ²|β, T, y) ~ IG{(n + p)/2, (y − Xβ)′(y − Xβ)/2}, (λ_j|β_j, σ²) ~ Ga(α + 1, |β_j|/σ + η), $(τ_{j}^{- 1} ∣ β_{j}, λ_{j}, σ^{2}) ~ Inv - Gauss {μ = {(λ_{j}^{2} σ^{2} ∕ β_{j}^{2})}^{1 ∕ 2}, ρ = λ^{2}}$ , where T = diag(τ₁, …, τ_p) and Inv-Gauss denotes the inverse Gaussian distribution with location and scale parameters μ and ρ. In our experience, this Gibbs sampler is efficient with fast rates of convergence and mixing.

In the absence of any prior information on α and η, one may either set them to their default values or, as an alternative, choose hyper-priors to allow the data to inform about the values of α and η. We use π(α) = 1/(1+α)² and π(η) = 1/(1+η)² to correspond to generalized Pareto hyper-priors with location parameter 0, scale parameter 1 and shape parameter 1. The median value of the resulting distribution for α and η is 1, centered at the default choices suggested earlier, while the mean and variance do not exist.

For sampling purposes, let a = 1/(1+α) and e = 1/(1+η). These transformations suggest a uniform prior on a and e in (0, 1) given the generalized Pareto priors on α and η. Consequently, the conditional posteriors for a and e are

π (a ∣ β, η) \propto {(\frac{1 - a}{a})}^{p} Π_{j = 1}^{p} {(1 + \frac{∣ β_{j} ∣}{σ η})}^{- 1 ∕ a},

π (e ∣ β, α) \propto {(\frac{e}{1 - e})}^{p} Π_{j = 1}^{p} {1 + e \frac{∣ β_{j} ∣}{σ (1 - e)}}^{- (α + 1)} .

We propose the embedded griddy Gibbs (Ritter and Tanner (1992)) sampling scheme:

Form a grid of m points a⁽¹⁾, …, a^(m) in the interval (0, 1).
Calculate w^(k) = π(a^(k)|β, η).
Normalize the weights, $w_{N}^{(k)} = w^{(k)} ∕ \sum_{k = 1}^{m} w^{(k)}$ .
Draw a sample from the set {a⁽¹⁾, …, a^(m)} with probabilities { $w_{N}^{(1)}, \dots, w_{N}^{(m)}$ }, and set α = 1/a − 1 to be used at the current iteration of the Gibbs sampler.

Repeat the same procedure for e and obtain a random draw for η. We also experiment with fixing η as 1 while treating α as unknown. In this case, the prior variance of β|σ² is determined by α.

In what follows we establish the ties between the Bayesian approach we have taken and some frequentist regularization approaches. The simple analytic structure of the generalized double Pareto prior facilitates analyses while its hierarchical formulation leads to straight-forward computation.

4. Sparse Maximum a Posteriori Estimation

The generalized double Pareto distribution can be used not only as a prior in a Bayesian analysis, but also to induce a sparsity-favoring penalty in regularized least squares:

\tilde{β} = arg \min_{β} {\frac{1}{2 σ^{2}} {∥ y - X β ∥}^{2} + \sum_{j = 1}^{p} p (∣ β_{j} ∣)},

(4.1)

where X is initially assumed to have orthonormal columns and p(.) denotes the penalty function implied by the prior on the regression coefficients. Following Fan and Li (2001), let $\hat{β} = X^{'} y$ , and denote the minimization problem in (4.1) for a component of β as

{\tilde{β}}_{j} = \arg \min_{β_{j}} {\frac{1}{2} {({\hat{β}}_{j} - β_{j})}^{2} + σ^{2} p (∣ β_{j} ∣)},

(4.2)

From Fan and Li (2001), a good penalty function should result in an estimator that is (i) nearly unbiased when the true unknown parameter is large, (ii) a thresholding rule that automatically sets small estimated coefficients to zero to reduce model complexity, and (iii) continuous in data ( ${\hat{β}}_{j}$ ) to avoid instability in model prediction. In the following, we show that the penalty function induced by prior (3.1) may achieve these properties.

4.1. Near-unbiasedness

The first order derivative of (4.2) with respect to β_j is $sgn (β_{j}) {∣ β_{j} ∣ + σ^{2} p^{'} (∣ β_{j} ∣)} - {\hat{β}}_{j} = sgn (β_{j}) {∣ β_{j} ∣ + σ^{2} (α + 1) ∕ (σ η + ∣ β_{j} ∣)} - {\hat{β}}_{j}$ , where p′(|β_j|) = ∂p(|β_j|)/∂|β_j| is the term causing bias in estimation. Although it is appealing to introduce bias in small coefficients to reduce the mean squared error and model complexity, it is also desirable to limit the shrinkage of large coefficients with p′(|β_j|) → 0 as |β_j| → ∞. In addition, it is desirable for p′(|β_j|) to approach zero rapidly, implying shrinkage, and the associated introduction of bias rapidly decreases as coefficients get further away from zero. In fact, the rate of convergence of p′(|β_j|) to zero is of the same order under the generalized double Pareto and Normal-Jeffreys’ priors, with $\lim_{∣ β_{j} ∣ \to \infty} {(α + 1) ∕ (σ η + ∣ β_{j} ∣)} ∕ {1 ∕ ∣ β_{j} ∣} = α + 1$ . As α controls the tail heaviness in the generalized double Pareto prior, with lighter tails for larger values of α, convergence of the ratio to (α + 1) is intuitive. In the case of LASSO, the bias, p′(|β_j|), remains constant regardless of |β_j|, which can also be observed in Figure 4.4(b).

Figure 4.4 — Thresholding functions for (a) generalized double Pareto prior with $η = \sqrt{α + 1}$ , α = {1, 3, 7}, (b) Hard thresholding, generalized double Pareto prior with η = 2, α = 3 and LASSO with σ = 1.

4.2. Sparsity

As noted in Fan and Li (2001), a sufficient condition for the resulting estimator to be a thresholding rule is that the minimum of the function |β_j| + σ²p′(|β_j|) is positive.

Proposition 3. Under the formulation in Proposition 1, prior (3.1) implies a penalty yielding an estimator that is a thresholding rule if $η < 2 \sqrt{α + 1}$ .

This result is obtained by finding the minimum of |β_j| + σ²p′(|β_j|) and taking it greater than zero. The thresholding is a direct consequence of the fact that when $∣ {\hat{β}}_{j} ∣ < \min_{β_{j}} {∣ β_{j} ∣ + σ^{2} (α + 1) ∕ (σ η + ∣ β_{j} ∣)}$ , which requires that $\min_{β_{j}} {∣ β_{j} ∣ + σ^{2} p^{'} (∣ β_{j} ∣)} > 0$ , the derivative of (4.2) is positive for all positive β_j and negative for all negative β_j. In this case, the penalized least squares estimator is zero. When $∣ {\hat{β}}_{j} ∣ > \min_{β_{j}} {∣ β_{j} ∣ + σ^{2} (α + 1) ∕ (σ η + ∣ β_{j} ∣)}$ , two roots may exist. The larger one (in absolute value) or zero is the penalized least squares estimator. To elaborate more on this, the root(s) may exist for $sgn (β_{j}) {∣ β_{j} ∣ + σ^{2} p^{'} (∣ β_{j} ∣)} - {\hat{β}}_{j} = 0$ only when $∣ {\hat{β}}_{j} ∣ > \min_{β_{j}} {∣ β_{j} ∣ + σ^{2} p^{'} (∣ β_{j} ∣)}$ . A helpful illustration is Figure 3 of Fan and Li (2001).

4.3. Continuity

Continuity in data is important if an estimator is to avoid instabilities in prediction. As in Breiman (1996), “a regularization procedure is unstable if a small change in data can make large changes in the regularized estimator”. Discontinuities in the thresholding rule may result in inclusion or dismissal of a signal with minor changes in the data used (see Figure 4.4(b)). Hard-thresholding, the “usual” variable selection, is an unstable procedure, while ridge and LASSO estimates are considered stable.

A necessary and sufficient condition for continuity is that the minimum of the function |β_j| + σ²p′(|β_j|) is at zero (Fan and Li (2001)). For our prior, the minimum of this function is obtained at $∣ β_{j} ∣ = σ (\sqrt{α + 1} - η)$ . Therefore $η = \sqrt{α + 1}$ yields an estimator with this property.

Proposition 4. Under the formulation in Proposition 1, a subfamily of prior (2.1) with $η = \sqrt{α + 1}$ yields an estimator with the continuity property.

In this particular case, the penalized likelihood estimator is set to zero if $∣ {\hat{β}}_{j} ∣ \leq σ \sqrt{α + 1}$ . When $∣ {\hat{β}}_{j} ∣ > σ \sqrt{α + 1}$

{\tilde{β}}_{j} = {\begin{matrix} \frac{{\hat{β}}_{j} - σ \sqrt{α + 1} + {{\hat{β}}_{j}^{2} + 2 {\hat{β}}_{j} σ \sqrt{α + 1} - 3 σ^{2} (α + 1)}^{1 ∕ 2}}{2} & {\hat{β}}_{j} > 0, \\ \frac{{\hat{β}}_{j} + σ \sqrt{α + 1} - {{\hat{β}}_{j}^{2} - 2 {\hat{β}}_{j} σ \sqrt{α + 1} - 3 σ^{2} (α + 1)}^{1 ∕ 2}}{2} & {\hat{β}}_{j} < 0 . \end{matrix}

(4.3)

As can be observed in Figure 4.4(a), ensuring continuity by letting $η = \sqrt{α + 1}$ creates a trade-off between sparsity and tail-robustness. As the thresholding region becomes wider, the larger values are penalized further, yet not nearly at the level of LASSO.

4.4. Maximum a Posteriori Estimation via Expectation-Maximization

We assume a normal likelihood to formulate the procedure for non-orthogonal linear regression. Estimation is carried out via the expectation-maximization (EM) algorithm.

4.4.1. Exploiting the Normal Mixture Representation

We take the expectation of the log-posterior with respect to the conditional posterior distributions of ( $τ_{j}^{- 1} ∣ β_{j}^{(k)}$ , $λ_{j}, σ^{2 (k)}$ ) and ( $λ_{j} ∣ β_{j}^{(k)}$ , σ^2(k)) at the kth step, then maximize with respect to β_j and σ² to get the values for the (k + 1)th step.

E-step:
M-step: Letting $D^{(k)} = diag (d_{1}^{(k)}, \dots, d_{p}^{(k)})$ , we have
$β^{(k + 1)} = {(X^{'} X + D^{(k)})}^{- 1} X^{'} y,$

$σ^{2 (k + 1)} = \frac{{(y - X β^{(k + 1)})}^{'} (y - X β^{(k + 1)}) + β^{{(k + 1)}^{'}} D^{(k)} β^{(k + 1)}}{n + p + 2}$

We refer to this estimator as GDP(MAP).

4.4.2. Exploiting the Laplace Mixture Representation and the One-step Estimator

In the proof of Proposition 1, the integration over τ leads to a Laplace mixture representation of the prior. Since the mixing distribution of the Laplace is a known distribution the required expectation is obtained with ease, resulting in the maximization step,

\begin{matrix} β^{(k + 1)} & = \arg \max_{β} {- \frac{1}{2 σ^{2 (k)}} {(y - X β)}^{'} (y - X β) - \frac{1}{σ^{(k)}} \sum_{j = 1}^{p} ∣ β_{j} ∣ (\frac{α + 1}{∣ β_{j}^{(k)} ∣ ∕ σ^{(k)} + η})}, \\ σ^{2 (k + 1)} & = \frac{b^{2} - 2 a c - \sqrt{b^{2} - 4 a c b^{2}}}{2 a^{2}}, \end{matrix}

(4.4)

Where a = −(n + p +2), $b = (α + 1) \sum_{j} ∣ β_{j}^{(k + 1)} ∣ ∕ (∣ β_{j}^{(k)} ∣ ∕ σ^{(k)} + η)$ , and $c = {(y - X β^{(k + 1)})}^{'} (y - X β^{(k + 1)})$ . The component-specific multiplier on |β_j| is obtained from the expectation of λ_j with respect to its conditional posterior distribution, π(λ_j|β_j, σ²). Similar results to (4.4) are in Candes, Wakin and Boyd (2008), Cevher (2009), and Garrigues (2009).

An intuitive relationship to the adaptive LASSO of Zou (2006) and the one-step sparse estimator of Zou and Li (2008) can be seen via the Laplace mixture representation. As a computationally fast alternative to estimating the exact mode via the above EM algorithm, we can obtain a “one-step estimator” and exploit the LARS algorithm as in Zou and Li (2008). The one-step estimator is

β^{(1)} = \arg \min_{β} {{(y - X β)}^{'} (y - X β) + a^{†} \sum_{j = 1}^{p} \frac{∣ β_{j} ∣}{∣ β_{j}^{(0)} ∣ + η^{†}}},

(4.5)

with α^† = 2σ²⁽⁰⁾(α + 1) and η^† = σ⁽⁰⁾η. This estimator resembles the adaptive LASSO. The LARS algorithm can be used to obtain β⁽¹⁾ very quickly. We refer to this estimator as GDP(OS).

Remark 1. For η^† = 0, the GDP(OS) solution path for varying α^† is identical to the adaptive LASSO solution path with γ = 1 (see (4) in Zou (2006)) using identical β⁽⁰⁾.

Remark 2. GDP(OS) forms a bridge between the LASSO and the adaptive LASSO: as η^† → ∞ and α^†/η^† → λ^† < ∞, GDP(OS) gives the LASSO solution with penalty parameter λ^†.

We derive the GDP(OS) estimator only to reveal a close connection with the adaptive LASSO of Zou (2006) and do not use it in our experiments.

4.4.3. Normal vs. Laplace Representations in Computation

As pointed out by an anonymous referee, it is appropriate to compare the convergence behavior of the EM algorithms that exploit different mixture representations. We generated n = {200, 400, 600, 800, 1000} observations from $y_{i} = x_{i}^{'} β^{*} + ∊_{i}$ , where the x_ij were independent standard normals for p = {20, 40, 60, 80, 100}, ∊ ~ N(0, σ²), and σ = 3. We set the first p/4 components of β* to be 1 and the rest to 0. For each (n, p) combination we simulated 100 data sets and ran the EM algorithms obtained from normal and Laplace scale mixture representations. Figure 4.5 illustrates the number of iterations taken by the two algorithms until ∥β^(k+1) − β^(k)∥₂ < 10⁻⁶. As expected, the convergence under the Laplace mixture representation was much faster with the intermediary mixing parameter τ_j integrated out rather than using the expectation step in the EM algorithm.

Figure 4.5 — Number of iterations until convergence of the EM algorithms under normal and Laplace representations.

4.5. Oracle Properties

Following Zou (2006) and Zou and Li (2008), we show that the GDP(MAP) and GDP(OS) estimators possess oracle properties. Relaxing the normality assumption on the error term leads to two conditions for Theorem 2 and Theorem 3.

(A1)
y_i = x_iβ* + ∊_i where ∊₁, …, ∊_n are independent and identically distributed with mean 0 and variance σ².
(A2)
$\frac{1}{n} X^{'} X \to C$ , where C is a positive definite matrix.

In what follows, $A = {j : β_{j}^{*} \neq 0, j = 1, \dots, p}$ , $β_{A}$ retains the entries of β indexed by $A$ , and $C_{A}$ retains the rows and columns of C indexed by $A$ .

Theorem 1. Let

β_{n}^{(\infty)} = \arg \min_{β} {{(y - X β)}^{'} (y - X β) + α_{n}^{'} \sum_{j = 1}^{p} \log (∣ β_{j} ∣ + η_{n}^{'})}

denote the GDP(MAP) estimator, where $α_{n}^{'} = 2 σ^{2} (α_{n} + 1)$ and $η_{n}^{'} = σ η_{n}$ . Let $A_{n} = {j : β_{n j}^{(\infty)} \neq 0, j = 1, \dots, p}$ . Suppose that $α_{n}^{'} \to \infty$ , $α_{n}^{'} ∕ \sqrt{n} \to 0$ and, $η_{n}^{'} \sqrt{n} \to c < \infty$ . Then $β_{n}^{(\infty)}$ is

consistent in variable selection in that $\lim_{n \to \infty} P (A_{n} = A) = 1$ ;
asymptotically normal with $\sqrt{n} (β_{n, A}^{(\infty)} - β_{A}^{*}) \overset{d}{\to} N (0, σ^{2} C_{A}^{- 1})$ .

Remark 3. More generally, the above results hold if $α_{n}^{'} ∕ (\sqrt{n} η_{n}^{'}) \to \infty$ and $α_{n}^{'} ∕ \sqrt{n} \to 0$ .

Theorem 2. Let $β_{n}^{(1)}$ denote the GDP(OS) estimator in (4.5) and $A_{n} = {j : β_{n j}^{(1)} \neq 0, j = 1, \dots, p}$ . Suppose that $α_{n}^{†} \to \infty$ , $α_{n}^{†} ∕ \sqrt{n} \to 0$ , and $η_{n}^{†} \sqrt{n} \to c < \infty$ . Then $β_{n}^{(1)}$ is

consistent in variable selection in that $\lim_{n \to \infty} P (A_{n} = A) = 1$ ;
asymptotically normal with $\sqrt{n} (β_{n, A}^{(1)} - β_{A}^{*}) \overset{d}{\to} N (0, σ^{2} C_{A}^{- 1})$ .

The proofs are deferred to Section 8.

5. Experiments

5.1. Simulation

In this section, we compare the proposed estimators to the posterior means obtained under the normal, Laplace, and horseshoe priors, to the Bayesian model averaged (BMA) estimator, as well as to the sparse estimates resulting from LASSO (Tibshirani (1996)) and SCAD (Fan and Li (2001)). GDP(PM) and GDP(MAP) denote the posterior mean and the MAP estimates, respectively, under the generalized double Pareto prior. Hyper-parameter values are provided in footnotes of Tables 5.1 and 5.2 when fixed in advance and are otherwise treated as random with the priors specified in Section 3. When not fixed, we first obtain the posterior means of the hyper-parameters from an initial Bayesian analysis, then use them in the calculation of the MAP estimates.

Table 5.1.

Model error comparisons.

	n = 50

Method	Model 1	Model 2	Model 3	Model 4	Model 5
Normal	2.299_0.085	4.879_0.263	2.585_0.134	4.972_0.385	2.886_0.150
Laplace	2.634_0.137	3.662_0.233	2.837_0.126	4.326_0.211	3.458_0.120
Horseshoe	2.264_0.086	2.316_0.167	3.205_0.140	3.929_0.218	4.409_0.130
BMA	2.451_0.123	1.647_0.126	4.043_0.233	3.062_0.194	6.015_0.301
GDP(PM)¹	2.306_0.114	2.405_0.192	3.193_0.215	4.123_0.304	4.283_0.142
GDP(PM)²	2.303_0.095	2.309_0.195	3.124_0.153	3.910_0.237	4.451_0.109
GDP(PM)	2.271_0.085	2.606_0.167	3.047_0.147	4.348_0.171	3.640_0.134
GDP(MAP)¹	3.414_0.148	1.619_0.150	5.605_0.298	2.970_0.168	8.769_0.403
GDP(MAP)²	4.250_0.354	1.618_0.153	6.331_0.300	3.040_0.163	9.308_0.377
GDP(MAP)	4.876_0.355	2.091_0.182	4.299_0.222	3.740_0.284	5.724_0.177
LASSO	2.183_0.124	2.618_0.152	3.258_0.194	3.531_0.172	5.646_0.229
SCAD	3.732_0.214	2.132_0.229	5.249_0.239	3.179_0.193	8.505_0.387

	n = 400

Normal	0.395_0.014	0.455_0.019	0.426_0.016	0.455_0.024	0.412_0.013
Laplace	0.315_0.016	0.374_0.014	0.388_0.016	0.422_0.015	0.457_0.014
Horseshoe	0.219_0.016	0.205_0.010	0.341_0.014	0.346_0.009	0.514_0.023
BMA	0.151_0.011	0.125_0.005	0.240_0.016	0.211_0.009	0.646_0.037
GDP(PM)1	0.233_0.016	0.206_0.009	0.326_0.015	0.284_0.014	0.625_0.031
GDP(PM)2	0.228_0.017	0.215_0.009	0.332_0.013	0.303_0.010	0.579_0.027
GDP(PM)	0.248_0.017	0.182_0.007	0.377_0.016	0.362_0.012	0.466_0.016
GDP(MAP)1	0.154_0.014	0.111_0.011	0.286_0.016	0.210_0.011	0.739_0.043
GDP(MAP)2	0.161_0.013	0.111_0.010	0.284_0.016	0.210_0.009	0.652_0.035
GDP(MAP)	0.185_0.017	0.119_0.010	0.326_0.016	0.336_0.010	0.478_0.020
LASSO	0.251_0.014	0.276_0.014	0.339_0.020	0.348_0.011	0.485_0.021
SCAD	0.121_0.010	0.118_0.008	0.233_0.011	0.206_0.017	0.469_0.019

Open in a new tab

α = 1, η = 1

η = 1

Table 5.2.

Posterior means of the hyper-parameters and the resulting model error.

		n = 50		n = 400

		GDP(PM)	GDP(PM)²	GDP(PM)	GDP(PM)²
Model 2	α	2.464	1.165	0.688	0.870
	η	4.181	–	0.614
	ME	2.443	2.219	0.149	0.181

Model 5	α	5.262	1.200	9.400	0.560
	η	9.476	–	51.735	–
	ME	6.290	7.019	0.518	0.614

Open in a new tab

η = 1

We generated n = {50, 400} observations from $y_{i} = x_{i}^{'} β^{*} + ∊_{i}$ , where the x_ij were standard normals with Cov(x_j, x_j′) = 0.5^|j−j′|, ∊_i ~ N(0, σ²), and σ = 3. We used the following β* configurations:

Model 1: 5 randomly chosen components of β* set to 1 and the rest to 0.
Model 2: 5 randomly chosen components of β* set to 3 and the rest to 0.
Model 3: 10 randomly chosen components of β* set to 1 and the rest to 0.
Model 4: 10 randomly chosen components of β* set to 3 and the rest to 0.
Model 5: β* = (0.85, …, 0.85)′.

In our experiments y and the columns of X were centered and the columns of X scaled to have unit length. For the calculation of competing estimators we used lars (Hastie and Efron (2011)), SIS (Fan et al. (2010)), monomvn (Gramacy (2010)) and BAS (Clyde and Littman (2005), Clyde, Ghosh, and Littman (2010)) packages in R. We mainly followed the default settings provided by the packages. Under the normal prior, the so-called “ridge” parameter was given an inverse gamma prior with shape and scale parameters 10⁻³. Under the Laplace prior, as a default choice, a gamma prior was placed on the “LASSO parameter” λ², as given in (6) of Park and Casella (2008), with shape and rate parameters 2 and 0.1, respectively. Under the horseshoe prior, the monomvn package uses the hierarchy given in Section 1.1 of Carvalho, Polson, and Scott (2010). For BMA, we used the default settings of the BAS package that employs the Zellner-Siow prior given in Section 3.1 of Liang et al. (2008). The tuning for LASSO and SCAD were carried out by the criteria given in Yuan and Lin (2005) and Wang, Li, and Tsai (2007), respectively, avoiding cross-validation.

100 data sets were generated for each case. In Table 5.1, we report the median model error. Model error was calculated as ${(β^{*} - \hat{β})}^{'} C (β^{*} - \hat{β})$ , where C is the variance-covariance matrix that generated X and $\hat{β}$ denotes the estimator in use. The values in the subscripts give the bootstrap standard error of the median model error values obtained. The bootstrap standard error was calculated by generating 500 bootstrap samples from 100 model error values, finding the median model error for each case, and then calculating the standard error for it. Under each model, the best three performances are boldfaced in the tables.

GDP(PM) estimates showed a similar performance to that of horseshoe under sparse setups. GDP(PM) (with α and η unknown) also showed great flexibility in adapting to dense models with small signals. GDP(MAP) estimates performed similarly to SCAD and much better than LASSO, particularly so with increasing sparsity, signal and/or sample size. The GDP(PM) and GDP(MAP) calculations are straightforward and computationally inexpensive due to the normal (and Laplace) scale mixture representation used. Being able to use a simple Gibbs sampler (especially when α = η = 1) makes the procedure attractive for the average user.

Letting α = η = 1 may be somewhat restrictive if the underlying model is very dense or very sparse, but in the cases we considered, it performed comparably to others and we believe that it constitutes a good default prior similar to standard Cauchy with the added advantage of thresholding ability. Although we do not take up p ⪢ n cases in this paper, in such situations much larger values of would need to be chosen to adjust for multiplicity.

5.2. Inferences on Hyper-parameters

Here we take a closer look at the inferences on the hyper-parameters obtained from an individual data set for Models 2 and 5 from Section 5.1. This gives us some insight into how α and η are inferred with changing sample size and sparsity structure. Note that GDP(PM)² is more restrictive than GDP(PM) as η is fixed, treating only α as unknown. Figure 5.6 gives the marginal posteriors of α and η in cases of GDP(PM)² and GDP(PM) as described in Section 5.1, while Table 5.2 reports the posterior means for α and η, as well as model error (ME) performance (as calculated in Section 5.1) on the particular data set used. We clearly observe the adaptive nature and higher flexibility of GDP(PM) moving from a sparse to a dense model with a big increase, particularly in η, flattening the prior on β. There is not quite as much wiggle room in the case of GDP(PM)². All it can do is to drive α smaller to allow heavier tails to accommodate a dense structure. As observed in Table 5.1, however, GDP(PM)² performs comparably in sparse cases.

Figure 5.6 — Inferences for (a) GDP(PM) for n = 50 under Model 2, (b) GDP(PM)² for n = 50 under Model 2, (c) GDP(PM) for n = 400 under Model 2, (b) GDP(PM)² for n = 400 under Model 2, (e) GDP(PM) for n = 50 under Model 5, (f) GDP(PM)² for n = 50 under Model 5, (g) GDP(PM) for n = 400 under Model 5, (h) GDP(PM)² for n = 400 under Model 2.

6. Data Example

We consider the ozone data analyzed by Breiman and Friedman (1985) and by Casella and Moreno (2006). The original data set contains 13 variables and 366 observations. The modeled response is the daily maximum one-hour averaged ozone reading in Los Angeles over 330 days in 1976. There are p = 12 predictors considered and deleting incomplete observations leaves n = 203 observations. For validation, the data were split into a training set containing 180 observations and a test set containing 23 observations. We considered models including main effects, quadratic, and two-way interaction terms resulting in 2⁹⁰ possible subsets. The complex correlation structure of the data is illustrated in Figure 6.7.

Figure 6.7 — The correlation structure of the Ozone data.

Figure 6.8 summarizes the performance of the proposed estimators and their competitors. Median values for $R_{test}^{2}$ and the ±2 standard error intervals were obtained by running the methods on 100 different random training-test splits. Standard errors were computed via bootstrapping the medians 500 times.

The median number of predictors retained in the model by all three GDP(MAP) estimates was only 4 while it was 14 and 9 for LASSO and SCAD. Hence GDP(MAP) promoted much sparser models. In terms of prediction, GDP(PM)¹ yielded the second best results after BMA, with GDP(PM)², GDP(PM), and the horseshoe estimator all having somewhat worse performance. These shrinkage priors are designed to mimic model averaging behavior, so we expected to obtain results that were competitive with, but not better than, BMA. The improved performance for GDP(PM)¹ may be attributed to the use of default hyper-parameter values that were fixed in advance at values thought to produce good performance in sparse settings. Treating the hyper-parameters as unknown is appealing from the standpoint of flexibility, but in practice the data may not inform sufficiently about their values to outperform a good default choice. GDP(MAP)¹ and SCAD both performed within the standard error range of LASSO, while retaining a smaller number of variables in the model. As it is important to account for model uncertainty in prediction, the posterior mean estimator under the GDP prior is appealing in mimicking BMA. In addition, obtaining a simple model containing a relatively small number of predictors is often important, since such models are more likely to be used in fields in which predictive black boxes are not acceptable and practitioners desire interpretable predictive models.

7. Discussion

We have proposed a hierarchical prior obtained through a particular scale mixture of normals where the resulting marginal prior has a folded generalized Pareto density thresholded at zero. This prior combines the best of both worlds in that fully Bayes inferences are feasible through its hierarchical representation, providing a measure of uncertainty in estimation, while the resulting marginal prior on the regression coefficients induces a penalty function that allows for the analysis of frequentist properties under maximum a posteriori estimation. The resulting posterior mean estimator can be argued to be mimicking a Bayesian model averaging behavior through mixing over higher level hyper-parameters. Although Bayesian model averaging is appealing, it can be argued that allowing parameters to be arbitrarily close to zero instead of exactly equal to zero may be more natural in some problems. Hence we have a procedure that not only bridges two paradigms – Bayesian shrinkage estimation and regularization – but also yields three useful tools: a sparse estimator with good frequentist properties through maximum a posteriori estimation, a posterior mean estimator that mimics a model averaging behavior, and a useful measure of uncertainty around the observed estimates. In addition, the proposed methods have substantial computational advantages in relying on simple block-updated Gibbs sampling, while BMA requires sampling from a model space with 2^p models. Given the simple and fast computation and the excellent performance in small sample simulation studies, the generalized double Pareto should be useful as a shrinkage prior in a broad variety of Bayesian hierarchical models, while also suggesting close relationships with frequentist penalized likelihood approaches. The proposed prior can be used in generalized linear models, shrinkage of basis coefficients in non-parametric regression, and in such settings as factor analysis and nonparametric Bayes modeling.

8. Technical Details

Proof of Theorem 1. The proof follows along similar lines as does the proof of Theorem 2 in Zou (2006). We first prove asymptotic normality. Let $β = β^{*} + u ∕ \sqrt{n}$ and

V_{n} (u) = {y - \sum_{j = 1}^{p} x_{j} (β_{j}^{*} + \frac{u_{i}}{\sqrt{n}}}^{2} + α_{n}^{'} \sum_{j = 1}^{p} \log {∣ β_{j}^{*} + \frac{u_{i}}{\sqrt{n}} ∣ + η_{n}^{'}} .

Let ${\hat{u}}_{n} = arg min V_{n} (u)$ , suggesting ${\hat{u}}_{n} = \sqrt{n} (β_{n}^{(\infty)} - β^{*})$ . Now

V_{n} (u) - V_{n} (0) = u^{'} (\frac{1}{n} X^{'} X) u - 2 \frac{∊^{'} X}{\sqrt{n}} u + α_{n}^{'} \sum_{j = 1}^{p} \log \frac{∣ β_{j}^{*} + u_{j} ∕ \sqrt{n} ∣ + η_{n}^{'}}{∣ β_{j}^{*} ∣ + η_{n}^{'}}

and we know that X′X/n → C and $∊^{'} X ∕ \sqrt{n} \overset{d}{\to} W \overset{d}{=} N (0, σ^{2} C)$ . Consider the limiting behavior of the third term, noting that $\lim_{a \to \infty} {(1 + b ∕ a)}^{a} = e^{b}$ . If $β_{j}^{*} \neq 0$ , then $α_{n}^{'} \log \frac{∣ β_{j}^{*} + u_{j} ∕ \sqrt{n} ∣ + η_{n}^{'}}{∣ β_{j}^{*} ∣ + η_{n}^{'}} \leq α_{n}^{'} \log \frac{∣ β_{j}^{*} ∣ + ∣ u_{j} ∕ \sqrt{n} ∣ + η_{n}^{'}}{∣ β_{j}^{*} ∣ + η_{n}^{'}} = α_{n}^{'} \log (1 + \frac{∣ u_{j} ∕ \sqrt{n} ∣}{∣ β_{j}^{*} ∣ + η_{n}^{'}}) \to 0$ . If $β_{j}^{*} = 0$ , then $α_{n}^{'} \log \frac{∣ u_{j} ∕ \sqrt{n} ∣ + η_{n}^{'}}{η_{n}^{'}} = α_{n}^{'} \log (1 + \frac{∣ u_{j} ∕ \sqrt{n} ∣}{η_{n}^{'}})$ which is 0 if u_j = 0, and diverges otherwise. By Slutsky’s Theorem

V_{n} (u) - V_{n} (0) \overset{d}{\to} {\begin{matrix} {u^{'}}_{A} C_{A} u_{A} - 2 u_{A}^{'} W_{A} & if u_{j} = 0 \forall j \notin A \\ \infty & otherwise . \end{matrix}

V_n(u)−V_n(0) is convex and the unique minimum of the right hand side is ${(C_{A}^{- 1} W_{A}, 0)}^{'}$ . By epiconvergence (Geyer (1994), Knight and Fu (2000)),

{\hat{u}}_{n, A} \overset{d}{\to} C_{A}^{- 1} W_{A}, {\hat{u}}_{n, A^{c}} \overset{d}{\to} 0 .

(8.1)

Since $W_{A} \overset{d}{=} N (0, σ^{2} C_{A})$ , this proves asymptotic normality.

Now $\forall j \in A, β_{n j}^{(\infty)} \overset{p}{\to} β_{j}^{*}$ ; thus $P (j \in A_{n}) \to 1$ . Hence for consistency, it is sufficient to show that $\forall j^{'} \notin A, P (j^{'} \in A_{n}) \to 0$ . Consider the event $j^{'} \in A_{n}$ . By the KKT optimality conditions, $2 x_{j^{'}}^{'} (y - X β_{n}^{(\infty)}) = \frac{α_{n}^{'}}{η_{n}^{'} + ∣ β_{n j^{'}}^{(\infty)} ∣}$ . Noting that $\sqrt{n} β_{n j^{'}}^{(\infty)} \overset{p}{\to} 0$ by (8.1), $\frac{α_{n}^{'}}{\sqrt{n} η_{n}^{'} + \sqrt{n} ∣ β_{n j^{p r i m e}}^{(\infty)} ∣} \to \infty$ , while

\frac{2 x_{j^{'}}^{'} (y - X β_{n}^{(\infty)})}{\sqrt{n}} = 2 {\frac{x_{j^{'}}^{'} X \sqrt{n} (β^{*} - β_{n}^{(\infty)})}{n} + \frac{x_{j^{'}}^{'} ∊}{\sqrt{n}}} .

By (8.1) and Slutsky’s Theorem, we know that both terms in the brackets converge in distribution to some normal, so

P (j^{'} \in A_{n}) \leq P {2 x_{j^{'}}^{'} (y - X β_{n}^{(\infty)}) = \frac{α_{n}^{'}}{∣ β_{n j^{'}}^{(\infty)} ∣ + η_{n}^{'}}} \to 0 .

This concludes the proof.

Proof of Theorem 2. We modify the proof of Theorem 2 in Zou (2006). Here $β_{n}^{(0)}$ denotes the least squares estimator. We first prove asymptotic normality. Let $β = β_{n}^{*} + u ∕ \sqrt{n}$ and

V_{n} (u) = {y - \sum_{j = 1}^{p} x_{j} (β_{n j}^{*} + \frac{u_{j}}{\sqrt{n}}}^{2} + α_{n}^{†} \sum_{j = 1}^{p} ∣ β_{n j}^{*} + \frac{u_{j}}{\sqrt{n}} ∣ {(∣ β_{n j}^{(0)} ∣ + η_{n}^{†})}^{- 1} .

Let ${\hat{u}}_{n} = arg min V_{n} (u)$ , suggesting ${\hat{u}}_{n} = \sqrt{n} (β_{n}^{(1)} - β_{n}^{*})$ . Now

V_{n} (u) - V_{n} (0) = u^{'} (\frac{1}{n} X^{'} X) u - 2 \frac{∊^{'} X}{\sqrt{n}} u + \frac{α_{n}^{†}}{\sqrt{n}} \sum_{j = 1}^{p} {(∣ β_{n j}^{(0)} ∣ + η_{n}^{†})}^{- 1} \sqrt{n} (∣ β_{n j}^{*} + \frac{u_{j}}{\sqrt{n}} ∣ - ∣ β_{n j}^{*} ∣),

and we know that X′X/n → C and $∊^{'} X ∕ \sqrt{n} \overset{d}{\to} W \overset{d}{=} N (0, σ^{2} C)$ . Consider the limiting behavior of the third term. If $β_{n j}^{*} \neq 0$ then, by the Continuous Mapping Theorem, ${∣ β_{n j}^{(0)} ∣ + η_{n}^{†}}^{- 1} \overset{p}{\to} {∣ β_{n j}^{*} ∣ + η_{n}^{†}}^{- 1}$ and $\sqrt{n} (∣ β_{n j}^{*} + u_{j} ∕ \sqrt{n} ∣ - ∣ β_{n j}^{*} ∣) \to u_{j} sgn (β_{n j}^{*})$ . By Slutsky’s Theorem, $(α_{n}^{†} ∕ \sqrt{n}) {∣ β_{n j}^{(0)} ∣ + η_{n}^{†}}^{- 1} \sqrt{n} (∣ β_{n j}^{*} + u_{j} ∕ \sqrt{n} ∣ - ∣ β_{n j}^{*} ∣) \overset{p}{\to} 0$ . If $β_{n j}^{*} = 0$ , then $\sqrt{n} (∣ β_{n j}^{*} + u_{j} ∕ \sqrt{n} ∣ - ∣ β_{n j}^{*} ∣) = ∣ u_{j} ∣$ and $α_{n}^{†} {∣ β_{n j}^{(0)} ∣ + η_{n}^{†}}^{- 1} ∕ \sqrt{n} = α_{n}^{†} ∕ (\sqrt{n} ∣ β_{n j}^{(0)} ∣ + \sqrt{n} η_{n}^{†})$ , where $\sqrt{n} β_{n j}^{(0)} = O_{p} (1)$ . Again by Slutsky’s Theorem,

V_{n} (u) - V_{n} (0) \overset{d}{\to} {\begin{matrix} u_{A}^{'} C_{A} u_{A} - 2 u_{A}^{'} W_{A} & if u_{j} = 0 for all j \notin A \\ \infty & otherwise . \end{matrix}

V_n(u)−V_n(0) is convex and the unique minimum of the right hand side is ${(C_{A}^{- 1} W_{A}, 0)}^{'}$ . By epiconvergence (Geyer (1994), Knight and Fu (2000)),

{\hat{u}}_{n, A} \overset{d}{\to} C_{A}^{- 1} W_{A}, {\hat{u}}_{n, A^{c}} \overset{d}{\to} 0 .

(8.2)

Since $W_{A} \overset{d}{=} N (0, σ^{2} C_{A})$ , this proves the asymptotic normality.

Now $\forall j \in A, β_{n j}^{(1)} \overset{p}{\to} β_{n j}^{*}$ ; thus $P (j \in A_{n}) \to 1$ . We show that for all $j^{'} \notin A$ , $P (j^{'} \in A_{n}) \to 0$ . Consider the event $j^{'} \in A_{n}$ . By the KKT optimality conditions, $2 x_{j^{'}}^{'} (y - X β_{n}^{(1)}) = α_{n}^{†} {(∣ β_{n j^{'}}^{(0)} ∣ + η_{n}^{†})}^{- 1}$ . We know that $α_{n}^{†} {(∣ β_{n j^{'}}^{(0)} ∣ + η_{n}^{†})}^{- 1} ∕ \sqrt{n} \overset{p}{\to} \infty$ , while

\frac{2 x_{j^{'}}^{'} (y - X β_{n}^{(1)})}{\sqrt{n}} = 2 {\frac{x_{j^{'}}^{'} X \sqrt{n} (β_{n}^{*} - β_{n}^{(1)})}{n} + \frac{x_{j^{'}}^{'} ∊}{\sqrt{n}}} .

By (8.2) and Slutsky’s Theorem, we know that both terms in the brackets converge in distribution to some normal, so

P (j^{'} \in A_{n}) \leq P {2 x_{j^{'}}^{'} (y - X β_{n}^{(1)}) = \frac{α_{n}^{†}}{∣ β_{n j^{'}}^{(0)} ∣ + η_{n}^{†}}} \to 0,

which proves consistency.

Acknowledgments

This work was supported by Award Number R01ES017436 from the National Institute of Environmental Health Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Environmental Health Sciences or the National Institutes of Health. Jaeyong Lee was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (20110027353).

Contributor Information

Artin Armagan, SAS Institute Inc., Durham, NC 27513, USA, artin.armagan@sas.com.

David B. Dunson, Department of Statistical Science, Duke University, Durham, NC 27708, USA, dunson@stat.duke.edu

Jaeyong Lee, Department of Statistics, Seoul National University, Seoul, 151-747, Korea, leejyc@gmail.com.

References

Berger J. A robust generalized Bayes estimator and confidence region for a multivariate normal mean. The Annals of Statistics. 1980;8:716–761. [Google Scholar]
Berger J. Statistical Decision Theory and Bayesian Analysis. Springer; New York: 1985. [Google Scholar]
Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]
Breiman L, Friedman JH. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association. 1985:80. [Google Scholar]
Candes EJ, Wakin MB, Boyd SP. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier Analysis and Applications. 2008;14:877–905. [Google Scholar]
Carvalho C, Polson N, Scott J. Handling sparsity via the horseshoe. JMLR: W&CP. 2009:5. [Google Scholar]
Carvalho C, Polson N, Scott J. The horseshoe estimator for sparse signals. Biometrika. 2010;97:465–480. [Google Scholar]
Casella G, Moreno E. Objective Bayesian variable selection. Journal of the American Statistical Association. 2006:101. [Google Scholar]
Cevher V. Learning with compressible priors. Advances in Neural Information Processing Systems. 2009:22. [Google Scholar]
Clyde M, Ghosh J, Littman ML. Bayesian adaptive sampling for variable selection and model averaging. Journal of Computational and Graphical Statistics. 2010;20:80–101. [Google Scholar]
Clyde M, Littman M. Bayesian model averaging using Bayesian adaptive sampling – BAS package manual. 2005. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
Fan J, Feng Y, Samworth R, Wu Y. SIS package manual. 2010 [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Figueiredo MAT. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;25:1150–1159. [Google Scholar]
Fu W. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
Garrigues PJ, University of California . PhD Thesis. Berkeley: 2009. Sparse coding models of natural images: Algorithms for efficient inference and learning of higher-order structure. [Google Scholar]
Geyer CJ. On the asymptotics of constrained M-estimation. The Annals of Statistics. 1994;22:1993–2010. [Google Scholar]
Gramacy RB. Estimation for multivariate normal and Student-t data with monotone missingness – Monomvn package manual. 2010. [Google Scholar]
Griffin JE, Brown PJ. Technical Report. 2007. Bayesian adaptive lassos with non-convex penalization. [Google Scholar]
Griffin JE, Brown PJ. Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis. 2010;5:171–188. [Google Scholar]
Hans C. Bayesian lasso regression. Biometrika. 2009;96:835–845. [Google Scholar]
Hastie T, Efron B. Lars package manual. 2011. [Google Scholar]
Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
Lee A, Caron F, Doucet A, Holmes C. Bayesian sparsity-path-analysis of genetic association signal using generalized t priors. Statistical Applications in Genetics and Molecular Biology. 2012;11 doi: 10.2202/1544-6115.1712. [DOI] [PubMed] [Google Scholar]
Liang F, Paulo R, Molina G, Clyde M, Berger J. Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association. 2008;103:410–423. [Google Scholar]
McDonald JB, Newey WK. Partially adaptive estimation of regression models via the generalized t distribution. Econometric Theory. 1988;4:428–457. [Google Scholar]
Park T, Casella G. The Bayesian lasso. Journal of the American Statistical Association. 2008;103:681–686. [Google Scholar]
Pickands J. Statistical inference using extreme order statistics. The Annals of Statistics. 1975;3:119–131. [Google Scholar]
Polson NG, Scott JG. Bayesian Statistics. Oxford University Press; 2010. Shrink globally, act locally: Sparse Bayesian regularization and prediction; p. 9. [Google Scholar]
Ritter C, Tanner MA. Facilitating the Gibbs sampler: The Gibbs stopper and the griddy-Gibbs sampler. Journal of the American Statistical Association. 1992;97:861–868. [Google Scholar]
Strawderman WE. Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics. 1971;42:385–388. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. 1996;58:267–288. [Google Scholar]
Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001:1. [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
West M. On scale mixtures of normal distributions. Biometrika. 1987;74:646–648. [Google Scholar]
Yuan M, Lin Y. Efficient empirical Bayes variable selection and estimation in linear models. Journal of the American Statistical Association. 2005;100:1215–1225. [Google Scholar]
Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2006:7. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Berger J. A robust generalized Bayes estimator and confidence region for a multivariate normal mean. The Annals of Statistics. 1980;8:716–761. [Google Scholar]

[R2] Berger J. Statistical Decision Theory and Bayesian Analysis. Springer; New York: 1985. [Google Scholar]

[R3] Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]

[R4] Breiman L, Friedman JH. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association. 1985:80. [Google Scholar]

[R5] Candes EJ, Wakin MB, Boyd SP. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier Analysis and Applications. 2008;14:877–905. [Google Scholar]

[R6] Carvalho C, Polson N, Scott J. Handling sparsity via the horseshoe. JMLR: W&CP. 2009:5. [Google Scholar]

[R7] Carvalho C, Polson N, Scott J. The horseshoe estimator for sparse signals. Biometrika. 2010;97:465–480. [Google Scholar]

[R8] Casella G, Moreno E. Objective Bayesian variable selection. Journal of the American Statistical Association. 2006:101. [Google Scholar]

[R9] Cevher V. Learning with compressible priors. Advances in Neural Information Processing Systems. 2009:22. [Google Scholar]

[R10] Clyde M, Ghosh J, Littman ML. Bayesian adaptive sampling for variable selection and model averaging. Journal of Computational and Graphical Statistics. 2010;20:80–101. [Google Scholar]

[R11] Clyde M, Littman M. Bayesian model averaging using Bayesian adaptive sampling – BAS package manual. 2005. [Google Scholar]

[R12] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]

[R13] Fan J, Feng Y, Samworth R, Wu Y. SIS package manual. 2010 [Google Scholar]

[R14] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R15] Figueiredo MAT. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;25:1150–1159. [Google Scholar]

[R16] Fu W. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]

[R17] Garrigues PJ, University of California . PhD Thesis. Berkeley: 2009. Sparse coding models of natural images: Algorithms for efficient inference and learning of higher-order structure. [Google Scholar]

[R18] Geyer CJ. On the asymptotics of constrained M-estimation. The Annals of Statistics. 1994;22:1993–2010. [Google Scholar]

[R19] Gramacy RB. Estimation for multivariate normal and Student-t data with monotone missingness – Monomvn package manual. 2010. [Google Scholar]

[R20] Griffin JE, Brown PJ. Technical Report. 2007. Bayesian adaptive lassos with non-convex penalization. [Google Scholar]

[R21] Griffin JE, Brown PJ. Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis. 2010;5:171–188. [Google Scholar]

[R22] Hans C. Bayesian lasso regression. Biometrika. 2009;96:835–845. [Google Scholar]

[R23] Hastie T, Efron B. Lars package manual. 2011. [Google Scholar]

[R24] Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]

[R25] Lee A, Caron F, Doucet A, Holmes C. Bayesian sparsity-path-analysis of genetic association signal using generalized t priors. Statistical Applications in Genetics and Molecular Biology. 2012;11 doi: 10.2202/1544-6115.1712. [DOI] [PubMed] [Google Scholar]

[R26] Liang F, Paulo R, Molina G, Clyde M, Berger J. Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association. 2008;103:410–423. [Google Scholar]

[R27] McDonald JB, Newey WK. Partially adaptive estimation of regression models via the generalized t distribution. Econometric Theory. 1988;4:428–457. [Google Scholar]

[R28] Park T, Casella G. The Bayesian lasso. Journal of the American Statistical Association. 2008;103:681–686. [Google Scholar]

[R29] Pickands J. Statistical inference using extreme order statistics. The Annals of Statistics. 1975;3:119–131. [Google Scholar]

[R30] Polson NG, Scott JG. Bayesian Statistics. Oxford University Press; 2010. Shrink globally, act locally: Sparse Bayesian regularization and prediction; p. 9. [Google Scholar]

[R31] Ritter C, Tanner MA. Facilitating the Gibbs sampler: The Gibbs stopper and the griddy-Gibbs sampler. Journal of the American Statistical Association. 1992;97:861–868. [Google Scholar]

[R32] Strawderman WE. Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics. 1971;42:385–388. [Google Scholar]

[R33] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. 1996;58:267–288. [Google Scholar]

[R34] Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001:1. [Google Scholar]

[R35] Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] West M. On scale mixtures of normal distributions. Biometrika. 1987;74:646–648. [Google Scholar]

[R37] Yuan M, Lin Y. Efficient empirical Bayes variable selection and estimation in linear models. Journal of the American Statistical Association. 2005;100:1215–1225. [Google Scholar]

[R38] Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2006:7. [Google Scholar]

[R39] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R40] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

GENERALIZED DOUBLE PARETO SHRINKAGE

Artin Armagan

David B Dunson

Jaeyong Lee

Abstract

1. Introduction

2. Generalized Double Pareto Prior

Figure 2.1.

Figure 2.2.

Figure 2.3.

3. Bayesian Inference in Linear Models

4. Sparse Maximum a Posteriori Estimation

4.1. Near-unbiasedness

Figure 4.4.

4.2. Sparsity

4.3. Continuity

4.4. Maximum a Posteriori Estimation via Expectation-Maximization

4.4.1. Exploiting the Normal Mixture Representation

4.4.2. Exploiting the Laplace Mixture Representation and the One-step Estimator

4.4.3. Normal vs. Laplace Representations in Computation

Figure 4.5.

4.5. Oracle Properties

5. Experiments

5.1. Simulation

Table 5.1.

Table 5.2.

5.2. Inferences on Hyper-parameters

Figure 5.6.

6. Data Example

Figure 6.7.

Figure 6.8.

7. Discussion

8. Technical Details

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases