Bayes variable selection in semiparametric linear models

Suprateek Kundu; David B Dunson

doi:10.1080/01621459.2014.881153

. Author manuscript; available in PMC: 2015 Mar 1.

Published in final edited form as: J Am Stat Assoc. 2014 Mar 19;109(505):437–447. doi: 10.1080/01621459.2014.881153

Bayes variable selection in semiparametric linear models

Suprateek Kundu ¹, David B Dunson ²

PMCID: PMC4111209 NIHMSID: NIHMS571275 PMID: 25071298

Abstract

There is a rich literature on Bayesian variable selection for parametric models. Our focus is on generalizing methods and asymptotic theory established for mixtures of g-priors to semiparametric linear regression models having unknown residual densities. Using a Dirichlet process location mixture for the residual density, we propose a semiparametric g-prior which incorporates an unknown matrix of cluster allocation indicators. For this class of priors, posterior computation can proceed via a straightforward stochastic search variable selection algorithm. In addition, Bayes factor and variable selection consistency is shown to result under a class of proper priors on g even when the number of candidate predictors p is allowed to increase much faster than sample size n, while making sparsity assumptions on the true model size.

Keywords: Asymptotic theory, Bayes factor, g-prior, Large p, small n, Model selection, Posterior consistency, Subset selection, Stochastic search variable selection

1. INTRODUCTION

Bayesian variable selection is widely applied, with O’Hara and Sillanpää providing a recent review (2009). There is a rich literature proposing variable selection methods and studying asymptotic properties for parametric models, while our focus is variable selection in semiparametric linear regression models of the form:

Y^{n} = X_{γ} β_{γ} + ε, ε_{i} ~ f,

(1)

where Yⁿ is n × 1, γ = {γ_j, j = 1, …, p} ∈ Γ, γ_j = 1 if the jth candidate predictor is included in the model with γ_j = 0 otherwise, Γ is the set of all possible subsets that are given non-zero prior probability, $p_{γ} = \sum_{j = 1}^{p} γ_{j}$ is the size of model γ, β_γ is the p_γ × 1 vector of regression coefficients, X_γ is the n × p_γ design matrix containing the predictors in model γ, and f is an unknown residual density. Our focus is on avoiding parametric assumptions on f, while accommodating high-dimensional settings in which the number of candidate predictors p can be much larger than the sample size n but Γ is restricted to sparse models having p_γ < n.

There has been limited consideration of variable selection in semiparametric Bayesian models, with essentially no results on asymptotic properties. In particular, it would be appealing to provide a computationally efficient procedure for Bayesian variable selection based on (1) for which it can be shown that the posterior probability on the true model converges to one as n → ∞ even when the number of candidate predictors increases much faster than n. In order for the asymptotic analysis to reflect the high dimensionality, it is important to allow p to grow with n. There has been some consideration of increasing p asymptotics in Bayesian parametric models. Castillo and van der Vaart (2012) study concentration of the posterior distribution in the normal means problem. Armagan et al. (2013) provide conditions for consistency in high-dimensional normal linear regression with shrinkage priors on the coefficients. Jiang (2007) studies convergence rates of the predictive distribution resulting from Bayesian model averaging in generalized linear models with high-dimensional predictors. These approaches do not consider consistency of model selection or semiparametric settings.

This article proposes a practical, useful and general methodology for Bayesian variable selection in semiparametric linear models (1), while providing basic theoretical support by showing Bayes factor and variable selection consistency. We also extend our approach and theory to increasing model dimensions involving p ≫ n candidate predictors while making sparsity assumptions on the true model. Our approach relies on placing a Dirichlet process (DP, Ferguson, 1972) location mixture of Gaussians (Lo, 1984) prior on the residual density f, inducing clustering of subjects. We introduce a prior on the coefficients β_γ specific to each model γ, which generalizes mixtures of g-priors (Zellner and Siow, 1980; Liang et al., 2008) to include cluster allocation indices induced through the Dirichlet process. The formulation leads to a straightforward implementation via a stochastic search variable selection (SSVS) algorithm (George and McCulloch, 1997).

Section 2 develops the proposed framework. Section 3 considers asymptotic properties. Section 4 contains simulation results. Section 5 applies the approach to a type 2 diabetes data example, and the proofs of Theorems are contained in the Appendix.

2. MIXTURES OF SEMIPARAMETRIC g-PRIORS

2.1 Model Formulation

In this section, we propose a new class of priors for Bayesian variable selection in linear regression models with an unknown residual density characterized via a Dirichlet process (DP) location mixture of Gaussians. In particular, let

\begin{array}{l} y_{i} = x_{γ, i}^{'} β_{γ} + ε_{i}, ε_{i} ~ f, i = 1, \dots, n, \\ f (\cdot) = \int N (\cdot; α, τ^{- 1}) d P (α), P ~ D P (m P_{0}), P_{0} = N (0, τ^{- 1}), \end{array}

(2)

where x_γ_,i is the ith row of X_γ and does not include an intercept as we do not restrict f to have zero mean, and f is a density with respect to Lebesgue measure on ℜ. We address uncertainty in subset selection by placing a prior on γ, while the prior on β_γ characterizes prior knowledge of the size of the coefficients for the selected predictors.

The DP mixture prior on the density f induces clustering of the n subjects into k groups/subclusters, where k is random and each group has a distinct intercept in the linear regression model. Let A denote an n × k allocation matrix, with A_ij = 1 if the ith subject is allocated to the jth cluster and 0 otherwise. The jth column of A then sums to n_j, the number of subjects allocated to subcluster j, with $\sum_{j = 1}^{k} n_{j} = n$ . Following Kyung, Gill and Casella (2009), conditionally on the allocation matrix A, (2) can be represented as a linear model with random intercepts

Y^{n} = A η + X_{γ} β_{γ} + ε, η \sim N (0, τ^{- 1} I_{k}), ε ~ N (0, τ^{- 1} I_{n}),

(3)

where A is random with a certain prior probability given by the coefficients in the summation of the likelihood expression (8) and the response and predictors are centered prior to analysis. In the special case in which A = 1_n, the model reduces to a linear regression model with a common intercept η and Gaussian residuals. In this case, the conditional posterior for η given A = 1_n is $N (\frac{1}{1 + n} \sum_{i = 1}^{n} (y_{i} - x_{γ, i} β_{γ}), \sqrt{\frac{1}{τ (1 + n)}})$ , which has realizations increasingly concentrated at zero as n increases.

We would like the prior on the regression coefficients to retain the essential elements of Zellner’s g-prior (Zellner, 1986), while being suitably adapted to the semiparametric case. To this effect, we propose a mixture of semi-parametric g-priors constructed to scale the covariance matrix in Zellner’s g-prior to reflect the clustering phenomenon as follows:

π (β_{γ}) = N (0, g τ^{- 1} {(X_{γ}^{'} \sum_{A}^{- 1} X_{γ})}^{- 1}), \sum_{A} = I_{n} + {AA}^{'}, g ~ π (g) .

(4)

Prior (4) inherits advantages of previous mixtures of g-priors including computational efficiency in computing marginal likelihoods (conditional on A) and robustness to mis-specification of g. The prior can be interpreted as having arisen from the analysis of a conceptual sample generated using a scaled design matrix $\sum_{A}^{- 1 / 2} X_{γ}$ , reflecting the clustering phenomenon due to the DP kernel mixture prior. Moreover, the proposed prior leads to Bayes factor and variable selection consistency in semi-parametric linear models (2) as we will show.

Note that $(X_{γ}^{'} \sum_{A}^{- 1} X_{γ}) = (X_{γ}^{'} X_{γ}) - X_{γ}^{'} A {(I_{k} + A^{'} A)}^{- 1} A^{'} X_{γ} < (X_{γ}^{'} X_{γ})$ , so ${(X_{γ}^{'} \sum_{A}^{- 1} X_{γ})}^{- 1} > {(X_{γ}^{'} X_{γ})}^{- 1}$ , implying that the prior variance of β_γ conditional on (g, τ) is higher for the semi-parametric g–prior as compared to the traditional g–prior for any allocation matrix A. To assess the influence of A on the prior for β_γ, we did simulations which revealed that for fixed (n, p), var(β_γl) increases but the cov(β_γl, β_γl_′) decreases as the number of underlying subclusters in the data increase (l′, l = 1, …, p, l′ ≠ l). This suggests that as the number of groups in A increase, the components of β_γ are likely to be more dispersed with decreasing association between each other.

2.2 Bayes Factor in Semiparametric Linear Models

In studying asymptotic properties of our proposed approach, we follow standard practice in Bayesian model selection, and assume that the data Yⁿ = (y₁, …, y_n)′ arise from one of the models in the list under consideration. This true model is denoted Inline graphic as defined in equation (5). For pairwise comparison, we evaluate the evidence in favor of compared to an alternative model using the Bayes factor, where

\begin{array}{l} M_{1} : Y^{n} = X_{γ_{1}} β_{γ_{1}} + ε_{1}, ε_{1 i} ~ f \\ M_{2} : Y^{n} = X_{γ_{2}} β_{γ_{2}} + ε_{2}, ε_{2 i} ~ f \\ f (\cdot) = \int N (\cdot; α, τ^{- 1}) d P (α), P ~ D P (m P_{0}), P_{0} = N (0, τ^{- 1}) \\ β_{γ_{j}} ~ π (β_{γ_{j}}), j = 1, 2, π (τ^{- 1}) \propto 1 / τ^{- 1}, g ~ π (g), \end{array}

(5)

where γ_j ∈ Γ indexes models of dimension p_j and π(β_γ_j) is defined in (4), j = 1, 2. Our prior specification philosophy is similar to the one adopted by Guo and Speckman (2009) for normal linear models, in that we assign proper priors on all elements of both β_γ₁, β_γ₂ conditional on (g, τ⁻¹), and an improper prior on τ⁻¹ for a more objective assessment. However unlike Guo and Speckman (2009), our focus is on Bayesian variable selection in semi-parametric linear models.

Note that the conditional likelihood of the response after marginalizing out η in (3) is L(Yⁿ|A, β_γ, τ⁻¹) = N (X_γβ_γ, τ⁻¹Σ_A) (Kyung et. al., 2009). Thus conditional on A and under the DP mixture of Gaussians prior on f, Inline graphic in (5) reduces to the normal linear model:

\sum_{A}^{- 1 / 2} Y^{n} = Z_{A} = {\tilde{X}}_{A, γ_{j}} β_{γ_{j}} + ε, ε ~ N (0, τ^{- 1} I_{n}), π (β_{γ_{j}}) = N (0, g τ^{- 1} {({\tilde{X}}_{A, γ_{j}}^{'} {\tilde{X}}_{A, γ_{j}})}^{- 1}),

(6)

where ${\tilde{X}}_{A, γ_{j}} = \sum_{A}^{- 1 / 2} X_{γ_{j}}$ . Under a mixture of semi-parametric g-priors, we can directly use expression (17) in Guo and Speckman (2009) to obtain (conditional on A) for j = 1, 2

L (Z_{A} ∣ M_{j}) \equiv L (Y^{n} ∣ A, M_{j}) \propto {(Z_{A}^{'} Z_{A})}^{- n / 2} \int_{0}^{\infty} {(1 + g)}^{- p_{j} / 2} {[1 - \frac{g}{1 + g} \frac{Z_{A}^{'} {\tilde{H}}_{A, j} Z_{A}}{Z_{A}^{'} Z_{A}}]}^{- n / 2} π (d g),

(7)

where ${\tilde{H}}_{A, j} = {\tilde{X}}_{A, γ_{j}} {({\tilde{X}}_{A, γ_{j}}^{'} {\tilde{X}}_{A, γ_{j}})}^{- 1} {\tilde{X}}_{A, γ_{j}}^{'}$ .

Also, marginalizing over all possible subcluster allocations for a given sample size n, the following marginal likelihood can be obtained under a DP prior on f (Kyung et. al., 2009):

L (Y^{n} ∣ M_{j}) = \frac{Γ (m)}{Γ (m + n)} \sum_{k = 1}^{n} m^{k} \sum_{A \in A_{k}} \prod_{i = 1}^{k} Γ (n_{i}) L (Y^{n} ∣ A, M_{j}) = \sum_{A_{l} \in C_{n}} w_{l} L (Y^{n} ∣ A_{l}, M_{j}),

(8)

where Inline graphic is the collection of all possible n×k matrices corresponding to different allocations of n subjects into k subclusters, and is the collection of all possible allocation matrices for a sample size n with w_l = 1. In the limiting case as n → ∞, we have as the class of limiting allocation matrices. Further using (7), the Bayes factor in favor of Inline graphic conditional on the allocation matrix A is given by

B F_{21, A}^{n} = \frac{L (Z_{A} ∣ M_{2})}{L (Z_{A} ∣ M_{1})} = \frac{\int_{0}^{\infty} {(1 + g)}^{- p_{2} / 2} {[1 - \frac{g}{1 - g} {\tilde{R}}_{A, 2}^{2}]}^{- n / 2} π (d g)}{\int_{0}^{\infty} {(1 + g)}^{- p_{1} / 2} {[1 - \frac{g}{1 + g} {\tilde{R}}_{A, 1}^{2}]}^{- n / 2} π (d g)},

(9)

where ${\tilde{R}}_{A, j}^{2} = Z_{A}^{'} {\tilde{H}}_{A, j} Z_{A} / Z_{A}^{'} Z_{A}$ , (j = 1, 2). Finally using (8), the unconditional Bayes factor in favor of Inline graphic marginalizing out A is

B F_{21}^{n} = \frac{L (Y^{n} ∣ M_{2})}{L (Y^{n} ∣ M_{1})} = \frac{\sum_{A_{l} \in C_{n}} w_{l} L (Z_{A_{l}} ∣ M_{2})}{\sum_{A_{l} \in C_{n}} w_{l} L (Z_{A_{l}} ∣ M_{1})} .

(10)

2.3 Posterior Computation

We propose a MCMC algorithm for posterior computation for model (2), which combines a stochastic search variable selection algorithm or SSVS (George and McCulloch, 1997) with recently proposed methods for efficient computation in DP mixture models. In particular, we utilize the slice sampler of Walker (2007) incorporating the modification of Yau et al.(2011). Using Sethuraman’s (1994) stick-breaking representation, let

P = \sum_{j = 1}^{\infty} w_{j} η_{j}, η_{j} ~ N (0, τ^{- 1}), w_{j} = ν_{j} \prod_{l < j} (1 - ν_{l}), ν_{l} ~ Beta (1, m) .

(11)

The slice sampler of Walker (2007) relies on augmentation with uniform latent variables, which allows us to move from an infinite summation for P in (11) to a finite sum given the uniform latent variable. In particular,

f_{w, η} (y ∣ u) \propto \sum_{j \in B_{w} (u)} N (y ∣ η_{j}), B_{w} (u) = {j : w_{j} > u} is a finite set, u ~ U (0, 1) .

For the DP precision parameter, we specify the hyperprior m ~ Ga(a_m, b_m) for greater flexibility. We specify a Ga(a_τ, b_τ) prior on τ and Be(a₁, b₁) prior on Pr(γ_l = 1) for implementing SSVS, l = 1, …, p. We choose π(g) as the hyper-g prior with a = 4 and use the fact that $\frac{g}{1 + g} ~ B e (1, 1)$ to sample g using a griddy Gibbs approach employing equally spaced quantiles. Inverting the n × n matrix Σ_A in the mixtures of semiparametric g-prior in (4) does not add much to the computational burden even for large n, as we can use the closed form expression $\sum_{A}^{- 1} = I_{n} - A {(I_{k} + A^{'} A)}^{- 1} A^{'}$ , where k grows at a rate log(n) (Antoniak, 1974) and is small to moderate in most practical applications. We outline the posterior computation steps in Appendix I.

3. ASYMPTOTIC PROPERTIES

In this section we establish asymptotic properties for the proposed approach using γ₁ to index the true model Inline graphic defined in (5) and γ₂ to index an arbitrary model being compared to , with ⊂ denoting nesting of in . Before proceeding, we introduce some regularity conditions essential for the development of asymptotic theory.

(A1′)
${lim}_{n \to \infty} \frac{β_{γ_{1}}^{'} (X_{γ_{1}}^{'} X_{γ_{1}}) β_{γ_{1}}}{n} \to b_{1} > 0$ .
(A2′)
For ⊈ , ${lim}_{n \to \infty} \frac{β_{γ_{1}}^{'} X_{γ_{1}}^{'} H_{2} X_{γ_{1}} β_{γ_{1}}}{n} \to b_{2} \in [0, b_{1}]$ , with $H_{2} = X_{γ_{2}} {(X_{γ_{2}}^{'} X_{γ_{2}})}^{- 1} X_{γ_{2}}^{'}$ .
(A1)
For p₁ = O(n^a₁), 0 ≤ a₁ < 1, ${lim}_{n \to \infty} \frac{β_{γ_{1}}^{'} (X_{γ_{1}}^{'} \sum_{A}^{- 1} X_{γ_{1}}) β_{γ_{1}}}{n} \to b_{A, 1} > 0$ .
(A2)
For ⊈ , ${lim}_{n \to \infty} \frac{β_{γ_{1}}^{'} {\tilde{X}}_{A, γ_{1}}^{'} {\tilde{H}}_{A, 2} {\tilde{X}}_{A, γ_{1}} β_{γ_{1}}}{n} \to b_{A, 2}$ , where b_A_,2 ∈ [0, b_A_,1) for fixed p₁, p₂, and b_A_,2 ∈ (0, b_A_,1) for p_j = O(n^a_j) (j = 1, 2, 0 ≤ a₁ < a₂ < 1).

(A1), (A2) depend on the allocation matrix A, which is a n × k binary matrix that for large n tends to have k ≪ n, and be very sparse containing mostly zeros with sparsity increasing with column index. We also assume the following for the class of proper priors π(g) on g:
(A3)
There exists a constant k ≥ 0 such that $\int_{a_{n}}^{c_{0} a_{n}} π (d g) \approx n^{- k}$ for any constant c₀ > 1 and any sequence a_n ≈ n. Here a_n ≈ b_n implies that lim_n_→∞ a_n/b_n > 0.
(A4)
There exists a constant k_u such that k−(p₂−p₁)/2 < k_u ≤ k and $\int_{0}^{\infty} {(1 + g)}^{k_{u}} π (d g) \approx 1$ .

We state (A1′), (A2′) as the standard assumptions for establishing Bayes factor consistency in normal linear models, on which our assumptions (A1), (A2) are based. We develop asymptotic theory for semiparametric linear models (5) based on assumptions (A1)–(A4). We note that (A1) is stronger compared to (A1′), since (A1) implies (A1′) as $\sum_{A}^{- 1} = I_{n} - A {(I_{k} + A^{'} A)}^{- 1} A^{'}$ . Further, in the extreme case when A = I_n, we have $\frac{X_{γ_{1}}^{'} \sum_{A}^{- 1} X_{γ_{1}}}{n} = \frac{1}{2} \frac{X_{γ_{1}}^{'} X_{γ_{1}}}{n}$ , so that (A1′) implies (A1). Again when A =1_n, $X_{γ_{1}}^{'} \sum_{A}^{- 1} X_{γ_{1}} \approx X_{γ_{1}}^{'} X_{γ_{1}} - n {\bar{X}}_{γ_{1}}^{'} {\bar{X}}_{γ_{1}}$ for large n, for ${\bar{X}}_{γ_{1}} = 1_{n}^{'} X_{γ_{1}} / n$ . Hence $\frac{β_{γ_{1}}^{'} (X_{γ_{1}}^{'} \sum_{A}^{- 1} X_{γ_{1}}^{'}) β_{γ_{1}}}{n} \approx \frac{β_{γ_{1}}^{'} (X_{γ_{1}}^{c^{'}} X_{γ_{1}}^{c}) β_{γ_{1}}}{n}$ , where $X_{γ_{1}}^{c}$ is the centered design matrix. When ${lim}_{n \to \infty} \frac{β_{γ_{1}}^{'} (X_{γ_{1}}^{c^{'}} X_{γ_{1}}^{c}) β_{γ_{1}}}{n} > 0$ , (A1′) implies (A1).

Assumption (A2) can be interpreted as a positive ‘limiting distance’ between the two models corresponding to design matrices X_γ₁ and X_γ₂ in (3) conditional on A, after marginalizing out η, i.e. $Δ_{21, A} = {lim}_{n \to \infty} \frac{β_{γ_{1}}^{'} {\tilde{X}}_{A, γ_{1}}^{'} (I_{n} - {\tilde{H}}_{A, 2}) {\tilde{X}}_{A, γ_{1}} β_{γ_{1}}}{n τ^{- 1}} = \frac{b_{A, 1} - b_{A, 2}}{τ^{- 1}} \in (0, \infty)$ . Such a ‘limiting distance’ (Δ_21,_A) can be considered as a natural extension of the definition of distance between two normal linear models in Casella et. al. (2009) and Moreno et. al. (2010) to models with random intercept as in (3).

Assumptions (A3), (A4) define a class of proper priors for g described in Guo and Speckman (2009). This class includes $hyper - g (\frac{a - 2}{2} {(1 + g)}^{- a / 2})$ and $hyper - g / n (\frac{a - 2}{2 n} {(1 + g / n)}^{- a / 2})$ priors with 2 < a ≤ 4 (Liang et. al. 2008), Zellner-Siow priors (Zellner and Siow, 1980) as well as beta-prime priors (Maruyama and George, 2008). It is clear that these assumptions on π(g) are satisfied by quite a few standard priors are hence are quite reasonable.

The following lemma gives the limits of quantities such as ${\tilde{R}}_{A, j}^{2} = Z_{A}^{'} {\tilde{H}}_{A, j} Z_{A} / Z_{A}^{'} Z_{A}$ , which would be useful for establishing asymptotic properties. The proof follows directly using Lemmas 1, 2 of Guo and Speckman (2009) and from (6) which essentially states that under the DP mixture of Gaussians prior on f for Inline graphic in (5) and conditional on allocation matrix A, $Z_{A} = \sum_{A}^{- 1 / 2} Y^{n} ~ N ({\tilde{X}}_{A, γ_{j}} β_{γ_{j}}, τ^{- 1} I_{n})$ , j = 1, 2.

Lemma 1

Let assumptions (A1), (A2) hold.

If ⊂ , conditional on A, ${\tilde{R}}_{A, 1}^{2} \overset{a . s .}{\to} \frac{b_{A, 1}}{τ^{- 1} + b_{A, 1}}, {\tilde{R}}_{A, 2}^{2} \overset{a . s .}{\to} \frac{b_{A, 1}}{τ^{- 1} + b_{A, 1}}$ , under
If ⊈ , conditional on A, ${\tilde{R}}_{A, 1}^{2} \overset{a . s .}{\to} \frac{b_{A, 1}}{τ^{- 1} + b_{A, 1}}, {\tilde{R}}_{A, 2}^{2} \overset{a . s .}{\to} \frac{b_{A, 2}}{τ^{- 1} + b_{A, 1}}$ , under

As shown by the following result, the proposed approach leads to Bayes factor consistency when comparing fixed dimensional models as well as models growing at the rate O(n^t), 0 < t < 1, when the truth is sparse.

Theorem I

Let assumptions (A1), (A2) hold.

Suppose p₁ and p₂ are fixed. If ⊂ , then under and assumptions (A3), (A4), $B F_{21}^{n} \overset{P}{\to} 0$ as n→ ∞ and if p₂ − p₁ > 2 + 2(k − k_u), $B F_{21}^{n} \overset{a . s .}{\to} 0$ as n → ∞. Further, if ⊈ , then under and assumption (A3), $B F_{21}^{n} \overset{a . s .}{\to} 0$ as n → ∞.
Suppose p_j is growing at the rate O(n^a_j), j=1,2, with 0 ≤ a₁ < a₂ < 1. Then under and assumption (A3), $B F_{21}^{n} \overset{a . s .}{\to} 0$ as n → ∞.

REMARK 1

Although we omit the proof here, Theorem I can be modified to accommodate the case of improper priors on g (i.e. $π (g) \propto \frac{1}{1 + g}$ ). In such a case, assumptions (A3), (A4) are excluded and we require p₂ − p₁ ≥ 3 for a.s. convergence in (I) for Inline graphic ⊂ .

The next result establishes model selection consistency for the proposed approach, even in cases when the cardinality of the model space increases with n. In particular, we consider cases when the number of candidate predictors p_n is growing at the rate O(n^a), a > 0, but the prior on the model space assigns zero probability to models growing at a rate equal to or faster than n. When a ≥ 1, the prior support consists of models constructed using O(n^t) (0 ≤ t < 1) sized subsets of p_n = O(n^a) candidate predictors.

To elaborate, let the support of the prior on the model space be Inline graphic = ∪ , where is the set of all (non-null) models γ such that there exists a sample size n₀ < ∈ for which γ_j = 0 for all j > p_n₀, and is the set of all models with dimensions growing at a rate strictly less than n, $M_{I} = {γ : \sum_{j = 1}^{p_{n}} γ_{j} = O (n^{t}), 0 < t < 1}$ . Letting p₀ = max{j: γ ∈ Inline graphic , γ_j = 1}, we can discard predictors having a higher index than p₀ for all γ ∈ and treat as finite dimensional having 2^p₀ − 1 elements (excluding the null model). Let γ_jl denote the lth model having dimension p_j. Consider the following sequence of priors which penalizes models with increasing dimensions, thus encouraging sparsity:

π^{n} (γ_{j l}) \propto π^{p_{j}} {(1 - π)}^{p_{0} - p_{j}} I [γ_{j l} \in M_{F}] + {(\begin{matrix} p_{n} \\ p_{j} \end{matrix})}^{- 1} I [γ_{j l} \in M_{I}], π ~ B e (a_{1}, b_{1}) .

(12)

When the truth is sparse such that Inline graphic ∈ , we have the following result.

Theorem II

Suppose assumptions (A1)–(A4) hold. For fixed p and under $P (M_{1} ∣ Y^{n}) \overset{P}{\to} 1$ for any prior on Γ with π( Inline graphic ) > 0. When p_n = O(n^a) (a > 0) and $P (M_{1} ∣ Y^{n}) \overset{P}{\to} 1$ under , for πⁿ(γ_jl) defined as in (12).

4. SIMULATION STUDY

We present the results of two simulation studies comparing our method (SLM) with the normal linear model (NLM) having $β_{γ} ~ N (0, g τ^{- 1} {(X_{γ}^{'} \sum_{A = 1_{n}}^{1} X_{γ})}^{- 1})$ (designed to assign comparable prior information when the residual is Gaussian), the lasso (Tibshirani, 1996) and elastic net (Zou and Hastie, 2005), as well as robust variable selection methods including an MM-type regression estimator (Yohai, 1987; Koller and Stahel, 2011), and a median regression model with SSVS for variable selection (Yu et al., 2013). The data is generated as follows:

\begin{array}{l} Case I : & y_{i} = x_{i} β_{T} + ε_{i}, ε_{i} ~ 0.5 N (2.5, 1) + 0.5 N (- 2.5, 1), \\ Case II : & y_{i} = 1 + x_{i} β_{T} + ε_{i}, ε_{i} ~ N (0, 1), i - 1, \dots, n, \end{array}

where x_i is a ten dimensional predictor (p=10), with x_ij, j = 1, …, 10 generated independently from U(−1,1), and β_T = (3, 2, −1, 0, 1.5, 1, 0, −4, −1.5, 0).

We used Ga(0.1, 1) prior on the DP precision parameter and Be(0.1, 1) prior on P(γ_j = 1), j=1,…,p, which corresponds to a weakly informative prior favoring parsimony. We update g using the griddy Gibbs approach having 1000 equally spaced quantiles for $\frac{g}{1 + g} ~ B e (1, 1)$ corresponding to a = 4 in the hyper-g prior. For both SLM and NLM, we ran 50,000 iterations with a burn in of 5,000. We implemented the lasso (L1) and elastic net (EL) using the GLMNET package in R with default settings, while the MM-type estimator (LMR) was implemented using ‘lmrob’ function in ‘robustbase’ package in R and the median regression with SSVS (QR) was implemented using function ‘SSVSquantreg’ in ‘MCMCpack’ package in R, with a Be(0.1, 1) prior on the prior inclusion probability for predictors. All results are summarized across 20 replicates. The computation time for SLM per iteration was marginally slower than NLM. The mixing for the fixed effects was good under both the methods. The results for SLM do not appear to be sensitive to the hyper-parameters in π(m), but are mildly sensitive to hyper-parameters in π(g) for n = 100.

We study the marginal inclusion probabilities (MIP) under SLM and NLM over varying sample sizes in Figures 1 and 2. These plots suggest a faster rate of increase of the MIP for the important predictors under SLM as compared to NLM when the true residuals are non-Gaussian, and a very similar rate of increase under both methods when the true residuals are Gaussian (thus justifying the prior choice for NLM). In contrast, the exclusion probabilities for the unimportant predictors converge to one slowly under both the methods, reflecting the well known tendency for slower accumulation of evidence in favor of the true null.

Marginal Inclusion Probabilities (MIP) over varying sample sizes: Truth generated from bimodal residual. Solid lines - Semi-parametric Linear Model, dashed lines - Normal Linear Model.

Marginal Inclusion Probabilities (MIP) over varying sample sizes: Truth generated from Gaussian residual. Solid lines - Semi-parametric Linear Model, dashed lines - Normal Linear Model.

Tables 1 and 2 present some summaries for n = 100 for Case I. The MIPs in Table I suggests correct variable selection decision by SLM, but poor performance by NLM which fails to exclude any of the unimportant predictors under median probability model. Further, L1, EL and QR seem to favor an overly complex model by choosing a superset of important predictors. In terms of estimation of the fixed effects, SLM has the highest degree of accuracy as reflected by the smallest mean square error ( $\frac{{‖ \hat{β} - β_{T} ‖}_{2}}{p}$ ) in Table 2, where β_T is the vector of true regression coefficients. In addition, the replicate average mean square error for out of sample prediction for a test sample size of 25 (Table 2) is smallest under the SLM, followed by lasso and elastic net. NLM is seen to be clearly inadequate for prediction purposes as indicated by the extremely high out of sample predictive MSE. Thus in conclusion, when the true residual is non-Gaussian, the SLM has the best performance compared to competitors, whereas NLM performs poorly in general.

Table 1.

Fixed effects estimates and marginal inclusion probabilities (MIP) for fixed effects for Case I when n=100.

β_T	MIP_SLM	β_SLM	MIP_NLM	β_SLM	β_L1	β_EL	β_LMR	β_QR
3	1.00	2.88(2.34, 3.41)	1.00	2.83(1.86, 3.81)	3.08	3.08	3.15	2.92
2	0.99	1.89(1.34, 2.44)	0.98	1.95(0.96, 2.91)	2.06	2.06	2.11	1.84
−1	0.93	−0.91(−1.46, −0.36)	0.75	−0.78(−1.75,0.03)	−0.98	−0.98	−0.87	−0.78
0	0.45	−0.01(−0.44,0.44)	0.53	0.006(−0.82, 0.81)	0.01	0.009	−0.003	−0.02
1.5	0.98	1.43(0.89,1.98)	0.90	1.35(0.35, 2.35)	1.54	1.54	1.57	1.29
1	0.90	0.79(0.28, 1.35)	0.68	0.54(−0.26, 1.48)	0.74	0.74	0.66	0.42
0	0.43	−0.005(−0.44, 0.42)	0.53	−0.05(−0.85, 0.73)	−0.04	−0.04	−0.09	−0.06
−4	1.00	−3.89(−4.43, −3.33)	1.00	−3.75(−4.74, −2.74)	−4.05	−4.04	−4.14	−3.95
−1.5	0.99	−1.54(−2.08, −0.98)	0.92	−1.43(−2.41, −0.41)	−1.57	−1.57	−1.54	−1.30
0	0.42	0.008(−0.43, 0.43)	0.54	−0.12(−0.93, 0.64)	−0.12	−0.12	−0.06	−0.14

Open in a new tab

SLM: Semi-parametric linear model, NLM: Normal linear model, L1: Lasso, EL: Elastic Net, LMR: MM-type estimator, QR: Median regression with SSVS.

Table 2.

Summaries for Case I when n=100.

Measure	SLM	NLM	L1	EL	LMR	QR
MSE around β_T	0.07	0.21	0.24	0.24	0.40	0.50
MSE for out of sample prediction	7.70	16.44	8.33	8.32	8.83	9.11

Open in a new tab

SLM: Semi-parametric linear model, NLM: Normal linear model, L1: Lasso, EL: Elastic Net, LMR: MM-type estimator, QR: Median regression with SSVS. MSE: mean square error and β_T is the vector of true regression coefficients.

5. APPLICATION TO DIABETES DATA

The prevalence of diabetes in the United States is expected to more than double to 48 million people by 2050 (Mokdad et. al., 2001). Previous medical studies have suggested that Diabetes Mellitus type II (DM II) or adult onset diabetes could be associated with high levels of total cholesterol (Brunham et. al., 2007) and obesity (often characterized by BMI and waist to hip ratio) (Schmidt et. al., 1992), as well as hypertension (indicated by a high systolic or diastolic blood pressure or both) which is twice as prevalent in diabetics compared to non-diabetic individuals (Epstein and Sowers, 1992).

We develop a comprehensive variable selection strategy for indicators of DM II in African-Americans based on data obtained from Department of Biostatistics, Vanderbilt University website. Our primary focus is to discover important indicators of DM II by modeling the continuous outcome glycosylated hemoglobin (> 7mg/dL indicates a positive diagnosis of diabetes) based on predictors such as total cholesterol (TC), stabilized glucose (SG), high density lipoprotein (HDL), age, gender, body mass index (BMI) indicator (overweight and obese with normal as baseline), systolic and diastolic blood pressure (SBP and DBP), waist to hip ratio (WHR) and postprandial time indicator (PPT) (1 if blood was drawn within 2 hours of a meal, 0 otherwise). We note that lower levels of HDL have been known to be associated with insulin resistance syndrome, often considered a precursor of DM II with a conversion rate around 30%. We also expect PPT to be a significant indicator as blood sugar levels are high up to 2 hours after a meal.

After excluding the records containing missing values, the data consisted of 365 subjects which was split into multiple training and test samples of sizes 330 and 35 respectively. The replicate averaged fixed effects estimates (multiplied by 100) for the SLM, NLM, L1, EL, LMR and QR are presented in Table 3, and the marginal inclusion probabilities (MIP) for the SLM, NLM and QR are summarized in Table 4. We also evaluate the out of sample predictive performance for each training-test split using predictive MSE in Table 5, and additionally provide the mean coverage (COV) and width (CIW) of 95% pointwise credible intervals for the predicted responses under SLM and NLM. The same values of hyper-parameters were used as in section 5. For each replicate, we randomized the initial starting points and made 100,000 runs for SLM (burn in = 20,000) and 50,000 runs for NLM (burn in = 5,000).

Table 3.

Fixed effects (times 100) for type-II diabetes example.

Predictor	β̂_SLM	β̂_NLM	β̂_L1	β̂_EL	β̂_LMR	β̂_QR
TC	0.55(0.11,0.73)	0.74(0.25,1.20)	0.75	0.75	0.29	0.01
SG	2.11(1.75,2.48)	2.82(2.5,3.15)	2.83	2.82	2.99	3.23
HDL	−0.50(−1.4,0.015)	−0.36(−1.61,0)	−1.02	−1.02	−0.42	0
Age	0.34(−0.06,1.3)	0.98(0,2.35)	1.19	1.19	0.57	0.04
Gender	−3.72(−30.12,4.39)	−1.53(−25.46,3.22)	−19.66	−19.81	−7.87	−0.86
BMI(overwt)	1.55(−9.43,24.03)	2.04(−3.33,29.53)	4.33	4.27	15.12	1.84
BMI(obese)	−0.74(−20.33,13.44)	−0.91(−21.93,6.14)	−14.88	−15.03	8.16	0.62
SBP	0.53(0,1.35)	0.03(−0.13,0.65)	0.25	0.25	0.56	0.009
DBP	−0.03(−0.99,0.69)	0(−0.45,0.45)	0.018	0.017	−0.55	0.002
WHR	224.27(67.72,381.88)	3.16(−44.74,91.4)	90.47	91.53	90.79	129.23
PPT	21.42(1.89,57.49)	33.04(0,80.39)	47.31	47.32	37.55	18.99

Open in a new tab

Table 4.

Marginal Inclusion Probabilities for SLM, NLM, QR in type-II diabetes data.

Predictor	TC	SG	HDL	Age	Gender	BMI(overwt)	BMI(obese)	SBP	DBP	WHR	PPT
MIP_SLM	0.97	1.00	0.64	0.43	0.17	0.15	0.22	0.72	0.23	0.93	0.64
MIP_NLM	0.98	1.00	0.39	0.67	0.12	0.13	0.11	0.14	0.10	0.13	0.68
MIP_QR	0.02	1.00	0.002	0.03	0.08	0.10	0.08	0.01	0.004	0.71	0.42

Open in a new tab

Table 5.

Out of Sample Prediction.

Replicate	S 1	S 2	S 3	S 4	S 5	S 6	S 7	S 8
MSE_SLM	1.25	1.24	1.55	1.21	1.45	1.47	3.44	1.23
MSE_NLM	1.23	1.33	1.74	1.29	1.14	1.46	3.43	1.52
MSE_L₁	1.28	1.45	2.49	2.34	1.13	1.45	3.47	1.75
MSE_EL	1.29	1.47	2.51	2.36	1.14	1.45	3.48	1.75
MSE_LMR	2.23	1.21	2.15	1.02	1.09	1.36	4.06	1.69
MSE_QR	1.82	1.91	2.64	1.15	1.64	2.68	3.98	2.44
Cov_SLM	100.00	97.14	100.00	97.14	100.00	100.00	91.42	100.00
Cov_NLM	97.12	97.14	94.28	97.14	100.00	97.14	91.42	100.00
CIW_NLM	5.92	5.41	5.84	5.94	5.93	5.91	5.59	5.90
CIW_SLM	6.93	6.16	6.80	6.81	6.84	6.86	6.13	6.77

Open in a new tab

MSE: out of sample predictive mean square error, Cov: 95% credible interval coverage, CIW: 95% credible interval width.

It is interesting to note from Table 4 that the variable selection decisions under SLM (using median probability model) are quite different compared to the NLM. In particular, while both the models successfully identify total cholesterol, stabilized glucose and postprandial time as important predictors, it is only the SLM which identifies systolic blood pressure (MIP = 0.72), HDL (MIP = 0.64) and waist to hip ratio (MIP = 0.93) as important indicators, compared to NLM which assigns MIP = 0.14, 0.39 and 0.13 to these three predictors respectively. Age is identified as an important predictor under NLM (MIP = 0.67), but not under SLM (MIP=0.43). For both the methods, the MIPs for BMI (overweight and obese) were low, which could potentially be attributed to adjusting for the other obesity factors such as waist to hip ratio. From Tables 3 and 4, we also see that the lasso, elastic net and the MM-type estimator select an overly complex model by excluding minimal number of predictors, while the quantile regression with SSVS fails to include several important predictors and selects a highly parsimonious and inadequate model.

Variable selection in this application is clearly influenced by the assumptions on the residual density, with the nonparametric residual density providing a more realistic characterization that should lead to a more accurate selection of the important predictors. Figure 3 shows an estimate of the residual density obtained from the SLM analysis, suggesting a uni-modal right skewed density with a heavy right tail. The SLM results suggest that a mixture of two Gaussians provides an adequate characterization of this density. The computation time for SLM is only marginally slower than NLM, and in addition SLM exhibits good mixing for most of the fixed effects (Table 6). These results are robust to SSVS starting points, and consistency in the results across training-test splits also indirectly suggests adequate computational efficiency of SSVS.

Residual density for Type II Diabetes study under Semi-parametric Linear Model.

Table 6.

Auto-correlations across lags for fixed effects in type-II diabetes data.

Predictor	Lag 1		Lag 5		Lag 10		Lag 25		Lag 50
	SLM	NLM	SLM	NLM	SLM	NLM	SLM	NLM	SLM	NLM
TC	0.22	0.18	0.113	0.194	0.073	0.159	0.032	0.111	0.013	0.059
SG	0.59	0.06	0.386	0.038	0.285	0.022	0.14	0.009	0.06	0.016
HDL	0.19	0.02	0.081	0.012	0.041	0.013	0.01	0.021	0.0005	−0.006
Age	0.21	0.04	0.072	0.009	0.053	−0.0001	0.025	0.006	0.007	−0.014
Gender	0.06	−0.007	0.030	0.0003	0.013	−0.006	0.009	−0.014	0.005	0.019
BMI(overwt)	0.02	−0.002	0.01	−0.006	0.006	0.013	−0.006	0.009	0.0014	0.018
BMI(obese)	0.02	0.002	0.017	0.004	0.004	0.018	0.007	−0.003	0.000	0.000
SBP	0.29	0.0711	0.137	0.019	0.096	0.007	0.047	0.03	0.014	0.022
DBP	0.07	0.0239	0.021	0.019	0.019	0.031	0.009	−0.003	0.004	−0.012
WHR	0.44	0.0642	0.353	0.043	0.321	0.061	0.251	0.06	0.186	−0.003
PPT	0.22	0.0600	0.118	0.047	0.068	0.045	0.015	0.004	−0.002	0.019

Open in a new tab

In terms of out of sample predictive MSE (Table 5), the relative performance between SLM, NLM, L1 and EL vary across training-test splits so that none of the models can be said to dominate the others, while LMR and QR produce relatively inferior prediction results. Overall, the NLM has narrower 95% pointwise credible intervals compared to SLM, often resulting in poorer coverage for out of sample predictions. In conclusion, SLM succeeds in choosing the most reasonable model for DM II, consistent with previous medical evidence, and compares favorably with other competitors for prediction purposes.

Acknowledgments

This work was supported by Award Number R01ES017240 from the National Institute of Environmental Health Sciences. The authors thank the referees and the associate editor for their valuable comments.

APPENDIX A: PROOF OF RESULTS

Proof of Theorem I

Using similar methods as in the proof of Theorem 2 in Guo and Speckman (2009), it can be shown that conditional on A and assumptions (A3) and (A4), the upper and lower bounds of $L (Y^{n} ∣ A, M_{1}) = \int_{0}^{\infty} {(1 + g)}^{- p_{1} / 2} {[1 - \frac{g}{1 + g} {\tilde{R}}_{A, 1}^{2}]}^{- n / 2} π (d g)$ are

\begin{array}{l} L (Y^{n} ∣ A, M_{1}) \leq {(\frac{p_{1} + 2 k_{u}}{n - p_{1} - 2 k_{u}})}^{p_{1} / 2 + k_{u}} {(\frac{1 - {\tilde{R}}_{A, 1}^{2}}{{\tilde{R}}_{A, 1}^{2}})}^{p_{1} / 2 + k_{u}} {(\frac{n}{n - p_{1} - 2 k_{u}})}^{- n / 2} {(1 - {\tilde{R}}_{A, 1}^{2})}^{- n / 2} \\ \approx {(\frac{p_{1} + 2 k_{u}}{n - p_{1} - 2 k_{u}})}^{p_{1} / 2 + k_{u}} {(\frac{1 - {\tilde{R}}_{A, 1}^{2}}{{\tilde{R}}_{A, 1}^{2}})}^{p_{1} / 2 + k_{u}} {(1 - {\tilde{R}}_{A, 1}^{2})}^{- n / 2} = U_{A, 1} (n), \end{array}

and $L (Y^{n} ∣ A, M_{1}) \geq n^{- p_{1} / 2 - k} {(1 - {\tilde{R}}_{A, 1}^{2})}^{- n / 2} = L_{A, 1} (n)$ . Similarly, $L_{A, 2} (n) \leq L (Y^{n} ∣ A, M_{2}) = \int_{0}^{\infty} {(1 + g)}^{- p_{2} / 2} {[1 - \frac{g}{1 + g} {\tilde{R}}_{A, 2}^{2}]}^{- n / 2} π (d g) \leq U_{A, 2} (n)$ . Therefore, ${BF}_{21, A}^{n} \leq \frac{U_{A, 2} (n)}{L_{A, 1} (n)}$

= {(\frac{p_{2} + 2 k_{u}}{n - p_{2} - 2 k_{u}})}^{p_{2} / 2 + k_{u}} {(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{{\tilde{R}}_{A, 2}^{2}})}^{p_{2} / 2 + k_{u}} {(1 - {\tilde{R}}_{A, 2}^{2})}^{- n / 2} / (n^{- p_{1} / 2 - k} {(1 - {\tilde{R}}_{A, 1}^{2})}^{- n / 2}) .

(13)

Case (I): For fixed p_j (j = 1, 2) and large n, ${BF}_{21, A}^{n} \leq ζ (A, n) = n^{\frac{p_{1} - p_{2}}{2} + k - k_{u}} {(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{1 - {\tilde{R}}_{A, 1}^{2}})}^{- n / 2}$ , ignoring terms independent of n. Using the results in proof of Theorems 2, 3 in Guo and Speckman (2009), we can show that $n ζ (A, n) \overset{a . s .}{\to} 0$ when Inline graphic ⊈ . Again for ⊂ , using results in the aforementioned proofs, we have ${(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{1 - {\tilde{R}}_{A, 1}^{2}})}^{- n / 2} = O_{P} (1)$ . Further for ⊂ when p₂ − p₁ > 2 + 2(k − k_u), we have $n^{δ} ζ (A, n) \overset{a . s .}{\to} 0$ , where δ > 0 is such that i ≤ 2(k − k_u) + 2δ < i + 1 when i ≤ 2(k − k_u) < i + 1. This implies that for large enough n,

\begin{array}{l} for M_{1} \subset M_{2} : ζ (A, n) = n^{- \frac{p_{2} - p_{1}}{2} + (k - k_{u})} in probability, \\ for M_{1} \subset M_{2} : ζ (A, n) \leq n^{- δ} almost surely when p_{2} - p_{1} > 2 + 2 (k - k_{u}), \\ for M_{1} ⊈ M_{2} : ζ (A, n) \leq n^{- 1} almost surely . \end{array}

(14)

Then for large enough n, we have,

\begin{array}{l} {BF}_{21, A}^{n} \leq ζ (A, n) \Leftrightarrow L (Y^{n} ∣ A, M_{2}) \leq ζ (A, n) L (Y^{n} ∣ A, M_{1}) \\ \Rightarrow L (Y^{n} ∣ M_{2}) \leq \sum_{A_{l} \in C_{n}} w_{l} ζ (A_{l}, n) L (Y^{n} ∣ A_{l}, M_{1}) \leq ζ^{*} (n) L (Y^{n} ∣ M_{1}), \end{array}

(15)

where ζ^*(n) is the LHS in equations (14) which is independent of A, and ζ^*(n) → 0 as n → ∞ (using (A4)). Dividing both sides of (15) by L(Yⁿ| Inline graphic ), we have ${BF}_{21}^{n} \leq ζ^{*} (n) \to 0$ as n → ∞.

Case (II): For increasing dimensions p₁ = O(n^a₁), p₂ = O(n^a₂) with 0 ≤ a₁ < a₂ < 1, we will only assume (A3) for g ~ π(g) so that k_u = 0. We have using (13)

B F_{21, A}^{n} \leq ζ (A, n) n^{p_{1} / 2 - (1 - a_{2}) p_{2} / 2 + k} {(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{{\tilde{R}}_{A, 2}^{2}})}^{p_{2} / 2} {(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{1 - {\tilde{R}}_{A, 1}^{2}})}^{- n / 2} .

(16)

Let us consider the following cases under 0 ≤ a₁ < a₂ < 1.

Case C1: Inline graphic ⊂ . We have $Q_{j} = τ (Z_{A}^{'} Z_{A} - Z_{A}^{'} {\tilde{H}}_{A, j} Z_{A}) ~ χ_{n - p_{j}}^{2} (0)$ , j=1,2, and $Q_{1} - Q_{2} = τ (Z_{A}^{'} ({\tilde{H}}_{A, 2} - {\tilde{H}}_{A, 1}) Z_{A}) ~ χ_{p_{2} - p_{1}}^{2} (0)$ . Using Lemma 1 of Guo et. al. (2009),

\frac{1 - {\tilde{R}}_{A, 1}^{2}}{1 - {\tilde{R}}_{A, 2}^{2}} = \frac{Z_{A}^{'} Z_{A} - Z_{A}^{'} {\tilde{H}}_{A, 1} Z_{A}}{Z_{A}^{'} Z_{A} - Z_{A}^{'} {\tilde{H}}_{A, 2} Z_{A}} = \frac{Q_{1}}{Q_{2}} = 1 + \frac{(Q_{1} - Q_{2}) / (p_{2} - p_{1})}{Q_{2} / (n - p_{2})} \frac{p_{2} - p_{1}}{n - p_{2}} \overset{a . s .}{\to} 1.

Moreover $(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{{\tilde{R}}_{A, 2}^{2}}) \overset{a . s .}{\to} (\frac{τ^{- 1}}{b_{A, 1}})$ under Inline graphic . Then for large n,

ζ (A, n) \approx n^{p_{1} / 2 - (1 - a_{2}) p_{2} / 2 + k} {(\frac{τ^{- 1}}{b_{A, 1}})}^{p_{2} / 2} = n^{p_{1} / 2 - (1 - a^{*} - a_{2}) p_{2} / 2 + k} {(\frac{τ^{- 1}}{n^{a^{*}} b_{A, 1}})}^{p_{2} / 2} a . s .,

where a^* > 0 is such that 0 < 1 − a^* − a₂ < 1. This implies that $n^{K^{*}} ζ (A, n) \overset{a . s .}{\to} 0$ under Inline graphic for any constant K^* > 0. Case C2: ⊈ . Using Lemma 1,

\frac{1 - {\tilde{R}}_{A, 2}^{2}}{{\tilde{R}}_{A, 2}^{2}} \overset{a . s .}{\to} \frac{τ^{- 1} + b_{A, 1} - b_{A, 2}}{b_{A, 2}} > 1, \frac{1 - {\tilde{R}}_{A, 1}^{2}}{1 - {\tilde{R}}_{A, 2}^{2}} \overset{a . s .}{\to} \frac{τ^{- 1}}{τ^{- 1} + b_{A, 1} - b_{A, 2}} < 1, under M_{1} .

For fixed τ⁻¹ and b_A_,2 > 0 (under (A2)), ${(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{{\tilde{R}}_{A, 2}^{2}})}^{p_{2} / 2} {(\frac{1 - {\tilde{R}}_{A, 2}^{2}}{1 - {\tilde{R}}_{A, 1}^{2}})}^{- n / 2} \overset{a . s .}{\to} 0$ . This implies that in the limiting case when n → ∞, we have

\begin{array}{l} for M_{1} \subset M_{2} : ζ (A, n) \leq n^{- K^{*}} almost surely, \\ for M_{1} ⊄ M_{2} : ζ (A, n) \leq n^{p_{1} / 2 - (1 - a_{2}) p_{2} / 2 + k} \leq n^{- K^{*}} almost surely, \end{array}

(17)

where K^* > 0 is a constant. Denoting the upper bounds as ζ^*(n), it is clear that ζ^*(n) is independent of A and ζ^*(n) → 0 as n → ∞ when 0 ≤ a₁ < a₂ < 1. Using similar arguments as in equation (15) of Case (I), we have ${BF}_{21}^{n} \leq n^{- K^{*}}$ and consistency follows.

Proof of Theorem II

Given the assumptions (A1)–(A4), Bayes factor consistency holds under the different cases elaborated in Theorem I. For fixed p, the proof follows trivially using Bayes factor consistency. For increasing p_n = O(n^a) (a > 0), our prior is $π^{n} (γ_{j l}) \propto π^{p_{j}} {(1 - π)}^{p_{0} - p_{j}} I [γ_{j_{l}} \in M_{F}] + N_{j}^{- 1} I [γ_{j_{l}} \in M_{I}]$ , where π ~ Be(a₁, b₁) and $N_{j} = (\begin{array}{l} p_{n} \\ p_{j} \end{array})$ . Let W_γ denote the prior weight for γ ∈ Inline graphic after marginalizing out π under the Be(a₁, b₁) prior (W₁ being the weight for ). Let ${BF}_{γ 1}^{n} =$ Bayes factor between models γ and , let D = {p_γ : γ ∈ } and denote = {γ ∈ : dim(γ) = p_j}. Note that under (A1)–(A4) and ∈ , ${BF}_{γ 1}^{n} \overset{P}{\to} 0$ for all $γ \in (M_{F} \cap M_{1}^{c}) \cup M_{I}$ , using Theorem I. Also,

\begin{array}{l} P (M_{1} ∣ Y^{n}) = {[1 + W_{1}^{- 1} \sum_{γ \in M_{F} \cap M_{1}^{c}} W_{γ} B F_{γ 1}^{n} + W_{1}^{- 1} \sum_{γ \in M_{I}} N_{j}^{- 1} B F_{γ 1}^{n}]}^{- 1} \\ = {[1 + W_{1}^{- 1} \sum_{γ \in M_{F} \cap M_{1}^{c}} W_{γ} B F_{γ 1}^{n} + W_{1}^{- 1} \sum_{p_{j} \in D} \sum_{γ_{j l} \in H_{j}} N_{j}^{- 1} B F_{γ_{j l} 1}^{n}]}^{- 1} \\ \geq {[1 + ε_{0} + W_{1}^{- 1} \sum_{p_{j} \in D} \sum_{γ_{j l} \in H_{j}} N_{j}^{- 1} B F_{γ_{j l} 1}^{n}]}^{- 1}, \end{array}

where $W_{1}^{- 1} \sum_{γ \in M_{F} \cap M_{1}^{c}} W_{γ} B F_{γ 1}^{n} \leq ε_{0}$ for large enough n, and ε₀ → 0 as n → ∞ since all the individual terms in the finite summation → 0 using Theorem I. Further using (17), the upper bound of $B F_{γ_{j l} 1}^{n}$ for any γ_jl ∈ Inline graphic is given by ζ^*(n) = n^{−K^*} when n is large, where K^* > 0 is a constant. Noting that the cardinality of $H_{j} \leq N_{j} = (\begin{array}{l} p_{n} \\ p_{j} \end{array})$ , we have for large n,

\sum_{γ_{j l} \in H_{j}} N_{j}^{- 1} B F_{γ_{j l} 1}^{n} \leq n^{- K^{*}} \Rightarrow P (M_{1} ∣ Y^{n}) \geq {[1 + ε_{0} + W_{1}^{- 1} \sum_{p_{j} \in D} n^{- K^{*}}]}^{- 1}, under M_{1} .

Now note that W₁ is fixed and the cardinality of D < κ₀n for some constant κ₀ > 0. Thus it is clear that $W_{1}^{- 1} \sum_{p_{j} \in D} n^{- K^{*}} \to 0$ as n → ∞ for large K^*. The rest is straightforward.

APPENDIX B: COMPUTATIONAL STEPS FOR MCMC

The posterior computation steps are:

Step 1.1
Update the ν’s after marginalizing out the augmented uniform variable using π(ν_h|−) = Be(1 + n_h, Σ_j_>_h n_j + m), h=1, …, M, where M is the total number of clusters satisfying $\sum_{h = 1}^{M} w_{h} > 1 - min (u_{1}, \dots, u_{n})$ , with w_h = ν_h Π_l_<_h(1 − ν_l).
Step 1.2
Update u_i, i=1, …, n, from its full conditional as described in Walker (2007).
Step 2
Update the cluster membership of different subjects using f(y_i|u_i, A_ih = 1) ∝ N(y_i|η_h, x_γ_,_i, β_γ, τ⁻¹)I(h ∈ B_w(u_i)), h=1, …, M, with B_w(u_i) defined as in section 2.3.
Step 3
Update the Dirichlet process atom η_l for the l-th cluster using $π (η_{l} ∣ -) = N (\frac{\sum_{i : A_{i l} = 1} (y_{i} = x_{γ, i} β_{γ})}{1 + n_{l}}, \sqrt{\frac{1}{τ (1 + n_{l})}})$ , where $n_{l} = \sum_{i = 1}^{n} A_{i l}$ is the cardinality of the l-th cluster, l=1, …, M.
Step 4
Update the DP precision using $π (m ∣ -) = G a (a_{m} + M, b_{m} - \sum_{l = 1}^{M} log (1 - ν_{l}))$ .
Step 5
Letting $p_{γ} = \sum_{j = 1}^{p} γ_{j}$ , update precision τ using $π (τ ∣ -) = G a (a_{τ} + \frac{n + p_{γ}}{2}, b_{τ} + \frac{1}{2} {{(Y^{n} - X_{γ} β_{γ})}^{'} \sum_{A}^{- 1} (Y^{n} - X_{γ} β_{γ}) + \frac{1}{g} β_{γ}^{'} (X_{γ}^{'} \sum_{A}^{- 1} X_{γ}) β_{γ}})$ .
Step 6
Using the hyper-g prior and the fact that $\frac{g}{1 + g} ~ B e (1, 1)$ for a = 4, we can adopt the griddy Gibbs approach (Ritter and Tanner, 1992) to update g.
Step 7
Update the prior inclusion probability π = Pr(γ_j = 1) using f(π|−) = Be(a₁ + p_γ, b₁ + p − p_γ).

Step 8

Update γ_j’s one at a time by computing their posterior inclusion probabilities after marginalizing out β_γ and conditional on inclusion indicators for the remaining predictors as well as g, τ and A. Denote γ(j) and γ(−j) as the vector of current variable inclusion indicators with γ_j fixed at 1 and 0 respectively, and let p_γ₍_j₎ and p_γ₍₋_j₎ denote the corresponding vector sums. We can sample γ_j from the Bernoulli conditional posterior distribution with probabilities Pr(γ_j = 1|−) = p_j₁/(p_j₁ + p_j₀) and Pr(γ_j = 0|−) = p_j₀/(p_j₁ + p_j₀), where

\begin{array}{l} p_{j 1} = π {(1 + g)}^{- p_{γ (j)} / 2} exp {\frac{τ}{2} (\frac{g}{1 + g}) ({(Y^{n})}^{T} \sum_{A}^{- 1} X_{γ (j)} {(X_{γ (j)}^{'} \sum_{A}^{- 1} X_{γ (j)})}^{- 1} X_{γ (j)}^{'} \sum_{A}^{- 1} Y^{n})}, \\ p_{j 0} = (1 - π) {(1 + g)}^{- p_{γ (- j)} / 2} exp {\frac{τ}{2} (\frac{g}{1 + g}) ({(Y^{n})}^{T} \sum_{A}^{- 1} X_{γ (- j)} {(X_{γ (- j)}^{'} \sum_{A}^{- 1} X_{γ (- j)})}^{- 1} X_{γ (- j)}^{'} \sum_{A}^{- 1} Y^{n})} . \end{array}

Step 9
Set {β_j : γ_j = 0} = 0 and update β_γ = {β_j : γ_j = 1} using π(β_γ|−) = N(β_γ; E, V), where $V = {(\frac{τ}{g} (X_{γ}^{'} \sum_{A}^{- 1} X_{γ}) + τ (X_{γ}^{'} X_{γ}))}^{- 1}$ and $E = V (τ X_{γ}^{'} (Y^{n} - A η))$ .

Contributor Information

Suprateek Kundu, Email: sk@stat.tamu.edu.

David B. Dunson, Email: dunson@stat.duke.edu.

References

1.Antoniak CE. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics. 1974;2:11521174. [Google Scholar]
2.Armagan A, Dunson DB, Lee J, Bajwa WU, Strawn N. Posterior consistency in high-dimensional linear models. Biometrika. 2013 To appear. [Google Scholar]
3.Brunham LR, Kruit JK, Pape TD, Timmins JM, Reuwer AQ, Vasanji Z, Marsh BJ, Rodrigues B, Johnson JD, Parks JS, Verchere CB, Hayden MR. β-cell ABCA1 influences insulin secretion, glucose homeostasis and response to thiazolidinedione treatment. Nature Medicine. 2007;13:340–347. doi: 10.1038/nm1546. [DOI] [PubMed] [Google Scholar]
4.Casella G, Girón FJ, Martínez ML, Moreno E. Consistency of Bayesian procedures for variable selection. Annals of Statistics. 2009;37:1207–1228. [Google Scholar]
5.Castillo I, van der Vaart A. Needles and straws in a haystack: Posterior concentration for possibly sparse sequences. Annals of Statistics. 2012;40:2069–2101. [Google Scholar]
6.Epstein M, Sowers JR. Diabetes mellitus and hypertension. Hypertension. 1992;19:403–418. doi: 10.1161/01.hyp.19.5.403. [DOI] [PubMed] [Google Scholar]
7.Ferguson TS. A Bayesian analysis of some nonparametric problems. Annals of Statistics. 1972;1:209–230. [Google Scholar]
8.George EI, McCulloch RE. Approaches for Bayesian Variable Selection. Statistica Sinica. 1997;7(2):339–74. [Google Scholar]
9.Guo R, Speckman P. Bayes factor consistency in linear models. In the 2009 International Workshop on Objective Bayes Methodology; Philadelphia. 2009.2009. [Google Scholar]
10.Jiang W. Bayesian variable selection for high dimensional generalized linear models: Convergence rates of the fitted densities. Annals of Statistics. 2007;35:1487–1511. [Google Scholar]
11.Koller M, Stahel WA. Sharpening Wald-type inference in robust regression for small samples. Computational Statistics and Data Analysis. 2011;55:25042515. [Google Scholar]
12.Kyung M, Gill J, Casella G. Characterizing the variance improvement in linear Dirichlet random effects models. Statistics and Probability Letters. 2009;79:2343–2350. [Google Scholar]
13.Liang F, Paulo R, Molina G, Clyde MA, Berger JO. Mixtures of g-priors for Bayesian Variable Selection. Journal of the American Statistical Association. 2008;103:410–423. [Google Scholar]
14.Lo AY. On a class of Bayesian nonparametric estimates: I. density estimates. Annals of Statistics. 1984;12:351–357. [Google Scholar]
15.Maruyama Y, George E. A g–prior extension for p > n. 2008 arxiv:0801.4410v1 [stat.ME] [Google Scholar]
16.Mokdad AH, Bowman BA, Ford ES, Vinicor F, Marks JS, Koplan JP. The continuing epidemics of obesity and diabetes in the United States. Journal of the American Medical Association. 2001;286:1195–1200. doi: 10.1001/jama.286.10.1195. [DOI] [PubMed] [Google Scholar]
17.O’Hara RB, Sillanpää MJ. Review of Bayesian variable selection methods: What, how and which. Bayesian Analysis. 2009;4:85–118. [Google Scholar]
18.Yu K, Chen CWS, Reed C, Dunson D. Bayesian Variable Selection in Quantile Regression. Statistics and its Interface. 2013;6:261–274. [Google Scholar]
19.Ritter C, Tanner MA. Facilitating the Gibbs sampler: the Gibbs stopper and the griddy-Gibbs sampler. Journal of the American Statistical Association. 1992;87:861–868. [Google Scholar]
20.Schmidt MI, Duncan BB, Canani LH, Karohl C, Chambless L. Association of waist-hip ratio with diabetes mellitus. Strength and possible modifiers. Diabetes Care. 1992;15:912–4. doi: 10.2337/diacare.15.7.912. [DOI] [PubMed] [Google Scholar]
21.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58(1):267–288. [Google Scholar]
22.Walker S. Sampling the dirichlet mixture model with slices. Communication in Statistics - Simulation and Computation. 2007;36:45–54. [Google Scholar]
23.Yau C, Papaspiliopoulos O, Roberts G, Holmes C. Bayesian non-parametric hidden Markov models with applications in genomics. Journal of the Royal Statical Society, Series B. 2011;73(Part 1):33–57. doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yohai VJ. High breakdown-point and high efficiency estimates for regression. Annals of Statistics. 1987;15:642–65. [Google Scholar]
25.Zellner A, Siow A. Bayesian Statistics: Proceedings of the First International Meeting. Valencia: University of Valencia Press; 1980. Posterior odds ratios for selected regression hypotheses; pp. 585–603. [Google Scholar]
26.Zellner A. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. 1986:233–243. [Google Scholar]
27.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]

[R1] 1.Antoniak CE. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics. 1974;2:11521174. [Google Scholar]

[R2] 2.Armagan A, Dunson DB, Lee J, Bajwa WU, Strawn N. Posterior consistency in high-dimensional linear models. Biometrika. 2013 To appear. [Google Scholar]

[R3] 3.Brunham LR, Kruit JK, Pape TD, Timmins JM, Reuwer AQ, Vasanji Z, Marsh BJ, Rodrigues B, Johnson JD, Parks JS, Verchere CB, Hayden MR. β-cell ABCA1 influences insulin secretion, glucose homeostasis and response to thiazolidinedione treatment. Nature Medicine. 2007;13:340–347. doi: 10.1038/nm1546. [DOI] [PubMed] [Google Scholar]

[R4] 4.Casella G, Girón FJ, Martínez ML, Moreno E. Consistency of Bayesian procedures for variable selection. Annals of Statistics. 2009;37:1207–1228. [Google Scholar]

[R5] 5.Castillo I, van der Vaart A. Needles and straws in a haystack: Posterior concentration for possibly sparse sequences. Annals of Statistics. 2012;40:2069–2101. [Google Scholar]

[R6] 6.Epstein M, Sowers JR. Diabetes mellitus and hypertension. Hypertension. 1992;19:403–418. doi: 10.1161/01.hyp.19.5.403. [DOI] [PubMed] [Google Scholar]

[R7] 7.Ferguson TS. A Bayesian analysis of some nonparametric problems. Annals of Statistics. 1972;1:209–230. [Google Scholar]

[R8] 8.George EI, McCulloch RE. Approaches for Bayesian Variable Selection. Statistica Sinica. 1997;7(2):339–74. [Google Scholar]

[R9] 9.Guo R, Speckman P. Bayes factor consistency in linear models. In the 2009 International Workshop on Objective Bayes Methodology; Philadelphia. 2009.2009. [Google Scholar]

[R10] 10.Jiang W. Bayesian variable selection for high dimensional generalized linear models: Convergence rates of the fitted densities. Annals of Statistics. 2007;35:1487–1511. [Google Scholar]

[R11] 11.Koller M, Stahel WA. Sharpening Wald-type inference in robust regression for small samples. Computational Statistics and Data Analysis. 2011;55:25042515. [Google Scholar]

[R12] 12.Kyung M, Gill J, Casella G. Characterizing the variance improvement in linear Dirichlet random effects models. Statistics and Probability Letters. 2009;79:2343–2350. [Google Scholar]

[R13] 13.Liang F, Paulo R, Molina G, Clyde MA, Berger JO. Mixtures of g-priors for Bayesian Variable Selection. Journal of the American Statistical Association. 2008;103:410–423. [Google Scholar]

[R14] 14.Lo AY. On a class of Bayesian nonparametric estimates: I. density estimates. Annals of Statistics. 1984;12:351–357. [Google Scholar]

[R15] 15.Maruyama Y, George E. A g–prior extension for p > n. 2008 arxiv:0801.4410v1 [stat.ME] [Google Scholar]

[R16] 16.Mokdad AH, Bowman BA, Ford ES, Vinicor F, Marks JS, Koplan JP. The continuing epidemics of obesity and diabetes in the United States. Journal of the American Medical Association. 2001;286:1195–1200. doi: 10.1001/jama.286.10.1195. [DOI] [PubMed] [Google Scholar]

[R17] 17.O’Hara RB, Sillanpää MJ. Review of Bayesian variable selection methods: What, how and which. Bayesian Analysis. 2009;4:85–118. [Google Scholar]

[R18] 18.Yu K, Chen CWS, Reed C, Dunson D. Bayesian Variable Selection in Quantile Regression. Statistics and its Interface. 2013;6:261–274. [Google Scholar]

[R19] 19.Ritter C, Tanner MA. Facilitating the Gibbs sampler: the Gibbs stopper and the griddy-Gibbs sampler. Journal of the American Statistical Association. 1992;87:861–868. [Google Scholar]

[R20] 20.Schmidt MI, Duncan BB, Canani LH, Karohl C, Chambless L. Association of waist-hip ratio with diabetes mellitus. Strength and possible modifiers. Diabetes Care. 1992;15:912–4. doi: 10.2337/diacare.15.7.912. [DOI] [PubMed] [Google Scholar]

[R21] 21.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58(1):267–288. [Google Scholar]

[R22] 22.Walker S. Sampling the dirichlet mixture model with slices. Communication in Statistics - Simulation and Computation. 2007;36:45–54. [Google Scholar]

[R23] 23.Yau C, Papaspiliopoulos O, Roberts G, Holmes C. Bayesian non-parametric hidden Markov models with applications in genomics. Journal of the Royal Statical Society, Series B. 2011;73(Part 1):33–57. doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Yohai VJ. High breakdown-point and high efficiency estimates for regression. Annals of Statistics. 1987;15:642–65. [Google Scholar]

[R25] 25.Zellner A, Siow A. Bayesian Statistics: Proceedings of the First International Meeting. Valencia: University of Valencia Press; 1980. Posterior odds ratios for selected regression hypotheses; pp. 585–603. [Google Scholar]

[R26] 26.Zellner A. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. 1986:233–243. [Google Scholar]

[R27] 27.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]

PERMALINK

Bayes variable selection in semiparametric linear models

Suprateek Kundu

David B Dunson

Abstract

1. INTRODUCTION

2. MIXTURES OF SEMIPARAMETRIC g-PRIORS

2.1 Model Formulation

2.2 Bayes Factor in Semiparametric Linear Models

2.3 Posterior Computation

3. ASYMPTOTIC PROPERTIES

Lemma 1

Theorem I

REMARK 1

Theorem II

4. SIMULATION STUDY

Figure 1.

Figure 2.

Table 1.

Table 2.

5. APPLICATION TO DIABETES DATA

Table 3.

Table 4.

Table 5.

Figure 3.

Table 6.

Acknowledgments

APPENDIX A: PROOF OF RESULTS

Proof of Theorem I

Proof of Theorem II

APPENDIX B: COMPUTATIONAL STEPS FOR MCMC

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bayes variable selection in semiparametric linear models

Suprateek Kundu

David B Dunson

Abstract

1. INTRODUCTION

2. MIXTURES OF SEMIPARAMETRIC g-PRIORS

2.1 Model Formulation

2.2 Bayes Factor in Semiparametric Linear Models

2.3 Posterior Computation

3. ASYMPTOTIC PROPERTIES

Lemma 1

Theorem I

REMARK 1

Theorem II

4. SIMULATION STUDY

Figure 1.

Figure 2.

Table 1.

Table 2.

5. APPLICATION TO DIABETES DATA

Table 3.

Table 4.

Table 5.

Figure 3.

Table 6.

Acknowledgments

APPENDIX A: PROOF OF RESULTS

Proof of Theorem I

Proof of Theorem II

APPENDIX B: COMPUTATIONAL STEPS FOR MCMC

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases