Bayesian Variable Selection and Computation for Generalized Linear Models with Conjugate Priors

Ming-Hui Chen; Lan Huang; Joseph G Ibrahim; Sungduk Kim

doi:10.1214/08-BA323

. Author manuscript; available in PMC: 2009 May 11.

Published in final edited form as: Bayesian Anal. 2008 Jul 1;3(3):585–614. doi: 10.1214/08-BA323

Bayesian Variable Selection and Computation for Generalized Linear Models with Conjugate Priors

Ming-Hui Chen ^*, Lan Huang ^†, Joseph G Ibrahim ^‡, Sungduk Kim ^§

PMCID: PMC2680310 NIHMSID: NIHMS94914 PMID: 19436774

Abstract

In this paper, we consider theoretical and computational connections between six popular methods for variable subset selection in generalized linear models (GLM’s). Under the conjugate priors developed by Chen and Ibrahim (2003) for the generalized linear model, we obtain closed form analytic relationships between the Bayes factor (posterior model probability), the Conditional Predictive Ordinate (CPO), the L measure, the Deviance Information Criterion (DIC), the Aikiake Information Criterion (AIC), and the Bayesian Information Criterion (BIC) in the case of the linear model. Moreover, we examine computational relationships in the model space for these Bayesian methods for an arbitrary GLM under conjugate priors as well as examine the performance of the conjugate priors of Chen and Ibrahim (2003) in Bayesian variable selection. Specifically, we show that once Markov chain Monte Carlo (MCMC) samples are obtained from the full model, the four Bayesian criteria can be simultaneously computed for all possible subset models in the model space. We illustrate our new methodology with a simulation study and a real dataset.

Keywords: Bayes factor, Conditional Predictive Ordinate, Conjugate prior, L measure, Poisson regression, Logistic regression

1 Introduction

Bayesian variable selection is still one of the most theoretically and computationally challenging problems encountered in practice due to issues regarding i) prior elicitation, ii) analytic evaluation of the model selection criterion, and iii) numerical computation of the criterion for all possible models in the model space. These issues have been discussed by many authors for various linear and generalized linear models including George and McCulloch (1993), Laud and Ibrahim (1995), George et al. (1996), Raftery (1996), Smith and Kohn (1996), George and McCulloch (1997), Raftery et al. (1997), Brown et al. (1998), Brown et al. (2002), Clyde (1999), Chen et al. (1999), Dellaportas and Forster (1999), Ibrahim et al. (1999), Chipman et al. (1998), Chipman et al. (2001), Chipman et al. (2003), George (2000), George and Foster (2000), Ibrahim et al. (2000), Ntzoufras et al. (2003), and Chen et al. (2003). Clyde and George (2004) present an excellent review article on Bayesian model selection and uncertainty, and give an excellent exposition of the theoretical and computational issues involved in Bayesian variable selection and Bayesian model uncertainty in general. An entire monograph devoted to Bayesian model selection is given by Lahiri (2001).

One of the important unresolved issues in Bayesian model selection and Bayesian variable selection in particular is what the analytic or empirical connections are between the various methods. For example, it is not clear what the relationship is between BIC and DIC, or DIC and the L measure, and whether one is a monotonic function of the other, and whether one can compute BIC from DIC or vice versa. A related question is that if one has MCMC samples from the full model, how can those samples be used to obtain all four Bayesian criteria mentioned above. To answer these questions, we investigate the following in this paper: (i) for the normal linear model with conjugate priors, we obtain analytic relationships between the Bayes factor, CPO, the L measure, DIC, AIC, and BIC, and (ii) for the class of GLM’s we show via the development of several theorems and identities how one can compute all of these Bayesian criteria simultaneously using only an MCMC sample from the full model.

The relationships obtained in (i) for the linear model shed light on the behavior and connections between these criteria for GLM’s. The development of (ii) above is important and useful since it establishes the computational relationships in the model space for each of the four Bayesian criteria and shows that for variable subset selection in GLM’s using the conjugate priors of Chen and Ibrahim (2003), we can compute the four Bayesian criteria for all possible 2^p subset models using only an MCMC sample from the full model with p covariates. Another important issue we examine in this paper is the performance of the conjugate priors proposed by Chen and Ibrahim (2003) in Bayesian variable subset selection. We demonstrate that these priors perform quite well in this context, and they are easy to specify and computationally feasible.

The rest of this paper is organized as follows. Section 2 gives formulas for each of the criteria under the conjugate priors of Chen and Ibrahim (2003) for GLM’s and Section 3 develops the theoretical connections between the six criteria for the normal linear model. Section 4 establishes the computational connections in the model space for the four Bayesian criteria and several key identities and theorems that are needed. Section 5 presents a detailed simulation study examining various properties of the six criteria, and Section 6 presents a real data example. We conclude the article with brief remarks in Section 7. All proofs are given in the Appendix.

2 The Method

2.1 Model and Notation

Suppose that {(x_i_, y_i), i = 1, 2, …, n} are independent observations, where y_i is the response variable, and x_i = (1, x_i₁, …, x_ik)′ is a (k + 1) × 1 random vector of covariates. Let ℳ denote the model space. We enumerate the models in ℳ by m = 1, 2,…, Inline graphic , where is the dimension of ℳ and model denotes the full model. Also, let β⁽⁾ = (β₀, β₁, …, β_k)′ denote the regression coefficients for the full model including an intercept, and let $x_{i}^{(m)}$ and β⁽^m⁾ denote k_m × 1 vectors of covariates and regression coefficients for model m with an intercept, and a specific choice of k_m − 1 covariates. We write $x_{i} = (x_{i}^{(m)^{'}}, x_{i}^{(- m)^{'}})^{'}$ , and β⁽⁾ = (β⁽^m⁾′, β⁽⁻^m⁾′)′, where $x_{i}^{(- m)}$ is x_i with $x_{i}^{(m)}$ deleted and β⁽⁻^m⁾ is β⁽⁾ with β⁽^m⁾ deleted.

Under model m, the generalized linear model (GLM) is assumed for [ $y_{i} ∣ x_{i}^{(m)}$ ], which has the conditional density given by

f (y_{i} ∣ x_{i}^{(m)}, β^{(m)}, τ) = exp [a_{i}^{- 1} (τ) {y_{i} θ_{i}^{(m)} - b (θ_{i}^{(m)})} + c (y_{i}, τ)], i = 1, 2, \dots, n,

(1)

where $θ_{i}^{(m)} = θ (η_{i}^{(m)})$ is the canonical parameter, $η_{i}^{(m)} = x_{i}^{(m)^{'}} β^{(m)}$ , and τ is a dispersion parameter. The functions a, b and c determine a particular family in the class. The functions a_i(τ) are commonly of the form $a_{i} (τ) = τ^{- 1} w_{i}^{- 1}$ , where the w_i’s are known weights. For ease of exposition, we assume throughout that τ = 1 and w_i = 1, as, for example, in logistic and Poisson regression. The methods proposed here can be easily extended to the case when τ is unknown. Under this assumption, (1) can be rewritten as

f (y_{i} ∣ x_{i}^{(m)}, β^{(m)}) = exp {y_{i} θ_{i}^{(m)} - b (θ_{i}^{(m)}) + c (y_{i})}, i = 1, 2, \dots, n .

(2)

2.2 Prior and Posterior

In the context of Bayesian variable selection, a prior distribution for β⁽^m⁾ needs to be specified for each model in the model space ℳ. To this end, we consider a conjugate prior for the GLM proposed by Chen and Ibrahim (2003). Under model m, the conjugate prior is of the form

\begin{array}{l} π (β^{(m)} ∣ y_{0}, a_{0}, m) \\ \propto \prod_{i = 1}^{n} exp [a_{0} {y_{0 i} θ_{i}^{(m)} - b (θ_{i}^{(m)})}] = exp [a_{0} {y_{0}^{'} θ^{(m)} - J^{'} b (θ^{(m)})}], \end{array}

(3)

where a₀ > 0 is a scalar prior parameter, y₀ = (y₀₁, …, y₀_n)′ is an n × 1 vector of prior parameters, J is an n×1 vector of ones, and $b (θ^{(m)}) = (b (θ_{1}^{(m)}), \dots, b (θ_{n}^{(m)}))^{'}$ is an n×1 vector of the $b (θ_{i}^{(m)})$ ’s. As discussed in Chen and Ibrahim (2003), y₀_i can be viewed as a prior prediction for the marginal mean of y_i at x_i. Thus, in eliciting y₀, the user must focus on a prediction (or guess) for E(y), which narrows the possibilities for choosing y₀. Moreover, the specification of all y₀_i equal has an appealing interpretation. A prior specification with y₀₁ = … = y₀_n implies a prior in which the prior modes of the slopes in the regression model are the same, but the prior modes of intercepts in the regression model vary. For example, a prior with y₀_i = 0.5 will have the same modes of slopes but a different mode of intercept than a prior with y₀_i = 0.1. This is intuitively appealing since in this case the prior prediction on y₀_i does not depend on the i^th subject’s specific information. Mathematically, this result was established in Chen and Ibrahim (2003). The details are as follows. Suppose we drop model index m. Let μ₀ be any prespecified p × 1 vector, where p = k + 1. Suppose we take

y_{0} = \dot{b} (θ) = \dot{b} (θ (X μ_{0})),

where ḃ(θ) is the gradient vector of b(θ). Then, the conjugate prior yields a prior mode of β equal to μ₀. Now we can see that μ₀ = (β₀, 0, …, 0)′ yields y₀₁ = y₀₂ = … = y₀_n = ḃ(θ(β₀)). On the other hand, as under some mild conditions, the prior mode is unique, and, hence, the specification of y₀ = y₀1 leads to the prior mode μ₀ = (β₀, 0, …, 0)′, where β₀ satisfies ḃ(θ(β₀)) = y₀. For instance, under normal linear regression, we can show that the prior mode μ₀ of β is given by

μ_{0} = {(X^{'} X)}^{- 1} X^{'} y_{0} .

If we specify y₀ = y₀1, we have

μ_{0} = (y_{0}, 0, 0, \dots, 0)^{'},

which implies that all the slopes are 0 while the intercept is equal to y₀. This attractive feature allows us to do sensitivity analyses by varying the intercepts in the prior. The parameter a₀ in (3) can be generally viewed as a precision parameter that quantifies the strength of our prior belief in y₀.

In the context of Bayesian variable selection, (3) specifies the priors for all models in ℳ in an automatic and systematic fashion. Although various theoretical properties of (3) were examined in Chen and Ibrahim (2003) in a great detail, it is not clear how well this type of the prior performs in the context of Bayesian variable selection.

Now, under model m, the posterior distribution of β⁽^m⁾ with the conjugate prior (3) is given by

\begin{array}{l} π (β^{(m)} ∣ D, m) \propto exp {y^{'} θ^{(m)} - J^{'} b (θ^{(m)})} π (β^{(m)} ∣ y_{0}, a_{0}, m) \\ \propto exp {(y + a_{0} y_{0})^{'} θ^{(m)} - (1 + a_{0}) J^{'} b (θ^{(m)})}, \end{array}

(4)

where D = {(y_i, x_i), i = 1, 2, …, n} denotes the observed data. From (4), we can see that under the conjugate prior, the resulting posterior has a very attractive form. Furthermore, when a₀ → 0, the posterior π(Σ⁽^m⁾|D, m) in (4) reduces to

π (β^{(m)} ∣ D, m) \propto exp {y^{'} θ^{(m)} - J^{'} b (θ^{(m)})},

which is the posterior distribution based on an improper uniform prior for β⁽^m⁾.

2.3 Variable Selection Criteria

In this section, we consider four Bayesian model assessment criteria, namely, Conditional Predictive Ordinate (CPO) statistic (Geisser (1993); Gelfand et al. (1992); and Gelfand and Dey (1994)), L measure (Ibrahim and Laud (1994); Laud and Ibrahim (1995); Gelfand and Ghosh (1998); Ibrahim et al. (2001a); and Chen et al. (2004)), Deviance Information Criterion (DIC) (Spiegelhalter et al. (2002)), and marginal likelihood (Bayes factor).

The CPO, L measure, and DIC are criterion based methods which can be attractive in the sense that they are well defined under improper priors as long as the posterior distribution is proper, and thus have an advantage over the marginal likelihood or Bayes factor approach in this sense. Because of this reason, these three criterion based methods can be directly compared to AIC (Akaike (1973)) and BIC (Schwarz (1978)). On the other hand, the marginal likelihood or the Bayes factor is well calibrated and relatively easy to interpret, but generally sensitive to vague proper priors. In the context of variable selection, it is not clear how these methods perform under the conjugate prior given in (3) for the GLM.

Under model m, for the i^th observation, we define the CPO statistic as follows:

{CPO}_{i} = f (y_{i} ∣ x_{i}, D^{(- i)}) = \int f (y_{i} ∣ x_{i}^{(m)}, β^{(m)}) π (β^{(m)} ∣ D^{(- i)}, m) d β^{(m)},

where D⁽⁻ⁱ⁾ is D with the i^th observation deleted, and π(β|D⁽⁻ⁱ^),^m) is the posterior distribution based on the data D⁽⁻ⁱ⁾. Due to the construction of the conjugate prior (3), it is more natural to define

π (β^{(m)} ∣ D^{(- i)}, m) \propto \prod_{j \neq i} exp {(y_{j} + a_{0} y_{0 j}) θ_{j}^{(m)} - (1 + a_{0}) b (θ_{j}^{(m)})} .

After some messy algebra, we can show that CPO_i takes the following form:

\begin{array}{l} {CPO}_{i} = f (y_{i} ∣ x_{i}, D^{(- i)}) \\ = \frac{\int \frac{1}{exp [a_{0} {y_{0 i} θ_{i}^{(m)} - b (θ_{i}^{(m)})}]} π (β^{(m)} ∣ D, m) d β^{(m)}}{\int \frac{1}{f (y_{i} ∣ x_{i}^{(m)}, β^{(m)}) exp [a_{0} {y_{0 i} θ_{i}^{(m)} - b (θ_{i}^{(m)})}]} π (β^{(m)} ∣ D, m) d β^{(m)}}, \end{array}

(5)

where $f (y_{i} ∣ x_{i}^{(m)}, β^{(m)})$ is the density function given in (2). Also, we notice that the CPO defined in (5) is slightly different from the usual CPO (Geisser (1993) and Gelfand et al. (1992)), which is of the form

{\int \frac{1}{f (y_{i} ∣ x_{i}^{(m)}, β^{(m)})} π (β^{(m)} ∣ D, m) d β^{(m)}}^{- 1} .

However, these two forms will be identical as a₀ → 0. As suggested in Ibrahim et al. (2001b), a natural summary statistic of the CPO_i’s is the logarithm of the Pseudo-marginal likelihood (LPML) defined as

{LPML}_{m} = \sum_{i = 1}^{n} log ({CPO}_{i}) .

We will use LPML_m as a criterion-based measure for variable selection.

The L measure criterion is another useful tool for model comparison and variable selection. The L measure is constructed from the posterior predictive distribution of the data. For the entire class of GLM’s in (2), under model m, the L measure is defined as:

\begin{array}{l} L_{m} (ν) = \sum_{i = 1}^{n} [E {b^{″} (θ_{i}^{(m)}) ∣ D, m} + Var {b^{'} (θ_{i}^{(m)}) ∣ D, m}] \\ + ν \sum_{i = 1}^{m} {[E {b^{'} (θ_{i}^{(m)}) ∣ D, m} - y_{i}]}^{2}, \end{array}

(6)

where b′(.) and b″(.) are the mean and variance functions of the GLM in (2), and all expectations and variances are taken with respect to the posterior distribution π(β⁽^m⁾|D, m) in (4). We note that for the GLM in (1), we need to modify L_m(ν) in (6) accordingly, and in this case, the L measure takes the form

\begin{array}{l} L_{m} (ν) = \sum_{i = 1}^{n} [E {a_{i} (τ) b^{″} (θ_{i}^{(m)}) ∣ D, m} + Var {b^{'} (θ_{i}^{(m)}) ∣ D, m}] \\ + ν \sum_{i = 1}^{m} {[E {b^{'} (θ_{i}^{(m)}) ∣ D, m} - y_{i}]}^{2} . \end{array}

(7)

The DIC criterion, proposed by Spiegelhalter et al. (2002), is given by

{DIC}_{m} = D ({\bar{β}}^{(m)}) + 2 p_{D}^{(m)},

(8)

where

p_{D}^{(m)} = \bar{D (β^{(m)})} - D ({\bar{β}}^{(m)}),

β̄⁽^m⁾ = E[β⁽^m⁾|D, m], and $\bar{D (β^{(m)})} = E [D (β^{(m)}) ∣ D, m]$ . For the GLM in (2), under model m,

D (β^{(m)}) = - 2 \sum_{i = 1}^{n} {y_{i} θ_{i}^{(m)} - b (θ_{i}^{(m)})} .

(9)

Similar to (6), under the GLM in (1), D(β⁽^m⁾) needs to be modified accordingly.

In the spirit of marginal likelihoods, after ignoring the constants shared by all variable subset models in model space ℳ for the GLM in (2), for the purpose of variable subset selection it suffices to compute the posterior normalizing constant

C_{m} (D) = \int exp {(y + a_{0} y_{0})^{'} θ^{(m)} - (1 + a_{0}) J^{'} b (θ^{(m)})} d β^{(m)}

(10)

and the prior normalizing constant

C_{0 m} (y_{0}) = \int exp [a_{0} {y_{0}^{'} θ^{(m)} - J^{'} b (θ^{(m)})}] d β^{(m)} .

(11)

Similar to the modification of (6) yielding (7), under the GLM in (1), D(β⁽^m⁾) in (9), C_m(D) in (10), and C₀_m(y₀) in (11) need to be modified accordingly. In the context of variable selection, we select a variable subset model which yields the largest LPML_m under the CPO, the smallest L_m(ν) under the L measure, the smallest DIC_m under the DIC, and the largest C_m(D)/C₀_m(y₀) or log[C_m(D)/C₀_m(y₀)] under the marginal likelihood.

3 Analytic Connections Between Variable Selection Criteria For the Normal Linear Regression Model

In this section, we consider the normal linear regression models given by

f (y_{i} ∣ x_{i}^{(m)}, β^{(m)}, τ) = \frac{τ^{1 / 2}}{\sqrt{2 π}} exp {- \frac{τ}{2} {(y_{i} - x_{i}^{(m)^{'}} β^{(m)})}^{2}} .

(12)

Let $X_{m} = (x_{1}^{(m)}, x_{2}^{(m)}, \dots, x_{n}^{(m)})^{'}$ , which is the design matrix for the normal linear regression under model m. Assume X_m is of full rank k_m throughout. We focus only on the τ known case as analytical connections are more difficult to establish when τ is unknown. For the model in (12) with a known τ, the conjugate prior for β⁽^m⁾ in (3) reduces to

[β^{(m)} ∣ y_{0}, a_{0}, m] \sim N_{k_{m}} ({(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} y_{0}, \frac{1}{τ a_{0}} {(X_{m}^{'} X_{m})}^{- 1}),

(13)

and the posterior distribution for β⁽^m⁾ is given by

[β^{(m)} ∣ D, m] \sim N_{k_{m}} ({(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} \frac{y + a_{0} y_{0}}{1 + a_{0}}, \frac{1}{τ (1 + a_{0})} {(X_{m}^{'} X_{m})}^{- 1}) .

For (12), AIC and BIC under model m are given by

{AIC}_{m} = - 2 log L ({\hat{β}}^{(m)} ∣ D) + 2 k_{m} = - n log (\frac{τ}{2 π}) + τ {SSE}_{m} + 2 k_{m},

(14)

where β̂⁽^m⁾ is the maximum likelihood estimate of β⁽^m⁾ and

{SSE}_{m} = y^{'} {I - X_{m} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'}} y

is the usual sum of squared errors, and

{BIC}_{m} = - 2 log L ({\hat{β}}^{(m)} ∣ D) + {log (n)} k_{m} = - n log (\frac{τ}{2 π}) + τ {SSE}_{m} + {log (n)} k_{m} .

(15)

After some algebra, we can show that after putting back all normalizing constants, the logarithm of the marginal likelihood under model m is given by

\begin{array}{l} log {C_{m} (D) / C_{0 m} (y_{0})} \\ = \frac{n}{2} log (\frac{τ}{2 π}) - \frac{τ}{2} y^{'} y + \frac{τ (1 + a_{0})}{2} {(\frac{y + a_{0} y_{0}}{1 + a_{0}})}^{'} X_{m} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} (\frac{y + a_{0} y_{0}}{1 + a_{0}}) \\ - \frac{τ a_{0}}{2} {y_{0}^{'} X_{m} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} y_{0}} + (\frac{1}{2} log \frac{a_{0}}{1 + a_{0}}) k_{m} . \end{array}

(16)

When y₀ = 0, the conjugate prior in (13) reduces to Zellner’s g-prior (Zellner (1986)). For this special case, (16) becomes

\begin{array}{l} log [C_{m} (D) / C_{0 m} (0)] \\ = \frac{n}{2} log (\frac{τ}{2 π}) - \frac{τ a_{0}}{2 (1 + a_{0})} y^{'} y - \frac{τ}{2 (1 + a_{0})} {SSE}_{m} + (\frac{1}{2} log \frac{a_{0}}{1 + a_{0}}) k_{m} . \end{array}

(17)

Thus, we have

\begin{array}{l} M_{m} (a_{0}) \equiv - 2 (1 + a_{0}) [log {C_{m} (D) / C_{0 m} (0)} + \frac{τ a_{0}}{2 (1 + a_{0})} y^{'} y] + a_{0} n log (\frac{τ}{2 π}) \\ = - n log (\frac{τ}{2 π}) + τ {SSE}_{m} + {(1 + a_{0}) log \frac{1 + a_{0}}{a_{0}}} k_{m} . \end{array}

(18)

For purposes of variable selection, it suffices to compare ℳ_m(a₀) and we then choose a model with the smallest ℳ_m(a₀). From (18), we can see that

M_{m} (a_{0}) = {\begin{array}{l} {AIC}_{m} & if (1 + a_{0}) log \frac{1 + a_{0}}{a_{0}} = 2, \\ {BIC}_{m} & if (1 + a_{0}) log \frac{1 + a_{0}}{a_{0}} = log n . \end{array}

(19)

For (12), we use (7) to compute L_m(ν). In particular, we have a_i(τ) = 1/τ, $E [a_{i} (τ) b^{″} (θ_{i}^{(m)}) ∣ D, m] = \frac{1}{τ}$ ,

\begin{array}{l} Var {b^{'} (θ_{i}^{(m)}) ∣ D, m} = Var {x_{i}^{(m)^{'}} β^{(m)} ∣ D, m} \\ = x_{i}^{(m)^{'}} Var (β^{(m)} ∣ D, m) x_{i}^{(m)} = \frac{1}{τ (1 + a_{0})} x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)}, \end{array}

and $E {b^{'} (θ_{i}^{(m)}) ∣ D, m} = E {x_{i}^{(m)^{'}} β^{(m)} ∣ D, m} = x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} \frac{y + a_{0} y_{0}}{1 + a_{0}}$ . Thus, we obtain

\begin{array}{l} L_{m} (ν) = \frac{n}{τ} + \frac{1}{τ (1 + a_{0})} \sum_{i = 1}^{n} x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)} \\ + ν \sum_{i = 1}^{n} {y_{i} - x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} \frac{y + a_{0} y_{0}}{1 + a_{0}}}^{2} \\ = \frac{n}{τ} + \frac{1}{τ (1 + a_{0})} k_{m} + ν [{y - X_{m} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} \frac{y + a_{0} y_{0}}{1 + a_{0}}}^{'} \\ \times {y - X_{m} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} \frac{y + a_{0} y_{0}}{1 + a_{0}}}] . \end{array}

(20)

When y₀ = 0, (20) reduces to

L_{m} (ν) = \frac{n}{τ} + \frac{1}{τ (1 + a_{0})} k_{m} + \frac{ν a_{0}^{2}}{{(1 + a_{0})}^{2}} y^{'} y + \frac{ν (1 + 2 a_{0})}{{(1 + a_{0})}^{2}} {SSE}_{m} .

(21)

Write

{\tilde{L}}_{m} (ν, a_{0}) = \frac{τ {(1 + a_{0})}^{2}}{ν (1 + 2 a_{0})} {L_{m} (ν) - \frac{n}{τ} - \frac{ν a_{0}^{2}}{{(1 + a_{0})}^{2}} y^{'} y} - n log (\frac{τ}{2 π}) .

(22)

Using (21) and (22), we obtain

{\tilde{L}}_{m} (ν, a_{0}) = - n log (\frac{τ}{2 π}) + τ {SSE}_{m} + \frac{1 + a_{0}}{ν (1 + 2 a_{0})} k_{m},

and hence

{\tilde{L}}_{m} (ν, a_{0}) = {\begin{array}{l} {AIC}_{m} & if \frac{1 + a_{0}}{ν (1 + 2 a_{0})} = 2, \\ {BIC}_{m} & if \frac{1 + a_{0}}{ν (1 + 2 a_{0})} = log n . \end{array}

Note that in the context of variable selection, a model with the smallest L_m(ν) is the same model that has the smallest L̃_m(ν, a₀). Thus, in this sense, the L measure can be equivalent to AIC or BIC by appropriately tuning (ν, a₀). It is interesting to mention that in order to achieve L̃_m(ν, a₀) = AIC_m or L̃_m(ν, a₀) = BIC_m, ν must be small, and hence when ν = 1, the L measure always has a smaller dimensional penalty than both AIC and BIC. Unlike the marginal likelihood, a₀ plays a minimum role in controlling dimensional penalty in the L measure.

When y₀ = 0, the posterior mean of β⁽^m⁾ is given by ${\bar{β}}^{(m)} = \frac{1}{1 + a_{0}} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} y$ . Thus, we have $D (β^{(m)}) = - n log (\frac{τ}{2 π}) + τ (y - X_{m} β^{(m)})^{'} (y - X_{m} β^{(m)})$ ,

\begin{array}{l} \bar{D (β^{(m)})} \\ = E [D (β^{(m)}) ∣ D, m] = - n log (\frac{τ}{2 π}) + τ E [{y - X_{m} {\bar{β}}^{(m)} - X_{m} (β^{(m)} - {\bar{β}}^{(m)})}^{'} \\ \times {y - X_{m} {\bar{β}}^{(m)} - X_{m} (β^{(m)} - {\bar{β}}^{(m)})} ∣ D, m] \\ = - n log (\frac{τ}{2 π}) + \frac{1}{1 + a_{0}} k_{m} + \frac{τ a_{0}^{2}}{{(1 + a_{0})}^{2}} y^{'} y + \frac{τ (1 + 2 a_{0})}{{(1 + a_{0})}^{2}} {SSE}_{m}, \end{array}

(23)

and

\begin{array}{l} D ({\bar{β}}^{(m)}) = - n log (\frac{τ}{2 π}) + τ (y - X_{m} {\bar{β}}^{(m)})^{'} (y - X_{m} {\bar{β}}^{(m)}) \\ = - n log (\frac{τ}{2 π}) + \frac{τ a_{0}^{2}}{{(1 + a_{0})}^{2}} y^{'} y + \frac{τ (1 + 2 a_{0})}{{(1 + a_{0})}^{2}} {SSE}_{m} . \end{array}

(24)

Combining (23) and (24) gives

p_{D}^{(m)} = \bar{D (β^{(m)})} - D ({\bar{β}}^{(m)}) = \frac{1}{1 + a_{0}} k_{m} .

(25)

Thus, the DIC_m for (12) is given by

{DIC}_{m} = - n log (\frac{τ}{2 π}) + \frac{τ a_{0}^{2}}{{(1 + a_{0})}^{2}} y^{'} y + \frac{τ (1 + 2 a_{0})}{{(1 + a_{0})}^{2}} {SSE}_{m} + \frac{2}{1 + a_{0}} k_{m} .

(26)

Write

{DIC}_{m}^{*} (a_{0}) = \frac{{(1 + a_{0})}^{2}}{1 + 2 a_{0}} {{DIC}_{m} - \frac{τ a_{0}^{2}}{{(1 + a_{0})}^{2}} y^{'} y} + \frac{n a_{0}^{2}}{1 + 2 a_{0}} log (\frac{τ}{2 π}) .

We have

{DIC}_{m}^{*} (a_{0}) = - n log (\frac{τ}{2 π}) + τ {SSE}_{m} + \frac{2 (1 + a_{0})}{1 + 2 a_{0}} k_{m} .

(27)

Therefore, when a₀ = 0, ${DIC}_{m}^{*} (0) = {DIC}_{m} = {AIC}_{m}$ , and when a₀ > 0, $\frac{2 (1 + a_{0})}{1 + 2 a_{0}} < 2$ , which implies that ${DIC}_{m}^{*} (a_{0})$ has a smaller dimensional penalty than both AIC and BIC.

Similarly to DIC, we consider only y₀ = 0. From (5), we have

{LPML}_{m} = \sum_{i = 1}^{n} log ({CPO}_{i}) = \sum_{i = 1}^{n} log ({CPO}_{1 i}) - \sum_{i = 1}^{n} log ({CPO}_{2 i}),

(28)

where ${CPO}_{1 i} = \int exp {\frac{a_{0} τ}{2} β^{(m)^{'}} x_{i}^{(m)} x_{i}^{(m)^{'}} β^{(m)}} π (β^{(m)} ∣ D, m) d β^{(m)}$ and

\begin{array}{l} {CPO}_{2 i} = {(\frac{τ}{2 π})}^{- 1 / 2} exp {\frac{τ}{2} y_{i}^{2}} \int exp [\frac{τ (1 + a_{0})}{2} \\ \times {β^{(m)^{'}} x_{i}^{(m)} x_{i}^{(m)^{'}} β^{(m)} - \frac{2}{1 + a_{0}} β^{(m)^{'}} x_{i}^{(m)} y_{i}}] π (β^{(m)} ∣ D, m) d β^{(m)} \end{array}

for i = 1, 2, …, n. After some messy algebra, we obtain

\begin{array}{l} {CPO}_{1 i} = {1 - \frac{a_{0}}{1 + a_{0}} x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)}}^{- 1 / 2} \\ \times exp {\frac{τ a_{0}}{2 {(1 + a_{0})}^{2}} \frac{y^{'} X_{m} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)} x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} y}{1 - \frac{a_{0}}{1 + a_{0}} x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)}}} \end{array}

and

\begin{array}{l} {CPO}_{2 i} = {(\frac{τ}{2 π})}^{- 1 / 2} {1 - x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)}}^{- 1 / 2} exp (\frac{τ}{2} y_{i}^{2}) \\ \times exp [\frac{τ}{2 (1 + a_{0})} {y_{i} x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)} y_{i} - 2 y^{'} X_{m} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)} y_{i}}] \\ \times exp [\frac{τ (X_{m}^{'} y - x_{i}^{(m)} y_{i})^{'} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)} x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} (X_{m}^{'} y - x_{i}^{(m)} y_{i})}{2 (1 + a_{0}) {1 - x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)}}}] . \end{array}

Let ${\hat{β}}^{(m)} = {(X_{m}^{'} X_{m})}^{- 1} X_{m}^{'} y, {\hat{y}}_{i}^{(m)} = x_{i}^{(m)^{'}} {\hat{β}}^{(m)}$ , and $h_{i i}^{(m)} = x_{i}^{(m)^{'}} {(X_{m}^{'} X_{m})}^{- 1} x_{i}^{(m)}$ . Plugging CPO₁_i and CPO₂_i into (28) yields

\begin{array}{l} {LPML}_{m} = \frac{n}{2} log (\frac{τ}{2 π}) - \frac{τ}{2} \sum_{i = 1}^{n} y_{i}^{2} + \frac{1}{2} \sum_{i = 1}^{n} {log (1 - h_{i i}^{(m)}) - log (1 - \frac{a_{0}}{1 + a_{0}} h_{i i}^{(m)})} \\ - \frac{τ}{2 (1 + a_{0})} \sum_{i = 1}^{n} {h_{i i}^{(m)} y_{i}^{2} - 2 y_{i} {\hat{y}}_{i}^{(m)}} + \frac{τ a_{0}}{2 {(1 + a_{0})}^{2}} \sum_{i = 1}^{n} {\frac{{\hat{y}}_{i}^{{(m)}^{2}}}{1 - \frac{a_{0}}{1 + a_{0}} h_{i i}^{m}}} \\ - \frac{τ}{2 (1 + a_{0})} \sum_{i = 1}^{n} \frac{{({\hat{y}}_{i}^{(m)} - h_{i i}^{(m)} y_{i})}^{2}}{1 - h_{i i}^{(m)}} . \end{array}

(29)

Using Taylor expansion and after some algebra, LPML_m in (29) can be rewritten as

{LPML}_{m} = \frac{n}{2} log (\frac{τ}{2 π}) - \frac{τ a_{0}^{2}}{2 {(1 + a_{0})}^{2}} y^{'} y - \frac{τ (1 + 2 a_{0})}{2 {(1 + a_{0})}^{2}} {SSE}_{m} - \frac{k_{m}}{2 (1 + a_{0})} + R_{m}^{*},

(30)

where

\begin{array}{l} R_{m}^{*} = - \frac{τ}{2 (1 + a_{0})} \sum_{i = 1}^{n} \frac{{(y_{i} - {\hat{y}}_{i}^{(m)})}^{2} h_{i i}^{(m)}}{1 - h_{i i}^{(m)}} + \frac{τ a_{0}}{2 {(1 + a_{0})}^{2}} \sum_{i = 1}^{n} \frac{a_{0}}{1 + a_{0}} \frac{h_{i i}^{(m)} {\hat{y}}_{i}^{{(m)}^{2}}}{1 - \frac{a_{0}}{1 + a_{0}} h_{i i}^{(m)}} \\ + \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 2}^{\infty} {1 - {(\frac{a_{0}}{1 + a_{0}})}^{j}} \frac{{(- 1)}^{j} h_{i i}^{{(m)}^{j}}}{j} . \end{array}

Write

{LPML}_{m}^{*} = - \frac{2 {(1 + a_{0})}^{2}}{1 + 2 a_{0}} {{LPML}_{m} + \frac{τ a_{0}^{2}}{2 {(1 + a_{0})}^{2}} y^{'} y} + \frac{n a_{0}^{2}}{1 + 2 a_{0}} log (\frac{τ}{2 π}) .

(31)

Using (30) and (31), we obtain

{LPML}_{m}^{*} = - n log (\frac{τ}{2 π}) + τ {SSE}_{m} + \frac{1 + a_{0}}{(1 + 2 a_{0})} k_{m} + R_{m},

where $R_{m} = - \frac{2 {(1 + a_{0})}^{2}}{1 + 2 a_{0}} R_{m}^{*}$ . We choose a model with the smallest ${LPML}_{m}^{*}$ . Note that the remainder term R_m is small when all $h_{i i}^{(m)}$ ’s are small. From (14), (15), and (27), we see that when R_m is small and does not vary much in the model space ℳ, LPML has a smaller dimensional penalty than DIC, AIC and BIC. In addition, when a₀ = 0, LPML_m in (30) is consistent with the one derived by Gelfand and Dey (1994) based on the asymptotic approximation.

Finally, we note that the quantities defined in (18), (22), (27) and (31) are linear transformations of those defined by (17), (21), (26) and (30), respectively. In these linear transformations, the relevant coefficients are independent of m. Thus, for the purposes of variable subset selection, these linearly transformed quantities act exactly like those original forms. With (18), (22), (27) and (31), we can much more clearly see the analytical connections to AIC and BIC. We also note that George and Foster (2000) provided some similar connections between model selection probabilities and various model selection criteria for this setup.

4 Computational Development: Theory and Implementation

For the purpose of variable selection, we need to compute LPML_m, L_m(ν), DIC_m, C_m(D) and C₀_m(y₀) for the Bayesian variable selection criteria described in the previous section for m = 1, 2, …, Inline graphic . Due to the complexity and generality of the GLM in (2), the analytical evaluation of these measures does not appear possible. Thus, a Monte Carlo (MC) based method is required for each of those measures under consideration. However, the MC methods currently available in the Bayesian computational literature require a Markov chain Monte Carlo (MCMC) sample from the posterior distribution π(β⁽^m⁾|D, m) in (4) under each variable subset model m. When the number of the models in ℳ is large, sampling from the posterior distribution under each variable subset model can be expensive. Thus, the computation of these four measures for all submodels becomes a difficult and challenging task. Therefore, the development of an efficient Monte Carlo method for variable selection for the GLM is very essential.

After examining (5), (6), and (8), we observe that there is a common feature in computing LPML_m, L_m(ν), and DIC_m, i.e., all of these three measures require to compute

g_{m} = E {g (β^{(m)}) ∣ D, m},

for various functions g, where the expectation is taken with respect to the joint posterior distribution in (4) under model m. Specifically, the functions required in these calculations include

$g (β^{(m)}) = exp [- a_{0} {y_{0 i} θ_{i}^{(m)} - b (θ_{i}^{(m)})}]$ and $g (β^{(m)}) = {(f (y_{i} ∣ x_{i}^{(m)}, β^{(m)}) exp [a_{0} {y_{0 i} θ_{i}^{(m)} - b (θ_{i}^{(m)})}])}^{- 1}$ for LPML_m;
$g (β^{(m)}) = b^{'} (θ_{i}^{(m)}), g (β^{(m)}) = {b^{'} (θ_{i}^{(m)})}^{2}$ , and $g (β^{(m)}) = b^{″} (θ_{i}^{(m)})$ for L_m(ν);
g(β⁽^m⁾) = β⁽^m⁾ and g(β⁽^m⁾) = D(β⁽^m⁾) for DIC_m.

Write

L (β^{(m)} ∣ D, m) = exp {(y + a_{0} y_{0})^{'} θ^{(m)} - (1 + a_{0}) J^{'} b (θ^{(m)})}

under model m and let L(β|D) = L(β⁽⁾|D, Inline graphic ), C(D) = C (D), and C₀(y₀) = C₀ (y₀) under the full model. Here, we abuse the notation a little bit as L(β⁽^m⁾|D, m) is not a likelihood function in the usual sense. Then, for a given function g, mathematically, we have

g_{m} = E [g (β^{(m)}) ∣ D, m] = \int g (β^{(m)}) \frac{L (β^{(m)} ∣ D, m)}{C_{m} (D)} d β^{(m)},

where C_m(D) is defined in (10). Now, we present a useful identity for g_m, which is formally stated in the following theorem.

Theorem 5

For any given function g, such that E[|g(β⁽^m⁾)| |D, m] < ∞, we have

g_{m} = \frac{C (D)}{C_{m} (D)} E {g (β^{(m)}) \frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D},

(32)

where the expectation is taken with respect to the joint posterior distribution in (4) under the full model. Here, w(β⁽⁻^m⁾| β⁽^m⁾) is a completely known conditional density, whose support is contained in, or equal to, the support of the conditional density of β⁽⁻ ^m⁾ given β⁽^m⁾ with respect to the joint posterior distribution in (4) under the full model.

Observing that when g ≡ 1, we have

1 = \frac{C (D)}{C_{m} (D)} E {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D},

which leads to

\frac{C_{m} (D)}{C (D)} = E {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}

(33)

and

g_{m} = \frac{E {\frac{g (β^{(m)}) L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}}{E {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}} .

(34)

It is interesting to mention that the identity (33) is a by-product of this derivation and this identity can be used to compute the posterior normalizing constant under model m. The identities (33) and (34) play an important role in developing a novel Monte Carlo method for computing LPML_m, L_m(ν), DIC_m, and C_m(D) simultaneously using a single MCMC sample from the joint posterior distribution under the full model. Towards this goal, we let {β_s = (β⁽^m⁾′_s, β⁽⁻^m⁾′_s), s = 1, 2, …, S} denote a MCMC sample from the joint posterior distribution (4) under the full model, where S is the MCMC sample size. Then, an estimate of g_m is given by

{\hat{g}}_{m} = \frac{\sum_{s = 1}^{S} \frac{g (β_{s}^{(m)}) L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}}{\sum_{s = 1}^{S} \frac{L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}} .

(35)

Under certain regularity conditions, such as ergodicity, we have

lim_{S \to \infty} {\hat{g}}_{m} = g_{m},

which indicates that ĝ_m is consistent.

Letting

A_{S} = \frac{1}{S} \sum_{s = 1}^{S} \frac{g (β_{s}^{(m)}) L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}

(36)

and

B_{S} = \frac{1}{S} \sum_{s = 1}^{S} \frac{L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)},

(37)

we have

lim_{S \to \infty} A_{S} = \frac{C_{m} (D)}{C (D)} g_{m} \equiv A,

(38)

and

lim_{S \to \infty} B_{S} = \frac{C_{m} (D)}{C (D)} \equiv B .

(39)

From (38) and (39), we obtain

g_{m} = \frac{C (D)}{C_{m} (D)} A = \frac{A}{B} .

(40)

Using (36)–(40), we have

{\hat{g}}_{m} - g_{m} = \frac{A_{S}}{B_{S}} - g_{m} = \frac{A_{S}}{B_{S}} - \frac{A}{B} = \frac{A_{S} - \frac{A}{B} B_{S}}{B_{S}} = A \frac{\frac{A_{S}}{A} - \frac{B_{S}}{B}}{B_{S}} = g_{m} \frac{\frac{A_{S}}{A} - \frac{B_{S}}{B}}{\frac{B_{S}}{B}} .

(41)

In (41), ${lim}_{S \to \infty} \frac{B_{S}}{B} = 1$ and

lim_{S \to \infty} (\frac{A_{S}}{A} - \frac{B_{S}}{B}) = 0

(42)

In addition, we have

\begin{array}{l} \frac{A_{S}}{A} - \frac{B_{S}}{B} = \frac{1}{S} \sum_{s = 1}^{S} [\frac{1}{A} {\frac{g (β_{s}^{(m)}) L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}} \\ - \frac{1}{B} {\frac{L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}}] . \end{array}

We are then led to the following theorem.

Theorem 6

Let {β_s, s = 1, 2, …, S} be a random sample. Assume A ≠ 0,

\begin{array}{l} V_{w} (g_{m}) = E [{\frac{g (β^{(m)}) L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{A L (β ∣ D)} \\ {- \frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{B L (β ∣ D)}}}^{2} ∣ D] < \infty \end{array}

(43)

and

E [{g (β^{(m)})}^{2} ∣ D] < \infty,

(44)

where the expectation is taken with respect to the joint posterior distribution in (4) under the full model. Then we have

lim_{S \to \infty} [S \times E {{(\frac{{\hat{g}}_{m} - g_{m}}{g_{m}})}^{2}}] = V_{w} (g_{m}),

(45)

where V_w(g_m) is defined by (43) and

\sqrt{S} ({\hat{g}}_{m} - g_{m}) \overset{D}{\to} N (0, g_{m}^{2} V_{w} (g_{m})) .

The proof of Theorem 6 directly follows from the proof of Theorem 3.1 of Chen and Shao (1997). Thus, the detail is omitted for brevity. From (45), we notice that $E {[\frac{{\hat{g}}_{m} - g_{m}}{g_{m}}]}^{2}$ is the relative mean-square error and Theorem 6 implies that when S is large,

E {(\frac{{\hat{g}}_{m} - g_{m}}{g_{m}})}^{2} \approx \frac{1}{S} V_{w} (g_{m}) .

Remark 4.1

As discussed in Chen et al. (2000), the simulation standard error of ĝ_m can be approximated by

s e ({\hat{g}}_{m}) = \frac{∣ {\hat{g}}_{m} ∣}{\hat{A}} \sqrt{\frac{1}{S} \sum_{s = 1}^{S} {[{g (β_{s}^{(m)}) - {\hat{g}}_{m}} \frac{L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}]}^{2}},

where Â = A_S.

Remark 4.2

From (34), it is quite natural that one may think a more efficient way to obtain a MC estimate of g_m is by generating two MC samples from the posterior distribution so that one sample is used for computing $E {\frac{g (β^{(m)}) L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}$ while the second sample is used for computing $E {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}$ . In this remark, we show that the use of two MC samples in obtaining the MC estimate of g_m may not necessarily be more efficient than the use of just one MC sample. In addition, generating two MC samples requires more computing time. Specifically, suppose that {β₁_;s, s = 1, 2, …, S₁} and {β₂_;s, s = 1, 2, …, S₂} are two independent random samples from the joint posterior distribution (4) under the full model. Then g_m can be estimated by

{\hat{g}}_{m}^{*} = \frac{\frac{1}{S_{1}} \sum_{s = 1}^{S_{1}} \frac{g (β_{1, s}^{(m)}) L (β_{1, s}^{(m)} ∣ D, m) w (β_{1, s}^{(- m)} ∣ β_{1, s}^{(m)})}{L (β_{1, s} ∣ D)}}{\frac{1}{S_{2}} \sum_{s = 1}^{S_{2}} \frac{L (β_{2, s}^{(m)} ∣ D, m) w (β_{2, s}^{(- m)} ∣ β_{2, s}^{(m)})}{L (β_{2, s} ∣ D)}} .

(46)

By the δ-Method, we have

\begin{array}{l} E {(\frac{{\hat{g}}_{m}^{*} - g_{m}}{g_{m}})}^{2} = \frac{Var {\frac{1}{S_{1}} \sum_{s = 1}^{S_{1}} \frac{g (β_{1, s}^{(m)}) L (β_{1, s}^{(m)} ∣ D, m) w (β_{1, s}^{(- m)} ∣ β_{1, s}^{(m)})}{L (β_{1, s} ∣ D)}}}{A^{2}} \\ + \frac{Var {\frac{1}{S_{2}} \sum_{s = 1}^{S_{2}} \frac{L (β_{2, s}^{(m)} ∣ D, m) w (β_{2, s}^{(- m)} ∣ β_{2, s}^{(m)})}{L (β_{2, s} ∣ D)}}}{B^{2}} + O {\frac{1}{{(S_{1} + S_{2})}^{2}}} \\ = \frac{1}{S_{1} A^{2}} Var {\frac{g (β^{(m)}) L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)}} \\ + \frac{1}{S_{2} B^{2}} Var {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)}} + O {\frac{1}{{(S_{1} + S_{2})}^{2}}}, \end{array}

where the expectation and variance are taken with respect to the joint posterior distribution (4) under the full model.

Assuming that S₁ = S₂ = S, we have

\begin{array}{l} lim_{S \to \infty} {S \times E {(\frac{{\hat{g}}_{m}^{*} - g_{m}}{g_{m}})}^{2}} = \frac{1}{A^{2}} Var {\frac{g (β^{(m)}) L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)}} \\ + \frac{1}{B^{2}} Var {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)}} . \end{array}

(47)

Thus, if

E {\frac{g (β^{(m)}) L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{A L (β ∣ D)} \times \frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{B L (β ∣ D)}} \geq 0,

(48)

we have

lim_{S \to \infty} {S \times E {(\frac{{\hat{g}}_{m}^{*} - g_{m}}{g_{m}})}^{2}} \geq lim_{S \to \infty} {S \times E {(\frac{{\hat{g}}_{m} - g_{m}}{g_{m}})}^{2}} .

It is easy to see that when g(β⁽^m⁾) ≥ 0 or g(β⁽^m⁾) ≤ 0, (48) automatically holds. Therefore, for many cases, it is unnecessary to use two MC samples instead of one MC sample in obtaining the MC estimate of g_m.

Note that the estimate ĝ_m depends on w(β⁽⁻^m⁾|β⁽^m⁾). It is reasonable to argue that the best choice of w should yield the smallest asymptotic variance of the estimate ĝ_m among all possible w’s. The following theorem precisely addresses this optimality issue.

Theorem 7

Let

w_{opt} = π (β^{(- m)} ∣ β^{(m)}, D)

(49)

be the conditional posterior density of β⁽⁻^m⁾ given β⁽^m⁾ under the full model, then we have

V_{w_{opt}} (g_{m}) \leq V_{w} (g_{m})

(50)

for all w’s, where V_w(g_m) is defined by (43).

Remark 4.3

Note that (50) holds for any function g that satisfies the condition given in (44). Thus, for various functions g involved in LPML_m, L_m(ν) and DIC_m, the best choice of w is the same w_opt given in (49).

Remark 4.4

When we use ${\hat{g}}_{m}^{*}$ in (46), we can also show that w_opt = π(β⁽⁻^m⁾ | β⁽^m⁾, D) yields the smallest asymptotic relative mean-square error of ${\hat{g}}_{m}^{*}$ , for example, the one given by (47).

Remark 4.5

For computing CPO_i in (5) under model m, we do not need to compute $\frac{C (D)}{C_{m} (D)}$ in (32). In fact, it is easy to see that

{CPO}_{i}^{(m)} = \frac{E {g_{1} (β^{(m)}) \frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}}{E {g_{2} (β^{(m)}) \frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}},

where $g_{1} (β^{(m)}) = exp [- a_{0} (y_{0 i} θ_{i}^{(m)} - b (θ_{i}^{(m)}))]$ and $g_{2} (β^{(m)}) = {f (y_{i} ∣ x_{i}^{(m)}, β^{(m)}) exp [a_{0} (y_{0 i} θ_{i}^{(m)} - b (θ_{i}^{(m)}))]}^{- 1}$ . Thus, given a MCMC sample { $β_{s} = ({β^{(m)}}_{s}^{'}, {β^{(- m)}}_{s}^{'})$ , s = 1, 2, …, S} from the joint posterior distribution (4), a MC estimate of CPO_i is given as follows:

{\hat{CPO}}_{i}^{(m)} = \frac{\sum_{s = 1}^{S} \frac{g_{1} (β_{s}^{(m)}) L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}}{\sum_{s = 1}^{S} \frac{g_{2} (β_{s}^{(m)}) L (β_{s}^{(m)} ∣ D, m) w (β_{s}^{(- m)} ∣ β_{s}^{(m)})}{L (β_{s} ∣ D)}} .

Following the proof of Theorem 7, we can easily show that the optimal choice of w for ${\hat{CPO}}_{i}^{(m)}$ is still the same w_opt given in (49).

Remark 4.6

To compute LPML, L(ν) and DIC under the full model, we can simply take β⁽⁾ = β and w(β⁽⁻⁾ |β⁽⁾) = 1. Then, for various functions g, given a MCMC sample {β_s, s = 1, 2, …, S} (35) reduces to

\hat{g} = \frac{1}{S} \sum_{s = 1}^{S} g (β_{s}),

where {β_s, s = 1, 2, …, S} is a MCMC sample from the posterior distribution (4) under the full model.

Remark 4.7

As shown in Theorem 7, the optimal choice of w is w_opt = π(β⁽⁻^m⁾ | β⁽^m⁾, D). However, for the GLM in (2), w_opt is not available in closed form. Fortunately, for the GLM, a good w(β⁽⁻^m⁾|β⁽^m⁾), which is close to the optimal choice, can be constructed based on the asymptotic approximation to the joint posterior proposed by Chen (1985). Let β̂ denote the posterior mode of β under the full model, i.e.,

\hat{β} = \underset{β}{arg max} log L (β ∣ D) = \underset{β}{arg max} {(y + a_{0} y_{0})^{'} θ - (1 + a_{0}) J^{'} b (θ)} .

Also let

\hat{Σ} = {- \frac{\partial^{2} log L (β ∣ D)}{\partial β \partial β^{'}} ∣_{β = \hat{β}}}^{- 1} .

Then, the joint posterior π(β|D) under the full model can be approximated by

\hat{π} (β ∣ \hat{β}, D) = {(2 π)}^{- \frac{k + 1}{2}} ∣ \sum^{^} ∣^{- \frac{1}{2}} exp {- \frac{1}{2} (β - \hat{β})^{'} {\sum^{^}}^{- 1} (β - \hat{β})} .

(51)

Using (51), we simply take w(β⁽⁻^m⁾ | β⁽^m⁾) = π̂(β⁽⁻^m⁾ | β⁽^m⁾, β̂, D), which is the conditional distribution of β⁽⁻^m⁾ given β⁽^m⁾ with respect to the (k + 1)-dimensional multivariate normal distribution in (51).

Remark 4.8

As a by-product, C_m(D)/C(D) is ready to compute via the identity (33). It can also be shown that

\frac{C_{0 m} (y_{0})}{C_{0} (y_{0})} = E {\frac{L (β^{(m)} ∣ y_{0}, a_{0}, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ y_{0}, a_{0})} ∣ y_{0}, a_{0}},

(52)

where $L (β^{(m)} ∣ y_{0}, a_{0}, m) = exp [a_{0} {y_{0}^{'} θ^{(m)} - J^{'} b (θ^{(m)})}]$ and the expectation is taken with respect to the prior distribution in (3) under the full model. After examining the construction of the conjugate prior and the form of the GLM in (2), we can also show that

B_{m} = \frac{C_{m} (D) / C (D)}{C_{0 m} (y_{0}) / C_{0} (y_{0})} = \frac{π (β^{(- m)} = 0 ∣ D)}{π (β^{(- m)} = 0 ∣ y_{0}, a_{0})},

(53)

where π(β⁽⁻^m⁾ = 0|D) and π(β⁽⁻^m⁾ = 0|y₀, a₀) are the marginal posterior density and the marginal prior density of β⁽⁻^m⁾ evaluated at β⁽⁻^m⁾ = 0 under the full model. Furthermore, B_m in (53) is the Bayes factor for comparing model m to the full model. Thus, to compute B_m, we need to generate two MCMC samples, one from the posterior distribution and another one from the prior distribution of β under the full model, and then use (33) and (52).

Finally, we note that we derive w_opt under the independence assumption. We expect that this optimal choice will work well even when a dependent MCMC sample is used. Some related empirical studies have been reported and discussed in Meng and Wong (1996), Diciccio et al. (1997) and Meng and Schilling (2002). They suggested that the optimal or near-optimal procedures constructed under the independence assumption can work remarkably well in general, providing orders of magnitude improvement over other methods with similar computational effort. Alternatively, suppose we systematically take a 1-in-b subsample of size S from the Markov chain that is generated from the joint posterior distribution in (4). Then, following from Guha et al. (2004), we can show that (45) holds under some mild regularity conditions such as geometrical ergodicity and a sufficiently large b. Thus, if we take a MCMC sample in such a way, this MCMC sample can be treated as “a random sample.”

5 A Simulation Study

In Section 3, we have established theoretical connections among AIC, BIC and the four Bayesian criteria in the normal linear regression setting. However, it does not appear possible that there are any analytic connections between AIC or BIC and the four Bayesian criteria for Poisson regression. For this reason, we present a simulation study for Poisson regression to empirically examine whether there exist any connections among these criteria and to examine the performance of conjugate priors in the context of variable selection. Suppose y_i|θ_i are independent Poisson observations with mean $e^{x_{i}^{'} β}$ , where $x_{i}^{'}$ is a 1 × p vector, i = 1, 2, …, n. The conjugate prior takes the form

π (β ∣ a_{0}, y_{0}) \propto exp {\sum_{i = 1}^{n} a_{0} (y_{0 i} x_{i}^{'} β - exp {x_{i}^{'} β})},

(54)

where y₀_i is the i^th component of y₀. In the simulation, we assume that x_i₀ = 1, x_ij ~ N (0, 1) independently for j = 1, 2, 3 and i = 1, 2, …, n. In (54), we take y₀_i = 1 for i = 1, 2, …, n, which yields a prior mode of β to be 0, as shown in Chen and Ibrahim (2003). Further we use β = (−0.3, 0.3, 0, 0)′, β = (−0.3, 0.3, 0.2, 0)′, and β= (−0.3, 0.3, 0.2, −0.15)′ which correspond to the true models (x₁), (x₁, x₂), and (x₁, x₂, x₃) (full model), respectively. We also use the sample size of n = 500.

Under the simulation design, we independently generated N = 500 datasets. For each simulated dataset, we fit 2³ = 8 models. To compute the posterior model probabilities based on the conjugate priors, we implemented the Monte Carlo algorithm proposed in Section 4 with a Monte Carlo sample size of S = 20, 000. For all of these 8 models, we computed BF, DIC, L measure, LPML, AIC, and BIC.

Tables 1 and 2 show results for the various methods. Our model performance evaluation criterion is a 0-1 loss function, the loss being 0 if the true model is selected and 1 otherwise. In Table 1, we see that BIC performs better than AIC in the number of times the true model is selected as best when the true model is a smaller model. For example, when (x₁) is the true model, AIC correctly identifies this model as best 361 times out of 500 and BIC correctly identifies this model as best 490 times. Table 2 compares the performance of the four other criteria under several values of a₀ from the conjugate prior as well as several values of ν for the L measure. We see from the table that, in general, for small values of a₀, which imply a noninformative prior, the Bayes factor results are quite consistent with DIC, the L measure, and LPML for small models being the true models, whereas when the full model is the true model, the Bayes factor tends to do worse for small a₀ compared to large a₀. In general, as a₀ increases, the performance of DIC, LPML, and the Bayes factor becomes worse, whereas for the L measure, it is fairly robust over several values of a₀. The L measure seems to perform best under moderate values of ν, such as ν = 0.5.

Table 1.

Frequencies for Ranking the True Model as Best Using AIC and BIC Based on n = 500 and N = 500 Datasets

True Model	AIC	BIC
(x₁)	361	490
(x₁, x₂)	425	446
(x₁, x₂, x₃)	474	316

Open in a new tab

Table 2.

Frequencies for Ranking the True Model as Best Using BF, DIC, CPO and L measure for Various a₀ Based on n = 500 and N = 500 Datasets

					L Measure (ν)
True Model	a₀	LPML	DIC	BF	0.1	0.3	0.5	0.7	0.9
(x₁)	0.001	395	361	492	398	396	359	318	276
	0.01	396	357	466	396	396	357	319	275
	0.1	377	332	386	408	396	352	304	268
	0.5	342	308	311	424	381	335	279	243
	1	320	299	288	424	372	321	264	222

(x₁, x₂)	0.001	425	425	436	164	347	390	380	356
	0.01	423	425	470	157	352	390	383	355
	0.1	417	417	443	195	370	399	372	353
	0.5	398	405	405	254	400	402	362	339
	1	382	394	391	269	410	390	359	329

(x₁, x₂, x₃)	0.001	475	474	291	88	371	456	475	480
	0.01	475	474	388	94	375	458	475	482
	0.1	479	475	460	125	402	466	480	488
	0.5	485	479	479	176	436	478	486	489
	1	486	481	481	214	453	483	487	490

Open in a new tab

6 A Real Data Example

Due to lack of analytic connections between AIC or BIC and the four Bayesian criteria for logistic regression, we consider the Chapman data from Los Angeles Heart Study of men (n = 200) presented in Dixon and Massey (1983) to empirically examine whether there exist any connections among these criteria.

In our analysis, we consider a coronary incident as a binary response variable (y), which takes the values 0 and 1, where a 1 denotes that an incident had occurred in the previous ten years and a 0 indicates otherwise. We consider five prognostic factors: age (Ag), systolic blood pressure in millimeters of mercury (S), diastolic blood pressure in millimeters of mercury (D), Cholesterol in milligrams per DL (Ch), and BMI = (703.07Weight)/(Height²).

Let x₁, x₂, x₃, x₄, and x₅ denote Ag, S, D, Ch, and BMIH. For the Chapman data, we fit a logistic regression model

logit {P (y = 1 ∣ x)} = log {\frac{P (y = 1 ∣ x)}{1 - P (y = 1 ∣ x)}} = x^{'} β .

(55)

The conjugate prior in (3) corresponding to the model (55) takes the form

π (β ∣ a_{0}, y_{0}) \propto exp (\sum_{i = 1}^{n} a_{0} [y_{0 i} x_{i}^{'} β - log {1 + exp (x_{i}^{'} β)}]),

(56)

where y_i₀ = 0.5, i = 1, 2, …, n, to ensure the prior mode of β to be 0. We wish to compare the following 32 models: Intercept only, (x₁), …, (x₅), (x₁, x₂), …, (x₁, x₂, x₃, x₄, x₅). We note that the notation (x₁, x₂, x₃, x₄, x₅), for example, implies that $x_{i}^{'} β = β_{0} + β_{1} {Ag}_{i} + β_{2} S_{i} + β_{3} D_{i} + β_{4} {Ch}_{i} + β_{5} B {MI}_{i}$ in (55). Thus, “Intercept only” is the model with zero predictors while (x₁, x₂, x₃, x₄, x₅) is the full model with the largest model dimension. We also note that an intercept is included in every model. Further we denote that M₁ = (Int), M₂ = (Int, Ag), M₃ = (Int, S), M₄ = (Int, D), M₅ = (Int, Ch), M₆ =(Int, BMI), M₇ = (Int, Ag, S), M₈ =(Int, Ag, D), M₉ =(Int, Ag, Ch), M₁₀ =(Int, Ag, BMI), M₁₁ =(Int, S, D), M₁₂ =(Int, S, Ch), M₁₃ =(Int, S, BMI), M₁₄ =(Int, D, Ch), M₁₅ =(Int, D, BMI), M₁₆ =(Int, Ch, BMI), M₁₇ =(Int, Ag, S, D), M₁₈ =(Int, Ag, S, Ch), M₁₉ =(Int, Ag, S, BMI), M₂₀ =(Int, Ag, D, Ch), M₂₁ =(Int, Ag, D, BMI), M₂₂ =(Int, Ag, Ch, BMI), M₂₃ =(Int, S, D, Ch), M₂₄ =(Int, S, D, BMI), M₂₅ =(Int, S, Ch, BMI), M₂₆ = (Int, D, Ch, BMI), M₂₇ =(Int, Ag, S, D, Ch), M₂₈ =(Int, Ag, S, D, BMI), M₂₉ = (Int, Ag, S, Ch, BMI), M₃₀ =(Int, Ag, D, Ch, BMI), M₃₁ = (Int, S, D, Ch, BMI), and M₃₂ = (Int, Ag, S, D, Ch, BMI).

To compute the posterior model probability (PMP), DIC, LPML, and L measure under various conjugate priors, we implemented the Monte Carlo algorithm proposed in Section 4 with a Monte Carlo sample size of S = 20, 000. We see from Table 3 that M₂₂ is selected as the best model by AIC and the fourth model by BIC, whereas M₁₀ is selected as the second best model by both criteria. Table 4 shows the results of the L measure, posterior model probability (PMP), LPML, and DIC for several values of a₀, as well as several values of ν for the L measure. Table 3 reveals a similar story as the simulation study. Model M₂₂ is selected as either the top model or second best model for most values of a₀ for DIC and PMP, as well as for the L measure under small values of ν. Under larger values of ν the L measure as well a LPML appear to favor model M₃₂. Finally, for small values of a₀, LPML and PMP appear to favor a smaller model, namely M₂. Thus, from these analyses, models {M₂, M₂₂, M₃₂} appear to be the most promising based on all of these model selection criteria. Table 5 shows the top five models selected for each of the four variable selection criteria (PMP, DIC, L measure, LPML). Again we see a remarkable consistency between the four criteria, in which the ordering of the top models is similar for the four criteria for small, moderate, and large values of a₀, and for a wide range of ν values for the L measure.

Table 3.

The Top Model Based on AIC and BIC for Chapman Data

AIC		BIC
M_k	Values	M_k	Values
M₂₂	142.75	M₂	153.34
M₁₀	143.73	M₁₀	153.63
M₂₉	144.69	M₉	155.83
M₃₀	144.75	M₂₂	155.94
M₁₉	145.57	M₁₆	155.99

Open in a new tab

Table 4.

The Best Model Based on Posterior Model Probability (PMP), DIC, LPML, and L Measure for Chapman Data

	a₀ = 0.001		a₀ = 0.01		a₀ = 0.1		a₀ = 0.5		a₀= 1.0
Criterion	M_k	Values	M_k	Values	M_k	Values	M_k	Values	M_k	Values
PMP	M₂	0.57	M₂	0.25	M₂₂	0.14	M₂₂	0.07	M₂₂	0.06
DIC	M₂₂	142.83	M₂₂	142.67	M₂₂	144.74	M₂₂	165.65	M₂₂	186.77
LPML	M₂	−73.38	M₂	−73.30	M₃₂	−73.79	M₃₂	−83.10	M₃₀	−93.29
L(ν = 0.1)	M₂₂	21.47	M₂₂	21.98	M₂₂	26.96	M₂₂	38.92	M₃₀	45.21
L(ν = 0.25)	M₂₂	24.79	M₂₂	25.29	M₂₂	30.23	M₂₂	42.56	M₃₀	49.39
L(ν = 0.5)	M₃₂	30.20	M₃₂	30.73	M₃₂	35.66	M₂₉	48.59	M₃₀	56.36
L(ν = 0.75)	M₃₂	35.24	M₃₂	35.76	M₃₂	40.77	M₃₂	54.52	M₃₀	63.33
L(ν = 0.9)	M₃₂	38.26	M₃₂	38.78	M₃₂	43.83	M₃₂	58.06	M₃₀	67.51

Open in a new tab

Table 5.

The Top Five Models Based on PMP, DIC, LPML, and L Measure for Chapman Data

	a₀ = 0.001		a₀ = 0.01		a₀ = 0.1		a₀ = 0.5		a₀ = 1.0
Criterion	M_k	Values	M_k	Values	M_k	Values	M_k	Values	M_k	Values
PMP	M₂	0.57	M₂	0.25	M₂₂	0.14	M₂₂	0.07	M₂₂	0.06
	M₁	0.11	M₁₀	0.23	M₁₀	0.14	M₁₀	0.07	M₁₀	0.05
	M₅	0.07	M₉	0.08	M₉	0.06	M₂₉	0.05	M₁₉	0.04
	M₁₀	0.07	M₂₂	0.08	M₂	0.06	M₁₉	0.05	M₂₉	0.04
	M₆	0.06	M₁₆	0.07	M₁₆	0.06	M₃₀	0.05	M₂₁	0.04

DIC	M₂₂	142.83	M₂₂	142.67	M₂₂	144.74	M₂₂	165.65	M₂₂	186.77
	M₁₀	143.79	M₁₀	143.70	M₁₀	145.70	M₁₀	166.02	M₁₀	186.88
	M₃₀	144.85	M₂₉	144.74	M₂₉	146.43	M₂₉	166.42	M₃₀	187.10
	M₂₉	144.96	M₃₀	144.78	M₃₀	146.48	M₁₉	166.87	M₂₁	187.38
	M₂₁	145.63	M₂₁	145.59	M₂₁	147.29	M₃₀	166.90	M₂₀	187.75

LPML	M₂	−73.38	M₂	−73.30	M₃₂	−73.79	M₃₂	−83.10	M₃₀	−93.29
	M₅	−73.56	M₅	−73.50	M₂	−74.11	M₂₉	−83.32	M₃₂	−93.35
	M₄	−73.58	M₆	−73.55	M₇	−74.43	M₁₀	−83.34	M₂₁	−93.41
	M₆	−73.73	M₄	−73.64	M₈	−74.48	M₁₉	−83.55	M₁₀	−93.41
	M₃₂	−73.91	M₃	−73.65	M₉	−74.68	M₂₁	−83.56	M₂₀	−93.55

L ν = 0.1	M₂₂	21.47	M₂₂	21.98	M₂₂	26.96	M₂₂	38.92	M₃₀	45.21
	M₃₀	22.04	M₂₅	22.63	M₂₅	27.33	M₂₅	38.99	M₂₂	45.23
	M₂₆	22.13	M₃₀	22.64	M₃₀	27.41	M₂₉	39.02	M₂₆	45.26
	M₃₂	22.15	M₂₉	22.65	M₂₆	27.45	M₁₉	39.16	M₂₀	45.28
	M₁₀	22.21	M₁₀	22.67	M₂₉	27.47	M₁₀	39.18	M₂₁	45.31

L ν = 0.25	M₂₂	24.79	M₂₂	25.29	M₂₂	30.23	M₂₂	42.56	M₃₀	49.39
	M₃₀	25.17	M₃₂	25.69	M₃₂	30.56	M₂₉	42.61	M₂₂	49.45
	M₃₂	25.17	M₃₀	25.77	M₃₀	30.56	M₂₅	42.67	M₂₀	49.49
	M₁₀	25.43	M₂₉	25.78	M₂₉	30.62	M₃₂	42.73	M₂₆	49.51
	M₂₀	25.47	M₁₀	25.89	M₂₅	30.65	M₁₉	42.79	M₂₁	49.52

L ν = 0.5	M₃₂	30.20	M₃₂	30.73	M₃₂	35.66	M₂₉	48.59	M₃₀	56.36
	M₂₂	30.31	M₂₂	30.80	M₂₂	35.70	M₃₂	48.62	M₃₂	56.47
	M₃₀	30.38	M₃₀	30.98	M₃₀	35.81	M₂₂	48.64	M₂₂	56.50
	M₂₀	30.77	M₂₉	31.00	M₂₉	35.88	M₃₀	48.78	M₂₀	56.50
	M₁₀	30.80	M₁₀	31.26	M₁₀	36.10	M₂₅	48.79	M₂₁	56.55

L ν = 0.75	M₃₂	35.24	M₃₂	35.76	M₃₂	40.77	M₃₂	54.52	M₃₀	63.33
	M₃₀	35.59	M₃₀	36.20	M₃₀	41.06	M₂₉	54.56	M₃₂	63.41
	M₂₂	35.84	M₂₉	36.22	M₂₉	41.14	M₂₂	54.71	M₂₀	63.52
	M₂₉	36.04	M₂₂	36.31	M₂₂	41.16	M₃₀	54.78	M₂₂	63.54
	M₂₀	36.08	M₁₀	36.63	M₁₀	41.48	M₁₉	54.88	M₂₁	63.57

L ν = 0.9	M₃₂	38.26	M₃₂	38.78	M₃₂	43.83	M₃₂	58.06	M₃₀	67.51
	M₃₀	38.72	M₃₀	39.33	M₃₀	44.21	M₂₉	58.15	M₃₂	67.58
	M₂₂	39.16	M₂₉	39.35	M₂₉	44.29	M₂₂	58.36	M₂₀	67.73
	M₂₉	39.17	M₂₂	39.61	M₂₂	44.44	M₃₀	58.37	M₂₂	67.77
	M₂₀	39.26	M₂₇	39.83	M₁₀	44.70	M₁₉	58.51	M₂₁	67.79

Open in a new tab

Table 6 shows the posterior means (Estimates), the posterior standard errors (SEs), and 95% HPD intervals for the β_j‘s under model M₂₂ (Ag, Ch, BMI) and model M₃₂ (Ag, S, D, Ch, BMI) when a₀ = 0.01. Table 6 also shows the corresponding maximum likelihood estimates (MLEs), the standard errors, and p-values. We see from Table 6 that the posterior estimates are very close to the MLEs, which is intuitively appealing, as a fairly noninformative (a₀ = 0.01) is used. We also see from this table that under these two “best” models, age and BMI are only two prognostic factors for the coronary incident, which are significant at the 5% significance level.

Table 6.

Estimates of the β under Model (Ag, Ch, BMI) and Model (Ag, S, D, Ch, BMI) for the Chapman Data when a₀ = 0.01

		Maximum Likelihood Estimates			Posterior Estimates
Model	Variable	Estimate	SE	p-value	Estimate	SE	95% HPD Interval
M₂₂	Intercept	−2.252	0.275	< .0001	−2.265	0.272	(−2.805, −1.748)
	Ag	0.556	0.245	0.0230	0.554	0.242	(0.087, 1.032)
	Ch	0.405	0.233	0.0816	0.402	0.234	(−0.064, 0.854)
	BMI	0.470	0.204	0.0211	0.465	0.207	(0.069, 0.882)

M₃₂	Intercept	−2.248	0.274	< .0001	−2.292	0.273	(−2.828, −1.766)
	Ag	0.527	0.270	0.0507	0.531	0.270	(0.012, 1.067)
	S	0.106	0.336	0.7523	0.097	0.344	(−0.583, 0.757)
	D	−0.077	0.383	0.8417	−0.069	0.383	(−0.806, 0.687)
	Ch	0.404	0.235	0.0857	0.402	0.240	(−0.074, 0.866)
	BMI	0.474	0.226	0.0361	0.473	0.230	(0.028, 0.930)

Open in a new tab

To examine performance of the proposed Monte Carlo method in Section 4, we first computed various model selection criteria under a sub-model using a MCMC sample from the full model. We then computed the same quantities using a MCMC sample directly from the posterior distribution under the same sub-model. For illustrative purposes, we considered a single variable sub-model M₂ = (Int, Ag) using the conjugate prior (56) with a₀ = 0.01. Using a MCMC sample size of S = 20, 000, the Monte Carlo estimates (simulation standard errors) of DIC, LPML, L(ν = 0.1), L(ν = 0.5), and L(ν = 0.9) under model M₂ are 146.68 (0.08), −73.30 (0.04), 23.91 (0.05), 32.44 (0.06), and 40.96 (0.06), respectively, using the proposed Monte Carlo method via (35). With the same MC sample size, these quantities are 146.67 (0.02), −73.29 (0.01), 23.90 (0.02), 32.42 (0.02), and 40.95 (0.02), respectively, using the MC sample directly from the posterior distribution under model M₂. All simulation standard errors were computed using the overlapping batch statistics (OBS) method of Schmeiser et al. (1990). As expected, the simulation standard errors using the MC sample from the full model are slightly larger than those computed using the MC sample directly from model M₂. However, these two sets of the MC estimates are very close. This empirically demonstrates that the proposed MC method works quite well. Finally, we compared the computational times between the proposed Monte Carlo method and the exhaustive alternative. With 2,000 “burn-in” iterations and S = 20, 000, the computational times of the proposed Monte Carlo method for 32 DIC’s, LPML’s, and L(ν)’s are 71.28, 100.11, and 76.36 seconds, respectively, on a Dell WS Xeon dual 2.4GHZ CPU Linux workstation. Using the same number of “burn-in” iterations, the same MC sample size, and the same computer, the computational times of the exhaustive alternative Monte Carlo method for 32 DIC’s, LPML’s, and L(ν)’s are 324.05, 357.97, and 322.13 seconds, respectively. Thus, it becomes apparent that the proposed Monte Carlo method leads to a substantial computational saving over the exhaustive alternative.

7 Concluding Remarks

We have examined and established theoretical and computational relationships between six commonly used methods for variable subset selection. These connections were facilitated from the class of conjugate priors of Chen and Ibrahim (2003). We saw that under this class of priors the four Bayesian criteria were quite similar in terms of model choice especially under small values of a₀, and the results were fairly robust under a wide choice of a₀ values. Further work remains to be done. In particular, it is of interest to obtain analytic connections between these criteria for specific GLM’s, such as the logistic and Poisson regression models, as well as theoretically examine the small sample and large sample behavior of these methods. In Section 4, the theory and algorithm are developed for computing the four Bayesian criteria which are defined for the GLM in (2). With some straightforward modification, these theory and algorithm can be applied for computing the four Bayesian criteria that are defined for the general GLM in (1).

We note some philosophical issues about model selection that are worth noting. In this paper, we have evaluated the performance of all criteria based on how well they can pick up the true sampling model. However, there are other ways of defining the “Bayesian model.” Many advocate that a Bayesian model is specified by the sampling density and the prior, not only by the sampling density. When one only evaluates the success of a criterion based on how well it picks up the sampling model, then a comparison between AIC (or BIC) and DIC is not meaningful when DIC is computed using an informative prior. Since AIC is equivalent to DIC based on a noninformative prior, a comparison of AIC (or BIC) to DIC is simply not meaningful when using informative priors. In general, one should avoid such comparisons, and only comparable criteria should be compared. For example, it is meaningful to compare AIC, BIC, DIC, LPML, the L-measure, and the Bayes factor based on noninformative priors. It is meaningful to compare DIC, the L-measure, LPML, and the Bayes factor based on informative priors. Finally, we note that most criteria for model assessment, especially the information criteria, are based on a well-defined utility function. If a utility function is chosen, a comparison to a criterion based on a different utility function is not justified. For example, the Bayes factor and BIC are prior predictive criteria aiming at the explanation of the data given the prior, whereas DIC (AIC as a special case) and LPML are posterior predictive criteria aiming at the explanation of replicate (unseen) data given the posterior. Thus, one must use caution in comparing these criteria in terms in picking up the true sampling model.

Acknowledgments

The authors wish to thank the Editor-in-Chief, the Editor, the Associate Editor, and the two referees for their helpful comments and suggestions, which have improved the paper. This research was partially supported by NIH grants #GM 70335 and #CA 74015.

Appendix: Proofs of Theorems

Proof of Theorem 5

Since ∫ w(β⁽⁻^m⁾|β⁽^m⁾)dβ⁽⁻^m⁾ = 1 and β = (β⁽^m⁾′ β⁽⁻^m⁾′)′, we have

\begin{array}{l} g_{m} = \int g (β^{(m)}) \frac{L (β^{(m)} ∣ D, m)}{C_{m} (D)} d β^{(m)} \\ = \int \int g (β^{(m)}) \frac{L (β^{(m)} ∣ D, m)}{C_{m} (D)} w (β^{(- m)} ∣ β^{(m)}) d β^{(- m)} d β^{(m)} \\ = \frac{C (D)}{C_{m} (D)} \int g (β^{(m)}) \frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} \frac{L (β ∣ D)}{C (D)} d β \\ = \frac{C (D)}{C_{m} (D)} E {\frac{g (β^{(m)}) L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)} ∣ D}, \end{array}

which completes the proof.

Proof of Theorem 7

From (43), we have

V_{w} (g_{m}) = E [{\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{L (β ∣ D)}}^{2} ∣ D] .

(A.1)

Plugging w_opt into (A.1), we have

\begin{array}{l} V_{w_{opt}} (g_{m}) \\ = \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m)}{C (D)}}^{2} \frac{π {(β^{(- m)} ∣ β^{(m)}, D)}^{2}}{π (β ∣ D)} d β \\ = \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m)}{C (D)}}^{2} \frac{π (β^{(- m)} ∣ β^{(m)}, D) \frac{π (β^{(- m)}, β^{(m)} ∣ D)}{π (β^{(m)} ∣ D)}}{π (β ∣ D)} d β \\ = \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m)}{C (D)}}^{2} \frac{π (β^{(- m)} ∣ β^{(m)}, D)}{π (β^{(m)} ∣ D)} d β \\ = \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m)}{C (D)}}^{2} \frac{d β^{(m)}}{π (β^{(m)} ∣ D)} \int π (β^{(- m)} ∣ β^{(m)}, D) d β^{(- m)} \\ \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m)}{π (β^{(m)} ∣ D) C (D)}}^{2} π (β^{(m)} ∣ D) d β^{(m)}, \end{array}

(A.2)

where π(β⁽^m⁾ | D) denotes the marginal posterior distribution of β⁽^m⁾ under the full model. Thus, it suffices to show

\begin{array}{l} \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m)}{π (β^{(m)} ∣ D)}}^{2} π (β^{(m)} ∣ D) d β^{(m)} \\ \leq \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{π (β ∣ D)}}^{2} π (β ∣ D) d β . \end{array}

(A.3)

By the Cauchy-Schwarz inequality, we have

\begin{array}{l} 1 = {\int w (β^{(- m)} ∣ β^{(m)}) d β^{(- m)}}^{2} \\ = {\int \frac{w (β^{(- m)} ∣ β^{(m)})}{\sqrt{π (β^{(- m)} ∣ β^{(m)}, D)}} \sqrt{π (β^{(- m)} ∣ β^{(m)}, D)} d β^{(- m)}}^{2} \\ \leq \int \frac{w^{2} (β^{(- m)} ∣ β^{(m)})}{π (β^{(- m)} ∣ β^{(m)}, D)} d β^{(- m)} \int π (β^{(- m)} ∣ β^{(m)}, D) d β^{(- m)} \\ = \int \frac{w^{2} (β^{(- m)} ∣ β^{(m)})}{π (β^{(- m)} ∣ β^{(m)}, D)} d β^{(- m)} . \end{array}

(A.4)

Using (A.4), the left-hand side of (A.3) becomes

\begin{array}{l} \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {\frac{L (β^{(m)} ∣ D, m)}{π (β^{(m)} ∣ D)}}^{2} π (β^{(m)} ∣ D) d β^{(m)} \\ \leq \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {{\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{π (β^{(m)} ∣ D)})}^{2} \frac{π (β^{(m)} ∣ D)}{π (β^{(- m)} ∣ β^{(m)}, D)} d β \\ = \int {\frac{g (β^{(m)})}{A} - \frac{1}{B}}^{2} {{\frac{L (β^{(m)} ∣ D, m) w (β^{(- m)} ∣ β^{(m)})}{π (β ∣ D)})}^{2} π (β ∣ D) d β, \end{array}

which exactly matches the right-hand side of (A.3).

References

Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov B, Csaki F, editors. International Symposium on Information Theory. Budapest: Akademia Kiado; 1973. pp. 267–281. 589. [Google Scholar]
Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian Variable Selection and Prediction. Journal of the Royal Statistical Society, Series B. 1998;60:627–641. 585. [Google Scholar]
Brown PJ, Vanucci M, Fearn T. Bayes Model Averaging with Selection of Regresors. Journal of the Royal Statistical Society, Series B. 2002;64:519–536. 585. [Google Scholar]
Chen CF. On Asymptotic Normality of Limiting Density Functions with Bayesian Implications. Journal of the Royal Statistical Society, Series B. 1985;47:540–546. 601. [Google Scholar]
Chen M-H, Dey DK, Ibrahim JG. Bayesian Criterion Based Model Assessment for Categorical Data. Biometrika. 2004;91:45–63. 589. [Google Scholar]
Chen MH, Ibrahim JG. Conjugate Priors for Generalized Linear Models. Statistica Sinica. 2003;13:461–476. 585, 586, 587, 588–608. [Google Scholar]
Chen M-H, Ibrahim JG, Shao Q-M, Weiss RE. Prior Elicitation for Model Selection and Estimation in Generalized Linear Mixed Models. Journal of Statistical Planning and Inference. 2003;111:57–76. 586. [Google Scholar]
Chen M-H, Ibrahim JG, Yiannoutsos C. Prior Elicitation, Variable Selection, and Bayesian Computation for Logistic Regression Models. Journal of the Royal Statistical Society, Series B. 1999;61:223–242. 585. [Google Scholar]
Chen M-H, Shao Q-M. On Monte Carlo Methods for Estimating Ratios of Normalizing Constants. The Annals of Statistics. 1997;25:1563–1594. 599. [Google Scholar]
Chen M-H, Shao Q-M, Ibrahim JG. Monte Carlo Methods in Bayesian Computation. New York: Springer-Verlag; 2000. 599. [Google Scholar]
Chipman HA, George EI, McCulloch RE. Bayesian CART Model Search (with Discussion) Journal of the American Statistical Association. 1998;93:935–960. 585. [Google Scholar]
Chipman HA, George EI, McCulloch RE. The practical Implementation of Bayesian Model Selection (with Discussion) In: Lahiri P, editor. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. pp. 63–134. 585. [Google Scholar]
Chipman HA, George EI, McCulloch RE. Bayesian Treed Generalized Linear Models (with Discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian Statistics. Vol. 7. Oxford: Oxford University Press; 2003. pp. 85–103. 585. [Google Scholar]
Clyde M. Bayesian Model Averaging and Model Search Strategies (with Discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 6. Oxford: Oxford University Press; 1999. pp. 157–185. 585. [Google Scholar]
Clyde M, George EI. Model Uncertainty. Statistical Science. 2004;19:81–94. 586. [Google Scholar]
Dellaportas P, Forster JJ. Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Log-linear Models. Biometrika. 1999;86:615–633. 585. [Google Scholar]
Diciccio TJ, Kass RE, Raftery A, Wasserman L. Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915. 602. [Google Scholar]
Dixon WJ, Massey FJ. Introduction to Statistical Analysis. 4. New York: McGraw-Hill; 1983. 603. [Google Scholar]
Geisser S. Predictive Inference: An Introduction. London: Chapman & Hall; 1993. p. 588. 589. [Google Scholar]
Gelfand AE, Dey DK. Bayesian Model Choice: Asymptotics and Exact Calculations. Journal of the Royal Statistical Society, Series B. 1994;56:501–514. 589–595. [Google Scholar]
Gelfand AE, Dey DK, Chang H. Model Determinating Using Predictive Distributions with Implementation via Sampling-based Methods (with Discussion) In: Bernado JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford: Oxford University Press; 1992. pp. 147–167.pp. 588–589. [Google Scholar]
Gelfand AE, Ghosh SK. Model Choice: A Minimum Posterior Predictive Loss Approach. Biometrika. 1998;85:1–13. 589. [Google Scholar]
George EI. The Variable Selection Problem. Journal of the American Statistical Association. 2000;95:1304–1308. 585. [Google Scholar]
George EI, Foster DP. Calibration and Empirical Bayes Variable Selection. Biometrika. 2000;87:731–747. 585–595. [Google Scholar]
George EI, McCulloch RE. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association. 1993;88:1304–1308. 585. [Google Scholar]
George EI, McCulloch RE. Approaches for Bayesian Variable Selection. Statistica Sinica. 1997;7:339–374. 585. [Google Scholar]
George EI, McCulloch RE, Tsay R. Two Approaches to Bayesian Model Selection with Applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348. 585. [Google Scholar]
Guha S, MacEachern SN, Peruggia M. Benchmark Estimation for Markov Chain Monte Carlo Samples. Journal of Computational and Graphical Statistics. 2004;13:683–701. 602. [Google Scholar]
Ibrahim JG, Chen M-H, McEachern SN. Bayesian Variable Selection for Proportional Hazards Models. Canadian Journal of Statistics. 1999;27:701–717. 585. [Google Scholar]
Ibrahim JG, Chen M-H, Ryan LM. Bayesian Variable Selection for Time Series Count Data. Statistica Sinica. 2000;10:971–987. 586. [Google Scholar]
Ibrahim JG, Chen M-H, Sinha D. Criterion Based Methods for Bayesian Model Assessment. Statistica Sinica. 2001a;11:419–443. 589. [Google Scholar]
Ibrahim JG, Chen M-H, Sinha D. Bayesian Survival Analysis. New York: Springer-Verlag; 2001b. 589. [Google Scholar]
Ibrahim JG, Laud PW. A Predictive Approach to the Analysis of Designed Experiments. Journal of the American Statistical Association. 1994;89:309–319. 589. [Google Scholar]
Lahiri P. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. 586. [Google Scholar]
Laud PW, Ibrahim JG. Predictive Model Selection. Journal of the Royal Statistical Society, Series B. 1995;57:247–262. 585–589. [Google Scholar]
Meng X-L, Schilling S. Warp Bridge Sampling. Journal of Computational and Graphical Statistics. 2002;11:552–586. 602. [Google Scholar]
Meng X-L, Wong WH. Simulating Ratios of Normalizing Constants via A Simple Identity: A Theoretical Exploration. Statistica Sinica. 1996;6:831–860. 602. [Google Scholar]
Ntzoufras I, Dellaportas P, Forster JJ. Bayesian Variable and Link Determination for Generalised Linear Models. Journal of Statistical Planning and Inference. 2003;111:165–180. 586. [Google Scholar]
Raftery AE. Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear Models. Biometrika. 1996;83:251–266. 585. [Google Scholar]
Raftery AE, Madigan D, Hoeting JA. Bayesian Model Averaging for Linear Regression Models. Journal of the American Statistical Association. 1997;92:179–191. 585. [Google Scholar]
Schmeiser BW, Avramidis AN, Hashem S. Overlapping Batch Statistics. In: Balci O, Sadowski RP, Nance RE, editors. Proceedings of the 1990 Winter Simulation Conference. San Diego, California: Society for Computer Simulation International; 1990. pp. 395–398. 607. [Google Scholar]
Schwarz G. Estimating the Dimension of A Model. The Annals of Statistics. 1978;6:461–464. 589. [Google Scholar]
Smith M, Kohn R. Nonparametric Regression Using Bayesian Variable Selection. Journal of Econometrics. 1996;75:317–343. 585. [Google Scholar]
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian Measures of Model Complexity and Fit (with Discussion) Journal of the Royal Statistical Society, Series B. 2002;62:583–639. 589–590. [Google Scholar]
Zellner A. On Assessing Prior Distributions and Bayesian Regression Analysis with g-Prior Distributions. In: Goel P, Zellner A, editors. Bayesian Inference and Decision Techniques. Amsterdam: Elsevier Science Publishers B.V; 1986. pp. 233–243. 592. [Google Scholar]

[R1] Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov B, Csaki F, editors. International Symposium on Information Theory. Budapest: Akademia Kiado; 1973. pp. 267–281. 589. [Google Scholar]

[R2] Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian Variable Selection and Prediction. Journal of the Royal Statistical Society, Series B. 1998;60:627–641. 585. [Google Scholar]

[R3] Brown PJ, Vanucci M, Fearn T. Bayes Model Averaging with Selection of Regresors. Journal of the Royal Statistical Society, Series B. 2002;64:519–536. 585. [Google Scholar]

[R4] Chen CF. On Asymptotic Normality of Limiting Density Functions with Bayesian Implications. Journal of the Royal Statistical Society, Series B. 1985;47:540–546. 601. [Google Scholar]

[R5] Chen M-H, Dey DK, Ibrahim JG. Bayesian Criterion Based Model Assessment for Categorical Data. Biometrika. 2004;91:45–63. 589. [Google Scholar]

[R6] Chen MH, Ibrahim JG. Conjugate Priors for Generalized Linear Models. Statistica Sinica. 2003;13:461–476. 585, 586, 587, 588–608. [Google Scholar]

[R7] Chen M-H, Ibrahim JG, Shao Q-M, Weiss RE. Prior Elicitation for Model Selection and Estimation in Generalized Linear Mixed Models. Journal of Statistical Planning and Inference. 2003;111:57–76. 586. [Google Scholar]

[R8] Chen M-H, Ibrahim JG, Yiannoutsos C. Prior Elicitation, Variable Selection, and Bayesian Computation for Logistic Regression Models. Journal of the Royal Statistical Society, Series B. 1999;61:223–242. 585. [Google Scholar]

[R9] Chen M-H, Shao Q-M. On Monte Carlo Methods for Estimating Ratios of Normalizing Constants. The Annals of Statistics. 1997;25:1563–1594. 599. [Google Scholar]

[R10] Chen M-H, Shao Q-M, Ibrahim JG. Monte Carlo Methods in Bayesian Computation. New York: Springer-Verlag; 2000. 599. [Google Scholar]

[R11] Chipman HA, George EI, McCulloch RE. Bayesian CART Model Search (with Discussion) Journal of the American Statistical Association. 1998;93:935–960. 585. [Google Scholar]

[R12] Chipman HA, George EI, McCulloch RE. The practical Implementation of Bayesian Model Selection (with Discussion) In: Lahiri P, editor. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. pp. 63–134. 585. [Google Scholar]

[R13] Chipman HA, George EI, McCulloch RE. Bayesian Treed Generalized Linear Models (with Discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian Statistics. Vol. 7. Oxford: Oxford University Press; 2003. pp. 85–103. 585. [Google Scholar]

[R14] Clyde M. Bayesian Model Averaging and Model Search Strategies (with Discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 6. Oxford: Oxford University Press; 1999. pp. 157–185. 585. [Google Scholar]

[R15] Clyde M, George EI. Model Uncertainty. Statistical Science. 2004;19:81–94. 586. [Google Scholar]

[R16] Dellaportas P, Forster JJ. Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Log-linear Models. Biometrika. 1999;86:615–633. 585. [Google Scholar]

[R17] Diciccio TJ, Kass RE, Raftery A, Wasserman L. Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915. 602. [Google Scholar]

[R18] Dixon WJ, Massey FJ. Introduction to Statistical Analysis. 4. New York: McGraw-Hill; 1983. 603. [Google Scholar]

[R19] Geisser S. Predictive Inference: An Introduction. London: Chapman & Hall; 1993. p. 588. 589. [Google Scholar]

[R20] Gelfand AE, Dey DK. Bayesian Model Choice: Asymptotics and Exact Calculations. Journal of the Royal Statistical Society, Series B. 1994;56:501–514. 589–595. [Google Scholar]

[R21] Gelfand AE, Dey DK, Chang H. Model Determinating Using Predictive Distributions with Implementation via Sampling-based Methods (with Discussion) In: Bernado JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford: Oxford University Press; 1992. pp. 147–167.pp. 588–589. [Google Scholar]

[R22] Gelfand AE, Ghosh SK. Model Choice: A Minimum Posterior Predictive Loss Approach. Biometrika. 1998;85:1–13. 589. [Google Scholar]

[R23] George EI. The Variable Selection Problem. Journal of the American Statistical Association. 2000;95:1304–1308. 585. [Google Scholar]

[R24] George EI, Foster DP. Calibration and Empirical Bayes Variable Selection. Biometrika. 2000;87:731–747. 585–595. [Google Scholar]

[R25] George EI, McCulloch RE. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association. 1993;88:1304–1308. 585. [Google Scholar]

[R26] George EI, McCulloch RE. Approaches for Bayesian Variable Selection. Statistica Sinica. 1997;7:339–374. 585. [Google Scholar]

[R27] George EI, McCulloch RE, Tsay R. Two Approaches to Bayesian Model Selection with Applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348. 585. [Google Scholar]

[R28] Guha S, MacEachern SN, Peruggia M. Benchmark Estimation for Markov Chain Monte Carlo Samples. Journal of Computational and Graphical Statistics. 2004;13:683–701. 602. [Google Scholar]

[R29] Ibrahim JG, Chen M-H, McEachern SN. Bayesian Variable Selection for Proportional Hazards Models. Canadian Journal of Statistics. 1999;27:701–717. 585. [Google Scholar]

[R30] Ibrahim JG, Chen M-H, Ryan LM. Bayesian Variable Selection for Time Series Count Data. Statistica Sinica. 2000;10:971–987. 586. [Google Scholar]

[R31] Ibrahim JG, Chen M-H, Sinha D. Criterion Based Methods for Bayesian Model Assessment. Statistica Sinica. 2001a;11:419–443. 589. [Google Scholar]

[R32] Ibrahim JG, Chen M-H, Sinha D. Bayesian Survival Analysis. New York: Springer-Verlag; 2001b. 589. [Google Scholar]

[R33] Ibrahim JG, Laud PW. A Predictive Approach to the Analysis of Designed Experiments. Journal of the American Statistical Association. 1994;89:309–319. 589. [Google Scholar]

[R34] Lahiri P. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. 586. [Google Scholar]

[R35] Laud PW, Ibrahim JG. Predictive Model Selection. Journal of the Royal Statistical Society, Series B. 1995;57:247–262. 585–589. [Google Scholar]

[R36] Meng X-L, Schilling S. Warp Bridge Sampling. Journal of Computational and Graphical Statistics. 2002;11:552–586. 602. [Google Scholar]

[R37] Meng X-L, Wong WH. Simulating Ratios of Normalizing Constants via A Simple Identity: A Theoretical Exploration. Statistica Sinica. 1996;6:831–860. 602. [Google Scholar]

[R38] Ntzoufras I, Dellaportas P, Forster JJ. Bayesian Variable and Link Determination for Generalised Linear Models. Journal of Statistical Planning and Inference. 2003;111:165–180. 586. [Google Scholar]

[R39] Raftery AE. Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear Models. Biometrika. 1996;83:251–266. 585. [Google Scholar]

[R40] Raftery AE, Madigan D, Hoeting JA. Bayesian Model Averaging for Linear Regression Models. Journal of the American Statistical Association. 1997;92:179–191. 585. [Google Scholar]

[R41] Schmeiser BW, Avramidis AN, Hashem S. Overlapping Batch Statistics. In: Balci O, Sadowski RP, Nance RE, editors. Proceedings of the 1990 Winter Simulation Conference. San Diego, California: Society for Computer Simulation International; 1990. pp. 395–398. 607. [Google Scholar]

[R42] Schwarz G. Estimating the Dimension of A Model. The Annals of Statistics. 1978;6:461–464. 589. [Google Scholar]

[R43] Smith M, Kohn R. Nonparametric Regression Using Bayesian Variable Selection. Journal of Econometrics. 1996;75:317–343. 585. [Google Scholar]

[R44] Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian Measures of Model Complexity and Fit (with Discussion) Journal of the Royal Statistical Society, Series B. 2002;62:583–639. 589–590. [Google Scholar]

[R45] Zellner A. On Assessing Prior Distributions and Bayesian Regression Analysis with g-Prior Distributions. In: Goel P, Zellner A, editors. Bayesian Inference and Decision Techniques. Amsterdam: Elsevier Science Publishers B.V; 1986. pp. 233–243. 592. [Google Scholar]

PERMALINK

Bayesian Variable Selection and Computation for Generalized Linear Models with Conjugate Priors

Ming-Hui Chen

Lan Huang

Joseph G Ibrahim

Sungduk Kim

Abstract

1 Introduction

2 The Method

2.1 Model and Notation

2.2 Prior and Posterior

2.3 Variable Selection Criteria

3 Analytic Connections Between Variable Selection Criteria For the Normal Linear Regression Model

4 Computational Development: Theory and Implementation

Theorem 5

Theorem 6

Remark 4.1

Remark 4.2

Theorem 7

Remark 4.3

Remark 4.4

Remark 4.5

Remark 4.6

Remark 4.7

Remark 4.8

5 A Simulation Study

Table 1.

Table 2.

6 A Real Data Example

Table 3.

Table 4.

Table 5.

Table 6.

7 Concluding Remarks

Acknowledgments

Appendix: Proofs of Theorems

Proof of Theorem 5

Proof of Theorem 7

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases