Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 May 11.
Published in final edited form as: Bayesian Anal. 2008 Jul 1;3(3):585–614. doi: 10.1214/08-BA323

Bayesian Variable Selection and Computation for Generalized Linear Models with Conjugate Priors

Ming-Hui Chen *, Lan Huang , Joseph G Ibrahim , Sungduk Kim §
PMCID: PMC2680310  NIHMSID: NIHMS94914  PMID: 19436774

Abstract

In this paper, we consider theoretical and computational connections between six popular methods for variable subset selection in generalized linear models (GLM’s). Under the conjugate priors developed by Chen and Ibrahim (2003) for the generalized linear model, we obtain closed form analytic relationships between the Bayes factor (posterior model probability), the Conditional Predictive Ordinate (CPO), the L measure, the Deviance Information Criterion (DIC), the Aikiake Information Criterion (AIC), and the Bayesian Information Criterion (BIC) in the case of the linear model. Moreover, we examine computational relationships in the model space for these Bayesian methods for an arbitrary GLM under conjugate priors as well as examine the performance of the conjugate priors of Chen and Ibrahim (2003) in Bayesian variable selection. Specifically, we show that once Markov chain Monte Carlo (MCMC) samples are obtained from the full model, the four Bayesian criteria can be simultaneously computed for all possible subset models in the model space. We illustrate our new methodology with a simulation study and a real dataset.

Keywords: Bayes factor, Conditional Predictive Ordinate, Conjugate prior, L measure, Poisson regression, Logistic regression

1 Introduction

Bayesian variable selection is still one of the most theoretically and computationally challenging problems encountered in practice due to issues regarding i) prior elicitation, ii) analytic evaluation of the model selection criterion, and iii) numerical computation of the criterion for all possible models in the model space. These issues have been discussed by many authors for various linear and generalized linear models including George and McCulloch (1993), Laud and Ibrahim (1995), George et al. (1996), Raftery (1996), Smith and Kohn (1996), George and McCulloch (1997), Raftery et al. (1997), Brown et al. (1998), Brown et al. (2002), Clyde (1999), Chen et al. (1999), Dellaportas and Forster (1999), Ibrahim et al. (1999), Chipman et al. (1998), Chipman et al. (2001), Chipman et al. (2003), George (2000), George and Foster (2000), Ibrahim et al. (2000), Ntzoufras et al. (2003), and Chen et al. (2003). Clyde and George (2004) present an excellent review article on Bayesian model selection and uncertainty, and give an excellent exposition of the theoretical and computational issues involved in Bayesian variable selection and Bayesian model uncertainty in general. An entire monograph devoted to Bayesian model selection is given by Lahiri (2001).

One of the important unresolved issues in Bayesian model selection and Bayesian variable selection in particular is what the analytic or empirical connections are between the various methods. For example, it is not clear what the relationship is between BIC and DIC, or DIC and the L measure, and whether one is a monotonic function of the other, and whether one can compute BIC from DIC or vice versa. A related question is that if one has MCMC samples from the full model, how can those samples be used to obtain all four Bayesian criteria mentioned above. To answer these questions, we investigate the following in this paper: (i) for the normal linear model with conjugate priors, we obtain analytic relationships between the Bayes factor, CPO, the L measure, DIC, AIC, and BIC, and (ii) for the class of GLM’s we show via the development of several theorems and identities how one can compute all of these Bayesian criteria simultaneously using only an MCMC sample from the full model.

The relationships obtained in (i) for the linear model shed light on the behavior and connections between these criteria for GLM’s. The development of (ii) above is important and useful since it establishes the computational relationships in the model space for each of the four Bayesian criteria and shows that for variable subset selection in GLM’s using the conjugate priors of Chen and Ibrahim (2003), we can compute the four Bayesian criteria for all possible 2p subset models using only an MCMC sample from the full model with p covariates. Another important issue we examine in this paper is the performance of the conjugate priors proposed by Chen and Ibrahim (2003) in Bayesian variable subset selection. We demonstrate that these priors perform quite well in this context, and they are easy to specify and computationally feasible.

The rest of this paper is organized as follows. Section 2 gives formulas for each of the criteria under the conjugate priors of Chen and Ibrahim (2003) for GLM’s and Section 3 develops the theoretical connections between the six criteria for the normal linear model. Section 4 establishes the computational connections in the model space for the four Bayesian criteria and several key identities and theorems that are needed. Section 5 presents a detailed simulation study examining various properties of the six criteria, and Section 6 presents a real data example. We conclude the article with brief remarks in Section 7. All proofs are given in the Appendix.

2 The Method

2.1 Model and Notation

Suppose that {(xi, yi), i = 1, 2, …, n} are independent observations, where yi is the response variable, and xi = (1, xi1, …, xik)′ is a (k + 1) × 1 random vector of covariates. Let ℳ denote the model space. We enumerate the models in ℳ by m = 1, 2,…,Inline graphic, where Inline graphic is the dimension of ℳ and model Inline graphic denotes the full model. Also, let β(Inline graphic) = (β0, β1, …, βk)′ denote the regression coefficients for the full model including an intercept, and let xi(m) and β(m) denote km × 1 vectors of covariates and regression coefficients for model m with an intercept, and a specific choice of km − 1 covariates. We write xi=(xi(m),xi(m)), and β(Inline graphic) = (β(m)′, β(−m)′)′, where xi(m) is xi with xi(m) deleted and β(−m) is β(Inline graphic) with β(m) deleted.

Under model m, the generalized linear model (GLM) is assumed for [ yixi(m)], which has the conditional density given by

f(yixi(m),β(m),τ)=exp[ai1(τ){yiθi(m)b(θi(m))}+c(yi,τ)],i=1,2,,n, (1)

where θi(m)=θ(ηi(m)) is the canonical parameter, ηi(m)=xi(m)β(m), and τ is a dispersion parameter. The functions a, b and c determine a particular family in the class. The functions ai(τ) are commonly of the form ai(τ)=τ1wi1, where the wi’s are known weights. For ease of exposition, we assume throughout that τ = 1 and wi = 1, as, for example, in logistic and Poisson regression. The methods proposed here can be easily extended to the case when τ is unknown. Under this assumption, (1) can be rewritten as

f(yixi(m),β(m))=exp{yiθi(m)b(θi(m))+c(yi)},i=1,2,,n. (2)

2.2 Prior and Posterior

In the context of Bayesian variable selection, a prior distribution for β(m) needs to be specified for each model in the model space ℳ. To this end, we consider a conjugate prior for the GLM proposed by Chen and Ibrahim (2003). Under model m, the conjugate prior is of the form

π(β(m)y0,a0,m)i=1nexp[a0{y0iθi(m)b(θi(m))}]=exp[a0{y0θ(m)Jb(θ(m))}], (3)

where a0 > 0 is a scalar prior parameter, y0 = (y01, …, y0n)′ is an n × 1 vector of prior parameters, J is an n×1 vector of ones, and b(θ(m))=(b(θ1(m)),,b(θn(m))) is an n×1 vector of the b(θi(m))’s. As discussed in Chen and Ibrahim (2003), y0i can be viewed as a prior prediction for the marginal mean of yi at xi. Thus, in eliciting y0, the user must focus on a prediction (or guess) for E(y), which narrows the possibilities for choosing y0. Moreover, the specification of all y0i equal has an appealing interpretation. A prior specification with y01 = … = y0n implies a prior in which the prior modes of the slopes in the regression model are the same, but the prior modes of intercepts in the regression model vary. For example, a prior with y0i = 0.5 will have the same modes of slopes but a different mode of intercept than a prior with y0i = 0.1. This is intuitively appealing since in this case the prior prediction on y0i does not depend on the ith subject’s specific information. Mathematically, this result was established in Chen and Ibrahim (2003). The details are as follows. Suppose we drop model index m. Let μ0 be any prespecified p × 1 vector, where p = k + 1. Suppose we take

y0=b.(θ)=b.(θ(Xμ0)),

where (θ) is the gradient vector of b(θ). Then, the conjugate prior yields a prior mode of β equal to μ0. Now we can see that μ0 = (β0, 0, …, 0)′ yields y01 = y02 = … = y0n = (θ(β0)). On the other hand, as under some mild conditions, the prior mode is unique, and, hence, the specification of y0 = y01 leads to the prior mode μ0 = (β0, 0, …, 0)′, where β0 satisfies (θ(β0)) = y0. For instance, under normal linear regression, we can show that the prior mode μ0 of β is given by

μ0=(XX)1Xy0.

If we specify y0 = y01, we have

μ0=(y0,0,0,,0),

which implies that all the slopes are 0 while the intercept is equal to y0. This attractive feature allows us to do sensitivity analyses by varying the intercepts in the prior. The parameter a0 in (3) can be generally viewed as a precision parameter that quantifies the strength of our prior belief in y0.

In the context of Bayesian variable selection, (3) specifies the priors for all models in ℳ in an automatic and systematic fashion. Although various theoretical properties of (3) were examined in Chen and Ibrahim (2003) in a great detail, it is not clear how well this type of the prior performs in the context of Bayesian variable selection.

Now, under model m, the posterior distribution of β(m) with the conjugate prior (3) is given by

π(β(m)D,m)exp{yθ(m)Jb(θ(m))}π(β(m)y0,a0,m)exp{(y+a0y0)θ(m)(1+a0)Jb(θ(m))}, (4)

where D = {(yi, xi), i = 1, 2, …, n} denotes the observed data. From (4), we can see that under the conjugate prior, the resulting posterior has a very attractive form. Furthermore, when a0 → 0, the posterior π(Σ(m)|D, m) in (4) reduces to

π(β(m)D,m)exp{yθ(m)Jb(θ(m))},

which is the posterior distribution based on an improper uniform prior for β(m).

2.3 Variable Selection Criteria

In this section, we consider four Bayesian model assessment criteria, namely, Conditional Predictive Ordinate (CPO) statistic (Geisser (1993); Gelfand et al. (1992); and Gelfand and Dey (1994)), L measure (Ibrahim and Laud (1994); Laud and Ibrahim (1995); Gelfand and Ghosh (1998); Ibrahim et al. (2001a); and Chen et al. (2004)), Deviance Information Criterion (DIC) (Spiegelhalter et al. (2002)), and marginal likelihood (Bayes factor).

The CPO, L measure, and DIC are criterion based methods which can be attractive in the sense that they are well defined under improper priors as long as the posterior distribution is proper, and thus have an advantage over the marginal likelihood or Bayes factor approach in this sense. Because of this reason, these three criterion based methods can be directly compared to AIC (Akaike (1973)) and BIC (Schwarz (1978)). On the other hand, the marginal likelihood or the Bayes factor is well calibrated and relatively easy to interpret, but generally sensitive to vague proper priors. In the context of variable selection, it is not clear how these methods perform under the conjugate prior given in (3) for the GLM.

Under model m, for the ith observation, we define the CPO statistic as follows:

CPOi=f(yixi,D(i))=f(yixi(m),β(m))π(β(m)D(i),m)dβ(m),

where D(−i) is D with the ith observation deleted, and π(β|D(−i),m) is the posterior distribution based on the data D(−i). Due to the construction of the conjugate prior (3), it is more natural to define

π(β(m)D(i),m)jiexp{(yj+a0y0j)θj(m)(1+a0)b(θj(m))}.

After some messy algebra, we can show that CPOi takes the following form:

CPOi=f(yixi,D(i))=1exp[a0{y0iθi(m)b(θi(m))}]π(β(m)D,m)dβ(m)1f(yixi(m),β(m))exp[a0{y0iθi(m)b(θi(m))}]π(β(m)D,m)dβ(m), (5)

where f(yixi(m),β(m)) is the density function given in (2). Also, we notice that the CPO defined in (5) is slightly different from the usual CPO (Geisser (1993) and Gelfand et al. (1992)), which is of the form

{1f(yixi(m),β(m))π(β(m)D,m)dβ(m)}1.

However, these two forms will be identical as a0 → 0. As suggested in Ibrahim et al. (2001b), a natural summary statistic of the CPOi’s is the logarithm of the Pseudo-marginal likelihood (LPML) defined as

LPMLm=i=1nlog(CPOi).

We will use LPMLm as a criterion-based measure for variable selection.

The L measure criterion is another useful tool for model comparison and variable selection. The L measure is constructed from the posterior predictive distribution of the data. For the entire class of GLM’s in (2), under model m, the L measure is defined as:

Lm(ν)=i=1n[E{b(θi(m))D,m}+Var{b(θi(m))D,m}]+νi=1m[E{b(θi(m))D,m}yi]2, (6)

where b′(.) and b″(.) are the mean and variance functions of the GLM in (2), and all expectations and variances are taken with respect to the posterior distribution π(β(m)|D, m) in (4). We note that for the GLM in (1), we need to modify Lm(ν) in (6) accordingly, and in this case, the L measure takes the form

Lm(ν)=i=1n[E{ai(τ)b(θi(m))D,m}+Var{b(θi(m))D,m}]+νi=1m[E{b(θi(m))D,m}yi]2. (7)

The DIC criterion, proposed by Spiegelhalter et al. (2002), is given by

DICm=D(β¯(m))+2pD(m), (8)

where

pD(m)=D(β(m))¯D(β¯(m)),

β̄(m) = E[β(m)|D, m], and D(β(m))¯=E[D(β(m))D,m]. For the GLM in (2), under model m,

D(β(m))=2i=1n{yiθi(m)b(θi(m))}. (9)

Similar to (6), under the GLM in (1), D(β(m)) needs to be modified accordingly.

In the spirit of marginal likelihoods, after ignoring the constants shared by all variable subset models in model space ℳ for the GLM in (2), for the purpose of variable subset selection it suffices to compute the posterior normalizing constant

Cm(D)=exp{(y+a0y0)θ(m)(1+a0)Jb(θ(m))}dβ(m) (10)

and the prior normalizing constant

C0m(y0)=exp[a0{y0θ(m)Jb(θ(m))}]dβ(m). (11)

Similar to the modification of (6) yielding (7), under the GLM in (1), D(β(m)) in (9), Cm(D) in (10), and C0m(y0) in (11) need to be modified accordingly. In the context of variable selection, we select a variable subset model which yields the largest LPMLm under the CPO, the smallest Lm(ν) under the L measure, the smallest DICm under the DIC, and the largest Cm(D)/C0m(y0) or log[Cm(D)/C0m(y0)] under the marginal likelihood.

3 Analytic Connections Between Variable Selection Criteria For the Normal Linear Regression Model

In this section, we consider the normal linear regression models given by

f(yixi(m),β(m),τ)=τ1/22πexp{τ2(yixi(m)β(m))2}. (12)

Let Xm=(x1(m),x2(m),,xn(m)), which is the design matrix for the normal linear regression under model m. Assume Xm is of full rank km throughout. We focus only on the τ known case as analytical connections are more difficult to establish when τ is unknown. For the model in (12) with a known τ, the conjugate prior for β(m) in (3) reduces to

[β(m)y0,a0,m]Nkm((XmXm)1Xmy0,1τa0(XmXm)1), (13)

and the posterior distribution for β(m) is given by

[β(m)D,m]Nkm((XmXm)1Xmy+a0y01+a0,1τ(1+a0)(XmXm)1).

For (12), AIC and BIC under model m are given by

AICm=2logL(β^(m)D)+2km=nlog(τ2π)+τSSEm+2km, (14)

where β̂(m) is the maximum likelihood estimate of β(m) and

SSEm=y{IXm(XmXm)1Xm}y

is the usual sum of squared errors, and

BICm=2logL(β^(m)D)+{log(n)}km=nlog(τ2π)+τSSEm+{log(n)}km. (15)

After some algebra, we can show that after putting back all normalizing constants, the logarithm of the marginal likelihood under model m is given by

log{Cm(D)/C0m(y0)}=n2log(τ2π)τ2yy+τ(1+a0)2(y+a0y01+a0)Xm(XmXm)1Xm(y+a0y01+a0)τa02{y0Xm(XmXm)1Xmy0}+(12loga01+a0)km. (16)

When y0 = 0, the conjugate prior in (13) reduces to Zellner’s g-prior (Zellner (1986)). For this special case, (16) becomes

log[Cm(D)/C0m(0)]=n2log(τ2π)τa02(1+a0)yyτ2(1+a0)SSEm+(12loga01+a0)km. (17)

Thus, we have

Mm(a0)2(1+a0)[log{Cm(D)/C0m(0)}+τa02(1+a0)yy]+a0nlog(τ2π)=nlog(τ2π)+τSSEm+{(1+a0)log1+a0a0}km. (18)

For purposes of variable selection, it suffices to compare ℳm(a0) and we then choose a model with the smallest ℳm(a0). From (18), we can see that

Mm(a0)={AICmif(1+a0)log1+a0a0=2,BICmif(1+a0)log1+a0a0=logn. (19)

For (12), we use (7) to compute Lm(ν). In particular, we have ai(τ) = 1/τ, E[ai(τ)b(θi(m))D,m]=1τ,

Var{b(θi(m))D,m}=Var{xi(m)β(m)D,m}=xi(m)Var(β(m)D,m)xi(m)=1τ(1+a0)xi(m)(XmXm)1xi(m),

and E{b(θi(m))D,m}=E{xi(m)β(m)D,m}=xi(m)(XmXm)1Xmy+a0y01+a0. Thus, we obtain

Lm(ν)=nτ+1τ(1+a0)i=1nxi(m)(XmXm)1xi(m)+νi=1n{yixi(m)(XmXm)1Xmy+a0y01+a0}2=nτ+1τ(1+a0)km+ν[{yXm(XmXm)1Xmy+a0y01+a0}×{yXm(XmXm)1Xmy+a0y01+a0}]. (20)

When y0 = 0, (20) reduces to

Lm(ν)=nτ+1τ(1+a0)km+νa02(1+a0)2yy+ν(1+2a0)(1+a0)2SSEm. (21)

Write

Lm(ν,a0)=τ(1+a0)2ν(1+2a0){Lm(ν)nτνa02(1+a0)2yy}nlog(τ2π). (22)

Using (21) and (22), we obtain

Lm(ν,a0)=nlog(τ2π)+τSSEm+1+a0ν(1+2a0)km,

and hence

Lm(ν,a0)={AICmif1+a0ν(1+2a0)=2,BICmif1+a0ν(1+2a0)=logn.

Note that in the context of variable selection, a model with the smallest Lm(ν) is the same model that has the smallest m(ν, a0). Thus, in this sense, the L measure can be equivalent to AIC or BIC by appropriately tuning (ν, a0). It is interesting to mention that in order to achieve m(ν, a0) = AICm or m(ν, a0) = BICm, ν must be small, and hence when ν = 1, the L measure always has a smaller dimensional penalty than both AIC and BIC. Unlike the marginal likelihood, a0 plays a minimum role in controlling dimensional penalty in the L measure.

When y0 = 0, the posterior mean of β(m) is given by β¯(m)=11+a0(XmXm)1Xmy. Thus, we have D(β(m))=nlog(τ2π)+τ(yXmβ(m))(yXmβ(m)),

D(β(m))¯=E[D(β(m))D,m]=nlog(τ2π)+τE[{yXmβ¯(m)Xm(β(m)β¯(m))}×{yXmβ¯(m)Xm(β(m)β¯(m))}D,m]=nlog(τ2π)+11+a0km+τa02(1+a0)2yy+τ(1+2a0)(1+a0)2SSEm, (23)

and

D(β¯(m))=nlog(τ2π)+τ(yXmβ¯(m))(yXmβ¯(m))=nlog(τ2π)+τa02(1+a0)2yy+τ(1+2a0)(1+a0)2SSEm. (24)

Combining (23) and (24) gives

pD(m)=D(β(m))¯D(β¯(m))=11+a0km. (25)

Thus, the DICm for (12) is given by

DICm=nlog(τ2π)+τa02(1+a0)2yy+τ(1+2a0)(1+a0)2SSEm+21+a0km. (26)

Write

DICm(a0)=(1+a0)21+2a0{DICmτa02(1+a0)2yy}+na021+2a0log(τ2π).

We have

DICm(a0)=nlog(τ2π)+τSSEm+2(1+a0)1+2a0km. (27)

Therefore, when a0 = 0, DICm(0)=DICm=AICm, and when a0 > 0, 2(1+a0)1+2a0<2, which implies that DICm(a0) has a smaller dimensional penalty than both AIC and BIC.

Similarly to DIC, we consider only y0 = 0. From (5), we have

LPMLm=i=1nlog(CPOi)=i=1nlog(CPO1i)i=1nlog(CPO2i), (28)

where CPO1i=exp{a0τ2β(m)xi(m)xi(m)β(m)}π(β(m)D,m)dβ(m) and

CPO2i=(τ2π)1/2exp{τ2yi2}exp[τ(1+a0)2×{β(m)xi(m)xi(m)β(m)21+a0β(m)xi(m)yi}]π(β(m)D,m)dβ(m)

for i = 1, 2, …, n. After some messy algebra, we obtain

CPO1i={1a01+a0xi(m)(XmXm)1xi(m)}1/2×exp{τa02(1+a0)2yXm(XmXm)1xi(m)xi(m)(XmXm)1Xmy1a01+a0xi(m)(XmXm)1xi(m)}

and

CPO2i=(τ2π)1/2{1xi(m)(XmXm)1xi(m)}1/2exp(τ2yi2)×exp[τ2(1+a0){yixi(m)(XmXm)1xi(m)yi2yXm(XmXm)1xi(m)yi}]×exp[τ(Xmyxi(m)yi)(XmXm)1xi(m)xi(m)(XmXm)1(Xmyxi(m)yi)2(1+a0){1xi(m)(XmXm)1xi(m)}].

Let β^(m)=(XmXm)1Xmy,y^i(m)=xi(m)β^(m), and hii(m)=xi(m)(XmXm)1xi(m). Plugging CPO1i and CPO2i into (28) yields

LPMLm=n2log(τ2π)τ2i=1nyi2+12i=1n{log(1hii(m))log(1a01+a0hii(m))}τ2(1+a0)i=1n{hii(m)yi22yiy^i(m)}+τa02(1+a0)2i=1n{y^i(m)21a01+a0hiim}τ2(1+a0)i=1n(y^i(m)hii(m)yi)21hii(m). (29)

Using Taylor expansion and after some algebra, LPMLm in (29) can be rewritten as

LPMLm=n2log(τ2π)τa022(1+a0)2yyτ(1+2a0)2(1+a0)2SSEmkm2(1+a0)+Rm, (30)

where

Rm=τ2(1+a0)i=1n(yiy^i(m))2hii(m)1hii(m)+τa02(1+a0)2i=1na01+a0hii(m)y^i(m)21a01+a0hii(m)+12i=1nj=2{1(a01+a0)j}(1)jhii(m)jj.

Write

LPMLm=2(1+a0)21+2a0{LPMLm+τa022(1+a0)2yy}+na021+2a0log(τ2π). (31)

Using (30) and (31), we obtain

LPMLm=nlog(τ2π)+τSSEm+1+a0(1+2a0)km+Rm,

where Rm=2(1+a0)21+2a0Rm. We choose a model with the smallest LPMLm. Note that the remainder term Rm is small when all hii(m)’s are small. From (14), (15), and (27), we see that when Rm is small and does not vary much in the model space ℳ, LPML has a smaller dimensional penalty than DIC, AIC and BIC. In addition, when a0 = 0, LPMLm in (30) is consistent with the one derived by Gelfand and Dey (1994) based on the asymptotic approximation.

Finally, we note that the quantities defined in (18), (22), (27) and (31) are linear transformations of those defined by (17), (21), (26) and (30), respectively. In these linear transformations, the relevant coefficients are independent of m. Thus, for the purposes of variable subset selection, these linearly transformed quantities act exactly like those original forms. With (18), (22), (27) and (31), we can much more clearly see the analytical connections to AIC and BIC. We also note that George and Foster (2000) provided some similar connections between model selection probabilities and various model selection criteria for this setup.

4 Computational Development: Theory and Implementation

For the purpose of variable selection, we need to compute LPMLm, Lm(ν), DICm, Cm(D) and C0m(y0) for the Bayesian variable selection criteria described in the previous section for m = 1, 2, …, Inline graphic. Due to the complexity and generality of the GLM in (2), the analytical evaluation of these measures does not appear possible. Thus, a Monte Carlo (MC) based method is required for each of those measures under consideration. However, the MC methods currently available in the Bayesian computational literature require a Markov chain Monte Carlo (MCMC) sample from the posterior distribution π(β(m)|D, m) in (4) under each variable subset model m. When the number of the models in ℳ is large, sampling from the posterior distribution under each variable subset model can be expensive. Thus, the computation of these four measures for all submodels becomes a difficult and challenging task. Therefore, the development of an efficient Monte Carlo method for variable selection for the GLM is very essential.

After examining (5), (6), and (8), we observe that there is a common feature in computing LPMLm, Lm(ν), and DICm, i.e., all of these three measures require to compute

gm=E{g(β(m))D,m},

for various functions g, where the expectation is taken with respect to the joint posterior distribution in (4) under model m. Specifically, the functions required in these calculations include

  1. g(β(m))=exp[a0{y0iθi(m)b(θi(m))}] and g(β(m))=(f(yixi(m),β(m))exp[a0{y0iθi(m)b(θi(m))}])1 for LPMLm;

  2. g(β(m))=b(θi(m)),g(β(m))={b(θi(m))}2, and g(β(m))=b(θi(m)) for Lm(ν);

  3. g(β(m)) = β(m) and g(β(m)) = D(β(m)) for DICm.

Write

L(β(m)D,m)=exp{(y+a0y0)θ(m)(1+a0)Jb(θ(m))}

under model m and let L(β|D) = L(β(Inline graphic)|D, Inline graphic), C(D) = CInline graphic (D), and C0(y0) = C0Inline graphic (y0) under the full model. Here, we abuse the notation a little bit as L(β(m)|D, m) is not a likelihood function in the usual sense. Then, for a given function g, mathematically, we have

gm=E[g(β(m))D,m]=g(β(m))L(β(m)D,m)Cm(D)dβ(m),

where Cm(D) is defined in (10). Now, we present a useful identity for gm, which is formally stated in the following theorem.

Theorem 5

For any given function g, such that E[|g(β(m))| |D, m] < ∞, we have

gm=C(D)Cm(D)E{g(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)D}, (32)

where the expectation is taken with respect to the joint posterior distribution in (4) under the full model. Here, w(β(−m)| β(m)) is a completely known conditional density, whose support is contained in, or equal to, the support of the conditional density of β(− m) given β(m) with respect to the joint posterior distribution in (4) under the full model.

Observing that when g ≡ 1, we have

1=C(D)Cm(D)E{L(β(m)D,m)w(β(m)β(m))L(βD)D},

which leads to

Cm(D)C(D)=E{L(β(m)D,m)w(β(m)β(m))L(βD)D} (33)

and

gm=E{g(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)D}E{L(β(m)D,m)w(β(m)β(m))L(βD)D}. (34)

It is interesting to mention that the identity (33) is a by-product of this derivation and this identity can be used to compute the posterior normalizing constant under model m. The identities (33) and (34) play an important role in developing a novel Monte Carlo method for computing LPMLm, Lm(ν), DICm, and Cm(D) simultaneously using a single MCMC sample from the joint posterior distribution under the full model. Towards this goal, we let {βs = (β(m)s, β(−m)s), s = 1, 2, …, S} denote a MCMC sample from the joint posterior distribution (4) under the full model, where S is the MCMC sample size. Then, an estimate of gm is given by

g^m=s=1Sg(βs(m))L(βs(m)D,m)w(βs(m)βs(m))L(βsD)s=1SL(βs(m)D,m)w(βs(m)βs(m))L(βsD). (35)

Under certain regularity conditions, such as ergodicity, we have

limSg^m=gm,

which indicates that ĝm is consistent.

Letting

AS=1Ss=1Sg(βs(m))L(βs(m)D,m)w(βs(m)βs(m))L(βsD) (36)

and

BS=1Ss=1SL(βs(m)D,m)w(βs(m)βs(m))L(βsD), (37)

we have

limSAS=Cm(D)C(D)gmA, (38)

and

limSBS=Cm(D)C(D)B. (39)

From (38) and (39), we obtain

gm=C(D)Cm(D)A=AB. (40)

Using (36)–(40), we have

g^mgm=ASBSgm=ASBSAB=ASABBSBS=AASABSBBS=gmASABSBBSB. (41)

In (41), limSBSB=1 and

limS(ASABSB)=0 (42)

In addition, we have

ASABSB=1Ss=1S[1A{g(βs(m))L(βs(m)D,m)w(βs(m)βs(m))L(βsD)}1B{L(βs(m)D,m)w(βs(m)βs(m))L(βsD)}].

We are then led to the following theorem.

Theorem 6

Let {βs, s = 1, 2, …, S} be a random sample. Assume A ≠ 0,

Vw(gm)=E[{g(β(m))L(β(m)D,m)w(β(m)β(m))AL(βD)L(β(m)D,m)w(β(m)β(m))BL(βD)}2D]< (43)

and

E[{g(β(m))}2D]<, (44)

where the expectation is taken with respect to the joint posterior distribution in (4) under the full model. Then we have

limS[S×E{(g^mgmgm)2}]=Vw(gm), (45)

where Vw(gm) is defined by (43) and

S(g^mgm)DN(0,gm2Vw(gm)).

The proof of Theorem 6 directly follows from the proof of Theorem 3.1 of Chen and Shao (1997). Thus, the detail is omitted for brevity. From (45), we notice that E[g^mgmgm]2 is the relative mean-square error and Theorem 6 implies that when S is large,

E(g^mgmgm)21SVw(gm).

Remark 4.1

As discussed in Chen et al. (2000), the simulation standard error of ĝm can be approximated by

se(g^m)=g^mA^1Ss=1S[{g(βs(m))g^m}L(βs(m)D,m)w(βs(m)βs(m))L(βsD)]2,

where  = AS.

Remark 4.2

From (34), it is quite natural that one may think a more efficient way to obtain a MC estimate of gm is by generating two MC samples from the posterior distribution so that one sample is used for computing E{g(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)D} while the second sample is used for computing E{L(β(m)D,m)w(β(m)β(m))L(βD)D}. In this remark, we show that the use of two MC samples in obtaining the MC estimate of gm may not necessarily be more efficient than the use of just one MC sample. In addition, generating two MC samples requires more computing time. Specifically, suppose that {β1;s, s = 1, 2, …, S1} and {β2;s, s = 1, 2, …, S2} are two independent random samples from the joint posterior distribution (4) under the full model. Then gm can be estimated by

g^m=1S1s=1S1g(β1,s(m))L(β1,s(m)D,m)w(β1,s(m)β1,s(m))L(β1,sD)1S2s=1S2L(β2,s(m)D,m)w(β2,s(m)β2,s(m))L(β2,sD). (46)

By the δ-Method, we have

E(g^mgmgm)2=Var{1S1s=1S1g(β1,s(m))L(β1,s(m)D,m)w(β1,s(m)β1,s(m))L(β1,sD)}A2+Var{1S2s=1S2L(β2,s(m)D,m)w(β2,s(m)β2,s(m))L(β2,sD)}B2+O{1(S1+S2)2}=1S1A2Var{g(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)}+1S2B2Var{L(β(m)D,m)w(β(m)β(m))L(βD)}+O{1(S1+S2)2},

where the expectation and variance are taken with respect to the joint posterior distribution (4) under the full model.

Assuming that S1 = S2 = S, we have

limS{S×E(g^mgmgm)2}=1A2Var{g(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)}+1B2Var{L(β(m)D,m)w(β(m)β(m))L(βD)}. (47)

Thus, if

E{g(β(m))L(β(m)D,m)w(β(m)β(m))AL(βD)×L(β(m)D,m)w(β(m)β(m))BL(βD)}0, (48)

we have

limS{S×E(g^mgmgm)2}limS{S×E(g^mgmgm)2}.

It is easy to see that when g(β(m)) ≥ 0 or g(β(m)) ≤ 0, (48) automatically holds. Therefore, for many cases, it is unnecessary to use two MC samples instead of one MC sample in obtaining the MC estimate of gm.

Note that the estimate ĝm depends on w(β(−m)|β(m)). It is reasonable to argue that the best choice of w should yield the smallest asymptotic variance of the estimate ĝm among all possible w’s. The following theorem precisely addresses this optimality issue.

Theorem 7

Let

wopt=π(β(m)β(m),D) (49)

be the conditional posterior density of β(−m) given β(m) under the full model, then we have

Vwopt(gm)Vw(gm) (50)

for all w’s, where Vw(gm) is defined by (43).

Remark 4.3

Note that (50) holds for any function g that satisfies the condition given in (44). Thus, for various functions g involved in LPMLm, Lm(ν) and DICm, the best choice of w is the same wopt given in (49).

Remark 4.4

When we use g^m in (46), we can also show that wopt = π(β(−m) | β(m), D) yields the smallest asymptotic relative mean-square error of g^m, for example, the one given by (47).

Remark 4.5

For computing CPOi in (5) under model m, we do not need to compute C(D)Cm(D) in (32). In fact, it is easy to see that

CPOi(m)=E{g1(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)D}E{g2(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)D},

where g1(β(m))=exp[a0(y0iθi(m)b(θi(m)))] and g2(β(m))={f(yixi(m),β(m))exp[a0(y0iθi(m)b(θi(m)))]}1. Thus, given a MCMC sample { βs=(β(m)s,β(m)s), s = 1, 2, …, S} from the joint posterior distribution (4), a MC estimate of CPOi is given as follows:

CPO^i(m)=s=1Sg1(βs(m))L(βs(m)D,m)w(βs(m)βs(m))L(βsD)s=1Sg2(βs(m))L(βs(m)D,m)w(βs(m)βs(m))L(βsD).

Following the proof of Theorem 7, we can easily show that the optimal choice of w for CPO^i(m) is still the same wopt given in (49).

Remark 4.6

To compute LPMLInline graphic, LInline graphic(ν) and DICInline graphic under the full model, we can simply take β(Inline graphic) = β and w(β(−Inline graphic) |β(Inline graphic)) = 1. Then, for various functions g, given a MCMC sample {βs, s = 1, 2, …, S} (35) reduces to

g^=1Ss=1Sg(βs),

where {βs, s = 1, 2, …, S} is a MCMC sample from the posterior distribution (4) under the full model.

Remark 4.7

As shown in Theorem 7, the optimal choice of w is wopt = π(β(−m) | β(m), D). However, for the GLM in (2), wopt is not available in closed form. Fortunately, for the GLM, a good w(β(−m)|β(m)), which is close to the optimal choice, can be constructed based on the asymptotic approximation to the joint posterior proposed by Chen (1985). Let β̂ denote the posterior mode of β under the full model, i.e.,

β^=argmaxβlogL(βD)=argmaxβ{(y+a0y0)θ(1+a0)Jb(θ)}.

Also let

Σ^={2logL(βD)βββ=β^}1.

Then, the joint posterior π(β|D) under the full model can be approximated by

π^(ββ^,D)=(2π)k+12^12exp{12(ββ^)^1(ββ^)}. (51)

Using (51), we simply take w(β(−m) | β(m)) = π̂(β(−m) | β(m), β̂, D), which is the conditional distribution of β(−m) given β(m) with respect to the (k + 1)-dimensional multivariate normal distribution in (51).

Remark 4.8

As a by-product, Cm(D)/C(D) is ready to compute via the identity (33). It can also be shown that

C0m(y0)C0(y0)=E{L(β(m)y0,a0,m)w(β(m)β(m))L(βy0,a0)y0,a0}, (52)

where L(β(m)y0,a0,m)=exp[a0{y0θ(m)Jb(θ(m))}] and the expectation is taken with respect to the prior distribution in (3) under the full model. After examining the construction of the conjugate prior and the form of the GLM in (2), we can also show that

Bm=Cm(D)/C(D)C0m(y0)/C0(y0)=π(β(m)=0D)π(β(m)=0y0,a0), (53)

where π(β(−m) = 0|D) and π(β(−m) = 0|y0, a0) are the marginal posterior density and the marginal prior density of β(−m) evaluated at β(−m) = 0 under the full model. Furthermore, Bm in (53) is the Bayes factor for comparing model m to the full model. Thus, to compute Bm, we need to generate two MCMC samples, one from the posterior distribution and another one from the prior distribution of β under the full model, and then use (33) and (52).

Finally, we note that we derive wopt under the independence assumption. We expect that this optimal choice will work well even when a dependent MCMC sample is used. Some related empirical studies have been reported and discussed in Meng and Wong (1996), Diciccio et al. (1997) and Meng and Schilling (2002). They suggested that the optimal or near-optimal procedures constructed under the independence assumption can work remarkably well in general, providing orders of magnitude improvement over other methods with similar computational effort. Alternatively, suppose we systematically take a 1-in-b subsample of size S from the Markov chain that is generated from the joint posterior distribution in (4). Then, following from Guha et al. (2004), we can show that (45) holds under some mild regularity conditions such as geometrical ergodicity and a sufficiently large b. Thus, if we take a MCMC sample in such a way, this MCMC sample can be treated as “a random sample.”

5 A Simulation Study

In Section 3, we have established theoretical connections among AIC, BIC and the four Bayesian criteria in the normal linear regression setting. However, it does not appear possible that there are any analytic connections between AIC or BIC and the four Bayesian criteria for Poisson regression. For this reason, we present a simulation study for Poisson regression to empirically examine whether there exist any connections among these criteria and to examine the performance of conjugate priors in the context of variable selection. Suppose yi|θi are independent Poisson observations with mean exiβ, where xi is a 1 × p vector, i = 1, 2, …, n. The conjugate prior takes the form

π(βa0,y0)exp{i=1na0(y0ixiβexp{xiβ})}, (54)

where y0i is the ith component of y0. In the simulation, we assume that xi0 = 1, xij ~ N (0, 1) independently for j = 1, 2, 3 and i = 1, 2, …, n. In (54), we take y0i = 1 for i = 1, 2, …, n, which yields a prior mode of β to be 0, as shown in Chen and Ibrahim (2003). Further we use β = (−0.3, 0.3, 0, 0)′, β = (−0.3, 0.3, 0.2, 0)′, and β= (−0.3, 0.3, 0.2, −0.15)′ which correspond to the true models (x1), (x1, x2), and (x1, x2, x3) (full model), respectively. We also use the sample size of n = 500.

Under the simulation design, we independently generated N = 500 datasets. For each simulated dataset, we fit 23 = 8 models. To compute the posterior model probabilities based on the conjugate priors, we implemented the Monte Carlo algorithm proposed in Section 4 with a Monte Carlo sample size of S = 20, 000. For all of these 8 models, we computed BF, DIC, L measure, LPML, AIC, and BIC.

Tables 1 and 2 show results for the various methods. Our model performance evaluation criterion is a 0-1 loss function, the loss being 0 if the true model is selected and 1 otherwise. In Table 1, we see that BIC performs better than AIC in the number of times the true model is selected as best when the true model is a smaller model. For example, when (x1) is the true model, AIC correctly identifies this model as best 361 times out of 500 and BIC correctly identifies this model as best 490 times. Table 2 compares the performance of the four other criteria under several values of a0 from the conjugate prior as well as several values of ν for the L measure. We see from the table that, in general, for small values of a0, which imply a noninformative prior, the Bayes factor results are quite consistent with DIC, the L measure, and LPML for small models being the true models, whereas when the full model is the true model, the Bayes factor tends to do worse for small a0 compared to large a0. In general, as a0 increases, the performance of DIC, LPML, and the Bayes factor becomes worse, whereas for the L measure, it is fairly robust over several values of a0. The L measure seems to perform best under moderate values of ν, such as ν = 0.5.

Table 1.

Frequencies for Ranking the True Model as Best Using AIC and BIC Based on n = 500 and N = 500 Datasets

True Model AIC BIC
(x1) 361 490
(x1, x2) 425 446
(x1, x2, x3) 474 316

Table 2.

Frequencies for Ranking the True Model as Best Using BF, DIC, CPO and L measure for Various a0 Based on n = 500 and N = 500 Datasets

L Measure (ν)
True Model a0 LPML DIC BF 0.1 0.3 0.5 0.7 0.9
(x1) 0.001 395 361 492 398 396 359 318 276
0.01 396 357 466 396 396 357 319 275
0.1 377 332 386 408 396 352 304 268
0.5 342 308 311 424 381 335 279 243
1 320 299 288 424 372 321 264 222

(x1, x2) 0.001 425 425 436 164 347 390 380 356
0.01 423 425 470 157 352 390 383 355
0.1 417 417 443 195 370 399 372 353
0.5 398 405 405 254 400 402 362 339
1 382 394 391 269 410 390 359 329

(x1, x2, x3) 0.001 475 474 291 88 371 456 475 480
0.01 475 474 388 94 375 458 475 482
0.1 479 475 460 125 402 466 480 488
0.5 485 479 479 176 436 478 486 489
1 486 481 481 214 453 483 487 490

6 A Real Data Example

Due to lack of analytic connections between AIC or BIC and the four Bayesian criteria for logistic regression, we consider the Chapman data from Los Angeles Heart Study of men (n = 200) presented in Dixon and Massey (1983) to empirically examine whether there exist any connections among these criteria.

In our analysis, we consider a coronary incident as a binary response variable (y), which takes the values 0 and 1, where a 1 denotes that an incident had occurred in the previous ten years and a 0 indicates otherwise. We consider five prognostic factors: age (Ag), systolic blood pressure in millimeters of mercury (S), diastolic blood pressure in millimeters of mercury (D), Cholesterol in milligrams per DL (Ch), and BMI = (703.07Weight)/(Height2).

Let x1, x2, x3, x4, and x5 denote Ag, S, D, Ch, and BMIH. For the Chapman data, we fit a logistic regression model

logit{P(y=1x)}=log{P(y=1x)1P(y=1x)}=xβ. (55)

The conjugate prior in (3) corresponding to the model (55) takes the form

π(βa0,y0)exp(i=1na0[y0ixiβlog{1+exp(xiβ)}]), (56)

where yi0 = 0.5, i = 1, 2, …, n, to ensure the prior mode of β to be 0. We wish to compare the following 32 models: Intercept only, (x1), …, (x5), (x1, x2), …, (x1, x2, x3, x4, x5). We note that the notation (x1, x2, x3, x4, x5), for example, implies that xiβ=β0+β1Agi+β2Si+β3Di+β4Chi+β5BMIi in (55). Thus, “Intercept only” is the model with zero predictors while (x1, x2, x3, x4, x5) is the full model with the largest model dimension. We also note that an intercept is included in every model. Further we denote that M1 = (Int), M2 = (Int, Ag), M3 = (Int, S), M4 = (Int, D), M5 = (Int, Ch), M6 =(Int, BMI), M7 = (Int, Ag, S), M8 =(Int, Ag, D), M9 =(Int, Ag, Ch), M10 =(Int, Ag, BMI), M11 =(Int, S, D), M12 =(Int, S, Ch), M13 =(Int, S, BMI), M14 =(Int, D, Ch), M15 =(Int, D, BMI), M16 =(Int, Ch, BMI), M17 =(Int, Ag, S, D), M18 =(Int, Ag, S, Ch), M19 =(Int, Ag, S, BMI), M20 =(Int, Ag, D, Ch), M21 =(Int, Ag, D, BMI), M22 =(Int, Ag, Ch, BMI), M23 =(Int, S, D, Ch), M24 =(Int, S, D, BMI), M25 =(Int, S, Ch, BMI), M26 = (Int, D, Ch, BMI), M27 =(Int, Ag, S, D, Ch), M28 =(Int, Ag, S, D, BMI), M29 = (Int, Ag, S, Ch, BMI), M30 =(Int, Ag, D, Ch, BMI), M31 = (Int, S, D, Ch, BMI), and M32 = (Int, Ag, S, D, Ch, BMI).

To compute the posterior model probability (PMP), DIC, LPML, and L measure under various conjugate priors, we implemented the Monte Carlo algorithm proposed in Section 4 with a Monte Carlo sample size of S = 20, 000. We see from Table 3 that M22 is selected as the best model by AIC and the fourth model by BIC, whereas M10 is selected as the second best model by both criteria. Table 4 shows the results of the L measure, posterior model probability (PMP), LPML, and DIC for several values of a0, as well as several values of ν for the L measure. Table 3 reveals a similar story as the simulation study. Model M22 is selected as either the top model or second best model for most values of a0 for DIC and PMP, as well as for the L measure under small values of ν. Under larger values of ν the L measure as well a LPML appear to favor model M32. Finally, for small values of a0, LPML and PMP appear to favor a smaller model, namely M2. Thus, from these analyses, models {M2, M22, M32} appear to be the most promising based on all of these model selection criteria. Table 5 shows the top five models selected for each of the four variable selection criteria (PMP, DIC, L measure, LPML). Again we see a remarkable consistency between the four criteria, in which the ordering of the top models is similar for the four criteria for small, moderate, and large values of a0, and for a wide range of ν values for the L measure.

Table 3.

The Top Model Based on AIC and BIC for Chapman Data

AIC BIC
Mk Values Mk Values
M22 142.75 M2 153.34
M10 143.73 M10 153.63
M29 144.69 M9 155.83
M30 144.75 M22 155.94
M19 145.57 M16 155.99

Table 4.

The Best Model Based on Posterior Model Probability (PMP), DIC, LPML, and L Measure for Chapman Data

a0 = 0.001 a0 = 0.01 a0 = 0.1 a0 = 0.5 a0= 1.0
Criterion Mk Values Mk Values Mk Values Mk Values Mk Values
PMP M2 0.57 M2 0.25 M22 0.14 M22 0.07 M22 0.06
DIC M22 142.83 M22 142.67 M22 144.74 M22 165.65 M22 186.77
LPML M2 −73.38 M2 −73.30 M32 −73.79 M32 −83.10 M30 −93.29
L(ν = 0.1) M22 21.47 M22 21.98 M22 26.96 M22 38.92 M30 45.21
L(ν = 0.25) M22 24.79 M22 25.29 M22 30.23 M22 42.56 M30 49.39
L(ν = 0.5) M32 30.20 M32 30.73 M32 35.66 M29 48.59 M30 56.36
L(ν = 0.75) M32 35.24 M32 35.76 M32 40.77 M32 54.52 M30 63.33
L(ν = 0.9) M32 38.26 M32 38.78 M32 43.83 M32 58.06 M30 67.51

Table 5.

The Top Five Models Based on PMP, DIC, LPML, and L Measure for Chapman Data

a0 = 0.001 a0 = 0.01 a0 = 0.1 a0 = 0.5 a0 = 1.0
Criterion Mk Values Mk Values Mk Values Mk Values Mk Values
PMP M2 0.57 M2 0.25 M22 0.14 M22 0.07 M22 0.06
M1 0.11 M10 0.23 M10 0.14 M10 0.07 M10 0.05
M5 0.07 M9 0.08 M9 0.06 M29 0.05 M19 0.04
M10 0.07 M22 0.08 M2 0.06 M19 0.05 M29 0.04
M6 0.06 M16 0.07 M16 0.06 M30 0.05 M21 0.04

DIC M22 142.83 M22 142.67 M22 144.74 M22 165.65 M22 186.77
M10 143.79 M10 143.70 M10 145.70 M10 166.02 M10 186.88
M30 144.85 M29 144.74 M29 146.43 M29 166.42 M30 187.10
M29 144.96 M30 144.78 M30 146.48 M19 166.87 M21 187.38
M21 145.63 M21 145.59 M21 147.29 M30 166.90 M20 187.75

LPML M2 −73.38 M2 −73.30 M32 −73.79 M32 −83.10 M30 −93.29
M5 −73.56 M5 −73.50 M2 −74.11 M29 −83.32 M32 −93.35
M4 −73.58 M6 −73.55 M7 −74.43 M10 −83.34 M21 −93.41
M6 −73.73 M4 −73.64 M8 −74.48 M19 −83.55 M10 −93.41
M32 −73.91 M3 −73.65 M9 −74.68 M21 −83.56 M20 −93.55

L
ν = 0.1
M22 21.47 M22 21.98 M22 26.96 M22 38.92 M30 45.21
M30 22.04 M25 22.63 M25 27.33 M25 38.99 M22 45.23
M26 22.13 M30 22.64 M30 27.41 M29 39.02 M26 45.26
M32 22.15 M29 22.65 M26 27.45 M19 39.16 M20 45.28
M10 22.21 M10 22.67 M29 27.47 M10 39.18 M21 45.31

L
ν = 0.25
M22 24.79 M22 25.29 M22 30.23 M22 42.56 M30 49.39
M30 25.17 M32 25.69 M32 30.56 M29 42.61 M22 49.45
M32 25.17 M30 25.77 M30 30.56 M25 42.67 M20 49.49
M10 25.43 M29 25.78 M29 30.62 M32 42.73 M26 49.51
M20 25.47 M10 25.89 M25 30.65 M19 42.79 M21 49.52

L
ν = 0.5
M32 30.20 M32 30.73 M32 35.66 M29 48.59 M30 56.36
M22 30.31 M22 30.80 M22 35.70 M32 48.62 M32 56.47
M30 30.38 M30 30.98 M30 35.81 M22 48.64 M22 56.50
M20 30.77 M29 31.00 M29 35.88 M30 48.78 M20 56.50
M10 30.80 M10 31.26 M10 36.10 M25 48.79 M21 56.55

L
ν = 0.75
M32 35.24 M32 35.76 M32 40.77 M32 54.52 M30 63.33
M30 35.59 M30 36.20 M30 41.06 M29 54.56 M32 63.41
M22 35.84 M29 36.22 M29 41.14 M22 54.71 M20 63.52
M29 36.04 M22 36.31 M22 41.16 M30 54.78 M22 63.54
M20 36.08 M10 36.63 M10 41.48 M19 54.88 M21 63.57

L
ν = 0.9
M32 38.26 M32 38.78 M32 43.83 M32 58.06 M30 67.51
M30 38.72 M30 39.33 M30 44.21 M29 58.15 M32 67.58
M22 39.16 M29 39.35 M29 44.29 M22 58.36 M20 67.73
M29 39.17 M22 39.61 M22 44.44 M30 58.37 M22 67.77
M20 39.26 M27 39.83 M10 44.70 M19 58.51 M21 67.79

Table 6 shows the posterior means (Estimates), the posterior standard errors (SEs), and 95% HPD intervals for the βj‘s under model M22 (Ag, Ch, BMI) and model M32 (Ag, S, D, Ch, BMI) when a0 = 0.01. Table 6 also shows the corresponding maximum likelihood estimates (MLEs), the standard errors, and p-values. We see from Table 6 that the posterior estimates are very close to the MLEs, which is intuitively appealing, as a fairly noninformative (a0 = 0.01) is used. We also see from this table that under these two “best” models, age and BMI are only two prognostic factors for the coronary incident, which are significant at the 5% significance level.

Table 6.

Estimates of the β under Model (Ag, Ch, BMI) and Model (Ag, S, D, Ch, BMI) for the Chapman Data when a0 = 0.01

Maximum Likelihood Estimates Posterior Estimates
Model Variable Estimate SE p-value Estimate SE 95% HPD Interval
M22 Intercept −2.252 0.275 < .0001 −2.265 0.272 (−2.805, −1.748)
Ag 0.556 0.245 0.0230 0.554 0.242 (0.087, 1.032)
Ch 0.405 0.233 0.0816 0.402 0.234 (−0.064, 0.854)
BMI 0.470 0.204 0.0211 0.465 0.207 (0.069, 0.882)

M32 Intercept −2.248 0.274 < .0001 −2.292 0.273 (−2.828, −1.766)
Ag 0.527 0.270 0.0507 0.531 0.270 (0.012, 1.067)
S 0.106 0.336 0.7523 0.097 0.344 (−0.583, 0.757)
D −0.077 0.383 0.8417 −0.069 0.383 (−0.806, 0.687)
Ch 0.404 0.235 0.0857 0.402 0.240 (−0.074, 0.866)
BMI 0.474 0.226 0.0361 0.473 0.230 (0.028, 0.930)

To examine performance of the proposed Monte Carlo method in Section 4, we first computed various model selection criteria under a sub-model using a MCMC sample from the full model. We then computed the same quantities using a MCMC sample directly from the posterior distribution under the same sub-model. For illustrative purposes, we considered a single variable sub-model M2 = (Int, Ag) using the conjugate prior (56) with a0 = 0.01. Using a MCMC sample size of S = 20, 000, the Monte Carlo estimates (simulation standard errors) of DIC, LPML, L(ν = 0.1), L(ν = 0.5), and L(ν = 0.9) under model M2 are 146.68 (0.08), −73.30 (0.04), 23.91 (0.05), 32.44 (0.06), and 40.96 (0.06), respectively, using the proposed Monte Carlo method via (35). With the same MC sample size, these quantities are 146.67 (0.02), −73.29 (0.01), 23.90 (0.02), 32.42 (0.02), and 40.95 (0.02), respectively, using the MC sample directly from the posterior distribution under model M2. All simulation standard errors were computed using the overlapping batch statistics (OBS) method of Schmeiser et al. (1990). As expected, the simulation standard errors using the MC sample from the full model are slightly larger than those computed using the MC sample directly from model M2. However, these two sets of the MC estimates are very close. This empirically demonstrates that the proposed MC method works quite well. Finally, we compared the computational times between the proposed Monte Carlo method and the exhaustive alternative. With 2,000 “burn-in” iterations and S = 20, 000, the computational times of the proposed Monte Carlo method for 32 DIC’s, LPML’s, and L(ν)’s are 71.28, 100.11, and 76.36 seconds, respectively, on a Dell WS Xeon dual 2.4GHZ CPU Linux workstation. Using the same number of “burn-in” iterations, the same MC sample size, and the same computer, the computational times of the exhaustive alternative Monte Carlo method for 32 DIC’s, LPML’s, and L(ν)’s are 324.05, 357.97, and 322.13 seconds, respectively. Thus, it becomes apparent that the proposed Monte Carlo method leads to a substantial computational saving over the exhaustive alternative.

7 Concluding Remarks

We have examined and established theoretical and computational relationships between six commonly used methods for variable subset selection. These connections were facilitated from the class of conjugate priors of Chen and Ibrahim (2003). We saw that under this class of priors the four Bayesian criteria were quite similar in terms of model choice especially under small values of a0, and the results were fairly robust under a wide choice of a0 values. Further work remains to be done. In particular, it is of interest to obtain analytic connections between these criteria for specific GLM’s, such as the logistic and Poisson regression models, as well as theoretically examine the small sample and large sample behavior of these methods. In Section 4, the theory and algorithm are developed for computing the four Bayesian criteria which are defined for the GLM in (2). With some straightforward modification, these theory and algorithm can be applied for computing the four Bayesian criteria that are defined for the general GLM in (1).

We note some philosophical issues about model selection that are worth noting. In this paper, we have evaluated the performance of all criteria based on how well they can pick up the true sampling model. However, there are other ways of defining the “Bayesian model.” Many advocate that a Bayesian model is specified by the sampling density and the prior, not only by the sampling density. When one only evaluates the success of a criterion based on how well it picks up the sampling model, then a comparison between AIC (or BIC) and DIC is not meaningful when DIC is computed using an informative prior. Since AIC is equivalent to DIC based on a noninformative prior, a comparison of AIC (or BIC) to DIC is simply not meaningful when using informative priors. In general, one should avoid such comparisons, and only comparable criteria should be compared. For example, it is meaningful to compare AIC, BIC, DIC, LPML, the L-measure, and the Bayes factor based on noninformative priors. It is meaningful to compare DIC, the L-measure, LPML, and the Bayes factor based on informative priors. Finally, we note that most criteria for model assessment, especially the information criteria, are based on a well-defined utility function. If a utility function is chosen, a comparison to a criterion based on a different utility function is not justified. For example, the Bayes factor and BIC are prior predictive criteria aiming at the explanation of the data given the prior, whereas DIC (AIC as a special case) and LPML are posterior predictive criteria aiming at the explanation of replicate (unseen) data given the posterior. Thus, one must use caution in comparing these criteria in terms in picking up the true sampling model.

Acknowledgments

The authors wish to thank the Editor-in-Chief, the Editor, the Associate Editor, and the two referees for their helpful comments and suggestions, which have improved the paper. This research was partially supported by NIH grants #GM 70335 and #CA 74015.

Appendix: Proofs of Theorems

Proof of Theorem 5

Since ∫ w(β(−m)|β(m))dβ(−m) = 1 and β = (β(m)β(−m)′)′, we have

gm=g(β(m))L(β(m)D,m)Cm(D)dβ(m)=g(β(m))L(β(m)D,m)Cm(D)w(β(m)β(m))dβ(m)dβ(m)=C(D)Cm(D)g(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)L(βD)C(D)dβ=C(D)Cm(D)E{g(β(m))L(β(m)D,m)w(β(m)β(m))L(βD)D},

which completes the proof.

Proof of Theorem 7

From (43), we have

Vw(gm)=E[{g(β(m))A1B}2{L(β(m)D,m)w(β(m)β(m))L(βD)}2D]. (A.1)

Plugging wopt into (A.1), we have

Vwopt(gm)={g(β(m))A1B}2{L(β(m)D,m)C(D)}2π(β(m)β(m),D)2π(βD)dβ={g(β(m))A1B}2{L(β(m)D,m)C(D)}2π(β(m)β(m),D)π(β(m),β(m)D)π(β(m)D)π(βD)dβ={g(β(m))A1B}2{L(β(m)D,m)C(D)}2π(β(m)β(m),D)π(β(m)D)dβ={g(β(m))A1B}2{L(β(m)D,m)C(D)}2dβ(m)π(β(m)D)π(β(m)β(m),D)dβ(m){g(β(m))A1B}2{L(β(m)D,m)π(β(m)D)C(D)}2π(β(m)D)dβ(m), (A.2)

where π(β(m) | D) denotes the marginal posterior distribution of β(m) under the full model. Thus, it suffices to show

{g(β(m))A1B}2{L(β(m)D,m)π(β(m)D)}2π(β(m)D)dβ(m){g(β(m))A1B}2{L(β(m)D,m)w(β(m)β(m))π(βD)}2π(βD)dβ. (A.3)

By the Cauchy-Schwarz inequality, we have

1={w(β(m)β(m))dβ(m)}2={w(β(m)β(m))π(β(m)β(m),D)π(β(m)β(m),D)dβ(m)}2w2(β(m)β(m))π(β(m)β(m),D)dβ(m)π(β(m)β(m),D)dβ(m)=w2(β(m)β(m))π(β(m)β(m),D)dβ(m). (A.4)

Using (A.4), the left-hand side of (A.3) becomes

{g(β(m))A1B}2{L(β(m)D,m)π(β(m)D)}2π(β(m)D)dβ(m){g(β(m))A1B}2{L(β(m)D,m)w(β(m)β(m))π(β(m)D))2π(β(m)D)π(β(m)β(m),D)dβ={g(β(m))A1B}2{L(β(m)D,m)w(β(m)β(m))π(βD))2π(βD)dβ,

which exactly matches the right-hand side of (A.3).

References

  1. Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov B, Csaki F, editors. International Symposium on Information Theory. Budapest: Akademia Kiado; 1973. pp. 267–281. 589. [Google Scholar]
  2. Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian Variable Selection and Prediction. Journal of the Royal Statistical Society, Series B. 1998;60:627–641. 585. [Google Scholar]
  3. Brown PJ, Vanucci M, Fearn T. Bayes Model Averaging with Selection of Regresors. Journal of the Royal Statistical Society, Series B. 2002;64:519–536. 585. [Google Scholar]
  4. Chen CF. On Asymptotic Normality of Limiting Density Functions with Bayesian Implications. Journal of the Royal Statistical Society, Series B. 1985;47:540–546. 601. [Google Scholar]
  5. Chen M-H, Dey DK, Ibrahim JG. Bayesian Criterion Based Model Assessment for Categorical Data. Biometrika. 2004;91:45–63. 589. [Google Scholar]
  6. Chen MH, Ibrahim JG. Conjugate Priors for Generalized Linear Models. Statistica Sinica. 2003;13:461–476. 585, 586, 587, 588–608. [Google Scholar]
  7. Chen M-H, Ibrahim JG, Shao Q-M, Weiss RE. Prior Elicitation for Model Selection and Estimation in Generalized Linear Mixed Models. Journal of Statistical Planning and Inference. 2003;111:57–76. 586. [Google Scholar]
  8. Chen M-H, Ibrahim JG, Yiannoutsos C. Prior Elicitation, Variable Selection, and Bayesian Computation for Logistic Regression Models. Journal of the Royal Statistical Society, Series B. 1999;61:223–242. 585. [Google Scholar]
  9. Chen M-H, Shao Q-M. On Monte Carlo Methods for Estimating Ratios of Normalizing Constants. The Annals of Statistics. 1997;25:1563–1594. 599. [Google Scholar]
  10. Chen M-H, Shao Q-M, Ibrahim JG. Monte Carlo Methods in Bayesian Computation. New York: Springer-Verlag; 2000. 599. [Google Scholar]
  11. Chipman HA, George EI, McCulloch RE. Bayesian CART Model Search (with Discussion) Journal of the American Statistical Association. 1998;93:935–960. 585. [Google Scholar]
  12. Chipman HA, George EI, McCulloch RE. The practical Implementation of Bayesian Model Selection (with Discussion) In: Lahiri P, editor. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. pp. 63–134. 585. [Google Scholar]
  13. Chipman HA, George EI, McCulloch RE. Bayesian Treed Generalized Linear Models (with Discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian Statistics. Vol. 7. Oxford: Oxford University Press; 2003. pp. 85–103. 585. [Google Scholar]
  14. Clyde M. Bayesian Model Averaging and Model Search Strategies (with Discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 6. Oxford: Oxford University Press; 1999. pp. 157–185. 585. [Google Scholar]
  15. Clyde M, George EI. Model Uncertainty. Statistical Science. 2004;19:81–94. 586. [Google Scholar]
  16. Dellaportas P, Forster JJ. Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Log-linear Models. Biometrika. 1999;86:615–633. 585. [Google Scholar]
  17. Diciccio TJ, Kass RE, Raftery A, Wasserman L. Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915. 602. [Google Scholar]
  18. Dixon WJ, Massey FJ. Introduction to Statistical Analysis. 4. New York: McGraw-Hill; 1983. 603. [Google Scholar]
  19. Geisser S. Predictive Inference: An Introduction. London: Chapman & Hall; 1993. p. 588. 589. [Google Scholar]
  20. Gelfand AE, Dey DK. Bayesian Model Choice: Asymptotics and Exact Calculations. Journal of the Royal Statistical Society, Series B. 1994;56:501–514. 589–595. [Google Scholar]
  21. Gelfand AE, Dey DK, Chang H. Model Determinating Using Predictive Distributions with Implementation via Sampling-based Methods (with Discussion) In: Bernado JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford: Oxford University Press; 1992. pp. 147–167.pp. 588–589. [Google Scholar]
  22. Gelfand AE, Ghosh SK. Model Choice: A Minimum Posterior Predictive Loss Approach. Biometrika. 1998;85:1–13. 589. [Google Scholar]
  23. George EI. The Variable Selection Problem. Journal of the American Statistical Association. 2000;95:1304–1308. 585. [Google Scholar]
  24. George EI, Foster DP. Calibration and Empirical Bayes Variable Selection. Biometrika. 2000;87:731–747. 585–595. [Google Scholar]
  25. George EI, McCulloch RE. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association. 1993;88:1304–1308. 585. [Google Scholar]
  26. George EI, McCulloch RE. Approaches for Bayesian Variable Selection. Statistica Sinica. 1997;7:339–374. 585. [Google Scholar]
  27. George EI, McCulloch RE, Tsay R. Two Approaches to Bayesian Model Selection with Applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348. 585. [Google Scholar]
  28. Guha S, MacEachern SN, Peruggia M. Benchmark Estimation for Markov Chain Monte Carlo Samples. Journal of Computational and Graphical Statistics. 2004;13:683–701. 602. [Google Scholar]
  29. Ibrahim JG, Chen M-H, McEachern SN. Bayesian Variable Selection for Proportional Hazards Models. Canadian Journal of Statistics. 1999;27:701–717. 585. [Google Scholar]
  30. Ibrahim JG, Chen M-H, Ryan LM. Bayesian Variable Selection for Time Series Count Data. Statistica Sinica. 2000;10:971–987. 586. [Google Scholar]
  31. Ibrahim JG, Chen M-H, Sinha D. Criterion Based Methods for Bayesian Model Assessment. Statistica Sinica. 2001a;11:419–443. 589. [Google Scholar]
  32. Ibrahim JG, Chen M-H, Sinha D. Bayesian Survival Analysis. New York: Springer-Verlag; 2001b. 589. [Google Scholar]
  33. Ibrahim JG, Laud PW. A Predictive Approach to the Analysis of Designed Experiments. Journal of the American Statistical Association. 1994;89:309–319. 589. [Google Scholar]
  34. Lahiri P. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. 586. [Google Scholar]
  35. Laud PW, Ibrahim JG. Predictive Model Selection. Journal of the Royal Statistical Society, Series B. 1995;57:247–262. 585–589. [Google Scholar]
  36. Meng X-L, Schilling S. Warp Bridge Sampling. Journal of Computational and Graphical Statistics. 2002;11:552–586. 602. [Google Scholar]
  37. Meng X-L, Wong WH. Simulating Ratios of Normalizing Constants via A Simple Identity: A Theoretical Exploration. Statistica Sinica. 1996;6:831–860. 602. [Google Scholar]
  38. Ntzoufras I, Dellaportas P, Forster JJ. Bayesian Variable and Link Determination for Generalised Linear Models. Journal of Statistical Planning and Inference. 2003;111:165–180. 586. [Google Scholar]
  39. Raftery AE. Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear Models. Biometrika. 1996;83:251–266. 585. [Google Scholar]
  40. Raftery AE, Madigan D, Hoeting JA. Bayesian Model Averaging for Linear Regression Models. Journal of the American Statistical Association. 1997;92:179–191. 585. [Google Scholar]
  41. Schmeiser BW, Avramidis AN, Hashem S. Overlapping Batch Statistics. In: Balci O, Sadowski RP, Nance RE, editors. Proceedings of the 1990 Winter Simulation Conference. San Diego, California: Society for Computer Simulation International; 1990. pp. 395–398. 607. [Google Scholar]
  42. Schwarz G. Estimating the Dimension of A Model. The Annals of Statistics. 1978;6:461–464. 589. [Google Scholar]
  43. Smith M, Kohn R. Nonparametric Regression Using Bayesian Variable Selection. Journal of Econometrics. 1996;75:317–343. 585. [Google Scholar]
  44. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian Measures of Model Complexity and Fit (with Discussion) Journal of the Royal Statistical Society, Series B. 2002;62:583–639. 589–590. [Google Scholar]
  45. Zellner A. On Assessing Prior Distributions and Bayesian Regression Analysis with g-Prior Distributions. In: Goel P, Zellner A, editors. Bayesian Inference and Decision Techniques. Amsterdam: Elsevier Science Publishers B.V; 1986. pp. 233–243. 592. [Google Scholar]

RESOURCES