Abstract
In this paper, we consider theoretical and computational connections between six popular methods for variable subset selection in generalized linear models (GLM’s). Under the conjugate priors developed by Chen and Ibrahim (2003) for the generalized linear model, we obtain closed form analytic relationships between the Bayes factor (posterior model probability), the Conditional Predictive Ordinate (CPO), the L measure, the Deviance Information Criterion (DIC), the Aikiake Information Criterion (AIC), and the Bayesian Information Criterion (BIC) in the case of the linear model. Moreover, we examine computational relationships in the model space for these Bayesian methods for an arbitrary GLM under conjugate priors as well as examine the performance of the conjugate priors of Chen and Ibrahim (2003) in Bayesian variable selection. Specifically, we show that once Markov chain Monte Carlo (MCMC) samples are obtained from the full model, the four Bayesian criteria can be simultaneously computed for all possible subset models in the model space. We illustrate our new methodology with a simulation study and a real dataset.
Keywords: Bayes factor, Conditional Predictive Ordinate, Conjugate prior, L measure, Poisson regression, Logistic regression
1 Introduction
Bayesian variable selection is still one of the most theoretically and computationally challenging problems encountered in practice due to issues regarding i) prior elicitation, ii) analytic evaluation of the model selection criterion, and iii) numerical computation of the criterion for all possible models in the model space. These issues have been discussed by many authors for various linear and generalized linear models including George and McCulloch (1993), Laud and Ibrahim (1995), George et al. (1996), Raftery (1996), Smith and Kohn (1996), George and McCulloch (1997), Raftery et al. (1997), Brown et al. (1998), Brown et al. (2002), Clyde (1999), Chen et al. (1999), Dellaportas and Forster (1999), Ibrahim et al. (1999), Chipman et al. (1998), Chipman et al. (2001), Chipman et al. (2003), George (2000), George and Foster (2000), Ibrahim et al. (2000), Ntzoufras et al. (2003), and Chen et al. (2003). Clyde and George (2004) present an excellent review article on Bayesian model selection and uncertainty, and give an excellent exposition of the theoretical and computational issues involved in Bayesian variable selection and Bayesian model uncertainty in general. An entire monograph devoted to Bayesian model selection is given by Lahiri (2001).
One of the important unresolved issues in Bayesian model selection and Bayesian variable selection in particular is what the analytic or empirical connections are between the various methods. For example, it is not clear what the relationship is between BIC and DIC, or DIC and the L measure, and whether one is a monotonic function of the other, and whether one can compute BIC from DIC or vice versa. A related question is that if one has MCMC samples from the full model, how can those samples be used to obtain all four Bayesian criteria mentioned above. To answer these questions, we investigate the following in this paper: (i) for the normal linear model with conjugate priors, we obtain analytic relationships between the Bayes factor, CPO, the L measure, DIC, AIC, and BIC, and (ii) for the class of GLM’s we show via the development of several theorems and identities how one can compute all of these Bayesian criteria simultaneously using only an MCMC sample from the full model.
The relationships obtained in (i) for the linear model shed light on the behavior and connections between these criteria for GLM’s. The development of (ii) above is important and useful since it establishes the computational relationships in the model space for each of the four Bayesian criteria and shows that for variable subset selection in GLM’s using the conjugate priors of Chen and Ibrahim (2003), we can compute the four Bayesian criteria for all possible 2p subset models using only an MCMC sample from the full model with p covariates. Another important issue we examine in this paper is the performance of the conjugate priors proposed by Chen and Ibrahim (2003) in Bayesian variable subset selection. We demonstrate that these priors perform quite well in this context, and they are easy to specify and computationally feasible.
The rest of this paper is organized as follows. Section 2 gives formulas for each of the criteria under the conjugate priors of Chen and Ibrahim (2003) for GLM’s and Section 3 develops the theoretical connections between the six criteria for the normal linear model. Section 4 establishes the computational connections in the model space for the four Bayesian criteria and several key identities and theorems that are needed. Section 5 presents a detailed simulation study examining various properties of the six criteria, and Section 6 presents a real data example. We conclude the article with brief remarks in Section 7. All proofs are given in the Appendix.
2 The Method
2.1 Model and Notation
Suppose that {(xi, yi), i = 1, 2, …, n} are independent observations, where yi is the response variable, and xi = (1, xi1, …, xik)′ is a (k + 1) × 1 random vector of covariates. Let ℳ denote the model space. We enumerate the models in ℳ by m = 1, 2,…,, where
is the dimension of ℳ and model
denotes the full model. Also, let β(
) = (β0, β1, …, βk)′ denote the regression coefficients for the full model including an intercept, and let
and β(m) denote km × 1 vectors of covariates and regression coefficients for model m with an intercept, and a specific choice of km − 1 covariates. We write
, and β(
) = (β(m)′, β(−m)′)′, where
is xi with
deleted and β(−m) is β(
) with β(m) deleted.
Under model m, the generalized linear model (GLM) is assumed for [ ], which has the conditional density given by
(1) |
where is the canonical parameter, , and τ is a dispersion parameter. The functions a, b and c determine a particular family in the class. The functions ai(τ) are commonly of the form , where the wi’s are known weights. For ease of exposition, we assume throughout that τ = 1 and wi = 1, as, for example, in logistic and Poisson regression. The methods proposed here can be easily extended to the case when τ is unknown. Under this assumption, (1) can be rewritten as
(2) |
2.2 Prior and Posterior
In the context of Bayesian variable selection, a prior distribution for β(m) needs to be specified for each model in the model space ℳ. To this end, we consider a conjugate prior for the GLM proposed by Chen and Ibrahim (2003). Under model m, the conjugate prior is of the form
(3) |
where a0 > 0 is a scalar prior parameter, y0 = (y01, …, y0n)′ is an n × 1 vector of prior parameters, J is an n×1 vector of ones, and is an n×1 vector of the ’s. As discussed in Chen and Ibrahim (2003), y0i can be viewed as a prior prediction for the marginal mean of yi at xi. Thus, in eliciting y0, the user must focus on a prediction (or guess) for E(y), which narrows the possibilities for choosing y0. Moreover, the specification of all y0i equal has an appealing interpretation. A prior specification with y01 = … = y0n implies a prior in which the prior modes of the slopes in the regression model are the same, but the prior modes of intercepts in the regression model vary. For example, a prior with y0i = 0.5 will have the same modes of slopes but a different mode of intercept than a prior with y0i = 0.1. This is intuitively appealing since in this case the prior prediction on y0i does not depend on the ith subject’s specific information. Mathematically, this result was established in Chen and Ibrahim (2003). The details are as follows. Suppose we drop model index m. Let μ0 be any prespecified p × 1 vector, where p = k + 1. Suppose we take
where ḃ(θ) is the gradient vector of b(θ). Then, the conjugate prior yields a prior mode of β equal to μ0. Now we can see that μ0 = (β0, 0, …, 0)′ yields y01 = y02 = … = y0n = ḃ(θ(β0)). On the other hand, as under some mild conditions, the prior mode is unique, and, hence, the specification of y0 = y01 leads to the prior mode μ0 = (β0, 0, …, 0)′, where β0 satisfies ḃ(θ(β0)) = y0. For instance, under normal linear regression, we can show that the prior mode μ0 of β is given by
If we specify y0 = y01, we have
which implies that all the slopes are 0 while the intercept is equal to y0. This attractive feature allows us to do sensitivity analyses by varying the intercepts in the prior. The parameter a0 in (3) can be generally viewed as a precision parameter that quantifies the strength of our prior belief in y0.
In the context of Bayesian variable selection, (3) specifies the priors for all models in ℳ in an automatic and systematic fashion. Although various theoretical properties of (3) were examined in Chen and Ibrahim (2003) in a great detail, it is not clear how well this type of the prior performs in the context of Bayesian variable selection.
Now, under model m, the posterior distribution of β(m) with the conjugate prior (3) is given by
(4) |
where D = {(yi, xi), i = 1, 2, …, n} denotes the observed data. From (4), we can see that under the conjugate prior, the resulting posterior has a very attractive form. Furthermore, when a0 → 0, the posterior π(Σ(m)|D, m) in (4) reduces to
which is the posterior distribution based on an improper uniform prior for β(m).
2.3 Variable Selection Criteria
In this section, we consider four Bayesian model assessment criteria, namely, Conditional Predictive Ordinate (CPO) statistic (Geisser (1993); Gelfand et al. (1992); and Gelfand and Dey (1994)), L measure (Ibrahim and Laud (1994); Laud and Ibrahim (1995); Gelfand and Ghosh (1998); Ibrahim et al. (2001a); and Chen et al. (2004)), Deviance Information Criterion (DIC) (Spiegelhalter et al. (2002)), and marginal likelihood (Bayes factor).
The CPO, L measure, and DIC are criterion based methods which can be attractive in the sense that they are well defined under improper priors as long as the posterior distribution is proper, and thus have an advantage over the marginal likelihood or Bayes factor approach in this sense. Because of this reason, these three criterion based methods can be directly compared to AIC (Akaike (1973)) and BIC (Schwarz (1978)). On the other hand, the marginal likelihood or the Bayes factor is well calibrated and relatively easy to interpret, but generally sensitive to vague proper priors. In the context of variable selection, it is not clear how these methods perform under the conjugate prior given in (3) for the GLM.
Under model m, for the ith observation, we define the CPO statistic as follows:
where D(−i) is D with the ith observation deleted, and π(β|D(−i),m) is the posterior distribution based on the data D(−i). Due to the construction of the conjugate prior (3), it is more natural to define
After some messy algebra, we can show that CPOi takes the following form:
(5) |
where is the density function given in (2). Also, we notice that the CPO defined in (5) is slightly different from the usual CPO (Geisser (1993) and Gelfand et al. (1992)), which is of the form
However, these two forms will be identical as a0 → 0. As suggested in Ibrahim et al. (2001b), a natural summary statistic of the CPOi’s is the logarithm of the Pseudo-marginal likelihood (LPML) defined as
We will use LPMLm as a criterion-based measure for variable selection.
The L measure criterion is another useful tool for model comparison and variable selection. The L measure is constructed from the posterior predictive distribution of the data. For the entire class of GLM’s in (2), under model m, the L measure is defined as:
(6) |
where b′(.) and b″(.) are the mean and variance functions of the GLM in (2), and all expectations and variances are taken with respect to the posterior distribution π(β(m)|D, m) in (4). We note that for the GLM in (1), we need to modify Lm(ν) in (6) accordingly, and in this case, the L measure takes the form
(7) |
The DIC criterion, proposed by Spiegelhalter et al. (2002), is given by
(8) |
where
β̄(m) = E[β(m)|D, m], and . For the GLM in (2), under model m,
(9) |
Similar to (6), under the GLM in (1), D(β(m)) needs to be modified accordingly.
In the spirit of marginal likelihoods, after ignoring the constants shared by all variable subset models in model space ℳ for the GLM in (2), for the purpose of variable subset selection it suffices to compute the posterior normalizing constant
(10) |
and the prior normalizing constant
(11) |
Similar to the modification of (6) yielding (7), under the GLM in (1), D(β(m)) in (9), Cm(D) in (10), and C0m(y0) in (11) need to be modified accordingly. In the context of variable selection, we select a variable subset model which yields the largest LPMLm under the CPO, the smallest Lm(ν) under the L measure, the smallest DICm under the DIC, and the largest Cm(D)/C0m(y0) or log[Cm(D)/C0m(y0)] under the marginal likelihood.
3 Analytic Connections Between Variable Selection Criteria For the Normal Linear Regression Model
In this section, we consider the normal linear regression models given by
(12) |
Let , which is the design matrix for the normal linear regression under model m. Assume Xm is of full rank km throughout. We focus only on the τ known case as analytical connections are more difficult to establish when τ is unknown. For the model in (12) with a known τ, the conjugate prior for β(m) in (3) reduces to
(13) |
and the posterior distribution for β(m) is given by
For (12), AIC and BIC under model m are given by
(14) |
where β̂(m) is the maximum likelihood estimate of β(m) and
is the usual sum of squared errors, and
(15) |
After some algebra, we can show that after putting back all normalizing constants, the logarithm of the marginal likelihood under model m is given by
(16) |
When y0 = 0, the conjugate prior in (13) reduces to Zellner’s g-prior (Zellner (1986)). For this special case, (16) becomes
(17) |
Thus, we have
(18) |
For purposes of variable selection, it suffices to compare ℳm(a0) and we then choose a model with the smallest ℳm(a0). From (18), we can see that
(19) |
For (12), we use (7) to compute Lm(ν). In particular, we have ai(τ) = 1/τ, ,
and . Thus, we obtain
(20) |
When y0 = 0, (20) reduces to
(21) |
Write
(22) |
Using (21) and (22), we obtain
and hence
Note that in the context of variable selection, a model with the smallest Lm(ν) is the same model that has the smallest L̃m(ν, a0). Thus, in this sense, the L measure can be equivalent to AIC or BIC by appropriately tuning (ν, a0). It is interesting to mention that in order to achieve L̃m(ν, a0) = AICm or L̃m(ν, a0) = BICm, ν must be small, and hence when ν = 1, the L measure always has a smaller dimensional penalty than both AIC and BIC. Unlike the marginal likelihood, a0 plays a minimum role in controlling dimensional penalty in the L measure.
When y0 = 0, the posterior mean of β(m) is given by . Thus, we have ,
(23) |
and
(24) |
(25) |
Thus, the DICm for (12) is given by
(26) |
Write
We have
(27) |
Therefore, when a0 = 0, , and when a0 > 0, , which implies that has a smaller dimensional penalty than both AIC and BIC.
Similarly to DIC, we consider only y0 = 0. From (5), we have
(28) |
where and
for i = 1, 2, …, n. After some messy algebra, we obtain
and
Let , and . Plugging CPO1i and CPO2i into (28) yields
(29) |
Using Taylor expansion and after some algebra, LPMLm in (29) can be rewritten as
(30) |
where
Write
(31) |
Using (30) and (31), we obtain
where . We choose a model with the smallest . Note that the remainder term Rm is small when all ’s are small. From (14), (15), and (27), we see that when Rm is small and does not vary much in the model space ℳ, LPML has a smaller dimensional penalty than DIC, AIC and BIC. In addition, when a0 = 0, LPMLm in (30) is consistent with the one derived by Gelfand and Dey (1994) based on the asymptotic approximation.
Finally, we note that the quantities defined in (18), (22), (27) and (31) are linear transformations of those defined by (17), (21), (26) and (30), respectively. In these linear transformations, the relevant coefficients are independent of m. Thus, for the purposes of variable subset selection, these linearly transformed quantities act exactly like those original forms. With (18), (22), (27) and (31), we can much more clearly see the analytical connections to AIC and BIC. We also note that George and Foster (2000) provided some similar connections between model selection probabilities and various model selection criteria for this setup.
4 Computational Development: Theory and Implementation
For the purpose of variable selection, we need to compute LPMLm, Lm(ν), DICm, Cm(D) and C0m(y0) for the Bayesian variable selection criteria described in the previous section for m = 1, 2, …, . Due to the complexity and generality of the GLM in (2), the analytical evaluation of these measures does not appear possible. Thus, a Monte Carlo (MC) based method is required for each of those measures under consideration. However, the MC methods currently available in the Bayesian computational literature require a Markov chain Monte Carlo (MCMC) sample from the posterior distribution π(β(m)|D, m) in (4) under each variable subset model m. When the number of the models in ℳ is large, sampling from the posterior distribution under each variable subset model can be expensive. Thus, the computation of these four measures for all submodels becomes a difficult and challenging task. Therefore, the development of an efficient Monte Carlo method for variable selection for the GLM is very essential.
After examining (5), (6), and (8), we observe that there is a common feature in computing LPMLm, Lm(ν), and DICm, i.e., all of these three measures require to compute
for various functions g, where the expectation is taken with respect to the joint posterior distribution in (4) under model m. Specifically, the functions required in these calculations include
and for LPMLm;
, and for Lm(ν);
g(β(m)) = β(m) and g(β(m)) = D(β(m)) for DICm.
Write
under model m and let L(β|D) = L(β()|D,
), C(D) = C
(D), and C0(y0) = C0
(y0) under the full model. Here, we abuse the notation a little bit as L(β(m)|D, m) is not a likelihood function in the usual sense. Then, for a given function g, mathematically, we have
where Cm(D) is defined in (10). Now, we present a useful identity for gm, which is formally stated in the following theorem.
Theorem 5
For any given function g, such that E[|g(β(m))| |D, m] < ∞, we have
(32) |
where the expectation is taken with respect to the joint posterior distribution in (4) under the full model. Here, w(β(−m)| β(m)) is a completely known conditional density, whose support is contained in, or equal to, the support of the conditional density of β(− m) given β(m) with respect to the joint posterior distribution in (4) under the full model.
Observing that when g ≡ 1, we have
which leads to
(33) |
and
(34) |
It is interesting to mention that the identity (33) is a by-product of this derivation and this identity can be used to compute the posterior normalizing constant under model m. The identities (33) and (34) play an important role in developing a novel Monte Carlo method for computing LPMLm, Lm(ν), DICm, and Cm(D) simultaneously using a single MCMC sample from the joint posterior distribution under the full model. Towards this goal, we let {βs = (β(m)′s, β(−m)′s), s = 1, 2, …, S} denote a MCMC sample from the joint posterior distribution (4) under the full model, where S is the MCMC sample size. Then, an estimate of gm is given by
(35) |
Under certain regularity conditions, such as ergodicity, we have
which indicates that ĝm is consistent.
Letting
(36) |
and
(37) |
we have
(38) |
and
(39) |
(40) |
(41) |
In (41), and
(42) |
In addition, we have
We are then led to the following theorem.
Theorem 6
Let {βs, s = 1, 2, …, S} be a random sample. Assume A ≠ 0,
(43) |
and
(44) |
where the expectation is taken with respect to the joint posterior distribution in (4) under the full model. Then we have
(45) |
where Vw(gm) is defined by (43) and
The proof of Theorem 6 directly follows from the proof of Theorem 3.1 of Chen and Shao (1997). Thus, the detail is omitted for brevity. From (45), we notice that is the relative mean-square error and Theorem 6 implies that when S is large,
Remark 4.1
As discussed in Chen et al. (2000), the simulation standard error of ĝm can be approximated by
where  = AS.
Remark 4.2
From (34), it is quite natural that one may think a more efficient way to obtain a MC estimate of gm is by generating two MC samples from the posterior distribution so that one sample is used for computing while the second sample is used for computing . In this remark, we show that the use of two MC samples in obtaining the MC estimate of gm may not necessarily be more efficient than the use of just one MC sample. In addition, generating two MC samples requires more computing time. Specifically, suppose that {β1;s, s = 1, 2, …, S1} and {β2;s, s = 1, 2, …, S2} are two independent random samples from the joint posterior distribution (4) under the full model. Then gm can be estimated by
(46) |
By the δ-Method, we have
where the expectation and variance are taken with respect to the joint posterior distribution (4) under the full model.
Assuming that S1 = S2 = S, we have
(47) |
Thus, if
(48) |
we have
It is easy to see that when g(β(m)) ≥ 0 or g(β(m)) ≤ 0, (48) automatically holds. Therefore, for many cases, it is unnecessary to use two MC samples instead of one MC sample in obtaining the MC estimate of gm.
Note that the estimate ĝm depends on w(β(−m)|β(m)). It is reasonable to argue that the best choice of w should yield the smallest asymptotic variance of the estimate ĝm among all possible w’s. The following theorem precisely addresses this optimality issue.
Theorem 7
Let
(49) |
be the conditional posterior density of β(−m) given β(m) under the full model, then we have
(50) |
for all w’s, where Vw(gm) is defined by (43).
Remark 4.3
Note that (50) holds for any function g that satisfies the condition given in (44). Thus, for various functions g involved in LPMLm, Lm(ν) and DICm, the best choice of w is the same wopt given in (49).
Remark 4.4
When we use in (46), we can also show that wopt = π(β(−m) | β(m), D) yields the smallest asymptotic relative mean-square error of , for example, the one given by (47).
Remark 4.5
For computing CPOi in (5) under model m, we do not need to compute in (32). In fact, it is easy to see that
where and . Thus, given a MCMC sample { , s = 1, 2, …, S} from the joint posterior distribution (4), a MC estimate of CPOi is given as follows:
Following the proof of Theorem 7, we can easily show that the optimal choice of w for is still the same wopt given in (49).
Remark 4.6
To compute LPML, L
(ν) and DIC
under the full model, we can simply take β(
) = β and w(β(−
) |β(
)) = 1. Then, for various functions g, given a MCMC sample {βs, s = 1, 2, …, S} (35) reduces to
where {βs, s = 1, 2, …, S} is a MCMC sample from the posterior distribution (4) under the full model.
Remark 4.7
As shown in Theorem 7, the optimal choice of w is wopt = π(β(−m) | β(m), D). However, for the GLM in (2), wopt is not available in closed form. Fortunately, for the GLM, a good w(β(−m)|β(m)), which is close to the optimal choice, can be constructed based on the asymptotic approximation to the joint posterior proposed by Chen (1985). Let β̂ denote the posterior mode of β under the full model, i.e.,
Also let
Then, the joint posterior π(β|D) under the full model can be approximated by
(51) |
Using (51), we simply take w(β(−m) | β(m)) = π̂(β(−m) | β(m), β̂, D), which is the conditional distribution of β(−m) given β(m) with respect to the (k + 1)-dimensional multivariate normal distribution in (51).
Remark 4.8
As a by-product, Cm(D)/C(D) is ready to compute via the identity (33). It can also be shown that
(52) |
where and the expectation is taken with respect to the prior distribution in (3) under the full model. After examining the construction of the conjugate prior and the form of the GLM in (2), we can also show that
(53) |
where π(β(−m) = 0|D) and π(β(−m) = 0|y0, a0) are the marginal posterior density and the marginal prior density of β(−m) evaluated at β(−m) = 0 under the full model. Furthermore, Bm in (53) is the Bayes factor for comparing model m to the full model. Thus, to compute Bm, we need to generate two MCMC samples, one from the posterior distribution and another one from the prior distribution of β under the full model, and then use (33) and (52).
Finally, we note that we derive wopt under the independence assumption. We expect that this optimal choice will work well even when a dependent MCMC sample is used. Some related empirical studies have been reported and discussed in Meng and Wong (1996), Diciccio et al. (1997) and Meng and Schilling (2002). They suggested that the optimal or near-optimal procedures constructed under the independence assumption can work remarkably well in general, providing orders of magnitude improvement over other methods with similar computational effort. Alternatively, suppose we systematically take a 1-in-b subsample of size S from the Markov chain that is generated from the joint posterior distribution in (4). Then, following from Guha et al. (2004), we can show that (45) holds under some mild regularity conditions such as geometrical ergodicity and a sufficiently large b. Thus, if we take a MCMC sample in such a way, this MCMC sample can be treated as “a random sample.”
5 A Simulation Study
In Section 3, we have established theoretical connections among AIC, BIC and the four Bayesian criteria in the normal linear regression setting. However, it does not appear possible that there are any analytic connections between AIC or BIC and the four Bayesian criteria for Poisson regression. For this reason, we present a simulation study for Poisson regression to empirically examine whether there exist any connections among these criteria and to examine the performance of conjugate priors in the context of variable selection. Suppose yi|θi are independent Poisson observations with mean , where is a 1 × p vector, i = 1, 2, …, n. The conjugate prior takes the form
(54) |
where y0i is the ith component of y0. In the simulation, we assume that xi0 = 1, xij ~ N (0, 1) independently for j = 1, 2, 3 and i = 1, 2, …, n. In (54), we take y0i = 1 for i = 1, 2, …, n, which yields a prior mode of β to be 0, as shown in Chen and Ibrahim (2003). Further we use β = (−0.3, 0.3, 0, 0)′, β = (−0.3, 0.3, 0.2, 0)′, and β= (−0.3, 0.3, 0.2, −0.15)′ which correspond to the true models (x1), (x1, x2), and (x1, x2, x3) (full model), respectively. We also use the sample size of n = 500.
Under the simulation design, we independently generated N = 500 datasets. For each simulated dataset, we fit 23 = 8 models. To compute the posterior model probabilities based on the conjugate priors, we implemented the Monte Carlo algorithm proposed in Section 4 with a Monte Carlo sample size of S = 20, 000. For all of these 8 models, we computed BF, DIC, L measure, LPML, AIC, and BIC.
Tables 1 and 2 show results for the various methods. Our model performance evaluation criterion is a 0-1 loss function, the loss being 0 if the true model is selected and 1 otherwise. In Table 1, we see that BIC performs better than AIC in the number of times the true model is selected as best when the true model is a smaller model. For example, when (x1) is the true model, AIC correctly identifies this model as best 361 times out of 500 and BIC correctly identifies this model as best 490 times. Table 2 compares the performance of the four other criteria under several values of a0 from the conjugate prior as well as several values of ν for the L measure. We see from the table that, in general, for small values of a0, which imply a noninformative prior, the Bayes factor results are quite consistent with DIC, the L measure, and LPML for small models being the true models, whereas when the full model is the true model, the Bayes factor tends to do worse for small a0 compared to large a0. In general, as a0 increases, the performance of DIC, LPML, and the Bayes factor becomes worse, whereas for the L measure, it is fairly robust over several values of a0. The L measure seems to perform best under moderate values of ν, such as ν = 0.5.
Table 1.
Frequencies for Ranking the True Model as Best Using AIC and BIC Based on n = 500 and N = 500 Datasets
True Model | AIC | BIC |
---|---|---|
(x1) | 361 | 490 |
(x1, x2) | 425 | 446 |
(x1, x2, x3) | 474 | 316 |
Table 2.
Frequencies for Ranking the True Model as Best Using BF, DIC, CPO and L measure for Various a0 Based on n = 500 and N = 500 Datasets
L Measure (ν) | |||||||||
---|---|---|---|---|---|---|---|---|---|
True Model | a0 | LPML | DIC | BF | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
(x1) | 0.001 | 395 | 361 | 492 | 398 | 396 | 359 | 318 | 276 |
0.01 | 396 | 357 | 466 | 396 | 396 | 357 | 319 | 275 | |
0.1 | 377 | 332 | 386 | 408 | 396 | 352 | 304 | 268 | |
0.5 | 342 | 308 | 311 | 424 | 381 | 335 | 279 | 243 | |
1 | 320 | 299 | 288 | 424 | 372 | 321 | 264 | 222 | |
| |||||||||
(x1, x2) | 0.001 | 425 | 425 | 436 | 164 | 347 | 390 | 380 | 356 |
0.01 | 423 | 425 | 470 | 157 | 352 | 390 | 383 | 355 | |
0.1 | 417 | 417 | 443 | 195 | 370 | 399 | 372 | 353 | |
0.5 | 398 | 405 | 405 | 254 | 400 | 402 | 362 | 339 | |
1 | 382 | 394 | 391 | 269 | 410 | 390 | 359 | 329 | |
| |||||||||
(x1, x2, x3) | 0.001 | 475 | 474 | 291 | 88 | 371 | 456 | 475 | 480 |
0.01 | 475 | 474 | 388 | 94 | 375 | 458 | 475 | 482 | |
0.1 | 479 | 475 | 460 | 125 | 402 | 466 | 480 | 488 | |
0.5 | 485 | 479 | 479 | 176 | 436 | 478 | 486 | 489 | |
1 | 486 | 481 | 481 | 214 | 453 | 483 | 487 | 490 |
6 A Real Data Example
Due to lack of analytic connections between AIC or BIC and the four Bayesian criteria for logistic regression, we consider the Chapman data from Los Angeles Heart Study of men (n = 200) presented in Dixon and Massey (1983) to empirically examine whether there exist any connections among these criteria.
In our analysis, we consider a coronary incident as a binary response variable (y), which takes the values 0 and 1, where a 1 denotes that an incident had occurred in the previous ten years and a 0 indicates otherwise. We consider five prognostic factors: age (Ag), systolic blood pressure in millimeters of mercury (S), diastolic blood pressure in millimeters of mercury (D), Cholesterol in milligrams per DL (Ch), and BMI = (703.07Weight)/(Height2).
Let x1, x2, x3, x4, and x5 denote Ag, S, D, Ch, and BMIH. For the Chapman data, we fit a logistic regression model
(55) |
The conjugate prior in (3) corresponding to the model (55) takes the form
(56) |
where yi0 = 0.5, i = 1, 2, …, n, to ensure the prior mode of β to be 0. We wish to compare the following 32 models: Intercept only, (x1), …, (x5), (x1, x2), …, (x1, x2, x3, x4, x5). We note that the notation (x1, x2, x3, x4, x5), for example, implies that in (55). Thus, “Intercept only” is the model with zero predictors while (x1, x2, x3, x4, x5) is the full model with the largest model dimension. We also note that an intercept is included in every model. Further we denote that M1 = (Int), M2 = (Int, Ag), M3 = (Int, S), M4 = (Int, D), M5 = (Int, Ch), M6 =(Int, BMI), M7 = (Int, Ag, S), M8 =(Int, Ag, D), M9 =(Int, Ag, Ch), M10 =(Int, Ag, BMI), M11 =(Int, S, D), M12 =(Int, S, Ch), M13 =(Int, S, BMI), M14 =(Int, D, Ch), M15 =(Int, D, BMI), M16 =(Int, Ch, BMI), M17 =(Int, Ag, S, D), M18 =(Int, Ag, S, Ch), M19 =(Int, Ag, S, BMI), M20 =(Int, Ag, D, Ch), M21 =(Int, Ag, D, BMI), M22 =(Int, Ag, Ch, BMI), M23 =(Int, S, D, Ch), M24 =(Int, S, D, BMI), M25 =(Int, S, Ch, BMI), M26 = (Int, D, Ch, BMI), M27 =(Int, Ag, S, D, Ch), M28 =(Int, Ag, S, D, BMI), M29 = (Int, Ag, S, Ch, BMI), M30 =(Int, Ag, D, Ch, BMI), M31 = (Int, S, D, Ch, BMI), and M32 = (Int, Ag, S, D, Ch, BMI).
To compute the posterior model probability (PMP), DIC, LPML, and L measure under various conjugate priors, we implemented the Monte Carlo algorithm proposed in Section 4 with a Monte Carlo sample size of S = 20, 000. We see from Table 3 that M22 is selected as the best model by AIC and the fourth model by BIC, whereas M10 is selected as the second best model by both criteria. Table 4 shows the results of the L measure, posterior model probability (PMP), LPML, and DIC for several values of a0, as well as several values of ν for the L measure. Table 3 reveals a similar story as the simulation study. Model M22 is selected as either the top model or second best model for most values of a0 for DIC and PMP, as well as for the L measure under small values of ν. Under larger values of ν the L measure as well a LPML appear to favor model M32. Finally, for small values of a0, LPML and PMP appear to favor a smaller model, namely M2. Thus, from these analyses, models {M2, M22, M32} appear to be the most promising based on all of these model selection criteria. Table 5 shows the top five models selected for each of the four variable selection criteria (PMP, DIC, L measure, LPML). Again we see a remarkable consistency between the four criteria, in which the ordering of the top models is similar for the four criteria for small, moderate, and large values of a0, and for a wide range of ν values for the L measure.
Table 3.
The Top Model Based on AIC and BIC for Chapman Data
AIC | BIC | ||
---|---|---|---|
Mk | Values | Mk | Values |
M22 | 142.75 | M2 | 153.34 |
M10 | 143.73 | M10 | 153.63 |
M29 | 144.69 | M9 | 155.83 |
M30 | 144.75 | M22 | 155.94 |
M19 | 145.57 | M16 | 155.99 |
Table 4.
The Best Model Based on Posterior Model Probability (PMP), DIC, LPML, and L Measure for Chapman Data
a0 = 0.001 | a0 = 0.01 | a0 = 0.1 | a0 = 0.5 | a0= 1.0 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Criterion | Mk | Values | Mk | Values | Mk | Values | Mk | Values | Mk | Values |
PMP | M2 | 0.57 | M2 | 0.25 | M22 | 0.14 | M22 | 0.07 | M22 | 0.06 |
DIC | M22 | 142.83 | M22 | 142.67 | M22 | 144.74 | M22 | 165.65 | M22 | 186.77 |
LPML | M2 | −73.38 | M2 | −73.30 | M32 | −73.79 | M32 | −83.10 | M30 | −93.29 |
L(ν = 0.1) | M22 | 21.47 | M22 | 21.98 | M22 | 26.96 | M22 | 38.92 | M30 | 45.21 |
L(ν = 0.25) | M22 | 24.79 | M22 | 25.29 | M22 | 30.23 | M22 | 42.56 | M30 | 49.39 |
L(ν = 0.5) | M32 | 30.20 | M32 | 30.73 | M32 | 35.66 | M29 | 48.59 | M30 | 56.36 |
L(ν = 0.75) | M32 | 35.24 | M32 | 35.76 | M32 | 40.77 | M32 | 54.52 | M30 | 63.33 |
L(ν = 0.9) | M32 | 38.26 | M32 | 38.78 | M32 | 43.83 | M32 | 58.06 | M30 | 67.51 |
Table 5.
The Top Five Models Based on PMP, DIC, LPML, and L Measure for Chapman Data
a0 = 0.001 | a0 = 0.01 | a0 = 0.1 | a0 = 0.5 | a0 = 1.0 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Criterion | Mk | Values | Mk | Values | Mk | Values | Mk | Values | Mk | Values |
PMP | M2 | 0.57 | M2 | 0.25 | M22 | 0.14 | M22 | 0.07 | M22 | 0.06 |
M1 | 0.11 | M10 | 0.23 | M10 | 0.14 | M10 | 0.07 | M10 | 0.05 | |
M5 | 0.07 | M9 | 0.08 | M9 | 0.06 | M29 | 0.05 | M19 | 0.04 | |
M10 | 0.07 | M22 | 0.08 | M2 | 0.06 | M19 | 0.05 | M29 | 0.04 | |
M6 | 0.06 | M16 | 0.07 | M16 | 0.06 | M30 | 0.05 | M21 | 0.04 | |
| ||||||||||
DIC | M22 | 142.83 | M22 | 142.67 | M22 | 144.74 | M22 | 165.65 | M22 | 186.77 |
M10 | 143.79 | M10 | 143.70 | M10 | 145.70 | M10 | 166.02 | M10 | 186.88 | |
M30 | 144.85 | M29 | 144.74 | M29 | 146.43 | M29 | 166.42 | M30 | 187.10 | |
M29 | 144.96 | M30 | 144.78 | M30 | 146.48 | M19 | 166.87 | M21 | 187.38 | |
M21 | 145.63 | M21 | 145.59 | M21 | 147.29 | M30 | 166.90 | M20 | 187.75 | |
| ||||||||||
LPML | M2 | −73.38 | M2 | −73.30 | M32 | −73.79 | M32 | −83.10 | M30 | −93.29 |
M5 | −73.56 | M5 | −73.50 | M2 | −74.11 | M29 | −83.32 | M32 | −93.35 | |
M4 | −73.58 | M6 | −73.55 | M7 | −74.43 | M10 | −83.34 | M21 | −93.41 | |
M6 | −73.73 | M4 | −73.64 | M8 | −74.48 | M19 | −83.55 | M10 | −93.41 | |
M32 | −73.91 | M3 | −73.65 | M9 | −74.68 | M21 | −83.56 | M20 | −93.55 | |
| ||||||||||
L ν = 0.1 |
M22 | 21.47 | M22 | 21.98 | M22 | 26.96 | M22 | 38.92 | M30 | 45.21 |
M30 | 22.04 | M25 | 22.63 | M25 | 27.33 | M25 | 38.99 | M22 | 45.23 | |
M26 | 22.13 | M30 | 22.64 | M30 | 27.41 | M29 | 39.02 | M26 | 45.26 | |
M32 | 22.15 | M29 | 22.65 | M26 | 27.45 | M19 | 39.16 | M20 | 45.28 | |
M10 | 22.21 | M10 | 22.67 | M29 | 27.47 | M10 | 39.18 | M21 | 45.31 | |
| ||||||||||
L ν = 0.25 |
M22 | 24.79 | M22 | 25.29 | M22 | 30.23 | M22 | 42.56 | M30 | 49.39 |
M30 | 25.17 | M32 | 25.69 | M32 | 30.56 | M29 | 42.61 | M22 | 49.45 | |
M32 | 25.17 | M30 | 25.77 | M30 | 30.56 | M25 | 42.67 | M20 | 49.49 | |
M10 | 25.43 | M29 | 25.78 | M29 | 30.62 | M32 | 42.73 | M26 | 49.51 | |
M20 | 25.47 | M10 | 25.89 | M25 | 30.65 | M19 | 42.79 | M21 | 49.52 | |
| ||||||||||
L ν = 0.5 |
M32 | 30.20 | M32 | 30.73 | M32 | 35.66 | M29 | 48.59 | M30 | 56.36 |
M22 | 30.31 | M22 | 30.80 | M22 | 35.70 | M32 | 48.62 | M32 | 56.47 | |
M30 | 30.38 | M30 | 30.98 | M30 | 35.81 | M22 | 48.64 | M22 | 56.50 | |
M20 | 30.77 | M29 | 31.00 | M29 | 35.88 | M30 | 48.78 | M20 | 56.50 | |
M10 | 30.80 | M10 | 31.26 | M10 | 36.10 | M25 | 48.79 | M21 | 56.55 | |
| ||||||||||
L ν = 0.75 |
M32 | 35.24 | M32 | 35.76 | M32 | 40.77 | M32 | 54.52 | M30 | 63.33 |
M30 | 35.59 | M30 | 36.20 | M30 | 41.06 | M29 | 54.56 | M32 | 63.41 | |
M22 | 35.84 | M29 | 36.22 | M29 | 41.14 | M22 | 54.71 | M20 | 63.52 | |
M29 | 36.04 | M22 | 36.31 | M22 | 41.16 | M30 | 54.78 | M22 | 63.54 | |
M20 | 36.08 | M10 | 36.63 | M10 | 41.48 | M19 | 54.88 | M21 | 63.57 | |
| ||||||||||
L ν = 0.9 |
M32 | 38.26 | M32 | 38.78 | M32 | 43.83 | M32 | 58.06 | M30 | 67.51 |
M30 | 38.72 | M30 | 39.33 | M30 | 44.21 | M29 | 58.15 | M32 | 67.58 | |
M22 | 39.16 | M29 | 39.35 | M29 | 44.29 | M22 | 58.36 | M20 | 67.73 | |
M29 | 39.17 | M22 | 39.61 | M22 | 44.44 | M30 | 58.37 | M22 | 67.77 | |
M20 | 39.26 | M27 | 39.83 | M10 | 44.70 | M19 | 58.51 | M21 | 67.79 |
Table 6 shows the posterior means (Estimates), the posterior standard errors (SEs), and 95% HPD intervals for the βj‘s under model M22 (Ag, Ch, BMI) and model M32 (Ag, S, D, Ch, BMI) when a0 = 0.01. Table 6 also shows the corresponding maximum likelihood estimates (MLEs), the standard errors, and p-values. We see from Table 6 that the posterior estimates are very close to the MLEs, which is intuitively appealing, as a fairly noninformative (a0 = 0.01) is used. We also see from this table that under these two “best” models, age and BMI are only two prognostic factors for the coronary incident, which are significant at the 5% significance level.
Table 6.
Estimates of the β under Model (Ag, Ch, BMI) and Model (Ag, S, D, Ch, BMI) for the Chapman Data when a0 = 0.01
Maximum Likelihood Estimates | Posterior Estimates | ||||||
---|---|---|---|---|---|---|---|
Model | Variable | Estimate | SE | p-value | Estimate | SE | 95% HPD Interval |
M22 | Intercept | −2.252 | 0.275 | < .0001 | −2.265 | 0.272 | (−2.805, −1.748) |
Ag | 0.556 | 0.245 | 0.0230 | 0.554 | 0.242 | (0.087, 1.032) | |
Ch | 0.405 | 0.233 | 0.0816 | 0.402 | 0.234 | (−0.064, 0.854) | |
BMI | 0.470 | 0.204 | 0.0211 | 0.465 | 0.207 | (0.069, 0.882) | |
| |||||||
M32 | Intercept | −2.248 | 0.274 | < .0001 | −2.292 | 0.273 | (−2.828, −1.766) |
Ag | 0.527 | 0.270 | 0.0507 | 0.531 | 0.270 | (0.012, 1.067) | |
S | 0.106 | 0.336 | 0.7523 | 0.097 | 0.344 | (−0.583, 0.757) | |
D | −0.077 | 0.383 | 0.8417 | −0.069 | 0.383 | (−0.806, 0.687) | |
Ch | 0.404 | 0.235 | 0.0857 | 0.402 | 0.240 | (−0.074, 0.866) | |
BMI | 0.474 | 0.226 | 0.0361 | 0.473 | 0.230 | (0.028, 0.930) |
To examine performance of the proposed Monte Carlo method in Section 4, we first computed various model selection criteria under a sub-model using a MCMC sample from the full model. We then computed the same quantities using a MCMC sample directly from the posterior distribution under the same sub-model. For illustrative purposes, we considered a single variable sub-model M2 = (Int, Ag) using the conjugate prior (56) with a0 = 0.01. Using a MCMC sample size of S = 20, 000, the Monte Carlo estimates (simulation standard errors) of DIC, LPML, L(ν = 0.1), L(ν = 0.5), and L(ν = 0.9) under model M2 are 146.68 (0.08), −73.30 (0.04), 23.91 (0.05), 32.44 (0.06), and 40.96 (0.06), respectively, using the proposed Monte Carlo method via (35). With the same MC sample size, these quantities are 146.67 (0.02), −73.29 (0.01), 23.90 (0.02), 32.42 (0.02), and 40.95 (0.02), respectively, using the MC sample directly from the posterior distribution under model M2. All simulation standard errors were computed using the overlapping batch statistics (OBS) method of Schmeiser et al. (1990). As expected, the simulation standard errors using the MC sample from the full model are slightly larger than those computed using the MC sample directly from model M2. However, these two sets of the MC estimates are very close. This empirically demonstrates that the proposed MC method works quite well. Finally, we compared the computational times between the proposed Monte Carlo method and the exhaustive alternative. With 2,000 “burn-in” iterations and S = 20, 000, the computational times of the proposed Monte Carlo method for 32 DIC’s, LPML’s, and L(ν)’s are 71.28, 100.11, and 76.36 seconds, respectively, on a Dell WS Xeon dual 2.4GHZ CPU Linux workstation. Using the same number of “burn-in” iterations, the same MC sample size, and the same computer, the computational times of the exhaustive alternative Monte Carlo method for 32 DIC’s, LPML’s, and L(ν)’s are 324.05, 357.97, and 322.13 seconds, respectively. Thus, it becomes apparent that the proposed Monte Carlo method leads to a substantial computational saving over the exhaustive alternative.
7 Concluding Remarks
We have examined and established theoretical and computational relationships between six commonly used methods for variable subset selection. These connections were facilitated from the class of conjugate priors of Chen and Ibrahim (2003). We saw that under this class of priors the four Bayesian criteria were quite similar in terms of model choice especially under small values of a0, and the results were fairly robust under a wide choice of a0 values. Further work remains to be done. In particular, it is of interest to obtain analytic connections between these criteria for specific GLM’s, such as the logistic and Poisson regression models, as well as theoretically examine the small sample and large sample behavior of these methods. In Section 4, the theory and algorithm are developed for computing the four Bayesian criteria which are defined for the GLM in (2). With some straightforward modification, these theory and algorithm can be applied for computing the four Bayesian criteria that are defined for the general GLM in (1).
We note some philosophical issues about model selection that are worth noting. In this paper, we have evaluated the performance of all criteria based on how well they can pick up the true sampling model. However, there are other ways of defining the “Bayesian model.” Many advocate that a Bayesian model is specified by the sampling density and the prior, not only by the sampling density. When one only evaluates the success of a criterion based on how well it picks up the sampling model, then a comparison between AIC (or BIC) and DIC is not meaningful when DIC is computed using an informative prior. Since AIC is equivalent to DIC based on a noninformative prior, a comparison of AIC (or BIC) to DIC is simply not meaningful when using informative priors. In general, one should avoid such comparisons, and only comparable criteria should be compared. For example, it is meaningful to compare AIC, BIC, DIC, LPML, the L-measure, and the Bayes factor based on noninformative priors. It is meaningful to compare DIC, the L-measure, LPML, and the Bayes factor based on informative priors. Finally, we note that most criteria for model assessment, especially the information criteria, are based on a well-defined utility function. If a utility function is chosen, a comparison to a criterion based on a different utility function is not justified. For example, the Bayes factor and BIC are prior predictive criteria aiming at the explanation of the data given the prior, whereas DIC (AIC as a special case) and LPML are posterior predictive criteria aiming at the explanation of replicate (unseen) data given the posterior. Thus, one must use caution in comparing these criteria in terms in picking up the true sampling model.
Acknowledgments
The authors wish to thank the Editor-in-Chief, the Editor, the Associate Editor, and the two referees for their helpful comments and suggestions, which have improved the paper. This research was partially supported by NIH grants #GM 70335 and #CA 74015.
Appendix: Proofs of Theorems
Proof of Theorem 5
Since ∫ w(β(−m)|β(m))dβ(−m) = 1 and β = (β(m)′ β(−m)′)′, we have
which completes the proof.
Proof of Theorem 7
From (43), we have
(A.1) |
Plugging wopt into (A.1), we have
(A.2) |
where π(β(m) | D) denotes the marginal posterior distribution of β(m) under the full model. Thus, it suffices to show
(A.3) |
By the Cauchy-Schwarz inequality, we have
(A.4) |
Using (A.4), the left-hand side of (A.3) becomes
which exactly matches the right-hand side of (A.3).
References
- Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov B, Csaki F, editors. International Symposium on Information Theory. Budapest: Akademia Kiado; 1973. pp. 267–281. 589. [Google Scholar]
- Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian Variable Selection and Prediction. Journal of the Royal Statistical Society, Series B. 1998;60:627–641. 585. [Google Scholar]
- Brown PJ, Vanucci M, Fearn T. Bayes Model Averaging with Selection of Regresors. Journal of the Royal Statistical Society, Series B. 2002;64:519–536. 585. [Google Scholar]
- Chen CF. On Asymptotic Normality of Limiting Density Functions with Bayesian Implications. Journal of the Royal Statistical Society, Series B. 1985;47:540–546. 601. [Google Scholar]
- Chen M-H, Dey DK, Ibrahim JG. Bayesian Criterion Based Model Assessment for Categorical Data. Biometrika. 2004;91:45–63. 589. [Google Scholar]
- Chen MH, Ibrahim JG. Conjugate Priors for Generalized Linear Models. Statistica Sinica. 2003;13:461–476. 585, 586, 587, 588–608. [Google Scholar]
- Chen M-H, Ibrahim JG, Shao Q-M, Weiss RE. Prior Elicitation for Model Selection and Estimation in Generalized Linear Mixed Models. Journal of Statistical Planning and Inference. 2003;111:57–76. 586. [Google Scholar]
- Chen M-H, Ibrahim JG, Yiannoutsos C. Prior Elicitation, Variable Selection, and Bayesian Computation for Logistic Regression Models. Journal of the Royal Statistical Society, Series B. 1999;61:223–242. 585. [Google Scholar]
- Chen M-H, Shao Q-M. On Monte Carlo Methods for Estimating Ratios of Normalizing Constants. The Annals of Statistics. 1997;25:1563–1594. 599. [Google Scholar]
- Chen M-H, Shao Q-M, Ibrahim JG. Monte Carlo Methods in Bayesian Computation. New York: Springer-Verlag; 2000. 599. [Google Scholar]
- Chipman HA, George EI, McCulloch RE. Bayesian CART Model Search (with Discussion) Journal of the American Statistical Association. 1998;93:935–960. 585. [Google Scholar]
- Chipman HA, George EI, McCulloch RE. The practical Implementation of Bayesian Model Selection (with Discussion) In: Lahiri P, editor. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. pp. 63–134. 585. [Google Scholar]
- Chipman HA, George EI, McCulloch RE. Bayesian Treed Generalized Linear Models (with Discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian Statistics. Vol. 7. Oxford: Oxford University Press; 2003. pp. 85–103. 585. [Google Scholar]
- Clyde M. Bayesian Model Averaging and Model Search Strategies (with Discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 6. Oxford: Oxford University Press; 1999. pp. 157–185. 585. [Google Scholar]
- Clyde M, George EI. Model Uncertainty. Statistical Science. 2004;19:81–94. 586. [Google Scholar]
- Dellaportas P, Forster JJ. Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Log-linear Models. Biometrika. 1999;86:615–633. 585. [Google Scholar]
- Diciccio TJ, Kass RE, Raftery A, Wasserman L. Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915. 602. [Google Scholar]
- Dixon WJ, Massey FJ. Introduction to Statistical Analysis. 4. New York: McGraw-Hill; 1983. 603. [Google Scholar]
- Geisser S. Predictive Inference: An Introduction. London: Chapman & Hall; 1993. p. 588. 589. [Google Scholar]
- Gelfand AE, Dey DK. Bayesian Model Choice: Asymptotics and Exact Calculations. Journal of the Royal Statistical Society, Series B. 1994;56:501–514. 589–595. [Google Scholar]
- Gelfand AE, Dey DK, Chang H. Model Determinating Using Predictive Distributions with Implementation via Sampling-based Methods (with Discussion) In: Bernado JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford: Oxford University Press; 1992. pp. 147–167.pp. 588–589. [Google Scholar]
- Gelfand AE, Ghosh SK. Model Choice: A Minimum Posterior Predictive Loss Approach. Biometrika. 1998;85:1–13. 589. [Google Scholar]
- George EI. The Variable Selection Problem. Journal of the American Statistical Association. 2000;95:1304–1308. 585. [Google Scholar]
- George EI, Foster DP. Calibration and Empirical Bayes Variable Selection. Biometrika. 2000;87:731–747. 585–595. [Google Scholar]
- George EI, McCulloch RE. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association. 1993;88:1304–1308. 585. [Google Scholar]
- George EI, McCulloch RE. Approaches for Bayesian Variable Selection. Statistica Sinica. 1997;7:339–374. 585. [Google Scholar]
- George EI, McCulloch RE, Tsay R. Two Approaches to Bayesian Model Selection with Applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348. 585. [Google Scholar]
- Guha S, MacEachern SN, Peruggia M. Benchmark Estimation for Markov Chain Monte Carlo Samples. Journal of Computational and Graphical Statistics. 2004;13:683–701. 602. [Google Scholar]
- Ibrahim JG, Chen M-H, McEachern SN. Bayesian Variable Selection for Proportional Hazards Models. Canadian Journal of Statistics. 1999;27:701–717. 585. [Google Scholar]
- Ibrahim JG, Chen M-H, Ryan LM. Bayesian Variable Selection for Time Series Count Data. Statistica Sinica. 2000;10:971–987. 586. [Google Scholar]
- Ibrahim JG, Chen M-H, Sinha D. Criterion Based Methods for Bayesian Model Assessment. Statistica Sinica. 2001a;11:419–443. 589. [Google Scholar]
- Ibrahim JG, Chen M-H, Sinha D. Bayesian Survival Analysis. New York: Springer-Verlag; 2001b. 589. [Google Scholar]
- Ibrahim JG, Laud PW. A Predictive Approach to the Analysis of Designed Experiments. Journal of the American Statistical Association. 1994;89:309–319. 589. [Google Scholar]
- Lahiri P. Model Selection. Beachwood, Ohio: Institute of Mathematical Statistics; 2001. 586. [Google Scholar]
- Laud PW, Ibrahim JG. Predictive Model Selection. Journal of the Royal Statistical Society, Series B. 1995;57:247–262. 585–589. [Google Scholar]
- Meng X-L, Schilling S. Warp Bridge Sampling. Journal of Computational and Graphical Statistics. 2002;11:552–586. 602. [Google Scholar]
- Meng X-L, Wong WH. Simulating Ratios of Normalizing Constants via A Simple Identity: A Theoretical Exploration. Statistica Sinica. 1996;6:831–860. 602. [Google Scholar]
- Ntzoufras I, Dellaportas P, Forster JJ. Bayesian Variable and Link Determination for Generalised Linear Models. Journal of Statistical Planning and Inference. 2003;111:165–180. 586. [Google Scholar]
- Raftery AE. Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear Models. Biometrika. 1996;83:251–266. 585. [Google Scholar]
- Raftery AE, Madigan D, Hoeting JA. Bayesian Model Averaging for Linear Regression Models. Journal of the American Statistical Association. 1997;92:179–191. 585. [Google Scholar]
- Schmeiser BW, Avramidis AN, Hashem S. Overlapping Batch Statistics. In: Balci O, Sadowski RP, Nance RE, editors. Proceedings of the 1990 Winter Simulation Conference. San Diego, California: Society for Computer Simulation International; 1990. pp. 395–398. 607. [Google Scholar]
- Schwarz G. Estimating the Dimension of A Model. The Annals of Statistics. 1978;6:461–464. 589. [Google Scholar]
- Smith M, Kohn R. Nonparametric Regression Using Bayesian Variable Selection. Journal of Econometrics. 1996;75:317–343. 585. [Google Scholar]
- Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian Measures of Model Complexity and Fit (with Discussion) Journal of the Royal Statistical Society, Series B. 2002;62:583–639. 589–590. [Google Scholar]
- Zellner A. On Assessing Prior Distributions and Bayesian Regression Analysis with g-Prior Distributions. In: Goel P, Zellner A, editors. Bayesian Inference and Decision Techniques. Amsterdam: Elsevier Science Publishers B.V; 1986. pp. 233–243. 592. [Google Scholar]