Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Apr 22.
Published in final edited form as: Lifetime Data Anal. 2008 Oct 3;14(4):496–520. doi: 10.1007/s10985-008-9101-5

Bayesian variable selection for the Cox regression model with missing covariates

Joseph G Ibrahim 1,, Ming-Hui Chen 2, Sungduk Kim 3
PMCID: PMC2858597  NIHMSID: NIHMS128678  PMID: 18836829

Abstract

In this paper, we develop Bayesian methodology and computational algorithms for variable subset selection in Cox proportional hazards models with missing covariate data. A new joint semi-conjugate prior for the piecewise exponential model is proposed in the presence of missing covariates and its properties are examined. The covariates are assumed to be missing at random (MAR). Under this new prior, a version of the Deviance Information Criterion (DIC) is proposed for Bayesian variable subset selection in the presence of missing covariates. Monte Carlo methods are developed for computing the DICs for all possible subset models in the model space. A Bone Marrow Transplant (BMT) dataset is used to illustrate the proposed methodology.

Keywords: Conjugate prior, Deviance information criterion, Missing at random, Proportional hazards models

1 Introduction

Bayesian variable selection in survival analysis is still one of the most challenging problems encountered in practice due to issues regarding (i) prior elicitation, (ii) evaluation of a model selection criterion due to the complication of censoring, and (iii) numerical computation of the criterion for all possible models in the model space. In the context of survival analysis, these issues have been discussed in Ibrahim et al. (1999a, 2001a) and the many references therein. There have been numerous papers in the statistical literature on Bayesian variable selection and model comparison, including articles by George and McCulloch (1993, 1997); Laud and Ibrahim (1995); George et al. (1996); Raftery (1996); Smith and Kohn (1996); Raftery et al. (1997); Brown et al. (1998, 2002); Clyde (1999); Chen et al. (1999, 2003, 2008); Dellaportas and Forster (1999); Chipman et al. (1998, 2001, 2003); George (2000); George and Foster (2000); Ibrahim et al. (2000); Ntzoufras et al. (2003) and Clyde and George (2004). However, the literature on Bayesian variable selection in the presence of missing data and in particular, for survival data in the presence of missing covariates, is still quite sparse. Part of the reason for this is that in the presence of missing covariate data, models can become quite complex and closed forms are not even available in the simplest of models. Thus, computing quantities such as Bayes factors, posterior model probabilities, the Aikiake Information Criterion (AIC) (Akaike 1973), the Bayesian Information Criterion (BIC) (Schwarz 1978), and Deviance Information Criterion (DIC) (Spiegelhalter et al. 2002), for example, become serious computational challenges. For example, to compute BIC in the presence of missing covariate data, one would need to maximize the observed data likelihood. There are two challenging issues with this: (i) the observed data likelihood does not have a closed form for most models, even the linear model when the covariates are not normally distributed, and suitable approximation is often not available, and (ii) maximizing the observed data likelihood can be a huge challenge even if it is available in closed form. There are also several technical issues for computing AIC and BIC in the presence of missing covariates. One could argue that these measures are not well defined in the context of missing covariate data since the penalty term is not clearly defined. In particular, if we use the observed data likelihood obtained by averaging over the possible missing values of the covariates according to the missing covariate distribution, it is not clear how to appropriately define the dimensional penalty for AIC and BIC. We elaborate more on this issue in Sect. 5.

This issue becomes even more complex when computing Bayes factors, as one has to integrate over a very large space and the integrals easily become of very high dimension even in the simplest missing data problems. Specifically, it is well known that methods based on Bayes factors or posterior model probabilities, proper prior distributions are needed. It is a major task to specify prior distributions for all models in the model space, especially if the model space is large. For survival models with missing covariates, it becomes even more challenging to specify prior distributions, as in this case, one needs to specify priors not only for the regression coefficients in the survival model, but also the parameters involved in the models for missing covariates. The prior elicitation issue has been discussed in detail by several authors including Laud and Ibrahim (1995); Chen et al. (1999) and Ibrahim and Chen (2000). In addition, it is well known that Bayes factors and posterior model probabilities are generally sensitive to the choices of prior hyperparameters, and thus one cannot simply select vague proper priors to get around the elicitation issue. Even when informative prior distributions are available, computing Bayes factors and posterior model probabilities is difficult and expensive as one needs to compute prior and posterior normalizing constants for each model in the model space. It may be practically infeasible to compute these quantities in the context of variable subset selection for survival models with missing data. Alternatively, criterion based methods can be attractive in the sense that they do not require proper prior distributions in general, and thus have an advantage over posterior model probabilities in this sense. Several recent papers advocating the use of Bayesian criteria for model assessment include Geisser and Eddy (1979); Gelfand et al. (1992); Gelfand and Dey (1994); Ibrahim and Laud (1994); Laud and Ibrahim (1995); Gelman et al. (1996); Dey et al. (1997); Gelfand and Ghosh (1998); Ibrahim et al. (2001b); Spiegelhalter et al. (2002); Chen et al. (2004); Huang et al. (2005); Hanson (2006); Celeux et al. (2006) and Kim et al. (2007).

To overcome some of the methodologic and computational issues mentioned above, we develop two methodologies in this paper: (i) a class of semi-conjugate priors in the presence of MAR covariate data, and (ii) a variation of DIC for survival models with missing covariates. The proposed class of priors overcome the elicitation issues mentioned above as well as the computational challenges. The proposed priors make elicitation easier than other conventional informative priors by basing the elicitation on observable quantities rather than the parameters themselves, along with a scalar quantifying the confidence in that prediction. This is an especially attractive approach in variable selection contexts since in this context the regression coefficients for every model in the model space have a different contextual meaning and interpretation, and thus specifying hyperparameters for all of the models in the model space is a monumental task. This elicitation challenge can be overcome by focusing on constructing a prior based on a prediction for the response variable, as pointed out by Laud and Ibrahim (1995) and Chen and Ibrahim (2003). They are also computationally attractive in that they lead to full conditionals that are log-concave and hence easily sampled via Adaptive Rejection Sampling (ARS) (Gilks and Wild 1992) within Gibbs. Thus, sampling the posterior with these priors is computationally very efficient.

The proposed version of DIC is an extension of a version of DIC discussed in Huang et al. (2005) for generalized linear models with missing covariates. For survival data with censored observations and missing covariates, DIC has a computational advantage over other criterion-based methods, such as AIC or BIC. With the computational methods developed in Sect. 4, the DIC measures can be easily computed for all models in the model space for a moderate number of covariates. In contrast, computation of AIC or BIC becomes quite difficult and challenging for variable subset selection for survival data with censored observations and missing covariates.

The rest of this paper is organized as follows. Section 2 presents a detailed development of the semi-conjugate prior under the piecewise exponential model in the presence of MAR covariates. Section 3 sets up all necessary formulas for the survival models, priors, and posteriors in the context of variable subset selection and presents a novel version of DIC for survival data with missing covariates. Section 4 presents the computational algorithms for computing the DIC measures for all models in the model space. A detailed analysis of the BMT data is given in Sect. 5. We conclude the article with brief remarks in Sect. 6.

2 The model, prior and posterior

2.1 The model

Let yi denote the minimum of the censoring time Ci and the survival time Ti, and let xi = (xi1, …, xik)′ be the k × 1 vector of covariates associated with yi for the ith subject. Denote by β = (β1, …, βk)′ the k × 1 vector of regression coefficients. Also, νi = 1{Ti = yi} is the indicator for the event for i = 1, 2, …, n, where n is the total number of observations. As usual, we assume throughout that xi does not include an intercept, since the intercept is not estimable in the Cox proportional hazards model, and that given xi, Ti and Ci are independent. In the presence of missing covariates, the missing data mechanism is defined as the distribution of the k × 1 random vector ri = (ri1, ri2, …, rik)′, where rij = 0 when xij is missing and rij = 1 when xij is observed for i = 1, 2, …, n and j = 1, 2, …, k. We assume that any missingness in covariates xij is missing at random (MAR) (Rubin 1976; Little and Rubin 2002). As discussed in Ibrahim et al. (2005), for MAR covariates xij we do not need to model the missing data mechanism.

We consider the Cox proportional piecewise exponential hazards model for [yi|xi], which has the survival function given by

S(yi|xi,β,λ)=exp{exp(xiβ)H0(yi|λ)}, (2.1)

where H0(t|λ) is the baseline cumulative hazard function. The piecewise exponential model is assumed for the baseline hazard function h0(t). Specifically, we first partition the time axis into J intervals: (s0, s1], (s1, s2], …, (sJ−1, sJ], where s0 = 0 < s1 < s2 < ⋯ < sJ. In practice, it is sufficient to choose sJ to be greater than the largest follow-up time. We then assume a constant hazard λj over the jth interval Ij = (sj−1, sj]. That is, h0(y)=λj if yIj for j =1, 2, …, J. Then the corresponding baseline cumulative hazard function, H0(y|λ), is given by

H0(y|λ)=λj(ysj1)+g=1j1λg(sgsg1) (2.2)

for sj−1 < ysj, where λ = (λ1, …, λJ). We note that when J = 1, H0(y|λ) reduces to the parametric exponential model.

We write xi=(x1i,x2i), where x1i is a k1 × 1 vector of covariates that are observed for all n observations, x2i is a k2 × 1 vector of covariates that have at least one missing value in the n observations, and k1 + k2 = k with k1 ≥ 0 and k2 ≥ 1. Furthermore, we let x2i,mis denote the vector of covariates that are missing for the ith case and let x2i,obs be the vector of covariates that are observed for the ith case. Let D = {(yi, νi, x1i, x2i,mis, x2i,obs), i = 1, 2, …, n} denote the complete data. Then, the complete data likelihood function is given by

L(β,λ|D)=i=1nj=1J{λjexp(xiβ)}δijνi×exp[δijexp(xiβ){λj(yisj1)+g=1j1λg(sgsg1)}], (2.3)

where δij = 1 if the ith subject failed or was right censored in the jth interval (sj−1, sj], and 0 otherwise.

2.2 Prior and posterior

We first specify a prior distribution for (β, λ). To this end, we extend the conjugate prior proposed by Chen and Ibrahim (2003) for the generalized linear model (GLM) to the piecewise exponential model in (2.1). Let X denote the n × k matrix with its ith row equal to xi. Given X, we propose a semi-conjugate prior as follows:

π(β,λ|y0,X,a0)(i=1nj=1J{λjexp(xiβ)}a0δ0ij×exp[a0δ0ijexp(xiβ){λj(y0isj1)+g=1j1λg(sgsg1)}])π0(λ), (2.4)

where a0 > 0 is a scalar prior parameter, y0 = (y01, …, y0n)′ is an n × 1 vector of prior parameters, δ0ij = 1 if sj−1 < y0isj and 0 otherwise, and π0(λ) is an initial prior for λ.

The prior (2.4) is called semi-conjugate since, by ignoring π0(λ), the prior has an identical form as the complete data likelihood given in (2.3). As discussed in Chen and Ibrahim (2003), y0i can be viewed as a prior prediction for the marginal mean of yi. Since y0i is the prior prediction of yi, we assume that y0i is an “observed” failure time. Thus, in eliciting y0, we must focus on a prediction (or guess) for E(yi), which narrows the possibilities for choosing y0i. To obtain a noninformative prior for (β, λ), we specify all the y0i equal. As shown in Chen and Ibrahim (2003), this specification under the GLM yields a prior in which the prior modes of the slopes in the regression model are the same. For the piecewise exponential model, we consider y01 = ⋯ = y0n = y0, where 0 < y0s1. Under this specification of y0, (2.4) reduces to

π(β,λ|y0,X,a0)i=1n{λ1exp(xiβ)}a0exp{a0y0λ1exp(xiβ)}π0(λ). (2.5)

We further specify π0(λ) as follows

π0(λ)1λ1j=2Jλjb11exp(b2λj), (2.6)

where b1 > 0 and b2 > 0. Note that in (2.5), we assume an improper uniform initial prior for β and an improper Jeffreys-type initial prior for λ1. Thus, π0(λ) introduced in (2.4) and further specified in (2.6) is an improper prior. However, under some mild conditions, the prior (2.5) is proper and (log λ1, β) has a prior mode of (−log y0, 0, …, 0)′. We formally characterize these properties in the following theorem.

Theorem 2.1

Let Xobs denote a submatrix of X with rows consisting of those completely observed xi’s and xmis=(x2i,mis,i=1,2,,n). Also, let Xobs*=(1,Xobs). Assume that Xobs* is of full rank (k + 1) and π0(λ) is given by (2.6). Then, for any given xmis, (i) (log λ1, β) has a unique prior mode of (−log y0, 0, …, 0)′ and (ii) π(β, λ|y0, X, a0) is proper.

The proof of Theorem 2.1 is given in Appendix A. We note that the conditions of Theorem 2.1 require at least k + 1 complete observations with linearly independent covariate vectors including an intercept. From Theorem 2.1, we see that when y01 = ⋯ = y0n = y0, the prior mode of β is 0 and with this prior prediction for the yi, both β and λ1 are identifiable in the sense that the joint prior is proper. Note that if we assume a general gamma prior instead of a Jeffreys-type prior for λ1 in (2.6), we can show that the prior mode of β is still 0, but the prior mode of log λ1 is no longer −log y0. Thus, a different specification of the initial prior for λ1 only changes the prior mode of the “intercept” in the survival model. Although we assume y0s1 in Theorem 2.1, we can show that the prior mode of β is still 0 even when y0 > s1. This is intuitively appealing since, in this case, the prior prediction y0i does not depend on the ith subject’s specific covariate information. We further note that the parameter a0 in (2.4) or (2.5) can be generally viewed as a precision parameter that quantifies the strength or confidence of our prior belief in y0. From Theorem 2.1, we see that the prior mode of β does not depend on a0. Thus, a0 controls only the prior precision of β. This is an attractive feature that allows us to do sensitivity analyses by varying a0 in the prior.

Next, we specify the distribution for the missing covariates. Since we are primarily interested in inferences about β, we only need to model x2i since x1i is observed for all n observations. Therefore, we model x2i conditioning on the completely observed covariates x1i throughout. Using a sequence of one-dimensional conditional distributions proposed by Lipsitz and Ibrahim (1996) and Ibrahim et al. (1999b), we specify the distribution of the k2-dimensional covariate vector x2i = (x2i1, x2i2, …, x2ik2)′ as

f(x2i|x1i,α)=f(x2i1|xi1,α1)f(x2i2|x2i1,xi1,α2)f(x2ik2|x2i,k21,,x2i1,xi1,αk2), (2.7)

where αl is a vector of parameters for the lth conditional distribution, the αl’s are distinct, and moreover, α=(α1,α2,,αk2). To complete the prior specification, we take independent priors for α1, …, αp so that

π(α)=l=1k2π(αl). (2.8)

Let xobs=((x1i,x2i,obs),i=1,2,,n). Using (2.5)(2.8), the joint prior for β, λ, xmis, and α is given by

π(β,λ,xmis,α|y0,a0,xobs)[i=1n{λ1exp(xiβ)}a0exp{a0y0λ1exp(xiβ)}]      ×[i=1nf(x2i|x1i,α)]π0(λ)π(α). (2.9)

Let Dobs = (y, ν, xobs) denote the completely observed data, where y = (y1, y2, …, yn)′ and ν = (ν1, ν2, …, νn)′. Then, the joint posterior distribution is given by

π(β,λ,xmis,α|y0,a0,Dobs)L(β,λ|D)π(β,λ,xmis,α|y0,a0,xobs), (2.10)

where L(β, λ|D) and π (β, λ, xmis, α|y0, a0, xobs) are given by (2.3) and (2.9), respectively. Although the posterior distribution in (2.10) is analytically intractable, a Gibbs sampling algorithm can be easily developed to sample from this posterior distribution. The implementational details of the Gibbs sampling algorithm are discussed in Appendix B.

3 Bayesian variable subset selection

Let ℳ denote the model space. We enumerate the models in ℳ by m = 1, 2, …, 𝒦, where 𝒦 is the dimension of ℳ and model 𝒦 denotes the full model. Also, let β(𝒦) = (β1, β2, …, βk)′ denote the regression coefficients for the full model including an intercept, and let xi(m) and β(m) denote km × 1 vectors of covariates and regression coefficients for model m, and a specific choice of km covariates. We write xi=(xi(m),xi(m)), and β(𝒦) = (β(m)′, β(−m)′)′, where xi(m)isxiwithxi(m) deleted, and β(−m) is β(𝒦) with β(m) deleted. We also write x1i=(x1i(m),x1i(m)) and x2i=(x2i(m),x2i(m)), where x1i(m) is a k1m(≤ k1) dimensional vector, x2i(m) is a k2m (≤ k2) dimensional vector, and x1i(m)andx2i(m) are x1i and x2i with x1i(m)andx2i(m) deleted, respectively. Furthermore, we write x2i,mis=(x2i,mis(m),x2i,mis(m)) and x2i,obs=(x2i,obs(m),x2i,obs(m)), where x2i,mis(m)andx2i,obs(m) are x2i,mis and x2i,obs with x2i,mis(m)andx2i,obs(m) deleted, respectively.

Under model m, let Dm={(yi,νi,x1i(m),x2i,mis(m),x2i,obs(m)),i=1,2,,n} denote the complete data and then the complete data likelihood function is given by

L(β(m),λ|Dm)=i=1nj=1J{λjexp((xi(m))β(m))}δijνi×exp[δijexp{(xi(m))β(m)}{λj(yisj1)    +g=1j1λg(sgsg1)}], (3.1)

where δij is defined in (2.3). Using exactly the same order of the sequence of one-dimensional conditional distributions for the covariates x2i in (2.7) by deleting x2i(m), we specify the distribution of the k2m-dimensional covariate vector x2i(m)=(x2i1(m),x2i2(m),,x2ik2m(m)) as

f(x2i(m)|x1i(m),α(m))=f(x2i1(m)|xi1(m),α1(m))f(x2i2(m)|x2i1(m),xi1(m),α2(m))××f(x2ik2m(m)|x2i,k2m1(m),,x2i1(m),xi1(m),αk2m(m)), (3.2)

where α(m)=(α1(m),α2(m),,αk2m(m)). It is important to note that in (3.2), α(m) is a subvector of α in (2.7). We further write α = (α(m)′, α(−m)′)′ where α(−m) is α with α(m) deleted. Similar to (2.8), the prior for α(m) is specified as π(α(m))=l=1k2mπ(αl(m)).

Let xobs(m)=((x1i(m),x2i,obs(m)),i=1,2,,n)andxmis(m)=(x2i,mis(m),i=1,2,,n). By applying the semi-conjugate prior (2.5) to model m, we have the joint prior for β(m), λ, xmis(m), and α(m) given by

π(β(m),λ,xmis(m),α(m)|y0,a0,xobs(m)) (i=1n[λ1exp{xi(m)β(m)}]a0exp[a0y0λ1exp{xi(m)β(m)}])   ×[i=1nf(x2i(m)|x1i(m),α(m))]π0(λ)π(α(m)), (3.3)

where π0(λ) and f(x2i,mis(m),x2i,obs(m)|x1i(m),α(m)) are defined by (2.6) and (3.2), respectively. Note that all models in the model space share the same prior for λ, that is, the prior for λ is the same for all models in the model space. Let Dm,obs=(y,ν,xobs(m)) denote the completely observed data. Under model m, the joint posterior distribution is given by

π(β(m),λ,xmis(m),α(m)|y0,a0,Dm,obs)L(β(m),λ|Dm)   ×π(β(m),λ,xmis(m),α(m)|y0,a0,xobs(m)), (3.4)

where L(β(m), λ|Dm) and π(β(m),λ,xmis(m),α(m)|y0,a0,xobs(m)) are given by (3.1) and (3.3), respectively.

We carry out Bayesian variable selection via DIC, originally proposed by Spiegelhalter et al. (2002). The use of DIC for missing data models has been discussed in detail in Celeux et al. (2006). Let θ(m)=(β(m),λ,xmis(m)). DIC is defined as follows:

DICm=Devm(θ¯(m))+2pm, (3.5)

where Devm(θ(m)) is a deviance function and θ̄(m) is the posterior mean of θ(m). In (3.5), pm is the effective number of model parameters, which is calculated as

pm=Dev¯m(θ(m))Devm(θ¯(m)), (3.6)

where

Dev¯m(θ(m))=E[Devm(θ(m))|Dm,obs] (3.7)

and the expectation is taken with respect to the posterior distribution given in (3.4). Since we are primarily interested in inferences about the survival model, we define the deviance function, Devm(θ(m)) in (3.5) as follows:

Devm(θ(m))=2logL(β(m),λ|Dm),

where L(β(m), λ|Dm) is given by (3.1). Following Huang et al. (2005), we compute Devm(θ¯(m)) as

Devm(θ¯(m))=2i=1nj=1J(δijνi[logE[λj|Dm,obs]+E[(xi(m))β(m)|Dm,obs] δijexp{E[(xi(m))β(m)|Dm,obs]}{E[λj|Dm,obs]  ×(yisj1)+g=1j1E[λg|Dm,obs](sgsg1)}), (3.8)

where all expectations are taken with respect to the posterior distribution in (3.4). In (3.8), instead of computing (E[xi(m)|Dm,obs])E[β|Dm,obs], we compute E[(xi(m))β(m)|Dm,obs] in the presence of missing covariates, which yields a more appropriate dimensional penalty term pm.

The DIC defined above is a Bayesian measure of predictive model performance, which is decomposed into a measure of fit and a measure of model complexity (pm). The smaller the value of DIC, the better the model will predict new observations generated in the same way as the data. As discussed and shown in Chen et al. (2008), the performance of DIC is similar to AIC. Moreover, the DIC defined in (3.5) has a nice computational property for Bayesian variable selection, which will be discussed in detail in the next section.

4 Computation of DIC measures

To carry out Bayesian variable selection, we need to compute DICm in (3.5) for m = 1, 2, …, 𝒦. Due to the complexity of the survival model in (3.1), analytical evaluation of DICm does not appear possible. Thus, a Monte Carlo (MC) method is needed to compute all DICm’s in the model space. To this end, we propose two approaches for computing the DICm’s. The first approach, called “the direct sampling method”, is based on direct Monte Carlo samples from each model in the model space. The second approach is “the single MC sample method,” which was proposed by Chen et al. (2008). The latter method requires only one Markov chain Monte Carlo (MCMC) sample from the posterior distribution under the full model and computes the Bayesian criterion simultaneously for all possible subset models in the model space. From (3.7) and (3.8), we observe that for DICm in (3.5), we need to compute the following quantities: (i) E[Devm(θ(m))|Dm,obs]; (ii) E[(xi(m))β(m)|Dm,obs]; and (iii) Ej|Dm,obs] for j = 1, 2, …, J. For (ii), we note that when xi(m) is completely observed, then E[(xi(m))β(m)|Dm,obs]=(xi(m))E[β(m)|Dm,obs]. Thus, for (ii), we may further consider (iia) E[β(m)|Dm,obs] and (iib) E[(xi(m))β(m)|Dm,obs] with at least one missing covariate in xi(m). It is interesting to observe that there is a common feature among (i), (iia), (iib), and (iii). That is, all of these quantities can be written as

gm=E[g(θ(m))|Dm,obs], (4.1)

for various functions g, where θ(m)=(β(m),λ,xmis(m)) and the expectation is taken with respect to the joint posterior distribution in (3.4) under model m.

First, we discuss the direct sampling method. Using the Gibbs sampling algorithm given in Appendix B, we generate a Monte Carlo sample {θq(m),q=1,2,,Q} from the joint posterior distribution in (3.4) under model m. Then, a Monte Carlo estimate of gm is given by

g^m=1Qq=1Qg(θq(m)) (4.2)

for all g’s listed in (i)–(iii). Then, plugging various ĝm’s in (3.5) gives a Monte Carlo estimate of DICm.

Next, we discuss the single MC sample method. Using the notation given in Sect. 3, we write γ=(β,λ,xmis,α),γ(m)=(β(m),λ,xmis(m),α(m)), and γ(m)=(β(m),xmis(m),α(m)), where xmis(m) is xmis with xmis(m) deleted and γ(−m) is γ with γ(m) deleted. Thus, the marginal likelihood under model m is given by

Cm=π*(γ(m)|y0,a0,Dm,obs)dγ(m), (4.3)

where

π*(γ(m)|y0,a0,Dm,obs)  =L(β(m),λ|Dm)(i=1n[λ1exp{xi(m)β(m)}]a0exp[a0y0λ1exp{xi(m)β(m)}])  ×[i=1nf(x2i(m)|x1i(m),α(m))]π0*(λ)π(α(m)), (4.4)

π0*(λ)=1λ1j=2Jλjb11exp(b2λj), and L(β(m), λ|Dm), π0(λ) and f(x2i(m)|x1i(m),α(m)) are defined by (3.1), (2.6) and (3.2), respectively. Then, for a given function g, we have

gm=E[g(θ(m))|Dm,obs]=g(θ(m))π*(γ(m)|y0,a0,Dm,obs)Cmdβ(m), (4.5)

where Cm is defined in (4.3).

For any given function g such that E[|g(θ(m))||Dm,obs] < ∞, we have the following identity

gm=C𝒦CmE[g(θ(m))w(γ(m)|γ(m))π*(γ(m)|y0,a0,Dm,obs)π*(γ|y0,a0,Dobs)|Dobs], (4.6)

where C𝒦 is the marginal likelihood given in (4.3) under the fill model, π*(γ|y0, a0, Dobs) is given in (4.4) corresponding to the full model, which is essentially the kernel of the joint posterior distribution in (2.10), and the expectation is taken with respect to the joint posterior distribution in (2.10) under the full model. In (4.6), w(γ(−m)| γ(m)) is a completely known conditional density, whose support is contained in, or equal to, the support of the conditional density of γ(−m) given γ(m) with respect to the joint posterior distribution in (2.10) under the full model.

Observe that as a special case of (4.1), we have gm = 1 when g ≡ 1. Using this result, we further obtain that

CmC𝒦=E[w(γ(m)|γ(m))π*(γ(m)|y0,a0,Dm,obs)π*(γ|y0,a0,Dobs)|Dobs]. (4.7)

Using (4.6) and (4.7), we have

gm=E[g(θ(m))w(γ(m)|γ(m))π*(γ(m)|y0,a0,Dm,obs)π*(γ|y0,a0,Dobs)|Dobs]E[w(γ(m)|γ(m))π*(γ(m)|y0,a0,Dm,obs)π*(γ|y0,a0,Dobs)|Dobs]. (4.8)

We note that as the dimension of λ does not change across all models, π*(λ) cancels out in the ratio π*(γ(m)|y0,a0,Dm,obs)π*(γ|y0,a0,Dobs).

Let {γq = (βq, λq, xmis,q, αq), q = 1, 2, …, Q} denote an MCMC sample of size Q from the joint posterior distribution (2.10) under the full model. Write γq=(γq(m),γq(m)), where γq(m)=(βq(m),λq,xmis,q(m),αq(m)), and γq(m)=(βq(m),xmis,q(m),αq(m)). Also let θq(m)=(βq(m),λq,xmis,q(m)). Then, an MC estimate of gm is given by

g^m=q=1Qg(θq(m))w(γq(m)|γq(m))π*(γq(m)|y0,a0,Dm,obs)π*(γq|y0,a0,Dobs)q=1Qw(γq(m)|γq(m))π*(γq(m)|y0,a0,Dm,obs)π*(γq|y0,a0,Dobs). (4.9)

Under certain regularity conditions, such as ergodicity, we have

limQg^m=gm,

implying that ĝm is consistent.

As shown in Chen et al. (2008) the optimal choice of w(γ(−m) | γ(m)) is the conditional posterior distribution of γ(−m) given γ(m) under the full model in the sense that ĝm achieves the minimum asymptotic variance. However, the optimal choice of w(γ(−m) | γ(m)) is not computationally feasible. Thus, we propose the following weight function

w(γ(m)|γ(m))=w(β(m)|β(m),λ,xmis)w(α(m)|α(m),xmis)w(xmis(m)|xmis(m),α(m)). (4.10)

Note that in (4.10), when model m includes all missing covariates x2i, we do not need to compute w(xmis(m)|xmis(m),α(m)) as in this case, xmis(m) is a null vector in the sense that it has zero dimension. In (4.10), a good w(β(−m)|β(m), λ, xmis), which is close to the optimal choice, can be constructed based on the asymptotic approximation to the joint posterior posterior. Let β̂(−m) (λ, xmis) denote the conditional posterior mode of β(−m) given β(m), λ and xmis under the full model. Specifically, we first compute

β^(m)(β(m),λ,xmis)=arg maxβ(m)logπ*(β|λ,xmis,y0,a0,Dobs), (4.11)

where

logπ*(β|λ,xmis,y0,a0,Dobs)  =i=1nj=1J[δijνixiβδijexp(xiβ){λj(yisj1)+g=1j1λg(sgsg1)}]+i=1n[a0xiβa0y0λ1exp(xiβ)] (4.12)

and then compute

Σ^(m)(β(m),λ,xmis) =[2logπ*(β|λ,xmis,y0,a0,Dobs)β(m)β(m)|β(m)=β^(m)(β(m),λ,xmis)]1.

Thus, a good w(β(−m) | β(m), λ, xmis) can be constructed as follows:

w(β(m)|β(m),λ,xmis)  =(2π)kkm2|Σ^(m)(β(m),λ,xmis)|12exp{12(β(m)β^(m)(β(m),λ,xmis))   ×[Σ^(m)(β(m),λ,xmis)]1(β(m)β^(m)(β(m),λ,xmis))}, (4.13)

which approximates the joint conditional posterior π(β(−m) | β(m), λ, xmis, y0, a0, Dobs) under the full model. Similarly, we can construct a good w(α(−m)|α(m), xmis) in (4.10). For w(xmis(m)|xmis(m),α(m)), we use a Monte Carlo estimate given by

w(xmis(m)|xmis(m),α(m))=1Qq=1Qw(xmis(m)|xmis(m),α(m),αq(m)), (4.14)

where

w(xmis(m)|xmis(m),α(m)αq(m))i=1nf(x2i|x1i,α(m),αq(m)),

and f(x2i|x1i,α(m),αq(m)) is given by (2.7) under the full model.

5 Analysis of the BMT data

The BMT data set consists of n = 2397 cases who received HLA-identical sibling transplant from 1995 to 2004 for AML or ALL in CR1 (pre-transplant status = 1st complete remission) with graft source of BM or PB/PB+BM. Infants were excluded (age < 2 year old). The outcome variable, yi in years, was the time from transplant to death or end of follow up, and νi denotes the censoring indicator which equals 1 if the ith subject died, and is 0 otherwise. The median follow-up was 5.1 years with interquartile range of 3.0 to 7.8 years. There were 904 deaths in the data set. We consider ten covariates: disease (disease type: AML, ALL), age, yeartx (transplant year), karnofprg (Karnofsky score at pre-transplant), gsource (graftype: BM, PB/PB+BM), sexmatch (Donor-Patient sex match: MM, MF, FM, FF), regimprg (conditioning regimen: CY+TBI±oth, TBI + other, Busulf + CY ± oth, Other/Unknown), prevgvh1 (GVHD prophylaxis: mtx ± other, csa ± other, mtx + csa ± other, tdep ± other, Other/Unknown), cytoabnew (cytogenetics: Poor, InterMed, Normal, Good), and wbcdx (WBC at diagnosis (109/l)). The covariates age, yeartx, karnofprg, and wbcdx are continuous, and the covariates disease and gsource are binary. We dichotomize sexmatch as sexmatch1, sexmatch2, and sexmatch3, where (sexmatch1, sexmatch2, sexmatch3) takes values (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1), which correspond to MM, MF, FM, and FF, respectively. In exactly the same fashion, we dichotomize regimprg, prevgvh1, and cytoabnew as (regimprg1, regimprg2, regimprg3), (prevgvh11, prevgvh12, prevgvh13, prevgvh14) and (cytoabnew1, cytoabnew2, cytoabnew3). For instance, the values (0,0,0), (1,0,0), (0,1,0), and (0,0,1) for (cytoabnew1, cytoabnew2, cytoabnew3) correspond to Poor, InterMed, Normal, and Good for cytoabnew, respectively.

Let x1 = disease, x2 = age, x3 = yeartx, x4 = karnofprg, x5 = gsource, x6 = (sexmatch1, sexmatch2, sexmatch3)′, x7 = (regimprg1, regimprg2, regimprg3)′ x8 = (prevgvh11, prevgvh12, prevgvh13, prevgvh14)′, x9=(sexmatch1, sexmatch2, sexmatch3)′ and x10 = log(wbcdx). For these 10 covariates, x1, x2, …, x8 were completely observed for all cases and x9 and x10 had missing information. There were 488 (20.36%) individuals with cytogenetics (x9) missing and 230 (9.6%) individuals with WBC missing, and 96 individuals with both cytogenetics and WBC missing. Overall, there were 623 (25.99%) individuals with at least one covariate missing. We assume that the missing covariates are MAR. In all computations, we standardized all completely observed covariates.

For the BMT data, we fit the piecewise exponential model given by (2.1) and (2.2) for the outcome variable yi, where sj is chosen to be the (j/J)th quantile of the failure times yi, for j = 1, 2, …, J − 1. Since x1, x2, …, x8 are always observed, they do not need to be modeled, and thus we condition on those covariates throughout. We then use a proportional odds logistic regression model for x9 and a normal regression model for x10. Specifically, under the full model with all ten covariates, f(x9|x1, x2, …, x8, α9) is specified as follows:

P(x9=(0,0,0)|x1,x2,,x8,α9)=F(α9,10+α91x1++α95x5+α96x6+α97x7+α98x8),P(x9=(1,0,0)|x1,x2,,x8,α9)=F(α9,20+α91x1++α95x5+α96x6+α97x7+α98x8)F(α9,10+α91x1++α95x5+α96x6+α97x7+α98x8),P(x9=(0,1,0)|x1,x2,,x8,α9)=F(α9,30+α91x1++α95x5+α96x6+α97x7+α98x8)F(α9,20+α91x1++α95x5+α96x6+α97x7+α98x8),

and P(x9=(0,0,1)|x1,x2,,x8,α9)=1F(α9,30+α91x1++α95x5+α96x6+α97x7+α98x8), where F(u) = exp(u)/{1+exp(u)}, α9,10 ≤ α9,20 ≤ α9,30, α96=(α96,1,α96,2,α96,3), α97=(α97,1,α97,2,α97,3), α98=(α98,1,α98,2,α98,3,α98,4) and α9=(α9,10,α9,20,α9,30,α91,,α5,α96,α97,α98). We note that α9,10, α9,20, and α9,30 are three intercepts in the proportional odds logistic regression model. Furthermore, f (x10|x1, x2, …, x8, x9, α10) is taken to be the density of a N(α10,0+α10,1x1++α10,5x5+α10,6x6+α10,7x7+α10,8x8+α10,9x9,α10,10) distribution, where α10,10 > 0 denotes the variance. The prior for (β, λ) is given by (2.5) and (2.6). In (2.5), we consider several values for a0 such as a0 = 0.1, 0.01, 0.001, and 0.0001 and in (2.6), we take b1 = b2 = 0.001. For the parameters in the models for the missing covariates, an inverse gamma prior with scale and shape parameters equal to 0.001 is specified for α10,10, and independent normal priors, N(0, 1000), are specified for all other parameters. We wish to compare the following 210 = 1024 models: no covariates (null model), (x1), …, (x10), (x1, x2), …, (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) (full model). In all computations, the Gibbs sampling algorithm given in Appendix B was used to sample from the posterior distributions and 10,000 Gibbs samples after a burn-in of 1,000 iterations were used to compute all DIC measures and other posterior estimates. The convergence of the Gibbs sampling algorithm was checked using several diagnostic procedures as recommended by Cowles and Carlin (1996).

We first carry out the complete case (CC) analysis of the BMT data. There were n* = 1,774 subjects with all ten covariates completely observed. In the CC analysis, we first perform subset variable selection using the AIC and BIC criteria, since with no missing data, these two criteria can be easily computed. Let Lcc(β(m), λ|Dcc,m) denote the likelihood function given in (3.1) with the completely observed data Dcc,m under model m in a model space that consists of 210 possible subset models. Then, AIC and BIC are given by

AICm=2logLcc(β^(m),λ^|Dcc,m)+2pm, (5.1)

where β̂(m) and λ̂ are the maximum likelihood estimates of β(m) and λ and pm = km + J, and

BICm=2Lcc(β^(m),λ^|Dcc,m)+[log(n*)]pm. (5.2)

Table 1 shows the best three AIC or BIC models for J = 10, 15, and 20. From Table 1, we see that the best three AIC models are (x1, x2, x4, x9), (x1, x2, x4, x5, x9), and (x1, x2, x3, x4, x5, x9), and the best three BIC models are (x1, x2, x9), (x1, x2, x4, x9), and (x1, x2, x5, x9). In Table 1, under the model (x1, x2, x3, x4, x5, x9), the values of pm = km + J = 8 + J are 18, 23, and 28 for J = 10, 15, and 20, respectively while the values of pm become 15, 20, and 25 for J = 10, 15, and 20, respectively, under the model (x1, x2, x9). Thus, (x1, x2, x9) is the smallest model while (x1, x2, x3, x4, x5, x9) is the largest model among the five models listed in Table 1. We note that the order of the best three models under either AIC or BIC remains the same for J = 10, J = 15, and J = 20. Thus, subset variable selection under both AIC and BIC is robust to the choice of J.We also see from Table 1 that the lowest values of AIC are attained at J = 15 for all five models while the lowest values of BIC are attained at J = 10. This result is expected since BIC favors smaller and more parsimonious models than AIC, due to a larger dimensional penalty imposed by BIC.

Table 1.

Values of AIC and BIC under best three AIC or BIC models for the completely observed BMT data

Model J = 10
J = 15
J = 20
AICm BICm pm AICm BICm pm AICm BICm pm
(x1, x2, x4, x9) 3607.09 3694.78 16 3595.19 3710.29 21 3614.49 3756.99 26
(x1, x2, x4, x5, x9) 3607.09 3700.27 17 3595.46 3716.05 22 3614.63 3762.62 27
(x1, x2, x3, x4, x5, x9) 3607.90 3706.56 18 3595.70 3721.76 23 3615.16 3768.63 28
(x1, x2, x9) 3610.60 3692.82 15 3598.82 3708.44 20 3618.05 3755.07 25
(x1, x2, x5, x9) 3610.68 3698.38 16 3599.15 3714.25 21 3618.26 3760.76 26

For the CC case, we used the direct sampling method to compute the DIC values for all 1024 models under various choices of J and a0. The results based on the best three DIC models under various choices of J and a0 are shown in Table 2. From Table 2, we see that when a0 is small, for example, a0 = 0.001 or a0 = 0.0001, the DIC values are very close to the corresponding values of AIC and the values of pm in DIC are also very close to those in AIC. We also observe that there is a convex pattern in the DICs as functions of J and a0. Specifically, the DIC values with J = 15 are smaller than those with either J = 10 or J = 20 for all the best three models, and, in addition, the DIC values with a0 = 0.001 are smaller than those with a0 = 0.1, a0 = 0.01, and a0 = 0.0001 under these same models, though the DIC values with a0 = 0.001 are close to those with a0 = 0.0001. These results are quite desirable as they empirically show that DIC may be used to guide the choices of J and a0 in achieving the best predictive model performance. In this CC case, among the values of J and a0 being considered, based on the DIC measure, the best choices of J and a0 are J = 15 and a0 = 0.001. In the CC case, we also implemented the single MC sample method discussed in Sect. 4 for computing the DIC measures. Using a Gibbs sample of 10,000 iterations after a burn-in of 1,000 iterations from the posterior distribution under the full model, the Monte Carlo estimates of DICm and pm are 3595.25 and 20.99 for model (x1, x2, x4, x9), 3595.40 and 21.94 for model (x1, x2, x4, x5, x9), and 3595.74 and 22.99 for model (x1, x2, x3, x4, x5, x9). These estimates are very similar to those given in Table 2 using the direct sampling method.

Table 2.

Values of DIC under best three DIC models for the completely observed BMT data

Model a0 = 0.001
J = 10
J = 15
J = 20
DICm pm DICm pm DICm pm
(x1, x2, x4, x9) 3607.18 16.02 3595.26 21.00 3614.65 26.05
(x1, x2, x4, x5, x9) 3606.96 16.91 3595.47 21.97 3615.12 26.94
(x1, x2, x3, x4, x5, x9) 3607.76 18.43 3595.68 22.96 3615.31 28.07

J = 15
a0 = 0.1
a0 = 0.01
a0 = 0.0001
DICm pm DICm pm DICm pm

(x1, x2, x4, x9) 3769.45 19.20 3599.56 20.73 3595.35 21.07
(x1, x2, x4, x5, x9) 3769.71 20.21 3599.76 21.70 3595.62 22.07
(x1, x2, x3, x4, x5, x9) 3769.66 21.18 3599.82 22.64 3595.87 22.89

For the best two DIC models with J = 15 and a0 = 0.001, we also computed the posterior means (Estimates), the posterior standard deviations (SD’s), and 95% highest posterior density (HPD) intervals of the model parameters. The results are shown in Table 3. Under model (x1, x2, x4, x9), all 95% HPD intervals do not contain 0, indicating the importance of all these covariates. The results given in Table 3 indicate that an ALL patient has a higher risk of death compared to an AML patient, an older patient has a higher risk of death, a higher Karnofsky score at pre-transplant leads to a lower risk of death, and a patient with poor cytogenetics is likely to have a high risk of death. Under model (x1, x2, x4, x5, x9), all covariates except for gsource have 95% HPD intervals that do not contain 0.

Table 3.

Posterior estimates of β under best two DIC models with J = 15 and a0 = 0.001 for the completely observed BMT data

Model Variable Estimate SD 95% HPD interval
(x1, x2, x4, x9) Disease 0.164 0.039 (0.086, 0.239)
Age 0.269 0.040 (0.193, 0.350)
Karnofprg −0.086 0.036 (−0.152, −0.013)
Cytoabnew1 −0.400 0.113 (−0.621, −0.185)
Cytoabnew2 −0.586 0.107 (−0.802, −0.382)
Cytoabnew3 −0.638 0.209 (−1.065, −0.241)
(x1, x2, x4, x5, x9) Disease 0.167 0.039 (0.091, 0.246)
Age 0.250 0.042 (0.167, 0.331)
Karnofprg −0.086 0.036 (−0.154, −0.014)
Gsource 0.054 0.041 (−0.027, 0.133)
Cytoabnew1 −0.399 0.111 (−0.626, −0.190)
Cytoabnew2 −0.580 0.108 (−0.786, −0.372)
Cytoabnew3 −0.621 0.207 (−1.037, −0.217)

Next, we carry out an all case (AC) analysis of the BMT data, that is, an analysis incorporating all of the cases. In the AC case, due to the additional complication of modeling the missing covariates, AIC and BIC are computationally infeasible, as discussed earlier and in fact, one could even argue that these measures are not well defined here since the penalty term is not clearly defined. In particular, if we use the marginal likelihood L(β̂(m),λ̂|Dm) and then average over all of the possible missing values of the covariates according to the missing covariate distribution, it is not clear how to appropriately define the dimensional penalty pm for AIC and BIC. Thus, for the AC case, we used DIC as the criterion for performing variable subset selection. To this end, we used the the direct sampling method to compute the DIC values for all 1,024 models under various choices J and a0. The DIC values for the best three models are presented in Table 4. Note that models (x1, x2, x4, x5, x8, x9) and (x1, x2, x4, x5, x9) are consistently the best and second best models for all the values of J and a0 considered in Table 4, while model (x1, x2, x3, x4, x5, x9) is the third best for most combinations of J and a0 except for (J, a0) = (15, 0.1) and (J, a0) = (10, 0.001). For (J, a0) = (15, 0.1) the third best model is (x1, x2, x3, x4, x5, x9, x10) with DICm = 4973.88 and pm = 24.75, and for (J, a0) = (10, 0.001) the third best model is (x1, x2, x4, x5, x9, x10) with DICm = 4743.10 and pm = 22.46. From Table 4, we also see that the second best DIC model (x1, x2, x4, x5, x9) in the CC analysis remains the second best DIC model in the AC analysis. In the AC analysis, when a0 = 0.001, the values of DICm and pm for the best CC analysis model (x1, x2, x4, x9) now become 4744.66 and 20.71 for J = 10, 4731.20 and 25.69 for J = 15, and 4750.68 and 30.80 for J = 20, which are much larger than the corresponding DIC values under the best AC model (x1, x2, x4, x5, x8, x9) and the second best AC model (x1, x2, x4, x5, x9). When a0 = 0.001, the best CC model (x1, x2, x4, x9) is the ninth best AC model for J = 10 and the tenth best model for both J = 15 and J = 20. Interestingly, similar to the CC analysis, the “optimal” choices of J and a0 are J = 15 and a0 = 0.001. Compared to the CC analysis, another noticeable change in the AC analysis is that the values of the dimensional penalty pm are larger than the corresponding values in the CC analysis, which is expected since the dimension of those missing covariates leads to the additional dimensional penalty in pm.

Table 4.

Values of DIC under best three DIC models for the BMT data with all cases

Model a0 = 0.001
J = 10
J = 15
J = 20
DICm pm DICm pm DICm pm
(x1, x2, x4, x5, x8, x9) 4742.43 25.07 4729.11 30.09 4748.67 35.09
(x1, x2, x4, x5, x9) 4742.52 21.39 4729.30 26.48 4748.95 31.53
(x1, x2, x3, x4, x5, x9) 4743.19 22.40 4729.73 27.44 4749.27 32.43

J = 15
a0 = 0.1
a0 = 0.01
a0 = 0.0001
DICm pm DICm pm DICm pm

(x1, x2, x4, x5, x8, x9) 4973.47 27.01 4735.55 29.61 4729.31 30.17
(x1, x2, x4, x5, x9) 4973.85 23.28 4735.63 25.76 4729.33 26.49
(x1, x2, x3, x4, x5, x9) 4974.25 24.29 4735.87 26.97 4729.84 27.57

For the best two DIC models with J = 15 and a0 = 0.001, we also computed the posterior estimates of the model parameters, and the results are shown in Table 5. Under both the best two DIC models, all covariates except for gsource in the survival model for the time from transplant to death have 95% HPD intervals that do not contain 0. As x9 is the only missing covariate in both models, using (3.2), we only need to model x9 via the proportional odds logistic regression model conditional on the other covariates, namely, disease, age, karnofprg, gsource, and prevgvh1 for model (x1, x2, x4, x5, x8, x9) and disease, age, karnofprg, and gsource for model (x1, x2, x4, x5, x9). The corresponding posterior estimates for these two missing covariate models are also shown in Table 5. Under these two models for the missing covariate cytoabnew (x9), we see that all covariates except for karnofprg have 95% HPD intervals that do not contain 0. Under the second best model, Table 3 compares the posterior estimates from the AC analysis to those of the CC analysis. In Table 3, we see that the AC analysis leads to smaller posterior standard deviations and shorter HPD intervals for all parameters in the survival model. In particular, gsource is nearly “significant” in the response model and “significant” in the covariate model in the AC analysis, where significance means that the 95% HPD interval does not contain 0.

Table 5.

Posterior estimates of β and α under best DIC model with J = 15 and a0 = 0.001 for the completely observed BMT data

Model Parameters Variable Estimate SD 95% HPD interval
(x1, x2, x4, x5, x8, x9) β Disease 0.130 0.034 (0.062, 0.196)
Age 0.233 0.037 (0.164, 0.308)
Karnofprg −0.090 0.032 (−0.152, −0.028)
Gsource 0.062 0.037 (−0.009, 0.136)
Prevgvh11 0.127 0.063 (0.008, 0.253)
Prevgvh12 0.062 0.071 (−0.071, 0.207)
Prevgvh13 0.014 0.041 (−0.067, 0.093)
Prevgvh14 0.010 0.044 (−0.076, 0.096)
Cytoabnew1 −0.424 0.109 (−0.633, −0.206)
Cytoabnew2 −0.614 0.103 (−0.812, −0.405)
Cytoabnew3 −0.726 0.205 (−1.150, −0.342)
α Intercept 1 −2.158 0.071 (−2.302, −2.024)
Intercept 2 −0.428 0.046 (−0.518, −0.338)
Intercept 3 2.950 0.097 (2.754, 3.135)
Disease 0.363 0.043 (0.275, 0.444)
Age 0.105 0.047 (0.011, 0.194)
Karnofprg −0.073 0.043 (−0.155, 0.011)
Gsource 0.144 0.046 (0.056, 0.238)
Prevgvh11 −0.021 0.073 (−0.159, 0.127)
Prevgvh12 −0.213 0.080 (−0.367, −0.056)
Prevgvh13 0.036 0.049 (−0.058, 0.134)
Prevgvh14 −0.136 0.054 (−0.245, −0.032)
(x1, x2, x4, x5, x9) β Disease 0.127 0.034 (0.060, 0.195)
Age 0.235 0.037 (0.163, 0.308)
Karnofprg −0.088 0.031 (−0.151, −0.027)
Gsource 0.068 0.036 (−0.004, 0.135)
Cytoabnew1 −0.423 0.108 (−0.623, −0.200)
Cytoabnew2 −0.618 0.102 (−0.809, −0.409)
Cytoabnew3 −0.743 0.204 (−1.140, −0.351)
α Intercept 1 −2.136 0.070 (−2.277, −2.003)
Intercept 2 −0.420 0.046 (−0.508, −0.329)
Intercept 3 2.928 0.096 (2.743, 3.116)
Disease 0.353 0.043 (0.269, 0.437)
Age 0.113 0.046 (0.018, 0.198)
Karnofprg −0.069 0.042 (−0.150, 0.014)
Gsource 0.149 0.045 (0.063, 0.240)

6 Discussion

We have proposed a joint semi-conjugate prior for the regression coefficients β and piecewise hazard parameters λ and examined their theoretical properties in the piecewise exponential model for right censored survival data. The proposed prior is quite attractive in the context of variable subset selection for survival data with missing covariates. It is proper and the functional form of the prior is immediately determined for all models once the functional form of the prior is written for the full model. In addition, the prior is completely specified by only one hyper-parameter, namely, a0. This indeed makes the elicitation of priors for all models in the model space much easier. Otherwise, prior elicitation would be an enormous task. In addition, we have empirically shown that the DIC measure can be used to guide the choice of a0 to achieve the best posterior predictive performance. In Sect. 5, for the BMT data, we see that the best model for the AC is different than the one based on a CC analysis. This empirical result demonstrates that one cannot do variable selection just based on the completely observed cases. In fact, it is important to use all cases in performing variable selection.

Our computational methods in this paper are intended for situations where the number of models in the model space can be enumerated, so with this in mind, our proposed procedure works best when the number of covariates is 9–15. We have considered two Monte Carlo methods for computing the DIC measures. The direct sampling method is easy to implement. However, care needs to be taken in monitoring convergence of the Gibbs sampling algorithm for each model in the model space. On the other hand, the single MC sample method requires only one Gibbs sample from the posterior distribution under the full model. Thus, one needs to monitor convergence of the Gibbs sampling algorithm only once. However, in this case, one needs to construct a “good” weight function w(γq(m)|γq(m)) to obtain an efficient single MC sample method. The choice of w(γq(m)|γq(m)) proposed in Sect. 4 works well. However, it requires finding the conditional posterior modes, which may be computationally expensive. Finding a less efficient but less computationally expensive weight function is an important future project, which is currently under investigation. We note that both Monte Carlo methods can be easily implemented using multiple computers. Thus, a parallel computing system can greatly speed up the computation of the DIC measures for variable selection. With a Linux cluster, the proposed computational procedure can work well when the number of covariates is up to 20.

Another important criterion used in model assessment is Bayesian Model Averaging (BMA). Since we have focused this paper on variable selection and selecting a set of top models, we have not addressed the issue of BMA at all, as this is an entirely different topic with different inferential goals and different computational strategies. The performance of the proposed semi-conjugate priors in the presence of MAR covariates and the effects of covariates such as sexmatch within the BMA context will be explored in future work.

Acknowledgements

The authors wish to thank Dr. Mei-Jie Zhang for providing the BMT data. The authors also wish to thank the Editor-in-Chief, the Editor, and a referee for their helpful comments and suggestions, which have improved the paper. This research was partially supported by NIH grants #GM 70335 and #CA 74015.

Appendix A: proofs of Theorem 2.1

Observe that the marginal prior of (λ1, β) is of the form

π(λ1,β|y0,X,a0)i=1n{λ1exp(xiβ)}a0exp{a0y0λ1exp(xiβ)}1λ1.

Let β0 = log λ1. We have

π(β0,β|y0,X,a0)i=1n{exp(β0+xiβ)}a0exp{a0y0exp(β0+xiβ)}. (A.1)

Since Xobs* is of full rank, then X* = (1, X) is of full rank. It is easy to show that π(β0, β|y0, X, a0) is log-concave in (β0, β′)′. This implies that π(β0, β|y0, X, a0) has a unique mode. Set

(β0,β)logπ(β0,β|y0,X,a0)=a0i=1n[1y0exp(β0+xiβ)](1,xi)=0. (A.2)

Thus, (−log y0, 0, …, 0)′ is the unique solution of (A.2). This implies that (−log y0, 0, …, 0)′ is the unique prior mode of (log λ1, β).

For (ii), it suffices to prove that π(β0, β|y0, X, a0) in (A.1) is proper since π02, …, λJ) is a proper prior. We write

π*(β0,β|y0,X,a0)=i=1n{exp(β0+xiβ)}a0exp{a0y0exp(β0+xiβ)}.

It is easy to observe that

{exp(β0+xiβ)}a0exp{a0y0exp(β0+xiβ)}K0

for i = 1, 2, …, n, where K0 > 0 is a constant. Since Xobs* is of full rank, there exist i1 < i2 < ⋯ < ik+1 such that xi1, xi2, …, xik+1 are completely observed and the (k+1) × (k+1) matrix X**=((1,xig),g=1,2,,k+1) is of full rank. Let u = (u1, u2, …, uk+1)′. Taking a one-to-one transformation u = X**(β0, β′)′ leads to

π*(β0,β|y0,X,a0)dβ0dβ K0nk1g=1k=1{exp(β0+xigβ)}a0exp{a0y0exp(β0+xigβ)}dβ0dβ =K1g=1k=1exp(a0ug)exp{a0y0exp(ug)}dug<, (A.3)

which completes the proof of Theorem 2.1.

Appendix B: posterior sampling

In this appendix, we discuss how to sample from the posterior distribution under the full model given in (2.10). To this end, we propose a Gibbs sampling algorithm, which requires sampling from the following full conditional distributions in turn:

  1. [β|λ, xmis, a0, Dobs];

  2. [λ|β, xmis, a0, Dobs];

  3. [xmis|β, λ, α, a0, Dobs];

  4. [α|xmis, a0, Dobs].

We briefly discuss how we sample from each of the above full conditional distributions. For (i), the full conditional density of β given λ, xmis, a0, and Dobs is of the form

π(β|λ,xmis,a0,Dobs)i=1nexp{(νi+a0)xi'βa0y0λ1exp(xi'β)}    ×exp[j=1Jδijexp(xi'β){λj(yisj1)+g=1j1λg(sgsg1)}].

It is easy to show that π(β|λ, xmis, a0, Dobs) is log-concave in β. Thus, we can sample the βj’s via the adaptive rejection algorithm of Gilks and Wild (1992). For (ii), given β and xmis, λ1, λ2, …, λJ are conditionally independent. Let hij=δij(yisj1)+g=j+1Jδig(sjsj1). Then, we have

λ1~Gamma(na0+i=1nδi1νi,i=1n{(a0y0+hi1)exp(xiβ)}), (B.1)

and

λj~Gamma(b1+i=1nδijνi,b2+i=1nhijexp(xiβ)), (B.2)

for j = 2, …, J. Hence, sampling the λj from (B.1) and (B.2) is straightforward.

For (iii), given β, λ, and α, the x2i,mis’s are conditionally independent, and the conditional distribution for x2i,mis is

π(x2i,mis|β,λ,α,a0,Dobs)f(x2i|x1i,α)exp{(νi+a0)xiβ}exp{a0y0λ1×exp(xiβ)}exp[j=1Jδijexp(xiβ){λj(yisj1)+g=1j1λg(sgsg1)}].

Thus, the conditional distribution of x2i,mis depends on the form of f (x2i|x1i, α). In Sect. 5, for the BMT data, f (x2i|x1i, α) is a product of a proportional odds logistic density and a normal density, and hence, sampling x2i,mis is relatively straightforward. In fact, the conditional distribution for (cytoabnew1, cytoabnew2, cytoabnew3) is multinomial while the conditional distribution for log(wbcdx) is log-concave, which can be sampled via the adaptive rejection algorithm of Gilks and Wild (1992). For (iv), the full conditional distribution is π(α|xmis,a0,Dobs)i=1nf(x2i|x1i,α)π(α). For various covariate distributions specified through a series of one dimensional conditional distributions, sampling α is straightforward. For example, in Section 5, the full conditional distribution for each component of α9 is log-concave, and hence we can sample these α9j’s via the adaptive rejection algorithm of Gilks and Wild (1992), and the full conditional distributions for the components of α10 are either normal or inverse gamma, which are easy to sample from.

Contributor Information

Joseph G. Ibrahim, Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA, e-mail:ibrahim@bios.unc.edu

Ming-Hui Chen, Department of Statistics, University of Connecticut, Storrs, CT 06269, USA, e-mail: mhchen@stat.uconn.edu.

Sungduk Kim, Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, NIH, Rockville, MD 20852, USA, e-mail: kims2@mail.nih.gov.

References

  1. Akaike H. Information theory and an extension of themaximum likelihood principle. In: Petrov BN, Csaki F, editors. International symposium on information theory; Budapest: Akademia Kiado; 1973. pp. 267–281. [Google Scholar]
  2. Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian variable selection and prediction. J R Stat Soc B. 1998;60:627–641. [Google Scholar]
  3. Brown PJ, Vanucci M, Fearn T. Bayes model averaging with selection of regresors. J R Stat Soc B. 2002;64:519–536. [Google Scholar]
  4. Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models (with discussion) Bayesian Anal. 2006;1:651–674. [Google Scholar]
  5. Chen MH, Ibrahim JG. Conjugate priors for generalized linear models. Stat Sinica. 2003;13:461–476. [Google Scholar]
  6. Chen MH, Ibrahim JG, Yiannoutsos C. Prior elicitation, variable selection, and Bayesian computation for logistic regression models. J R Stat Soc B. 1999;61:223–242. [Google Scholar]
  7. Chen MH, Ibrahim JG, Shao QM, Weiss RE. Prior elicitation for model selection and estimation in generalized linear mixed models. J Stat Plan Inference. 2003;111:57–76. [Google Scholar]
  8. Chen MH, Dey DK, Ibrahim JG. Bayesian criterion based model assessment for categorical data. Biometrika. 2004;91:45–63. [Google Scholar]
  9. Chen MH, Huang L, Ibrahim JG, Kim S. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 2008;3:585–614. doi: 10.1214/08-BA323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chipman HA, George IE, McCulloch RE. Bayesian CART model search (with discussion) J Am Stat Assoc. 1998;93:935–960. [Google Scholar]
  11. Chipman HA, George IE, McCulloch RE. The practical implementation of Bayesian model selection (with discussion) In: Lahiri P, editor. Model selection. Beachwood: Institute of Mathematical Statistics; 2001. pp. 63–134. [Google Scholar]
  12. Chipman HA, George IE, McCulloch RE. Bayesian treed generalized linear models (with discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian statistics. vol 7. Oxford: Oxford University Press; 2003. pp. 85–103. [Google Scholar]
  13. Clyde M. Bayesian model averaging and model search strategies (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 6. Oxford: Oxford University Press; 1999. pp. 157–185. [Google Scholar]
  14. Clyde M, George IE. Model uncertainty. Stat Sci. 2004;19:81–94. [Google Scholar]
  15. Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc. 1996;91:883–904. [Google Scholar]
  16. Dellaportas P, Forster JJ. Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika. 1999;86:615–633. [Google Scholar]
  17. Dey DK, Chen MH, Chang H. Bayesian approach for the nonlinear random effects models. Biometrics. 1997;53:1239–1252. [Google Scholar]
  18. Geisser S, Eddy W. A predictive approach to model selection. J Am Stat Assoc. 1979;74:153–160. [Google Scholar]
  19. Gelfand AE, Dey DK. Bayesian model choice: asymptotics and exact calculations. J R Stat Soc B. 1994;56:501–514. [Google Scholar]
  20. Gelfand AE, Dey DK, Chang H. Model determinating using predictive distributions with implementation via sampling-based methods (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 4. Oxford: Oxford University Press; 1992. pp. 147–167. [Google Scholar]
  21. Gelfand AE, Ghosh SK. Model choice: a minimum posterior predictive loss approach. Biometrika. 1998;85:1–13. [Google Scholar]
  22. Gelman A, Meng XL, Stern HS. Posterior predictive assessment of model fitness via realized discrepancies (with discussion) Stat Sinica. 1996;6:733–807. [Google Scholar]
  23. George EI. The variable selection problem. J Am Stat Assoc. 2000;95:1304–1308. [Google Scholar]
  24. George EI, Foster DP. Calibration and empirical Bayes variable selection. Biometrika. 2000;87:731–747. [Google Scholar]
  25. George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993;88:881–889. [Google Scholar]
  26. George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sinica. 1997;7:339–374. [Google Scholar]
  27. George EI, McCulloch RE, Tsay R. Two approaches to Bayesian model selection with applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian analysis in statistics and econometrics: essays in honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348. [Google Scholar]
  28. Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. J R Stat Soc C (Appl Stat) 1992;41:337–348. [Google Scholar]
  29. Hanson TE. Inference for mixtures of finite polya tree models. J Am Stat Assoc. 2006;101:1548–1565. [Google Scholar]
  30. Huang L, Chen MH, Ibrahim JG. Bayesian analysis for generalized linear models with nonignorably missing covariates. Biometrics. 2005;61:767–780. doi: 10.1111/j.1541-0420.2005.00338.x. [DOI] [PubMed] [Google Scholar]
  31. Ibrahim JG, Chen MH. Power prior distributions for regression models. Stat Sci. 2000;15:46–60. [Google Scholar]
  32. Ibrahim JG, Laud PW. A Predictive approach to the analysis of designed experiments. J Am Stat Assoc. 1994;89:309–319. [Google Scholar]
  33. Ibrahim JG, Chen MH, McEachern SN. Bayesian variable selection for proportional hazards models. Can J Stat. 1999a;27:701–717. [Google Scholar]
  34. Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J R Stat Soc B. 1999b;61:173–190. [Google Scholar]
  35. Ibrahim JG, Chen MH, Ryan LM. Bayesian variable selection for time series count data. Stat Sinica. 2000;10:971–987. [Google Scholar]
  36. Ibrahim JG, Chen MH, Sinha D. Bayesian survival analysis. New York: Springer-Verlag; 2001a. [Google Scholar]
  37. Ibrahim JG, Chen MH, Sinha D. Criterion based methods for Bayesian model assessment. Stat Sinica. 2001b;11:419–443. [Google Scholar]
  38. Ibrahim JG, Chen MH, Lipsitz SR, Herring AH. Missing data methods in regression models. J Am Stat Assoc. 2005;100:332–346. [Google Scholar]
  39. Kim S, Chen MH, Dey DK, Gamerman D. Bayesian dynamic models for survival data with a cure fraction. Lifetime Data Anal. 2007;13:17–35. doi: 10.1007/s10985-006-9028-7. [DOI] [PubMed] [Google Scholar]
  40. Laud PW, Ibrahim JG. Predictive model selection. J R Stat Soc B. 1995;57:247–262. [Google Scholar]
  41. Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922. [Google Scholar]
  42. Little RJA, Rubin DB. Statistical analysis with missing data. 2nd edn. New York: Wiley; 2002. [Google Scholar]
  43. Ntzoufras I, Dellaportas P, Forster JJ. Bayesian variable and link determination for generalised linear models. J Stat Plan Inference. 2003;111:165–180. [Google Scholar]
  44. Raftery AE. Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika. 1996;83:251–266. [Google Scholar]
  45. Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. J Am Stat Assoc. 1997;92:179–191. [Google Scholar]
  46. Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
  47. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464. [Google Scholar]
  48. Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. J Econom. 1996;75:317–343. [Google Scholar]
  49. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion) J R Stat Soc B. 2002;64:583–639. [Google Scholar]

RESOURCES