Bayesian variable selection for the Cox regression model with missing covariates

Joseph G Ibrahim; Ming-Hui Chen; Sungduk Kim

doi:10.1007/s10985-008-9101-5

. Author manuscript; available in PMC: 2010 Apr 22.

Published in final edited form as: Lifetime Data Anal. 2008 Oct 3;14(4):496–520. doi: 10.1007/s10985-008-9101-5

Bayesian variable selection for the Cox regression model with missing covariates

Joseph G Ibrahim ^1,^✉, Ming-Hui Chen ², Sungduk Kim ³

PMCID: PMC2858597 NIHMSID: NIHMS128678 PMID: 18836829

Abstract

In this paper, we develop Bayesian methodology and computational algorithms for variable subset selection in Cox proportional hazards models with missing covariate data. A new joint semi-conjugate prior for the piecewise exponential model is proposed in the presence of missing covariates and its properties are examined. The covariates are assumed to be missing at random (MAR). Under this new prior, a version of the Deviance Information Criterion (DIC) is proposed for Bayesian variable subset selection in the presence of missing covariates. Monte Carlo methods are developed for computing the DICs for all possible subset models in the model space. A Bone Marrow Transplant (BMT) dataset is used to illustrate the proposed methodology.

Keywords: Conjugate prior, Deviance information criterion, Missing at random, Proportional hazards models

1 Introduction

Bayesian variable selection in survival analysis is still one of the most challenging problems encountered in practice due to issues regarding (i) prior elicitation, (ii) evaluation of a model selection criterion due to the complication of censoring, and (iii) numerical computation of the criterion for all possible models in the model space. In the context of survival analysis, these issues have been discussed in Ibrahim et al. (1999a, 2001a) and the many references therein. There have been numerous papers in the statistical literature on Bayesian variable selection and model comparison, including articles by George and McCulloch (1993, 1997); Laud and Ibrahim (1995); George et al. (1996); Raftery (1996); Smith and Kohn (1996); Raftery et al. (1997); Brown et al. (1998, 2002); Clyde (1999); Chen et al. (1999, 2003, 2008); Dellaportas and Forster (1999); Chipman et al. (1998, 2001, 2003); George (2000); George and Foster (2000); Ibrahim et al. (2000); Ntzoufras et al. (2003) and Clyde and George (2004). However, the literature on Bayesian variable selection in the presence of missing data and in particular, for survival data in the presence of missing covariates, is still quite sparse. Part of the reason for this is that in the presence of missing covariate data, models can become quite complex and closed forms are not even available in the simplest of models. Thus, computing quantities such as Bayes factors, posterior model probabilities, the Aikiake Information Criterion (AIC) (Akaike 1973), the Bayesian Information Criterion (BIC) (Schwarz 1978), and Deviance Information Criterion (DIC) (Spiegelhalter et al. 2002), for example, become serious computational challenges. For example, to compute BIC in the presence of missing covariate data, one would need to maximize the observed data likelihood. There are two challenging issues with this: (i) the observed data likelihood does not have a closed form for most models, even the linear model when the covariates are not normally distributed, and suitable approximation is often not available, and (ii) maximizing the observed data likelihood can be a huge challenge even if it is available in closed form. There are also several technical issues for computing AIC and BIC in the presence of missing covariates. One could argue that these measures are not well defined in the context of missing covariate data since the penalty term is not clearly defined. In particular, if we use the observed data likelihood obtained by averaging over the possible missing values of the covariates according to the missing covariate distribution, it is not clear how to appropriately define the dimensional penalty for AIC and BIC. We elaborate more on this issue in Sect. 5.

This issue becomes even more complex when computing Bayes factors, as one has to integrate over a very large space and the integrals easily become of very high dimension even in the simplest missing data problems. Specifically, it is well known that methods based on Bayes factors or posterior model probabilities, proper prior distributions are needed. It is a major task to specify prior distributions for all models in the model space, especially if the model space is large. For survival models with missing covariates, it becomes even more challenging to specify prior distributions, as in this case, one needs to specify priors not only for the regression coefficients in the survival model, but also the parameters involved in the models for missing covariates. The prior elicitation issue has been discussed in detail by several authors including Laud and Ibrahim (1995); Chen et al. (1999) and Ibrahim and Chen (2000). In addition, it is well known that Bayes factors and posterior model probabilities are generally sensitive to the choices of prior hyperparameters, and thus one cannot simply select vague proper priors to get around the elicitation issue. Even when informative prior distributions are available, computing Bayes factors and posterior model probabilities is difficult and expensive as one needs to compute prior and posterior normalizing constants for each model in the model space. It may be practically infeasible to compute these quantities in the context of variable subset selection for survival models with missing data. Alternatively, criterion based methods can be attractive in the sense that they do not require proper prior distributions in general, and thus have an advantage over posterior model probabilities in this sense. Several recent papers advocating the use of Bayesian criteria for model assessment include Geisser and Eddy (1979); Gelfand et al. (1992); Gelfand and Dey (1994); Ibrahim and Laud (1994); Laud and Ibrahim (1995); Gelman et al. (1996); Dey et al. (1997); Gelfand and Ghosh (1998); Ibrahim et al. (2001b); Spiegelhalter et al. (2002); Chen et al. (2004); Huang et al. (2005); Hanson (2006); Celeux et al. (2006) and Kim et al. (2007).

To overcome some of the methodologic and computational issues mentioned above, we develop two methodologies in this paper: (i) a class of semi-conjugate priors in the presence of MAR covariate data, and (ii) a variation of DIC for survival models with missing covariates. The proposed class of priors overcome the elicitation issues mentioned above as well as the computational challenges. The proposed priors make elicitation easier than other conventional informative priors by basing the elicitation on observable quantities rather than the parameters themselves, along with a scalar quantifying the confidence in that prediction. This is an especially attractive approach in variable selection contexts since in this context the regression coefficients for every model in the model space have a different contextual meaning and interpretation, and thus specifying hyperparameters for all of the models in the model space is a monumental task. This elicitation challenge can be overcome by focusing on constructing a prior based on a prediction for the response variable, as pointed out by Laud and Ibrahim (1995) and Chen and Ibrahim (2003). They are also computationally attractive in that they lead to full conditionals that are log-concave and hence easily sampled via Adaptive Rejection Sampling (ARS) (Gilks and Wild 1992) within Gibbs. Thus, sampling the posterior with these priors is computationally very efficient.

The proposed version of DIC is an extension of a version of DIC discussed in Huang et al. (2005) for generalized linear models with missing covariates. For survival data with censored observations and missing covariates, DIC has a computational advantage over other criterion-based methods, such as AIC or BIC. With the computational methods developed in Sect. 4, the DIC measures can be easily computed for all models in the model space for a moderate number of covariates. In contrast, computation of AIC or BIC becomes quite difficult and challenging for variable subset selection for survival data with censored observations and missing covariates.

The rest of this paper is organized as follows. Section 2 presents a detailed development of the semi-conjugate prior under the piecewise exponential model in the presence of MAR covariates. Section 3 sets up all necessary formulas for the survival models, priors, and posteriors in the context of variable subset selection and presents a novel version of DIC for survival data with missing covariates. Section 4 presents the computational algorithms for computing the DIC measures for all models in the model space. A detailed analysis of the BMT data is given in Sect. 5. We conclude the article with brief remarks in Sect. 6.

2 The model, prior and posterior

2.1 The model

Let y_i denote the minimum of the censoring time C_i and the survival time T_i, and let x_i = (x_i1, …, x_ik)′ be the k × 1 vector of covariates associated with y_i for the ith subject. Denote by β = (β₁, …, β_k)′ the k × 1 vector of regression coefficients. Also, ν_i = 1{T_i = y_i} is the indicator for the event for i = 1, 2, …, n, where n is the total number of observations. As usual, we assume throughout that x_i does not include an intercept, since the intercept is not estimable in the Cox proportional hazards model, and that given x_i, T_i and C_i are independent. In the presence of missing covariates, the missing data mechanism is defined as the distribution of the k × 1 random vector r_i = (r_i1, r_i2, …, r_ik)′, where r_ij = 0 when x_ij is missing and r_ij = 1 when x_ij is observed for i = 1, 2, …, n and j = 1, 2, …, k. We assume that any missingness in covariates x_ij is missing at random (MAR) (Rubin 1976; Little and Rubin 2002). As discussed in Ibrahim et al. (2005), for MAR covariates x_ij we do not need to model the missing data mechanism.

We consider the Cox proportional piecewise exponential hazards model for [y_i|x_i], which has the survival function given by

S (y_{i} | x_{i}, β, λ) = exp {- exp (x_{i}^{'} β) H_{0} (y_{i} | λ)},

(2.1)

where H₀(t|λ) is the baseline cumulative hazard function. The piecewise exponential model is assumed for the baseline hazard function h₀(t). Specifically, we first partition the time axis into J intervals: (s₀, s₁], (s₁, s₂], …, (s_J−1, s_J], where s₀ = 0 < s₁ < s₂ < ⋯ < s_J. In practice, it is sufficient to choose s_J to be greater than the largest follow-up time. We then assume a constant hazard λ_j over the jth interval I_j = (s_j−1, s_j]. That is, h₀(y)=λ_j if y ∈ I_j for j =1, 2, …, J. Then the corresponding baseline cumulative hazard function, H₀(y|λ), is given by

H_{0} (y | λ) = λ_{j} (y - s_{j - 1}) + \sum_{g = 1}^{j - 1} λ_{g} (s_{g} - s_{g - 1})

(2.2)

for s_j−1 < y ≤ s_j, where λ = (λ₁, …, λ_J). We note that when J = 1, H₀(y|λ) reduces to the parametric exponential model.

We write $x_{i}^{'} = (x_{1 i}^{'}, x_{2 i}^{'})'$ , where x_1i is a k₁ × 1 vector of covariates that are observed for all n observations, x_2i is a k₂ × 1 vector of covariates that have at least one missing value in the n observations, and k₁ + k₂ = k with k₁ ≥ 0 and k₂ ≥ 1. Furthermore, we let x_2i,mis denote the vector of covariates that are missing for the ith case and let x_2i,obs be the vector of covariates that are observed for the ith case. Let D = {(y_i, ν_i, x_1i, x_2i,mis, x_2i,obs), i = 1, 2, …, n} denote the complete data. Then, the complete data likelihood function is given by

\begin{matrix} L (β, λ | D) = & \prod_{i = 1}^{n} \prod_{j = 1}^{J} {λ_{j} exp (x_{i}^{'} β)}^{δ_{i j} ν_{i}} \\ \times exp [- δ_{i j} exp (x_{i}^{'} β) {λ_{j} (y_{i} - s_{j - 1}) + \sum_{g = 1}^{j - 1} λ_{g} (s_{g} - s_{g - 1})}], \end{matrix}

(2.3)

where δ_ij = 1 if the ith subject failed or was right censored in the jth interval (s_j−1, s_j], and 0 otherwise.

2.2 Prior and posterior

We first specify a prior distribution for (β, λ). To this end, we extend the conjugate prior proposed by Chen and Ibrahim (2003) for the generalized linear model (GLM) to the piecewise exponential model in (2.1). Let X denote the n × k matrix with its ith row equal to $x_{i}^{'}$ . Given X, we propose a semi-conjugate prior as follows:

\begin{matrix} π (β, λ | y_{0}, X, a_{0}) \propto (\prod_{i = 1}^{n} \prod_{j = 1}^{J} {λ_{j} exp (x_{i}^{'} β)}^{a_{0} δ_{0 i j}} \\ \times exp [- a_{0} δ_{0 i j} exp (x_{i}^{'} β) {λ_{j} (y_{0 i} - s_{j - 1}) + \sum_{g = 1}^{j - 1} λ_{g} (s_{g} - s_{g - 1})}]) π_{0} (λ), \end{matrix}

(2.4)

where a₀ > 0 is a scalar prior parameter, y₀ = (y₀₁, …, y_0n)′ is an n × 1 vector of prior parameters, δ_0ij = 1 if s_j−1 < y_0i ≤ s_j and 0 otherwise, and π₀(λ) is an initial prior for λ.

The prior (2.4) is called semi-conjugate since, by ignoring π₀(λ), the prior has an identical form as the complete data likelihood given in (2.3). As discussed in Chen and Ibrahim (2003), y_0i can be viewed as a prior prediction for the marginal mean of y_i. Since y_0i is the prior prediction of y_i, we assume that y_0i is an “observed” failure time. Thus, in eliciting y₀, we must focus on a prediction (or guess) for E(y_i), which narrows the possibilities for choosing y_0i. To obtain a noninformative prior for (β, λ), we specify all the y_0i equal. As shown in Chen and Ibrahim (2003), this specification under the GLM yields a prior in which the prior modes of the slopes in the regression model are the same. For the piecewise exponential model, we consider y₀₁ = ⋯ = y_0n = y₀, where 0 < y₀ ≤ s₁. Under this specification of y₀, (2.4) reduces to

π (β, λ | y_{0}, X, a_{0}) \propto \prod_{i = 1}^{n} {λ_{1} exp (x_{i}^{'} β)}^{a_{0}} exp {- a_{0} y_{0} λ_{1} exp (x_{i}^{'} β)} π_{0} (λ) .

(2.5)

We further specify π₀(λ) as follows

π_{0} (λ) \propto \frac{1}{λ_{1}} \prod_{j = 2}^{J} λ_{j}^{b_{1} - 1} exp (- b_{2} λ_{j}),

(2.6)

where b₁ > 0 and b₂ > 0. Note that in (2.5), we assume an improper uniform initial prior for β and an improper Jeffreys-type initial prior for λ₁. Thus, π₀(λ) introduced in (2.4) and further specified in (2.6) is an improper prior. However, under some mild conditions, the prior (2.5) is proper and (log λ₁, β) has a prior mode of (−log y₀, 0, …, 0)′. We formally characterize these properties in the following theorem.

Theorem 2.1

Let X_obs denote a submatrix of X with rows consisting of those completely observed x_i’s and $x_{mis} = (x_{2 i, mis}^{'}, i = 1, 2, \dots, n)'$ . Also, let $X_{obs}^{*} = (1, X_{obs})$ . Assume that $X_{obs}^{*}$ is of full rank (k + 1) and π₀(λ) is given by (2.6). Then, for any given x_mis, (i) (log λ₁, β) has a unique prior mode of (−log y₀, 0, …, 0)′ and (ii) π(β, λ|y₀, X, a₀) is proper.

The proof of Theorem 2.1 is given in Appendix A. We note that the conditions of Theorem 2.1 require at least k + 1 complete observations with linearly independent covariate vectors including an intercept. From Theorem 2.1, we see that when y₀₁ = ⋯ = y_0n = y₀, the prior mode of β is 0 and with this prior prediction for the y_i, both β and λ₁ are identifiable in the sense that the joint prior is proper. Note that if we assume a general gamma prior instead of a Jeffreys-type prior for λ₁ in (2.6), we can show that the prior mode of β is still 0, but the prior mode of log λ₁ is no longer −log y₀. Thus, a different specification of the initial prior for λ₁ only changes the prior mode of the “intercept” in the survival model. Although we assume y₀ ≤ s₁ in Theorem 2.1, we can show that the prior mode of β is still 0 even when y₀ > s₁. This is intuitively appealing since, in this case, the prior prediction y_0i does not depend on the ith subject’s specific covariate information. We further note that the parameter a₀ in (2.4) or (2.5) can be generally viewed as a precision parameter that quantifies the strength or confidence of our prior belief in y₀. From Theorem 2.1, we see that the prior mode of β does not depend on a₀. Thus, a₀ controls only the prior precision of β. This is an attractive feature that allows us to do sensitivity analyses by varying a₀ in the prior.

Next, we specify the distribution for the missing covariates. Since we are primarily interested in inferences about β, we only need to model x_2i since x_1i is observed for all n observations. Therefore, we model x_2i conditioning on the completely observed covariates x_1i throughout. Using a sequence of one-dimensional conditional distributions proposed by Lipsitz and Ibrahim (1996) and Ibrahim et al. (1999b), we specify the distribution of the k₂-dimensional covariate vector x_2i = (x_2i1, x_2i2, …, x_2ik₂)′ as

\begin{matrix} f (x_{2 i} | x_{1 i}, α) = & f (x_{2 i 1} | x_{i 1}, α_{1}) f (x_{2 i 2} | x_{2 i 1}, x_{i 1}, α_{2}) \dots \\ f (x_{2 i k_{2}} | x_{2 i, k_{2} - 1}, \dots, x_{2 i 1}, x_{i 1}, α_{k_{2}}), \end{matrix}

(2.7)

where α_l is a vector of parameters for the lth conditional distribution, the α_l’s are distinct, and moreover, $α = (α_{1}^{'}, α_{2}^{'}, \dots, α_{k_{2}}^{'})'$ . To complete the prior specification, we take independent priors for α₁, …, α_p so that

π (α) = \prod_{l = 1}^{k_{2}} π (α_{l}) .

(2.8)

Let $x_{obs} = ((x_{1 i}^{'}, x_{2 i, obs}^{'}), i = 1, 2, \dots, n)'$ . Using (2.5)–(2.8), the joint prior for β, λ, x_mis, and α is given by

\begin{matrix} π (β, λ, x_{mis}, α | y_{0}, a_{0}, x_{obs}) \propto & [\prod_{i = 1}^{n} {λ_{1} exp (x_{i}^{'} β)}^{a_{0}} exp {- a_{0} y_{0} λ_{1} exp (x_{i}^{'} β)}] \\ \times [\prod_{i = 1}^{n} f (x_{2 i} | x_{1 i}, α)] π_{0} (λ) π (α) . \end{matrix}

(2.9)

Let D_obs = (y, ν, x_obs) denote the completely observed data, where y = (y₁, y₂, …, y_n)′ and ν = (ν₁, ν₂, …, ν_n)′. Then, the joint posterior distribution is given by

π (β, λ, x_{mis}, α | y_{0}, a_{0}, D_{obs}) \propto L (β, λ | D) π (β, λ, x_{mis}, α | y_{0}, a_{0}, x_{obs}),

(2.10)

where L(β, λ|D) and π (β, λ, x_mis, α|y₀, a₀, x_obs) are given by (2.3) and (2.9), respectively. Although the posterior distribution in (2.10) is analytically intractable, a Gibbs sampling algorithm can be easily developed to sample from this posterior distribution. The implementational details of the Gibbs sampling algorithm are discussed in Appendix B.

3 Bayesian variable subset selection

Let ℳ denote the model space. We enumerate the models in ℳ by m = 1, 2, …, 𝒦, where 𝒦 is the dimension of ℳ and model 𝒦 denotes the full model. Also, let β^(𝒦) = (β₁, β₂, …, β_k)′ denote the regression coefficients for the full model including an intercept, and let $x_{i}^{(m)}$ and β^(m) denote k_m × 1 vectors of covariates and regression coefficients for model m, and a specific choice of k_m covariates. We write $x_{i} = (x_{i}^{(m)'}, x_{i}^{(- m)'})'$ , and β^(𝒦) = (β^(m)′, β^(−m)′)′, where $x_{i}^{(- m)} is x_{i} with x_{i}^{(m)}$ deleted, and β^(−m) is β^(𝒦) with β^(m) deleted. We also write $x_{1 i} = (x_{1 i}^{(m)'}, x_{1 i}^{(- m)'})'$ and $x_{2 i} = (x_{2 i}^{(m)'}, x_{2 i}^{(- m)'})'$ , where $x_{1 i}^{(m)}$ is a k_1m(≤ k₁) dimensional vector, $x_{2 i}^{(m)}$ is a k_2m (≤ k₂) dimensional vector, and $x_{1 i}^{(- m)} and x_{2 i}^{(- m)}$ are x_1i and x_2i with $x_{1 i}^{(m)} and x_{2 i}^{(m)}$ deleted, respectively. Furthermore, we write $x_{2 i, mis} = ({x_{2 i, mis}^{(m)}}^{'}, {x_{2 i, mis}^{(- m)}}^{'})'$ and $x_{2 i, obs} = ({x_{2 i, obs}^{(m)}}^{'}, {x_{2 i, obs}^{(- m)}}^{'})'$ , where $x_{2 i, mis}^{(- m)} and x_{2 i, obs}^{(- m)}$ are x_2i,mis and x_2i,obs with $x_{2 i, mis}^{(m)} and x_{2 i, obs}^{(m)}$ deleted, respectively.

Under model m, let $D_{m} = {(y_{i}, ν_{i}, x_{1 i}^{(m)}, x_{2 i, mis}^{(m)}, x_{2 i, obs}^{(m)}), i = 1, 2, \dots, n}$ denote the complete data and then the complete data likelihood function is given by

\begin{matrix} L (β^{(m)}, λ | D_{m}) = & \prod_{i = 1}^{n} \prod_{j = 1}^{J} {λ_{j} exp ((x_{i}^{(m)})' β^{(m)})}^{δ_{i j} ν_{i}} \\ \times exp [- δ_{i j} exp {(x_{i}^{(m)})' β^{(m)}} {λ_{j} (y_{i} - s_{j - 1}) \\ + \sum_{g = 1}^{j - 1} λ_{g} (s_{g} - s_{g - 1})}], \end{matrix}

(3.1)

where δ_ij is defined in (2.3). Using exactly the same order of the sequence of one-dimensional conditional distributions for the covariates x_2i in (2.7) by deleting $x_{2 i}^{(- m)}$ , we specify the distribution of the k_2m-dimensional covariate vector $x_{2 i}^{(m)} = (x_{2 i 1}^{(m)}, x_{2 i 2}^{(m)}, \dots, x_{2 i k_{2 m}}^{(m)})'$ as

\begin{matrix} f (x_{2 i}^{(m)} | x_{1 i}^{(m)}, α^{(m)}) = & f (x_{2 i 1}^{(m)} | x_{i 1}^{(m)}, α_{1}^{(m)}) f (x_{2 i 2}^{(m)} | x_{2 i 1}^{(m)}, x_{i 1}^{(m)}, α_{2}^{(m)}) \\ \times \dots \times f (x_{2 i k_{2 m}}^{(m)} | x_{2 i, k_{2 m} - 1}^{(m)}, \dots, x_{2 i 1}^{(m)}, x_{i 1}^{(m)}, α_{k_{2 m}}^{(m)}), \end{matrix}

(3.2)

where $α^{(m)} = (α_{1}^{(m)'}, α_{2}^{(m)'}, \dots, α_{k_{2 m}}^{(m)'})'$ . It is important to note that in (3.2), α^(m) is a subvector of α in (2.7). We further write α = (α^(m)′, α^(−m)′)′ where α^(−m) is α with α^(m) deleted. Similar to (2.8), the prior for α^(m) is specified as $π (α^{(m)}) = \prod_{l = 1}^{k_{2 m}} π (α_{l}^{(m)})$ .

Let $x_{obs}^{(m)} = (({x_{1 i}^{(m)}}^{'}, {x_{2 i, obs}^{(m)}}^{'}), i = 1, 2, \dots, n)' and x_{mis}^{(m)} = ({x_{2 i, mis}^{(m)}}^{'}, i = 1, 2, \dots, n)'$ . By applying the semi-conjugate prior (2.5) to model m, we have the joint prior for β^(m), λ, $x_{mis}^{(m)}$ , and α^(m) given by

\begin{matrix} π (β^{(m)}, λ, x_{mis}^{(m)}, α^{(m)} | y_{0}, a_{0}, x_{obs}^{(m)}) \\ \propto (\prod_{i = 1}^{n} {[λ_{1} exp {x_{i}^{(m)'} β^{(m)}}]}^{a_{0}} exp [- a_{0} y_{0} λ_{1} exp {x_{i}^{(m)'} β^{(m)}}]) \\ \times [\prod_{i = 1}^{n} f (x_{2 i}^{(m)} | x_{1 i}^{(m)}, α^{(m)})] π_{0} (λ) π (α^{(m)}), \end{matrix}

(3.3)

where π₀(λ) and $f (x_{2 i, mis}^{(m)}, x_{2 i, obs}^{(m)} | x_{1 i}^{(m)}, α^{(m)})$ are defined by (2.6) and (3.2), respectively. Note that all models in the model space share the same prior for λ, that is, the prior for λ is the same for all models in the model space. Let $D_{m, obs} = (y, ν, x_{obs}^{(m)})$ denote the completely observed data. Under model m, the joint posterior distribution is given by

\begin{matrix} π (β^{(m)}, λ, x_{mis}^{(m)}, α^{(m)} | y_{0}, a_{0}, D_{m, obs}) \propto L (β^{(m)}, λ | D_{m}) \\ \times π (β^{(m)}, λ, x_{mis}^{(m)}, α^{(m)} | y_{0}, a_{0}, x_{obs}^{(m)}), \end{matrix}

(3.4)

where L(β^(m), λ|D_m) and $π (β^{(m)}, λ, x_{mis}^{(m)}, α^{(m)} | y_{0}, a_{0}, x_{obs}^{(m)})$ are given by (3.1) and (3.3), respectively.

We carry out Bayesian variable selection via DIC, originally proposed by Spiegelhalter et al. (2002). The use of DIC for missing data models has been discussed in detail in Celeux et al. (2006). Let $θ^{(m)} = (β^{(m)}, λ, x_{mis}^{(m)})$ . DIC is defined as follows:

{DIC}_{m} = {Dev}_{m} ({\bar{θ}}^{(m)}) + 2 p_{m},

(3.5)

where Dev_m(θ^(m)) is a deviance function and θ̄^(m) is the posterior mean of θ^(m). In (3.5), p_m is the effective number of model parameters, which is calculated as

p_{m} = {\bar{Dev}}_{m} (θ^{(m)}) - {Dev}_{m} ({\bar{θ}}^{(m)}),

(3.6)

where

{\bar{Dev}}_{m} (θ^{(m)}) = E [{Dev}_{m} (θ^{(m)}) | D_{m, obs}]

(3.7)

and the expectation is taken with respect to the posterior distribution given in (3.4). Since we are primarily interested in inferences about the survival model, we define the deviance function, Dev_m(θ^(m)) in (3.5) as follows:

{Dev}_{m} (θ^{(m)}) = - 2 log L (β^{(m)}, λ | D_{m}),

where L(β^(m), λ|D_m) is given by (3.1). Following Huang et al. (2005), we compute Dev_m(θ¯^(m)) as

\begin{matrix} {Dev}_{m} ({\bar{θ}}^{(m)}) = - 2 \sum_{i = 1}^{n} \sum_{j = 1}^{J} & (δ_{i j} ν_{i} [log E [λ_{j} | D_{m, obs}] + E [(x_{i}^{(m)})' β^{(m)} | D_{m, obs}] \\ - δ_{i j} exp {E [(x_{i}^{(m)})' β^{(m)} | D_{m, obs}]} {E [λ_{j} | D_{m, obs}] \\ \times (y_{i} - s_{j - 1}) + \sum_{g = 1}^{j - 1} E [λ_{g} | D_{m, obs}] (s_{g} - s_{g - 1})}), \end{matrix}

(3.8)

where all expectations are taken with respect to the posterior distribution in (3.4). In (3.8), instead of computing $(E [x_{i}^{(m)} | D_{m, obs}])' E [β | D_{m, obs}],$ we compute $E [(x_{i}^{(m)})' β^{(m)} | D_{m, obs}]$ in the presence of missing covariates, which yields a more appropriate dimensional penalty term p_m.

The DIC defined above is a Bayesian measure of predictive model performance, which is decomposed into a measure of fit and a measure of model complexity (p_m). The smaller the value of DIC, the better the model will predict new observations generated in the same way as the data. As discussed and shown in Chen et al. (2008), the performance of DIC is similar to AIC. Moreover, the DIC defined in (3.5) has a nice computational property for Bayesian variable selection, which will be discussed in detail in the next section.

4 Computation of DIC measures

To carry out Bayesian variable selection, we need to compute DIC_m in (3.5) for m = 1, 2, …, 𝒦. Due to the complexity of the survival model in (3.1), analytical evaluation of DIC_m does not appear possible. Thus, a Monte Carlo (MC) method is needed to compute all DIC_m’s in the model space. To this end, we propose two approaches for computing the DIC_m’s. The first approach, called “the direct sampling method”, is based on direct Monte Carlo samples from each model in the model space. The second approach is “the single MC sample method,” which was proposed by Chen et al. (2008). The latter method requires only one Markov chain Monte Carlo (MCMC) sample from the posterior distribution under the full model and computes the Bayesian criterion simultaneously for all possible subset models in the model space. From (3.7) and (3.8), we observe that for DIC_m in (3.5), we need to compute the following quantities: (i) E[Dev_m(θ^(m))|D_m,obs]; (ii) $E [(x_{i}^{(m)})' β^{(m)} | D_{m, obs}]$ ; and (iii) E[λ_j|D_m,obs] for j = 1, 2, …, J. For (ii), we note that when $x_{i}^{(m)}$ is completely observed, then $E [(x_{i}^{(m)})' β^{(m)} | D_{m, obs}] = (x_{i}^{(m)})' E [β^{(m)} | D_{m, obs}]$ . Thus, for (ii), we may further consider (iia) E[β^(m)|D_m,obs] and (iib) $E [(x_{i}^{(m)})' β^{(m)} | D_{m, obs}]$ with at least one missing covariate in $x_{i}^{(m)}$ . It is interesting to observe that there is a common feature among (i), (iia), (iib), and (iii). That is, all of these quantities can be written as

g_{m} = E [g (θ^{(m)}) | D_{m, obs}],

(4.1)

for various functions g, where $θ^{(m)} = (β^{(m)}, λ, x_{mis}^{(m)})$ and the expectation is taken with respect to the joint posterior distribution in (3.4) under model m.

First, we discuss the direct sampling method. Using the Gibbs sampling algorithm given in Appendix B, we generate a Monte Carlo sample ${θ_{q}^{(m)}, q = 1, 2, \dots, Q}$ from the joint posterior distribution in (3.4) under model m. Then, a Monte Carlo estimate of g_m is given by

{\hat{g}}_{m} = \frac{1}{Q} \sum_{q = 1}^{Q} g (θ_{q}^{(m)})

(4.2)

for all g’s listed in (i)–(iii). Then, plugging various ĝ_m’s in (3.5) gives a Monte Carlo estimate of DIC_m.

Next, we discuss the single MC sample method. Using the notation given in Sect. 3, we write $γ = (β, λ, x_{mis}, α), γ^{(m)} = (β^{(m)}, λ, x_{mis}^{(m)}, α^{(m)})$ , and $γ^{(- m)} = (β^{(- m)}, x_{mis}^{(- m)}, α^{(- m)})$ , where $x_{mis}^{(- m)}$ is x_mis with $x_{mis}^{(m)}$ deleted and γ^(−m) is γ with γ^(m) deleted. Thus, the marginal likelihood under model m is given by

C_{m} = \int π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs}) d γ^{(m)},

(4.3)

where

\begin{matrix} π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs}) \\ = L (β^{(m)}, λ | D_{m}) (\prod_{i = 1}^{n} {[λ_{1} exp {x_{i}^{(m)'} β^{(m)}}]}^{a_{0}} exp [- a_{0} y_{0} λ_{1} exp {x_{i}^{(m)'} β^{(m)}}]) \\ \times [\prod_{i = 1}^{n} f (x_{2 i}^{(m)} | x_{1 i}^{(m)}, α^{(m)})] π_{0}^{*} (λ) π (α^{(m)}), \end{matrix}

(4.4)

$π_{0}^{*} (λ) = \frac{1}{λ_{1}} \prod_{j = 2}^{J} λ_{j}^{b_{1} - 1} exp ({- b}_{2} λ_{j})$ , and L(β^(m), λ|D_m), π₀(λ) and $f (x_{2 i}^{(m)} | x_{1 i}^{(m)}, α^{(m)})$ are defined by (3.1), (2.6) and (3.2), respectively. Then, for a given function g, we have

g_{m} = E [g (θ^{(m)}) | D_{m, obs}] = \int g (θ^{(m)}) \frac{π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs})}{C_{m}} d β^{(m)},

(4.5)

where C_m is defined in (4.3).

For any given function g such that E[|g(θ^(m))||D_m,obs] < ∞, we have the following identity

g_{m} = \frac{C_{𝒦}}{C_{m}} E [g (θ^{(m)}) w (γ^{(- m)} | γ^{(m)}) \frac{π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs})}{π^{*} (γ | y_{0}, a_{0}, D_{obs})} | D_{obs}],

(4.6)

where C_𝒦 is the marginal likelihood given in (4.3) under the fill model, π*(γ|y₀, a₀, D_obs) is given in (4.4) corresponding to the full model, which is essentially the kernel of the joint posterior distribution in (2.10), and the expectation is taken with respect to the joint posterior distribution in (2.10) under the full model. In (4.6), w(γ^(−m)| γ^(m)) is a completely known conditional density, whose support is contained in, or equal to, the support of the conditional density of γ^(−m) given γ^(m) with respect to the joint posterior distribution in (2.10) under the full model.

Observe that as a special case of (4.1), we have g_m = 1 when g ≡ 1. Using this result, we further obtain that

\frac{C_{m}}{C_{𝒦}} = E [w (γ^{(- m)} | γ^{(m)}) \frac{π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs})}{π^{*} (γ | y_{0}, a_{0}, D_{obs})} | D_{obs}] .

(4.7)

Using (4.6) and (4.7), we have

g_{m} = \frac{E [g (θ^{(m)}) w (γ^{(- m)} | γ^{(m)}) \frac{π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs})}{π^{*} (γ | y_{0}, a_{0}, D_{obs})} | D_{obs}]}{E [w (γ^{(- m)} | γ^{(m)}) \frac{π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs})}{π^{*} (γ | y_{0}, a_{0}, D_{obs})} | D_{obs}]} .

(4.8)

We note that as the dimension of λ does not change across all models, π*(λ) cancels out in the ratio $\frac{π^{*} (γ^{(m)} | y_{0}, a_{0}, D_{m, obs})}{π^{*} (γ | y_{0}, a_{0}, D_{obs})}$ .

Let {γ_q = (β_q, λ_q, x_mis,q, α_q), q = 1, 2, …, Q} denote an MCMC sample of size Q from the joint posterior distribution (2.10) under the full model. Write $γ_{q} = (γ_{q}^{(m)}, γ_{q}^{(- m)})$ , where $γ_{q}^{(m)} = (β_{q}^{(m)}, λ_{q}, x_{mis, q}^{(m)}, α_{q}^{(m)})$ , and $γ_{q}^{(m)} = (β_{q}^{(- m)}, x_{mis, q}^{(- m)}, α_{q}^{(- m)})$ . Also let $θ_{q}^{(m)} = (β_{q}^{(m)}, λ_{q}, x_{mis, q}^{(m)})$ . Then, an MC estimate of g_m is given by

{\hat{g}}_{m} = \frac{\sum_{q = 1}^{Q} g (θ_{q}^{(m)}) w (γ_{q}^{(- m)} | γ_{q}^{(m)}) \frac{π^{*} (γ_{q}^{(m)} | y_{0}, a_{0}, D_{m, obs})}{π^{*} (γ_{q} | y_{0}, a_{0}, D_{obs})}}{\sum_{q = 1}^{Q} w (γ_{q}^{(- m)} | γ_{q}^{(m)}) \frac{π^{*} (γ_{q}^{(m)} | y_{0}, a_{0}, D_{m, obs})}{π^{*} (γ_{q} | y_{0}, a_{0}, D_{obs})}} .

(4.9)

Under certain regularity conditions, such as ergodicity, we have

lim_{Q \to \infty} {\hat{g}}_{m} = g_{m},

implying that ĝ_m is consistent.

As shown in Chen et al. (2008) the optimal choice of w(γ^(−m) | γ^(m)) is the conditional posterior distribution of γ^(−m) given γ^(m) under the full model in the sense that ĝ_m achieves the minimum asymptotic variance. However, the optimal choice of w(γ^(−m) | γ^(m)) is not computationally feasible. Thus, we propose the following weight function

w (γ^{(- m)} | γ^{(m)}) = w (β^{(- m)} | β^{(m)}, λ, x_{mis}) w (α^{(- m)} | α^{(m)}, x_{mis}) w (x_{mis}^{(- m)} | x_{mis}^{(m)}, α^{(m)}) .

(4.10)

Note that in (4.10), when model m includes all missing covariates x_2i, we do not need to compute $w (x_{mis}^{(- m)} | x_{mis}^{(m)}, α^{(m)})$ as in this case, $x_{mis}^{(- m)}$ is a null vector in the sense that it has zero dimension. In (4.10), a good w(β^(−m)|β^(m), λ, x_mis), which is close to the optimal choice, can be constructed based on the asymptotic approximation to the joint posterior posterior. Let β̂^(−m) (λ, x_mis) denote the conditional posterior mode of β^(−m) given β^(m), λ and x_mis under the full model. Specifically, we first compute

{\hat{β}}^{(- m)} (β^{(m)}, λ, x_{mis}) = \underset{β^{(- m)}}{arg max} log π^{*} (β | λ, x_{mis}, y_{0}, a_{0}, D_{obs}),

(4.11)

where

\begin{matrix} log π^{*} (β | λ, x_{mis}, y_{0}, a_{0}, D_{obs}) \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{J} [δ_{i j} ν_{i} x_{i}^{'} β - δ_{i j} exp (x_{i}^{'} β) {λ_{j} (y_{i} - s_{j - 1}) + \sum_{g = 1}^{j - 1} λ_{g} (s_{g} - s_{g - 1})}] \\ + \sum_{i = 1}^{n} [a_{0} x_{i}^{'} β - a_{0} y_{0} λ_{1} exp (x_{i}^{'} β)] \end{matrix}

(4.12)

and then compute

\begin{matrix} {\hat{Σ}}^{(- m)} (β^{(m)}, λ, x_{mis}) \\ = {[- \frac{\partial^{2} log π^{*} (β | λ, x_{mis}, y_{0}, a_{0}, D_{obs})}{\partial β^{(- m)} \partial β^{(- m)'}} |_{β^{(- m)} = {\hat{β}}^{(- m)} (β^{(m)}, λ, x_{mis})}]}^{- 1} . \end{matrix}

Thus, a good w(β^(−m) | β^(m), λ, x_mis) can be constructed as follows:

\begin{matrix} w (β^{(- m)} | β^{(m)}, λ, x_{mis}) \\ = {(2 π)}^{- \frac{k - k_{m}}{2}} {| {\hat{Σ}}^{(- m)} (β^{(m)}, λ, x_{mis}) |}^{- \frac{1}{2}} exp {- \frac{1}{2} (β^{(- m)} - {\hat{β}}^{(- m)} (β^{(m)}, λ, x_{mis}))' \\ \times {[{\hat{Σ}}^{(- m)} (β^{(m)}, λ, x_{mis})]}^{- 1} (β^{(- m)} - {\hat{β}}^{(- m)} (β^{(m)}, λ, x_{mis}))}, \end{matrix}

(4.13)

which approximates the joint conditional posterior π(β^(−m) | β^(m), λ, x_mis, y₀, a₀, D_obs) under the full model. Similarly, we can construct a good w(α^(−m)|α^(m), x_mis) in (4.10). For $w (x_{mis}^{(- m)} | x_{mis}^{(m)}, α^{(m)})$ , we use a Monte Carlo estimate given by

w (x_{mis}^{(- m)} | x_{mis}^{(m)}, α^{(m)}) = \frac{1}{Q} \sum_{q = 1}^{Q} w (x_{mis}^{(- m)} | x_{mis}^{(m)}, α^{(m)}, α_{q}^{(- m)}),

(4.14)

where

w (x_{mis}^{(- m)} | x_{mis}^{(m)}, α^{(m)} α_{q}^{(- m)}) \propto \prod_{i = 1}^{n} f (x_{2 i} | x_{1 i}, α^{(m)}, α_{q}^{(- m)}),

and $f (x_{2 i} | x_{1 i}, α^{(m)}, α_{q}^{(- m)})$ is given by (2.7) under the full model.

5 Analysis of the BMT data

The BMT data set consists of n = 2397 cases who received HLA-identical sibling transplant from 1995 to 2004 for AML or ALL in CR1 (pre-transplant status = 1st complete remission) with graft source of BM or PB/PB+BM. Infants were excluded (age < 2 year old). The outcome variable, y_i in years, was the time from transplant to death or end of follow up, and ν_i denotes the censoring indicator which equals 1 if the ith subject died, and is 0 otherwise. The median follow-up was 5.1 years with interquartile range of 3.0 to 7.8 years. There were 904 deaths in the data set. We consider ten covariates: disease (disease type: AML, ALL), age, yeartx (transplant year), karnofprg (Karnofsky score at pre-transplant), gsource (graftype: BM, PB/PB+BM), sexmatch (Donor-Patient sex match: MM, MF, FM, FF), regimprg (conditioning regimen: CY+TBI±oth, TBI + other, Busulf + CY ± oth, Other/Unknown), prevgvh1 (GVHD prophylaxis: mtx ± other, csa ± other, mtx + csa ± other, tdep ± other, Other/Unknown), cytoabnew (cytogenetics: Poor, InterMed, Normal, Good), and wbcdx (WBC at diagnosis (10⁹/l)). The covariates age, yeartx, karnofprg, and wbcdx are continuous, and the covariates disease and gsource are binary. We dichotomize sexmatch as sexmatch1, sexmatch2, and sexmatch3, where (sexmatch1, sexmatch2, sexmatch3) takes values (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1), which correspond to MM, MF, FM, and FF, respectively. In exactly the same fashion, we dichotomize regimprg, prevgvh1, and cytoabnew as (regimprg1, regimprg2, regimprg3), (prevgvh11, prevgvh12, prevgvh13, prevgvh14) and (cytoabnew1, cytoabnew2, cytoabnew3). For instance, the values (0,0,0), (1,0,0), (0,1,0), and (0,0,1) for (cytoabnew1, cytoabnew2, cytoabnew3) correspond to Poor, InterMed, Normal, and Good for cytoabnew, respectively.

Let x₁ = disease, x₂ = age, x₃ = yeartx, x₄ = karnofprg, x₅ = gsource, x₆ = (sexmatch1, sexmatch2, sexmatch3)′, x₇ = (regimprg1, regimprg2, regimprg3)′ x₈ = (prevgvh11, prevgvh12, prevgvh13, prevgvh14)′, x₉=(sexmatch1, sexmatch2, sexmatch3)′ and x₁₀ = log(wbcdx). For these 10 covariates, x₁, x₂, …, x₈ were completely observed for all cases and x₉ and x₁₀ had missing information. There were 488 (20.36%) individuals with cytogenetics (x₉) missing and 230 (9.6%) individuals with WBC missing, and 96 individuals with both cytogenetics and WBC missing. Overall, there were 623 (25.99%) individuals with at least one covariate missing. We assume that the missing covariates are MAR. In all computations, we standardized all completely observed covariates.

For the BMT data, we fit the piecewise exponential model given by (2.1) and (2.2) for the outcome variable y_i, where s_j is chosen to be the (j/J)th quantile of the failure times y_i, for j = 1, 2, …, J − 1. Since x₁, x₂, …, x₈ are always observed, they do not need to be modeled, and thus we condition on those covariates throughout. We then use a proportional odds logistic regression model for x₉ and a normal regression model for x₁₀. Specifically, under the full model with all ten covariates, f(x₉|x₁, x₂, …, x₈, α₉) is specified as follows:

\begin{matrix} P (x_{9} & = & (0, 0, 0)' | x_{1}, x_{2}, \dots, x_{8}, α_{9}) \\ = & F (α_{9, 10} + α_{91} x_{1} + \dots + α_{95} x_{5} + α_{96}^{'} x_{6} + α_{97}^{'} x_{7} + α_{98}^{'} x_{8}), \\ P (x_{9} & = & (1, 0, 0)' | x_{1}, x_{2}, \dots, x_{8}, α_{9}) \\ = & F (α_{9, 20} + α_{91} x_{1} + \dots + α_{95} x_{5} + α_{96}^{'} x_{6} + α_{97}^{'} x_{7} + α_{98}^{'} x_{8}) \\ - F (α_{9, 10} + α_{91} x_{1} + \dots + α_{95} x_{5} + α_{96}^{'} x_{6} + α_{97}^{'} x_{7} + α_{98}^{'} x_{8}), \\ P (x_{9} & = & (0, 1, 0)' | x_{1}, x_{2}, \dots, x_{8}, α_{9}) \\ = & F (α_{9, 30} + α_{91} x_{1} + \dots + α_{95} x_{5} + α_{96}^{'} x_{6} + α_{97}^{'} x_{7} + α_{98}^{'} x_{8}) \\ - F (α_{9, 20} + α_{91} x_{1} + \dots + α_{95} x_{5} + α_{96}^{'} x_{6} + α_{97}^{'} x_{7} + α_{98}^{'} x_{8}), \end{matrix}

and $P (x_{9} = (0, 0, 1)' | x_{1}, x_{2}, \dots, x_{8}, α_{9}) = 1 - F (α_{9, 30} + α_{91} x_{1} + \dots + α_{95} x_{5} + α_{96}^{'} x_{6} + α_{97}^{'} x_{7} + α_{98}^{'} x_{8})$ , where F(u) = exp(u)/{1+exp(u)}, α_9,10 ≤ α_9,20 ≤ α_9,30, $α_{96}^{'} = (α_{96, 1}, α_{96, 2}, α_{96, 3})$ , $α_{97}^{'} = (α_{97, 1}, α_{97, 2}, α_{97, 3})$ , $α_{98}^{'} = (α_{98, 1}, α_{98, 2}, α_{98, 3}, α_{98, 4})$ and $α_{9} = (α_{9, 10}, α_{9, 20}, α_{9, 30}, α_{91}, \dots, α_{5}, α_{96}^{'}, α_{97}^{'}, α_{98}^{'})'$ . We note that α_9,10, α_9,20, and α_9,30 are three intercepts in the proportional odds logistic regression model. Furthermore, f (x₁₀|x₁, x₂, …, x₈, x₉, α₁₀) is taken to be the density of a $N (α_{10, 0} + α_{10, 1} x_{1} + \dots + α_{10, 5} x_{5} + α_{10, 6}^{'} x_{6} + α_{10, 7}^{'} x_{7} + α_{10, 8}^{'} x_{8} + α_{10, 9}^{'} x_{9}, α_{10, 10})$ distribution, where α_10,10 > 0 denotes the variance. The prior for (β, λ) is given by (2.5) and (2.6). In (2.5), we consider several values for a₀ such as a₀ = 0.1, 0.01, 0.001, and 0.0001 and in (2.6), we take b₁ = b₂ = 0.001. For the parameters in the models for the missing covariates, an inverse gamma prior with scale and shape parameters equal to 0.001 is specified for α_10,10, and independent normal priors, N(0, 1000), are specified for all other parameters. We wish to compare the following 2¹⁰ = 1024 models: no covariates (null model), (x₁), …, (x₁₀), (x₁, x₂), …, (x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x₉, x₁₀) (full model). In all computations, the Gibbs sampling algorithm given in Appendix B was used to sample from the posterior distributions and 10,000 Gibbs samples after a burn-in of 1,000 iterations were used to compute all DIC measures and other posterior estimates. The convergence of the Gibbs sampling algorithm was checked using several diagnostic procedures as recommended by Cowles and Carlin (1996).

We first carry out the complete case (CC) analysis of the BMT data. There were n* = 1,774 subjects with all ten covariates completely observed. In the CC analysis, we first perform subset variable selection using the AIC and BIC criteria, since with no missing data, these two criteria can be easily computed. Let L_cc(β^(m), λ|D_cc,m) denote the likelihood function given in (3.1) with the completely observed data D_cc,m under model m in a model space that consists of 2¹⁰ possible subset models. Then, AIC and BIC are given by

{AIC}_{m} = - 2 log L_{c c} ({\hat{β}}^{(m)}, \hat{λ} | D_{c c, m}) + 2 p_{m},

(5.1)

where β̂^(m) and λ̂ are the maximum likelihood estimates of β^(m) and λ and p_m = k_m + J, and

{BIC}_{m} = - 2 L_{c c} ({\hat{β}}^{(m)}, \hat{λ} | D_{c c, m}) + [log (n^{*})] p_{m} .

(5.2)

Table 1 shows the best three AIC or BIC models for J = 10, 15, and 20. From Table 1, we see that the best three AIC models are (x₁, x₂, x₄, x₉), (x₁, x₂, x₄, x₅, x₉), and (x₁, x₂, x₃, x₄, x₅, x₉), and the best three BIC models are (x₁, x₂, x₉), (x₁, x₂, x₄, x₉), and (x₁, x₂, x₅, x₉). In Table 1, under the model (x₁, x₂, x₃, x₄, x₅, x₉), the values of p_m = k_m + J = 8 + J are 18, 23, and 28 for J = 10, 15, and 20, respectively while the values of p_m become 15, 20, and 25 for J = 10, 15, and 20, respectively, under the model (x₁, x₂, x₉). Thus, (x₁, x₂, x₉) is the smallest model while (x₁, x₂, x₃, x₄, x₅, x₉) is the largest model among the five models listed in Table 1. We note that the order of the best three models under either AIC or BIC remains the same for J = 10, J = 15, and J = 20. Thus, subset variable selection under both AIC and BIC is robust to the choice of J.We also see from Table 1 that the lowest values of AIC are attained at J = 15 for all five models while the lowest values of BIC are attained at J = 10. This result is expected since BIC favors smaller and more parsimonious models than AIC, due to a larger dimensional penalty imposed by BIC.

Table 1.

Values of AIC and BIC under best three AIC or BIC models for the completely observed BMT data

Model	J = 10			J = 15			J = 20
	AIC_m	BIC_m	p_m	AIC_m	BIC_m	p_m	AIC_m	BIC_m	p_m
(x₁, x₂, x₄, x₉)	3607.09	3694.78	16	3595.19	3710.29	21	3614.49	3756.99	26
(x₁, x₂, x₄, x₅, x₉)	3607.09	3700.27	17	3595.46	3716.05	22	3614.63	3762.62	27
(x₁, x₂, x₃, x₄, x₅, x₉)	3607.90	3706.56	18	3595.70	3721.76	23	3615.16	3768.63	28
(x₁, x₂, x₉)	3610.60	3692.82	15	3598.82	3708.44	20	3618.05	3755.07	25
(x₁, x₂, x₅, x₉)	3610.68	3698.38	16	3599.15	3714.25	21	3618.26	3760.76	26

Open in a new tab

For the CC case, we used the direct sampling method to compute the DIC values for all 1024 models under various choices of J and a₀. The results based on the best three DIC models under various choices of J and a₀ are shown in Table 2. From Table 2, we see that when a₀ is small, for example, a₀ = 0.001 or a₀ = 0.0001, the DIC values are very close to the corresponding values of AIC and the values of p_m in DIC are also very close to those in AIC. We also observe that there is a convex pattern in the DICs as functions of J and a₀. Specifically, the DIC values with J = 15 are smaller than those with either J = 10 or J = 20 for all the best three models, and, in addition, the DIC values with a₀ = 0.001 are smaller than those with a₀ = 0.1, a₀ = 0.01, and a₀ = 0.0001 under these same models, though the DIC values with a₀ = 0.001 are close to those with a₀ = 0.0001. These results are quite desirable as they empirically show that DIC may be used to guide the choices of J and a₀ in achieving the best predictive model performance. In this CC case, among the values of J and a₀ being considered, based on the DIC measure, the best choices of J and a₀ are J = 15 and a₀ = 0.001. In the CC case, we also implemented the single MC sample method discussed in Sect. 4 for computing the DIC measures. Using a Gibbs sample of 10,000 iterations after a burn-in of 1,000 iterations from the posterior distribution under the full model, the Monte Carlo estimates of DIC_m and p_m are 3595.25 and 20.99 for model (x₁, x₂, x₄, x₉), 3595.40 and 21.94 for model (x₁, x₂, x₄, x₅, x₉), and 3595.74 and 22.99 for model (x₁, x₂, x₃, x₄, x₅, x₉). These estimates are very similar to those given in Table 2 using the direct sampling method.

Table 2.

Values of DIC under best three DIC models for the completely observed BMT data

Model	a₀ = 0.001
	J = 10		J = 15		J = 20
	DIC_m	p_m	DIC_m	p_m	DIC_m	p_m
(x₁, x₂, x₄, x₉)	3607.18	16.02	3595.26	21.00	3614.65	26.05
(x₁, x₂, x₄, x₅, x₉)	3606.96	16.91	3595.47	21.97	3615.12	26.94
(x₁, x₂, x₃, x₄, x₅, x₉)	3607.76	18.43	3595.68	22.96	3615.31	28.07

	J = 15
	a₀ = 0.1		a₀ = 0.01		a₀ = 0.0001
	DIC_m	p_m	DIC_m	p_m	DIC_m	p_m

(x₁, x₂, x₄, x₉)	3769.45	19.20	3599.56	20.73	3595.35	21.07
(x₁, x₂, x₄, x₅, x₉)	3769.71	20.21	3599.76	21.70	3595.62	22.07
(x₁, x₂, x₃, x₄, x₅, x₉)	3769.66	21.18	3599.82	22.64	3595.87	22.89

Open in a new tab

For the best two DIC models with J = 15 and a₀ = 0.001, we also computed the posterior means (Estimates), the posterior standard deviations (SD’s), and 95% highest posterior density (HPD) intervals of the model parameters. The results are shown in Table 3. Under model (x₁, x₂, x₄, x₉), all 95% HPD intervals do not contain 0, indicating the importance of all these covariates. The results given in Table 3 indicate that an ALL patient has a higher risk of death compared to an AML patient, an older patient has a higher risk of death, a higher Karnofsky score at pre-transplant leads to a lower risk of death, and a patient with poor cytogenetics is likely to have a high risk of death. Under model (x₁, x₂, x₄, x₅, x₉), all covariates except for gsource have 95% HPD intervals that do not contain 0.

Table 3.

Posterior estimates of β under best two DIC models with J = 15 and a₀ = 0.001 for the completely observed BMT data

Model	Variable	Estimate	SD	95% HPD interval
(x₁, x₂, x₄, x₉)	Disease	0.164	0.039	(0.086, 0.239)
	Age	0.269	0.040	(0.193, 0.350)
	Karnofprg	−0.086	0.036	(−0.152, −0.013)
	Cytoabnew1	−0.400	0.113	(−0.621, −0.185)
	Cytoabnew2	−0.586	0.107	(−0.802, −0.382)
	Cytoabnew3	−0.638	0.209	(−1.065, −0.241)
(x₁, x₂, x₄, x₅, x₉)	Disease	0.167	0.039	(0.091, 0.246)
	Age	0.250	0.042	(0.167, 0.331)
	Karnofprg	−0.086	0.036	(−0.154, −0.014)
	Gsource	0.054	0.041	(−0.027, 0.133)
	Cytoabnew1	−0.399	0.111	(−0.626, −0.190)
	Cytoabnew2	−0.580	0.108	(−0.786, −0.372)
	Cytoabnew3	−0.621	0.207	(−1.037, −0.217)

Open in a new tab

Next, we carry out an all case (AC) analysis of the BMT data, that is, an analysis incorporating all of the cases. In the AC case, due to the additional complication of modeling the missing covariates, AIC and BIC are computationally infeasible, as discussed earlier and in fact, one could even argue that these measures are not well defined here since the penalty term is not clearly defined. In particular, if we use the marginal likelihood L(β̂^(m),λ̂|D_m) and then average over all of the possible missing values of the covariates according to the missing covariate distribution, it is not clear how to appropriately define the dimensional penalty p_m for AIC and BIC. Thus, for the AC case, we used DIC as the criterion for performing variable subset selection. To this end, we used the the direct sampling method to compute the DIC values for all 1,024 models under various choices J and a₀. The DIC values for the best three models are presented in Table 4. Note that models (x₁, x₂, x₄, x₅, x₈, x₉) and (x₁, x₂, x₄, x₅, x₉) are consistently the best and second best models for all the values of J and a₀ considered in Table 4, while model (x₁, x₂, x₃, x₄, x₅, x₉) is the third best for most combinations of J and a₀ except for (J, a₀) = (15, 0.1) and (J, a₀) = (10, 0.001). For (J, a₀) = (15, 0.1) the third best model is (x₁, x₂, x₃, x₄, x₅, x₉, x₁₀) with DIC_m = 4973.88 and p_m = 24.75, and for (J, a₀) = (10, 0.001) the third best model is (x₁, x₂, x₄, x₅, x₉, x₁₀) with DIC_m = 4743.10 and p_m = 22.46. From Table 4, we also see that the second best DIC model (x₁, x₂, x₄, x₅, x₉) in the CC analysis remains the second best DIC model in the AC analysis. In the AC analysis, when a₀ = 0.001, the values of DIC_m and p_m for the best CC analysis model (x₁, x₂, x₄, x₉) now become 4744.66 and 20.71 for J = 10, 4731.20 and 25.69 for J = 15, and 4750.68 and 30.80 for J = 20, which are much larger than the corresponding DIC values under the best AC model (x₁, x₂, x₄, x₅, x₈, x₉) and the second best AC model (x₁, x₂, x₄, x₅, x₉). When a₀ = 0.001, the best CC model (x₁, x₂, x₄, x₉) is the ninth best AC model for J = 10 and the tenth best model for both J = 15 and J = 20. Interestingly, similar to the CC analysis, the “optimal” choices of J and a₀ are J = 15 and a₀ = 0.001. Compared to the CC analysis, another noticeable change in the AC analysis is that the values of the dimensional penalty p_m are larger than the corresponding values in the CC analysis, which is expected since the dimension of those missing covariates leads to the additional dimensional penalty in p_m.

Table 4.

Values of DIC under best three DIC models for the BMT data with all cases

Model	a₀ = 0.001
	J = 10		J = 15		J = 20
	DIC_m	p_m	DIC_m	p_m	DIC_m	p_m
(x₁, x₂, x₄, x₅, x₈, x₉)	4742.43	25.07	4729.11	30.09	4748.67	35.09
(x₁, x₂, x₄, x₅, x₉)	4742.52	21.39	4729.30	26.48	4748.95	31.53
(x₁, x₂, x₃, x₄, x₅, x₉)	4743.19	22.40	4729.73	27.44	4749.27	32.43

	J = 15
	a₀ = 0.1		a₀ = 0.01		a₀ = 0.0001
	DIC_m	p_m	DIC_m	p_m	DIC_m	p_m

(x₁, x₂, x₄, x₅, x₈, x₉)	4973.47	27.01	4735.55	29.61	4729.31	30.17
(x₁, x₂, x₄, x₅, x₉)	4973.85	23.28	4735.63	25.76	4729.33	26.49
(x₁, x₂, x₃, x₄, x₅, x₉)	4974.25	24.29	4735.87	26.97	4729.84	27.57

Open in a new tab

For the best two DIC models with J = 15 and a₀ = 0.001, we also computed the posterior estimates of the model parameters, and the results are shown in Table 5. Under both the best two DIC models, all covariates except for gsource in the survival model for the time from transplant to death have 95% HPD intervals that do not contain 0. As x₉ is the only missing covariate in both models, using (3.2), we only need to model x₉ via the proportional odds logistic regression model conditional on the other covariates, namely, disease, age, karnofprg, gsource, and prevgvh1 for model (x₁, x₂, x₄, x₅, x₈, x₉) and disease, age, karnofprg, and gsource for model (x₁, x₂, x₄, x₅, x₉). The corresponding posterior estimates for these two missing covariate models are also shown in Table 5. Under these two models for the missing covariate cytoabnew (x₉), we see that all covariates except for karnofprg have 95% HPD intervals that do not contain 0. Under the second best model, Table 3 compares the posterior estimates from the AC analysis to those of the CC analysis. In Table 3, we see that the AC analysis leads to smaller posterior standard deviations and shorter HPD intervals for all parameters in the survival model. In particular, gsource is nearly “significant” in the response model and “significant” in the covariate model in the AC analysis, where significance means that the 95% HPD interval does not contain 0.

Table 5.

Posterior estimates of β and α under best DIC model with J = 15 and a₀ = 0.001 for the completely observed BMT data

Model	Parameters	Variable	Estimate	SD	95% HPD interval
(x₁, x₂, x₄, x₅, x₈, x₉)	β	Disease	0.130	0.034	(0.062, 0.196)
		Age	0.233	0.037	(0.164, 0.308)
		Karnofprg	−0.090	0.032	(−0.152, −0.028)
		Gsource	0.062	0.037	(−0.009, 0.136)
		Prevgvh11	0.127	0.063	(0.008, 0.253)
		Prevgvh12	0.062	0.071	(−0.071, 0.207)
		Prevgvh13	0.014	0.041	(−0.067, 0.093)
		Prevgvh14	0.010	0.044	(−0.076, 0.096)
		Cytoabnew1	−0.424	0.109	(−0.633, −0.206)
		Cytoabnew2	−0.614	0.103	(−0.812, −0.405)
		Cytoabnew3	−0.726	0.205	(−1.150, −0.342)
	α	Intercept 1	−2.158	0.071	(−2.302, −2.024)
		Intercept 2	−0.428	0.046	(−0.518, −0.338)
		Intercept 3	2.950	0.097	(2.754, 3.135)
		Disease	0.363	0.043	(0.275, 0.444)
		Age	0.105	0.047	(0.011, 0.194)
		Karnofprg	−0.073	0.043	(−0.155, 0.011)
		Gsource	0.144	0.046	(0.056, 0.238)
		Prevgvh11	−0.021	0.073	(−0.159, 0.127)
		Prevgvh12	−0.213	0.080	(−0.367, −0.056)
		Prevgvh13	0.036	0.049	(−0.058, 0.134)
		Prevgvh14	−0.136	0.054	(−0.245, −0.032)
(x₁, x₂, x₄, x₅, x₉)	β	Disease	0.127	0.034	(0.060, 0.195)
		Age	0.235	0.037	(0.163, 0.308)
		Karnofprg	−0.088	0.031	(−0.151, −0.027)
		Gsource	0.068	0.036	(−0.004, 0.135)
		Cytoabnew1	−0.423	0.108	(−0.623, −0.200)
		Cytoabnew2	−0.618	0.102	(−0.809, −0.409)
		Cytoabnew3	−0.743	0.204	(−1.140, −0.351)
	α	Intercept 1	−2.136	0.070	(−2.277, −2.003)
		Intercept 2	−0.420	0.046	(−0.508, −0.329)
		Intercept 3	2.928	0.096	(2.743, 3.116)
		Disease	0.353	0.043	(0.269, 0.437)
		Age	0.113	0.046	(0.018, 0.198)
		Karnofprg	−0.069	0.042	(−0.150, 0.014)
		Gsource	0.149	0.045	(0.063, 0.240)

Open in a new tab

6 Discussion

We have proposed a joint semi-conjugate prior for the regression coefficients β and piecewise hazard parameters λ and examined their theoretical properties in the piecewise exponential model for right censored survival data. The proposed prior is quite attractive in the context of variable subset selection for survival data with missing covariates. It is proper and the functional form of the prior is immediately determined for all models once the functional form of the prior is written for the full model. In addition, the prior is completely specified by only one hyper-parameter, namely, a₀. This indeed makes the elicitation of priors for all models in the model space much easier. Otherwise, prior elicitation would be an enormous task. In addition, we have empirically shown that the DIC measure can be used to guide the choice of a₀ to achieve the best posterior predictive performance. In Sect. 5, for the BMT data, we see that the best model for the AC is different than the one based on a CC analysis. This empirical result demonstrates that one cannot do variable selection just based on the completely observed cases. In fact, it is important to use all cases in performing variable selection.

Our computational methods in this paper are intended for situations where the number of models in the model space can be enumerated, so with this in mind, our proposed procedure works best when the number of covariates is 9–15. We have considered two Monte Carlo methods for computing the DIC measures. The direct sampling method is easy to implement. However, care needs to be taken in monitoring convergence of the Gibbs sampling algorithm for each model in the model space. On the other hand, the single MC sample method requires only one Gibbs sample from the posterior distribution under the full model. Thus, one needs to monitor convergence of the Gibbs sampling algorithm only once. However, in this case, one needs to construct a “good” weight function $w (γ_{q}^{(- m)} | γ_{q}^{(m)})$ to obtain an efficient single MC sample method. The choice of $w (γ_{q}^{(- m)} | γ_{q}^{(m)})$ proposed in Sect. 4 works well. However, it requires finding the conditional posterior modes, which may be computationally expensive. Finding a less efficient but less computationally expensive weight function is an important future project, which is currently under investigation. We note that both Monte Carlo methods can be easily implemented using multiple computers. Thus, a parallel computing system can greatly speed up the computation of the DIC measures for variable selection. With a Linux cluster, the proposed computational procedure can work well when the number of covariates is up to 20.

Another important criterion used in model assessment is Bayesian Model Averaging (BMA). Since we have focused this paper on variable selection and selecting a set of top models, we have not addressed the issue of BMA at all, as this is an entirely different topic with different inferential goals and different computational strategies. The performance of the proposed semi-conjugate priors in the presence of MAR covariates and the effects of covariates such as sexmatch within the BMA context will be explored in future work.

Acknowledgements

The authors wish to thank Dr. Mei-Jie Zhang for providing the BMT data. The authors also wish to thank the Editor-in-Chief, the Editor, and a referee for their helpful comments and suggestions, which have improved the paper. This research was partially supported by NIH grants #GM 70335 and #CA 74015.

Appendix A: proofs of Theorem 2.1

Observe that the marginal prior of (λ₁, β) is of the form

π (λ_{1}, β | y_{0}, X, a_{0}) \propto \prod_{i = 1}^{n} {λ_{1} exp (x_{i}^{'} β)}^{a_{0}} exp {- a_{0} y_{0} λ_{1} exp (x_{i}^{'} β)} \frac{1}{λ_{1}} .

Let β₀ = log λ₁. We have

π (β_{0}, β | y_{0}, X, a_{0}) \propto \prod_{i = 1}^{n} {exp (β_{0} + x_{i}^{'} β)}^{a_{0}} exp {- a_{0} y_{0} exp (β_{0} + x_{i}^{'} β)} .

(A.1)

Since $X_{obs}^{*}$ is of full rank, then X* = (1, X) is of full rank. It is easy to show that π(β₀, β|y₀, X, a₀) is log-concave in (β₀, β′)′. This implies that π(β₀, β|y₀, X, a₀) has a unique mode. Set

\frac{\partial}{\partial (β_{0}, β')'} log π (β_{0}, β | y_{0}, X, a_{0}) = a_{0} \sum_{i = 1}^{n} [1 - y_{0} exp (β_{0} + x_{i}^{'} β)] (1, x_{i}^{'})' = 0.

(A.2)

Thus, (−log y₀, 0, …, 0)′ is the unique solution of (A.2). This implies that (−log y₀, 0, …, 0)′ is the unique prior mode of (log λ₁, β).

For (ii), it suffices to prove that π(β₀, β|y₀, X, a₀) in (A.1) is proper since π₀(λ₂, …, λ_J) is a proper prior. We write

π * (β_{0}, β | y_{0}, X, a_{0}) = \prod_{i = 1}^{n} {exp (β_{0} + x_{i}^{'} β)}^{a_{0}} exp {- a_{0} y_{0} exp (β_{0} + x_{i}^{'} β)} .

It is easy to observe that

{exp (β_{0} + x_{i}^{'} β)}^{a_{0}} exp {- a_{0} y_{0} exp (β_{0} + x_{i}^{'} β)} \leq K_{0}

for i = 1, 2, …, n, where K₀ > 0 is a constant. Since $X_{obs}^{*}$ is of full rank, there exist i₁ < i₂ < ⋯ < i_k+1 such that x_i₁, x_i₂, …, x_{i_k+1} are completely observed and the (k+1) × (k+1) matrix $X^{* *} = ((1, x_{i_{g}}^{'}), g = 1, 2, \dots, k + 1)$ is of full rank. Let u = (u₁, u₂, …, u_k+1)′. Taking a one-to-one transformation u = X**(β₀, β′)′ leads to

\begin{matrix} \int π^{*} (β_{0}, β | y_{0}, X, a_{0}) d β_{0} d β \\ \leq K_{0}^{n - k - 1} \int \prod_{g = 1}^{k = 1} {exp (β_{0} + x_{i_{g}}^{'} β)}^{a_{0}} exp {- a_{0} y_{0} exp (β_{0} + x_{i_{g}}^{'} β)} d β_{0} d β \\ = K_{1} \prod_{g = 1}^{k = 1} \int exp (a_{0} u_{g}) exp {- a_{0} y_{0} exp (u_{g})} d u_{g} < \infty, \end{matrix}

(A.3)

which completes the proof of Theorem 2.1.

Appendix B: posterior sampling

In this appendix, we discuss how to sample from the posterior distribution under the full model given in (2.10). To this end, we propose a Gibbs sampling algorithm, which requires sampling from the following full conditional distributions in turn:

[β|λ, x_mis, a₀, D_obs];
[λ|β, x_mis, a₀, D_obs];
[x_mis|β, λ, α, a₀, D_obs];
[α|x_mis, a₀, D_obs].

We briefly discuss how we sample from each of the above full conditional distributions. For (i), the full conditional density of β given λ, x_mis, a₀, and D_obs is of the form

\begin{matrix} π (β | λ, x_{mis}, a_{0}, D_{obs}) \propto \prod_{i = 1}^{n} exp {(ν_{i} + a_{0}) x_{i}^{'} β - a_{0} y_{0} λ_{1} exp (x_{i}^{'} β)} \\ \times exp [- \sum_{j = 1}^{J} δ_{i j} exp (x_{i}^{'} β) {λ_{j} (y_{i} - s_{j - 1}) + \sum_{g = 1}^{j - 1} λ_{g} (s_{g} - s_{g - 1})}] . \end{matrix}

It is easy to show that π(β|λ, x_mis, a₀, D_obs) is log-concave in β. Thus, we can sample the β_j’s via the adaptive rejection algorithm of Gilks and Wild (1992). For (ii), given β and x_mis, λ₁, λ₂, …, λ_J are conditionally independent. Let $h_{i j} = δ_{i j} (y_{i} - s_{j - 1}) + \sum_{g = j + 1}^{J} δ_{i g} (s_{j} - s_{j - 1})$ . Then, we have

λ_{1} ~ Gamma (n a_{0} + \sum_{i = 1}^{n} δ_{i 1} ν_{i,} \sum_{i = 1}^{n} {(a_{0} y_{0} + h_{i 1}) exp (x_{i}^{'} β)}),

(B.1)

and

λ_{j} ~ Gamma (b_{1} + \sum_{i = 1}^{n} δ_{i j} ν_{i}, b_{2} + \sum_{i = 1}^{n} h_{i j} exp (x_{i}^{'} β)),

(B.2)

for j = 2, …, J. Hence, sampling the λ_j from (B.1) and (B.2) is straightforward.

For (iii), given β, λ, and α, the x_2i,mis’s are conditionally independent, and the conditional distribution for x_2i,mis is

\begin{matrix} π (x_{2 i, mis} | β, λ, α, a_{0}, D_{obs}) \propto f (x_{2 i} | x_{1 i}, α) exp {(ν_{i} + a_{0}) x_{i}^{'} β} exp {- a_{0} y_{0} λ_{1} \\ \times exp (x_{i}^{'} β)} exp [- \sum_{j = 1}^{J} δ_{i j} exp (x_{i}^{'} β) {λ_{j} (y_{i} - s_{j - 1}) + \sum_{g = 1}^{j - 1} λ_{g} (s_{g} - s_{g - 1})}] . \end{matrix}

Thus, the conditional distribution of x_2i,mis depends on the form of f (x_2i|x_1i, α). In Sect. 5, for the BMT data, f (x_2i|x_1i, α) is a product of a proportional odds logistic density and a normal density, and hence, sampling x_2i,mis is relatively straightforward. In fact, the conditional distribution for (cytoabnew1, cytoabnew2, cytoabnew3) is multinomial while the conditional distribution for log(wbcdx) is log-concave, which can be sampled via the adaptive rejection algorithm of Gilks and Wild (1992). For (iv), the full conditional distribution is $π (α | x_{mis}, a_{0}, D_{obs}) \propto \prod_{i = 1}^{n} f (x_{2 i} | x_{1 i}, α) π (α)$ . For various covariate distributions specified through a series of one dimensional conditional distributions, sampling α is straightforward. For example, in Section 5, the full conditional distribution for each component of α₉ is log-concave, and hence we can sample these α_9j’s via the adaptive rejection algorithm of Gilks and Wild (1992), and the full conditional distributions for the components of α₁₀ are either normal or inverse gamma, which are easy to sample from.

Contributor Information

Joseph G. Ibrahim, Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA, e-mail:ibrahim@bios.unc.edu

Ming-Hui Chen, Department of Statistics, University of Connecticut, Storrs, CT 06269, USA, e-mail: mhchen@stat.uconn.edu.

Sungduk Kim, Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, NIH, Rockville, MD 20852, USA, e-mail: kims2@mail.nih.gov.

References

Akaike H. Information theory and an extension of themaximum likelihood principle. In: Petrov BN, Csaki F, editors. International symposium on information theory; Budapest: Akademia Kiado; 1973. pp. 267–281. [Google Scholar]
Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian variable selection and prediction. J R Stat Soc B. 1998;60:627–641. [Google Scholar]
Brown PJ, Vanucci M, Fearn T. Bayes model averaging with selection of regresors. J R Stat Soc B. 2002;64:519–536. [Google Scholar]
Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models (with discussion) Bayesian Anal. 2006;1:651–674. [Google Scholar]
Chen MH, Ibrahim JG. Conjugate priors for generalized linear models. Stat Sinica. 2003;13:461–476. [Google Scholar]
Chen MH, Ibrahim JG, Yiannoutsos C. Prior elicitation, variable selection, and Bayesian computation for logistic regression models. J R Stat Soc B. 1999;61:223–242. [Google Scholar]
Chen MH, Ibrahim JG, Shao QM, Weiss RE. Prior elicitation for model selection and estimation in generalized linear mixed models. J Stat Plan Inference. 2003;111:57–76. [Google Scholar]
Chen MH, Dey DK, Ibrahim JG. Bayesian criterion based model assessment for categorical data. Biometrika. 2004;91:45–63. [Google Scholar]
Chen MH, Huang L, Ibrahim JG, Kim S. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 2008;3:585–614. doi: 10.1214/08-BA323. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chipman HA, George IE, McCulloch RE. Bayesian CART model search (with discussion) J Am Stat Assoc. 1998;93:935–960. [Google Scholar]
Chipman HA, George IE, McCulloch RE. The practical implementation of Bayesian model selection (with discussion) In: Lahiri P, editor. Model selection. Beachwood: Institute of Mathematical Statistics; 2001. pp. 63–134. [Google Scholar]
Chipman HA, George IE, McCulloch RE. Bayesian treed generalized linear models (with discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian statistics. vol 7. Oxford: Oxford University Press; 2003. pp. 85–103. [Google Scholar]
Clyde M. Bayesian model averaging and model search strategies (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 6. Oxford: Oxford University Press; 1999. pp. 157–185. [Google Scholar]
Clyde M, George IE. Model uncertainty. Stat Sci. 2004;19:81–94. [Google Scholar]
Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc. 1996;91:883–904. [Google Scholar]
Dellaportas P, Forster JJ. Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika. 1999;86:615–633. [Google Scholar]
Dey DK, Chen MH, Chang H. Bayesian approach for the nonlinear random effects models. Biometrics. 1997;53:1239–1252. [Google Scholar]
Geisser S, Eddy W. A predictive approach to model selection. J Am Stat Assoc. 1979;74:153–160. [Google Scholar]
Gelfand AE, Dey DK. Bayesian model choice: asymptotics and exact calculations. J R Stat Soc B. 1994;56:501–514. [Google Scholar]
Gelfand AE, Dey DK, Chang H. Model determinating using predictive distributions with implementation via sampling-based methods (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 4. Oxford: Oxford University Press; 1992. pp. 147–167. [Google Scholar]
Gelfand AE, Ghosh SK. Model choice: a minimum posterior predictive loss approach. Biometrika. 1998;85:1–13. [Google Scholar]
Gelman A, Meng XL, Stern HS. Posterior predictive assessment of model fitness via realized discrepancies (with discussion) Stat Sinica. 1996;6:733–807. [Google Scholar]
George EI. The variable selection problem. J Am Stat Assoc. 2000;95:1304–1308. [Google Scholar]
George EI, Foster DP. Calibration and empirical Bayes variable selection. Biometrika. 2000;87:731–747. [Google Scholar]
George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993;88:881–889. [Google Scholar]
George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sinica. 1997;7:339–374. [Google Scholar]
George EI, McCulloch RE, Tsay R. Two approaches to Bayesian model selection with applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian analysis in statistics and econometrics: essays in honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348. [Google Scholar]
Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. J R Stat Soc C (Appl Stat) 1992;41:337–348. [Google Scholar]
Hanson TE. Inference for mixtures of finite polya tree models. J Am Stat Assoc. 2006;101:1548–1565. [Google Scholar]
Huang L, Chen MH, Ibrahim JG. Bayesian analysis for generalized linear models with nonignorably missing covariates. Biometrics. 2005;61:767–780. doi: 10.1111/j.1541-0420.2005.00338.x. [DOI] [PubMed] [Google Scholar]
Ibrahim JG, Chen MH. Power prior distributions for regression models. Stat Sci. 2000;15:46–60. [Google Scholar]
Ibrahim JG, Laud PW. A Predictive approach to the analysis of designed experiments. J Am Stat Assoc. 1994;89:309–319. [Google Scholar]
Ibrahim JG, Chen MH, McEachern SN. Bayesian variable selection for proportional hazards models. Can J Stat. 1999a;27:701–717. [Google Scholar]
Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J R Stat Soc B. 1999b;61:173–190. [Google Scholar]
Ibrahim JG, Chen MH, Ryan LM. Bayesian variable selection for time series count data. Stat Sinica. 2000;10:971–987. [Google Scholar]
Ibrahim JG, Chen MH, Sinha D. Bayesian survival analysis. New York: Springer-Verlag; 2001a. [Google Scholar]
Ibrahim JG, Chen MH, Sinha D. Criterion based methods for Bayesian model assessment. Stat Sinica. 2001b;11:419–443. [Google Scholar]
Ibrahim JG, Chen MH, Lipsitz SR, Herring AH. Missing data methods in regression models. J Am Stat Assoc. 2005;100:332–346. [Google Scholar]
Kim S, Chen MH, Dey DK, Gamerman D. Bayesian dynamic models for survival data with a cure fraction. Lifetime Data Anal. 2007;13:17–35. doi: 10.1007/s10985-006-9028-7. [DOI] [PubMed] [Google Scholar]
Laud PW, Ibrahim JG. Predictive model selection. J R Stat Soc B. 1995;57:247–262. [Google Scholar]
Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922. [Google Scholar]
Little RJA, Rubin DB. Statistical analysis with missing data. 2nd edn. New York: Wiley; 2002. [Google Scholar]
Ntzoufras I, Dellaportas P, Forster JJ. Bayesian variable and link determination for generalised linear models. J Stat Plan Inference. 2003;111:165–180. [Google Scholar]
Raftery AE. Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika. 1996;83:251–266. [Google Scholar]
Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. J Am Stat Assoc. 1997;92:179–191. [Google Scholar]
Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464. [Google Scholar]
Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. J Econom. 1996;75:317–343. [Google Scholar]
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion) J R Stat Soc B. 2002;64:583–639. [Google Scholar]

[R1] Akaike H. Information theory and an extension of themaximum likelihood principle. In: Petrov BN, Csaki F, editors. International symposium on information theory; Budapest: Akademia Kiado; 1973. pp. 267–281. [Google Scholar]

[R2] Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian variable selection and prediction. J R Stat Soc B. 1998;60:627–641. [Google Scholar]

[R3] Brown PJ, Vanucci M, Fearn T. Bayes model averaging with selection of regresors. J R Stat Soc B. 2002;64:519–536. [Google Scholar]

[R4] Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models (with discussion) Bayesian Anal. 2006;1:651–674. [Google Scholar]

[R5] Chen MH, Ibrahim JG. Conjugate priors for generalized linear models. Stat Sinica. 2003;13:461–476. [Google Scholar]

[R6] Chen MH, Ibrahim JG, Yiannoutsos C. Prior elicitation, variable selection, and Bayesian computation for logistic regression models. J R Stat Soc B. 1999;61:223–242. [Google Scholar]

[R7] Chen MH, Ibrahim JG, Shao QM, Weiss RE. Prior elicitation for model selection and estimation in generalized linear mixed models. J Stat Plan Inference. 2003;111:57–76. [Google Scholar]

[R8] Chen MH, Dey DK, Ibrahim JG. Bayesian criterion based model assessment for categorical data. Biometrika. 2004;91:45–63. [Google Scholar]

[R9] Chen MH, Huang L, Ibrahim JG, Kim S. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 2008;3:585–614. doi: 10.1214/08-BA323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Chipman HA, George IE, McCulloch RE. Bayesian CART model search (with discussion) J Am Stat Assoc. 1998;93:935–960. [Google Scholar]

[R11] Chipman HA, George IE, McCulloch RE. The practical implementation of Bayesian model selection (with discussion) In: Lahiri P, editor. Model selection. Beachwood: Institute of Mathematical Statistics; 2001. pp. 63–134. [Google Scholar]

[R12] Chipman HA, George IE, McCulloch RE. Bayesian treed generalized linear models (with discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian statistics. vol 7. Oxford: Oxford University Press; 2003. pp. 85–103. [Google Scholar]

[R13] Clyde M. Bayesian model averaging and model search strategies (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 6. Oxford: Oxford University Press; 1999. pp. 157–185. [Google Scholar]

[R14] Clyde M, George IE. Model uncertainty. Stat Sci. 2004;19:81–94. [Google Scholar]

[R15] Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc. 1996;91:883–904. [Google Scholar]

[R16] Dellaportas P, Forster JJ. Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika. 1999;86:615–633. [Google Scholar]

[R17] Dey DK, Chen MH, Chang H. Bayesian approach for the nonlinear random effects models. Biometrics. 1997;53:1239–1252. [Google Scholar]

[R18] Geisser S, Eddy W. A predictive approach to model selection. J Am Stat Assoc. 1979;74:153–160. [Google Scholar]

[R19] Gelfand AE, Dey DK. Bayesian model choice: asymptotics and exact calculations. J R Stat Soc B. 1994;56:501–514. [Google Scholar]

[R20] Gelfand AE, Dey DK, Chang H. Model determinating using predictive distributions with implementation via sampling-based methods (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 4. Oxford: Oxford University Press; 1992. pp. 147–167. [Google Scholar]

[R21] Gelfand AE, Ghosh SK. Model choice: a minimum posterior predictive loss approach. Biometrika. 1998;85:1–13. [Google Scholar]

[R22] Gelman A, Meng XL, Stern HS. Posterior predictive assessment of model fitness via realized discrepancies (with discussion) Stat Sinica. 1996;6:733–807. [Google Scholar]

[R23] George EI. The variable selection problem. J Am Stat Assoc. 2000;95:1304–1308. [Google Scholar]

[R24] George EI, Foster DP. Calibration and empirical Bayes variable selection. Biometrika. 2000;87:731–747. [Google Scholar]

[R25] George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993;88:881–889. [Google Scholar]

[R26] George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sinica. 1997;7:339–374. [Google Scholar]

[R27] George EI, McCulloch RE, Tsay R. Two approaches to Bayesian model selection with applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian analysis in statistics and econometrics: essays in honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348. [Google Scholar]

[R28] Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. J R Stat Soc C (Appl Stat) 1992;41:337–348. [Google Scholar]

[R29] Hanson TE. Inference for mixtures of finite polya tree models. J Am Stat Assoc. 2006;101:1548–1565. [Google Scholar]

[R30] Huang L, Chen MH, Ibrahim JG. Bayesian analysis for generalized linear models with nonignorably missing covariates. Biometrics. 2005;61:767–780. doi: 10.1111/j.1541-0420.2005.00338.x. [DOI] [PubMed] [Google Scholar]

[R31] Ibrahim JG, Chen MH. Power prior distributions for regression models. Stat Sci. 2000;15:46–60. [Google Scholar]

[R32] Ibrahim JG, Laud PW. A Predictive approach to the analysis of designed experiments. J Am Stat Assoc. 1994;89:309–319. [Google Scholar]

[R33] Ibrahim JG, Chen MH, McEachern SN. Bayesian variable selection for proportional hazards models. Can J Stat. 1999a;27:701–717. [Google Scholar]

[R34] Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J R Stat Soc B. 1999b;61:173–190. [Google Scholar]

[R35] Ibrahim JG, Chen MH, Ryan LM. Bayesian variable selection for time series count data. Stat Sinica. 2000;10:971–987. [Google Scholar]

[R36] Ibrahim JG, Chen MH, Sinha D. Bayesian survival analysis. New York: Springer-Verlag; 2001a. [Google Scholar]

[R37] Ibrahim JG, Chen MH, Sinha D. Criterion based methods for Bayesian model assessment. Stat Sinica. 2001b;11:419–443. [Google Scholar]

[R38] Ibrahim JG, Chen MH, Lipsitz SR, Herring AH. Missing data methods in regression models. J Am Stat Assoc. 2005;100:332–346. [Google Scholar]

[R39] Kim S, Chen MH, Dey DK, Gamerman D. Bayesian dynamic models for survival data with a cure fraction. Lifetime Data Anal. 2007;13:17–35. doi: 10.1007/s10985-006-9028-7. [DOI] [PubMed] [Google Scholar]

[R40] Laud PW, Ibrahim JG. Predictive model selection. J R Stat Soc B. 1995;57:247–262. [Google Scholar]

[R41] Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922. [Google Scholar]

[R42] Little RJA, Rubin DB. Statistical analysis with missing data. 2nd edn. New York: Wiley; 2002. [Google Scholar]

[R43] Ntzoufras I, Dellaportas P, Forster JJ. Bayesian variable and link determination for generalised linear models. J Stat Plan Inference. 2003;111:165–180. [Google Scholar]

[R44] Raftery AE. Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika. 1996;83:251–266. [Google Scholar]

[R45] Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. J Am Stat Assoc. 1997;92:179–191. [Google Scholar]

[R46] Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]

[R47] Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464. [Google Scholar]

[R48] Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. J Econom. 1996;75:317–343. [Google Scholar]

[R49] Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion) J R Stat Soc B. 2002;64:583–639. [Google Scholar]

PERMALINK

Bayesian variable selection for the Cox regression model with missing covariates

Joseph G Ibrahim

Ming-Hui Chen

Sungduk Kim

Abstract

1 Introduction

2 The model, prior and posterior

2.1 The model

2.2 Prior and posterior

Theorem 2.1

3 Bayesian variable subset selection

4 Computation of DIC measures

5 Analysis of the BMT data

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

6 Discussion

Acknowledgements

Appendix A: proofs of Theorem 2.1

Appendix B: posterior sampling

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bayesian variable selection for the Cox regression model with missing covariates

Joseph G Ibrahim

Ming-Hui Chen

Sungduk Kim

Abstract

1 Introduction

2 The model, prior and posterior

2.1 The model

2.2 Prior and posterior

Theorem 2.1

3 Bayesian variable subset selection

4 Computation of DIC measures

5 Analysis of the BMT data

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

6 Discussion

Acknowledgements

Appendix A: proofs of Theorem 2.1

Appendix B: posterior sampling

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases