Bayesian Criterion Based Variable Selection

Arnab Kumar Maity; Sanjib Basu; Santu Ghosh

doi:10.1111/rssc.12488

. Author manuscript; available in PMC: 2024 Jun 11.

Published in final edited form as: J R Stat Soc Ser C Appl Stat. 2021 Aug 7;70(4):835–857. doi: 10.1111/rssc.12488

Bayesian Criterion Based Variable Selection

Arnab Kumar Maity ¹, Sanjib Basu ², Santu Ghosh ³

PMCID: PMC11166016 NIHMSID: NIHMS1997287 PMID: 38863987

Abstract

Bayesian approaches for criterion based selection include the marginal likelihood based highest posterior model (HPM) and the deviance information criterion (DIC). The DIC is popular in practice as it can often be estimated from sampling based methods with relative ease and DIC is readily available in various Bayesian software. We find that sensitivity of DIC based selection can be high, in the range of 90 – 100%. However, correct selection by DIC can be in the range of 0 – 2%. These performances persist consistently with increase in sample size. We establish that both marginal likelihood and DIC asymptotically disfavor under-fitted models, explaining the high sensitivities of both criteria. However, mis-selection probability of DIC remains bounded below by a positive constant in linear models with $g$ -priors whereas mis-selection probability by marginal likelihood converges to 0 under certain conditions. A consequence of our results is that not only the DIC cannot asymptotically differentiate between the data-generating and an over-fitted model, but, in fact, it cannot asymptotically differentiate between two over-fitted models as well. We illustrate these results in multiple simulation studies and in a biomarker selection problem on cancer cachexia of non-small cell lung cancer patients. We further study performances of HPM and DIC in generalized linear model as practitioners often choose to use DIC that is readily available in software in such non-conjugate settings.

Keywords: Deviance Information Criterion, g-prior, Highest Posterior Model, mis-selction

1. Introduction

Variable selection and the broader problem of model selection remain among the most theoretically and computationally challenging problems, and at the same time, some of the most frequent questions encountered in practice. Our focus in this article is on criterion based variable selection, or what is known as “wrapper” method in the feature selection literature (Saeys et al., 2007), which assesses subsets of predictor variables based on their usefulness according to a criterion. Variable selection is then done by optimizing this criterion either by complete enumeration or by deterministic or stochastic search over the model space. These methods thus differ from “embedded” methods such as lasso, which performs variable selection in the process of fitting the model.

In 2011, a panel of international experts participated in a formal consensus process to develop a framework for definition and classification of cancer cachexia (Fearon et al., 2011). This panel defined cancer cachexia as a “multifactorial syndrome defined by an ongoing loss of skeletal muscle mass (with or without loss of fat mass) that cannot be fully reversed by conventional nutritional support and leads to progressive functional impairment”. A series of recent studies have examined cancer cachexia and its prognostic value in cancer outcome (Bonomi et al., 2017; Martin et al., 2014; Gielda et al., 2011). Patel et al. (2016) explored the relation between cancer efficacy outcome and cancer cachexia in advanced non-small cell lung cancer (NSCLC) patients and concluded that “weight gain during treatment may be an early indicator of clinical benefit” and “monitoring weight change may provide important information regarding survival outcomes in NSCLC”.

Neutrophil and lymphocyte are the two most abundant types of white blood cells. The neutrophil-to-lymphocyte ratio (NLR) is increasingly being used as a marker of subclinical inflammation. Multiple studies have explored the association of increased baseline NLR with poor clinical outcomes for several types of cancers, In Derman, Macklis, Azeem, Sayidine, Basu, Batus, Esmail, Borgia, Bonomi, & Fidler (2017), we established associations among neutrophil-to-lymphocyte ratio, cancer cachexia and survival from NSCLC. We were involved in a follow-up cancer cachexia study which included advanced/metastatic, stage IIIB or IV, NSCLC patients treated first-line with platinum doublet chemotherapy. More than one-third of the patients gained weight on treatment from baseline to 6 weeks. Additionally, the study considered targeted biomarkers for NSCLC and measured levels of 33 such markers by Luminex^® Multiplex Assays at baseline. One objective of the study was to recommend a selected panel of biomarkers that can prognosticate decreased levels of neutrophil-to-lymphocyte ratio (NLR).

In the framework of criterion based selection, a leading Bayesian selection criterion is the marginal likelihood. Bayesian model comparison has traditionally been performed using the Bayes factor (Kass & Raftery, 1995), calculated as the ratio of marginal likelihoods of two comparator models. The Bayes factor is associated with Schwarz’s Bayesian information criterion BIC, as difference in BIC provides an approximation to the log (Bayes factor) under appropriate prior assumptions (Kass & Raftery, 1995). The estimation of the marginal likelihood from Markov chain sampling approaches is studied in Kass & Raftery (1995), Chib (1995), Chib & Jeliazkov (2001, 2005); Chan & Jeliazkov (2009) and many others. These proposed approaches require additional computational work that can be complex, intensive and are often not readily available in software packages.

A widely popular Bayesian selection criterion is the Deviance Information Criterion or DIC (Spiegelhalter et al., 2002, 2014). The DIC can be easily monitored in various implementations of the BUGS software. For models with latent structures such as hierarchical or random effects, Ariyo et al. (2019, 2020); Merkle et al. (2019); Quintero & Lesaffre (2018) recommend use of marginalized DIC over (conditional) DIC. In spite of its wide acceptance, criticisms of DIC include that it may not be invariant to reparametrization, it uses a plug-in predictive approach rather than a proper Bayesian predictive distribution and it does not focus on identifying the “true” model.

The penalty in DIC is analogous to the 2× (number of parameters) penalty in AIC, which has been shown to be asymptotically equivalent to leave-one-out cross-validation. Another Bayesian selection criterion based on one-deleted cross validation is the conditional predictive ordinate (CPO), originally introduced in Geisser (1980), whose implementation in sampling based approaches is discussed in Gelfand et al. (1992). The CPOs are combined into the Log Pseudo Marginal Likelihood (LPML) criterion, originally proposed in Geisser & Eddy (1979). One well known drawback of cross-validation methods, and hence LPML, is that their computation burden increases with increase in sample size. The other worrisome fact is the asymptotic inconsistency of one-deleted cross-validation in differentiating over-fitted models which has been studied in non-Bayesian literature (see Shao 1993 and references therein). Other Bayesian predictive criteria are reviewed in Gelman et al. (2014); Vehtari et al. (2017) and cross-validation equivalence of marginal likelihood is discussed in Fong & Holmes (2020).

Chen, Huang, Ibrahim, & Kim (2008) provide review, comparison and computational expressions of Bayes factor, DIC, LPML and other criteria in the setting of generalized linear models. Earlier works on Bayesian criterion based methods include Ibrahim, Chen, & Sinha (2001) and Chen, Dey, & Ibrahim (2004). Review of marginal likelihood, its properties and its various applications can be found in Kass & Raftery (1995) and in a series of works of Chib (1995), Chib & Jeliazkov (2001, 2005). Criteria based on posterior predictive approaches are considered in Laud & Ibrahim (1995), Meyer & Laud (2002) Ibrahim, Chen, & Sinha (2001), and Daniels, Chatterjee, & Wang (2012)

In the cancer cachexis study of advanced stage metastatic NSCLC patients and the question of selecting a panel of biomarkers that can prognosticate decreased levels of neutrophil-to-lymphocyte ratio (NLR) the individual markers were found to have varying levels of marginal correlation with NLR, the highest absolute correlation being 0.27. The biomarkers also had varying degrees of collinearity with the largest pairwise absolute correlation among two markers being 0.62. The selection of a prognosticating biomarker panel in this study is one of the motivating applications that initiated the research work in this article. We find that Bayesian variable selections by horseshoe prior (Carvalho et al., 2010), and nonlocal prior (Shin et al., 2018) select two different models (with 2 and 3 biomarkers respectively) which are completely distinct (have no common markers between them). In contrast, the non-Bayesian sure independent screening (Fan & Lv, 2008) approach selects a model with 20 biomarkers. The differences among these selections as well as selections from marginal likelihood and DIC based approaches in this scientific problem are discussed in detail in section 4.2.

The rest of the article is organized as follows. We begin with an illustrative example in section 2. In section 3, we construct the framework of under and over-fitted models for studying the properties of the Bayesian criteria. We obtain mis-selection probability expressions for DIC and marginal likelihood, and establish that mis-selection probability of marginal likelihood converges to 0 under certain conditions but show that the same does not hold for the DIC. In section 4, we compare the performances of DIC and marginal likelihood based selections with horseshoe prior (Carvalho et al., 2010), nonlocal prior (Shin et al., 2018) and the non-Bayesian sure independent screening (Fan & Lv, 2008) in the presence and absence of multicollinearity among predictors. We find that DIC based selection provides high sensitivity, in the range of 90 – 100%, however the probability of correct selection by DIC is in between 0 – 2% and the false discovery rate is between 25 – 49%. This poor performance of DIC persists consistently in simulation studies ranging from $n = 50$ to $n = 10,000$ and for various signal-to-noise ratios. Correct selection by marginal likelihood, in contrast, improves from 40% for $n = 50$ to 95% for $n = 10,000$ . Theorem 1 establishes that both marginal likelihood and DIC asymptotically disfavor under-fitted models, thus explaining the high sensitivity of both of these criteria found in the simulation studies. On the other hand Theorem 3 establishes that mis-selection probability of DIC remains away from zero for every sample size whereas mis-selection probability by marginal likelihood converges to 0 under certain conditions.

The motivating scientific problem of selecting a prognosticating biomarker panel for cancer cachexia based on baseline level biomarkers measured by Luminex^® Multiplex Assays for metastatic advanced stage non-small cell lung cancer patients is considered in detail in section 4.2. In section 6, we explore beyond the setting of linear models where we examine the performance of these criteria for variable selection in generalized linear models in the presence of multicollienarity. We find poor performance of the DIC in generalized linear model to be disconcerning as practitioners often choose to use DIC that is readily available in software in such non-conjugate settings. Finally we conclude with a brief discussion in Section 7.

2. An illustrative example

Shao (1993) considered variable selection in a dataset from Gunst & Mason (1980) to compare selection based on one-deleted and many-deleted cross-validations. We consider the same setting as Shao (1993) where the model space consists of all possible models that always include the first covariate $x_{1} \equiv 1$ . We evaluate DIC and marginal likelihood by complete enumeration of all models in the model space and report the models with the smallest DIC, highest marginal likelihood or highest posterior model (HPM). We also report the highest LPML model for comparison. As noted before, the LPML criterion combines the conditional predictive ordinates, each of which is obtained as predictive density evaluated at current observation given the remaining observations (one deleted cross-validation), see Geisser & Eddy (1979); Gelfand et al. (1992); Chen et al. (2008). $g$ -prior is assumed on the regression coefficients in the analysis models. We consider three different values of the constant multiplier of the $g$ -prior (see (3.1)), $a_{0} = 1 / n$ , and a lower and a higher value.

We simulate data from data-generating model $ℳ (D) : y ~ N (X β (D), τ^{- 1} I)$ and consider four different scenarios: (i) $ℳ (D) = \{x_{1}, x_{4}\}$ , (ii) $ℳ (D) = \{x_{1}, x_{4}, x_{5}\}$ , (iii) $ℳ (D) = \{x_{1}, x_{2}, x_{4}, x_{5}\}$ , and (iv) $ℳ (D) = \{x_{1}, x_{2}, x_{3}, x_{4}, x_{5}\}$ , the full model. We report relative frequencies of the optimally selected models by DIC, marginal likelihood and LPML over repeated simulations in Table 1.

Table 1:

Relative frequency of optimally selected models by DIC, marginal likelihood (HPM) and LPML. $ℳ (D)$ is the data-generating model and Category II are models which includes the variables in $ℳ (D)$ (but can be larger). $a_{0}$ refers to the constant multiplier of the g-prior.

Model	Category	$a_{0} = 0.001$			$a_{0} = \frac{1}{n}$			$a_{0} = 0.1$
Model	Category	DIC	HPM	LPML	DIC	HPM	LPML	DIC	HPM	LPML
$n = 40$
1,4	$M (D)$	0.575	0.973	0.584	0.621	0.876	0.602	0.572	0.727	0.350
1,2,4	II	0.104	0.006	0.114	0.109	0.038	0.123	0.117	0.075	0.113
1,3,4	II	0.131	0.010	0.132	0.092	0.029	0.103	0.105	0.073	0.070
1,4,5	II	0.108	0.011	0.097	0.106	0.045	0.096	0.129	0.085	0.287
1,2,3,4	II	0.026		0.024	0.024	0.005	0.030	0.025	0.013	0.042
1,2,4,5	II	0.026		0.022	0.018	0.005	0.014	0.025	0.014	0.053
1,3,4,5	II	0.018		0.018	0.023	0.002	0.028	0.018	0.009	0.068
1,2,3,4,5	II	0.009		0.009	0.007		0.004	0.009	0.004	0.017
1,4,5	$M (D)$	0.707	0.986	0.714	0.722	0.910	0.699	0.678	0.813	0.424
1,2,4,5	II	0.124	0.006	0.121	0.121	0.048	0.129	0.118	0.075	0.114
1,3,4,5	II	0.134	0.007	0.136	0.120	0.039	0.130	0.145	0.092	0.270
1,2,3,4,5	II	0.035	0.001	0.029	0.037	0.003	0.042	0.059	0.020	0.192
1,2,4,5	$M (D)$	0.822	0.993	0.834	0.832	0.954	0.818	0.826	0.897	0.149
1,2,3,4,5	II	0.178	0.007	0.166	0.168	0.046	0.182	0.174	0.103	0.851
1,2,3,4,5	$ℳ (𝒟)$	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
$n = 400$
1,4	$ℳ (𝒟)$	(0.605)	0.973	(0.605)	(0.625)	0.0.955	0.0.620	(0.574)	0.716	(0.228)
1,2,4	II	0.110	0.011	0.118	0.107	0.014	0.106	0.124	0.094	0.082
1,3,4	II	0.104	0.008	0.102	0.091	0.013	0.096	0.110	0.084	0.057
1,4,5	II	0.108	0.007	0.105	0.114	0.016	0.112	0.106	0.078	0.396
1,2,3,4	II	0.024	0.001	0.023	0.029	0.001	0.029	0.035	0.011	0.037
1,2,4,5	II	0.016		0.014	0.015	0.001	0.018	0.021	0.009	0.081
1,3,4,5	II	0.023		0.021	0.012		0.012	0.020	0.007	0.090
1,2,3,4,5	II	0.010		0.012	0.007		0.007	0.010	0.001	0.029
1,4,5	$ℳ (𝒟)$	0.710	0.985	0.712	0.712	0.970	0.722	0.689	0.802	0.427
1,2,4,5	II	0.129	0.008	0.124	0.143	0.014	0.135	0.128	0.076	0.108
1,3,4,5	II	0.129	0.007	0.129	0.106	0.016	0.104	0.145	0.106	0.285
1,2,3,4,5	II	0.032		0.035	0.039		0.039	0.038	0.016	0.180
1,2,4,5	$ℳ (𝒟)$	0.819	0.986	0.820	0.834	0.988	0.839	0.805	0.881	0.012
1,2,3,4,5	II	0.181	0.014	0.180	0.166	0.012	0.161	0.195	0.119	0.988
1,2,3,4,5	$ℳ (𝒟)$	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000

Open in a new tab

We first notice that across all the scenarios considered in Table 1, remarkably, the optimally selected models, either by DIC, marginal likelihood or LPML, are never sub-models of the respective data-generating models $ℳ (D)$ . As we will see in the following section 3, this finding is consistent with Theorem 1.

Secondly, we notice that the HPM approach performs superiorly in selecting the data-generating model uniformly across Table 1. This is especially evident when the data-generating model $ℳ (D) = \{x_{1}, x_{4}\}$ is relatively sparse. The DIC and LPML criteria here select the data-generating model only 62% and 60% of times when $a_{0} = 1 / n$ whereas, in comparison, the marginal likelihood criterion selects 88% of times. The performances of all three criteria, but especially of DIC and LPML improve when the data-generating model is less sparse.

To study the performances of these three criteria with increased sample size, we consider the setting of n = 400 by replicating the original $X$ matrix 10 times. We notice that surprisingly there is almost no improvements in the performance of the DIC and LPML when the sample size is increased from $n = 40$ to $n = 400$ . We further notice that the performance of HPM remains mostly unchanged from $n = 40$ to $n = 400$ as well when $a_{0}$ remains constant in $n$ (0.001 or 0.1). However, for the choice of $a_{0} = 1 / n$ , there is a significant increase in correct selection by HPM for $n = 400$ in Table 1. There is almost no improvement in the performances of DIC and LPML even in the case of $a_{0} = 1 / n$ . As we will see in the following section 3, these findings are consistent with Theorem 3.

3. Bayesian Variable Selection

3.1. Selection Criteria

We are interested in exploring the association between outcome $Y$ and $p$ predictors or features $x_{1}, \dots, x_{p}$ , and suppose that we have observations $\{(y_{i}, x_{i 1}, \dots, x_{i p}), i = 1, \dots, n\} = (y, x_{1}, \dots, x_{p})$ . Let $ℳ \subseteq {all subsets of {1,2, \dots, p}}$ denote the model space under consideration in the variable selection. A member of $ℳ$ is $ℳ (α)$ where $α$ is a subset of ${1,2, \dots, p}$ and $X_{n} (α)$ is the matrix composed of columns of $\{x_{1}, \dots, x_{p}\}$ with indices in $α$ . In model $ℳ (α)$ , the association of the stochastic outcome $Y_{n}$ with $X_{n} (α)$ is expressed in terms of a quantity $ξ_{Y_{n} ∣ X_{n} (α)}$ associated with the probability distribution $F_{Y_{n} ∣ X_{n} (α)}$ via the regression model $ξ_{Y_{n} ∣ X_{n} (α)} = X_{n} (α) β (α)$ . In normal linear model, $ξ_{Y_{n} ∣ X_{n} (α)} = E (Y_{n} ∣ X_{n} (α))$ and $F_{Y_{n} ∣ X_{n (α)}}$ is normal, resulting in model $ℳ (α) : Y_{n} ~ N (X_{n} (α) β (α), τ^{- 1} I_{n})$ where $I_{n}$ is the identity matrix. We assume all relevant $X_{n} (α)$ matrices are of full column rank.

The marginal likelihood has traditionally been used as a Bayesian model selection criterion. For model $ℳ (α)$ , the marginal likelihood is given by

p (y_{n} ∣ ℳ (α)) = \int f (y∣ β (α), τ, ℳ (α), τ) π (β (α), τ∣ ℳ (α)) d β (α) d τ,

where $f (y_{n} ∣ β (α), τ, ℳ (α))$ and $π (β (α), τ ∣ ℳ (α))$ are the likelihood and the prior under model $ℳ (α)$ . The Bayes factor for two comparator models is obtained as ratio of their marginal likelihoods. It has a large literature (Kass & Raftery, 1995; Chib, 1995; Chib & Jeliazkov, 2001, 2005; Chan & Jeliazkov, 2009) and is associated with Schwarz’s Bayesian information criterion BIC. The posterior probability of model $ℳ (α)$ in the model space $ℳ$ is given by

P (ℳ (α) ∣ y_{n}) = \frac{p (y_{n} ∣ ℳ (α)) P (ℳ (α))}{\sum_{α^{'}} p (y_{n} ∣ ℳ (α^{'}) P (ℳ (α^{'}))} \propto p (y_{n} ∣ ℳ (α)) P (ℳ (α)),

where $P (ℳ (α))$ is the prior probability of model $ℳ (α)$ in the model space. The highest posterior model (HPM) is argmax $\{P (ℳ (α) ∣ y_{n}) : α \in ℳ\}$ and the HPM approach for model selection has an extensive literature from both theoretical and computational aspects.

The Deviance Information Criterion (Spiegelhalter et al., 2002, 2014), (Chen, Huang, Ibrahim, & Kim, 2008) combines goodness of fit of a model with a penalty for model complexity and is defined as deviance + 2× (effective number of parameters), evaluated at a posterior point estimate of the parameter, In particular, $DIC = D (\tilde{θ}) + 2 p_{D}$ , where $D (θ) = - 2 log f (y_{n} ∣ θ), f (y_{n} ∣ θ)$ is the likelihood function of the model and $\tilde{θ}$ is an estimate of the model parameter $θ$ . In the above expression $p_{D}$ is termed as the effective number of parameters and is defined as $p_{D} = \bar{D (θ)} - D (\tilde{θ})$ , where $\bar{D (θ)} = E_{θ ∣ y_{n}} \{- 2 log f (y_{n} ∣ θ)\} + 2 log h (y_{n})$ is considered to be a measure of model fit, and $h (y_{n})$ is some fully specified standardizing term that is a function of the data alone. A model with a smaller value of DIC is preferred by this criterion.

The first term of DIC, $D (\hat{θ})$ is evaluated at a posterior point estimate $\hat{θ}$ . In models with latent structures, such as mixed effects, hierarchical, latent variables or missing data models, the unobservables also include the latent variables. In these settings, what should be considered as likelihood and $θ$ can result in different definitions of DIC (Celeux et al., 2006). A series of recent works (Quintero & Lesaffre, 2018; Merkle et al., 2019; Ariyo et al., 2019, 2020) have shown superior performances of marginalized criteria (such as marginalized DIC) over conditional criteria in such settings. We consider linear models without any latent structures in this article.

In Bayesian variable selection, the g-prior (Zellner, 1986) has a long history. In the setting of the normal linear model $ℳ (α) : Y_{n} ~ N (X_{n} (α) β (α), τ^{- 1} I_{n})$ , we consider the following $g$ -prior on the regression parameter $β (α)$ and $Gamma (a, b)$ prior on the precision parameter $τ$

β (α) |τ ~ N (0, \frac{1}{τ a_{0}} {(X_{n} (α)^{'} X_{n} (α))}^{- 1}) and τ ~ Gamma (a, b) .

(3.1)

The $g$ -prior is popular and widely studied (Chen, Huang, Ibrahim, & Kim, 2008) in the context of Bayesian variable selection in linear models. The choice of $a_{0}$ is important for variable selection and different choices have been suggested (George & Foster, 2000). When $Y_{n} |β (α), τ ~ N_{n} (X_{n} (α) β (α), \frac{1}{τ} I), β (α)| τ ~ N (0, \frac{1}{τ a_{0}} {(X_{n} (α)^{'} X_{n} (α))}^{- 1})$ and $τ ~ Gamma (a, b)$ as in (3.1), the marginal distribution of $Y_{n}$ is a multivariate t-distribution and the log-marginal likelihood of the model $ℳ (α)$ is given by

\begin{array}{r} logML (ℳ (α)) = - \frac{n}{2} log (2 π) + \frac{d (α)}{2} log (\frac{a_{0}}{a_{0} + 1}) + log (\frac{Γ (a + n / 2)}{Γ a}) - a log (b) - (a + \frac{n}{2}) log (b + \frac{a_{0}}{2 (1 + a_{0})} Y_{n}^{'} Y_{n} + \frac{1}{2 (1 + a_{0})} SSE (α)) . \end{array}

(3.2)

Here $SSE (α) = Y_{n}^{'} (I - H (α)) Y_{n}, H (α) = X_{n} (α) {(X_{n} (α)^{'} X_{n} (α))}^{- 1} X_{n} (α)$ is the hat matrix, and $d (α)$ is the model dimension of $ℳ (α)$ . Under prior specifications in (3.1), the DIC of model $ℳ (α)$ is obtained

DIC (ℳ (α)) = D (\bar{β} (α), \overline{τ} (α)) + 2 p_{D} (α),

where $\bar{β} (α) = E [β ∣ y_{n}]$ and $\bar{𝒯} (α) = E {τ (α) ∣ y_{n}} = \frac{a + n / 2}{b + \frac{1}{2} S S E (α) + \frac{1}{2} \frac{a_{0}}{1 + a_{0}} y_{n}^{'} H (α) y_{n}}$ are posterior estimates of $β (α)$ and $τ$ . We have (see Spiegelhalter et al. 2002) $p_{D} (α) = \bar{D (β (α), τ (α))} - D (\bar{β} (α), \overline{τ} (α)) = E_{τ ∣ y_{n}} [E_{β ∣ y_{n}, τ} [- 2 log f (y_{n} ∣ θ)]] - D (\bar{β} (α), \overline{τ} (α)) = \frac{d (α)}{1 + a_{0}} - n {ψ (a + n / 2) - log (a + n / 2)}$ where $ψ (.)$ is the digamma function. It follows that

\begin{array}{r} DIC (ℳ (α)) = - n ln (\frac{\overline{τ} (α)}{2 π}) + \frac{a_{0}^{2} \overline{τ} (α)}{{(1 + a_{0})}^{2}} Y_{n}^{'} Y_{n} + \frac{(1 + 2 a_{0}) \overline{τ} (α)}{{(1 + a_{0})}^{2}} SSE (α) + 2 \frac{d (α)}{1 + a_{0}} - 2 n {ψ (a + n / 2) - log (a + n / 2)} . \end{array}

(3.3)

Both DIC and marginal likelihood have extensive literature. For marginal likelihood and Bayes factor in general setting, see Kass & Raftery (1995); Chib (1995); Chib & Jeliazkov (2001, 2005); Chan & Jeliazkov (2009) and references therein. In normal linear model with $g$ -prior, Fernandez et al. (2001a) show that posterior probability of data generating model converges to 1. In similar setting, George & Foster (2000) show (see also Fernandez et al. 2001b) that the HPM approach can be equivalent to either AIC or BIC or other approaches depending on the choice of the constant of the $g$ -prior. Liang et al. (2008) study Bayesian variable selection under mixtures of $g$ -priors. Besides the setting of $g$ -priors, Casella & Moreno (2006); Casella et al. (2009) establish consistency of Bayes factor under intrinsic priors whereas Moreno et al. (2010) study consistency property with the rate of growth of the model dimension. Chib et al. (2018) study marginal likelihood in the exponentially tilted empirical likelihood framework whereas recent work by Fong & Holmes (2020) establish that marginal likelihood is formally equivalent to a type of exhaustive cross-validation. Johnson & Rossell (2010, 2012) propose non-local prior to regularize imbalance in point null hypothesis testing and study Bayes factor, see also Shin et al. (2018).

DIC is introduced in Spiegelhalter et al. (2002) and its expression depends on the likelihood and the posterior. It thus can be obtained when the likelihood and posterior are tractable, such as in the setting of improper priors. The terms in DIC can often be estimated from Markov chain samples and DIC is readily available in popular Bayesian software. DIC is evaluated at a plug-in posterior estimate $\tilde{θ}$ and has been criticized as not fully Bayesian (Gelman et al., 2014) and being not a consistent model selection procedure (Moreno & Vâzquez-Polo, 2014), There is substantial recent interest on DIC based on observed-data likelihood as opposed to conditional likelihood (Celeux et al., 2006; Quintero & Lesaffre, 2018; Merkle et al., 2019; Ariyo et al., 2019, 2020) in the setting of models with latent structure. Chan & Grant (2016) develop fast computation for observed-data DIC whereas Li et al. (2020) establish relation to observed-data AIC as well as propose DIC for misspecified models.

We study properties of marginal likelihood and DIC under category I and Category II models, which are introduced below.

3.2. Category I and II models

In the following, we assume there is a data-generating model within the model space, denoted by $ℳ (α_{D})$ , or in short, simply as $ℳ (D)$ . We consider a partition $\{ℳ_{1}, ℳ_{2}\}$ of $ℳ ∖ {ℳ (𝒟)}$ , where $ℳ_{1} = {ℳ (α) {: α}_{D} ⊈ α\}$ and $ℳ_{2} = \{ℳ (α) : α_{D} \subset α\} . ℳ_{1}$ contains all models that do not contain at least one component of the data-generating model whereas, $ℳ_{2}$ contains all super-models of the data-generating model. We refer to $ℳ_{1}$ and $ℳ_{2}$ as the class of category I and category II models respectively, and similar definitions can be found in Shao (1993).

Under the data generating model $ℳ (D) \in ℳ, Y_{n} ~ N (μ_{n}, τ^{- 1} I_{n})$ where $μ_{n} = X_{n} (D) β (D)$ . Many results in this article are derived under the following assumptions: (A1) ${lim}_{n \to \infty} n^{- 1} (X_{n}^{'} X_{n}) = V$ , where $V$ is finite and positive definite matrix, (A2) ${lim}_{n \to \infty} {max}_{1 \leq i \leq n} h_{i i} (α) \to 0, h_{i i} (α)$ is the $i$ th diagonal element of $H_{n} (α)$ , and (A3) ${lim}_{n \to \infty} n^{- 1} μ_{n}^{'} μ_{n} = μ^{'} μ$ . Assumptions (A1), (A2), and (A3) are standard for deriving asymptotic results in normal linear models. We additionally assume (A4): for any model $ℳ (α^{'}) \in ℳ_{1}, {lim}_{n \to \infty} n^{- 1} μ_{n}^{'} (I - H (α^{'}) μ_{n}) = b (α^{'})$ , similar to Fernandez et al. (2001a). (A1) ensures the covariates are bounded and (A3) ensures the mean vector of the data generating model is finite whereas (A4) is needed for identifiability.

Our first set of results establishes that even the worst model in $ℳ_{2} \cup {ℳ (D)}$ is asymptotically preferred by marginal likelihood and DIC than the best category I model in $ℳ_{1}$ . All proofs are available in the supplementary material.

Theorem 1. If $ℳ_{1}$ and $ℳ_{2} \cup {ℳ (D)}$ are not empty and for the choice of $a_{0} = 1 / n$

${lim}_{n \to \infty} P_{Y_{n}} \{{min}_{ℳ (α) \in ℳ_{2} \cup {ℳ (D)}, ℳ (α^{'}) \in ℳ_{1}} Bayes factor (ℳ (α), ℳ (α^{'})) \geq 1\} = 1$ .
${lim}_{n \to \infty} P_{Y_{n}} \{{min}_{ℳ (α) \in ℳ_{1}} D I C (ℳ (α)) \geq {max}_{ℳ (α) \in ℳ_{2} \cup {ℳ (D)}} D I C (ℳ (α))\} = 1$ .

Theorem 1 establishes that selection by these criteria concentrates on the class of ${category II} \cup ℳ (D)}$ models. In our numerical studies in sections 2 and 4, we find that both marginal likelihood and DIC based selections yield high sensitivities (see Table 2) corroborating this result. Scientific problems increasingly involve a large number of features, and the typical scientific postulate is that the underlying data generating model $ℳ (D)$ is sparse, involving only a few of the features. In such cases, the class of category II models is large and it is important to perform appropriate variable selection. Our next set of results consider the mis-selection probability within this class.

Table 2:

$ℳ (D)$ is the proportion of times the data generating model is selected. FDR is the false discovery rate (=FP/(TP+FP)) and SEN is the sensitivity (=TP/(TP+FN)), both averaged over replications. Here TP (True Positive) and FN (False Negative) are number of $β$ coefficients selected respectively as important and not important by the analysis approach out of the non-zero coefficients in the data generating model. Similarly FP (False Positive) is the number of coefficients selected as important out of the zero coefficients in the data generating model. Results are based on 100 replications.

	DIC			HPM			HorseShoe			BayesS5			SIS
	$ℳ (D)$	FDR	SEN	$ℳ (D)$	FDR	SEN	$ℳ (D)$	FDR	SEN	$ℳ (D)$	FDR	SEN	$ℳ (D)$	FDR	SEN
	$n = 50, p = 25, β (D) = (1.5, - 1.5,1.5, - 1.5,1.5)$
$ρ = 0$	0.00	0.46	1.00	0.67	0.07	1.00	0.18	0.02	0.68	0.72	0.05	1.00	0.21	0.34	1.00
$ρ = 0.5$	0.01	0.49	0.92	0.40	0.11	0.88	0.00	0.01	0.27	0.42	0.11	0.87	0.02	0.53	0.84
AR	0.01	0.46	0.99	0.59	0.09	0.95	0.00	0.01	0.16	0.58	0.09	0.93	0.16	0.43	0.99
	$n = 100, p = 25, β (D) = (1.5, - 1.5,1.5, - 1.5,1.5)$
$ρ = 0$	0.01	0.47	1.00	0.79	0.05	1.00	0.83	0.02	0.99	0.79	0.04	1.00	0.57	0.20	1.00
$ρ = 0.5$	0.00	0.49	0.97	0.72	0.06	0.98	0.01	0.00	0.65	0.60	0.10	0.97	0.14	0.46	0.90
AR	0.00	0.48	1.00	0.83	0.03	1.00	0.35	0.01	0.70	0.68	0.06	1.00	0.50	0.21	1.00
	$n = 10000, p = 25, β (D) = (1.5, - 1.5,1.5, - 1.5,1.5)$
$ρ = 0$	0.00	0.44	1.00	0.96	0.01	1.00	0.75	0.04	1.00	0.98	0.00	1.00	0.92	0.02	1.00
$ρ = 0.5$	0.00	0.43	1.00	0.95	0.01	1.00	0.76	0.04	1.00	0.90	0.02	1.00	0.81	0.04	1.00
AR	0.00	0.42	1.00	0.95	0.01	1.00	0.77	0.04	1.00	0.97	0.00	1.00	0.87	0.03	1.00
	$n = 100, p = 25, β (D) = (0.7, - 0.7,0.7, - 1,1, - 1,1, - 1.5,1.5, - 1.5)$
$ρ = 0$	0.02	0.26	1.00	0.68	0.03	0.99	0.05	0.01	0.72	0.73	0.02	0.98	0.20	0.20	1.00
$ρ = 0.5$	0.02	0.26	0.94	0.06	0.03	0.86	0.00	0.01	0.51	0.02	0.01	0.80	0.09	0.21	0.95
AR	0.01	0.25	0.99	0.46	0.02	0.90	0.00	0.01	0.29	0.43	0.03	0.88	0.09	0.24	0.99
	$n = 200, p = 25, β (D) = (0.7, - 0.7,0.7, - 1,1, - 1,1, - 1.5,1.5, - 1.5)$
$ρ = 0$	0.02	0.25	1.00	0.83	0.02	1.00	0.55	0.00	0.95	0.82	0.02	1.00	0.49	0.10	1.00
$ρ = 0.5$	0.02	0.28	0.98	0.38	0.03	0.95	0.00	0.01	0.74	0.43	0.02	0.94	0.28	0.15	0.98
AR	0.02	0.26	1.00	0.75	0.02	1.00	0.08	0.00	0.74	0.70	0.03	1.00	0.27	0.15	1.00

Open in a new tab

Theorem 2. Consider any model $ℳ (α) \in ℳ_{2}$ .

The probability of mis-selection by marginal likelihood is given by
$P_{Y_{n}} \{Bayes factor (ℳ (α), ℳ (α_{D})) \geq 1\} \leq 1 - F_{χ_{η}^{2}} [(1 + a_{0}) (2 n + a) \{{(1 + a_{0}^{- 1})}^{\frac{η}{(n + 2 a)}} - 1\}] + o (1) .$ (3.4)
where $η = d (α) - d (α_{D})$ and $F_{χ_{η}^{2}} (x)$ is the cdf of $χ^{2}$ distribution with $η$ degrees of freedom as before.
The probability of mis-selection by DIC is given by
$P_{Y_{n}} \{D I C (ℳ (α)) \leq D I C (ℳ (D))\} \leq 1 - F_{χ_{η}^{2}} (\frac{2 (1 + a_{0})}{1 + 2 a_{0}} η) + o (1) .$ (3.5)

Remark 1: Note that the expressions in (3.4) and (3.5) are free of $μ (D)$ (or the limiting value $μ^{'} μ$ as in A3) and $τ (D)$ of the data generating model and thus can be readily evaluated without knowing the data-generating model.

Remark 2: It is noteworthy that the upper bound in (3.5) is independent of $n$ and thus provides a constant bound for all $n$ (when $a_{0}$ does not depend on $n$ ).

Theorem 2 is illustrated in Figure 1 in which the marginal likelihood and DIC of the data generating model $ℳ (D)$ and different $ℳ (α)$ models are evaluated for different sample sizes. We used $a_{0} = 1 / n$ for the $g$ -prior on $β (α)$ and Gamma(1, 1) prior on $τ$ . The figure shows the empirical estimates of the left hand side expressions in Theorem 2 based on 2,500 repeated data simulations (red) and the upper bounds of mis-selection probabilities (blue) from the right hand sides of Theorem 2. We note that the mis-selection probabilities of Bayes factor are relatively low across all three panels and decreases to zero with increased sample size. The corresponding probability for DIC is relatively high and stays bounded from below. Both of these results are established theoretically in the following.

Usual recommendation in Bayesian variable selection is to consider $a_{0} (n)$ decreasing in sample size $n$ . The following Theorem establishes that the mis-selection probability of DIC asymptotically continues to be positive for any such sequence.

Theorem 3. (a) For $ℳ (α) \in ℳ_{2}, a_{0} = a_{0} (n) \to 0$ , and $a_{0} (n)^{- 1} = O (n)$ as $n \to \infty$ ,

lim_{n \to \infty} P_{Y_{n}} {Bayes factor (ℳ (α), ℳ (D)) \geq 1} = 0 .

(b) Consider any sequence $\{a_{0} (n)\}$ converging to 0 as $n \to \infty$ . For $ℳ (α) \in ℳ_{2}$

lim_{n \to \infty} P_{Y_{n}} {D I C (ℳ (α)) \leq D I C (ℳ (D))} > 0 .

Note that the above results in Theorem 3 are comparative results between models. Fernandez et al. (2001a) provide a result on consistency of the data generating model $ℳ (D)$ under a non-informative $π (τ) \propto 1 / τ$ and $a_{0} (n) \to 0$ .

The implication of the next result is that given data $y_{n}$ and a fixed $a_{0} < \frac{1}{exp (\frac{4 a}{n} + 2) - 1}$ , if we mis-select model $ℳ (α)$ instead of the data generating model $ℳ (D)$ using the Bayes factor criterion, then we will also make the same mis-selection using the DIC criterion. The previous results in this article are stated in terms of probability under the data-generating sampling model. In contrast, the following result is, in fact, conditional on data $y_{n}$ .

Theorem 4. If $a_{0} < \frac{1}{exp (\frac{4 a}{n} + 2) - 1}$ , then for the data-generating model $ℳ (D)$ and a model $ℳ (α) \in ℳ_{2}$ ,

\{y_{n} : Bayes factor (ℳ (α), ℳ (D) ∣ y_{n}) \geq 1\} \subseteq \{y_{n} : DIC (ℳ (α) ∣ y_{n}) \leq DIC (ℳ (D) ∣ y_{n})\} .

We briefly sketch parallel results on the performances of DIC and marginal likelihood when $τ$ is known in section 5.

4. Numerical Studies

We present a numerical study and the cancer cachexia scientific study to further illustrate our results and the performances of DIC and marginal Likelihood.

4.1. HPM, DIC, Horseshoe, Nonlocal, SIS

We compare the empirical properties of marginal likelihood based HPM (Chib, 1995; Chib & Jeliazkov, 2001, 2005) and minimum DIC (Chen, Huang, Ibrahim, & Kim, 2008) approaches with several Bayesian and non-Bayesian variable selection methods. The horseshoe prior (Carvalho et al. 2010, Bhadra et al. 2016) has a well-established literature for sparse signals. We consider the implementation of the horseshoe prior in the R package Horseshoe (van der Pas et al., 2016) and the variable selection function in this package in its default setting. We further consider the nonlocal prior based selection (Johnson & Rossell, 2012) whose theoretical and numerical performances are recently considered in Shin et al. (2018) and its stochastic search implementation in the R package BayesS5 (Shin & Tian, 2017) in its default setting. We further compare these Bayesian approaches with the non-Bayesian (iterative) sure independent screening method whose sure screening and consistency are established in Fan & Lv (2008); Fan et al. (2011) and which is implemented in the R package SIS (Saldana & Feng, 2018).

We use SNR to denote the signal-to noise ratio in the data generating model (Meier et al., 2009; Fan et al., 2011; Dicker, 2014),

SNR = \frac{Var (X (D) β (D))}{Var (ε)} = τ (D) β (D)^{'} Σ_{X (D)} β (D),

where $Σ_{X (D)}$ is the covariance matrix of the randomly generated $X (D)$ in the data generating model.

We consider the following simulation models

$(p = 25, p_{Datagen} = 5$ , uncorrelated x’s, SNR = 5): We simulate data according to the Gaussian linear model $Y ~ N (2 \cdot 1 + X (D) β (D), τ^{- 1} I)$ where 1 is a column of 1’s. We take the data generating model $ℳ (D)$ to be ${1,2, 3,4, 5}, β (D) = (1.5, - 1.5,1.5, - 1.5,1.5)$ and $τ = (1.5)^{- 2}$ . Each row of $X$ is independently generated from $N_{p} (0, Σ_{X})$ where $Σ_{X} = I$ is taken to be isotropic. We consider three sample sizes $n = 50,100$ and a large $n = 10000$ to explore selection consistency.
( $p = 25, p_{Datagen} = 5$ , correlated x’s, SNR ≈ 2.58): Same as above but the rows of $X$ are generated from $N_{p} (0, Σ_{X})$ so that $v a r (x_{i}) = 1, c o r (x_{4}, x_{j}) = ρ^{1 / 2}, j \neq 4$ , and $c o r (x_{i}, x_{j}) = ρ$ for all other $i \neq j$ (Fan & Lv, 2008). We take $ρ = 0.5$ .
$(p = 25, p_{Datagen} = 5$ , autoregressive correlated X’s, SNR ≈ 2.13): Same as above but the rows of $X$ are generated from $N_{p} (0, Σ_{X})$ so that $v a r (x_{i}) = 1$ , and $c o r (x_{i}, x_{j}) = ρ^{| i - j |}$ for $i \neq j$ . We take $ρ = 0.5$
$(p = 25, p_{Datagen} = 10$ , uncorrelated x’s, SNR ≈ 5.43): Same as 1 but we take $β (D) = (0.7, - 0.7,0.7, - 1,1, - 1,1, - 1.5,1.5, - 1.5)$ to consider a mixture of weak and moderate coefficients whereas keeping the SNR value similar to 1.
$(p = 25, p_{Datagen} = 10$ , correlated x’s, SNR ≈ 3.8) Same as 2 but with $β (D)$ same as 4.
$(p = 25, p_{Datagen} = 10$ , autoregressive correlated x’s, SNR ≈ 2.13): Same as 3 but with $β (D)$ same as 4.

For each scenario, we generate 100 replicated data sets and apply the different variable selection methods to each. We use $a_{0} = max {(n, p^{2})}^{- 1}$ in the g-prior (George & Foster, 2000). We further compare the complete model space enumeration based HPM selection with the MCMC method of HPM selection implemented in the R package BAS (https://cran.r-project.org/package=BAS) and the two approaches selected identical models in all the replicated cases we consider. The computation of the complete model space enumeration based HPM selection for $p = 25$ and 100 replicated datasets is performed in a high performance computing cluster (HPC) with 60 nodes each equipped with 2 Intel X5650 2.66 GHz6-core processors and 72 GB RAM and took about 20 hours.

We notice in Table 2 that the HPM, DIC, BayesS5, SIS methods (except for Horseshoe) report high sensitivity values (often = 1) for almost all the cases; this suggests that these methods mostly select models from category II, that is, super-models of the data-generating model. For HPM and DIC, the high sensitivity values corroborates with Theorem 1.

The performance of the DIC is poor uniformly across Table 2 in selecting the data-generating model $ℳ (D)$ . More importantly, the performance remains constantly poor even with increase in sample size $n$ , and even for $n = 10,000$ , corroborating with Theorem 3. The reported average FDR values in Table 2 suggest that DIC often chooses models with quite a few additional x-variables.

The HPM based selection using the $g$ -prior perform superiorly across Table 2 and often performed similar to the nonlocal prior based HPM selection in BayesS5. We also note the performance improvement of HPM in selecting the data-generating model $ℳ (D)$ with increasing sample size $n$ , consistent with Theorem 3.

All methods perform weakly for the case of non-autoregressive $ρ = 0.5$ , especially for small sample size and more in the case when $β (D)$ has a mixture of weak and moderate coefficients. The performances of HPM and BayesS5 based HPM, however, improve with increase in sample size. The Horseshoe approach generally perform weakly in Table 2 in presence of multicollineraity, but this could be partially due to our choice of default settings in the Horseshoe package. The SIS approach, surprisingly, does not perform strongly.

We note that Chen, Huang, Ibrahim, & Kim (2008) provide a simulation study comparing selection accuracy of DIC, HPM and other approaches in generalized linear models. Our results for $p = 25$ are different from the good performances they observed for the case of $p = 3$

4.2. Cancer Cachexia

We have described cancer cachexia in section 1 where we mention that prognostic value of cancer cachexia in cancer outcome is of increased interest. Fearon et al. (2011) described the formal consensus development of a framework for definition and classification of cancer cachexia by an international expert panel. Cancer cachexia is a “multifactorial syndrome defined by an ongoing loss of skeletal muscle mass (with or without loss of fat mass) that cannot be fully reversed by conventional nutritional support and leads to progressive functional impairment”. There is a substantial literature on cancer cachexia and its prognosic value. In a study on advanced non-small cell lung cancer (NSCLC) patients Patel et al. (2016) concluded that “weight gain during treatment may be an early indicator of clinical benefit”.

Neutrophils and lymphocytes are the two most abundant leukocytes, representing about 62% and 30% respectively of white blood cells. The neutrophil-to-lymphocyte ratio (NLR) is increasingly being used as a marker of subclinical inflammation. Multiple studies have explored the association of increased baseline NLR with poor clinical outcomes for several types of cancers, In Derman, Macklis, Azeem, Sayidine, Basu, Batus, Esmail, Borgia, Bonomi, & Fidler (2017), we established associations among longitudinal neutrophil-to-lymphocyte ratio, cancer cachexia and survival from NSCLC. We were involved in a follow-up study which involved $n = 107$ advanced/metastatic, stage IIIB or IV, NSCLC patients treated first-line with platinum doublet chemotherapy at Rush University Medical Center in Chicago. The median age at baseline was 66 years and 52% of the patients were females. NSCLC histology was primarily adenocarcinoma (65%) and squamous cell carcinoma (18%). The median weight at baseline and at 6 weeks after treatment were 158.8 lbs and 156.9 lbs respectively 37% of patients gained weight on treatment from baseline to 6 weeks. In addition to their neutrophil and lymphocuyte counts, the levels of 33 biomarkers were measured by Luminex^® Multiplex Assays for the patients at baseline. These biomarkers are well known in the literature for their association with NSCLC. In our analysis, we consider the neutrophil to lymphocyte ratio (NLR) as prognostic for efficacy outcome and consider the association of the baseline level biomarkers with NLR.

The biomarkers in decreasing order of marginal correlation with neutrophil-to-lymohocyte ratio are Adipisin (a protease that stimulates glucose transport) Visfatin (also known as NAMPT, Nicotinamide phosphoribosyltransferase), MIP-1- $β$ (Macrophage inflammatory protein 1- $β$ , also known as CCL4), GM-CSF (Granulocyte macrophage colony-stimulating factor), Adiponectin, Ghrelin etc. as listed in Table 3a. We note that these biomarkers are correlated among themselves as well, resulting in the potential of significant multicollinearity in the variable selection process.

Table 3:

(a): Correlation matrix of Neutrophil-to-Lymphocyte ratio (NLR) and 6 biomarkers which have highest marginal correlation with NLR. (b) Selected model for Weight Change and $p = 33$ biomarkers. Reported log-Bayes factor is in reference to the HPM model

(a)
Biomarkers	NLR	Adipisin	Visfatin	MIP-1- $β$	GM-CSF	Adiponectin
Adipisin	−0.27
Visfatin	0.23	−0.11
MIP-1- $β$	−0.16	−0.02	0.04
GM-CSF	−0.15	−0.09	0.07	0.24
Adiponectin	0.14	0.04	0.06	−0.05	0.00
Ghrelin	0.13	0.00	0.26	−0.07	0.10	0.26

(b)
Method	Selected model	log Bayes factor
HPM	Adipisin, Visfatin	reference
DIC	Adipisin, Visfatin, MIP-1- $β$ , GM-CSF, IL-21, IL-23, IL-8	−3.49
Nonlocal	IL-1- $β$ , IL-21	−6.04
Horseshoe	Adipisin, IL-23, Adiponectin	−2.30
Median Prob Model	Adipisin, Visfatin	0
SIS	20 biomarkers	−28.90

Open in a new tab

Table 3b reports the results of different variable selection methods applied to these data. We applied the HPM and min-DIC based optimal model choice for $a_{0} = n^{- 1} = 0.009$ . We also report nonlocal prior based variable selection (Shin et al., 2018) and model selected based on horseshoe prior(Carvalho et al., 2010; van der Pas et al., 2016) (both in their default settings). We further report the median probability model (MPM) whose optimality with respect to predictive loss function in the setting of orthogonal design matrix is established in Barbieri et al. (2004). Finally we report the model selected by the non-Bayesian sure independent screening (SIS (Fan & Lv, 2008)) method.

We note substantial differences among the models selected by the different variable selection methods. For example, the model selected by the nonlocal prior approach is distinctly different and does not include the variables listed in Table 3a which have high marginal correlations with the outcome. The model selected by the SIS method includes many more variables compared to the other methods. The model selected by the min-DIC approach is a super-model of the models selected by HPM or Median probability approaches and includes many additional variables. This finding is consistent with our results in section 3.2 and our simulation studies where we found that the DIC has high sensitivity but also has high false discovery rate in selecting additional variables. The HPM approach on the other hand, was found to have high sensitivity as well as low false discovery rate in those studies.

5. The case of known variance

For completeness, we briefly sketch here a set of parallel results when $τ$ is known. The expressions of the marginal likelihood and the DIC under $g$ -prior on $β (α)$ are given by

logML (ℳ (α)) = \frac{n}{2} log (\frac{τ}{2 π}) - \frac{τ a_{0}}{2 (1 + a_{0})} Y_{n}^{'} Y_{n} - \frac{τ SSE (α)}{2 (1 + a_{0})} + \frac{1}{2} log (\frac{a_{0}}{1 + a_{0}}) d (α), and

(5.1)

DIC (α) = - n log (\frac{τ}{2 π}) + \frac{a_{0}^{2} τ}{{(1 + a_{0})}^{2}} Y_{n}^{'} Y_{n} + \frac{(1 + 2 a_{0}) τ}{{(1 + a_{0})}^{2}} SSE (α) + \frac{2}{1 + a_{0}} d (α),

(5.2)

for the choice of the posterior mean $\bar{β} (α)$ as the posterior estimate $\tilde{β} (α)$ (Chen, Huang, Ibrahim, & Kim, 2008). The following result parallels Theorem 1.

Theorem 5. If $ℳ_{1}$ and $ℳ_{2} \cup {ℳ (D)}$ are not empty,

$lim_{n \to \infty} P_{Y_{n}} \{min_{ℳ (α) \in ℳ_{2} \cup {ℳ (D)}, ℳ (α^{'}) \in ℳ_{1}} Bayes factor (ℳ (α), ℳ (α^{'})) \geq 1\} = 1$ .
$lim_{n \to \infty} P_{Y_{n}} \{min_{ℳ (α^{'}) \in ℳ_{1}} D I C (ℳ (α)) \geq max_{ℳ (α) \in ℳ_{2} \cup {ℳ (D)}} D I C (ℳ (α))\} = 1$ .

The next set of results consider mis-selection probabilities and parallel Theorem 3

Theorem 6. For $ℳ (α) \in ℳ_{2}$ , and $a_{0} = a_{0} (n) \to 0$ as $n \to \infty$ ,

lim_{n \to \infty} P_{Y_{n}} {Bayes factor (ℳ (α), ℳ (D)) \geq 1} = 0 .

In contrast, we have the following result for DIC.

Theorem 7. For $ℳ (α) \in ℳ_{2}$ , any $a_{0} > 0$ and any $n$ ,

P_{Y_{n}} {D I C (ℳ (α)) \leq D I C (ℳ (D))} \geq 1 - F_{χ_{η}^{2}} (2 η) .

These results are illustrated in Figure 2 in which the marginal likelihood, DIC and LPML of the data generating model $ℳ (D)$ and different $ℳ (α)$ models were evaluated for 2,500 repeated data simulations for different sample sizes, and for the choice of $a_{0} = 1 / n$ . The figure shows the empirical estimates of mis-selection probabilities based on repeated data simulations (red) and evaluations based on analytical expressions (blue). We note that the mis-selection probabilities of Bayes factor are relatively low across all three panels and decrease to zero with increasing sample size as established in Theorem 6. In comparison, the mis-selection probabilities for both DIC and LPML are relatively high and stay relatively flat. The fact that mis-selection probability of DIC is bounded from below is established in Theorem 7.

Remark: In fact, for any $ℳ (α)$ in $ℳ (D) \cup ℳ_{2}$ , it can be shown that ${lim}_{n \to \infty} \frac{1}{n} DIC (α) \overset{p}{\to} 1 + \frac{τ a_{0}^{2}}{{(1 + a_{0})}^{2}} μ^{'} μ - log (\frac{τ}{2 π})$ . Since the right hand side is identical for every $ℳ (α)$ , DIC cannot differentiate among the models in $ℳ_{2}$ in this sense. This result parallels a result in Shao (1993) who showed that predictive squared error criterion under one-deleted cross-validation converges to an identical constant for every category II model.

We conclude this section with a result which shows that mis-selection by marginal likelihood for a given data $y_{n}$ , in fact, deterministically implies mis-selection by DIC. The reverse however is not established, suggesting that the set of $y_{n}$ resulting in mis-selection by DIC is potentially much bigger.

Theorem 8. For any $a_{0} > 0$ ,

\{y_{n} : Bayes factor (ℳ (α), ℳ (D) ∣ y_{n}) \geq 1\} \subseteq \{y_{n} : DIC (ℳ (α) ∣ y_{n}) \leq DIC (ℳ (D) ∣ y_{n})\} .

Theorem 8 is illustrated in Figure 3 in which the data-generating model $ℳ (D) = \{x_{3}, x_{4}\}$ and $ℳ (α) = \{x_{1}, x_{3}, x_{4}\}$ of the Gunst and Mason data (see Section 2). We evaluated the marginal likelihood and DIC of these two models for 2,500 repeated data simulations. The first quadrant of Figure 3 is empty which is the important finding of Theorem 8. The fourth quadrant shows the cases log Bayes factor $(ℳ (α), ℳ (D)) \geq 0$ , and $DIC (ℳ (α)) - DIC (ℳ (D)) \leq 0$ and the third quadrant highlights the cases where $DIC (ℳ (α)) - DIC (ℳ (D)) \leq 0$ but log Bayes factor $(ℳ (α), ℳ (D)) ≰ 0$ .

6. Generalized linear model

In previous sections, we study properties of Bayesian selection criteria in linear models. It is of interest to examine how these selection criteria compare in nonlinear, models. In this subsection, we provide empirical investigation of the performance of Bayesian selection criteria in generalized linear models (GLMs) with binary outcome.

GLMs assume distributions within the exponential family (McCullagh and Nelder 1989), with density

p (Y_{n}) = \prod_{i = 1}^{n} exp \{\frac{1}{a (ϕ)} (Y_{i} θ_{i} - b (θ_{i})) + c (Y_{i}, ϕ)\}

(6.1)

where $a (\cdot), b (\cdot)$ and $c (\cdot)$ are specific functions and $ϕ$ is a scaling parameter. In model $ℳ (α)$ , the association of the stochastic outcome $Y_{n}$ with $X_{n} (α)$ is expressed in terms of $E [Y_{n} ∣ X_{n} (α)]$ via the regression model

g (E [Y_{n} ∣ X_{n} (α)]) = X_{n} (α) β (α) .

(6.2)

where $g (\cdot)$ is the link function of the GLM.

Chen, Huang, Ibrahim, & Kim (2008) provide review and comparison of Bayes factor, DIC, LPML and other criteria in the setting of generalized linear models with conjugate power prior. Other works on Bayesian criterion based methods in generalized linear models include Meyer & Laud (2002) and Chen, Dey, & Ibrahim (2004). There is a substantial literature on g-priors for GLMS and several variants have been suggested, see Li & Clyde (2018) for a recent review. Unlike the case of normal linear model, the information matrix in model $ℳ (α)$ for GLM depends on $β (α)$ . The alternative g-priors vary depending on whether the expected or observed information is used in the prior and whether it is considered at $β (α) = 0$ or at the maximum likelihood estimate $\hat{β (α)}$ for model $ℳ (α)$ . We consider a g-prior proposed in Li & Clyde (2018) based on its geometric interpretability and computational efficiency and is given by

β (α) ~ N (0, \frac{1}{a_{0}} ℐ_{n} (\hat{β} (α)))

Here $ℐ_{n} (\hat{β} (α))$ is the observed information matrix for $β (α)$ excluding the intercept.

6.1. Study 1:

We consider the Pima Indian dataset on onset of Diabetes Mellitus (Smith et al., 1988). The data are collected from the Pima Indian population near Phoenix, Arizona who has been under study since 1965 by the National Institute of Diabetes and Digestive and Kidney Diseases due to high incidence rate of diabetes. The outcome variable $y_{i}, i \leq 200$ is an indicator of Diabetes Mellitus onset, diagnosed according to World Health Organization Criteria, among females of ≥ 21 years of age. We consider five covariates: number of times pregnant $x_{2}$ , plasma glucose concentration at 2 Hours in an oral glucose tolerance test $x_{3}$ , body mass index $x_{4}$ , diabetes pedigree function $x_{5}$ and age in years $x_{6}$ . This dataset has been extensively studied. We also consider a randomly selected subsample of $n = 100$ to examine performance under a smaller sample size.

In this study, we simulate binary data vector $y$ from logistic regression models based on observed $x$ values and the results reported in Table 5 are based on 100 repeated data simulations. The model space $ℳ$ consists of all possible models which always includes the first covariate $x_{1} \equiv 1$ . For each data simulation, we report the models with the optimal DIC (smallest) and marginal likelihood (highest) by complete enumeration of all models in the space $ℳ$ . For comparison, we also report the highest LPML model. The relative frequency of the optimally selected models by DIC, marginal likelihood and LPML over the repeated simulations are reported in Table 5. We estimate the DIC and LPML for each model by Markov chain sampling and evaluate approximate marginal likelihood based on an integrated Laplace approximation considered in Li & Clyde (2018) which yields a closed form expression.

Table 5:

Relative frequency of optimally selected models by DIC, marginal likelihood (ML) and LPML for Pima Indian data

	$ℳ (D) = {1,3, 5}$			$ℳ (D) = {1,3, 5,6}$			$ℳ (D) = {1,3, 4,5, 6}$			$ℳ (D) = {1,2, 3,4, 5,6}$
	DIC	HPM	LPML	DIC	HPM	LPML	DIC	HPM	LPML	DIC	HPM	LPML
Subsample of $n = 100$
category I				0.05	0.02	0.01	0.08	(0.24)	0.17	0.03	0.03	0.18
$ℳ (D)$	0.43	0.86	0.49	0.69	0.90	0.81	0.55	0.76	0.48	0.97	0.97	0.82
category II	0.57	0.14	0.51	0.26	0.08	0.18	0.37		0.35
$n = 200$
category I								0.01			0.02	0.01
$ℳ (D)$	0.46	0.91	0.48	0.71	0.97	0.73	0.65	0.94	0.60	1.00	0.98	0.99
category II	0.54	0.09	0.52	0.29	0.03	0.23	0.35	0.05	0.40

Open in a new tab

As in section 2, we consider four different scenarios: (i) the data-generating model $ℳ (D) = \{x_{1}, x_{3}, x_{5}\}$ , (ii) $ℳ (D) = \{x_{1}, x_{3}, x_{5}, x_{6}\}$ , (iii) $ℳ (D) = \{x_{1}, x_{3}, x_{4}, x_{5}, x_{6}\}$ , and (iv) $ℳ (D) = \{x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}\}$ , the full model. We first note that, in contrast to the case of linear models in section 2, the optimally selected models belong to category I $ℳ_{1}$ for a non-negligible proportion of times in case of smaller $n = 100$ . This is especially evident when $ℳ (D) = \{x_{1}, x_{3}, x_{4}, x_{5}, x_{6}\}$ for all three, DIC, HPM and LPML, criteria. For the case of larger $n = 200$ , category I models are substantially less frequently selected.

The superior performance of the marginal likelihood in selecting the data-generating model in the setting of this generalized linear model is evident across Table 5. Paralleling the case of linear model, DIC and LPML performs poorly when the data-generating model $ℳ (D) = \{x_{1}, x_{3}, x_{4}\}$ is relatively sparse whereas their performances improve as $ℳ (D)$ becomes less sparse. Also paralleling the liner model case, there is little improvement on the performance of the DIC with increase in sample size $n$ , whereas the performance of HPM improves in most of the cases.

6.2. Study 2:

The design for generating covariate $X$ in this study is similar to section 4.1. Each row of $X$ is independently generated from $N_{p} (0, Σ_{X})$ where (i) $Σ_{X} = I$ is taken to be isotropic for the uncorrelated case, (ii) $c o r (x_{i}, x_{j}) = ρ, i \neq j$ for the equi-correlated case and (iii) $c o r (x_{i}, x_{j}) = ρ^{| i - j |}, i \neq j$ for the autoregressive case. The binary outcome $y$ is then generated from a logistic regression model using a subset $X (D)$ of the $X$ matrix. We consider $p = 12$ and two cases of $n = 200$ and $n = 800$ . We further consider the cases of $(a) p_{datagen} = 4$ when the data-generating model is relatively sparse and $(b) p_{datagen} = 10$ .

The results are given in Table 6 and we first note that both DIC and HPM report average sensitivity of 1 in each of the cases, thus implying neither criteria select a category I model as optimal in any of the 100 replications and in any of the cases. In general, the performance of the DIC is quite poor in selecting the data-generating model and the performance does not improve even with a four times increase in sample size. The HPM, on the other hand, performs superiorly in selecting the data generating model in this setting of generalized linear model and the performance improves with increase in sample size. As we noted before, Chen, Huang, Ibrahim, & Kim (2008) provided a simulation study comparing selection accuracy of DIC, HPM and other approaches in Poisson regression model for the case of $p = 3$ . Our results for larger $p$ ’s in Tables 5 and 6 are different from the good performances they observed for the case of $p = 3$

Table 6:

Relative frequency of selecting data generating model $ℳ (D)$ , false discovery rate (FDR), and sensitivity (SEN) in logistic regression. Results based on averaged over 100 replications.

	DIC			HPM			DIC			HPM
	$ℳ (D)$	FDR	SEN	$ℳ (D)$	FDR	SEN	$ℳ (D)$	FDR	SEN	$ℳ (D)$	FDR	SEN
	$n = 200, p_{datagen} = 4$						$n = 200, p_{datagen} = 10$
$ρ = 0$	0.00	0.36	1.00	0.76	0.05	1.00	0.26	0.07	1.00	0.94	0.01	1.00
$ρ = 0.5$	0.01	0.32	1.00	0.86	0.03	1.00	0.15	0.08	1.00	0.98	0.00	1.00
AR	0.00	0.34	1.00	0.87	0.03	1.00	0.06	0.10	1.00	0.84	0.02	1.00
	$n = 800, p_{datagen} = 4$						$n = 800, p_{datagen} = 10$
$ρ = 0$	0.01	0.34	1.00	0.97	0.01	1.00	0.16	0.08	1.00	0.95	0.00	1.00
$ρ = 0.5$	0.02	0.35	1.00	0.96	0.01	1.00	0.09	0.08	1.00	0.98	0.00	1.00
AR	0.02	0.32	1.00	0.95	0.01	1.00	0.17	0.08	1.00	0.94	0.01	1.00

Open in a new tab

7. Conclusion

In our theoretical investigations and especially in the numerical studies, we find that the marginal likelihood based HPM approach performs superiorly overall in selecting the data-generating model. Some undesirable properties of the DIC are well-known but we are surprised by the degree of poor performance of the DIC in identifying the data generating model in the numerical examples. This is especially disconcerting in the generalized linear model in section 6 as practitioners often choose to use DIC that is readily available in software as opposed to the complication of marginal likelihood computation in this non-conjugate setting.

In models with latent structures, such as mixed effects, hierarchical or latent variable models, a series of works considered marginalized DIC based on observed-data likelihood as opposed to conditional likelihood (Celeux et al., 2006; Chan & Grant, 2016; Quintero & Lesaffre, 2018; Merkle et al., 2019; Ariyo et al., 2019, 2020; Li et al., 2020). Comparative performance of Bayesian variable selection criteria in these settings is an interesting topic of further research.

We additionally study performances of WAIC (Watanabe-Akaike information criterion, Watanabe & Opper (2010); Gelman et al. (2014)) as well as performances of DIC and HPM under a non-conjugate double-exponential prior in the setting of section 2 following recommendations of anonymous reviewers. We find that HPM performs superior to WAIC in selecting the data generating model. The difference in performances between HPM and DIC approaches persists almost in the same scale under the double-exponential prior.

In high-dimensional variable selection with many $x$ -variables, criterion based variable selection approaches face the challenge of a large model space whose enumeration can be infeasible. Recent works on intelligent and stochastic search strategies on the model space (Shin et al., 2018; Maity, 2016) have made significant progress on the use of criterion based selection in such settings.

Supplementary Material

NIHMS1997287-supplement-1.pdf^{(231.1KB, pdf)}

Acknowledgement

The computational work in this manuscript used resources of the Center for Research Computing and Data at Northern Illinois University. We are indebted to an Associate Editor and four anonymous reviewers for their thoughtful comments which substantially improved this article.

Sanjib Basu’s research was partially supported by award R01-ES028790 from the National Institute of Environmental Health Sciences.

Footnotes

Supplement: Supplementary Material to Bayesian Criterion Based Variable Selection provides the proofs of the Theorems described in this article.

Contributor Information

Arnab Kumar Maity, Pfizer Inc., San Diego, CA.

Sanjib Basu, University of Illinois at Chicago, Chicago, IL.

Santu Ghosh, Augusta University, Augusta, GA.

References

Ariyo O, Lesaffre E, Verbeke G, & Quintero A (2019). Model selection for bayesian linear mixed models with longitudinal data: Sensitivity to the choice of priors. Communications in Statistics-Simulation and Computation, (pp. 1–25). [Google Scholar]
Ariyo O, Quintero A, Muñoz J, Verbeke G, & Lesaffre E (2020). Bayesian model selection in linear mixed models for longitudinal data. Journal of Applied Statistics, 47(5), 890–913. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barbieri MM, Berger JO, et al. (2004). Optimal predictive model selection. The Annals of Statistics, 32(3), 870–897. [Google Scholar]
Bhadra A, Datta J, Polson NG, & Willard B (2016). Default Bayesian analysis with global-local shrinkage priors. Biometrika, 103(4), 955–969. [Google Scholar]
Bonomi P, Batus M, Fidler MJ, & Borgia JA (2017). Practical and theoretical implications of weight gain in advanced non-small cell lung cancer patients. Annals of Translational Medicine, 5(6). [DOI] [PMC free article] [PubMed] [Google Scholar]
Carvalho CM, Polson NG, & Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465–480. [Google Scholar]
Casella G, Girón FJ, Martínez ML, Moreno E, et al. (2009). Consistency of bayesian procedures for variable selection. The Annals of Statistics, 37(3), 1207–1228. [Google Scholar]
Casella G, & Moreno E (2006). Objective Bayesian variable selection. Journal of the American Statistical Association, 101(473), 157–167. [Google Scholar]
Celeux G, Forbes F, Robert CP, Titterington DM, et al. (2006). Deviance information criteria for missing data models. Bayesian analysis, 1 (4), 651–673. [Google Scholar]
Chan JC, & Grant AL (2016). Fast computation of the deviance information criterion for latent variable models. Computational Statistics & Data Analysis, 100, 847–859. [Google Scholar]
Chan JC, & Jeliazkov I (2009). Efficient simulation and integrated likelihood estimation in state space models. International Journal of Mathematical Modelling and Numerical Optimisation, 1(1–2), 101–120. [Google Scholar]
Chen M-H, Dey D, & Ibrahim J (2004). Bayesian criterion based model assessment for categorical data. Biometrika, 9 (45). [Google Scholar]
Chen M-H, Huang L, Ibrahim JG, & Kim S (2008). Bayesian Variable Selection and Computation for Generalized Linear Models with Conjugate Priors. Bayesian Analysis, 3 (3), 585–614. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chib S (1995). Marginal Likelihood from the Gibbs Output. Journal of the American Statistical Association, 90(432), 1313–1321. [Google Scholar]
Chib S, & Jeliazkov I (2001). Marginal Likelihood from the Metropolis–Hastings Output. Journal of the American Statistical Association, 96(453), 270–281. [Google Scholar]
Chib S, & Jeliazkov I (2005). Accept-reject Metropolis-Hastings sampling and marginal likelihood estimation. Statistica Neerlandica, 59(1), 30–44. [Google Scholar]
Chib S, Shin M, & Simoni A (2018). Bayesian estimation and comparison of moment condition models. Journal of the American Statistical Association, 113(524), 1656–1668. [Google Scholar]
Daniels MJ, Chatterjee AS, & Wang C (2012). Bayesian model selection for incomplete data using the posterior predictive distribution. Biometrics, 68(4), 1055–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
Derman B, Macklis J, Azeem M, Sayidine S, Basu S, Batus M, Esmail F, Borgia J, Bonomi P, & Fidler M (2017). Relationships between longitudinal neutrophil to lymphocyte ratios, body weight changes, and overall survival in patients with non-small cell lung cancer. BMC Cancer, 17(1), 141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dicker LH (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2), 269–284. [Google Scholar]
Fan J, Feng Y, & Song R (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, & Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fearon K, Strasser F, Anker SD, Bosaeus I, Bruera E, Fainsinger RL, Jatoi A, Loprinzi C, MacDonald N, Mantovani G, et al. (2011). Definition and classification of cancer cachexia: an international consensus. The Lancet Oncology, 12(5), 489–495. [DOI] [PubMed] [Google Scholar]
Fernandez C, Ley E, & Steel MF (2001a). Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100(2), 381–427. [Google Scholar]
Fernandez C, Ley E, & Steel MF (2001b). Model uncertainty in cross-country growth regressions. Journal of applied Econometrics, 16(5), 563–576. [Google Scholar]
Fong E, & Holmes C (2020). On the marginal likelihood and cross-validation. Biometrika, 107(2), 489–496. [Google Scholar]
Geisser S (1980). Discussion on Sampling and Bayes’ inference in scientific modeling and robustness (by GEP Box). Journal of the Royal Statistical Society A, 143, 416–417. [Google Scholar]
Geisser S, & Eddy WF (1979). A Predictive Approach to Model Selection. Journal of the American Statistical Association, 74, 153–160. [Google Scholar]
Gelfand AE, Dey DK, & Chang H (1992). Model determination using predictive distributions with implementation via sampling-based methods. Tech. rep., DTIC Document. [Google Scholar]
Gelman A, Hwang J, & Vehtari A (2014). Understanding predictive information criteria for bayesian models. Statistics and computing, 24(6), 997–1016. [Google Scholar]
George EI, & Foster DP (2000). Calibration and Empirical Bayes Variable Selection. Biometrika, 87, 731–747. [Google Scholar]
Gielda BT, Mehta P, Khan A, Marsh JC, Zusag TW, Warren WH, Fidler MJ, Abrams RA, Bonomi P, Liptay M, et al. (2011). Weight gain in advanced non-small-cell lung cancer patients during treatment with split-course concurrent chemoradiotherapy is associated with superior survival. International Journal of Radiation Oncology Biology Physics, 81(4), 985–991. [DOI] [PubMed] [Google Scholar]
Gunst RF, & Mason RL (1980). Regression analysis and its application: a data-oriented approach, vol. 34. CRC Press. [Google Scholar]
Ibrahim J, Chen M, & Sinha D (2001). Criterion-based methods for bayesian model assessment. Statistical Sinica, 11(419). [Google Scholar]
Johnson VE, & Rossell D (2010). On the use of non-local prior densities in bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(2), 143–170. [Google Scholar]
Johnson VE, & Rossell D (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498), 649–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kass RE, & Raftery AE (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773–795. [Google Scholar]
Laud P, & Ibrahim J (1995). Predictive model selection. Journal Of Royal Statistical Society, Series B, 57(247). [Google Scholar]
Li Y, & Clyde MA (2018). Mixtures of g-priors in generalized linear models. Journal of the American Statistical Association, 113(524), 1828–1845. [Google Scholar]
Li Y, Yu J, & Zeng T (2020). Deviance information criterion for latent variable models and misspecified models. Journal of Econometrics, 216(2), 450–493. [Google Scholar]
Liang F, Paulo R, Molina G, Clyde MA, & Berger JO (2008). Mixtures of g Priors for Bayesian Variable Selection. Journal of the American Statistical Association, 103(481), 410–423. [Google Scholar]
Maity AK (2016). Bayesian variable selection in linear and non-linear models. Ph.D. thesis, Northern Illinois University. [Google Scholar]
Martin L, Senesse P, Gioulbasanis I, Antoun S, Bozzetti F, Deans C, Strasser F, Thoresen L, Jagoe RT, Chasen M, et al. (2014). Diagnostic criteria for the classification of cancer-associated weight loss. Journal of Clinical Oncology, 33(1), 90–99. [DOI] [PubMed] [Google Scholar]
Meier L, Van de Geer S, Bühlmann P, et al. (2009). High-dimensional additive modeling. The Annals of Statistics, 37(6B), 3779–3821. [Google Scholar]
Merkle EC, Furr D, & Rabe-Hesketh S (2019). Bayesian comparison of latent variable models: Conditional versus marginal likelihoods. Psychometrika, 84(3), 802–829. [DOI] [PubMed] [Google Scholar]
Meyer MC, & Laud PW (2002). Predictive variable selection in generalized linear models. Journal of the American Statistical Association, 97(459), 859–871,. [Google Scholar]
Moreno E, Girón FJ, Casella G, et al. (2010). Consistency of objective bayes factors as the model dimension grows. The Annals of Statistics, 38(4), 1937–1952. [Google Scholar]
Moreno E, & Vâzquez-Polo F-J (2014). Comments on the presentation: The deviance information criterion: 12 years on. Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 76(3), 490–492. [Google Scholar]
Patel J, Pereira J, Chen J, Liu J, Guba S, John W, Orlando M, Scagliotti G, & Bonomi P (2016). Relationship between efficacy outcomes and weight gain during treatment of advanced, non-squamous, non-small-cell lung cancer patients. Annals of Oncology, 27(8), 1612–1619. [DOI] [PubMed] [Google Scholar]
Quintero A, & Lesaffre E (2018). Comparing hierarchical models via the marginalized deviance information criterion. Statistics in medicine, 37(16), 2440–2454. [DOI] [PubMed] [Google Scholar]
Saeys Y, Inza I, & Larrañaga P (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517. [DOI] [PubMed] [Google Scholar]
Saldana DF, & Feng Y (2018). SIS: An R Package for Sure Independence Screening in Ultrahigh Dimensional Statistical Models. Journal of Statistical Software, (2), 1–25. [Google Scholar]
Shao J (1993). Linear Model Selection by Cross Validation. Journal of the American Statistical Association, 88(422), 486–494. [Google Scholar]
Shin M, Bhattacharya A, & Johnson VE (2018). Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Statistica Sinica, 28(2), 1053–1078. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shin M, & Tian R (2017). BayesS5: Bayesian Variable Selection Using Simplified Shotgun Stochastic Search with Screening (S5). R package version 1.30. URL https://CRAN.R-project.org/package=BayesS5 [Google Scholar]
Smith JW, Everhart J, Dickson W, Knowler W, & Johannes R (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, (p. 261). American Medical Informatics Association. [Google Scholar]
Spiegelhalter DJ, Best NG, Carlin BP, & van der Linde A (2002). Bayesian Measures of Model Complexity and Fit. Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 64(3), 1–34. [Google Scholar]
Spiegelhalter DJ, Best NG, Carlin BP, & van der Linde A (2014). The deviance information criterion: 12 years on. Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 76(3), 485–493. [Google Scholar]
van der Pas S, Scott J, Chakraborty A, & Bhattacharya A (2016). horseshoe: Implementation of the Horseshoe Prior. R package version 0.1.0. URL https://CRAN.R-project.org/package=horseshoe [Google Scholar]
Vehtari A, Gelman A, & Gabry J (2017). Practical bayesian model evaluation using leave-one-out cross-validation and waic. Statistics and computing, 27(5), 1413–1432. [Google Scholar]
Watanabe S, & Opper M (2010). Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory. Journal of machine learning research, 11(12). [Google Scholar]
Zellner A (1986). On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti eds Goel PK and Zellner A, (pp. 233–243). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1997287-supplement-1.pdf^{(231.1KB, pdf)}

[R1] Ariyo O, Lesaffre E, Verbeke G, & Quintero A (2019). Model selection for bayesian linear mixed models with longitudinal data: Sensitivity to the choice of priors. Communications in Statistics-Simulation and Computation, (pp. 1–25). [Google Scholar]

[R2] Ariyo O, Quintero A, Muñoz J, Verbeke G, & Lesaffre E (2020). Bayesian model selection in linear mixed models for longitudinal data. Journal of Applied Statistics, 47(5), 890–913. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Barbieri MM, Berger JO, et al. (2004). Optimal predictive model selection. The Annals of Statistics, 32(3), 870–897. [Google Scholar]

[R4] Bhadra A, Datta J, Polson NG, & Willard B (2016). Default Bayesian analysis with global-local shrinkage priors. Biometrika, 103(4), 955–969. [Google Scholar]

[R5] Bonomi P, Batus M, Fidler MJ, & Borgia JA (2017). Practical and theoretical implications of weight gain in advanced non-small cell lung cancer patients. Annals of Translational Medicine, 5(6). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Carvalho CM, Polson NG, & Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465–480. [Google Scholar]

[R7] Casella G, Girón FJ, Martínez ML, Moreno E, et al. (2009). Consistency of bayesian procedures for variable selection. The Annals of Statistics, 37(3), 1207–1228. [Google Scholar]

[R8] Casella G, & Moreno E (2006). Objective Bayesian variable selection. Journal of the American Statistical Association, 101(473), 157–167. [Google Scholar]

[R9] Celeux G, Forbes F, Robert CP, Titterington DM, et al. (2006). Deviance information criteria for missing data models. Bayesian analysis, 1 (4), 651–673. [Google Scholar]

[R10] Chan JC, & Grant AL (2016). Fast computation of the deviance information criterion for latent variable models. Computational Statistics & Data Analysis, 100, 847–859. [Google Scholar]

[R11] Chan JC, & Jeliazkov I (2009). Efficient simulation and integrated likelihood estimation in state space models. International Journal of Mathematical Modelling and Numerical Optimisation, 1(1–2), 101–120. [Google Scholar]

[R12] Chen M-H, Dey D, & Ibrahim J (2004). Bayesian criterion based model assessment for categorical data. Biometrika, 9 (45). [Google Scholar]

[R13] Chen M-H, Huang L, Ibrahim JG, & Kim S (2008). Bayesian Variable Selection and Computation for Generalized Linear Models with Conjugate Priors. Bayesian Analysis, 3 (3), 585–614. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Chib S (1995). Marginal Likelihood from the Gibbs Output. Journal of the American Statistical Association, 90(432), 1313–1321. [Google Scholar]

[R15] Chib S, & Jeliazkov I (2001). Marginal Likelihood from the Metropolis–Hastings Output. Journal of the American Statistical Association, 96(453), 270–281. [Google Scholar]

[R16] Chib S, & Jeliazkov I (2005). Accept-reject Metropolis-Hastings sampling and marginal likelihood estimation. Statistica Neerlandica, 59(1), 30–44. [Google Scholar]

[R17] Chib S, Shin M, & Simoni A (2018). Bayesian estimation and comparison of moment condition models. Journal of the American Statistical Association, 113(524), 1656–1668. [Google Scholar]

[R18] Daniels MJ, Chatterjee AS, & Wang C (2012). Bayesian model selection for incomplete data using the posterior predictive distribution. Biometrics, 68(4), 1055–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Derman B, Macklis J, Azeem M, Sayidine S, Basu S, Batus M, Esmail F, Borgia J, Bonomi P, & Fidler M (2017). Relationships between longitudinal neutrophil to lymphocyte ratios, body weight changes, and overall survival in patients with non-small cell lung cancer. BMC Cancer, 17(1), 141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Dicker LH (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2), 269–284. [Google Scholar]

[R21] Fan J, Feng Y, & Song R (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Fan J, & Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Fearon K, Strasser F, Anker SD, Bosaeus I, Bruera E, Fainsinger RL, Jatoi A, Loprinzi C, MacDonald N, Mantovani G, et al. (2011). Definition and classification of cancer cachexia: an international consensus. The Lancet Oncology, 12(5), 489–495. [DOI] [PubMed] [Google Scholar]

[R24] Fernandez C, Ley E, & Steel MF (2001a). Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100(2), 381–427. [Google Scholar]

[R25] Fernandez C, Ley E, & Steel MF (2001b). Model uncertainty in cross-country growth regressions. Journal of applied Econometrics, 16(5), 563–576. [Google Scholar]

[R26] Fong E, & Holmes C (2020). On the marginal likelihood and cross-validation. Biometrika, 107(2), 489–496. [Google Scholar]

[R27] Geisser S (1980). Discussion on Sampling and Bayes’ inference in scientific modeling and robustness (by GEP Box). Journal of the Royal Statistical Society A, 143, 416–417. [Google Scholar]

[R28] Geisser S, & Eddy WF (1979). A Predictive Approach to Model Selection. Journal of the American Statistical Association, 74, 153–160. [Google Scholar]

[R29] Gelfand AE, Dey DK, & Chang H (1992). Model determination using predictive distributions with implementation via sampling-based methods. Tech. rep., DTIC Document. [Google Scholar]

[R30] Gelman A, Hwang J, & Vehtari A (2014). Understanding predictive information criteria for bayesian models. Statistics and computing, 24(6), 997–1016. [Google Scholar]

[R31] George EI, & Foster DP (2000). Calibration and Empirical Bayes Variable Selection. Biometrika, 87, 731–747. [Google Scholar]

[R32] Gielda BT, Mehta P, Khan A, Marsh JC, Zusag TW, Warren WH, Fidler MJ, Abrams RA, Bonomi P, Liptay M, et al. (2011). Weight gain in advanced non-small-cell lung cancer patients during treatment with split-course concurrent chemoradiotherapy is associated with superior survival. International Journal of Radiation Oncology Biology Physics, 81(4), 985–991. [DOI] [PubMed] [Google Scholar]

[R33] Gunst RF, & Mason RL (1980). Regression analysis and its application: a data-oriented approach, vol. 34. CRC Press. [Google Scholar]

[R34] Ibrahim J, Chen M, & Sinha D (2001). Criterion-based methods for bayesian model assessment. Statistical Sinica, 11(419). [Google Scholar]

[R35] Johnson VE, & Rossell D (2010). On the use of non-local prior densities in bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(2), 143–170. [Google Scholar]

[R36] Johnson VE, & Rossell D (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498), 649–660. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Kass RE, & Raftery AE (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773–795. [Google Scholar]

[R38] Laud P, & Ibrahim J (1995). Predictive model selection. Journal Of Royal Statistical Society, Series B, 57(247). [Google Scholar]

[R39] Li Y, & Clyde MA (2018). Mixtures of g-priors in generalized linear models. Journal of the American Statistical Association, 113(524), 1828–1845. [Google Scholar]

[R40] Li Y, Yu J, & Zeng T (2020). Deviance information criterion for latent variable models and misspecified models. Journal of Econometrics, 216(2), 450–493. [Google Scholar]

[R41] Liang F, Paulo R, Molina G, Clyde MA, & Berger JO (2008). Mixtures of g Priors for Bayesian Variable Selection. Journal of the American Statistical Association, 103(481), 410–423. [Google Scholar]

[R42] Maity AK (2016). Bayesian variable selection in linear and non-linear models. Ph.D. thesis, Northern Illinois University. [Google Scholar]

[R43] Martin L, Senesse P, Gioulbasanis I, Antoun S, Bozzetti F, Deans C, Strasser F, Thoresen L, Jagoe RT, Chasen M, et al. (2014). Diagnostic criteria for the classification of cancer-associated weight loss. Journal of Clinical Oncology, 33(1), 90–99. [DOI] [PubMed] [Google Scholar]

[R44] Meier L, Van de Geer S, Bühlmann P, et al. (2009). High-dimensional additive modeling. The Annals of Statistics, 37(6B), 3779–3821. [Google Scholar]

[R45] Merkle EC, Furr D, & Rabe-Hesketh S (2019). Bayesian comparison of latent variable models: Conditional versus marginal likelihoods. Psychometrika, 84(3), 802–829. [DOI] [PubMed] [Google Scholar]

[R46] Meyer MC, & Laud PW (2002). Predictive variable selection in generalized linear models. Journal of the American Statistical Association, 97(459), 859–871,. [Google Scholar]

[R47] Moreno E, Girón FJ, Casella G, et al. (2010). Consistency of objective bayes factors as the model dimension grows. The Annals of Statistics, 38(4), 1937–1952. [Google Scholar]

[R48] Moreno E, & Vâzquez-Polo F-J (2014). Comments on the presentation: The deviance information criterion: 12 years on. Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 76(3), 490–492. [Google Scholar]

[R49] Patel J, Pereira J, Chen J, Liu J, Guba S, John W, Orlando M, Scagliotti G, & Bonomi P (2016). Relationship between efficacy outcomes and weight gain during treatment of advanced, non-squamous, non-small-cell lung cancer patients. Annals of Oncology, 27(8), 1612–1619. [DOI] [PubMed] [Google Scholar]

[R50] Quintero A, & Lesaffre E (2018). Comparing hierarchical models via the marginalized deviance information criterion. Statistics in medicine, 37(16), 2440–2454. [DOI] [PubMed] [Google Scholar]

[R51] Saeys Y, Inza I, & Larrañaga P (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517. [DOI] [PubMed] [Google Scholar]

[R52] Saldana DF, & Feng Y (2018). SIS: An R Package for Sure Independence Screening in Ultrahigh Dimensional Statistical Models. Journal of Statistical Software, (2), 1–25. [Google Scholar]

[R53] Shao J (1993). Linear Model Selection by Cross Validation. Journal of the American Statistical Association, 88(422), 486–494. [Google Scholar]

[R54] Shin M, Bhattacharya A, & Johnson VE (2018). Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Statistica Sinica, 28(2), 1053–1078. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] Shin M, & Tian R (2017). BayesS5: Bayesian Variable Selection Using Simplified Shotgun Stochastic Search with Screening (S5). R package version 1.30. URL https://CRAN.R-project.org/package=BayesS5 [Google Scholar]

[R56] Smith JW, Everhart J, Dickson W, Knowler W, & Johannes R (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, (p. 261). American Medical Informatics Association. [Google Scholar]

[R57] Spiegelhalter DJ, Best NG, Carlin BP, & van der Linde A (2002). Bayesian Measures of Model Complexity and Fit. Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 64(3), 1–34. [Google Scholar]

[R58] Spiegelhalter DJ, Best NG, Carlin BP, & van der Linde A (2014). The deviance information criterion: 12 years on. Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 76(3), 485–493. [Google Scholar]

[R59] van der Pas S, Scott J, Chakraborty A, & Bhattacharya A (2016). horseshoe: Implementation of the Horseshoe Prior. R package version 0.1.0. URL https://CRAN.R-project.org/package=horseshoe [Google Scholar]

[R60] Vehtari A, Gelman A, & Gabry J (2017). Practical bayesian model evaluation using leave-one-out cross-validation and waic. Statistics and computing, 27(5), 1413–1432. [Google Scholar]

[R61] Watanabe S, & Opper M (2010). Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory. Journal of machine learning research, 11(12). [Google Scholar]

[R62] Zellner A (1986). On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti eds Goel PK and Zellner A, (pp. 233–243). [Google Scholar]

PERMALINK

Bayesian Criterion Based Variable Selection

Arnab Kumar Maity

Sanjib Basu

Santu Ghosh

Abstract

1. Introduction

2. An illustrative example

Table 1:

3. Bayesian Variable Selection

3.1. Selection Criteria

3.2. Category I and II models

Table 2:

Figure 1:

4. Numerical Studies

4.1. HPM, DIC, Horseshoe, Nonlocal, SIS

4.2. Cancer Cachexia

Table 3:

5. The case of known variance

Figure 2:

Figure 3:

6. Generalized linear model

6.1. Study 1:

Table 5:

6.2. Study 2:

Table 6:

7. Conclusion

Supplementary Material

Acknowledgement

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bayesian Criterion Based Variable Selection

Arnab Kumar Maity

Sanjib Basu

Santu Ghosh

Abstract

1. Introduction

2. An illustrative example

Table 1:

3. Bayesian Variable Selection

3.1. Selection Criteria

3.2. Category I and II models

Table 2:

Figure 1:

4. Numerical Studies

4.1. HPM, DIC, Horseshoe, Nonlocal, SIS

4.2. Cancer Cachexia

Table 3:

5. The case of known variance

Figure 2:

Figure 3:

6. Generalized linear model

6.1. Study 1:

Table 5:

6.2. Study 2:

Table 6:

7. Conclusion

Supplementary Material

Acknowledgement

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases