Abstract
This paper investigates Bayesian variable selection when there is a hierarchical dependence structure on the inclusion of predictors in the model. In particular, we study the type of dependence found in polynomial response surfaces of orders two and higher, whose model spaces are required to satisfy weak or strong heredity conditions. These conditions restrict the inclusion of higher-order terms depending upon the inclusion of lower-order parent terms. We develop classes of priors on the model space, investigate their theoretical and finite sample properties, and provide a Metropolis-Hastings algorithm for searching the space of models. The tools proposed allow fast and thorough exploration of model spaces that account for hierarchical polynomial structure in the predictors and provide control of the inclusion of false positives in high posterior probability models.
Keywords: Markov Chain Monte Carlo, intrinsic prior, model priors, multiple testing, multiplicity penalization, well-formulated models, strong heredity, weak heredity
1 Background
In modern regression problems, choosing subsets of good predictors from a large pool is standard practice — especially considering the pervasiveness of big data problems. Such analyses require multiple testing corrections in order to attain reasonable Type I error rates as well as control for false positives. This multiple testing problem is further complicated when it is critical to incorporate interactions or powers of the predictors. Not only can there be a large number of predictors, but the structure that underlies the space of predictors must be taken in to account. For regressions with a large number of predictors or high degree, the complete model space is too large to enumerate, and automatic stochastic search algorithms are necessary to find appropriate parsimonious models.
More than two decades years ago, Peixoto (1990) exposed the relevance of respecting polynomial hierarchy among covariates in the variable selection context. To make model selection invariant to coding transformations in polynomial interaction models (e.g., to centering of the main effects), the selection must be constrained to the subset of models that fully respect the polynomial hierarchy (Griepentrog et al., 1982; Khuri, 2002; McCullagh and Nelder, 1989; Nelder, 2000; Peixoto, 1987, 1990); such models satisfy a strong heredity condition (SHC). Succinctly, a model satisfies SHC if for any predictor in the model every lower-order predictor associated with it is also in the model; for example, the model using the intercept, x1, x2 and x1x2. Such models are referred to as strong heredity models (SHMs).
For the inclusion of a given term, a relaxation of the SHC requires that only a subset of lower-order terms is also included. This broader class of models does not attain the invariance properties of SHMs but can represent scientifically valid assumptions (Nelder, 1998). A particular class of this type of model satisfies a weak heredity condition (WHC) and models in this class are referred to as weak heredity models (WHMs). SHMs are examples of WHMs, but the WHM class is larger. A model including the intercept, x1, and x1x2 is a WHM, but not a SHM. Specific definitions of the heredity conditions for defining SHMs and WHMs in the context of this paper are given in Section 2.
Although research on this topic started more than three decades ago (Nelder, 1977), only recently have modern variable selection techniques been adapted to account for the constraints imposed by the polynomial hierarchy. As described in Bien et al. (2013), the current literature on variable selection for polynomial response surface models can be classified into three broad groups: multistep procedures (Brusco et al., 2009; Peixoto, 1987), regularized regression methods (Bien et al., 2013; Choi et al., 2010; Yuan et al., 2009), and Bayesian approaches (Chipman, 1996, 1998; Chipman et al., 1997). This paper addresses the analysis of models satisfying a heredity condition (HC) from a Bayesian perspective.
Contributing to the Bayesian literature on model selection, we focus on both the effect of the assumption of a HC as well as the influence of the prior for the model space. We prove a new theoretical result that provides a further argument in favor of the SHC over the WHC. We show that under the SHC assumption for a model space the posterior asymptotically concentrates on a single model, whereas this need not be the case for the WHC. We then modify and extend the model space priors developed in Chipman (1996) to provide classes of prior distributions incorporating assumptions about independence or exchangeability of term inclusion.
The Bayesian variable selection problem consists of comparing models M in a model space ℳ using their posterior probabilities, given by p(M|y, ℳ) ∝ m(y|M)π(M|ℳ). Model posterior probabilities depend on the model space prior as well as on the priors for the model-specific parameters, implicitly through the marginals m(y|M). Priors on the model-specific parameters have been extensively discussed in the literature (Berger and Pericchi, 1996; Berger et al., 2001; George, 2000; Jeffreys, 1961; Kass and Wasserman, 1996; Liang et al., 2008; Zellner and Siow, 1980). In contrast, the effect of the prior on the model space has until recently been neglected. Scott and Berger (2010) highlighted the impact of the priors on the model space in the context of multiple testing. The Ockham’s-razor effect implicit in Bayesian variable selection through the Bayes factor does not correct for multiple testing. This penalization acts against more complex models but does not account for the cardinality or structure of the model space. The required multiplicity penalty is “hidden away” in the model prior probabilities π(M|ℳ) (Scott and Berger, 2010). Wilson et al. (2010) motivated the need for stronger penalization than pure combinatorics in the context of big data. Adequately formulating priors on the model space can both account for structure in the predictors and provide additional control on the detection of false positive terms. In contast, using the popular uniform prior over the model space may lead to the undesirable and “informative” implication of favoring the set of models models that include half of the number of terms in the largest model in ℳ (Wilson et al., 2010).
When conducting variable selection within these model spaces, two relevant issues must be taken into account. First, the notion of model complexity takes on a new dimension. Complexity is not exclusively a function of the number of predictors, but also depends upon the depth and connectedness of the predictor associations defined by the polynomial hierarchy. Second, because the model space is defined by such relationships, stochastic search algorithms used to explore the models must also conform to these restrictions.
To the best of our knowledge, this is the only Bayesian framework for exploring polynomial model spaces using priors that incorporate penalties for both the number of predictors and for the complexity introduced by the polynomial structure of the model space. Although we focus on polynomial response surface regression models with independent homoscedastic normal errors, the proposed methods can be extended to generalized linear models, linear mixed models, and generalized linear mixed models. The only difficulty in such extensions is the computation of Bayes factors, which are most easily achieved for the homoscedastic normal regression problem. Instead, we focus on the prior structures that can be defined on the model space itself and the effect of these priors on subsequent inference.
The manuscript is organized as follows. In Section 2, we review the hierarchical structure of polynomial regression and associated model spaces in terms of directed acyclic graphs (DAGs), facilitating the definition of the priors and random walks on the model spaces. We propose, investigate, and provide recommendations for types of prior distributions on the model spaces in Section 3. Section 4 provides a Metropolis-Hastings algorithm that takes advantage of the heredity structure. In Section 5, the performance of the proposed methods is explored through simulation studies. We illustrate the algorithm on real data and compare the results with other procedures from current literature in Section 6. Finally, Section 7 contains recommendations and discussion. Supplementary materials contain details about the methods used and additional simulation results.
2 Polynomial models and model spaces
Suppose that the observations y = (y1, …, yn) are modeled using polynomial regression
| (1) |
where , ℕ0 is the set of natural numbers including 0, ε ~ N(0, σ2I), and only a finite number of nonzero regression coefficients. As an illustration, consider a model that includes polynomial terms incorporating covariates x1 and x2 only. The terms and can be represented by α = (0, 2) and α = (2, 1), respectively.
The set constitutes a partially ordered set. It is ordered through the binary relation “⪯” between pairs (α, α′), which is defined by α′ ⪯ α whenever for all j = 1, …, p. Additionally, α′ ≺ α if α′ ⪯ α and for some j. The order of a term is given by the sum of its elements, order(α) = ∑αj, the length of α is given by the number of nonzero αj, and the type of α is defined by the increasing sequence of nonzero elements of α. If order(α) = order(α′) + 1 and α′ ≺ α then α′ is said to immediately precede α, which is denoted by α′ → α. The parent set of α is defined by , which is the set of nodes that immediately precede α. Similarly, the children set of α is defined as . The poset can be represented by a directed acyclic graph (DAG), denoted by , with directed edges to a node from its parents.
A polynomial model M satisfies the SHC if α ∈ M implies that 𝒫(α) ⊂ M, and M satisfies the WHC if α ∈ M implies α′ ∈ M for some α′ ∈ 𝒫(α). Any SHM M is represented by a subgraph Γ(M) of with the property that if node α ∈ Γ(M), then all nodes pointing to α are also in Γ(M). A WHM graph can include a node α if there is a path from the intercept node to α. Figure 2.1 illustrates instances of these models.
Figure 2.1.
Examples of graphs for different heredity conditions.
The motivation for considering only SHMs is compelling. The subspace of y modeled by a polynomial surface, given by the hat matrix HM, is invariant to linear transformations of the covariates used in the surface if and only if M corresponds to a SHM (Peixoto, 1990). For example, if p = 2 and y = β(0,0) + β(1,0)x1 + β(0,1)x2 + β(1,1)x1x2 + ε, then the hat matrix is invariant to any covariate transformation of the form (x1 x2)A + 1b′ for any diagonal real nonsingular 2 × 2 matrix A and any real column vector b of length two. In contrast, if , the hat matrix formed after applying the transformation x1 ↦ x1 + 1c for real c ≠ 0 is not the same as the hat matrix formed by the original x1. This will make the selection depend on how the variables are coded. Nelder (1998) demonstrated that the conditions, under which weak heredity allows the design matrix to be invariant to coding, are seldom met exactly in practice. A WHM that does not obey the SHC constrains the regression model by enforcing “special points” (e.g. intercept fixed to zero or zero gradient point specified a priori). Special points arise in situations where omitting lower-order terms is justified due to the nature of the data and are valid in restricted scientific contexts, but they are the exception rather than the rule (Nelder, 1998). Theorems 1 and 2 in Section 2.1 provide further theoretical justification for restricting attention to SHMs.
The spaces of models, ℳ, considered in this paper are characterized by two SHMs: MB, the base model, and MF, the full model, which we assume is fixed and finite. The base model MB consists of terms that are not subject to selection and is nested in the full model MF. For example, MB can incorporate covariates describing the structure of an experimental design. The model space ℳ is populated by all models M that satisfy the desired HC, contain MB, and are nested in MF. To define the models explicitly, let ϒ(M) = M \ MB and define the sets of extreme nodes and children nodes of M by ℰ(M) = {α ∈ ϒ(M) : M \ {α} satisfies the HC} and 𝒞(M) = {α ∈ ϒ(MF) \ ϒ(M) : M ∪ {α} satisfies the HC}, respectively. Under the SHC, M ∈ ℳ is uniquely determined by considering either ℰ(M) or 𝒞(M). Under the WHC, M ∈ ℳ is determined through knowing both ℰ(M) and 𝒞(M).
2.1 Theoretical posterior considerations
We now provide results regarding the posterior of the model space when the true model MT is nested in MF and we assume a HC for ℳ; the results are all asymptotic as dim(y) = n → ∞. We require that the design matrix of the full model, defined as XFull with higher-order terms included, satisfies , where Ω is positive definite. If a factor is considered as a covariate, the inclusion of one of its levels forces the inclusion of all of its levels; columns are dropped when collinearity due to factor variables or their interactions is present.
The Bayes factors for homoscedastic normal regression are obtained using intrinsic priors. The learning rate of model MT versus model M is exponential in n when MT ⊈ M and proportional to n(|M|−|MT|)/2 when MT ⊆ M. Specifically, we have (up to O(1))
where is a directed distance of MT from M (Girón et al., 2010). Note that these learning rates are achieved across large classes of parametric Bayesian models with fixed prior distributions, making the results of these theorems quite generalizable (Kass and Raftery, 1995). In the theorems, we assume that the prior probabilities of all models satisfying the assumed HC are nonzero; the asymptotic results are independent of the particular positive prior assumed for ℳ.
Theorem 1
Suppose that MT ⊆ MF is the true model, which may or may not be in ℳ, and assume that ℳ satisfies the SHC. Let MT̃ = {α′ ∈ MF : α′ ⪯ α for some α ∈ MT} ∪ MB. Then MT ⊆ MT̃ ∈ ℳ and p(MT̃|y, X, ℳ) → 1.
The result in Theorem 1 depends only on the fact that the SHC is assumed for ℳ and not on whether MT satisfies a HC (only that MT ⊆ MF). Note that if MT ∈ ℳ then MT̃ = MT. When only the WHC is assumed for ℳ, a similar result holds.
Theorem 2
Suppose that MT ⊆ MF is the true model, which may or may not be in ℳ, and assume that ℳ satisfies the WHC. Let t = min{|MT̃| : MT ⊆ MT̃ ∈ ℳ} and ℳT = {MT̃ : MT ⊆ MT̃ ∈ ℳ and |MT̃| = t}. Then p(ℳT|y, X, ℳ) → 1. Further, if MT satisfies the WHC, then |ℳT| = 1.
In Theorem 2, it is interesting to note what happens when MF ⊇ MT ∉ ℳ. If ℳ satisfies the WHC, there can be more than one model in ℳT. The Bayes factor between models M, M′ ∈ ℳT is asymptotically given by , where the χ2 distributions have t − |MT| degrees of freedom and are not necessarily independent. Thus, though the posterior probability of ℳT does converge to 1, no unique model is selected unless |ℳT| = 1. An example of this comes from a true model given by y = β(1,1)x1x2 + ε. In this case, the two smallest models satisfying the WHC that nest MT are given by MT̃, 1 = {(0, 0), (1, 0), (1, 1)} and MT̃, 2 = {(0, 0), (0, 1), (1, 1)}. Because y is not correlated with x1 or x2 conditioned on the inclusion of x1x2, the asymptotic posterior probabilities of MT̃, 1 and MT̃, 2 are random.
These theorems amount to Cromwell’s Rule and a statement about Bayesian learning rates. The posterior concentrates on a set of models ℳT which are of the smallest size, contain the true model MT, and have nonzero prior probability. The difference between Theorem 1 and Theorem 2 comes from the fact that the SHC places zero prior probability on models that satisfy the WHC but not the SHC. This provides another theoretical argument in favor of the SHC over the WHC. Under the SHC, the posterior concentrates on a single best model within the model space, whereas realizations of the χ2 random variables can produce arbitrary distinctions between the models in ℳT under the WHC.
3 Priors on the model space
In this section, we develop different prior structures on model spaces defined by a HC, discuss their advantages and disadvantages, and describe reasonable choices for their hyperparameters. Our hyperparameter choices are motivated by the multiplicity prior from Scott and Berger (2010) and the penalization prior of Wilson et al. (2010). We investigate how the choices of prior structure and hyperparameter affect the prior and posterior probabilities for predictor inclusion.
3.1 Model prior definition
For α ∈ ϒ(MF), let γα(M) be the indicator function describing whether α is included in M. Let γν(M) = {γα(M) : order(α) = ν} and . With these definitions, the prior probability of any model M ∈ ℳ can be factored as , where and are, respectively, the minimum and maximum orders of nodes in ϒ(MF), and . Chipman (1996) simplifies prior distributions on ℳ based on two assumptions. First, if α and α′ are order j then γα and γα′ are assumed to be conditionally independent given γ<j. Second, the author invokes immediate inheritance: if order(α) = j then π(γα(M)|γ<j(M), ℳ) = π(γα (M)|γ𝒫(α)(M), ℳ), where γ𝒫(α)(M) is the inclusion indicator vector for 𝒫(α).
Let πα(M) = π(γα(M) = 1|γ𝒫(α)(M), ℳ) be the conditional inclusion probability of node α in model M. Under the assumptions of conditional independence and immediate inheritance, the prior probability of M is π(M|πℳ, ℳ) = ∏ πα(M)γα(M)(1−πα(M))1−γα(M), with πℳ = {πα(M) : α ∈ ϒ(MF), M ∈ ℳ}. Under the SHC, πα(M) = γα(M) = 0 if γ𝒫(α)(M) = 0, and under the WHC, πα(M) = γα(M) = 0 if ∏α′∈𝒫(α) γα′(M) = 0. Thus, the product can be restricted to the set of nodes α ∈ ϒ(M) ∪ 𝒞(M). Structure can be built into the prior on ℳ by making assumptions about the inclusion probabilities πα(M), such as equality assumptions or assumptions of a hyper-prior for these parameters. We now elaborate upon five model prior definitions, which incorporate various assumptions about the term inclusion indicators and the probabilities of term inclusion. Graphical representations of the priors are provided in Appendix A of the Supplementary materials.
Hierarchical Uniform Prior (HUP)
The HUP assumes that the non-zero probabilities πα(M) are all equal. Specifically, for a model M ∈ ℳ it is assumed that πα(M) = π for all α ∈ ϒ(M) ∪ 𝒞(M). A full Bayesian specification of the HUP is completed by assuming a prior distribution for π. The choice of π ~ Beta(a, b) produces πHUP (M|ℳ, a, b) = B(|ϒ(M)| + a, |𝒞(M)| + b)/B(a, b) where B is the beta function. The HUP assigns equal probabilities to all models for which the sets of nodes ϒ(M) and 𝒞(M) have the same cardinality. This prior yields combinatorial penalization, but essentially fails to account for the hierarchical structure of the model space, as models with the same number of terms and children get the same probability irrespective of depth and connectedness of the model’s graph. An additional penalization for model complexity can be incorporated into the HUP by changing the values of a and b. Because πα(M) = π for all γα(M) that are not forced to be zero, this penalization can only depend on characteristics of the entire graph for MF. One such penalization is to take a = 1 and b = |ϒ(MF)| for every model in ℳ.
Hierarchical Independence Prior (HIP)
The HIP assumes that there are no equality constraints among the non-zero πα(M). Each non-zero πα(M) is given its own prior, which we assume to be Beta(aα, bα). The prior probability of M under the HIP is , where the product over the empty set is taken to be one. Because the πα(M) are independent, any choice of aα and bα is equivalent to choosing a probability of success πα(M) for a given α. Under the SHC, the HIP with parameters aα = bα = 1 is equivalent to the prior proposed in Chipman (1996) with fixed conditional inclusion probability of 1/2 for each term. This choice of hyperparameters accounts for the hierarchical structure of the model space, but essentially provides no penalization for combinatorial complexity at different levels of the hierarchy. This can be observed by considering this choice when all terms in MF are order one, which is the equal probability prior (EPP).
An additional penalization for model complexity can be incorporated into the HIP. In the construction of the prior, each vector of inclusion indicators for terms of order j, γj(M), is conditioned on the vector of inclusion indicators for lower order terms, γ<j(M). Therefore, aα and bα for α of order j can be specified as functions of γ<j(M). One such additional penalization utilizes the number of nodes of order j that could be added without violating the assumed HC. We refer to this particular type of penalty by ch, representing its dependence on (potential) children sets. Choosing aα = 1 and bα(M) = chj(γ<j(M)) is equivalent to choosing a probability of success πα(M) = [1 + chj(γ<j(M))]−1.
Hierarchical Order Prior (HOP)
A compromise between complete equality and complete independence of the πα(M) is to assume equality between the non-zero πα(M) of a given order and independence across the different orders. Define ϒj(M) = {α ∈ ϒ(M) : order(α) = j} and 𝒞j(M) = {α ∈ 𝒞(M) : order(α) = j}. The HOP assumes that πα(M) = π(j)(M) for all α ∈ ϒj(M) ∪ 𝒞j(M). Assuming that π(j)(M) ~ Beta(aj, bj) provides a prior probability of . The specific choice of aj = bj = 1 for all j produces a hierarchical version of the Scott and Berger (2010) multiplicity correction. An additional complexity penalization can be incorporated into the HOP in a similar fashion to the HIP. Given γ<j(M), the number of order-j nodes that could be added is chj(γ<j(M)) = |ϒj(M)∪𝒞j(M)|. Using aj = 1 and bj(M) = chj(γ<j(M)) produces a hierarchical version of the penalization introduced in Wilson et al. (2010).
The fully specified HOP amounts to replacing the conditional independence and immediate inheritance conditions with an order-based exchangeability condition. In particular, the nodes that give rise to chj(M) are exchangeable for each order j, and the γj(M) are independent across different orders (assuming that they satisfy the choice of HC for ℳ).
Hierarchical Length Prior (HLP) and Hierarchical Type Prior (HTP)
Incorporating additional exchangeability assumptions might be of interest to a researcher who, for example, would like to account for the fact that nodes with more connections in a model’s graph should be penalized differently from nodes that have fewer connections. To incorporate this type of additional structure we propose the HLP and the HTP. These two priors are equivalent if and can differ if .
In particular, the HLP is constructed by making different group exchangeability conditions than the HOP. The γj(M) are assumed independent, but the nodes in ϒj(M)∪𝒞j(M) are taken to be group exchangeable and not merely exchangeable. The sets of nodes in ϒj(M) ∪ 𝒞j(M) are given by (ϒj(M) ∪ 𝒞j(M))ℓ = {α ∈ ϒj(M) ∪ 𝒞j(M) : length(α) = ℓ}. The sets of nodes (ϒj(M) ∪ 𝒞j(M))ℓ are independent across the different length groups but the nodes within a given (ϒj(M) ∪ 𝒞j(M))ℓ are exchangeable. In other words, nodes of a given order with the same number of parents in MF are assumed to be exchangeable, conditioned on their inclusion satisfying the HC of ℳ.
Similarly, the HTP is constructed by assuming node groups defined by (ϒj(M) ∪ 𝒞j(M))t = {α ∈ ϒj(M) ∪ 𝒞j(M) : type(α) = t}. Thus, nodes of a given order of the same type are assumed to be exchangeable. Consider a full order four polynomial surface in two variables. The nodes and are grouped together, the nodes and are grouped together, and is grouped by itself. Both the HLP and HTP can incorporate complexity penalizations in the fashion of the HIP or HOP, with ch representing the number of nodes available for a given group that is assumed to be exchangeable.
Further penalizations in WHMs
In his discussion of WHMs, Chipman (1996) specifies conditional probabilities on the inclusion of a given node by taking into account the number of parents of that node that are included in the model. As an example, consider the inclusion of x1x2 ≐ (1, 1) in a model M. The probability π(1,1)(M) can take four values based upon the inclusion indicators for its parent terms x1 and x2. Call these probabilities π(1,1)(0, 0), π(1,1)(1, 0), π(1,1)(0, 1), and π(1,1)(1, 1). Under the WHC, the author sets the probabilities to be 0, 0.25, 0.25, 0.5, reflecting the belief that the more parents of x1x2 included in M, the higher π(1,1)(M) ought to be. Such a belief can be incorporated into the HIP by letting aα and bα depend on the number of included and missing parents of α. For example, letting aα = 1 and bα = 1 + 2[|𝒫(α)| − ∑α′∈𝒫(α)γα(M)] produces the four probabilities that the author set for π(1,1)(M) assuming the WHC.
Similarly, a penalization based on the number of included and missing parents can be incorporated into the HLP and HTP. These priors already group nodes based upon their order and the size of their parent sets. The nodes can be further grouped by the number of parents that are missing from M and aα and bα can be modified depending on γ𝒫(α)(M).
3.2 Prior sensitivity to distributional choices
Each form of the priors introduced in Section 3.1 defines a whole family of model priors, characterized by the probability distribution for the vector of inclusion probabilities πℳ. Here we compare the two choices of hyperparameters described in Section 3.1 for a specific model space. The first assumes that the parameters of the beta distributions are all (1, 1) and it referred to as a = 1, b = 1. The second alternative is referred to as a = 1, b = ch, where b = ch has different interpretations depending on the prior and represents a (potential) children based penalty. Table 3.1 shows the values in the prior for an order-two surface in two predictors when the base model MB is taken to be the intercept-only model and we have assumed the SHC a for ℳ. Though this example is certainly not exhaustive, it does help build some intuition for how the different priors and hyperparameter choices behave.
Table 3.1.
Prior probabilities for the space of SHMs associated with the quadratic surface in two main effects.
| Model | HIP | HOP | HUP | HLP/HTP | |||||
|---|---|---|---|---|---|---|---|---|---|
| (1,1) | (1,ch) | (1,1) | (1,ch) | (1,1) | (1,ch) | (1,1) | (1,ch) | ||
| 1 | 1 | 1/4 | 4/9 | 1/3 | 1/2 | 1/3 | 5/7 | 1/3 | 1/2 |
| 2 | 1, x1 | 1/8 | 1/9 | 1/12 | 1/12 | 1/12 | 5/56 | 1/12 | 1/12 |
| 3 | 1, x2 | 1/8 | 1/9 | 1/12 | 1/12 | 1/12 | 5/56 | 1/12 | 1/12 |
| 4 | 1, x1, | 1/8 | 1/9 | 1/12 | 1/12 | 1/12 | 5/168 | 1/12 | 1/12 |
| 5 | 1, x2, | 1/8 | 1/9 | 1/12 | 1/12 | 1/12 | 5/168 | 1/12 | 1/12 |
| 6 | 1, x1, x2 | 1/32 | 3/64 | 1/12 | 1/12 | 1/60 | 1/72 | 1/18 | 1/24 |
| 7 | 1, x1, x2, | 1/32 | 1/64 | 1/36 | 1/60 | 1/60 | 1/168 | 1/36 | 1/72 |
| 8 | 1, x1, x2, x1x2 | 1/32 | 1/64 | 1/36 | 1/60 | 1/60 | 1/168 | 1/18 | 1/24 |
| 9 | 1, x1, x2, | 1/32 | 1/64 | 1/36 | 1/60 | 1/60 | 1/168 | 1/36 | 1/72 |
| 10 | 1, x1, x2,, x1x2 | 1/32 | 1/192 | 1/36 | 1/120 | 1/30 | 1/252 | 1/36 | 1/72 |
| 11 | 1, x1, x2, , | 1/32 | 1/192 | 1/36 | 1/120 | 1/30 | 1/252 | 1/18 | 1/72 |
| 12 | 1, x1, x2, x1x2, | 1/32 | 1/192 | 1/36 | 1/120 | 1/30 | 1/252 | 1/36 | 1/72 |
| 13 | 1, x1, x2, , x1x2, | 1/32 | 1/576 | 1/12 | 1/120 | 1/6 | 1/252 | 1/18 | 1/72 |
First, compare the priors for (a, b) = (1, 1). The HIP induces a complexity penalization that only accounts for the order of the terms in the model. Models including x1 and x2 (models 6 through 13) are given the same prior probability and no penalization is incurred for the inclusion of any of the quadratic terms. The HUP induces a penalization for model complexity that does not adequately penalize models for including additional terms; MF is given more probability than any model containing at least one term. This lack of penalization for the full model is a consequence of it being the only model in ℳ of that particular size. That is, this model space distribution favors the base and full models as both are the only models in their respective size classes. Similar behavior is observed with the HOP. As models become larger, they are penalized for the increased number of terms. However, once the models become large enough, the number of models of those particular sizes is reduced, producing less combinatorial penalization. In this example, the HLP and HTP coincide and produce more penalization for the inclusion of a square term than the interaction term.
In contrast, if (a, b) = (1, ch), all the priors produce strong penalization as models become more complex, both in terms of the number and order of the nodes contained in the model. For all of the priors, adding a node α to a model M to form M′ produces p(M) ≥ p(M′). However, there are clear differences between the priors. The HIP penalizes the full model the most, with the HLP/HTP penalizing it the least. This observation is not unique to this particular choice of MF. Because the HLP/HTP divides the nodes of a given order into smaller groups than the HUP or HOP, the former will always produce less penalization under the (1, ch) hyperparameter choice than the latter.
3.3 Posterior sensitivity to the choice of prior
To explore posterior sensitivity to the choice of prior distribution, we performed a simulation experiment with a full order-two surface on five predictors. We assume the SHC for the model space with MB only including the intercept, which gives 38,619 models. The true model contains the three main effects, two square terms, and two interactions. We simulated twenty datasets and consider results for model selection using EPP, HIP(a, b), HUP(a, b), and HOP(a, b) (where a = b = 1 and a = 1, b = ch).
Figure 3.1 shows the average rank of the true model and its average posterior probability over the twenty datasets for varying sample sizes. A general trend emerges that the EPP provides the least evidence in favor of the true model, while the choice of b = ch provides the most evidence in favor of the true model when the sample size is moderate or large. However, when the sample size is small, the EPP and b = 1 choices provide better rank for the true model. This is due to the additional penalization in the b = ch priors, which needs to be overcome by the evidence contained in the Bayes factor.
Figure 3.1.
Comparison of Rank MT and p(MT|y) for different prior choices
These simulations were also carried out for the spike-and-slab (SnS) prior (Ishwaran and Rao, 2005; Mitchell and Beauchamp, 1988) as well as the lasso for hierarchical interactions (hierNet) (Bien et al., 2013). As seen in Table 3.2, the highest posterior probability model (HPM) for all methods nests the true model when n > 50. The heredity-based priors exhibit essentially no false positives when n > 100, but false positives are observed with both hierNet and SnS as well as the EPP. For instance, when n = 100, SnS, hierNet, and EPP have average false positive rates of 0.0143, 0.3071, and 0.0167, respectively. In contrast, the hierarchical priors with b = ch have no false positives and b = 1 has false positive rate 0.0083. The additional penalizations built into the hierarchical priors strongly control the false positive rate in model selection. Of course, there is a trade-off and these priors can have difficulty finding true positives in situations where the signal-to-noise ratio is weak.
Table 3.2.
Method comparison for the mean percentage of true and false positive terms identified through selection
| n | %True Positives | %False Positives | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 30 | 50 | 100 | 200 | 500 | 30 | 50 | 100 | 200 | 500 | |
| SnS | 0.89 | 1.00 | 1.00 | 1.00 | 1.00 | 0.04 | 0.04 | 0.01 | 0.01 | 0.00 |
| hierNet | 0.87 | 1.00 | 1.00 | 1.00 | 1.00 | 0.29 | 0.45 | 0.31 | 0.23 | 0.12 |
| EPP | 0.91 | 1.00 | 1.00 | 1.00 | 1.00 | 0.03 | 0.07 | 0.02 | 0.03 | 0.00 |
| HOP.Ch | 0.73 | 1.00 | 1.00 | 1.00 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 |
| HIP.Ch | 0.68 | 1.00 | 1.00 | 1.00 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 |
| HUP.Ch | 0.65 | 1.00 | 1.00 | 1.00 | 1.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 |
| HOP.11 | 0.82 | 1.00 | 1.00 | 1.00 | 1.00 | 0.03 | 0.02 | 0.01 | 0.00 | 0.00 |
| HIP.11 | 0.88 | 1.00 | 1.00 | 1.00 | 1.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 |
| HUP.11 | 0.82 | 1.00 | 1.00 | 1.00 | 1.00 | 0.02 | 0.00 | 0.01 | 0.00 | 0.00 |
4 Random Walks on the Model Space
When the model space ℳ is too large to enumerate, a Metropolis-Hastings (MH) algorithm can be utilized to generate a dependent sample of models from the model space posterior. The structure of ℳ both presents difficulties and provides clues on how to build algorithms to explore it. Several kernels can be implemented in such an MH algorithm, three of which are outlined in this section and are based on local, global, and intermediate steps. Combining the different strategies (Tierney, 1994), e.g., by mixing Markov chain kernels, allows the MH algorithm to explore the model space thoroughly and quickly.
For homoscedastic normal regression using intrinsic priors, the marginal probability of the observed y can be computed efficiently for any proposed model M. Because of this, we bypass the intricacies of reversible jump and pseudo-prior methods (Carlin and Chib, 1995; Green, 1995). We note that when good approximations to the marginals are available, for example through the Laplace approximation (DiCiccio et al., 1997; Tierney and Kadane, 1986), then the random walk algorithm can be implemented using them.
In this section, we succinctly describe the stochastic search algorithm proposed to explore ℳ. We detail three kernels designed with distinct scopes. Although the default setting yields an automatic and sensible specification for the search algorithm, details related to tuning are discussed. Finally, we provide a brief discussion on the estimators used for p(M|y).
4.1 Proposal kernels
Global Jump
In order to keep the chain from getting stuck in a local mode, the global jump kernel generates independently at random a model from the prior distribution. The MH correction is the Bayes factor for the proposed model against the current model.
Local Jump
Given a current model M, the set ℰ(M) ∪ 𝒞(M) contains the nodes whose inclusion can be changed while maintaining the assumed heredity condition of ℳ. The local jump kernel implements a stochastic forwards-backwards proposal kernel conditioned on M. For each α ∈ ℰ(M) the model Mα is defined to be the model M \ {α} and for each α ∈ 𝒞(M) it is defined to be M ∪ {α}. The proposal kernel is supported only on the set of models ℳM = {M} ∪ {Mα : α ∈ ℰ(M) ∪ 𝒞(M)}. Let |ℳM| = N and p(Mα|y, X, ℳM) be the renormalized posterior probability of Mα restricted to ℳM. Mα is proposed with probability p(Mα|y, X, ℳM)/2 + 1/(2N). The proposal kernel is biased towards models with higher posterior probability, and each model has probability at least 1/(2N) of being proposed. Thus, even models with small posterior probability are proposed during iterations of the random walk and the probability of proposing any one model is bounded above by 0.5(1 + 1/N).
When at model M, this local jump kernel uses m(y|Mα) for all Mα ∈ ℳM in order to compute the proposal probabilities. One can store m(y|Mα) for use in future proposal kernels as well as a renormalization estimator of the full posterior distribution, even if Mα is never proposed or visited during the random walk. Thus, the balanced random walk helps to keep the sampler around models with high posterior probability while this transition kernel provides information about models that are close to those models.
Intermediate Jump
The intermediate jump proposes a model by making proposals at each order. The algorithm either increases the order from to or decreases it. Suppose that he chain is currently at M. The algorithm creates intermediate models for each . Set , which is the set of models that could be formed by adding or subtracting nodes of order k to M′ or remaining at M′. When proposing an intermediate model from , the proposal density utilized is the same as that for the local jump restricted to the proposal set.
If increasing the order, set and for , propose models from sequentially with k increasing. The final proposed model is . The decreasing-order kernel is defined analogously and is the reverse of the increasing-order kernel. The MH correction is calculated by accounting for the posterior probabilities of the proposed model M′ and current model M as well as the proposal probabilities, which are computed using the proposal probabilities for each order.
For these intermediate steps, the use of the local jump proposal restricted to nodes of a given degree provides a means of proposing more models than the standard local jump (and thus storing more marginals of y conditioned on specific models). The final proposed model M′ can differ from M by as little as no nodes or as many as nodes (one for each order). This allows quicker exploration of the model space (in terms of number of iterations) while requiring more computation at each iteration.
Algorithm Tuning
The sampling algorithm outlined in this paper in two ways. First, the weights in the convex combination of the restricted posterior and restricted uniform proposal in the local and intermediate jumps can be modified. When the weight on the restricted posterior is near one, the kernel is biased towards proposing models with high posterior probability. Conversely, when the weight is near zero, the kernel proposes models uniformly at random from the restricted space. The tradeoff is between proposing models with a high probability of being accepted versus increasing the probability of proposing low probability models and potentially moving to another part of the model space that might also contain high probability models. As a default, weights of 1/2 are used.
Second, the relative frequency of utilizing the global, local, and intermediate jumps as proposal kernels can be modified. The global jump has relatively low probability of proposing good models, but can provide large moves across the model space. In contrast, the local jump has high probability of proposing models that will be accepted, but will move slowly around the model space. The intermediate jump allows for larger moves than the local jump, but at larger computational costs. The local jump should be used the most often with the global jump used the least to ensure that the algorithm does not get stuck in a local neighborhood of the model space. As a default, the kernels are used with frequency 1/2, 2/5, and 1/10 for the local, intermediate, and global jumps, respectively.
4.2 Posterior Estimators
The MH sampler is uniformly ergodic because the model space is finite, and one can employ either renormalization or visit frequency to compute posterior quantities of interest, such as model posterior probabilities (Garcia-Donato and Martinez-Beneito, 2013). When the model space is large, the number of iterations required to obtain good estimators of posterior probabilities with visit frequencies can be too large to be practically feasible.
An alternative renormalization framework can be implemented that utilizes not only models visited during the random walk, but also any models used in proposal kernels during the random walk. Kernels that utilize aspects of the model space posterior provide the necessary marginals, m(y|M, ℳ), to compute estimators that incorporate a larger portion of the model space than visit frequency estimators. Thus, the implemented renormalization estimators provide smaller total variation distance to the posterior than obtained using frequency estimators. Of course, this produces an efficiency trade-off. The transition kernels that propose a subset of models ℳM ⊂ ℳ when at a model M can require the computation (or lookup) of m(y|M′) for all M′ ∈ ℳM, which has more cost than simple random proposals.
To compare the ability of the algorithm to estimate the model posterior probabilities through the proposed renormalization approach and frequency-based estimation, we explore the posterior probabilities obtained from the random walk performed for the example in Section 3.3. In particular, we follow the approach suggested in Chib (1995), where the entire model space is enumerated and the true model posterior probabilities calculated for each model. The true posterior probabilities were then compared to the estimated ones.
Table 4.1 shows the the total variation distance (TVD) between the resulting estimators and the actual model posterior probabilities. In the least favorable scenario (n = 30 and 5, 000 iterations) renormalization and frequency estimators produced a mean TVDs of 0.0079 and 0.2625, respectively. In the most favorable scenario (n = 200 and 50, 000 iterations) the resulting TVDs were 0.0007 and 0.2149. Despite the uniform ergodicity of the sampler, the frequency estimates converge slowly whereas renormalization provides a good approximation to the true posterior probabilities.
Table 4.1.
Mean TVD between true model posterior probabilities and estimators.
| n | number of iterations | |||||
|---|---|---|---|---|---|---|
| Renormalization | Frequency | |||||
| 5,000 | 10,000 | 50,000 | 5,000 | 10,000 | 50,000 | |
| 30 | 0.0079 | 0.0073 | 0.0062 | 0.2625 | 0.2564 | 0.2540 |
| 100 | 0.0013 | 0.0012 | 0.0012 | 0.2470 | 0.2406 | 0.2325 |
| 200 | 0.0010 | 0.0008 | 0.0007 | 0.2165 | 0.2157 | 0.2149 |
5 Simulation Study
To study the operating characteristics of the proposed priors, a simulation experiment was designed to determine the effect of the sample size, the signal-to-noise ratio, signal allocation, heredity assumption for true model, and model space heredity assumption. The model space is defined with MB being the intercept-only model and MF being the complete order-four polynomial surface in five main effects, which has with 126 nodes. The cardinality of this model space is |ℳ| > 1.2 × 1022, which makes enumeration of all models infeasible. The entries of the matrix of main effects are generated as independent standard normal variates. The response vectors are drawn from the n-variate normal distribution as y ~ Nn (XMTβMT, In), where MT is the true model. The highest order term in the true model has order three.
We consider three different values for n (130, 260, 1040) and signal-to-noise ratio (SNR= 0.25, 1, 4). Three different signal allocations are investigated. The first places signal equally across orders one, two, and three. The second decreases the signal strength by half when increasing order. The third increases signal strength by a factor of two when increasing order. The largest true model considered is displayed in Figure 5.1, which satisfies the SHC. The second choice for MT removes nodes x1x2 and x2x5 from the graph to form a WHM that is not a SHM. Similarly, the third choice of MT removes and x2x5 from the graph, producing a model that does not satisfy the WHC. Regardless of the HC assumed for ℳ, this choice of MT leads to unique MT̃ ∈ ℳ with smallest cardinality nesting MT.
Figure 5.1.
MT : DAG of the largest true model used in simulations.
We drive a random walk through the model space using the HIP(1, 1) prior distribution and compute renormalization estimators of posterior probabilities for the HIP, HUP, HOP, HLP, and HTP priors with choice of (a, b) ∈ {(1, 1), (1, ch)} as well as the EPP. The total number of combinations considered for SNR, sample size, regression coefficient values, and nodes in MT amounts to 162 different scenarios. Each scenario was run with 100 independently generated datasets for 10, 000 iterations of the random walk. The results presented in this section pertain to median results across the datasets regarding the true and false positive rates of the highest posterior probability model, posterior probability of the true model, and rank of the true model. Because small n and/or low SNR lead to underfitting, we also consider the number of simulations in which the true model was not visited. Figures providing the necessary simulation summaries are provided in Appendix D of Supplementary Materials.
SNR and sample size effect
As expected, small sample sizes conditioned upon a small SNR impair the ability of the algorithm to detect true nonzero coefficients. Across the simulations with n = 130 and SNR= 0.25, only the EPP did a good job of recovering true positives. For example, assuming that MT is a SHM and that signal is allocated equally across orders, the EPP has median true positive rates of 2/3 and 1/3 when assuming the SHC and WHC for ℳ, respectively. The false positive rates were 0.43% and 0.36% under the EPP, respectively. In contrast, the hierarchical priors produced no false positives, but only the HIP(1, 1) of Chipman (1996) combined with the SHC assumption for ℳ obtained a nonzero true positive rate, which was 1/9.
For low SNR and large sample size (n = 1040), all of the priors choose MT as the highest posterior model (HPM) under the SHC. The major differences come from the allocation of posterior probability to MT. The hierarchical priors obtain the highest probability for MT with the HUP(1, ch), which is 87%. In contrast, the EPP only assigns 8% posterior probability to MT. Under the WHC, the true positive rate falls to a high of 78%, which is achieved under the EPP and the HTP(1, ch). The contrast comes from the ranks for MT, which are 1, 426 and 582, respectively. The best rank was 262, achieved by HOP(1, ch). In the best case scenario (n = 1040 and SNR= 4), the true positive rate for the WHC did not increase in this scenario. Under these same conditions, the SHC provides perfect selection. These results together provide a strong argument for the SHC over the WHC. Under the SHC, the HIP(1, 1) performed the best when the SNR was low, followed by the HLP and HTP with the same hyperparameter choice for moderate sample sizes. The true model was not found 30% of the time under the SHC and never found under the WHC.
Coefficient magnitude
For this comparison, we focus on n = 260 and SNR = 1, but with MT satisfying the WHC and not the SHC. First, we consider the different priors under the SHC. When the signal is concentrated more on the higher-order terms or distributed evenly across the orders, the hierarchical priors perform admirably with the exception of HUP(1, ch). In general, ranking and posterior probability of the true model is improved slightly by setting b = ch, but the strong control that this choice provides over false positives decreased the median true positive rate for that prior. When the signal is concentrated towards lower-order terms, the HTP(1, 1) performed the best, but only achieved a 2/3 true positive rate. Similar to the discussion for SNR and sample size, the WHC gives generally weaker inference than the SHC. The trends tend to follow those for the SHC assumption. One notable difference is that the b = ch priors perform better or comparable to their b = 1 counterparts, with the exception of the HOP.
A general pattern is seen with the hierarchical priors under the SHC or WHC: evidence in favor of the null is lowest when the signal is concentrated on lower-order terms. Intuitively, this empirical result makes sense. Because we are penalizing the addition of higher-order terms, their signal needs to be larger in order to increase the evidence in favor of inclusion from Bayes factors. Concentration of the signal on higher-order terms provides the most inferential power under the SHC due to its strict parental inclusion requirements.
Special points on the scale
Across the range of examples, assuming the SHC provides better inference than assuming the WHC in terms of true positives, false positives, and posterior probability and rank of the true model. Even when the true model does not satisfy the SHC (or any HC at all), the HIP, HTP, and HLP under the SHC assumption for ℳ provide the most balanced inference in favor of the pseudo-true model MT̃. Subsequent inference about the influence of terms in MT̃ can be carried out through estimation and consideration of credible intervals.
6 Method comparison: Ozone data analysis
We analyze the ozone data from Breiman and Friedman (1985) to assess the performance of our procedures. We follow the analysis of Liang et al. (2008), though we only use intrinsic priors to obtain our Bayes factors. After removing observations with missing values, the data contains 330 observations. A subset of 165 observations was sampled uniformly at random and used for variable selection, and the remaining data were used for validation. We predict daily measurements of maximum ozone concentration near Los Angeles with eight meteorological variables. The meteorological variables and their two-way interactions and squares are used in MF, which has 44 terms. The base model, MB, is the intercept-only model. There are 7.1 × 1010 and 6.5 × 1011 models under the SHC and WHC, respectively.
The priors are assessed through their predictive accuracy on the validation dataset with the prediction root mean squared error (PRMSE). The results are provided for both for the highest posterior probability model (HPM) and under model averaging of the 500 highest posterior probability models. Additionally, we compare the performance of the priors developed in this paper to the lasso for hierarchical interactions (Bien et al., 2013), implemented using the R package hierNet. The penalty parameter was obtained by minimizing the estimated cross-validation error in a ten-fold cross-validation. Given that the penalty parameter chosen is influenced by the cross-validation sets chosen, the cross-validation procedure was replicated 100 times. The results of this analysis are contained in Table 6.1. The HPMs referenced in the Table 6.1 and the marginal inclusion probabilities for all 44 terms are given in Appendix E of the Supplementary materials.
Table 6.1.
Predicted RMSE under the SHC and WHC for the highest probability model (HPM) and under model averaging (Ave).
| Prior | pars | HC | PRMSE | posterior prob | |MHPM| | MHPM | ||
|---|---|---|---|---|---|---|---|---|
| HPM | Ave | MHPM | ||||||
| HIP | (1,ch) | SH | 4.2 | 4.128 | 0.278 | 0.999 | 6 | M(1) |
| WH | 4.593 | 4.483 | 0.712 | 0.999 | 3 | M(2) | ||
| (1,1) | SH | 4.2 | 4.129 | 0.278 | 0.998 | 6 | M(1) | |
| WH | 4.593 | 4.212 | 0.605 | 0.997 | 3 | M(2) | ||
| HOP | (1,ch) | SH | 4.2 | 4.141 | 0.249 | 0.991 | 6 | M(1) |
| WH | 4.593 | 4.125 | 0.412 | 0.985 | 3 | M(2) | ||
| (1,1) | SH | 4.2 | 4.141 | 0.251 | 0.993 | 6 | M(1) | |
| WH | 4.593 | 4.125 | 0.412 | 0.986 | 3 | M(2) | ||
| HUP | (1,ch) | SH | 4.2 | 4.12 | 0.256 | 0.993 | 6 | M(1) |
| WH | 4.593 | 4.354 | 0.703 | 0.995 | 3 | M(2) | ||
| (1,1) | SH | 4.2 | 4.121 | 0.256 | 0.994 | 6 | M(1) | |
| WH | 4.593 | 4.195 | 0.635 | 0.99 | 3 | M(2) | ||
| HLP | (1,ch) | SH | 4.2 | 4.142 | 0.237 | 0.991 | 6 | M(1) |
| WH | 4.593 | 4.17 | 0.344 | 0.99 | 3 | M(2) | ||
| (1,1) | SH | 4.2 | 4.143 | 0.237 | 0.991 | 6 | M(1) | |
| WH | 6.459 | 4.223 | 0.267 | 0.98 | 4 | M(3) | ||
| EPP | – | SH | 4.2 | 4.151 | 0.024 | 0.423 | 6 | M(1) |
| WH | 4.21 | 4.161 | 0.018 | 0.359 | 5 | M(4) | ||
| hierNet | – | SH | 4.349 | – | – | – | 24 | M(5) |
| WH | 4.371 | – | – | – | 22 | M(6) | ||
The main results can be summarized as follows. First, the lowest PRMSE for a HPM comes from the SHC with hierarchical priors or the EPP. Second, the hierarchical priors provide stronger posterior concentration than the EPP. Third, HPMs under the SHC yield a lower PRMSE than those obtained under the WHC. Fourth, models found with hierNet are considerably larger than those resulting from the Bayesian procedures. However, hierNet provides slightly larger PRMSEs, showing that they overfit the data. Fifth, model averaging PRMSEs slightly improve upon those from the HPM’s. Sixth, smaller selected models are obtained under the WHC, which one expects given its relaxed inclusion condition. However, these selected models have greater predictive PRMSE, showing another possible argument in favor of the SHC over the WHC.
Interestingly, the HPM PRMSE under the WHC for the HLP(1, 1) is the largest observed. However, the model averaging PRMSE indicates that the full Bayesian analysis using this prior still behaves well. The second highest probability model is M(2), which has a comparable posterior probability (0.20) to that of the HPM (0.267).
7 Discussion
In this paper, we investigated prior structures for polynomial regression assuming the strong and weak heredity conditions and developed random walk samplers to draw from the model space. We extended the priors developed in Chipman (1996) by completing the Bayesian specification as well as developing priors that utilize exchangeability conditions. These hierarchical priors have similar posterior behavior, with the HOP, HLP, and HTP providing hierarchical versions of the multiplicity correction and complexity penalizing priors, which help immensely with model selection when the polynomial surface has a high degree. When the number of main effects is small or the signal-to-noise ratio is low, the HIP provides better model selection properties. However, it is ill-suited for regressions with a large number of predictors.
In addition to these results from the simulations, Theorems 1 and 2 provide a further theoretical argument in favor of strong — over weak — heredity. Through the analysis of the ozone data, we can also see that the WHC leads to higher prediction RMSE than the SHC. This is related to the theoretical result, which shows that the SHC concentrates posterior probability on a unique model, whereas the WHC might not.
The computational algorithm described in the paper takes advantage of the hierarchical nature of the model space and provides for renormalization estimators that incorporate a large portion of the model space. Through a simulation experiment, we have shown that these estimators have smaller total variation distance to the true posterior than visit frequency estimators. We provide the software necessary to carry out a Metropolis-Hastings random walk on the space of these model spaces through the R package varSelectIP.
The methods in this paper can be expanded to generalized linear mixed models and other modeling contexts. For example, for binary data the methods proposed by Albert and Chib (1993) can be implemented using intrinsic priors on the parameter space (Leon-Novelo et al., 2012) and with the proposed hierarchical priors on the model space.
The theoretical results of Section 2.1 can be easily extended to encompass these classes of models under suitable regularity conditions. Efficient computation or accurate approximation of the marginal probability of the data under competing models is required to employ the MH algorithm we have developed. However, because the focus of the paper is on the model space prior, the methods discussed in this paper can be incorporated into more complicated setting where more elaborate MCMC methods are necessary.
Supplementary Material
References
- Albert J, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88(422):669–679. [Google Scholar]
- Berger J, Pericchi L. The intrinsic bayes factor for model selection and prediction. Journal of the American Statistical Association. 1996;91(433):109–122. [Google Scholar]
- Berger J, Pericchi L, Ghosh J. Objective Bayesian methods for model selection: introduction and comparison. In Model selection, volume 38 of IMS Lecture Notes Monogr. Ser. 2001:135–207. Inst. Math. Statist. [Google Scholar]
- Bien J, Taylor J, Tibshirani R. A lasso for hierarchical interactions. The Annals of Statistics. 2013;41(3):1111–1141. doi: 10.1214/13-AOS1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breiman L, Friedman J. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association. 1985;80:580–598. [Google Scholar]
- Brusco MJ, Steinley D, Cradit JD. An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression. Technometrics. 2009;51(3):306–315. [Google Scholar]
- Carlin B, Chib S. Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society. Series B. 1995;57(3):473–484. [Google Scholar]
- Chib S. Marginal Likelihood from the Gibbs Output. Journal of the American Statistical Association. 1995;90(432):1313–1321. [Google Scholar]
- Chipman H. Bayesian variable selection with related predictors. Canadian Journal of Statistics-Revue Canadienne De Statistique. 1996;24(1):17–36. [Google Scholar]
- Chipman H. Fast model search for designed experiments with complex aliasing. Quality Improvement Through Statistical Methods. 1998:207–220. [Google Scholar]
- Chipman H, Hamada M, Wu C. A Bayesian variable-selection approach for analyzing designed experiments with complex aliasing. Technometrics. 1997;39(4):372–381. [Google Scholar]
- Choi NH, Li W, Zhu J. Variable Selection With the Strong Heredity Constraint and Its Oracle Property. Journal of the American Statistical Association. 2010;105(489):354–364. [Google Scholar]
- DiCiccio TJ, Kass RE, Raftery A, Wasserman L. Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915. [Google Scholar]
- Garcia-Donato G, Martinez-Beneito M. On Sampling Strategies in Bayesian Variable Selection Problems with Large Model Spaces. Journal of the American Statistical Association. 2013;108(501):340–352. [Google Scholar]
- George E. The variable selection problem. Journal of the American Statistical Association. 2000;95(452):1304–1308. [Google Scholar]
- Girón FJ, Moreno E, Casella G, Martínez ML. Consistency of objective Bayes factors for nonnested linear models and increasing model dimension. Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas. 2010;104(1):57–67. [Google Scholar]
- Green PJ. Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]
- Griepentrog GL, Ryan JM, Smith LD. Linear transformations of polynomial regression-models. American Statistician. 1982;36(3):171–174. [Google Scholar]
- Ishwaran H, Rao JS. Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics. 2005;33:730–773. [Google Scholar]
- Jeffreys H. Theory of Probability. 3rd. London: Oxford University Press; 1961. [Google Scholar]
- Kass R, Raftery A. Bayes Factors. Journal of the American Statistical Association. 1995;90:773–795. [Google Scholar]
- Kass RE, Wasserman L. The Selection of Prior Distributions by Formal Rules. Journal of the American Statistical Association. 1996;91(435):1343–1370. [Google Scholar]
- Khuri A. Nonsingular linear transformations of the control variables in response surface models. Technical Report. 2002 [Google Scholar]
- Leon-Novelo L, Moreno E, Casella G. Objective Bayes model selection in probit models. Statistics in medicine. 2012;31(4):353–365. doi: 10.1002/sim.4406. [DOI] [PubMed] [Google Scholar]
- Liang F, Paulo R, Molina G, Clyde MA, Berger JO. Mixtures of g priors for bayesian variable selection. Journal of the American Statistical Association. 2008;103(481):410–423. [Google Scholar]
- McCullagh P, Nelder JA. Generalized linear models. 2nd. London, England: Chapman & Hall; 1989. [Google Scholar]
- Mitchell T, Beauchamp J. Bayesian variable selection in linear regression. Journal of the American Statistical Association. 1988;83:1023–1032. [Google Scholar]
- Nelder JA. Reformulation of linear-models. Journal of the Royal Statistical Society Series A - Statistics in Society. 1977;140:48–77. [Google Scholar]
- Nelder JA. The selection of terms in response-surface models - how strong is the weak-heredity principle? American Statistician. 1998;52(4):315–318. [Google Scholar]
- Nelder JA. Functional marginality and response-surface fitting. Journal of Applied Statistics. 2000;27(1):109–112. [Google Scholar]
- Peixoto JL. Hierarchical variable selection in polynomial regression models. American Statistician. 1987;41(4):311–313. [Google Scholar]
- Peixoto JL. A property of well-formulated polynomial regression models. American Statistician. 1990;44(1):26–30. [Google Scholar]
- Scott JG, Berger JO. Bayes and Empirical-Bayes Multiplicity Adjustment in the variable selection problem. Annals of Statistics. 2010;38(5):2587–2619. [Google Scholar]
- Tierney L. Markov chains for exploring posterior distributions. The Annals of Statistics. 1994:1701–1728. [Google Scholar]
- Tierney L, Kadane JB. Accurate Approximations for Posterior Moments and Marginal Densities. Journal of the American Statistical Association. 1986;81:82–86. [Google Scholar]
- Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM. Bayesian model search and multilevel inference for snp association studies. The Annals of Applied Statistics. 2010;4(3):1342–1364. doi: 10.1214/09-aoas322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan M, Joseph VR, Zou H. Structured variable selection and estimation. The Annals of Applied Statistics. 2009;3(4):1738–1757. [Google Scholar]
- Zellner A, Siow A. Posterior odds ratios for selected regression hypotheses. Journal of Systems Science and Complexity. 1980;31.1:732–748. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



