Bayesian Variable Selection on Model Spaces Constrained by Heredity Conditions

Daniel Taylor-Rodriguez; Andrew Womack; Nikolay Bliznyuk

doi:10.1080/10618600.2015.1056793

. Author manuscript; available in PMC: 2017 May 10.

Published in final edited form as: J Comput Graph Stat. 2016 May 10;25(2):515–535. doi: 10.1080/10618600.2015.1056793

Bayesian Variable Selection on Model Spaces Constrained by Heredity Conditions

Daniel Taylor-Rodriguez, Andrew Womack, Nikolay Bliznyuk

PMCID: PMC5222564 NIHMSID: NIHMS801700 PMID: 28082825

Abstract

This paper investigates Bayesian variable selection when there is a hierarchical dependence structure on the inclusion of predictors in the model. In particular, we study the type of dependence found in polynomial response surfaces of orders two and higher, whose model spaces are required to satisfy weak or strong heredity conditions. These conditions restrict the inclusion of higher-order terms depending upon the inclusion of lower-order parent terms. We develop classes of priors on the model space, investigate their theoretical and finite sample properties, and provide a Metropolis-Hastings algorithm for searching the space of models. The tools proposed allow fast and thorough exploration of model spaces that account for hierarchical polynomial structure in the predictors and provide control of the inclusion of false positives in high posterior probability models.

Keywords: Markov Chain Monte Carlo, intrinsic prior, model priors, multiple testing, multiplicity penalization, well-formulated models, strong heredity, weak heredity

1 Background

In modern regression problems, choosing subsets of good predictors from a large pool is standard practice — especially considering the pervasiveness of big data problems. Such analyses require multiple testing corrections in order to attain reasonable Type I error rates as well as control for false positives. This multiple testing problem is further complicated when it is critical to incorporate interactions or powers of the predictors. Not only can there be a large number of predictors, but the structure that underlies the space of predictors must be taken in to account. For regressions with a large number of predictors or high degree, the complete model space is too large to enumerate, and automatic stochastic search algorithms are necessary to find appropriate parsimonious models.

More than two decades years ago, Peixoto (1990) exposed the relevance of respecting polynomial hierarchy among covariates in the variable selection context. To make model selection invariant to coding transformations in polynomial interaction models (e.g., to centering of the main effects), the selection must be constrained to the subset of models that fully respect the polynomial hierarchy (Griepentrog et al., 1982; Khuri, 2002; McCullagh and Nelder, 1989; Nelder, 2000; Peixoto, 1987, 1990); such models satisfy a strong heredity condition (SHC). Succinctly, a model satisfies SHC if for any predictor in the model every lower-order predictor associated with it is also in the model; for example, the model using the intercept, x₁, x₂ and x₁x₂. Such models are referred to as strong heredity models (SHMs).

For the inclusion of a given term, a relaxation of the SHC requires that only a subset of lower-order terms is also included. This broader class of models does not attain the invariance properties of SHMs but can represent scientifically valid assumptions (Nelder, 1998). A particular class of this type of model satisfies a weak heredity condition (WHC) and models in this class are referred to as weak heredity models (WHMs). SHMs are examples of WHMs, but the WHM class is larger. A model including the intercept, x₁, and x₁x₂ is a WHM, but not a SHM. Specific definitions of the heredity conditions for defining SHMs and WHMs in the context of this paper are given in Section 2.

Although research on this topic started more than three decades ago (Nelder, 1977), only recently have modern variable selection techniques been adapted to account for the constraints imposed by the polynomial hierarchy. As described in Bien et al. (2013), the current literature on variable selection for polynomial response surface models can be classified into three broad groups: multistep procedures (Brusco et al., 2009; Peixoto, 1987), regularized regression methods (Bien et al., 2013; Choi et al., 2010; Yuan et al., 2009), and Bayesian approaches (Chipman, 1996, 1998; Chipman et al., 1997). This paper addresses the analysis of models satisfying a heredity condition (HC) from a Bayesian perspective.

Contributing to the Bayesian literature on model selection, we focus on both the effect of the assumption of a HC as well as the influence of the prior for the model space. We prove a new theoretical result that provides a further argument in favor of the SHC over the WHC. We show that under the SHC assumption for a model space the posterior asymptotically concentrates on a single model, whereas this need not be the case for the WHC. We then modify and extend the model space priors developed in Chipman (1996) to provide classes of prior distributions incorporating assumptions about independence or exchangeability of term inclusion.

The Bayesian variable selection problem consists of comparing models M in a model space ℳ using their posterior probabilities, given by p(M|y, ℳ) ∝ m(y|M)π(M|ℳ). Model posterior probabilities depend on the model space prior as well as on the priors for the model-specific parameters, implicitly through the marginals m(y|M). Priors on the model-specific parameters have been extensively discussed in the literature (Berger and Pericchi, 1996; Berger et al., 2001; George, 2000; Jeffreys, 1961; Kass and Wasserman, 1996; Liang et al., 2008; Zellner and Siow, 1980). In contrast, the effect of the prior on the model space has until recently been neglected. Scott and Berger (2010) highlighted the impact of the priors on the model space in the context of multiple testing. The Ockham’s-razor effect implicit in Bayesian variable selection through the Bayes factor does not correct for multiple testing. This penalization acts against more complex models but does not account for the cardinality or structure of the model space. The required multiplicity penalty is “hidden away” in the model prior probabilities π(M|ℳ) (Scott and Berger, 2010). Wilson et al. (2010) motivated the need for stronger penalization than pure combinatorics in the context of big data. Adequately formulating priors on the model space can both account for structure in the predictors and provide additional control on the detection of false positive terms. In contast, using the popular uniform prior over the model space may lead to the undesirable and “informative” implication of favoring the set of models models that include half of the number of terms in the largest model in ℳ (Wilson et al., 2010).

When conducting variable selection within these model spaces, two relevant issues must be taken into account. First, the notion of model complexity takes on a new dimension. Complexity is not exclusively a function of the number of predictors, but also depends upon the depth and connectedness of the predictor associations defined by the polynomial hierarchy. Second, because the model space is defined by such relationships, stochastic search algorithms used to explore the models must also conform to these restrictions.

To the best of our knowledge, this is the only Bayesian framework for exploring polynomial model spaces using priors that incorporate penalties for both the number of predictors and for the complexity introduced by the polynomial structure of the model space. Although we focus on polynomial response surface regression models with independent homoscedastic normal errors, the proposed methods can be extended to generalized linear models, linear mixed models, and generalized linear mixed models. The only difficulty in such extensions is the computation of Bayes factors, which are most easily achieved for the homoscedastic normal regression problem. Instead, we focus on the prior structures that can be defined on the model space itself and the effect of these priors on subsequent inference.

The manuscript is organized as follows. In Section 2, we review the hierarchical structure of polynomial regression and associated model spaces in terms of directed acyclic graphs (DAGs), facilitating the definition of the priors and random walks on the model spaces. We propose, investigate, and provide recommendations for types of prior distributions on the model spaces in Section 3. Section 4 provides a Metropolis-Hastings algorithm that takes advantage of the heredity structure. In Section 5, the performance of the proposed methods is explored through simulation studies. We illustrate the algorithm on real data and compare the results with other procedures from current literature in Section 6. Finally, Section 7 contains recommendations and discussion. Supplementary materials contain details about the methods used and additional simulation results.

2 Polynomial models and model spaces

Suppose that the observations y = (y₁, …, y_n) are modeled using polynomial regression

y = \sum β_{(α_{1}, \dots, α_{p})} \prod_{j = 1}^{p} x_{j}^{α_{j}} + ε,

(1)

where $(α_{1}, \dots, α_{p}) = α \in ℕ_{0}^{p}$ , ℕ₀ is the set of natural numbers including 0, ε ~ N(0, σ²I), and only a finite number of nonzero regression coefficients. As an illustration, consider a model that includes polynomial terms incorporating covariates x₁ and x₂ only. The terms $x_{2}^{2}$ and $x_{1}^{2} x_{2}$ can be represented by α = (0, 2) and α = (2, 1), respectively.

The set $ℕ_{0}^{p}$ constitutes a partially ordered set. It is ordered through the binary relation “⪯” between pairs (α, α′), which is defined by α′ ⪯ α whenever $α_{j} \geq α_{j}^{'}$ for all j = 1, …, p. Additionally, α′ ≺ α if α′ ⪯ α and $α_{j} > α_{j}^{'}$ for some j. The order of a term $α \in ℕ_{0}^{p}$ is given by the sum of its elements, order(α) = ∑α_j, the length of α is given by the number of nonzero α_j, and the type of α is defined by the increasing sequence of nonzero elements of α. If order(α) = order(α′) + 1 and α′ ≺ α then α′ is said to immediately precede α, which is denoted by α′ → α. The parent set of α is defined by $𝒫 (α) = {α^{'} \in ℕ_{0}^{p} : α^{'} \to α}$ , which is the set of nodes that immediately precede α. Similarly, the children set of α is defined as $𝒞 (α) = {α^{'} \in ℕ_{0}^{p} : α \to α'}$ . The poset $ℕ_{0}^{p}$ can be represented by a directed acyclic graph (DAG), denoted by $Γ (ℕ_{0}^{p})$ , with directed edges to a node from its parents.

A polynomial model M satisfies the SHC if α ∈ M implies that 𝒫(α) ⊂ M, and M satisfies the WHC if α ∈ M implies α′ ∈ M for some α′ ∈ 𝒫(α). Any SHM M is represented by a subgraph Γ(M) of $Γ (ℕ_{0}^{p})$ with the property that if node α ∈ Γ(M), then all nodes pointing to α are also in Γ(M). A WHM graph can include a node α if there is a path from the intercept node to α. Figure 2.1 illustrates instances of these models.

Figure 2.1 — Examples of graphs for different heredity conditions.

The motivation for considering only SHMs is compelling. The subspace of y modeled by a polynomial surface, given by the hat matrix H_M, is invariant to linear transformations of the covariates used in the surface if and only if M corresponds to a SHM (Peixoto, 1990). For example, if p = 2 and y = β_(0,0) + β_(1,0)x₁ + β_(0,1)x₂ + β_(1,1)x₁x₂ + ε, then the hat matrix is invariant to any covariate transformation of the form (x₁ x₂)A + 1b′ for any diagonal real nonsingular 2 × 2 matrix A and any real column vector b of length two. In contrast, if $y = β_{00} + β_{20} x_{1}^{2} + ε$ , the hat matrix formed after applying the transformation x₁ ↦ x₁ + 1c for real c ≠ 0 is not the same as the hat matrix formed by the original x₁. This will make the selection depend on how the variables are coded. Nelder (1998) demonstrated that the conditions, under which weak heredity allows the design matrix to be invariant to coding, are seldom met exactly in practice. A WHM that does not obey the SHC constrains the regression model by enforcing “special points” (e.g. intercept fixed to zero or zero gradient point specified a priori). Special points arise in situations where omitting lower-order terms is justified due to the nature of the data and are valid in restricted scientific contexts, but they are the exception rather than the rule (Nelder, 1998). Theorems 1 and 2 in Section 2.1 provide further theoretical justification for restricting attention to SHMs.

The spaces of models, ℳ, considered in this paper are characterized by two SHMs: M_B, the base model, and M_F, the full model, which we assume is fixed and finite. The base model M_B consists of terms that are not subject to selection and is nested in the full model M_F. For example, M_B can incorporate covariates describing the structure of an experimental design. The model space ℳ is populated by all models M that satisfy the desired HC, contain M_B, and are nested in M_F. To define the models explicitly, let ϒ(M) = M \ M_B and define the sets of extreme nodes and children nodes of M by ℰ(M) = {α ∈ ϒ(M) : M \ {α} satisfies the HC} and 𝒞(M) = {α ∈ ϒ(M_F) \ ϒ(M) : M ∪ {α} satisfies the HC}, respectively. Under the SHC, M ∈ ℳ is uniquely determined by considering either ℰ(M) or 𝒞(M). Under the WHC, M ∈ ℳ is determined through knowing both ℰ(M) and 𝒞(M).

2.1 Theoretical posterior considerations

We now provide results regarding the posterior of the model space when the true model M_T is nested in M_F and we assume a HC for ℳ; the results are all asymptotic as dim(y) = n → ∞. We require that the design matrix of the full model, defined as X_Full with higher-order terms included, satisfies $\frac{1}{n} X_{Full}^{'} X_{Full} \to Ω$ , where Ω is positive definite. If a factor is considered as a covariate, the inclusion of one of its levels forces the inclusion of all of its levels; columns are dropped when collinearity due to factor variables or their interactions is present.

The Bayes factors for homoscedastic normal regression are obtained using intrinsic priors. The learning rate of model M_T versus model M is exponential in n when M_T ⊈ M and proportional to n^{(|M|−|M_T|)/2} when M_T ⊆ M. Specifically, we have (up to O(1))

B F_{M_{T}, M} (y) \to {\begin{matrix} n \frac{| M | - | M_{T} |}{2} exp [\frac{n}{2} log (1 + Δ_{M_{T}, M})] & if M_{T} ⊈ M \\ n \frac{| M | - | M_{T} |}{2} exp [- \frac{1}{2} χ_{| M | - | M_{T} |}^{2}] & if M_{T} \subseteq M, \end{matrix}

where $Δ_{M_{T}, M} = {lim}_{n \to \infty} β_{M}^{'} X_{M}^{'} (I - H_{M_{T}}) X_{M} β_{M} / (n σ_{M}^{2})$ is a directed distance of M_T from M (Girón et al., 2010). Note that these learning rates are achieved across large classes of parametric Bayesian models with fixed prior distributions, making the results of these theorems quite generalizable (Kass and Raftery, 1995). In the theorems, we assume that the prior probabilities of all models satisfying the assumed HC are nonzero; the asymptotic results are independent of the particular positive prior assumed for ℳ.

Theorem 1

Suppose that M_T ⊆ M_F is the true model, which may or may not be in ℳ, and assume that ℳ satisfies the SHC. Let M_T̃ = {α′ ∈ M_F : α′ ⪯ α for some α ∈ M_T} ∪ M_B. Then M_T ⊆ M_T̃ ∈ ℳ and p(M_T̃|y, X, ℳ) → 1.

The result in Theorem 1 depends only on the fact that the SHC is assumed for ℳ and not on whether M_T satisfies a HC (only that M_T ⊆ M_F). Note that if M_T ∈ ℳ then M_T̃ = M_T. When only the WHC is assumed for ℳ, a similar result holds.

Theorem 2

Suppose that M_T ⊆ M_F is the true model, which may or may not be in ℳ, and assume that ℳ satisfies the WHC. Let t = min{|M_T̃| : M_T ⊆ M_T̃ ∈ ℳ} and ℳ_T = {M_T̃ : M_T ⊆ M_T̃ ∈ ℳ and |M_T̃| = t}. Then p(ℳ_T|y, X, ℳ) → 1. Further, if M_T satisfies the WHC, then |ℳ_T| = 1.

In Theorem 2, it is interesting to note what happens when M_F ⊇ M_T ∉ ℳ. If ℳ satisfies the WHC, there can be more than one model in ℳ_T. The Bayes factor between models M, M′ ∈ ℳ_T is asymptotically given by $B F_{M, M'} \to exp [\frac{1}{2} (χ_{M', M_{T}}^{2} - χ_{M, M_{T}}^{2})]$ , where the χ² distributions have t − |M_T| degrees of freedom and are not necessarily independent. Thus, though the posterior probability of ℳ_T does converge to 1, no unique model is selected unless |ℳ_T| = 1. An example of this comes from a true model given by y = β_(1,1)x₁x₂ + ε. In this case, the two smallest models satisfying the WHC that nest M_T are given by M_{T̃, 1} = {(0, 0), (1, 0), (1, 1)} and M_{T̃, 2} = {(0, 0), (0, 1), (1, 1)}. Because y is not correlated with x₁ or x₂ conditioned on the inclusion of x₁x₂, the asymptotic posterior probabilities of M_{T̃, 1} and M_{T̃, 2} are random.

These theorems amount to Cromwell’s Rule and a statement about Bayesian learning rates. The posterior concentrates on a set of models ℳ_T which are of the smallest size, contain the true model M_T, and have nonzero prior probability. The difference between Theorem 1 and Theorem 2 comes from the fact that the SHC places zero prior probability on models that satisfy the WHC but not the SHC. This provides another theoretical argument in favor of the SHC over the WHC. Under the SHC, the posterior concentrates on a single best model within the model space, whereas realizations of the χ² random variables can produce arbitrary distinctions between the models in ℳ_T under the WHC.

3 Priors on the model space

In this section, we develop different prior structures on model spaces defined by a HC, discuss their advantages and disadvantages, and describe reasonable choices for their hyperparameters. Our hyperparameter choices are motivated by the multiplicity prior from Scott and Berger (2010) and the penalization prior of Wilson et al. (2010). We investigate how the choices of prior structure and hyperparameter affect the prior and posterior probabilities for predictor inclusion.

3.1 Model prior definition

For α ∈ ϒ(M_F), let γ_α(M) be the indicator function describing whether α is included in M. Let γ^ν(M) = {γ_α(M) : order(α) = ν} and $γ^{< ν} (M) = \cup_{j = 0}^{ν - 1} γ^{j} (M)$ . With these definitions, the prior probability of any model M ∈ ℳ can be factored as $π (M | ℳ) \prod_{j = J_{ℳ}^{min}}^{J_{ℳ}^{max}} π (γ^{j} (M) | γ^{< j} (M), ℳ)$ , where $J_{ℳ}^{min}$ and $J_{ℳ}^{max}$ are, respectively, the minimum and maximum orders of nodes in ϒ(M_F), and $π (γ^{J_{ℳ}^{min}} (M) | γ^{< J_{ℳ}^{min}} (M), ℳ) = π (γ^{J_{ℳ}^{min}} (M) | ℳ)$ . Chipman (1996) simplifies prior distributions on ℳ based on two assumptions. First, if α and α′ are order j then γ_α and γ_α′ are assumed to be conditionally independent given γ^<j. Second, the author invokes immediate inheritance: if order(α) = j then π(γ_α(M)|γ^<j(M), ℳ) = π(γ_α (M)|γ_𝒫(α)(M), ℳ), where γ_𝒫(α)(M) is the inclusion indicator vector for 𝒫(α).

Let π_α(M) = π(γ_α(M) = 1|γ_𝒫(α)(M), ℳ) be the conditional inclusion probability of node α in model M. Under the assumptions of conditional independence and immediate inheritance, the prior probability of M is π(M|π_ℳ, ℳ) = ∏ π_α(M)^γ_α(M)(1−π_α(M))^1−γ_α(M), with π_ℳ = {π_α(M) : α ∈ ϒ(M_F), M ∈ ℳ}. Under the SHC, π_α(M) = γ_α(M) = 0 if γ_𝒫(α)(M) = 0, and under the WHC, π_α(M) = γ_α(M) = 0 if ∏_{α′∈𝒫(α)} γ_α′(M) = 0. Thus, the product can be restricted to the set of nodes α ∈ ϒ(M) ∪ 𝒞(M). Structure can be built into the prior on ℳ by making assumptions about the inclusion probabilities π_α(M), such as equality assumptions or assumptions of a hyper-prior for these parameters. We now elaborate upon five model prior definitions, which incorporate various assumptions about the term inclusion indicators and the probabilities of term inclusion. Graphical representations of the priors are provided in Appendix A of the Supplementary materials.

Hierarchical Uniform Prior (HUP)

The HUP assumes that the non-zero probabilities π_α(M) are all equal. Specifically, for a model M ∈ ℳ it is assumed that π_α(M) = π for all α ∈ ϒ(M) ∪ 𝒞(M). A full Bayesian specification of the HUP is completed by assuming a prior distribution for π. The choice of π ~ Beta(a, b) produces π^HUP (M|ℳ, a, b) = B(|ϒ(M)| + a, |𝒞(M)| + b)/B(a, b) where B is the beta function. The HUP assigns equal probabilities to all models for which the sets of nodes ϒ(M) and 𝒞(M) have the same cardinality. This prior yields combinatorial penalization, but essentially fails to account for the hierarchical structure of the model space, as models with the same number of terms and children get the same probability irrespective of depth and connectedness of the model’s graph. An additional penalization for model complexity can be incorporated into the HUP by changing the values of a and b. Because π_α(M) = π for all γ_α(M) that are not forced to be zero, this penalization can only depend on characteristics of the entire graph for M_F. One such penalization is to take a = 1 and b = |ϒ(M_F)| for every model in ℳ.

Hierarchical Independence Prior (HIP)

The HIP assumes that there are no equality constraints among the non-zero π_α(M). Each non-zero π_α(M) is given its own prior, which we assume to be Beta(a_α, b_α). The prior probability of M under the HIP is $π^{H I P} (M | ℳ, a, b) = (\prod_{α \in ϒ (M)} \frac{a_{α}}{a_{α} + b_{α}}) (\prod_{α \in 𝒞 (M)} \frac{b_{α}}{a_{α} + b_{α}})$ , where the product over the empty set is taken to be one. Because the π_α(M) are independent, any choice of a_α and b_α is equivalent to choosing a probability of success π_α(M) for a given α. Under the SHC, the HIP with parameters a_α = b_α = 1 is equivalent to the prior proposed in Chipman (1996) with fixed conditional inclusion probability of 1/2 for each term. This choice of hyperparameters accounts for the hierarchical structure of the model space, but essentially provides no penalization for combinatorial complexity at different levels of the hierarchy. This can be observed by considering this choice when all terms in M_F are order one, which is the equal probability prior (EPP).

An additional penalization for model complexity can be incorporated into the HIP. In the construction of the prior, each vector of inclusion indicators for terms of order j, γ^j(M), is conditioned on the vector of inclusion indicators for lower order terms, γ^<j(M). Therefore, a_α and b_α for α of order j can be specified as functions of γ^<j(M). One such additional penalization utilizes the number of nodes of order j that could be added without violating the assumed HC. We refer to this particular type of penalty by ch, representing its dependence on (potential) children sets. Choosing a_α = 1 and b_α(M) = ch_j(γ^<j(M)) is equivalent to choosing a probability of success π_α(M) = [1 + ch_j(γ^<j(M))]⁻¹.

Hierarchical Order Prior (HOP)

A compromise between complete equality and complete independence of the π_α(M) is to assume equality between the non-zero π_α(M) of a given order and independence across the different orders. Define ϒ_j(M) = {α ∈ ϒ(M) : order(α) = j} and 𝒞_j(M) = {α ∈ 𝒞(M) : order(α) = j}. The HOP assumes that π_α(M) = π^(j)(M) for all α ∈ ϒ_j(M) ∪ 𝒞_j(M). Assuming that π^(j)(M) ~ Beta(a_j, b_j) provides a prior probability of $π^{H O P} (M | ℳ, a, b) = \prod_{j = J_{ℳ}^{min}}^{J_{M}^{max}} (B (| ϒ_{j} (M) | + a_{j}, | 𝒞_{j} (M) | + b_{j})) / (B (a_{j}, b_{j}))$ . The specific choice of a_j = b_j = 1 for all j produces a hierarchical version of the Scott and Berger (2010) multiplicity correction. An additional complexity penalization can be incorporated into the HOP in a similar fashion to the HIP. Given γ^<j(M), the number of order-j nodes that could be added is ch_j(γ^<j(M)) = |ϒ_j(M)∪𝒞j(M)|. Using a_j = 1 and b_j(M) = ch_j(γ^<j(M)) produces a hierarchical version of the penalization introduced in Wilson et al. (2010).

The fully specified HOP amounts to replacing the conditional independence and immediate inheritance conditions with an order-based exchangeability condition. In particular, the nodes that give rise to ch_j(M) are exchangeable for each order j, and the γ^j(M) are independent across different orders (assuming that they satisfy the choice of HC for ℳ).

Hierarchical Length Prior (HLP) and Hierarchical Type Prior (HTP)

Incorporating additional exchangeability assumptions might be of interest to a researcher who, for example, would like to account for the fact that nodes with more connections in a model’s graph should be penalized differently from nodes that have fewer connections. To incorporate this type of additional structure we propose the HLP and the HTP. These two priors are equivalent if $J_{ℳ}^{max} \leq 3$ and can differ if $J_{ℳ}^{max} \geq 4$ .

In particular, the HLP is constructed by making different group exchangeability conditions than the HOP. The γ^j(M) are assumed independent, but the nodes in ϒ_j(M)∪𝒞_j(M) are taken to be group exchangeable and not merely exchangeable. The sets of nodes in ϒ_j(M) ∪ 𝒞_j(M) are given by (ϒ_j(M) ∪ 𝒞_j(M))_ℓ = {α ∈ ϒ_j(M) ∪ 𝒞_j(M) : length(α) = ℓ}. The sets of nodes (ϒ_j(M) ∪ 𝒞_j(M))_ℓ are independent across the different length groups but the nodes within a given (ϒ_j(M) ∪ 𝒞_j(M))_ℓ are exchangeable. In other words, nodes of a given order with the same number of parents in M_F are assumed to be exchangeable, conditioned on their inclusion satisfying the HC of ℳ.

Similarly, the HTP is constructed by assuming node groups defined by (ϒ_j(M) ∪ 𝒞_j(M))_t = {α ∈ ϒ_j(M) ∪ 𝒞_j(M) : type(α) = t}. Thus, nodes of a given order of the same type are assumed to be exchangeable. Consider a full order four polynomial surface in two variables. The nodes $x_{1}^{3} x_{2}$ and $x_{1} x_{2}^{3}$ are grouped together, the nodes $x_{1}^{4}$ and $x_{2}^{4}$ are grouped together, and $x_{1}^{2} x_{2}^{2}$ is grouped by itself. Both the HLP and HTP can incorporate complexity penalizations in the fashion of the HIP or HOP, with ch representing the number of nodes available for a given group that is assumed to be exchangeable.

Further penalizations in WHMs

In his discussion of WHMs, Chipman (1996) specifies conditional probabilities on the inclusion of a given node by taking into account the number of parents of that node that are included in the model. As an example, consider the inclusion of x₁x₂ ≐ (1, 1) in a model M. The probability π_(1,1)(M) can take four values based upon the inclusion indicators for its parent terms x₁ and x₂. Call these probabilities π_(1,1)(0, 0), π_(1,1)(1, 0), π_(1,1)(0, 1), and π_(1,1)(1, 1). Under the WHC, the author sets the probabilities to be 0, 0.25, 0.25, 0.5, reflecting the belief that the more parents of x₁x₂ included in M, the higher π_(1,1)(M) ought to be. Such a belief can be incorporated into the HIP by letting a_α and b_α depend on the number of included and missing parents of α. For example, letting a_α = 1 and b_α = 1 + 2[|𝒫(α)| − ∑_{α′∈𝒫(α)}γ_α(M)] produces the four probabilities that the author set for π_(1,1)(M) assuming the WHC.

Similarly, a penalization based on the number of included and missing parents can be incorporated into the HLP and HTP. These priors already group nodes based upon their order and the size of their parent sets. The nodes can be further grouped by the number of parents that are missing from M and a_α and b_α can be modified depending on γ_𝒫(α)(M).

3.2 Prior sensitivity to distributional choices

Each form of the priors introduced in Section 3.1 defines a whole family of model priors, characterized by the probability distribution for the vector of inclusion probabilities π_ℳ. Here we compare the two choices of hyperparameters described in Section 3.1 for a specific model space. The first assumes that the parameters of the beta distributions are all (1, 1) and it referred to as a = 1, b = 1. The second alternative is referred to as a = 1, b = ch, where b = ch has different interpretations depending on the prior and represents a (potential) children based penalty. Table 3.1 shows the values in the prior for an order-two surface in two predictors when the base model M_B is taken to be the intercept-only model and we have assumed the SHC a for ℳ. Though this example is certainly not exhaustive, it does help build some intuition for how the different priors and hyperparameter choices behave.

Table 3.1.

Prior probabilities for the space of SHMs associated with the quadratic surface in two main effects.

	Model	HIP		HOP		HUP		HLP/HTP
	Model	(1,1)	(1,ch)	(1,1)	(1,ch)	(1,1)	(1,ch)	(1,1)	(1,ch)
1	1	1/4	4/9	1/3	1/2	1/3	5/7	1/3	1/2
2	1, x₁	1/8	1/9	1/12	1/12	1/12	5/56	1/12	1/12
3	1, x₂	1/8	1/9	1/12	1/12	1/12	5/56	1/12	1/12
4	1, x₁, $x_{1}^{2}$	1/8	1/9	1/12	1/12	1/12	5/168	1/12	1/12
5	1, x₂, $x_{2}^{2}$	1/8	1/9	1/12	1/12	1/12	5/168	1/12	1/12
6	1, x₁, x₂	1/32	3/64	1/12	1/12	1/60	1/72	1/18	1/24
7	1, x₁, x₂, $x_{1}^{2}$	1/32	1/64	1/36	1/60	1/60	1/168	1/36	1/72
8	1, x₁, x₂, x₁x₂	1/32	1/64	1/36	1/60	1/60	1/168	1/18	1/24
9	1, x₁, x₂, $x_{2}^{2}$	1/32	1/64	1/36	1/60	1/60	1/168	1/36	1/72
10	1, x₁, x₂, $x_{1}^{2}$ , x₁x₂	1/32	1/192	1/36	1/120	1/30	1/252	1/36	1/72
11	1, x₁, x₂, $x_{1}^{2}$ , $x_{2}^{2}$	1/32	1/192	1/36	1/120	1/30	1/252	1/18	1/72
12	1, x₁, x₂, x₁x₂, $x_{2}^{2}$	1/32	1/192	1/36	1/120	1/30	1/252	1/36	1/72
13	1, x₁, x₂, $x_{1}^{2}$ , x₁x₂, $x_{2}^{2}$	1/32	1/576	1/12	1/120	1/6	1/252	1/18	1/72

Open in a new tab

First, compare the priors for (a, b) = (1, 1). The HIP induces a complexity penalization that only accounts for the order of the terms in the model. Models including x₁ and x₂ (models 6 through 13) are given the same prior probability and no penalization is incurred for the inclusion of any of the quadratic terms. The HUP induces a penalization for model complexity that does not adequately penalize models for including additional terms; M_F is given more probability than any model containing at least one term. This lack of penalization for the full model is a consequence of it being the only model in ℳ of that particular size. That is, this model space distribution favors the base and full models as both are the only models in their respective size classes. Similar behavior is observed with the HOP. As models become larger, they are penalized for the increased number of terms. However, once the models become large enough, the number of models of those particular sizes is reduced, producing less combinatorial penalization. In this example, the HLP and HTP coincide and produce more penalization for the inclusion of a square term than the interaction term.

In contrast, if (a, b) = (1, ch), all the priors produce strong penalization as models become more complex, both in terms of the number and order of the nodes contained in the model. For all of the priors, adding a node α to a model M to form M′ produces p(M) ≥ p(M′). However, there are clear differences between the priors. The HIP penalizes the full model the most, with the HLP/HTP penalizing it the least. This observation is not unique to this particular choice of M_F. Because the HLP/HTP divides the nodes of a given order into smaller groups than the HUP or HOP, the former will always produce less penalization under the (1, ch) hyperparameter choice than the latter.

3.3 Posterior sensitivity to the choice of prior

To explore posterior sensitivity to the choice of prior distribution, we performed a simulation experiment with a full order-two surface on five predictors. We assume the SHC for the model space with M_B only including the intercept, which gives 38,619 models. The true model contains the three main effects, two square terms, and two interactions. We simulated twenty datasets and consider results for model selection using EPP, HIP(a, b), HUP(a, b), and HOP(a, b) (where a = b = 1 and a = 1, b = ch).

Figure 3.1 shows the average rank of the true model and its average posterior probability over the twenty datasets for varying sample sizes. A general trend emerges that the EPP provides the least evidence in favor of the true model, while the choice of b = ch provides the most evidence in favor of the true model when the sample size is moderate or large. However, when the sample size is small, the EPP and b = 1 choices provide better rank for the true model. This is due to the additional penalization in the b = ch priors, which needs to be overcome by the evidence contained in the Bayes factor.

These simulations were also carried out for the spike-and-slab (SnS) prior (Ishwaran and Rao, 2005; Mitchell and Beauchamp, 1988) as well as the lasso for hierarchical interactions (hierNet) (Bien et al., 2013). As seen in Table 3.2, the highest posterior probability model (HPM) for all methods nests the true model when n > 50. The heredity-based priors exhibit essentially no false positives when n > 100, but false positives are observed with both hierNet and SnS as well as the EPP. For instance, when n = 100, SnS, hierNet, and EPP have average false positive rates of 0.0143, 0.3071, and 0.0167, respectively. In contrast, the hierarchical priors with b = ch have no false positives and b = 1 has false positive rate 0.0083. The additional penalizations built into the hierarchical priors strongly control the false positive rate in model selection. Of course, there is a trade-off and these priors can have difficulty finding true positives in situations where the signal-to-noise ratio is weak.

Table 3.2.

Method comparison for the mean percentage of true and false positive terms identified through selection

n	%True Positives					%False Positives
n	30	50	100	200	500	30	50	100	200	500

SnS	0.89	1.00	1.00	1.00	1.00	0.04	0.04	0.01	0.01	0.00
hierNet	0.87	1.00	1.00	1.00	1.00	0.29	0.45	0.31	0.23	0.12
EPP	0.91	1.00	1.00	1.00	1.00	0.03	0.07	0.02	0.03	0.00
HOP.Ch	0.73	1.00	1.00	1.00	1.00	0.01	0.00	0.00	0.00	0.00
HIP.Ch	0.68	1.00	1.00	1.00	1.00	0.01	0.00	0.00	0.00	0.00
HUP.Ch	0.65	1.00	1.00	1.00	1.00	0.01	0.00	0.00	0.00	0.00
HOP.11	0.82	1.00	1.00	1.00	1.00	0.03	0.02	0.01	0.00	0.00
HIP.11	0.88	1.00	1.00	1.00	1.00	0.01	0.00	0.01	0.00	0.00
HUP.11	0.82	1.00	1.00	1.00	1.00	0.02	0.00	0.01	0.00	0.00

Open in a new tab

4 Random Walks on the Model Space

When the model space ℳ is too large to enumerate, a Metropolis-Hastings (MH) algorithm can be utilized to generate a dependent sample of models from the model space posterior. The structure of ℳ both presents difficulties and provides clues on how to build algorithms to explore it. Several kernels can be implemented in such an MH algorithm, three of which are outlined in this section and are based on local, global, and intermediate steps. Combining the different strategies (Tierney, 1994), e.g., by mixing Markov chain kernels, allows the MH algorithm to explore the model space thoroughly and quickly.

For homoscedastic normal regression using intrinsic priors, the marginal probability of the observed y can be computed efficiently for any proposed model M. Because of this, we bypass the intricacies of reversible jump and pseudo-prior methods (Carlin and Chib, 1995; Green, 1995). We note that when good approximations to the marginals are available, for example through the Laplace approximation (DiCiccio et al., 1997; Tierney and Kadane, 1986), then the random walk algorithm can be implemented using them.

In this section, we succinctly describe the stochastic search algorithm proposed to explore ℳ. We detail three kernels designed with distinct scopes. Although the default setting yields an automatic and sensible specification for the search algorithm, details related to tuning are discussed. Finally, we provide a brief discussion on the estimators used for p(M|y).

4.1 Proposal kernels

Global Jump

In order to keep the chain from getting stuck in a local mode, the global jump kernel generates independently at random a model from the prior distribution. The MH correction is the Bayes factor for the proposed model against the current model.

Local Jump

Given a current model M, the set ℰ(M) ∪ 𝒞(M) contains the nodes whose inclusion can be changed while maintaining the assumed heredity condition of ℳ. The local jump kernel implements a stochastic forwards-backwards proposal kernel conditioned on M. For each α ∈ ℰ(M) the model M_α is defined to be the model M \ {α} and for each α ∈ 𝒞(M) it is defined to be M ∪ {α}. The proposal kernel is supported only on the set of models ℳ_M = {M} ∪ {M_α : α ∈ ℰ(M) ∪ 𝒞(M)}. Let |ℳ_M| = N and p(M_α|y, X, ℳ_M) be the renormalized posterior probability of M_α restricted to ℳ_M. M_α is proposed with probability p(M_α|y, X, ℳ_M)/2 + 1/(2N). The proposal kernel is biased towards models with higher posterior probability, and each model has probability at least 1/(2N) of being proposed. Thus, even models with small posterior probability are proposed during iterations of the random walk and the probability of proposing any one model is bounded above by 0.5(1 + 1/N).

When at model M, this local jump kernel uses m(y|M_α) for all M_α ∈ ℳ_M in order to compute the proposal probabilities. One can store m(y|M_α) for use in future proposal kernels as well as a renormalization estimator of the full posterior distribution, even if M_α is never proposed or visited during the random walk. Thus, the balanced random walk helps to keep the sampler around models with high posterior probability while this transition kernel provides information about models that are close to those models.

Intermediate Jump

The intermediate jump proposes a model by making proposals at each order. The algorithm either increases the order from $J_{ℳ}^{min}$ to $J_{ℳ}^{max}$ or decreases it. Suppose that he chain is currently at M. The algorithm creates intermediate models $M_{k}^{'}$ for each $J_{ℳ}^{min} \leq k \leq J_{ℳ}^{max}$ . Set $ℳ_{M'}^{k} = {M'} \cup {M_{α}^{'} : α \in (ℰ (M') \cup 𝒞 (M')) \cap ϒ_{k} (M_{F})}$ , which is the set of models that could be formed by adding or subtracting nodes of order k to M′ or remaining at M′. When proposing an intermediate model from $ℳ_{M'}^{k}$ , the proposal density utilized is the same as that for the local jump restricted to the proposal set.

If increasing the order, set $M_{J_{ℳ}^{min} - 1}^{'} = M$ and for $k = J_{ℳ}^{min}, \dots, J_{ℳ}^{max}$ , propose models $M_{k}^{'}$ from $ℳ_{M_{k - 1}^{'}}^{k}$ sequentially with k increasing. The final proposed model is $M' = M_{J_{ℳ}^{max}}^{'}$ . The decreasing-order kernel is defined analogously and is the reverse of the increasing-order kernel. The MH correction is calculated by accounting for the posterior probabilities of the proposed model M′ and current model M as well as the proposal probabilities, which are computed using the proposal probabilities for each order.

For these intermediate steps, the use of the local jump proposal restricted to nodes of a given degree provides a means of proposing more models than the standard local jump (and thus storing more marginals of y conditioned on specific models). The final proposed model M′ can differ from M by as little as no nodes or as many as $J_{ℳ}^{max} - J_{ℳ}^{min} + 1$ nodes (one for each order). This allows quicker exploration of the model space (in terms of number of iterations) while requiring more computation at each iteration.

Algorithm Tuning

The sampling algorithm outlined in this paper in two ways. First, the weights in the convex combination of the restricted posterior and restricted uniform proposal in the local and intermediate jumps can be modified. When the weight on the restricted posterior is near one, the kernel is biased towards proposing models with high posterior probability. Conversely, when the weight is near zero, the kernel proposes models uniformly at random from the restricted space. The tradeoff is between proposing models with a high probability of being accepted versus increasing the probability of proposing low probability models and potentially moving to another part of the model space that might also contain high probability models. As a default, weights of 1/2 are used.

Second, the relative frequency of utilizing the global, local, and intermediate jumps as proposal kernels can be modified. The global jump has relatively low probability of proposing good models, but can provide large moves across the model space. In contrast, the local jump has high probability of proposing models that will be accepted, but will move slowly around the model space. The intermediate jump allows for larger moves than the local jump, but at larger computational costs. The local jump should be used the most often with the global jump used the least to ensure that the algorithm does not get stuck in a local neighborhood of the model space. As a default, the kernels are used with frequency 1/2, 2/5, and 1/10 for the local, intermediate, and global jumps, respectively.

4.2 Posterior Estimators

The MH sampler is uniformly ergodic because the model space is finite, and one can employ either renormalization or visit frequency to compute posterior quantities of interest, such as model posterior probabilities (Garcia-Donato and Martinez-Beneito, 2013). When the model space is large, the number of iterations required to obtain good estimators of posterior probabilities with visit frequencies can be too large to be practically feasible.

An alternative renormalization framework can be implemented that utilizes not only models visited during the random walk, but also any models used in proposal kernels during the random walk. Kernels that utilize aspects of the model space posterior provide the necessary marginals, m(y|M, ℳ), to compute estimators that incorporate a larger portion of the model space than visit frequency estimators. Thus, the implemented renormalization estimators provide smaller total variation distance to the posterior than obtained using frequency estimators. Of course, this produces an efficiency trade-off. The transition kernels that propose a subset of models ℳ_M ⊂ ℳ when at a model M can require the computation (or lookup) of m(y|M′) for all M′ ∈ ℳ_M, which has more cost than simple random proposals.

To compare the ability of the algorithm to estimate the model posterior probabilities through the proposed renormalization approach and frequency-based estimation, we explore the posterior probabilities obtained from the random walk performed for the example in Section 3.3. In particular, we follow the approach suggested in Chib (1995), where the entire model space is enumerated and the true model posterior probabilities calculated for each model. The true posterior probabilities were then compared to the estimated ones.

Table 4.1 shows the the total variation distance (TVD) between the resulting estimators and the actual model posterior probabilities. In the least favorable scenario (n = 30 and 5, 000 iterations) renormalization and frequency estimators produced a mean TVDs of 0.0079 and 0.2625, respectively. In the most favorable scenario (n = 200 and 50, 000 iterations) the resulting TVDs were 0.0007 and 0.2149. Despite the uniform ergodicity of the sampler, the frequency estimates converge slowly whereas renormalization provides a good approximation to the true posterior probabilities.

Table 4.1.

Mean TVD between true model posterior probabilities and estimators.

n	number of iterations
	Renormalization			Frequency
	5,000	10,000	50,000	5,000	10,000	50,000

30	0.0079	0.0073	0.0062	0.2625	0.2564	0.2540
100	0.0013	0.0012	0.0012	0.2470	0.2406	0.2325
200	0.0010	0.0008	0.0007	0.2165	0.2157	0.2149

Open in a new tab

5 Simulation Study

To study the operating characteristics of the proposed priors, a simulation experiment was designed to determine the effect of the sample size, the signal-to-noise ratio, signal allocation, heredity assumption for true model, and model space heredity assumption. The model space is defined with M_B being the intercept-only model and M_F being the complete order-four polynomial surface in five main effects, which has with 126 nodes. The cardinality of this model space is |ℳ| > 1.2 × 10²², which makes enumeration of all models infeasible. The entries of the matrix of main effects are generated as independent standard normal variates. The response vectors are drawn from the n-variate normal distribution as y ~ N_n (X_{M_T}β_{M_T}, I_n), where M_T is the true model. The highest order term in the true model has order three.

We consider three different values for n (130, 260, 1040) and signal-to-noise ratio (SNR= 0.25, 1, 4). Three different signal allocations are investigated. The first places signal equally across orders one, two, and three. The second decreases the signal strength by half when increasing order. The third increases signal strength by a factor of two when increasing order. The largest true model considered is displayed in Figure 5.1, which satisfies the SHC. The second choice for M_T removes nodes x₁x₂ and x₂x₅ from the graph to form a WHM that is not a SHM. Similarly, the third choice of M_T removes $x_{1}^{2}$ and x₂x₅ from the graph, producing a model that does not satisfy the WHC. Regardless of the HC assumed for ℳ, this choice of M_T leads to unique M_T̃ ∈ ℳ with smallest cardinality nesting M_T.

Figure 5.1 — *M_T* : DAG of the largest true model used in simulations.

We drive a random walk through the model space using the HIP(1, 1) prior distribution and compute renormalization estimators of posterior probabilities for the HIP, HUP, HOP, HLP, and HTP priors with choice of (a, b) ∈ {(1, 1), (1, ch)} as well as the EPP. The total number of combinations considered for SNR, sample size, regression coefficient values, and nodes in M_T amounts to 162 different scenarios. Each scenario was run with 100 independently generated datasets for 10, 000 iterations of the random walk. The results presented in this section pertain to median results across the datasets regarding the true and false positive rates of the highest posterior probability model, posterior probability of the true model, and rank of the true model. Because small n and/or low SNR lead to underfitting, we also consider the number of simulations in which the true model was not visited. Figures providing the necessary simulation summaries are provided in Appendix D of Supplementary Materials.

SNR and sample size effect

As expected, small sample sizes conditioned upon a small SNR impair the ability of the algorithm to detect true nonzero coefficients. Across the simulations with n = 130 and SNR= 0.25, only the EPP did a good job of recovering true positives. For example, assuming that M_T is a SHM and that signal is allocated equally across orders, the EPP has median true positive rates of 2/3 and 1/3 when assuming the SHC and WHC for ℳ, respectively. The false positive rates were 0.43% and 0.36% under the EPP, respectively. In contrast, the hierarchical priors produced no false positives, but only the HIP(1, 1) of Chipman (1996) combined with the SHC assumption for ℳ obtained a nonzero true positive rate, which was 1/9.

For low SNR and large sample size (n = 1040), all of the priors choose M_T as the highest posterior model (HPM) under the SHC. The major differences come from the allocation of posterior probability to M_T. The hierarchical priors obtain the highest probability for M_T with the HUP(1, ch), which is 87%. In contrast, the EPP only assigns 8% posterior probability to M_T. Under the WHC, the true positive rate falls to a high of 78%, which is achieved under the EPP and the HTP(1, ch). The contrast comes from the ranks for M_T, which are 1, 426 and 582, respectively. The best rank was 262, achieved by HOP(1, ch). In the best case scenario (n = 1040 and SNR= 4), the true positive rate for the WHC did not increase in this scenario. Under these same conditions, the SHC provides perfect selection. These results together provide a strong argument for the SHC over the WHC. Under the SHC, the HIP(1, 1) performed the best when the SNR was low, followed by the HLP and HTP with the same hyperparameter choice for moderate sample sizes. The true model was not found 30% of the time under the SHC and never found under the WHC.

Coefficient magnitude

For this comparison, we focus on n = 260 and SNR = 1, but with M_T satisfying the WHC and not the SHC. First, we consider the different priors under the SHC. When the signal is concentrated more on the higher-order terms or distributed evenly across the orders, the hierarchical priors perform admirably with the exception of HUP(1, ch). In general, ranking and posterior probability of the true model is improved slightly by setting b = ch, but the strong control that this choice provides over false positives decreased the median true positive rate for that prior. When the signal is concentrated towards lower-order terms, the HTP(1, 1) performed the best, but only achieved a 2/3 true positive rate. Similar to the discussion for SNR and sample size, the WHC gives generally weaker inference than the SHC. The trends tend to follow those for the SHC assumption. One notable difference is that the b = ch priors perform better or comparable to their b = 1 counterparts, with the exception of the HOP.

A general pattern is seen with the hierarchical priors under the SHC or WHC: evidence in favor of the null is lowest when the signal is concentrated on lower-order terms. Intuitively, this empirical result makes sense. Because we are penalizing the addition of higher-order terms, their signal needs to be larger in order to increase the evidence in favor of inclusion from Bayes factors. Concentration of the signal on higher-order terms provides the most inferential power under the SHC due to its strict parental inclusion requirements.

Special points on the scale

Across the range of examples, assuming the SHC provides better inference than assuming the WHC in terms of true positives, false positives, and posterior probability and rank of the true model. Even when the true model does not satisfy the SHC (or any HC at all), the HIP, HTP, and HLP under the SHC assumption for ℳ provide the most balanced inference in favor of the pseudo-true model M_T̃. Subsequent inference about the influence of terms in M_T̃ can be carried out through estimation and consideration of credible intervals.

6 Method comparison: Ozone data analysis

We analyze the ozone data from Breiman and Friedman (1985) to assess the performance of our procedures. We follow the analysis of Liang et al. (2008), though we only use intrinsic priors to obtain our Bayes factors. After removing observations with missing values, the data contains 330 observations. A subset of 165 observations was sampled uniformly at random and used for variable selection, and the remaining data were used for validation. We predict daily measurements of maximum ozone concentration near Los Angeles with eight meteorological variables. The meteorological variables and their two-way interactions and squares are used in M_F, which has 44 terms. The base model, M_B, is the intercept-only model. There are 7.1 × 10¹⁰ and 6.5 × 10¹¹ models under the SHC and WHC, respectively.

The priors are assessed through their predictive accuracy on the validation dataset with the prediction root mean squared error (PRMSE). The results are provided for both for the highest posterior probability model (HPM) and under model averaging of the 500 highest posterior probability models. Additionally, we compare the performance of the priors developed in this paper to the lasso for hierarchical interactions (Bien et al., 2013), implemented using the R package hierNet. The penalty parameter was obtained by minimizing the estimated cross-validation error in a ten-fold cross-validation. Given that the penalty parameter chosen is influenced by the cross-validation sets chosen, the cross-validation procedure was replicated 100 times. The results of this analysis are contained in Table 6.1. The HPMs referenced in the Table 6.1 and the marginal inclusion probabilities for all 44 terms are given in Appendix E of the Supplementary materials.

Table 6.1.

Predicted RMSE under the SHC and WHC for the highest probability model (HPM) and under model averaging (Ave).

Prior

pars

PRMSE

posterior prob

|M_HPM|

M_HPM

HPM

Ave

M_HPM

{M_{(k)}}_{k = 1}^{500}

HIP

(1,ch)

4.2

4.128

0.278

0.999

M₍₁₎

4.593

4.483

0.712

0.999

M₍₂₎

(1,1)

4.2

4.129

0.278

0.998

M₍₁₎

4.593

4.212

0.605

0.997

M₍₂₎

HOP

(1,ch)

4.2

4.141

0.249

0.991

M₍₁₎

4.593

4.125

0.412

0.985

M₍₂₎

(1,1)

4.2

4.141

0.251

0.993

M₍₁₎

4.593

4.125

0.412

0.986

M₍₂₎

HUP

(1,ch)

4.2

4.12

0.256

0.993

M₍₁₎

4.593

4.354

0.703

0.995

M₍₂₎

(1,1)

4.2

4.121

0.256

0.994

M₍₁₎

4.593

4.195

0.635

0.99

M₍₂₎

HLP

(1,ch)

4.2

4.142

0.237

0.991

M₍₁₎

4.593

4.17

0.344

0.99

M₍₂₎

(1,1)

4.2

4.143

0.237

0.991

M₍₁₎

6.459

4.223

0.267

0.98

M₍₃₎

EPP

–

4.2

4.151

0.024

0.423

M₍₁₎

4.21

4.161

0.018

0.359

M₍₄₎

hierNet

–

4.349

–

M₍₅₎

4.371

–

M₍₆₎

Open in a new tab

The main results can be summarized as follows. First, the lowest PRMSE for a HPM comes from the SHC with hierarchical priors or the EPP. Second, the hierarchical priors provide stronger posterior concentration than the EPP. Third, HPMs under the SHC yield a lower PRMSE than those obtained under the WHC. Fourth, models found with hierNet are considerably larger than those resulting from the Bayesian procedures. However, hierNet provides slightly larger PRMSEs, showing that they overfit the data. Fifth, model averaging PRMSEs slightly improve upon those from the HPM’s. Sixth, smaller selected models are obtained under the WHC, which one expects given its relaxed inclusion condition. However, these selected models have greater predictive PRMSE, showing another possible argument in favor of the SHC over the WHC.

Interestingly, the HPM PRMSE under the WHC for the HLP(1, 1) is the largest observed. However, the model averaging PRMSE indicates that the full Bayesian analysis using this prior still behaves well. The second highest probability model is M₍₂₎, which has a comparable posterior probability (0.20) to that of the HPM (0.267).

7 Discussion

In this paper, we investigated prior structures for polynomial regression assuming the strong and weak heredity conditions and developed random walk samplers to draw from the model space. We extended the priors developed in Chipman (1996) by completing the Bayesian specification as well as developing priors that utilize exchangeability conditions. These hierarchical priors have similar posterior behavior, with the HOP, HLP, and HTP providing hierarchical versions of the multiplicity correction and complexity penalizing priors, which help immensely with model selection when the polynomial surface has a high degree. When the number of main effects is small or the signal-to-noise ratio is low, the HIP provides better model selection properties. However, it is ill-suited for regressions with a large number of predictors.

In addition to these results from the simulations, Theorems 1 and 2 provide a further theoretical argument in favor of strong — over weak — heredity. Through the analysis of the ozone data, we can also see that the WHC leads to higher prediction RMSE than the SHC. This is related to the theoretical result, which shows that the SHC concentrates posterior probability on a unique model, whereas the WHC might not.

The computational algorithm described in the paper takes advantage of the hierarchical nature of the model space and provides for renormalization estimators that incorporate a large portion of the model space. Through a simulation experiment, we have shown that these estimators have smaller total variation distance to the true posterior than visit frequency estimators. We provide the software necessary to carry out a Metropolis-Hastings random walk on the space of these model spaces through the R package varSelectIP.

The methods in this paper can be expanded to generalized linear mixed models and other modeling contexts. For example, for binary data the methods proposed by Albert and Chib (1993) can be implemented using intrinsic priors on the parameter space (Leon-Novelo et al., 2012) and with the proposed hierarchical priors on the model space.

The theoretical results of Section 2.1 can be easily extended to encompass these classes of models under suitable regularity conditions. Efficient computation or accurate approximation of the marginal probability of the data under competing models is required to employ the MH algorithm we have developed. However, because the focus of the paper is on the model space prior, the methods discussed in this paper can be incorporated into more complicated setting where more elaborate MCMC methods are necessary.

Supplementary Material

NIHMS801700-supplement-Supplementary_Material.pdf^{(534.5KB, pdf)}

References

Albert J, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88(422):669–679. [Google Scholar]
Berger J, Pericchi L. The intrinsic bayes factor for model selection and prediction. Journal of the American Statistical Association. 1996;91(433):109–122. [Google Scholar]
Berger J, Pericchi L, Ghosh J. Objective Bayesian methods for model selection: introduction and comparison. In Model selection, volume 38 of IMS Lecture Notes Monogr. Ser. 2001:135–207. Inst. Math. Statist. [Google Scholar]
Bien J, Taylor J, Tibshirani R. A lasso for hierarchical interactions. The Annals of Statistics. 2013;41(3):1111–1141. doi: 10.1214/13-AOS1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breiman L, Friedman J. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association. 1985;80:580–598. [Google Scholar]
Brusco MJ, Steinley D, Cradit JD. An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression. Technometrics. 2009;51(3):306–315. [Google Scholar]
Carlin B, Chib S. Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society. Series B. 1995;57(3):473–484. [Google Scholar]
Chib S. Marginal Likelihood from the Gibbs Output. Journal of the American Statistical Association. 1995;90(432):1313–1321. [Google Scholar]
Chipman H. Bayesian variable selection with related predictors. Canadian Journal of Statistics-Revue Canadienne De Statistique. 1996;24(1):17–36. [Google Scholar]
Chipman H. Fast model search for designed experiments with complex aliasing. Quality Improvement Through Statistical Methods. 1998:207–220. [Google Scholar]
Chipman H, Hamada M, Wu C. A Bayesian variable-selection approach for analyzing designed experiments with complex aliasing. Technometrics. 1997;39(4):372–381. [Google Scholar]
Choi NH, Li W, Zhu J. Variable Selection With the Strong Heredity Constraint and Its Oracle Property. Journal of the American Statistical Association. 2010;105(489):354–364. [Google Scholar]
DiCiccio TJ, Kass RE, Raftery A, Wasserman L. Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915. [Google Scholar]
Garcia-Donato G, Martinez-Beneito M. On Sampling Strategies in Bayesian Variable Selection Problems with Large Model Spaces. Journal of the American Statistical Association. 2013;108(501):340–352. [Google Scholar]
George E. The variable selection problem. Journal of the American Statistical Association. 2000;95(452):1304–1308. [Google Scholar]
Girón FJ, Moreno E, Casella G, Martínez ML. Consistency of objective Bayes factors for nonnested linear models and increasing model dimension. Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas. 2010;104(1):57–67. [Google Scholar]
Green PJ. Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]
Griepentrog GL, Ryan JM, Smith LD. Linear transformations of polynomial regression-models. American Statistician. 1982;36(3):171–174. [Google Scholar]
Ishwaran H, Rao JS. Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics. 2005;33:730–773. [Google Scholar]
Jeffreys H. Theory of Probability. 3rd. London: Oxford University Press; 1961. [Google Scholar]
Kass R, Raftery A. Bayes Factors. Journal of the American Statistical Association. 1995;90:773–795. [Google Scholar]
Kass RE, Wasserman L. The Selection of Prior Distributions by Formal Rules. Journal of the American Statistical Association. 1996;91(435):1343–1370. [Google Scholar]
Khuri A. Nonsingular linear transformations of the control variables in response surface models. Technical Report. 2002 [Google Scholar]
Leon-Novelo L, Moreno E, Casella G. Objective Bayes model selection in probit models. Statistics in medicine. 2012;31(4):353–365. doi: 10.1002/sim.4406. [DOI] [PubMed] [Google Scholar]
Liang F, Paulo R, Molina G, Clyde MA, Berger JO. Mixtures of g priors for bayesian variable selection. Journal of the American Statistical Association. 2008;103(481):410–423. [Google Scholar]
McCullagh P, Nelder JA. Generalized linear models. 2nd. London, England: Chapman & Hall; 1989. [Google Scholar]
Mitchell T, Beauchamp J. Bayesian variable selection in linear regression. Journal of the American Statistical Association. 1988;83:1023–1032. [Google Scholar]
Nelder JA. Reformulation of linear-models. Journal of the Royal Statistical Society Series A - Statistics in Society. 1977;140:48–77. [Google Scholar]
Nelder JA. The selection of terms in response-surface models - how strong is the weak-heredity principle? American Statistician. 1998;52(4):315–318. [Google Scholar]
Nelder JA. Functional marginality and response-surface fitting. Journal of Applied Statistics. 2000;27(1):109–112. [Google Scholar]
Peixoto JL. Hierarchical variable selection in polynomial regression models. American Statistician. 1987;41(4):311–313. [Google Scholar]
Peixoto JL. A property of well-formulated polynomial regression models. American Statistician. 1990;44(1):26–30. [Google Scholar]
Scott JG, Berger JO. Bayes and Empirical-Bayes Multiplicity Adjustment in the variable selection problem. Annals of Statistics. 2010;38(5):2587–2619. [Google Scholar]
Tierney L. Markov chains for exploring posterior distributions. The Annals of Statistics. 1994:1701–1728. [Google Scholar]
Tierney L, Kadane JB. Accurate Approximations for Posterior Moments and Marginal Densities. Journal of the American Statistical Association. 1986;81:82–86. [Google Scholar]
Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM. Bayesian model search and multilevel inference for snp association studies. The Annals of Applied Statistics. 2010;4(3):1342–1364. doi: 10.1214/09-aoas322. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan M, Joseph VR, Zou H. Structured variable selection and estimation. The Annals of Applied Statistics. 2009;3(4):1738–1757. [Google Scholar]
Zellner A, Siow A. Posterior odds ratios for selected regression hypotheses. Journal of Systems Science and Complexity. 1980;31.1:732–748. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS801700-supplement-Supplementary_Material.pdf^{(534.5KB, pdf)}

[R1] Albert J, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88(422):669–679. [Google Scholar]

[R2] Berger J, Pericchi L. The intrinsic bayes factor for model selection and prediction. Journal of the American Statistical Association. 1996;91(433):109–122. [Google Scholar]

[R3] Berger J, Pericchi L, Ghosh J. Objective Bayesian methods for model selection: introduction and comparison. In Model selection, volume 38 of IMS Lecture Notes Monogr. Ser. 2001:135–207. Inst. Math. Statist. [Google Scholar]

[R4] Bien J, Taylor J, Tibshirani R. A lasso for hierarchical interactions. The Annals of Statistics. 2013;41(3):1111–1141. doi: 10.1214/13-AOS1096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Breiman L, Friedman J. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association. 1985;80:580–598. [Google Scholar]

[R6] Brusco MJ, Steinley D, Cradit JD. An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression. Technometrics. 2009;51(3):306–315. [Google Scholar]

[R7] Carlin B, Chib S. Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society. Series B. 1995;57(3):473–484. [Google Scholar]

[R8] Chib S. Marginal Likelihood from the Gibbs Output. Journal of the American Statistical Association. 1995;90(432):1313–1321. [Google Scholar]

[R9] Chipman H. Bayesian variable selection with related predictors. Canadian Journal of Statistics-Revue Canadienne De Statistique. 1996;24(1):17–36. [Google Scholar]

[R10] Chipman H. Fast model search for designed experiments with complex aliasing. Quality Improvement Through Statistical Methods. 1998:207–220. [Google Scholar]

[R11] Chipman H, Hamada M, Wu C. A Bayesian variable-selection approach for analyzing designed experiments with complex aliasing. Technometrics. 1997;39(4):372–381. [Google Scholar]

[R12] Choi NH, Li W, Zhu J. Variable Selection With the Strong Heredity Constraint and Its Oracle Property. Journal of the American Statistical Association. 2010;105(489):354–364. [Google Scholar]

[R13] DiCiccio TJ, Kass RE, Raftery A, Wasserman L. Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915. [Google Scholar]

[R14] Garcia-Donato G, Martinez-Beneito M. On Sampling Strategies in Bayesian Variable Selection Problems with Large Model Spaces. Journal of the American Statistical Association. 2013;108(501):340–352. [Google Scholar]

[R15] George E. The variable selection problem. Journal of the American Statistical Association. 2000;95(452):1304–1308. [Google Scholar]

[R16] Girón FJ, Moreno E, Casella G, Martínez ML. Consistency of objective Bayes factors for nonnested linear models and increasing model dimension. Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas. 2010;104(1):57–67. [Google Scholar]

[R17] Green PJ. Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]

[R18] Griepentrog GL, Ryan JM, Smith LD. Linear transformations of polynomial regression-models. American Statistician. 1982;36(3):171–174. [Google Scholar]

[R19] Ishwaran H, Rao JS. Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics. 2005;33:730–773. [Google Scholar]

[R20] Jeffreys H. Theory of Probability. 3rd. London: Oxford University Press; 1961. [Google Scholar]

[R21] Kass R, Raftery A. Bayes Factors. Journal of the American Statistical Association. 1995;90:773–795. [Google Scholar]

[R22] Kass RE, Wasserman L. The Selection of Prior Distributions by Formal Rules. Journal of the American Statistical Association. 1996;91(435):1343–1370. [Google Scholar]

[R23] Khuri A. Nonsingular linear transformations of the control variables in response surface models. Technical Report. 2002 [Google Scholar]

[R24] Leon-Novelo L, Moreno E, Casella G. Objective Bayes model selection in probit models. Statistics in medicine. 2012;31(4):353–365. doi: 10.1002/sim.4406. [DOI] [PubMed] [Google Scholar]

[R25] Liang F, Paulo R, Molina G, Clyde MA, Berger JO. Mixtures of g priors for bayesian variable selection. Journal of the American Statistical Association. 2008;103(481):410–423. [Google Scholar]

[R26] McCullagh P, Nelder JA. Generalized linear models. 2nd. London, England: Chapman & Hall; 1989. [Google Scholar]

[R27] Mitchell T, Beauchamp J. Bayesian variable selection in linear regression. Journal of the American Statistical Association. 1988;83:1023–1032. [Google Scholar]

[R28] Nelder JA. Reformulation of linear-models. Journal of the Royal Statistical Society Series A - Statistics in Society. 1977;140:48–77. [Google Scholar]

[R29] Nelder JA. The selection of terms in response-surface models - how strong is the weak-heredity principle? American Statistician. 1998;52(4):315–318. [Google Scholar]

[R30] Nelder JA. Functional marginality and response-surface fitting. Journal of Applied Statistics. 2000;27(1):109–112. [Google Scholar]

[R31] Peixoto JL. Hierarchical variable selection in polynomial regression models. American Statistician. 1987;41(4):311–313. [Google Scholar]

[R32] Peixoto JL. A property of well-formulated polynomial regression models. American Statistician. 1990;44(1):26–30. [Google Scholar]

[R33] Scott JG, Berger JO. Bayes and Empirical-Bayes Multiplicity Adjustment in the variable selection problem. Annals of Statistics. 2010;38(5):2587–2619. [Google Scholar]

[R34] Tierney L. Markov chains for exploring posterior distributions. The Annals of Statistics. 1994:1701–1728. [Google Scholar]

[R35] Tierney L, Kadane JB. Accurate Approximations for Posterior Moments and Marginal Densities. Journal of the American Statistical Association. 1986;81:82–86. [Google Scholar]

[R36] Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM. Bayesian model search and multilevel inference for snp association studies. The Annals of Applied Statistics. 2010;4(3):1342–1364. doi: 10.1214/09-aoas322. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Yuan M, Joseph VR, Zou H. Structured variable selection and estimation. The Annals of Applied Statistics. 2009;3(4):1738–1757. [Google Scholar]

[R38] Zellner A, Siow A. Posterior odds ratios for selected regression hypotheses. Journal of Systems Science and Complexity. 1980;31.1:732–748. [Google Scholar]

PERMALINK

Bayesian Variable Selection on Model Spaces Constrained by Heredity Conditions

Daniel Taylor-Rodriguez

Andrew Womack

Nikolay Bliznyuk

Abstract

1 Background

2 Polynomial models and model spaces

Figure 2.1.

2.1 Theoretical posterior considerations

Theorem 1

Theorem 2

3 Priors on the model space

3.1 Model prior definition

Hierarchical Uniform Prior (HUP)

Hierarchical Independence Prior (HIP)

Hierarchical Order Prior (HOP)

Hierarchical Length Prior (HLP) and Hierarchical Type Prior (HTP)

Further penalizations in WHMs

3.2 Prior sensitivity to distributional choices

Table 3.1.

3.3 Posterior sensitivity to the choice of prior

Figure 3.1.

Table 3.2.

4 Random Walks on the Model Space

4.1 Proposal kernels

Global Jump

Local Jump

Intermediate Jump

Algorithm Tuning

4.2 Posterior Estimators

Table 4.1.

5 Simulation Study

Figure 5.1.

SNR and sample size effect

Coefficient magnitude

Special points on the scale

6 Method comparison: Ozone data analysis

Table 6.1.

7 Discussion

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases