Abstract
Stacking is a widely used model averaging technique that asymptotically yields optimal predictions among linear averages. We show that stacking is most effective when model predictive performance is heterogeneous in inputs, and we can further improve the stacked mixture with a hierarchical model. We generalize stacking to Bayesian hierarchical stacking. The model weights are varying as a function of data, partially-pooled, and inferred using Bayesian inference. We further incorporate discrete and continuous inputs, other structured priors, and time series and longitudinal data. To verify the performance gain of the proposed method, we derive theory bounds, and demonstrate on several applied problems.
Keywords: Bayesian hierarchical modeling, conditional prediction, covariate shift, model averaging, stacking, prior construction
1. Introduction
Statistical inference is conditional on the model, and a general challenge is how to make full use of multiple candidate models. Consider data , and models , each having its own parameter vector , likelihood, and prior. We fit each model and obtain posterior predictive distributions,
| (1) |
The model fit is judged by its expected predictive utility of future (out-of-sample) data , which generally have an unknown true joint density . Model selection seeks the best model with the highest utility when averaged over . Model averaging assigns models with weight subject to a simplex constraint , and the future prediction is a linear mixture from individual models:
| (2) |
Stacking (Wolpert, 1992), among other ensemble-learners, has been successful for various prediction tasks. Yao et al. (2018) apply the stacking idea to combine predictions from separate Bayesian inferences. The first step is to fit each individual model and evaluate the pointwise leave-one-out predictive density of each data point under each model :
which in a Bayesian context we can approximate using posterior simulations and Paretosmoothed importance sampling (Vehtari et al., 2017). Reusing data eliminates the need to model the unknown joint density . The next step is to determine the vector1 of weights that optimize the average log score of the stacked prediction,
| (3) |
However, the linear mixture (2) restricts an identical set of weights for all input . We will later label this solution (3) as complete-pooling stacking. The present paper proposes hierarchical stacking, an approach that goes further in three ways:
Framing the estimation of the stacking weights as a Bayesian inference problem rather than a pure optimization problem. This in itself does not make much difference in the complete-pooling estimate (3) but is helpful in the later development.
Expanding to a hierarchical model in which the stacking weights can vary over the population. If the model predictors take on different values in the data, we can use Bayesian inference to estimate a matrix of weights that partially pools the data both in row and column.
Further expanding to allow weights to vary as a function of continuous predictors. This idea generalizes the feature-weighted linear stacking (Sill et al., 2009) with a more flexible form and Bayesian hierarchical shrinkage.
There are two reasons we would like to consider input-dependent model weights. First, the scoring rule measures the expected predictive performance averaged over and , as the objective function in (3) divided by is a consistent estimate of . But an overall good model fit does not ensure a good conditional prediction at a given location , or under covariate shift when the distribution of input in the observations differs from the population of interest. More importantly, different models can be good at explaining different regions in the input-response space, which is why model averaging can be a better solution to model selection. Even if we are only interested in the average performance, we can further improve model averaging by learning where a model is good so as to locally inflate its weight.
In Section 2, we develop detailed implementation of hierarchical stacking. We explain why it is legitimate to convert an optimization problem into a formal Bayesian model. With hierarchical shrinkage, we partially pool the stacking weights across data. By varying priors, hierarchical stacking includes classic stacking and selection as special cases. We generalize this approach to continuous input variables, other structured priors, and time-series and longitudinal data. In Section 3, we turn heuristics from the previous paragraph into a rigorous learning bound, indicating the benefit from model selection to model averaging, and from complete-pooling model averaging to a local averaging that allows the model weights to vary in the population. We outline related work in Section 4. In Section 5, we evaluate the proposed method in several simulated and real-data examples, including a U.S. presidential election forecast.
This paper makes two main contributions:
Hierarchical stacking provides a Bayesian recipe for model averaging with input-dependent weights and hierarchical regularization. It is beneficial for both improving the overall model fit, and the conditional local fit in small and new areas.
Our theoretical results characterize how the model list should be locally separated to be useful in model averaging and local model averaging.
2. Hierarchical stacking
The present paper generalizes the linear model averaging (2) to pointwise model averaging. The goal is to construct an input-dependent model weight function , and combine the predictive densities pointwisely by
| (4) |
If the input is discrete and has finite categories, one naïve estimation of the pointwise optimal weight is to run complete-pooling stacking (3) separately on each category, which we will label no-pooling stacking. The no-pooling procedure generally has a larger variance and overfits the data.
From a Bayesian perspective, it is natural to compromise between unpooled and completely pooled procedures by a hierarchical model. Given some hierarchical prior , we define the posterior distribution of the stacking weights through the usual likelihood-prior protocol:
| (5) |
The final estimate of the pointwise stacking weight used in (4) is then the posterior mean from this joint density . We call this approach hierarchical stacking.
2.1. Complete-pooling and no-pooling stacking
For notational consistency, we rewrite the input variables into two groups (, ), where are variables on which the model weight depends during model averaging (4), and are all remaining input variables.
To start, we consider to be discrete and has categories, . We will extend to continuous and hybrid later. The input varying stacking weight function is parameterized by a matrix : Each row of the matrix is an element of the length- simplex. The -th model in cell has the weight , . We fit each individual model to all observed data and obtain pointwise leave-one-out cross-validated log predictive densities:
| (6) |
Same as in complete-pooling stacking, here we avoid refitting each model times, and instead use the Pareto smoothed importance sampling (PSIS, Vehtari et al., 2017, 2019) to approximate from one-time-fit posterior draws . The cost of such approximate leave-one-out cross validation is often negligible compared with individual model fitting.
To optimize the expected predictive performance of the pointwisely combined model averaging, we can maximize the leave-one-out predictive density
| (7) |
On one extreme, the complete-pooling stacking (3) solves optimization (7) subject to a constant constraint , , , . On the other extreme, no-pooling stacking maximizes this objective function (7) without extra constraint other than the row-simplex-condition, which amounts to separately solving complete-pooling stacking (3) on each input cell .
If there are a large number of repeated measurements in each cell, , then becomes a reasonable estimate of the conditional log predictive density , with convergence rate , and therefore, no-pooling stacking becomes asymptotically optimal among all cell-wise combination weights. For finite sample size, because the cell size is smaller than total sample size, we would expect a larger variance in no-pooling stacking than in complete-pooling stacking. Moreover, the cell sizes are often not balanced, which entails a large noise of no-pooling stacking weight in small cells.
2.2. Bayesian inference for stacking weights
Vanilla (optimization-based) stacking (3) is justified by Bayesian decision theory: the expected log predictive density of the combined model is estimated by leave-one-out . The point optimum asymptotically maximizes the expected utility (Le and Clarke, 2017), hence is an -optimal decision in terms of Vehtari and Ojanen (2012).
To fold stacking into a Bayesian inference problem, we want to treat the objective function in (7) as a log likelihood with parameter . After integrating out individual-model-specific parameters such that is given, the outcomes at input location in the combined model have densities , which implies a joint log likelihood: . But this procedure has used data twice—in other practices, data are often used twice to pick the prior, whereas here data are used twice to pick the likelihood.
We use a two-stage estimation procedure to avoid reusing data. Assuming a hypothetically provided holdout dataset of the same size and identical distribution as observations , we can use to fit the individual model first and compute . In the second stage we plug in the observed , , and obtain the pointwise full likelihood .
Now in lack of holdout data , the leave--th-observation-out predictive density is a consistent estimate of the pointwise out-of-sample predictive density . By plugging it into the two-stage log likelihood and integrating out the unobserved holdout data , we get a profile likelihood
Summing over arrives at . This log likelihood coincides with the no-pooling optimization objective function (7).
Integrating out the hypothetical data is related to the idea of marginal data augmentation (Meng and van Dyk, 1999). Polson and Scott (2011) took a similar approach to convert the optimization-based support vector machine into a Bayesian inference.
2.3. Hierarchical stacking: discrete inputs
The log posterior density of hierarchical stacking model (5) contains the log likelihood defined above , and a prior distribution on the weight matrix , which we specify in the following.
We first take a softmax transformation that bijectively converts the simplex matrix space to unconstrained space :
| (8) |
is interpreted as the log odds ratio of model with reference to in cell .
We propose a normal hierarchical prior on the unconstrained model weights conditional on hyperparameters and ,
| (9) |
The prior partially pools unconstrained weights toward the shared mean . The shrinkage effect depends on both the cell sample size (how strong the likelihood is in cell ), and the model-specific (how much across-cell discrepancy is allowed in model ). If and are given constants, and if the posterior distribution is summarized by its mode, then hierarchical stacking contains two special cases:
no-pooling stacking by a flat prior , .
complete-pooling stacking by a concentration prior , .
It is possible to derive other structured priors. For example, a sparse prior (e.g., Heiner et al., 2019) on simplex will enforce a cell-wise selection.
Instead of choosing fixed values, we view and as hyperparameters and aim for a full Bayesian solution: to describe the uncertainty of all parameters by their joint posterior distribution , letting the data to tell how much regularization is desired.
To accomplish this Bayesian inference, we assign a hyperprior to (, ):
| (10) |
where stands for the half-normal distribution supported on with scale parameter .
Putting the pieces (5), (9), (10) together, up to a normalization constant that has been omitted, we attain a joint posterior density of all free parameters , , :
| (11) |
Unlike complete and no-pooling stacking, which are typically solved by optimization, the maximum a posteriori (MAP) estimate of (11) is not meaningful: the mode is attained at the complete-pooling subspace , , , , on which the joint density is positive infinity. Instead, we sample (, , ) from this joint density (11) using Markov chain Monte Carlo (MCMC) methods and compute the Monte Carlo mean of posterior draws , which we will call hierarchical stacking weights.
The final posterior predictive density of outcome at any input location (, ) is
| (12) |
Using a point estimate is not a waste of the joint simulation draws. Because equation (12) is a linear expression on , and because of the linearity of expectation, using is as good as using all simulation draws. Nonetheless, for the purpose of post-processing, approximate cross validation, and extra model check and comparison, we will use all posterior simulation draws; see discussion in Section 6.3.
2.4. Hierarchical stacking: continuous and hybrid inputs
The next step is to include more structure in the weights, which could correspond to regression for continuous predictors, nonexchangeable models for nested or crossed grouping factors, nonparametric prior, or combinations of these.
Additive model
Hierarchical stacking is not limited to discrete cell-divider . When the input is continuous or hybrid, one extension is to model the unconstrained weights additively:
| (13) |
where are distinct features. Here we have already extracted the prior mean , representing the “average” weight of model in the unconstrained space. The discrete model (9) is now equivalent to letting for . We may still use the basic prior (9) and hyperprior (10):
| (14) |
We provide Stan (Stan Development Team, 2020) code for this additive model, and discuss recommendations on the hyperprior and feature design in Appendix C and D (Yao et al., 2021).
Because the main motivation of our paper is to convert the one-fit-all model-averaging algorithm into open-ended Bayesian modeling, the basic shrinkage prior above should be viewed as a starting point for model building and improvement. Without trying to exhaust all possible variants, we list a few useful prior structures:
- Grouped hierarchical prior. The basic model (14) is limited to have a same regularization for all . When the features are grouped (e.g., are dummy variables from two discrete inputs; states are grouped in regions), we achieve group specific shrinkage by replacing (14) by
where is the group index of feature . Feature-model decomposition. Alternatively we can learn feature-dependent regularization by
-
Prior correlation. For discrete cells, we would like to incorporate prior knowledge of the group-correlation. For example in election forecast (Section 5.3), we have a rough sense of some states being demographically close, and would expect a similar model weights therein. To this end, we calculate a prior correlation matrix from various sources of state level historical data, and replace the independent prior (9) by a multivariate normal (MVN) distribution,
(15) The prior correlation is especially useful to stabilize stacking weights in small cells.
- Crude approximation of input density. When applying the basic model (13) to continuous inputs , instead of a direct linear regression , we recommend a coordinate-wise ReLU-typed transformation:
where is the sample median of . The pointwise model predictive performance typically relies on the training density : The more training data seen nearby, the better predictions. The feature (16) is designed to be a crude approximation of log marginal input densities.(16)
Gaussian process prior
An alternative way to generalize both the discrete prior in Section 2.3 and the prior correlation (15) is Gaussian process priors. To this end we need covariance kernels , and place priors on the unconstrained weight , viewed as an function: . The discrete prior is a special case of a Gaussian process via a zero-one kernel . Due to the previously discussed measurement error and the preference on stronger regularization for continuous , we recommend simple exponentiated quadratic kernels with an informative hyperprior that avoids too small or too big length-scale , and too big . We present an example in Section 5.2.
2.5. Time series and longitudinal data
Hierarchical stacking can easily extend to time series and longitudinal data. Consider a time series dataset where outcomes come sequentially in time . The joint likelihood is not exchangeable, but still factorizable via . Therefore, assuming some stationary condition, we can approximate the expected log predictive densities of the next-unit unseen outcome by historical average of one-unit-ahead log predictive densities, defined by
In hierarchical stacking, we only need to replace the regular leave-one-out predictive density (6) by this redefined , and run hierarchical stacking (11) as usual. Using importance sampling based approximation (Bürkner et al., 2020), we also make efficient computation without the need to fit each model times.
If we worry about time series being non-stationary, we can reweight the likelihood in (11) by a non-decreasing sequence , so as to emphasize more recent dates. For example, where a fixed parameter determines how much influence early data has. By appending , the stacking weight can vary across the time variable, too.
In Section 5.3, we present an election example with longitudinal polling data (40 weeks × 50 states). For the poll (already ordered by date), we encode state index into input , all other poll-specific variables , data , and poll outcome . We compute the one-week-ahead predictive density where the dataset contains polls from all states up to one week before date .
3. Why model averaging works and why hierarchical stacking can work better
The consistency of leave-one-out cross validation ensures that complete-pooling stacking (3) is asymptotically no worse than model selection in predictions (Clarke, 2003; Le and Clarke, 2017), hence justified by Bayesian decision theory. The theorems we establish in Section 3.2 go a step further, providing lower bounds on the utility gain of stacking and pointwise stacking. In short, model averaging is more pronounced when the model predictive performances are locally separable, but in the same situation, we can improve the linear mixture model by learning locally which model is better, so that the stacking is a step toward model improvement rather than an end to itself. We illustrate with a theoretical example in Appendix A and provide proofs in Appendix B (Yao et al., 2021).
3.1. All models are wrong, but some are somewhere useful
With an -closed view (Bernardo and Smith, 1994), one of the candidate models is the true data generating process, whereas in the more realistic -open scenario, none of the candidate models is completely correct, hence models are evaluated to the extent that they interpret the data.
The expectation of a strictly proper scoring rule, such as the expected log predictive density (elpd), is maximized at the correct data generating process. However, the extent to which a model is “true” is contingent on the input information we have collected. Consider an input-outcome pair (, ) generated by
If the input is not observed or is omitted in the analysis, then is the only correct model and is optimal among all probabilistic predictions of unconditioning on . But this marginally true model is strictly worse than a misspecified conditional prediction, , since the expected log predictive densities are and respectively after averaged over and . The former model is true purely because it ignores some predictors.
This wronger-model-does-better example does not contradict the log score being strictly proper, as we are changing the decision space from measures on to conditional measures on . But this example does underline two properties of model evaluation and averaging. First, we have little interest in a binary model check. The hypothesis testing based model-being-true-or-false depends on what variables to condition on and is not necessarily related to model fit or prediction accuracy. In a non-quantum scheme, a really “everywhere true” model that has exhausted all potentially unobserved inputs contains no aleatory uncertainty. Second, the model fits typically vary across the input space. In the Bernoulli example, despite its larger overall error, is more desired near , and is optimal at .
For theoretical interest, we define the conditional (on ) expected (on ) log predictive density in the model, . If are known, we can divide the input space into disjoint sets based on which model has the locally best fit (When there is a tie, the point is assigned the smallest index, and stands for “input”):
| (17) |
In this Bernoulli example, .
3.2. The gain from stacking, and what can be gained more
In this subsection, we focus on the oracle expressiveness power of model selection and averaging, and their input-dependent version. refers to the complete-pooling stacking weight in the population:
| (18) |
Apart from the heuristic that model averaging is likely to be more useful when candidate models are more “dissimilar” or “distinct” (Breiman, 1996; Clarke, 2003), we are not aware of rigorous theories that characterize this “diversity” regarding the effectiveness of stacking. It seems tempting to use some divergence measure between posterior predictions from each model as a metric of how close these models are, but this is irrelevant to the true data generating process.
We define a more relevant metric on how individual predictive distributions can be pointwisely separated. The description of a forecast being good is probabilistic on both and : an overall bad forecast may be lucky at an one-time realization of outcome and covariate . We consider the input-output product space and divide it into disjoints subsets ( stands for “joint”):
In this framework, we call a family of predictive densities to be locally separable with a constant pair and , if
| (19) |
Stacking is sometimes criticized for being a black box. The next two theorems link stacking weight to a probabilistic explanation. Unlike Bayesian model averaging (Hoeting et al., 1999) that computes the probability of a model being “true”, stacking is more related to : the probability of a model being the locally “best” fit, with respect to the true joint measure .
Theorem 1. When the separation condition (19) holds, the complete pooling stacking weight is approximately the probability of the model being the locally best fit:
| (20) |
in the sense that the objective function is nearly optimal:
| (21) |
Further, a model is only ignored by stacking if its winning probability is low.
Theorem 2. When the separation condition (19) holds, and if the k-th model has zero weight in stacking, , then the probability of its winning region is bounded by:
| (22) |
The right-hand side can be further upper-bounded by .
The separation condition (19) trivially holds for and an arbitrary , or for and an arbitrary , though in those cases the bounds (21) and (22) are too loose. To be clear, we only use the closed form approximation (20) for theoretical assessment.
The next theorem bounds the utility gain from shifting model selection to stacking:
Theorem 3.Under the separation condition(19), let, and a deterministic function, then the utility gain of stacking is lower-bounded byEvaluating requires access to and . Though both terms are unknown, the roles of and are not symmetric: we could bespoke the model in preparation for a future prediction at a given , but cannot be tailored for a realization of . To be more tractable, we consider the case when the variation on predominates the uncertainty of model comparison, such that , where is defined in (17). More precisely, we define a strong local separation condition with a distance-probability pair (, ):
| (23) |
We define . Under condition (23), and will be close. If we know the input space division , we can select model for and only for , which we call pointwise selection. The predictive density is
| (24) |
As per Theorem 3, for a given pair of and , the smaller is , the higher improvement () can stacking achieve against model selection: the situation in which no model always predominates. Thus, the effectiveness of stacking can indicate heterogeneity of model fitting. Next, we show that the heterogeneity of model fitting provides an additional utility gain if we shift from stacking to pointwise selection:
Theorem 4.Under the strong separation condition(23), and if the divisionsare known exactly, then the extra utility gain of pointwise selection has a lower bound,For a given input location , the pointwise no-pooling optimum in the population is same as the complete-pooling solution restricted to the slice . Hence, applying Theorem 3 to each slice will bound the advantage of pointwise averaging (4) against pointwise selection (24).
The potential utility gain from Theorems 3 and 4 is the motivation behind the input-varying model averaging. Despite this asymptotic expressiveness, the finite sample estimate remains challenging. (a) We do not know or . We may use leave-one-out cross validation to estimate the overall model fit , but in the pointwise version, we want to assess conditional model performance. Further, the more data coming in, the more input locations need to assess. (b) The asymptotic expressiveness comes with increasing complexity. The free parameters in single model selection, complete-pooling stacking, pointwise selection, and no-pooling stacking are a single model index, a length- simplex, a vector of pointwise model selection index , and a matrix of pointwise weight . To handle this complexity-expressiveness tradeoff, it is natural to apply the hierarchical shrinkage prior.
3.3. Immunity to covariate shift
So far we have adopted an IID view: the training and out-of-sample data are from the same distribution. Yet another appealing property of hierarchical stacking is its immunity to covariate shift (Shimodaira, 2000), a ubiquitous problem in non-representative sample survey, data-dependent collection, causal inference, and many other areas.
If the distribution of inputs in the training sample, , differs from these predictors’ distribution in the population of interest, is absolutely continuous with respect to ), and if and remain invariant, then we do not need to adjust weight estimate from (11), because it has already aimed at pointwise fit.
By contrast, complete-pooling stacking targets the average risk. Under covariate shift, the sample mean of leave-one-out score in the model, , is no longer a consistent estimate of population elpd. To adjust, we can run importance sampling (Sugiyama and Müller, 2005; Sugiyama et al., 2007; Yao et al., 2018) and reweight the term in the objective (3) proportional to the inverse probability ratio . Even in the ideal situation when both and are known, the importance weighted sum has in general larger or even infinite variance (Vehtari et al., 2019), thereby decreasing the effective sample size and convergence rate in complete-pooling stacking (toward its optimum (18)). When is unknown, the covariate reweighting is more complex while hierarchical stacking circumvents the need of explicit modeling of .
When we are interested only at one fixed input location , hierarchical stacking is ready for conditional predictions, whereas no-pooling stacking and reweighted-complete-pooling stacking effectively discard all training data in their objectives, especially a drawback when is rarely observed in the sample.
4. Related literature
Stacking (Wolpert, 1992; Breiman, 1996; LeBlanc and Tibshirani, 1996), or what we call complete-pooling stacking in this paper has long been a popular method to combine learning algorithms, and has been advocated for averaging Bayesian models (Clarke, 2003; Clyde and Iversen, 2013; Le and Clarke, 2017; Yao et al., 2018). Stacking is applied in various areas such as recommendation systems, epidemiology (Bhatt et al., 2017), network modeling (Ghasemian et al., 2020), and post-processing in Monte Carlo computation (Tracey and Wolpert, 2016; Yao et al., 2020). Stacking can be equipped with any scoring rules, while the present paper focuses on the logarithm score by default. Our theory investigation in Section 3.2 is inspired by the discussion of how to choose candidate models by Clarke (2003) and Le and Clarke (2017). In loss stacking, they recommended “independent” models in terms of posterior point predictions being independent. When combining Bayesian predictive distributions, the correlations of the posterior predictive mean is not enough to summarize the relation between predictive distributions (Pirš and Štrumbelj, 2019), hence we consider the local separation condition instead.
Allowing a heterogeneous stacking model weight that changes with input is not a new idea. Feature-weighted linear stacking (Sill et al., 2009) constructs data-varying model weights of the -th model by , and optimizes the loss of the point predictions of the weighted model. This is similar to the likelihood term of our additive model specification in Section 2.4, except we model the unconstrained weights. The direct least-squares optimization solution from feature-weighted linear stacking is what we label no-pooling stacking.
It is also not a new idea to add regularization and optimize the penalized loss function. For loss stacking, Breiman (1996) advocated non-negative constraints. In the context of combining Bayesian predictive densities, a simplex constraint is necessary. Reid and Grudic (2009) investigated to add or penalty, or , into complete-pooling stacking objective (3). Yao et al. (2020) assigned a , prior distribution to the complete-pooling stacking weight vector to ensure strict concavity of the objective function. Sill et al. (2009) mentioned the use of penalization in feature-weighted linear stacking, which is equivalent to setting a fixed prior for all free parameters , , , whose solution path connects between uniform weighing and no-pooling stacking by tuning . All of these schemes are shown to reduce over-fitting with an appropriate amount of regularization, while the tuning is computation intensive. In particular, each stacking run is built upon one layer of cross validation to compute the expected pointwise score in each model , and this extra tuning would require to fit each model times for each tuning parameter value evaluation if both done in exact leave-one-out way. Fushiki (2020) approximated this double cross validation for loss complete-pooling stacking with penalty on , beyond which there was no general efficient approximation.
Hierarchical stacking treats and as parameters and samples them from the joint density. Such hierarchy could be approximated by using penalized point estimate with a different tuning parameter in each model, and tune all parameters ( for the basic model, or for the product model). But then this intensive tuning is the same as finding the Type-II MAP of hierarchical stacking in an inefficient grid search (in contrast to gradient-based MCMC).
Another popular family of regularization in stacking enforces sparse weights (e.g., Zhang and Zhou, 2011; Şen and Erdogan, 2013; Yang and Dunson, 2014), which include sparse and grouped sparse priors on the unconstrained weights, and sparse Dirichlet prior on simplex weights. The goal is that only a limited number of models are expressed. From our discussion in Section 3, all models are somewhere useful, hence we are not aimed for model sparsity—The concavity of log scoring rules implicitly resists sparsity; The posterior mean of hierarchical stacking weights is, in general, never sparse. Nevertheless, when sparsity is of concern for memory saving or interpretability, we can run hierarchical stacking first and then apply projection predictive variable selection (Piironen and Vehtari, 2017) afterwards to the posterior draws from the stacking model (11) and pick a sparse (or cell-wise sparse) solution.
Contrary to fitting individual models in parallel before model averaging, an alternative approach is to fit all models jointly in a bigger mixture model. Kamary et al. (2019) proposed a Bayesian hypothesis testing by fitting an encompassing model . The mixture model requires to simultaneously fit model parameters and model weights . Yao et al. (2018) illustrated that (complete-pooling) stacking is often more stable than the full-mixture, especially with small sample size and similar models. Nevertheless, our formulation of hierarchical stacking agrees with Kamary et al. (2019) in sampling from the posterior distribution of in a Bayesian model. A jointly-inferred model is related to the “mixture of experts” (Jacobs et al., 1991; Waterhouse et al., 1996) and “hierarchical mixture of experts” (Jordan and Jacobs, 1994; Svensën and Bishop, 2003), where and are parameterized by neural networks and trained jointly in the bigger mixture model. Hierarchical stacking differs from mixture modeling in two aspects. First, its separate inference of individual models and weights reduces computational burden, making full Bayes affordable. Second, the built-in leave-one-out likelihood helps reduce overfitting. Both the mixture modeling and stacking approach have limitations—both can suffer from overfitting: a mixture-of-experts has more free parameters, and hierarchical stacking may have an under-regularized prior distribution; Both can suffer from underfitting: if the experts to mix are hard to separate, or if hierarchical stacking has sloppy individual models; Both can suffer from computation costs: a mixture-of-experts requires joint parameter estimation, and our method requires full Bayesian inference. Stacking and hierarchical stacking are more suitable when each individual model has already been developed to fit the data on its own. Rather than to compete with a mixture-of-experts on combining weak learners, hierarchical stacking is more recommended to combine a mixture-of-experts with other sophisticated models. Lastly, our full-Bayesian formulation makes hierarchical stacking directly applicable to complex priors and complex data structures, such as time series or panel data, while these extensions are not straightforward in the mixture of experts.
5. Examples
We present three examples. The well-switching example demonstrates an automated hierarchical stacking implementation with both continuous and categorical inputs. The Gaussian process example highlights the benefit of hierarchical stacking when individual models are already highly expressive. The election forecast illustrates a real-world classification task with a complex data structure. We evaluate the proposed method on several metrics, including the mean log predictive density on holdout data, conditional log predictive densities, and the calibration error.
5.1. Well-switching in Bangladesh
We work with a dataset used by Vehtari et al. (2017) to demonstrate cross validation. A survey with a size of was conducted on residents from a small area in Bangladesh that was affected by arsenic in drinking water. Households with elevated arsenic levels in their wells were asked whether or not they were interested in switching to a neighbor’s well, denoted by . Well-switching behavior can be predicted by a set of household-level variables , including the detected arsenic concentration value in the well, the distance to the closest known safe well, the education level of the head of household, and whether any household members are in community organizations. The first two inputs are continuous and the remaining two are categorical variables.
We fit a series of logistic regressions, starting with an additive model including all covariates in model 1. In model 2, we replace one input—well arsenic level—by its logarithm. In models 3 and 4, we add cubic spline basis functions with ten knots of well arsenic level and distance, respectively in input variables. In model 5 we replace the categorical education variable with a continuous measure of years of schooling.
Using the additive model specification (13) and default prior (14), we model the unconstrained weight by a linear regression of all categorical inputs and all rectified continuous inputs (16). In this example the categorical input has eight distinct levels based on the product of education (four levels) and community participation (binary).
For comparison, we consider three alternative approaches: (a) complete-pooling stacking (b) no-pooling stacking: the maximum likelihood estimate of (13), and (c) model selection that picks model with the highest leave-one-out log predictive densities. We split data into a training set () and an independent holdout test set. The leftmost panel in Figure 2 displays the pointwise difference of leave-one-out log scores for models 1 and 2 against log arsenic values in training data. Intuitively, model 1 fits poorly for data with high arsenic. In line with this evidence, hierarchical stacking assigns model 1 an overall low weight, and especially low for the right end of the arsenic levels. The second panel shows the pointwise posterior mean of unconstrained weight difference between model 1 and 2, , against the arsenic values in training data. The no-pooling stacking reveals a similar direction that model 1’s weight should be lower with a higher arsenic value, but for lack of hierarchical prior regularization, the fitted is orders of magnitude larger (the third panel). As a result, the realized pointwise weights are nearly either zero or one. The rightmost two columns in Figure 2 display the fitted pointwise weights of model 4 against log arsenic values and education level in test data. Because only a small proportion (7%) of respondents had high school education and above, the no-pooling stacking weight for this category is largely determined by small sample variation. Hierarchical stacking partially pools this “high school” effect toward the shared posterior mean of all educational levels, and the realized hierarchical stacking weights do not clearly depend on education levels.
Figure 2:

(1) Pointwise difference of leave-one-out log scores between models 1 and 2, plotted against log arsenic. Model 1 poorly fits points with high arsenic. (2) Posterior mean of pointwise unconstrained weight difference between models 1 and 2, in hierarchical stacking. (3) Pointwise log weight difference between models 1 and 2 in no-pooling stacking. (4) Posterior mean of , the weight assigned to model 4, in hierarchical stacking, displayed against log arsenic and education levels. There are few samples with high school education and above, whose effect on model weights is pooled toward the shared mean. The blue line is the complete-pooling stacking. (5) The unconstrained weight of model 4, , in no-pooling stacking. The “high school” effect stands out and the resulting model weights are nearly all zeroes and ones.
We evaluate model fit on the following three metrics. To reduce randomness, we evaluate all these metrics averaging over 50 random training-test splits.
The log predictive densities averaged over test data. In the first panel of Figure 3, we set hierarchical stacking as a baseline and all other methods attain lower predictive densities.
The calibration error. We set 20 equally spaced bins between 0 and 1. For each bin and each learning algorithm, we collect test data points whose model-predicted positive probability falling in that bin, and compute the absolute discrepancy between the realized proportion of positives in test data and the model-predicted probabilities. The middle panel in Figure 3 displays the resulted calibration error averaged over 20 bins. The proposed hierarchical stacking has the lowest error. No-pooling stacking has the highest calibration error despite its higher overall log predictive densities than model selection, suggesting prediction overconfidence.
We compute the average log predictive densities of four methods among the most shocking test data points (the ones with lowest predictive densities conditioning on a given method) for varying from 10 to 200 and the total test data has size 1020. As exhibited in the last panel in Figure 3, the proposed hierarchical stacking consistently outperforms all other approaches for all : a robust performance in the worst-case scenario.
Figure 3:

We evaluate hierarchical, complete-pooling and no-pooling stacking, and model selection on three metrics: (a) average log predictive densities on test data, where we set the hierarchical stacking as benchmark 0, (b) calibration error: discrepancy between the predicted positive probability and realized proportion of positives in test data, averaged over 20 equally spaced bins, and (c) average log predictive densities among the worst test data points. We repeat 50 random training-test splits with training size 2000 and test size 1020.
Figure 4 presents the same comparisons of four methods while the training sample size varies from 100 to 1200 (averaged over 50 random training-test splits). In agreement with the heuristic in Figure 1, the most complex method—no-pooling stacking—performs especially poorly with a small sample size. By contrast, the simplest method, model selection, reaches its peak elpd quickly with a moderate sample size but cannot keep improving as training data size grows. The proposed hierarchical stacking performs the best in this setting under all metrics.
Figure 4:

Same comparisons as Figure 3, with training sample size varying from 100 to 1200.
Figure 1:

Evolution of methods. First row from left to right: the methods have a higher degree of freedom to ensure a higher asymptotic predictive accuracy, the gain of which is bounded by the labeled theorems. Meanwhile, complex methods come with a slower convergence rate. The hierarchical stacking is a generalization of all remaining methods by assigning various structured priors, and adapts to the complexity-expressiveness tradeoff by hierarchical modeling.
5.2. Gaussian process regression weighted by a Gaussian process
The local model averaging (12) tangles a -dependent weight and -dependent individual prediction . If the individual model , is already big enough to have exhausted “all” variability in input , is there still a room for improvement by modeling local model weights ? The next example suggests a positive answer. It also showcases non-parametric priors in hierarchical stacking.
Consider a regression problem with observations at one-dimensional input locations . To the data we fit a Gaussian process regression on the latent function with zero mean and squared exponential covariance, and independent noise :
| (25) |
We adopt training data from Neal (1998). They were generated such that the posterior distribution of hyperparameters contains at least two isolated modes (the first panel in Figure 5). We consider three mode-based approximate inference of : (a) Type-II MAP, where we pick local modes of hyperparameters that maximizes the marginal density , and further draw local variables , , (b) Laplace approximation of around the mode, and (c) importance resampling where we draw uniform samples near the mode and keep sample with probability proportional to . In the existence of two local modes , , we either obtain two MAPs or two nearly-nonoverlapped draws. Yao et al. (2020) suggests using complete-pooling stacking to combine two predictions, which shows advantages over other ad-hoc weighting strategies such as mode heights or importance weighting.
Figure 5:

From left to right, Column 1: posterior density at . At least two modes exist. Column 2: predictive distribution of from two modes. Column 3: the pointwise companion of log predictive density of the Laplace approximations at two modes, and the hierarchical stacking weight of mode 1. Column 4: the test data mean predictive densities of the weighted model, where individual components in the final model consist of either the MAP, Laplace approximation, or importance sampling around the two modes, and the weighting methods include hierarchical stacking, complete-pooling stacking, mode heights and importance weighing.
Visually, mode 1 has smaller length scale, more wiggling and attracted by training data. Because of a better overall fit, it receives higher complete-stacking weights. However, the wiggling tail makes its extrapolation less robust. We now run hierarchical stacking with -dependent weight for mode by placing another Gaussian process prior on unconstrained weight with squared exponential covariance,
Despite using the same GP prior, this is not related to the training regression model (25). To evaluate how good the weighted ensemble is, we generate independent holdout test data (, ). Both training and test inputs, and , are distributed from normal(0, 1). As presented in the rightmost panel in Figure 5, for all three approximate inferences, hierarchical stacking always has a higher mean test log predictive density than complete pooling stacking and other weighting schemes.
In this dataset, exact MCMC is able to explore both posterior modes in model (25) after a long enough sampling. Gaussian process regression equipped with exact Bayesian inference can be regarded as the “always true” model here. Hierarchical stacking achieves a similar average test data fit by combining two Laplace approximations.
Furthermore, hierarchical stacking has better predictive performance under covariate shift. To examine local model fit, we generate another independent holdout test data, with results shown in Figure 6. This time the test inputs are from uniform(−3, 3). We divide the test data into 10 equally spaced bins and compute the mean test data log predictive density inside each bin. Compared with exact inference, hierarchical stacking has comparable performance in the bulk region of , while it yields higher predictive densities in the tail, suggesting a more reliable extrapolation.
Figure 6:

Compare the test log predictive densities of hierarchical stacking with (left) a long-chain exact Bayes and (right) stacking of two Laplace approximations. A positive value means hierarchical stacking has a better fit.
5.3. U.S. presidential election forecast
We explore the use of hierarchical stacking on a practical example of forecasting polls for the 2016 United States presidential election. Since the polling data are naturally divided into states, it provides a suitable platform for hierarchical stacking in which model weights vary on states.
To create a pool of candidate models, we first concisely describe the model of Heidemanns et al. (2020), an updated dynamic election forecasting model from Linzer (2013), and then follow up with different variations of it. Let be the index of an individual poll, the number of respondents that support the Democratic candidate, and the number of respondents who support either the Democratic or the Republican candidate in the poll. Let and denote the state and time of poll respectively. The observations follow a Binomial likelihood , where is modeled by
| (26) |
where superscripts denote parameter names, and subscripts their indexes. The term is the underlying support for the Democratic candidate, and , , and represent different bias terms. is further decomposed into
| (27) |
where is the house effect, polling population effect, polling mode effect, and an adjustment term for non-response bias. Furthermore, an autoregressive (AR(1)) prior is given to the , where is the estimated state-covariance matrix and is the estimate from the fundamentals.
Although we believe this model reasonably fits data, there is always room for improvement. Our pool of candidates consists of eight models. : The fundamentals-based model of Abramowitz (2008). : The model of Heidemanns et al. (2020). : without the fundamentals prior, . : with an AR(2) structure, , . : simplify without polling population effect, polling mode effect, and the adjustment trend for non-response bias, . : where we added an extra regression term into model (26) using the S&P 500 index at the time of poll . : without the entire shared bias term, . : without hierarchical structure on states.
We equip hierarchical stacking with either the basic independent prior (9) or the state-correlated prior (15). The prior correlation is estimated using a pool of state-level macro variables (election results in the past, racial makeup, educational attainment, etc.), and has already been used in some of the individual models to partially pool state-level polling. We plug this pre-estimated prior correlation in the correlated stacking prior (15) and refer to it as “hierarchical stacking with correlation” in later comparisons.
Since the data are longitudinal, we evaluate different pooling approaches using a one-week-ahead forecast with an expanding window for each conducted poll. We extract the fitted one-week-ahead predictions from each individual model, and train hierarchical stacking, complete-pooling, and no-pooling stacking, and evaluate the combined models by computing their mean log predictive densities on the unseen data next week. To account for the non-stationarity discussed in Section 2.5, we only use the last four weeks prior to prediction day for training model averaging. In the end we obtain a trajectory of this back-testing performance of hierarchical stacking, complete-pooling stacking, no-pooling stacking, and single model selection.
The left-hand side of Figure 7 shows the seven-day running average of the one-week-ahead back-test log predictive density from models combined with various approaches. The right-hand side of Figure 7 shows the overall cumulative one-week-ahead back-test log predictive density. We set the uncorrelated hierarchical stacking to be a constant zero for reference. Hierarchical stacking performs the best, followed by stacking, no-pooling stacking, and model selection respectively. The advantage of hierarchical stacking is highest at the beginning and slowly decreases the closer we get to election day. As we move closer to the election, more polls become available, so the candidate models become better and also more similar since some models only differ in priors. As a result, all combination methods eventually become more similar. No-pooling stacking has high variance and hence performs the worst out of all combination methods. Hierarchical stacking with correlated prior performs similarly to the independent approach, with a minor advantage at the beginning of the year, where the prior correlation stabilizes the state weights, and later we see this advantage more discernible in individual states.
Figure 7:

Left: pointwise differences in 7-day running mean log predictive densities on one-week-ahead test data, where we set the hierarchical stacking as benchmark 0. Right: pointwise differences in cumulative average predictive log density by date. The advantage of hierarchical stacking is most noticeable toward the beginning, where there are fewer polls available.
To examine small area estimates, we divided states into three categories based on how many state polls were conducted. Figure 8 shows the overall mean pointwise differences in test log predictive densities divided by these categories, along with a fourth panel over all states. No-pooling stacking performs the worst in all panels. An explanation for that could be that we are using a four-week moving window to tackle non-stationarity, which might not contain enough data for the no-pooling method. The variance of the no-pooling is amended by the hierarchical approach, which performs on par with stacking with scarcer data and outperforms it otherwise. Figure 14 in Appendix E (Yao et al., 2021) shows the state-level cumulative log predictive density by time. With a large number of state polls available, for example, close to election day in Florida and North Carolina, no-pooling stacking performs well. In states with fewer polls, no-pooling stacking is unstable. Hierarchical stacking alleviates this instability while retaining enough flexibility for a good performance when large data come in.
Figure 8:

Mean test log predictive densities with 50% and 95% confidence intervals, among subsets of states with few, moderate, and many numbers of state polls, and among all states. Correlated hierarchical stacking is set as reference 0. It is better than independent hierarchical stacking when data are scarce. Complete-pooling stacking is close to hierarchical stacking in small states but worse in bigger states.
Figure 9 illustrates how cell size affects the pooling effect. The first panel shows the hierarchical stacking state-wise weights for the first candidate model as a function of date. For either early-date forecasts or states with few polls, hierarchical stacking weights are more pooled toward the shared nationwide mean. The middle and right panels compare the difference between state-wise hierarchical stacking weights and the nationwide mean, or with no-pooling weights, against the total number of respondents for each state and prediction date. The cells with more observed data are less pooled and closer to their no-pooling optimums, and vice versa.
Figure 9:

Hierarchical stacking weights for . Left: weights for of the 10 states with fewest polls and with most polls over time. Dotted line shows the complete-pooling stacking weight and the solid black line is the nationwide mean weight. States with fewer polls are shrunken more toward the mean. Middle: absolute differences between state-wise hierarchical stacking weights and the nationwide mean, against number of respondents. The blue line is the linear trend reference. States with smaller sample sizes are more pooled to the mean. Right: absolute differences between hierarchical stacking and no-pooling stacking weights.
6. Discussion
6.1. Robustness in small areas
The input-varying model averaging improves both the overall averaged prediction and conditional prediction , whereas these two tasks are subject to a trade-off in complete-pooling stacking. Besides, the partial pooling prior (9) borrows information from other cells, which stabilizes model weights in small cells where date are not enough for no-pooling stacking. For a crude mean-field approximation, the likelihood in the discrete model (11) is approximately , where is the unconstrained no-pooling stacking weight, and is the diagonal element of the Hessian. Because appears in terms of the summation, for a given . Combined with the prior , the conditional posterior mean of the -th model weight in the -th cell is the usual precision-weighed average of the no-pooling optimum and the shared mean: . Hence for a given model , . Larger pooling usually occurs in smaller cells. This pooling factor is in line with Figure 9 and general ideas in hierarchical modeling (Gelman and Pardoe, 2006). Our full-Bayesian solution also integrates out and , which further partially pools across models.
The possibility of partial pooling across cells encourages open-ended data gathering. In the election polling example, even if a pollster is only interested in the forecast of one state, they could gather polling data from everywhere else, fit multiple models, evaluate models on each state, and use hierarchical stacking to construct model averaging, which is especially applicable when the state of interest does not have enough polls to conduct a meaningful model evaluation individually. In this context swing states naturally have more state polls, so that the small-area estimation may not be crucial, but in general, we conjecture that the hierarchical techniques can be useful for model evaluation and averaging in a more general domain adaptation setting. Without going into extra details, hierarchical models are as useful for making inferences from a subset of data (small-area estimation) as to generalize data to a new area (extrapolation). When the latter task is the focus, hierarchical stacking only needs to redefine the leave-one-data-out predictive density (6) by leave-one-cell-out .
6.2. Using hierarchical stacking to understand local model fit
We use hierarchical stacking not only as a tool for optimizing predictions but also as a way to understand problems with fitted models. The fact that hierarchical stacking is being used is already an implicit recognition that we have different models that perform better or worse in different subsets of data, and it can valuable to explore the conditions under which different models are fitting poorly, reveal potential problems in the data or data processing, and point to directions for individual-model improvement.
Vehtari et al. (2017) and Gelman et al. (2020) suggested to examine the pointwise cross-validated log score as a function of , and see if there is a pattern or explanation for why some observations are harder to fit than others. For example, the first panel of Figure 2 seems to indicate that Model 1 is incapable of fitting the rightmost 10–15 non-switchers. However, contains a non-vanishing variance since is a single realization from . Despite its merit in exploratory data analysis, it is hard to tell from the raw cross validation scores whether Model 1 is incapable of fitting high arsenic or is merely unlucky for these few points. The hierarchical stacking weight provides a smoothed summary of how each model fits locally in and comes with built-in Bayesian uncertainty estimation. For example, in Figure 5, has a slightly inflated right tail, but this small bump is smoothed by stacking, and the local weight therein is close to (0.5, 0.5).
6.3. Retrieving a formal likelihood from an optimization objective
The implication of hierarchical stacking (11) being a formal Bayesian model is that we can evaluate its posterior distribution as with a regular Bayesian model. For example, we can run (approximate) leave-one-out cross validation of the stacking posterior . In practice, we only need to fit the stacking model (11) once, collect a size- MCMC sample of stacking parameters from the full posterior , denoted by , compute the PSIS-stabilized importance ratio of each draw , and then compute the mean leave-one-out cross validated log predictive density to evaluate the overall out-of-sample fit of the final stacked model:
| (28) |
As discussed in Section 4, the same task of out-of-sample prediction evaluation in an optimization-based stacking requires double cross validation (refit the model times if using leave-one-out), but now becomes almost computationally free by post-processing posterior draws of stacking.
The Bayesian justification above applies to log-score stacking. In general, we cannot convert an arbitrary objective function into a log density—its exponential is not necessarily integrable, and, even if it is, the resulted density does not necessarily correspond to a relevant model. Take linear regression for example, the ordinary least square estimate arg is identical to the maximum likelihood estimate of from a probabilistic model , , with flat priors. But the directly adapted “log posterior density” from the negative loss, , differs from the Bayesian inference of the latter probabilistic model unless . The hierarchical stacking framework may still apply to other scoring rules, while we leave their Bayesian calibration for future research.
6.4. Statistical workflow for black box algorithms
Unlike our previous work (Yao et al., 2018) that merely applied stacking to Bayesian models, the present paper converts optimization-based stacking itself into a formal Bayesian model, analogous to reformulating a least-squares estimate into a normal-error regression. Breiman (2001) distinguished between the two cultures: the generative modeling assumes that data come from a given stochastic model, whereas the algorithmic modeling treats the data mechanism unknown and advocates black box learning for the goal of predictive accuracy. As a method that Breiman himself introduced (along with Wolpert, 1992), stacking is arguably closer to the algorithmic end of the spectrum, while our hierarchical Bayesian formulation pulls it toward the generative modeling end.
Such a full-Bayesian formulation is appealing for two reasons. First, the generative modeling language facilitates flexible data inclusion during model averaging. For example, the election forecast model contains various outcomes on state polls and national polls from several pollsters, and pollster-, state- and national-level fundamental predictors, and prior state-level correlations. It is not clear how methods like bagging or boosting can include all of them. Data do not have to conveniently arrive in independent (, ) pairs and compliantly await an algorithm to train upon. Second, instead of a static algorithm, hierarchical stacking is now part of a statistical workflow (Gelman et al., 2020). It then enjoys all the flexibility of Bayesian model building, fitting, and checking—we can incorporate other Bayesian shrinkage priors as add-on components without reinventing them; we can run a posterior predictive check or approximate leave-one-out cross validation (28) to assess the out-of-sample performance of the final stacking model; we may even further select, stack, or hierarchically stack a sequence of hierarchical stacking model with various priors and parametric forms. Looking ahead, the success of this work encourages more use of generative Bayesian modeling to improve other black box prediction algorithms.
Supplementary Material
Appendices to “Bayesian hierarchical stacking: Some models are (somewhere) useful” (DOI: 10.1214/21-BA1287SUPP;.pdf).
Acknowledgments
The authors thank the National Science Foundation, Institute of Education Sciences, Office of Naval Research, National Institutes of Health, Sloan Foundation, Schmidt Futures, and the Academy of Finland Flagship programme: Finnish Center for Artificial Intelligence (FCAI) for partial financial support. Gregor Pirš is supported by the Slovenian Research Agency young researcher grant.
Footnotes
We use the bold letter , or to reflect that the weight is vector, or a vector of functions.
References
- Abramowitz AI (2008). “Forecasting the 2008 presidential election with the time-for-change model.” Political Science and Politics, 41: 691–695. [Google Scholar]
- Bernardo JM and Smith AF (1994). Bayesian Theory. Chichester: Wiley. MR1274699. doi: 10.1002/9780470316870. [DOI] [Google Scholar]
- Bhatt S, Cameron E, Flaxman SR, Weiss DJ, Smith DL, and Gething PW (2017). “Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization.” Journal of The Royal Society Interface, 14. [Google Scholar]
- Breiman L. (1996). “Stacked regressions.” Machine Learning, 24: 49–64. [Google Scholar]
- Breiman L. (2001). “Statistical modeling: the two cultures.” Statistical Science, 16: 199–231. MR1874152. doi: 10.1214/ss/1009213726. [DOI] [Google Scholar]
- Bürkner P-C, Gabry J, and Vehtari A (2020). “Approximate leave-future-out cross-validation for Bayesian time series models.” Journal of Statistical Computation and Simulation, 90: 2499–2523. MR4145352. doi: 10.1080/00949655.2020.1783262. [DOI] [Google Scholar]
- Clarke B. (2003). “Comparing Bayes model averaging and stacking when model approximation error cannot be ignored.” Journal of Machine Learning Research, 4: 683–712. MR2072265. doi: 10.1162/153244304773936090. [DOI] [Google Scholar]
- Clyde M and Iversen ES (2013). “Bayesian model averaging in the M-open framework.” In Bayesian Theory and Applications, 483–498. Oxford University Press. MR3221178. doi: 10.1093/acprof:oso/9780199695607.003.0024. [DOI] [Google Scholar]
- Fushiki T. (2020). “On the selection of the regularization parameter in stacking.” Neural Processing Letters, 1–12. [Google Scholar]
- Gelman A and Pardoe I (2006). “Bayesian measures of explained variance and pooling in multilevel (hierarchical) models.” Technometrics, 48: 241–251. MR2277678. doi: 10.1198/004017005000000517. [DOI] [Google Scholar]
- Gelman A, Vehtari A, Simpson D, Margossian CC, Carpenter B, Yao Y, Kennedy L, Gabry J, Bürkner P-C, and Modrák M (2020). “Bayesian workflow.” arXiv:2011.01808. [Google Scholar]
- Ghasemian A, Hosseinmardi H, Galstyan A, Airoldi EM, and Clauset A (2020). “Stacking models for nearly optimal link prediction in complex networks.” Proceedings of the National Academy of Sciences, 117: 23393–23400. [Google Scholar]
- Heidemanns M, Gelman A, and Morris GE (2020). “An updated dynamic Bayesian forecasting model for the US presidential election.” Harvard Data Science Review, 2. [Google Scholar]
- Heiner M, Kottas A, and Munch S (2019). “Structured priors for sparse probability vectors with application to model selection in Markov chains.” Statistics and Computing, 29: 1077–1093. MR3994618. doi: 10.1007/s11222-019-09856-2. [DOI] [Google Scholar]
- Hoeting JA, Madigan D, Raftery AE, and Volinsky CT (1999). “Bayesian model averaging: a tutorial.” Statistical Science, 382–401. MR1765176. doi: 10.1214/ss/1009212519. [DOI] [Google Scholar]
- Jacobs RA, Jordan MI, Nowlan SJ, and Hinton GE (1991). “Adaptive mixtures of local experts.” Neural Computation, 3: 79–87. [DOI] [PubMed] [Google Scholar]
- Jordan MI and Jacobs RA (1994). “Hierarchical mixtures of experts and the EM algorithm.” Neural Computation, 6: 181–214. [Google Scholar]
- Kamary K, Mengersen K, Robert CP, and Rousseau J (2019). “Testing hypotheses via a mixture estimation model.” arXiv:1412.2044. [Google Scholar]
- Le T and Clarke B (2017). “A Bayes interpretation of stacking for -complete and -open settings.” Bayesian Analysis, 12: 807–829. MR3655877. doi: 10.1214/16-BA1023. [DOI] [Google Scholar]
- LeBlanc M and Tibshirani R (1996). “Combining estimates in regression and classification.” Journal of the American Statistical Association, 91: 1641–1650. MR1439105. doi: 10.2307/2291591. [DOI] [Google Scholar]
- Linzer DA (2013). “Dynamic Bayesian forecasting of presidential elections in the states.” Journal of the American Statistical Association, 108: 124–134. MR3174607. doi: 10.1080/01621459.2012.737735. [DOI] [Google Scholar]
- Meng X-L and van Dyk DA (1999). “Seeking efficient data augmentation schemes via conditional and marginal augmentation.” Biometrika, 86: 301–320. MR1705351. doi: 10.1093/biomet/86.2.301. [DOI] [Google Scholar]
- Neal RM (1998). “Regression and classification using Gaussian process priors.” In Bernardo J, Berger JO, Dawid AP, and Smith AFM (eds.), Bayesian Statistics, volume 6, 475–501. Oxford University Press. MR1723510. [Google Scholar]
- Piironen J and Vehtari A (2017). “Comparison of Bayesian predictive methods for model selection.” Statistics and Computing, 27(3): 711–735. MR3613594. doi: 10.1007/s11222-016-9649-y. [DOI] [Google Scholar]
- Pirš G and Štrumbelj E (2019). “Bayesian combination of probabilistic classifiers using multivariate normal mixtures.” Journal of Machine Learning Research, 20: 1–18. MR3948091. [Google Scholar]
- Polson NG and Scott SL (2011). “Data augmentation for support vector machines.” Bayesian Analysis, 6: 1–23. MR2781803. doi: 10.1214/11-BA601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reid S and Grudic G (2009). “Regularized linear models in stacked generalization.” In International Workshop on Multiple Classifier Systems, 112–121. [Google Scholar]
- Şen MU and Erdogan H (2013). “Linear classifier combination and selection using group sparse regularization and hinge loss.” Pattern Recognition Letters, 34: 265–274. [Google Scholar]
- Shimodaira H. (2000). “Improving predictive inference under covariate shift by weighting the log-likelihood function.” Journal of Statistical Planning and Inference, 90: 227–244. MR1795598. doi: 10.1016/S0378-3758(00)00115-4. [DOI] [Google Scholar]
- Sill J, Takács G, Mackey L, and Lin D (2009). “Feature-weighted linear stacking.” arXiv:0911.0460. [Google Scholar]
- Stan Development Team (2020). Stan Modeling Language Users Guide and Reference Manual. Version 2.25.0, http://mc-stan.org.
- Sugiyama M, Krauledat M, and Müller K-R (2007). “Covariate shift adaptation by importance weighted cross validation.” Journal of Machine Learning Research, 8: 985–1005. [Google Scholar]
- Sugiyama M and Müller K-R (2005). “Input-dependent estimation of generalization error under covariate shift.” Statistics and Decisions, 23: 249–280. MR2255627. doi: 10.1524/stnd.2005.23.4.249. [DOI] [Google Scholar]
- Svensën M and Bishop CM (2003). “Bayesian hierarchical mixtures of experts.” In Uncertainty in Artificial Intelligence. [Google Scholar]
- Tracey BD and Wolpert DH (2016). “Reducing the error of Monte Carlo algorithms by learning control variates.” In Conference on Neural Information Processing Systems. [Google Scholar]
- Vehtari A, Gelman A, and Gabry J (2017). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Statistics and Computing, 27: 1413–1432. MR3647105. doi: 10.1007/s11222-016-9696-4. [DOI] [Google Scholar]
- Vehtari A and Ojanen J (2012). “A survey of Bayesian predictive methods for model assessment, selection and comparison.” Statistics Surveys, 6: 142–228. MR3011074. doi: 10.1214/12-SS102. [DOI] [Google Scholar]
- Vehtari A, Simpson D, Gelman A, Yao Y, and Gabry J (2019). “Pareto smoothed importance sampling.” arXiv:1507.02646. [Google Scholar]
- Waterhouse S, MacKay D, and Robinson T (1996). “Bayesian methods for mixtures of experts.” In Advances in Neural Information Processing Systems. [Google Scholar]
- Wolpert DH (1992). “Stacked generalization.” Neural Networks, 5: 241–259. [Google Scholar]
- Yang Y and Dunson DB (2014). “Minimax optimal Bayesian aggregation.” arXiv:1403.1345. [Google Scholar]
- Yao Y, Vehtari A, and Gelman A (2020). “Stacking for non-mixing Bayesian computations: The curse and blessing of multimodal posteriors.” arXiv:2006.12335. [Google Scholar]
- Yao Y, Vehtari A, Simpson D, and Gelman A (2018). “Using stacking to average Bayesian predictive distributions (with discussion).” Bayesian Analysis, 13: 917–1007. MR3853125. doi: 10.1214/17-BA1091. [DOI] [Google Scholar]
- Yao Y, Pirš G, Vehtari A, and Gelman A (2021). “Supplementary material for: Bayesian Hierarchical Stacking: Some Models Are (Somewhere) Useful.” Bayesian Analysis. doi: 10.1214/21-BA1287SUPP. [DOI] [Google Scholar]
- Zhang L and Zhou W-D (2011). “Sparse ensembles using weighted combination methods based on linear programming.” Pattern Recognition, 44: 97–106. MR3184068. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
