Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 18.
Published in final edited form as: Bayesian Anal. 2021 Sep 27;17(4):1043–1071. doi: 10.1214/21-ba1287

Bayesian Hierarchical Stacking: Some Models Are (Somewhere) Useful

Yuling Yao , Gregor Pirš , Aki Vehtari §, Andrew Gelman
PMCID: PMC12442486  NIHMSID: NIHMS2108428  PMID: 40970226

Abstract

Stacking is a widely used model averaging technique that asymptotically yields optimal predictions among linear averages. We show that stacking is most effective when model predictive performance is heterogeneous in inputs, and we can further improve the stacked mixture with a hierarchical model. We generalize stacking to Bayesian hierarchical stacking. The model weights are varying as a function of data, partially-pooled, and inferred using Bayesian inference. We further incorporate discrete and continuous inputs, other structured priors, and time series and longitudinal data. To verify the performance gain of the proposed method, we derive theory bounds, and demonstrate on several applied problems.

Keywords: Bayesian hierarchical modeling, conditional prediction, covariate shift, model averaging, stacking, prior construction

1. Introduction

Statistical inference is conditional on the model, and a general challenge is how to make full use of multiple candidate models. Consider data 𝒟=(yi𝒴,xi𝒳)i=1n, and K models M1,,Mk, each having its own parameter vector θkΘk, likelihood, and prior. We fit each model and obtain posterior predictive distributions,

p(y~x~,Mk)=Θkp(y~x~,θk,Mk)p(θk{yi,xi}i=1n,Mk)dθk. (1)

The model fit is judged by its expected predictive utility of future (out-of-sample) data (y~,x~)𝒴×𝒳, which generally have an unknown true joint density pt(y~,x~). Model selection seeks the best model with the highest utility when averaged over pt(y~,x~). Model averaging assigns models with weight w1,,wK subject to a simplex constraint w𝒮K={w:k=1Kwk=1;wk[0,1],k}, and the future prediction is a linear mixture from individual models:

p(y~x~,w,model averaging)=k=1Kwkp(y~x~,Mk),w𝒮K. (2)

Stacking (Wolpert, 1992), among other ensemble-learners, has been successful for various prediction tasks. Yao et al. (2018) apply the stacking idea to combine predictions from separate Bayesian inferences. The first step is to fit each individual model and evaluate the pointwise leave-one-out predictive density of each data point i under each model k:

pk,i=Θkp(yiθk,xi,Mk)p(θkMk,{(xi,yi):ii})dθk,

which in a Bayesian context we can approximate using posterior simulations and Paretosmoothed importance sampling (Vehtari et al., 2017). Reusing data eliminates the need to model the unknown joint density pt(y~,x~). The next step is to determine the vector1 of weights w=(w1,,wK) that optimize the average log score of the stacked prediction,

w^stacking=argmaxwi=1nlog(k=1Kwkpk,i),such thatw𝒮K. (3)

However, the linear mixture (2) restricts an identical set of weights for all input x. We will later label this solution (3) as complete-pooling stacking. The present paper proposes hierarchical stacking, an approach that goes further in three ways:

  1. Framing the estimation of the stacking weights as a Bayesian inference problem rather than a pure optimization problem. This in itself does not make much difference in the complete-pooling estimate (3) but is helpful in the later development.

  2. Expanding to a hierarchical model in which the stacking weights can vary over the population. If the model predictors x take on J different values in the data, we can use Bayesian inference to estimate a J×K matrix of weights that partially pools the data both in row and column.

  3. Further expanding to allow weights to vary as a function of continuous predictors. This idea generalizes the feature-weighted linear stacking (Sill et al., 2009) with a more flexible form and Bayesian hierarchical shrinkage.

There are two reasons we would like to consider input-dependent model weights. First, the scoring rule measures the expected predictive performance averaged over x~ and y~, as the objective function in (3) divided by n is a consistent estimate of Ex~,y~log(k=1Kwkp(y~x~,𝒟,Mk)). But an overall good model fit does not ensure a good conditional prediction at a given location x~=x~0, or under covariate shift when the distribution of input x in the observations differs from the population of interest. More importantly, different models can be good at explaining different regions in the input-response space, which is why model averaging can be a better solution to model selection. Even if we are only interested in the average performance, we can further improve model averaging by learning where a model is good so as to locally inflate its weight.

In Section 2, we develop detailed implementation of hierarchical stacking. We explain why it is legitimate to convert an optimization problem into a formal Bayesian model. With hierarchical shrinkage, we partially pool the stacking weights across data. By varying priors, hierarchical stacking includes classic stacking and selection as special cases. We generalize this approach to continuous input variables, other structured priors, and time-series and longitudinal data. In Section 3, we turn heuristics from the previous paragraph into a rigorous learning bound, indicating the benefit from model selection to model averaging, and from complete-pooling model averaging to a local averaging that allows the model weights to vary in the population. We outline related work in Section 4. In Section 5, we evaluate the proposed method in several simulated and real-data examples, including a U.S. presidential election forecast.

This paper makes two main contributions:

  • Hierarchical stacking provides a Bayesian recipe for model averaging with input-dependent weights and hierarchical regularization. It is beneficial for both improving the overall model fit, and the conditional local fit in small and new areas.

  • Our theoretical results characterize how the model list should be locally separated to be useful in model averaging and local model averaging.

2. Hierarchical stacking

The present paper generalizes the linear model averaging (2) to pointwise model averaging. The goal is to construct an input-dependent model weight function w(x)=(w1(x),,wK(x)):𝒳𝒮K, and combine the predictive densities pointwisely by

p(y~x~,w(),pointwise averaging)=k=1Kwk(x~)p(y~x~,Mk),such thatw()𝒮K𝒳. (4)

If the input is discrete and has finite categories, one naïve estimation of the pointwise optimal weight is to run complete-pooling stacking (3) separately on each category, which we will label no-pooling stacking. The no-pooling procedure generally has a larger variance and overfits the data.

From a Bayesian perspective, it is natural to compromise between unpooled and completely pooled procedures by a hierarchical model. Given some hierarchical prior pprior(), we define the posterior distribution of the stacking weights w𝒮K𝒳 through the usual likelihood-prior protocol:

logp(w()𝒟)=i=1nlog(k=1Kwk(xi)pk,i)+logpprior(w)+constant,w()𝒮K𝒳. (5)

The final estimate of the pointwise stacking weight used in (4) is then the posterior mean from this joint density E(w()𝒟). We call this approach hierarchical stacking.

2.1. Complete-pooling and no-pooling stacking

For notational consistency, we rewrite the input variables into two groups (x, z), where x are variables on which the model weight w(x) depends during model averaging (4), and z are all remaining input variables.

To start, we consider x to be discrete and has J< categories, x=1,,J. We will extend to continuous and hybrid x later. The input varying stacking weight function is parameterized by a J×K matrix {wjk}𝒮KJ: Each row of the matrix is an element of the length-K simplex. The k-th model in cell j has the weight wk(xi)=wjk, xi=j. We fit each individual model Mk to all observed data 𝒟=(xi,zi,yi)i=1n and obtain pointwise leave-one-out cross-validated log predictive densities:

pk,iΘkp(yiθk,xi,zi,Mk)p(θk{(xl,yl,zl):li},Mk)dθk. (6)

Same as in complete-pooling stacking, here we avoid refitting each model n times, and instead use the Pareto smoothed importance sampling (PSIS, Vehtari et al., 2017, 2019) to approximate {pk,i}i=1n from one-time-fit posterior draws p(θkMk,𝒟). The cost of such approximate leave-one-out cross validation is often negligible compared with individual model fitting.

To optimize the expected predictive performance of the pointwisely combined model averaging, we can maximize the leave-one-out predictive density

maxw()i=1nlog(k=1Kwk(xi)pk,i). (7)

On one extreme, the complete-pooling stacking (3) solves optimization (7) subject to a constant constraint wk(x)=wk(x), k, x, x. On the other extreme, no-pooling stacking maximizes this objective function (7) without extra constraint other than the row-simplex-condition, which amounts to separately solving complete-pooling stacking (3) on each input cell 𝒟j={(xi,zi,yi):xi=j}.

If there are a large number of repeated measurements in each cell, nj{i:xi=j}, then 1nji:xi=jlogk=1Kwk(j)pk,i becomes a reasonable estimate of the conditional log predictive density 𝒴pt(y~x~=j)logk=1Kwk(j)p(y~j,Mk)dy~, with convergence rate nj, and therefore, no-pooling stacking becomes asymptotically optimal among all cell-wise combination weights. For finite sample size, because the cell size is smaller than total sample size, we would expect a larger variance in no-pooling stacking than in complete-pooling stacking. Moreover, the cell sizes are often not balanced, which entails a large noise of no-pooling stacking weight in small cells.

2.2. Bayesian inference for stacking weights

Vanilla (optimization-based) stacking (3) is justified by Bayesian decision theory: the expected log predictive density of the combined model Ey~log(k=1Kwkp(y~Mk)) is estimated by leave-one-out 1ni=1nlog(k=1Kwkpk,i). The point optimum asymptotically maximizes the expected utility (Le and Clarke, 2017), hence is an M-optimal decision in terms of Vehtari and Ojanen (2012).

To fold stacking into a Bayesian inference problem, we want to treat the objective function in (7) as a log likelihood with parameter w. After integrating out individual-model-specific parameters θk such that p(yx,Mk) is given, the outcomes yi at input location xi in the combined model have densities p(yixi,wk(xi))=k=1Kwk(xi)×p(yixi,Mk), which implies a joint log likelihood: i=1nlog(k=1Kwk(xi)p(yixi,Mk)). But this procedure has used data twice—in other practices, data are often used twice to pick the prior, whereas here data are used twice to pick the likelihood.

We use a two-stage estimation procedure to avoid reusing data. Assuming a hypothetically provided holdout dataset 𝒟 of the same size and identical distribution as observations 𝒟={yi,xi}i=1n, we can use 𝒟 to fit the individual model first and compute p~(yixi,Mk,𝒟)=p(yixi,Mk,θk)p(θkMk,𝒟)dθk. In the second stage we plug in the observed yi, xi, and obtain the pointwise full likelihood p(yiw,𝒟,xi)=k=1Kwk(xi)p~(yixi,Mk,𝒟).

Now in lack of holdout data 𝒟, the leave-i-th-observation-out predictive density pk,i is a consistent estimate of the pointwise out-of-sample predictive density E𝒟(p~(yixi,Mk,𝒟)). By plugging it into the two-stage log likelihood and integrating out the unobserved holdout data 𝒟, we get a profile likelihood

p(yiw,xi)E𝒟(p(yiw,xi,𝒟))=k=1Kwk(xi)E𝒟(p(yixi,Mk,𝒟))k=1Kwk(xi)pk,i.

Summing over yi arrives at log(p(𝒟w))i=1nlog(k=1Kwk(xi)pk,i). This log likelihood coincides with the no-pooling optimization objective function (7).

Integrating out the hypothetical data 𝒟 is related to the idea of marginal data augmentation (Meng and van Dyk, 1999). Polson and Scott (2011) took a similar approach to convert the optimization-based support vector machine into a Bayesian inference.

2.3. Hierarchical stacking: discrete inputs

The log posterior density of hierarchical stacking model (5) contains the log likelihood defined above i=1nlog(k=1Kwk(xi)pk,i), and a prior distribution on the weight matrix w={wjk}𝒮KJ, which we specify in the following.

We first take a softmax transformation that bijectively converts the simplex matrix space 𝒮KJ to unconstrained space RJ(K1):

wjk=exp(αjk)k=1Kexp(αjk),1kK1,1jJ;αjK=0,1jJ. (8)

αjkR is interpreted as the log odds ratio of model k with reference to MK in cell j.

We propose a normal hierarchical prior on the unconstrained model weights (αjk)k=1K1 conditional on hyperparameters μRK1 and σR+K1,

prior:αjkμk,σknormal(μk,σk),k=1,,K1,j=1,,J. (9)

The prior partially pools unconstrained weights toward the shared mean (μ1,,μK1). The shrinkage effect depends on both the cell sample size nj (how strong the likelihood is in cell j), and the model-specific σk (how much across-cell discrepancy is allowed in model k). If μ and σ are given constants, and if the posterior distribution is summarized by its mode, then hierarchical stacking contains two special cases:

  • no-pooling stacking by a flat prior σk, k=1,,K1.

  • complete-pooling stacking by a concentration prior σk0, k=1,,K1.

It is possible to derive other structured priors. For example, a sparse prior (e.g., Heiner et al., 2019) on simplex (wj1,,wjK) will enforce a cell-wise selection.

Instead of choosing fixed values, we view μ and σ as hyperparameters and aim for a full Bayesian solution: to describe the uncertainty of all parameters by their joint posterior distribution p(α,μ,σ𝒟), letting the data to tell how much regularization is desired.

To accomplish this Bayesian inference, we assign a hyperprior to (μ, σ):

hyperprior:μknormal(μ0,τμ),σknormal+(0,τσ),k=1,,K1, (10)

where normal+(0,τσ) stands for the half-normal distribution supported on [0,) with scale parameter τσ.

Putting the pieces (5), (9), (10) together, up to a normalization constant that has been omitted, we attain a joint posterior density of all free parameters αRJ×K, μRK1, σR+K1:

logp(α,μ,σ𝒟)=i=1nlog(k=1Kwk(xi)pk,i)+k=1K1j=1Jlogpprior(αjkμk,σk)k=1K1logphyperprior(μk,σk). (11)

Unlike complete and no-pooling stacking, which are typically solved by optimization, the maximum a posteriori (MAP) estimate of (11) is not meaningful: the mode is attained at the complete-pooling subspace αjk=μk, σk=0, j, k, on which the joint density is positive infinity. Instead, we sample (α, μ, σ) from this joint density (11) using Markov chain Monte Carlo (MCMC) methods and compute the Monte Carlo mean of posterior draws w¯jk, which we will call hierarchical stacking weights.

The final posterior predictive density of outcome y~ at any input location (x~, z~) is

final predictions:p(y~x~,z~,𝒟)=k=1Kw¯k(x~)Θkp(y~x~,z~,θk,Mk)p(θkMk,𝒟)dθk. (12)

Using a point estimate w¯jk is not a waste of the joint simulation draws. Because equation (12) is a linear expression on wk, and because of the linearity of expectation, using w¯ is as good as using all simulation draws. Nonetheless, for the purpose of post-processing, approximate cross validation, and extra model check and comparison, we will use all posterior simulation draws; see discussion in Section 6.3.

2.4. Hierarchical stacking: continuous and hybrid inputs

The next step is to include more structure in the weights, which could correspond to regression for continuous predictors, nonexchangeable models for nested or crossed grouping factors, nonparametric prior, or combinations of these.

Additive model

Hierarchical stacking is not limited to discrete cell-divider x. When the input x is continuous or hybrid, one extension is to model the unconstrained weights additively:

w1:K(x)=softmax(w1:K(x)),wk(x)=μk+m=1Mαmkfm(x),kK1,wK(x)=0, (13)

where {fm:𝒳R} are M distinct features. Here we have already extracted the prior mean μk, representing the “average” weight of model k in the unconstrained space. The discrete model (9) is now equivalent to letting fm(x)=1(x=m) for m=1,,J. We may still use the basic prior (9) and hyperprior (10):

αmkσknormal(0,σk),μknormal(μ0,τμ),σknormal+(0,τσ). (14)

We provide Stan (Stan Development Team, 2020) code for this additive model, and discuss recommendations on the hyperprior and feature design in Appendix C and D (Yao et al., 2021).

Because the main motivation of our paper is to convert the one-fit-all model-averaging algorithm into open-ended Bayesian modeling, the basic shrinkage prior above should be viewed as a starting point for model building and improvement. Without trying to exhaust all possible variants, we list a few useful prior structures:

  • Grouped hierarchical prior. The basic model (14) is limited to have a same regularization σk for all αmk. When the features fm(x) are grouped (e.g., fm are dummy variables from two discrete inputs; states are grouped in regions), we achieve group specific shrinkage by replacing (14) by
    αmkσgknormal(0,σg[m]k),μknormal(μ0,τμ),σgknormal+(0,τσ),
    where g[m]=1,G is the group index of feature m.
  • Feature-model decomposition. Alternatively we can learn feature-dependent regularization by

αmkμk,σk,λmnormal(0,σkλm),λmInvGamma(a,b),σknormal+(0,τσ).
  • Prior correlation. For discrete cells, we would like to incorporate prior knowledge of the group-correlation. For example in election forecast (Section 5.3), we have a rough sense of some states being demographically close, and would expect a similar model weights therein. To this end, we calculate a prior correlation matrix ΩJ×J from various sources of state level historical data, and replace the independent prior (9) by a multivariate normal (MVN) distribution,
    (α1k,,αjk)σ,Ω,μMVN((μk,,μk),diag(σk2)×Ω). (15)

    The prior correlation is especially useful to stabilize stacking weights in small cells.

  • Crude approximation of input density. When applying the basic model (13) to continuous inputs x=(x1,,xD)RD, instead of a direct linear regression fd(x)=xd, we recommend a coordinate-wise ReLU-typed transformation:
    {f:f2d1(x)=(xdmed(xd))+,f2d(x)=(med(xd)xd)+,dD}, (16)
    where med(xd) is the sample median of xd. The pointwise model predictive performance typically relies on the training density PXtrain(x~): The more training data seen nearby, the better predictions. The feature (16) is designed to be a crude approximation of log marginal input densities.

Gaussian process prior

An alternative way to generalize both the discrete prior in Section 2.3 and the prior correlation (15) is Gaussian process priors. To this end we need K1 covariance kernels 𝒦1,,𝒦K1, and place priors on the unconstrained weight αk(x), viewed as an 𝒳R function: αk(x)𝒢𝒫(μk,𝒦k(x)). The discrete prior is a special case of a Gaussian process via a zero-one kernel 𝒦k(xi,xj)=σk1(xi=xj). Due to the previously discussed measurement error and the preference on stronger regularization for continuous x, we recommend simple exponentiated quadratic kernels 𝒦k(xi,xj)=akexp(((xixj)ρk)2) with an informative hyperprior that avoids too small or too big length-scale ρk, and too big ak. We present an example in Section 5.2.

2.5. Time series and longitudinal data

Hierarchical stacking can easily extend to time series and longitudinal data. Consider a time series dataset where outcomes yi come sequentially in time 0tiT. The joint likelihood is not exchangeable, but still factorizable via p(y1:nθ)=i=1np(yiθ,y1:(i1)). Therefore, assuming some stationary condition, we can approximate the expected log predictive densities of the next-unit unseen outcome by historical average of one-unit-ahead log predictive densities, defined by

pk,iΘkp(yixi,y1:(i1),x1:(i1),θk,Mk)p(θky1:(n1),x1:(n1))dθk.

In hierarchical stacking, we only need to replace the regular leave-one-out predictive density (6) by this redefined pk,i, and run hierarchical stacking (11) as usual. Using importance sampling based approximation (Bürkner et al., 2020), we also make efficient computation without the need to fit each model n times.

If we worry about time series being non-stationary, we can reweight the likelihood in (11) by a non-decreasing sequence πi:ni=1n(πilog(k=1Kwk(xi)pk,i))i=1nπi, so as to emphasize more recent dates. For example, πi=1+γ(1tiT)2 where a fixed parameter γ>0 determines how much influence early data has. By appending x(x,t), the stacking weight can vary across the time variable, too.

In Section 5.3, we present an election example with longitudinal polling data (40 weeks × 50 states). For the i-th poll (already ordered by date), we encode state index into input xi=1,,50, all other poll-specific variables zi, data t, and poll outcome yi. We compute the one-week-ahead predictive density pk,ip(yixi,zi,𝒟i,Mk)p(θk𝒟i,Mk)dθk where the dataset 𝒟i={(yl,xl,zl):tlti7} contains polls from all states up to one week before date ti.

3. Why model averaging works and why hierarchical stacking can work better

The consistency of leave-one-out cross validation ensures that complete-pooling stacking (3) is asymptotically no worse than model selection in predictions (Clarke, 2003; Le and Clarke, 2017), hence justified by Bayesian decision theory. The theorems we establish in Section 3.2 go a step further, providing lower bounds on the utility gain of stacking and pointwise stacking. In short, model averaging is more pronounced when the model predictive performances are locally separable, but in the same situation, we can improve the linear mixture model by learning locally which model is better, so that the stacking is a step toward model improvement rather than an end to itself. We illustrate with a theoretical example in Appendix A and provide proofs in Appendix B (Yao et al., 2021).

3.1. All models are wrong, but some are somewhere useful

With an -closed view (Bernardo and Smith, 1994), one of the candidate models is the true data generating process, whereas in the more realistic -open scenario, none of the candidate models is completely correct, hence models are evaluated to the extent that they interpret the data.

The expectation of a strictly proper scoring rule, such as the expected log predictive density (elpd), is maximized at the correct data generating process. However, the extent to which a model is “true” is contingent on the input information we have collected. Consider an input-outcome pair (x, y) generated by

x[0,1],y{0,1},xuniform(0,1),Pr(y=1x)=x.

If the input x is not observed or is omitted in the analysis, then M1:yBernoulli(0.5) is the only correct model and is optimal among all probabilistic predictions of y unconditioning on x. But this marginally true model is strictly worse than a misspecified conditional prediction, M2:Pr(y=1x)=x, since the expected log predictive densities are log(0.5)=0.69 and 712=0.58 respectively after averaged over x and y. The former model is true purely because it ignores some predictors.

This wronger-model-does-better example does not contradict the log score being strictly proper, as we are changing the decision space from measures on y to conditional measures on yx. But this example does underline two properties of model evaluation and averaging. First, we have little interest in a binary model check. The hypothesis testing based model-being-true-or-false depends on what variables to condition on and is not necessarily related to model fit or prediction accuracy. In a non-quantum scheme, a really “everywhere true” model that has exhausted all potentially unobserved inputs contains no aleatory uncertainty. Second, the model fits typically vary across the input space. In the Bernoulli example, despite its larger overall error, M1 is more desired near x.5, and is optimal at x=.5.

For theoretical interest, we define the conditional (on x~) expected (on y~x~) log predictive density in the k-th model, celpdk(x~)𝒴pt(y~,x~)logp(y~x~,Mk)dy~. If {celpdk}k=1K are known, we can divide the input space 𝒳 into K disjoint sets based on which model has the locally best fit (When there is a tie, the point is assigned the smallest index, and stands for “input”):

k{x~𝒳:celpdk(x~)>celpdk(x~),kk},k=1,,K. (17)

In this Bernoulli example, 1=[0.25,0.67].

3.2. The gain from stacking, and what can be gained more

In this subsection, we focus on the oracle expressiveness power of model selection and averaging, and their input-dependent version. wstacking,cp refers to the complete-pooling stacking weight in the population:

wstacking,cpargmaxw𝒮𝒦elpd(w),elpd(w)=𝒳×𝒴log(k=1Kwkp(y~Mk,x~))pt(y~,x~)dy~dx~. (18)

Apart from the heuristic that model averaging is likely to be more useful when candidate models are more “dissimilar” or “distinct” (Breiman, 1996; Clarke, 2003), we are not aware of rigorous theories that characterize this “diversity” regarding the effectiveness of stacking. It seems tempting to use some divergence measure between posterior predictions from each model as a metric of how close these models are, but this is irrelevant to the true data generating process.

We define a more relevant metric on how individual predictive distributions can be pointwisely separated. The description of a forecast being good is probabilistic on both x~ and y~: an overall bad forecast may be lucky at an one-time realization of outcome y~ and covariate x~. We consider the input-output product space 𝒳×𝒴 and divide it into K disjoints subsets (𝒥 stands for “joint”):

𝒥k{(x~,y~)𝒳×𝒴:p(y~Mk,x~)>p(y~Mk,x~),kk},k=1,,K.

In this framework, we call a family of predictive densities {p(y~Mk,x~)}k=1K to be locally separable with a constant pair L>0 and 0ϵ<1, if

k=1K(x~,y~)𝒥k1(logp(y~Mk,x~)<logp(y~Mk,x~)+L,kk)pt(y~,x~)dy~dx~ϵ. (19)

Stacking is sometimes criticized for being a black box. The next two theorems link stacking weight to a probabilistic explanation. Unlike Bayesian model averaging (Hoeting et al., 1999) that computes the probability of a model being “true”, stacking is more related to Pr(𝒥k): the probability of a model being the locally “best” fit, with respect to the true joint measure pt(y~,x~).

Theorem 1. When the separation condition (19) holds, the complete pooling stacking weight is approximately the probability of the model being the locally best fit:

wkstacking,cpwkapproxPr(𝒥k)=𝒥kpt(y~,x~)dy~dx~, (20)

in the sense that the objective function is nearly optimal:

elpd(wapprox)elpd(wstacking,cp)𝒪(ϵ+exp(L)). (21)

Further, a model is only ignored by stacking if its winning probability is low.

Theorem 2. When the separation condition (19) holds, and if the k-th model has zero weight in stacking, wkstacking,cp=0, then the probability of its winning region is bounded by:

Pr(𝒥k)(1+(exp(L)1)(1ϵ)+ϵ)1. (22)

The right-hand side can be further upper-bounded by exp(L)+ϵ.

The separation condition (19) trivially holds for ϵ=1 and an arbitrary L, or for L=0 and an arbitrary ϵ, though in those cases the bounds (21) and (22) are too loose. To be clear, we only use the closed form approximation (20) for theoretical assessment.

The next theorem bounds the utility gain from shifting model selection to stacking:

Theorem 3.Under the separation condition(19), letρ=supkPr(𝒥k), and a deterministic functiong(L,K,ρ,ϵ)=L(1ρ)(1ϵ)logK, then the utility gain of stacking is lower-bounded by
elpdstacking,cpsupkelpdkmax(g(L,K,ρ)+𝒪(exp(L)+ϵ),0).

Evaluating 𝒥K requires access to y~x~ and x~. Though both terms are unknown, the roles of x~ and y~ are not symmetric: we could bespoke the model in preparation for a future prediction at a given x~, but cannot be tailored for a realization of y~. To be more tractable, we consider the case when the variation on x~ predominates the uncertainty of model comparison, such that 𝒥kk×𝒴, where k is defined in (17). More precisely, we define a strong local separation condition with a distance-probability pair (L, ϵ):

k=1Kx~k𝒴1(logp(y~Mk,x~)<logp(y~Mk,x~)+L,kk)pt(y~,x~)dy~dx~ϵ. (23)

We define ρ𝒳=supkPr(k). Under condition (23), ρ𝒳 and ρ will be close. If we know the input space division {k}, we can select model Mk for and only for xk, which we call pointwise selection. The predictive density is

p(y~x~,,pointwise selection)=k=1K1(x~k)p(y~x~,Mk). (24)

As per Theorem 3, for a given pair of L and ϵ, the smaller is ρ, the higher improvement (K(1ϵ)(1ρ)) can stacking achieve against model selection: the situation in which no model always predominates. Thus, the effectiveness of stacking can indicate heterogeneity of model fitting. Next, we show that the heterogeneity of model fitting provides an additional utility gain if we shift from stacking to pointwise selection:

Theorem 4.Under the strong separation condition(23), and if the divisions{k}are known exactly, then the extra utility gain of pointwise selection has a lower bound,
elpdpointwise selectionelpdstacking,cplogρ𝒳+𝒪(exp(L)+ϵ).

For a given input location x0𝒳, the pointwise no-pooling optimum w(x0)𝒮K in the population is same as the complete-pooling solution restricted to the slice {x0}×𝒴. Hence, applying Theorem 3 to each slice will bound the advantage of pointwise averaging (4) against pointwise selection (24).

The potential utility gain from Theorems 3 and 4 is the motivation behind the input-varying model averaging. Despite this asymptotic expressiveness, the finite sample estimate remains challenging. (a) We do not know k or 𝒥k. We may use leave-one-out cross validation to estimate the overall model fit elpdk, but in the pointwise version, we want to assess conditional model performance. Further, the more data coming in, the more input locations need to assess. (b) The asymptotic expressiveness comes with increasing complexity. The free parameters in single model selection, complete-pooling stacking, pointwise selection, and no-pooling stacking are a single model index, a length-K simplex, a vector of pointwise model selection index {1,2,,K}𝒳, and a matrix of pointwise weight (𝒮K)𝒳. To handle this complexity-expressiveness tradeoff, it is natural to apply the hierarchical shrinkage prior.

3.3. Immunity to covariate shift

So far we have adopted an IID view: the training and out-of-sample data are from the same distribution. Yet another appealing property of hierarchical stacking is its immunity to covariate shift (Shimodaira, 2000), a ubiquitous problem in non-representative sample survey, data-dependent collection, causal inference, and many other areas.

If the distribution of inputs x in the training sample, pXtrain(), differs from these predictors’ distribution in the population of interest, pXpop()(pXpop is absolutely continuous with respect to pXtrain), and if p(zx) and p(yx,z) remain invariant, then we do not need to adjust weight estimate from (11), because it has already aimed at pointwise fit.

By contrast, complete-pooling stacking targets the average risk. Under covariate shift, the sample mean of leave-one-out score in the k-th model, 1ni=1nlogp(y~x~,Mk), is no longer a consistent estimate of population elpd. To adjust, we can run importance sampling (Sugiyama and Müller, 2005; Sugiyama et al., 2007; Yao et al., 2018) and reweight the i-th term in the objective (3) proportional to the inverse probability ratio pXpop(xi)pXtrain(xi). Even in the ideal situation when both pXpop and pXtrain are known, the importance weighted sum has in general larger or even infinite variance (Vehtari et al., 2019), thereby decreasing the effective sample size and convergence rate in complete-pooling stacking (toward its optimum (18)). When pXtrain is unknown, the covariate reweighting is more complex while hierarchical stacking circumvents the need of explicit modeling of pXtrain.

When we are interested only at one fixed input location pXpop(x)=δ(x=x0), hierarchical stacking is ready for conditional predictions, whereas no-pooling stacking and reweighted-complete-pooling stacking effectively discard all xix0 training data in their objectives, especially a drawback when x0 is rarely observed in the sample.

4. Related literature

Stacking (Wolpert, 1992; Breiman, 1996; LeBlanc and Tibshirani, 1996), or what we call complete-pooling stacking in this paper has long been a popular method to combine learning algorithms, and has been advocated for averaging Bayesian models (Clarke, 2003; Clyde and Iversen, 2013; Le and Clarke, 2017; Yao et al., 2018). Stacking is applied in various areas such as recommendation systems, epidemiology (Bhatt et al., 2017), network modeling (Ghasemian et al., 2020), and post-processing in Monte Carlo computation (Tracey and Wolpert, 2016; Yao et al., 2020). Stacking can be equipped with any scoring rules, while the present paper focuses on the logarithm score by default. Our theory investigation in Section 3.2 is inspired by the discussion of how to choose candidate models by Clarke (2003) and Le and Clarke (2017). In L2 loss stacking, they recommended “independent” models in terms of posterior point predictions (E(y~x~,M1),,E(y~x~,MK)) being independent. When combining Bayesian predictive distributions, the correlations of the posterior predictive mean is not enough to summarize the relation between predictive distributions (Pirš and Štrumbelj, 2019), hence we consider the local separation condition instead.

Allowing a heterogeneous stacking model weight that changes with input x is not a new idea. Feature-weighted linear stacking (Sill et al., 2009) constructs data-varying model weights of the k-th-th model by wk(x)=m=1Mαkmfm(x), and αkm optimizes the L2 loss of the point predictions of the weighted model. This is similar to the likelihood term of our additive model specification in Section 2.4, except we model the unconstrained weights. The direct least-squares optimization solution from feature-weighted linear stacking is what we label no-pooling stacking.

It is also not a new idea to add regularization and optimize the penalized loss function. For L2 loss stacking, Breiman (1996) advocated non-negative constraints. In the context of combining Bayesian predictive densities, a simplex constraint is necessary. Reid and Grudic (2009) investigated to add L1 or L2 penalty, λw1 or λw2, into complete-pooling stacking objective (3). Yao et al. (2020) assigned a Dirichlet(λ), λ>1 prior distribution to the complete-pooling stacking weight vector w to ensure strict concavity of the objective function. Sill et al. (2009) mentioned the use of L2 penalization in feature-weighted linear stacking, which is equivalent to setting a fixed prior for all free parameters αkmnormal(0,τ), k, m, whose solution path connects between uniform weighing and no-pooling stacking by tuning τ. All of these schemes are shown to reduce over-fitting with an appropriate amount of regularization, while the tuning is computation intensive. In particular, each stacking run is built upon one layer of cross validation to compute the expected pointwise score in each model pk,i, and this extra tuning would require to fit each model n(n1) times for each tuning parameter value evaluation if both done in exact leave-one-out way. Fushiki (2020) approximated this double cross validation for L2 loss complete-pooling stacking with L2 penalty on w, beyond which there was no general efficient approximation.

Hierarchical stacking treats {μk} and {σk} as parameters and samples them from the joint density. Such hierarchy could be approximated by using L2 penalized point estimate with a different tuning parameter in each model, and tune all parameters ({σk}k=1K1 for the basic model, or {σmk}m=1,k=1M,K1 for the product model). But then this intensive tuning is the same as finding the Type-II MAP of hierarchical stacking in an inefficient grid search (in contrast to gradient-based MCMC).

Another popular family of regularization in stacking enforces sparse weights (e.g., Zhang and Zhou, 2011; Şen and Erdogan, 2013; Yang and Dunson, 2014), which include sparse and grouped sparse priors on the unconstrained weights, and sparse Dirichlet prior on simplex weights. The goal is that only a limited number of models are expressed. From our discussion in Section 3, all models are somewhere useful, hence we are not aimed for model sparsity—The concavity of log scoring rules implicitly resists sparsity; The posterior mean of hierarchical stacking weights wjk is, in general, never sparse. Nevertheless, when sparsity is of concern for memory saving or interpretability, we can run hierarchical stacking first and then apply projection predictive variable selection (Piironen and Vehtari, 2017) afterwards to the posterior draws from the stacking model (11) and pick a sparse (or cell-wise sparse) solution.

Contrary to fitting individual models in parallel before model averaging, an alternative approach is to fit all models jointly in a bigger mixture model. Kamary et al. (2019) proposed a Bayesian hypothesis testing by fitting an encompassing model p(yw,θ)=k=1Kwkp(yθk,Mk). The mixture model requires to simultaneously fit model parameters and model weights p(w1,,K,θ1,,Ky). Yao et al. (2018) illustrated that (complete-pooling) stacking is often more stable than the full-mixture, especially with small sample size and similar models. Nevertheless, our formulation of hierarchical stacking agrees with Kamary et al. (2019) in sampling from the posterior distribution of p(wy) in a Bayesian model. A jointly-inferred model p(yx,w(x),θ)=k=1Kwk(x)p(yx,θk,Mk) is related to the “mixture of experts” (Jacobs et al., 1991; Waterhouse et al., 1996) and “hierarchical mixture of experts” (Jordan and Jacobs, 1994; Svensën and Bishop, 2003), where wk() and p(x,Mk) are parameterized by neural networks and trained jointly in the bigger mixture model. Hierarchical stacking differs from mixture modeling in two aspects. First, its separate inference of individual models p(θ1y,M1),,p(θKy,Mk) and weights reduces computational burden, making full Bayes affordable. Second, the built-in leave-one-out likelihood helps reduce overfitting. Both the mixture modeling and stacking approach have limitations—both can suffer from overfitting: a mixture-of-experts has more free parameters, and hierarchical stacking may have an under-regularized prior distribution; Both can suffer from underfitting: if the experts to mix are hard to separate, or if hierarchical stacking has sloppy individual models; Both can suffer from computation costs: a mixture-of-experts requires joint parameter estimation, and our method requires full Bayesian inference. Stacking and hierarchical stacking are more suitable when each individual model has already been developed to fit the data on its own. Rather than to compete with a mixture-of-experts on combining weak learners, hierarchical stacking is more recommended to combine a mixture-of-experts with other sophisticated models. Lastly, our full-Bayesian formulation makes hierarchical stacking directly applicable to complex priors and complex data structures, such as time series or panel data, while these extensions are not straightforward in the mixture of experts.

5. Examples

We present three examples. The well-switching example demonstrates an automated hierarchical stacking implementation with both continuous and categorical inputs. The Gaussian process example highlights the benefit of hierarchical stacking when individual models are already highly expressive. The election forecast illustrates a real-world classification task with a complex data structure. We evaluate the proposed method on several metrics, including the mean log predictive density on holdout data, conditional log predictive densities, and the calibration error.

5.1. Well-switching in Bangladesh

We work with a dataset used by Vehtari et al. (2017) to demonstrate cross validation. A survey with a size of n=3020 was conducted on residents from a small area in Bangladesh that was affected by arsenic in drinking water. Households with elevated arsenic levels in their wells were asked whether or not they were interested in switching to a neighbor’s well, denoted by y. Well-switching behavior can be predicted by a set of household-level variables x, including the detected arsenic concentration value in the well, the distance to the closest known safe well, the education level of the head of household, and whether any household members are in community organizations. The first two inputs are continuous and the remaining two are categorical variables.

We fit a series of logistic regressions, starting with an additive model including all covariates x in model 1. In model 2, we replace one input—well arsenic level—by its logarithm. In models 3 and 4, we add cubic spline basis functions with ten knots of well arsenic level and distance, respectively in input variables. In model 5 we replace the categorical education variable with a continuous measure of years of schooling.

Using the additive model specification (13) and default prior (14), we model the unconstrained weight αk(x) by a linear regression of all categorical inputs and all rectified continuous inputs (16). In this example the categorical input has eight distinct levels based on the product of education (four levels) and community participation (binary).

For comparison, we consider three alternative approaches: (a) complete-pooling stacking (b) no-pooling stacking: the maximum likelihood estimate of (13), and (c) model selection that picks model with the highest leave-one-out log predictive densities. We split data into a training set (ntrain=2000) and an independent holdout test set. The leftmost panel in Figure 2 displays the pointwise difference of leave-one-out log scores for models 1 and 2 against log arsenic values in training data. Intuitively, model 1 fits poorly for data with high arsenic. In line with this evidence, hierarchical stacking assigns model 1 an overall low weight, and especially low for the right end of the arsenic levels. The second panel shows the pointwise posterior mean of unconstrained weight difference between model 1 and 2, α2(x)α1(x), against the arsenic values in training data. The no-pooling stacking reveals a similar direction that model 1’s weight should be lower with a higher arsenic value, but for lack of hierarchical prior regularization, the fitted α2(x)α1(x) is orders of magnitude larger (the third panel). As a result, the realized pointwise weights w are nearly either zero or one. The rightmost two columns in Figure 2 display the fitted pointwise weights of model 4 against log arsenic values and education level in test data. Because only a small proportion (7%) of respondents had high school education and above, the no-pooling stacking weight for this category is largely determined by small sample variation. Hierarchical stacking partially pools this “high school” effect toward the shared posterior mean of all educational levels, and the realized hierarchical stacking weights do not clearly depend on education levels.

Figure 2:

Figure 2:

(1) Pointwise difference of leave-one-out log scores between models 1 and 2, plotted against log arsenic. Model 1 poorly fits points with high arsenic. (2) Posterior mean of pointwise unconstrained weight difference between models 1 and 2, α2(x)α1(x) in hierarchical stacking. (3) Pointwise log weight difference between models 1 and 2 in no-pooling stacking. (4) Posterior mean of w4(x), the weight assigned to model 4, in hierarchical stacking, displayed against log arsenic and education levels. There are few samples with high school education and above, whose effect on model weights is pooled toward the shared mean. The blue line is the complete-pooling stacking. (5) The unconstrained weight of model 4, α4(x), in no-pooling stacking. The “high school” effect stands out and the resulting model weights w are nearly all zeroes and ones.

We evaluate model fit on the following three metrics. To reduce randomness, we evaluate all these metrics averaging over 50 random training-test splits.

  1. The log predictive densities averaged over test data. In the first panel of Figure 3, we set hierarchical stacking as a baseline and all other methods attain lower predictive densities.

  2. The L1 calibration error. We set 20 equally spaced bins between 0 and 1. For each bin and each learning algorithm, we collect test data points whose model-predicted positive probability falling in that bin, and compute the absolute discrepancy between the realized proportion of positives in test data and the model-predicted probabilities. The middle panel in Figure 3 displays the resulted calibration error averaged over 20 bins. The proposed hierarchical stacking has the lowest error. No-pooling stacking has the highest calibration error despite its higher overall log predictive densities than model selection, suggesting prediction overconfidence.

  3. We compute the average log predictive densities of four methods among the nworst most shocking test data points (the ones with lowest predictive densities conditioning on a given method) for nworst varying from 10 to 200 and the total test data has size 1020. As exhibited in the last panel in Figure 3, the proposed hierarchical stacking consistently outperforms all other approaches for all nworst: a robust performance in the worst-case scenario.

Figure 3:

Figure 3:

We evaluate hierarchical, complete-pooling and no-pooling stacking, and model selection on three metrics: (a) average log predictive densities on test data, where we set the hierarchical stacking as benchmark 0, (b) calibration error: discrepancy between the predicted positive probability and realized proportion of positives in test data, averaged over 20 equally spaced bins, and (c) average log predictive densities among the 10n0200 worst test data points. We repeat 50 random training-test splits with training size 2000 and test size 1020.

Figure 4 presents the same comparisons of four methods while the training sample size ntrain varies from 100 to 1200 (averaged over 50 random training-test splits). In agreement with the heuristic in Figure 1, the most complex method—no-pooling stacking—performs especially poorly with a small sample size. By contrast, the simplest method, model selection, reaches its peak elpd quickly with a moderate sample size but cannot keep improving as training data size grows. The proposed hierarchical stacking performs the best in this setting under all metrics.

Figure 4:

Figure 4:

Same comparisons as Figure 3, with training sample size varying from 100 to 1200.

Figure 1:

Figure 1:

Evolution of methods. First row from left to right: the methods have a higher degree of freedom to ensure a higher asymptotic predictive accuracy, the gain of which is bounded by the labeled theorems. Meanwhile, complex methods come with a slower convergence rate. The hierarchical stacking is a generalization of all remaining methods by assigning various structured priors, and adapts to the complexity-expressiveness tradeoff by hierarchical modeling.

5.2. Gaussian process regression weighted by a Gaussian process

The local model averaging (12) tangles a x-dependent weight w(x) and x-dependent individual prediction p(yx,Mk). If the individual model yx, Mk is already big enough to have exhausted “all” variability in input x, is there still a room for improvement by modeling local model weights w(x)? The next example suggests a positive answer. It also showcases non-parametric priors in hierarchical stacking.

Consider a regression problem with observations {yi}i=1n at one-dimensional input locations {xi}i=1n. To the data we fit a Gaussian process regression on the latent function f with zero mean and squared exponential covariance, and independent noise ϵ:

yi=f(xi)+ϵi,ϵinormal(0,σ),f(x)𝒢𝒫(0,a2exp((xx)2ρ2)). (25)

We adopt training data from Neal (1998). They were generated such that the posterior distribution of hyperparameters θ=(a,ρ,σ) contains at least two isolated modes (the first panel in Figure 5). We consider three mode-based approximate inference of θy: (a) Type-II MAP, where we pick local modes of hyperparameters that maximizes the marginal density θ^=argmaxp(θy), and further draw local variables fθ^, y, (b) Laplace approximation of θy around the mode, and (c) importance resampling where we draw uniform samples near the mode and keep sample with probability proportional to p(θy). In the existence of two local modes θ^1, θ^2, we either obtain two MAPs or two nearly-nonoverlapped draws. Yao et al. (2020) suggests using complete-pooling stacking to combine two predictions, which shows advantages over other ad-hoc weighting strategies such as mode heights or importance weighting.

Figure 5:

Figure 5:

From left to right, Column 1: posterior density at σ=0.25. At least two modes exist. Column 2: predictive distribution of y from two modes. Column 3: the pointwise companion of log predictive density of the Laplace approximations at two modes, and the hierarchical stacking weight of mode 1. Column 4: the test data mean predictive densities of the weighted model, where individual components in the final model consist of either the MAP, Laplace approximation, or importance sampling around the two modes, and the weighting methods include hierarchical stacking, complete-pooling stacking, mode heights and importance weighing.

Visually, mode 1 has smaller length scale, more wiggling and attracted by training data. Because of a better overall fit, it receives higher complete-stacking weights. However, the wiggling tail makes its extrapolation less robust. We now run hierarchical stacking with x-dependent weight wk(x) for mode k=1,2 by placing another Gaussian process prior on unconstrained weight logit(w1(x)) with squared exponential covariance,

w1(x)=invlogit(α(x)),α(x)𝒢𝒫(0,K(x)).

Despite using the same GP prior, this is not related to the training regression model (25). To evaluate how good the weighted ensemble is, we generate independent holdout test data (x~i, y~i). Both training and test inputs, x and x~, are distributed from normal(0, 1). As presented in the rightmost panel in Figure 5, for all three approximate inferences, hierarchical stacking always has a higher mean test log predictive density than complete pooling stacking and other weighting schemes.

In this dataset, exact MCMC is able to explore both posterior modes in model (25) after a long enough sampling. Gaussian process regression equipped with exact Bayesian inference can be regarded as the “always true” model here. Hierarchical stacking achieves a similar average test data fit by combining two Laplace approximations.

Furthermore, hierarchical stacking has better predictive performance under covariate shift. To examine local model fit, we generate another independent holdout test data, with results shown in Figure 6. This time the test inputs x~ are from uniform(−3, 3). We divide the test data into 10 equally spaced bins and compute the mean test data log predictive density inside each bin. Compared with exact inference, hierarchical stacking has comparable performance in the bulk region of x, while it yields higher predictive densities in the tail, suggesting a more reliable extrapolation.

Figure 6:

Figure 6:

Compare the test log predictive densities of hierarchical stacking with (left) a long-chain exact Bayes and (right) stacking of two Laplace approximations. A positive value means hierarchical stacking has a better fit.

5.3. U.S. presidential election forecast

We explore the use of hierarchical stacking on a practical example of forecasting polls for the 2016 United States presidential election. Since the polling data are naturally divided into states, it provides a suitable platform for hierarchical stacking in which model weights vary on states.

To create a pool of candidate models, we first concisely describe the model of Heidemanns et al. (2020), an updated dynamic election forecasting model from Linzer (2013), and then follow up with different variations of it. Let i be the index of an individual poll, yi the number of respondents that support the Democratic candidate, and ni the number of respondents who support either the Democratic or the Republican candidate in the poll. Let s[i] and t[i] denote the state and time of poll i respectively. The observations follow a Binomial likelihood yiBinomial(θi,ni), where θ is modeled by

θi={logit1(μs[i],t[i]b+αi+ζistate+ξs[i]),iis a state poll,logit1(s=1Susμs,t[i]b+αi+ζinational+s=1Susξs),iis a national poll,} (26)

where superscripts denote parameter names, and subscripts their indexes. The term μb is the underlying support for the Democratic candidate, and αi, ζ, and ξ represent different bias terms. αi is further decomposed into

αi=μp[i]c+μr[i]r+μm[i]m+zϵt[i], (27)

where μc is the house effect, μr polling population effect, μm polling mode effect, and ϵ an adjustment term for non-response bias. Furthermore, an autoregressive (AR(1)) prior is given to the μb:μtbμt1bMVN(μt1b,Σb), where Σb is the estimated state-covariance matrix and μTb is the estimate from the fundamentals.

Although we believe this model reasonably fits data, there is always room for improvement. Our pool of candidates consists of eight models. M1: The fundamentals-based model of Abramowitz (2008). M2: The model of Heidemanns et al. (2020). M3: M2 without the fundamentals prior, μTb=0. M4: M2 with an AR(2) structure, μtbμt1b, μt2bMVN(0.5μt1bμt2b,Σb). M5: simplify M2 without polling population effect, polling mode effect, and the adjustment trend for non-response bias, αi=μp[i]c. M6: M2 where we added an extra regression term βstockstockt[i] into model (26) using the S&P 500 index at the time of poll i. M7: M2 without the entire shared bias term, αi=0. M8: M2 without hierarchical structure on states.

We equip hierarchical stacking with either the basic independent prior (9) or the state-correlated prior (15). The prior correlation Ω is estimated using a pool of state-level macro variables (election results in the past, racial makeup, educational attainment, etc.), and has already been used in some of the individual models to partially pool state-level polling. We plug this pre-estimated prior correlation in the correlated stacking prior (15) and refer to it as “hierarchical stacking with correlation” in later comparisons.

Since the data are longitudinal, we evaluate different pooling approaches using a one-week-ahead forecast with an expanding window for each conducted poll. We extract the fitted one-week-ahead predictions from each individual model, and train hierarchical stacking, complete-pooling, and no-pooling stacking, and evaluate the combined models by computing their mean log predictive densities on the unseen data next week. To account for the non-stationarity discussed in Section 2.5, we only use the last four weeks prior to prediction day for training model averaging. In the end we obtain a trajectory of this back-testing performance of hierarchical stacking, complete-pooling stacking, no-pooling stacking, and single model selection.

The left-hand side of Figure 7 shows the seven-day running average of the one-week-ahead back-test log predictive density from models combined with various approaches. The right-hand side of Figure 7 shows the overall cumulative one-week-ahead back-test log predictive density. We set the uncorrelated hierarchical stacking to be a constant zero for reference. Hierarchical stacking performs the best, followed by stacking, no-pooling stacking, and model selection respectively. The advantage of hierarchical stacking is highest at the beginning and slowly decreases the closer we get to election day. As we move closer to the election, more polls become available, so the candidate models become better and also more similar since some models only differ in priors. As a result, all combination methods eventually become more similar. No-pooling stacking has high variance and hence performs the worst out of all combination methods. Hierarchical stacking with correlated prior performs similarly to the independent approach, with a minor advantage at the beginning of the year, where the prior correlation stabilizes the state weights, and later we see this advantage more discernible in individual states.

Figure 7:

Figure 7:

Left: pointwise differences in 7-day running mean log predictive densities on one-week-ahead test data, where we set the hierarchical stacking as benchmark 0. Right: pointwise differences in cumulative average predictive log density by date. The advantage of hierarchical stacking is most noticeable toward the beginning, where there are fewer polls available.

To examine small area estimates, we divided states into three categories based on how many state polls were conducted. Figure 8 shows the overall mean pointwise differences in test log predictive densities divided by these categories, along with a fourth panel over all states. No-pooling stacking performs the worst in all panels. An explanation for that could be that we are using a four-week moving window to tackle non-stationarity, which might not contain enough data for the no-pooling method. The variance of the no-pooling is amended by the hierarchical approach, which performs on par with stacking with scarcer data and outperforms it otherwise. Figure 14 in Appendix E (Yao et al., 2021) shows the state-level cumulative log predictive density by time. With a large number of state polls available, for example, close to election day in Florida and North Carolina, no-pooling stacking performs well. In states with fewer polls, no-pooling stacking is unstable. Hierarchical stacking alleviates this instability while retaining enough flexibility for a good performance when large data come in.

Figure 8:

Figure 8:

Mean test log predictive densities with 50% and 95% confidence intervals, among subsets of states with few, moderate, and many numbers of state polls, and among all states. Correlated hierarchical stacking is set as reference 0. It is better than independent hierarchical stacking when data are scarce. Complete-pooling stacking is close to hierarchical stacking in small states but worse in bigger states.

Figure 9 illustrates how cell size affects the pooling effect. The first panel shows the hierarchical stacking state-wise weights for the first candidate model w1j as a function of date. For either early-date forecasts or states with few polls, hierarchical stacking weights are more pooled toward the shared nationwide mean. The middle and right panels compare the difference between state-wise hierarchical stacking weights and the nationwide mean, or with no-pooling weights, against the total number of respondents for each state and prediction date. The cells with more observed data are less pooled and closer to their no-pooling optimums, and vice versa.

Figure 9:

Figure 9:

Hierarchical stacking weights for M1. Left: weights for M1 of the 10 states with fewest polls and with most polls over time. Dotted line shows the complete-pooling stacking weight and the solid black line is the nationwide mean weight. States with fewer polls are shrunken more toward the mean. Middle: absolute differences between state-wise hierarchical stacking weights and the nationwide mean, against number of respondents. The blue line is the linear trend reference. States with smaller sample sizes are more pooled to the mean. Right: absolute differences between hierarchical stacking and no-pooling stacking weights.

6. Discussion

6.1. Robustness in small areas

The input-varying model averaging improves both the overall averaged prediction Ey~,x~(logp(y~x~)) and conditional prediction Ey~x~=x0(logp(y~x~)), whereas these two tasks are subject to a trade-off in complete-pooling stacking. Besides, the partial pooling prior (9) borrows information from other cells, which stabilizes model weights in small cells where date are not enough for no-pooling stacking. For a crude mean-field approximation, the likelihood in the discrete model (11) is approximately j,knormal(αjkmode,λjk), where αmode=argmaxαi=1nlog(k=1Kwk(xi)pk,i) is the unconstrained no-pooling stacking weight, and λjk2=2αjk2modei=1nlog(k=1Kwk(xi)pk,i) is the diagonal element of the Hessian. Because αjk appears in nj terms of the summation, λjk=𝒪(nj12) for a given k. Combined with the prior αjknormal(μk,σk), the conditional posterior mean of the k-th model weight in the j-th cell is the usual precision-weighed average of the no-pooling optimum and the shared mean: αjkpostE(αjkλjk,σk,μk,𝒟)(λjk2αjkmode+σk2μk)(λjk2+σk2)1. Hence for a given model k, αjkmodeαjkpost=𝒪(nj1). Larger pooling usually occurs in smaller cells. This pooling factor is in line with Figure 9 and general ideas in hierarchical modeling (Gelman and Pardoe, 2006). Our full-Bayesian solution also integrates out μk and σk, which further partially pools across models.

The possibility of partial pooling across cells encourages open-ended data gathering. In the election polling example, even if a pollster is only interested in the forecast of one state, they could gather polling data from everywhere else, fit multiple models, evaluate models on each state, and use hierarchical stacking to construct model averaging, which is especially applicable when the state of interest does not have enough polls to conduct a meaningful model evaluation individually. In this context swing states naturally have more state polls, so that the small-area estimation may not be crucial, but in general, we conjecture that the hierarchical techniques can be useful for model evaluation and averaging in a more general domain adaptation setting. Without going into extra details, hierarchical models are as useful for making inferences from a subset of data (small-area estimation) as to generalize data to a new area (extrapolation). When the latter task is the focus, hierarchical stacking only needs to redefine the leave-one-data-out predictive density (6) by leave-one-cell-out pk,iΘkp(yiθk,xi,zi,Mk)p(θkMk,{(xi,zi,yi):xixi})dθk.

6.2. Using hierarchical stacking to understand local model fit

We use hierarchical stacking not only as a tool for optimizing predictions but also as a way to understand problems with fitted models. The fact that hierarchical stacking is being used is already an implicit recognition that we have different models that perform better or worse in different subsets of data, and it can valuable to explore the conditions under which different models are fitting poorly, reveal potential problems in the data or data processing, and point to directions for individual-model improvement.

Vehtari et al. (2017) and Gelman et al. (2020) suggested to examine the pointwise cross-validated log score logpk,i as a function of xi, and see if there is a pattern or explanation for why some observations are harder to fit than others. For example, the first panel of Figure 2 seems to indicate that Model 1 is incapable of fitting the rightmost 10–15 non-switchers. However, logpk,i contains a non-vanishing variance since yi is a single realization from pt(yxi). Despite its merit in exploratory data analysis, it is hard to tell from the raw cross validation scores whether Model 1 is incapable of fitting high arsenic or is merely unlucky for these few points. The hierarchical stacking weight w(x) provides a smoothed summary of how each model fits locally in x and comes with built-in Bayesian uncertainty estimation. For example, in Figure 5, logp1,ilogp2,i has a slightly inflated right tail, but this small bump is smoothed by stacking, and the local weight therein is close to (0.5, 0.5).

6.3. Retrieving a formal likelihood from an optimization objective

The implication of hierarchical stacking (11) being a formal Bayesian model is that we can evaluate its posterior distribution as with a regular Bayesian model. For example, we can run (approximate) leave-one-out cross validation of the stacking posterior p(w𝒟i)p(w𝒟)p(yixi,w)=p(w𝒟)(k=1Kwk(xi)pk,i). In practice, we only need to fit the stacking model (11) once, collect a size-S MCMC sample of stacking parameters from the full posterior p(w𝒟), denoted by {(wk1(xi),,wkS(xi))}i,k, compute the PSIS-stabilized importance ratio of each draw ris(k=1Kwks(xi)pk,i)1, and then compute the mean leave-one-out cross validated log predictive density to evaluate the overall out-of-sample fit of the final stacked model:

elpdstackinglooi=1nlogs=1S(risk=1Kwks(xi)pk,i)s=1Sris. (28)

As discussed in Section 4, the same task of out-of-sample prediction evaluation in an optimization-based stacking requires double cross validation (refit the model n(n1) times if using leave-one-out), but now becomes almost computationally free by post-processing posterior draws of stacking.

The Bayesian justification above applies to log-score stacking. In general, we cannot convert an arbitrary objective function into a log density—its exponential is not necessarily integrable, and, even if it is, the resulted density does not necessarily correspond to a relevant model. Take linear regression for example, the ordinary least square estimate arg arg minβi=1n(yixiTβ)2 is identical to the maximum likelihood estimate of β from a probabilistic model yixi, β, σnormal(xiTβ,σ) with flat priors. But the directly adapted “log posterior density” from the negative L2 loss, logp(βy)=i=1n(yixiTβ)2+C, differs from the Bayesian inference of the latter probabilistic model unless σ1. The hierarchical stacking framework may still apply to other scoring rules, while we leave their Bayesian calibration for future research.

6.4. Statistical workflow for black box algorithms

Unlike our previous work (Yao et al., 2018) that merely applied stacking to Bayesian models, the present paper converts optimization-based stacking itself into a formal Bayesian model, analogous to reformulating a least-squares estimate into a normal-error regression. Breiman (2001) distinguished between the two cultures: the generative modeling assumes that data come from a given stochastic model, whereas the algorithmic modeling treats the data mechanism unknown and advocates black box learning for the goal of predictive accuracy. As a method that Breiman himself introduced (along with Wolpert, 1992), stacking is arguably closer to the algorithmic end of the spectrum, while our hierarchical Bayesian formulation pulls it toward the generative modeling end.

Such a full-Bayesian formulation is appealing for two reasons. First, the generative modeling language facilitates flexible data inclusion during model averaging. For example, the election forecast model contains various outcomes on state polls and national polls from several pollsters, and pollster-, state- and national-level fundamental predictors, and prior state-level correlations. It is not clear how methods like bagging or boosting can include all of them. Data do not have to conveniently arrive in independent (xi, yi) pairs and compliantly await an algorithm to train upon. Second, instead of a static algorithm, hierarchical stacking is now part of a statistical workflow (Gelman et al., 2020). It then enjoys all the flexibility of Bayesian model building, fitting, and checking—we can incorporate other Bayesian shrinkage priors as add-on components without reinventing them; we can run a posterior predictive check or approximate leave-one-out cross validation (28) to assess the out-of-sample performance of the final stacking model; we may even further select, stack, or hierarchically stack a sequence of hierarchical stacking model with various priors and parametric forms. Looking ahead, the success of this work encourages more use of generative Bayesian modeling to improve other black box prediction algorithms.

Supplementary Material

Supplementary Material

Appendices to “Bayesian hierarchical stacking: Some models are (somewhere) useful” (DOI: 10.1214/21-BA1287SUPP;.pdf).

Acknowledgments

The authors thank the National Science Foundation, Institute of Education Sciences, Office of Naval Research, National Institutes of Health, Sloan Foundation, Schmidt Futures, and the Academy of Finland Flagship programme: Finnish Center for Artificial Intelligence (FCAI) for partial financial support. Gregor Pirš is supported by the Slovenian Research Agency young researcher grant.

Footnotes

1

We use the bold letter w, or w() to reflect that the weight is vector, or a vector of functions.

References

  1. Abramowitz AI (2008). “Forecasting the 2008 presidential election with the time-for-change model.” Political Science and Politics, 41: 691–695. [Google Scholar]
  2. Bernardo JM and Smith AF (1994). Bayesian Theory. Chichester: Wiley. MR1274699. doi: 10.1002/9780470316870. [DOI] [Google Scholar]
  3. Bhatt S, Cameron E, Flaxman SR, Weiss DJ, Smith DL, and Gething PW (2017). “Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization.” Journal of The Royal Society Interface, 14. [Google Scholar]
  4. Breiman L. (1996). “Stacked regressions.” Machine Learning, 24: 49–64. [Google Scholar]
  5. Breiman L. (2001). “Statistical modeling: the two cultures.” Statistical Science, 16: 199–231. MR1874152. doi: 10.1214/ss/1009213726. [DOI] [Google Scholar]
  6. Bürkner P-C, Gabry J, and Vehtari A (2020). “Approximate leave-future-out cross-validation for Bayesian time series models.” Journal of Statistical Computation and Simulation, 90: 2499–2523. MR4145352. doi: 10.1080/00949655.2020.1783262. [DOI] [Google Scholar]
  7. Clarke B. (2003). “Comparing Bayes model averaging and stacking when model approximation error cannot be ignored.” Journal of Machine Learning Research, 4: 683–712. MR2072265. doi: 10.1162/153244304773936090. [DOI] [Google Scholar]
  8. Clyde M and Iversen ES (2013). “Bayesian model averaging in the M-open framework.” In Bayesian Theory and Applications, 483–498. Oxford University Press. MR3221178. doi: 10.1093/acprof:oso/9780199695607.003.0024. [DOI] [Google Scholar]
  9. Fushiki T. (2020). “On the selection of the regularization parameter in stacking.” Neural Processing Letters, 1–12. [Google Scholar]
  10. Gelman A and Pardoe I (2006). “Bayesian measures of explained variance and pooling in multilevel (hierarchical) models.” Technometrics, 48: 241–251. MR2277678. doi: 10.1198/004017005000000517. [DOI] [Google Scholar]
  11. Gelman A, Vehtari A, Simpson D, Margossian CC, Carpenter B, Yao Y, Kennedy L, Gabry J, Bürkner P-C, and Modrák M (2020). “Bayesian workflow.” arXiv:2011.01808. [Google Scholar]
  12. Ghasemian A, Hosseinmardi H, Galstyan A, Airoldi EM, and Clauset A (2020). “Stacking models for nearly optimal link prediction in complex networks.” Proceedings of the National Academy of Sciences, 117: 23393–23400. [Google Scholar]
  13. Heidemanns M, Gelman A, and Morris GE (2020). “An updated dynamic Bayesian forecasting model for the US presidential election.” Harvard Data Science Review, 2. [Google Scholar]
  14. Heiner M, Kottas A, and Munch S (2019). “Structured priors for sparse probability vectors with application to model selection in Markov chains.” Statistics and Computing, 29: 1077–1093. MR3994618. doi: 10.1007/s11222-019-09856-2. [DOI] [Google Scholar]
  15. Hoeting JA, Madigan D, Raftery AE, and Volinsky CT (1999). “Bayesian model averaging: a tutorial.” Statistical Science, 382–401. MR1765176. doi: 10.1214/ss/1009212519. [DOI] [Google Scholar]
  16. Jacobs RA, Jordan MI, Nowlan SJ, and Hinton GE (1991). “Adaptive mixtures of local experts.” Neural Computation, 3: 79–87. [DOI] [PubMed] [Google Scholar]
  17. Jordan MI and Jacobs RA (1994). “Hierarchical mixtures of experts and the EM algorithm.” Neural Computation, 6: 181–214. [Google Scholar]
  18. Kamary K, Mengersen K, Robert CP, and Rousseau J (2019). “Testing hypotheses via a mixture estimation model.” arXiv:1412.2044. [Google Scholar]
  19. Le T and Clarke B (2017). “A Bayes interpretation of stacking for -complete and -open settings.” Bayesian Analysis, 12: 807–829. MR3655877. doi: 10.1214/16-BA1023. [DOI] [Google Scholar]
  20. LeBlanc M and Tibshirani R (1996). “Combining estimates in regression and classification.” Journal of the American Statistical Association, 91: 1641–1650. MR1439105. doi: 10.2307/2291591. [DOI] [Google Scholar]
  21. Linzer DA (2013). “Dynamic Bayesian forecasting of presidential elections in the states.” Journal of the American Statistical Association, 108: 124–134. MR3174607. doi: 10.1080/01621459.2012.737735. [DOI] [Google Scholar]
  22. Meng X-L and van Dyk DA (1999). “Seeking efficient data augmentation schemes via conditional and marginal augmentation.” Biometrika, 86: 301–320. MR1705351. doi: 10.1093/biomet/86.2.301. [DOI] [Google Scholar]
  23. Neal RM (1998). “Regression and classification using Gaussian process priors.” In Bernardo J, Berger JO, Dawid AP, and Smith AFM (eds.), Bayesian Statistics, volume 6, 475–501. Oxford University Press. MR1723510. [Google Scholar]
  24. Piironen J and Vehtari A (2017). “Comparison of Bayesian predictive methods for model selection.” Statistics and Computing, 27(3): 711–735. MR3613594. doi: 10.1007/s11222-016-9649-y. [DOI] [Google Scholar]
  25. Pirš G and Štrumbelj E (2019). “Bayesian combination of probabilistic classifiers using multivariate normal mixtures.” Journal of Machine Learning Research, 20: 1–18. MR3948091. [Google Scholar]
  26. Polson NG and Scott SL (2011). “Data augmentation for support vector machines.” Bayesian Analysis, 6: 1–23. MR2781803. doi: 10.1214/11-BA601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Reid S and Grudic G (2009). “Regularized linear models in stacked generalization.” In International Workshop on Multiple Classifier Systems, 112–121. [Google Scholar]
  28. Şen MU and Erdogan H (2013). “Linear classifier combination and selection using group sparse regularization and hinge loss.” Pattern Recognition Letters, 34: 265–274. [Google Scholar]
  29. Shimodaira H. (2000). “Improving predictive inference under covariate shift by weighting the log-likelihood function.” Journal of Statistical Planning and Inference, 90: 227–244. MR1795598. doi: 10.1016/S0378-3758(00)00115-4. [DOI] [Google Scholar]
  30. Sill J, Takács G, Mackey L, and Lin D (2009). “Feature-weighted linear stacking.” arXiv:0911.0460. [Google Scholar]
  31. Stan Development Team (2020). Stan Modeling Language Users Guide and Reference Manual. Version 2.25.0, http://mc-stan.org.
  32. Sugiyama M, Krauledat M, and Müller K-R (2007). “Covariate shift adaptation by importance weighted cross validation.” Journal of Machine Learning Research, 8: 985–1005. [Google Scholar]
  33. Sugiyama M and Müller K-R (2005). “Input-dependent estimation of generalization error under covariate shift.” Statistics and Decisions, 23: 249–280. MR2255627. doi: 10.1524/stnd.2005.23.4.249. [DOI] [Google Scholar]
  34. Svensën M and Bishop CM (2003). “Bayesian hierarchical mixtures of experts.” In Uncertainty in Artificial Intelligence. [Google Scholar]
  35. Tracey BD and Wolpert DH (2016). “Reducing the error of Monte Carlo algorithms by learning control variates.” In Conference on Neural Information Processing Systems. [Google Scholar]
  36. Vehtari A, Gelman A, and Gabry J (2017). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Statistics and Computing, 27: 1413–1432. MR3647105. doi: 10.1007/s11222-016-9696-4. [DOI] [Google Scholar]
  37. Vehtari A and Ojanen J (2012). “A survey of Bayesian predictive methods for model assessment, selection and comparison.” Statistics Surveys, 6: 142–228. MR3011074. doi: 10.1214/12-SS102. [DOI] [Google Scholar]
  38. Vehtari A, Simpson D, Gelman A, Yao Y, and Gabry J (2019). “Pareto smoothed importance sampling.” arXiv:1507.02646. [Google Scholar]
  39. Waterhouse S, MacKay D, and Robinson T (1996). “Bayesian methods for mixtures of experts.” In Advances in Neural Information Processing Systems. [Google Scholar]
  40. Wolpert DH (1992). “Stacked generalization.” Neural Networks, 5: 241–259. [Google Scholar]
  41. Yang Y and Dunson DB (2014). “Minimax optimal Bayesian aggregation.” arXiv:1403.1345. [Google Scholar]
  42. Yao Y, Vehtari A, and Gelman A (2020). “Stacking for non-mixing Bayesian computations: The curse and blessing of multimodal posteriors.” arXiv:2006.12335. [Google Scholar]
  43. Yao Y, Vehtari A, Simpson D, and Gelman A (2018). “Using stacking to average Bayesian predictive distributions (with discussion).” Bayesian Analysis, 13: 917–1007. MR3853125. doi: 10.1214/17-BA1091. [DOI] [Google Scholar]
  44. Yao Y, Pirš G, Vehtari A, and Gelman A (2021). “Supplementary material for: Bayesian Hierarchical Stacking: Some Models Are (Somewhere) Useful.” Bayesian Analysis. doi: 10.1214/21-BA1287SUPP. [DOI] [Google Scholar]
  45. Zhang L and Zhou W-D (2011). “Sparse ensembles using weighted combination methods based on linear programming.” Pattern Recognition, 44: 97–106. MR3184068. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES