Improving multilevel regression and poststratification with structured priors

Yuxiang Gao; Lauren Kennedy; Daniel Simpson; Andrew Gelman

doi:10.1214/20-ba1223

. Author manuscript; available in PMC: 2022 Jun 16.

Published in final edited form as: Bayesian Anal. 2020 Jul 15;16(3):719–744. doi: 10.1214/20-ba1223

Improving multilevel regression and poststratification with structured priors

Yuxiang Gao ^*, Lauren Kennedy ^†, Daniel Simpson ^‡, Andrew Gelman ^§

PMCID: PMC9203002 NIHMSID: NIHMS1811398 PMID: 35719315

Abstract

A central theme in the field of survey statistics is estimating population-level quantities through data coming from potentially non-representative samples of the population. Multilevel regression and poststratification (MRP), a model-based approach, is gaining traction against the traditional weighted approach for survey estimates. MRP estimates are susceptible to bias if there is an underlying structure that the methodology does not capture. This work aims to provide a new framework for specifying structured prior distributions that lead to bias reduction in MRP estimates. We use simulation studies to explore the benefit of these prior distributions and demonstrate their efficacy on non-representative US survey data. We show that structured prior distributions offer absolute bias reduction and variance reduction for posterior MRP estimates in a large variety of data regimes.

Keywords: Multilevel regression and poststratification, non-representative data, bias reduction, small-area estimation, structured prior distributions, Stan, INLA

1. Introduction

Multilevel regression and poststratification (MRP) is an increasingly popular tool for adjusting a non-representative sample to a larger population. In particular, MRP appears to be effective in areas where conventional design-based survey approaches have traditionally struggled, notably small-area estimation (Rao, 2014; Pfeffermann et al., 2013; Zhang et al., 2014) and with convenience sampling (Wang et al., 2015).

One difference between MRP and traditional poststratified design-based weights is that MRP uses partial pooling. Simple poststratification has difficulties with empty cells, in which case the usual practice is to poststratify only on marginals (thus ignoring interactions), or pool cells together. In contrast, the partial pooling of multilevel modeling automatically regularizes group estimates.

Although other options for regularization with poststratification have been explored (Gelman, 2018; Bisbee, 2019), applications of MRP typically assume independent group-level errors. For example, in a political poll modeling there are varying intercepts for states using a regression on region indicators, state-level predictors such as previous voting patterns in the state, plus independent errors at the state level. In some applications though, there is potential benefit from including underlying structure not captured by regression predictors. We demonstrate that this structure can be captured through more complex prior specifications. For example, instead of independent errors for an ordered categorical predictor, we specify an autoregressive structure instead. Ordered predictors are just one example where we can introduce structured prior distributions.

We begin this manuscript by providing an overview of the existing literature on methods that adjust for nonrepresentative data, as well as a review on the common application areas of MRP.

1.1. Existing literature on post-sampling adjustments for non-representativeness and MRP

Post-sampling adjustments aim to correct for differences between a potentially biased sample and a target population. Poststratification is a commonly used weighting procedure for nonresponse in model-based survey estimates (Little, 1993). It can improve accuracy of estimates but is no silver bullet, since the quality of poststratified estimates depends on the quality of the known information about the population sizes of the strata, along with the assumption that the sample is representative of the population within each poststratification cell. An approximation to poststratification is raking, which is an iterative algorithm using marginal totals (Deming and Stephan, 1940; Lohr, 2009; Skinner et al., 2017). When adjusting for many factors, raking can yield unstable estimates caused by high variability of the adjusted weights (Izrael et al., 2009). For a modern overview of current methods of inference and post-sampling adjustments for nonprobability samples, see Elliott et al. (2017). As the demands for small-area estimation increase, so too should the utility of MRP. We use structured priors in our proposed improvement for MRP, with the aim of more sensible shrinkage of posterior estimates that should ultimately reduce estimation bias.

MRP has been used in a broad range of applied problems ranging from epidemiology (Zhang et al., 2014; Downes et al., 2018) to social science (Lax and Phillips, 2009; Wang et al., 2015; Trangucci et al., 2018). MRP’s beginnings saw applications in political science (Gelman and Little, 1997; Park et al., 2004) for the estimation of state-level opinions from national polls. The breadth of its applications has since matured substantially, even to the extent of being used by data journalists (Morris, 2019). One of MRP’s appeals to applied researchers is the ability to produce reliable estimates for small areas in the population and simultaneously adjust for non-representativeness.

On the methodology front, Ghitza and Gelman (2013) and Gelman et al. (2016) extended MRP to include varying intercepts and slopes for interactions, along with inference for time series of polls.

1.2. Outline for this paper

This work explores alternative regularization techniques with structured prior distributions that lead to absolute bias reduction in MRP estimates. Our methodology of structured priors should not be confused with that of Si et al. (2017), who define structured priors as a way to perform variable selection for higher-order interaction terms. Our improvements on estimation precision come from replacing independent distributions of varying coefficients with Gaussian Markov random fields (Rue and Held, 2005).

This paper is structured as follows: Section 1 provides an overview of the existing literature for MRP and nonrepresentative data. Section 2 gives a concise overview of MRP and what’s required for the methodology. Section 3 describes our structured priors framework in detail, along with motivation for their use in MRP. Section 4 presents simulation studies of structured priors across various regimes of non-representative survey data. Explanation of the simulation setup and interpretation of the simulation results are given. Bias and variance comparisons are made between structured priors and the classical independent random effects in MRP in section 4. Section 5 contains the application of structured priors in MRP to a real survey data set that’s non-representative. Section 6 is the conclusion. All computation carried out in this manuscript was done in R (R Core Team, 2019).

2. Overview of MRP

Multilevel regression and poststratification Gelman and Little (1997) proceeds by fitting a hierarchical regression model to survey data, and then using the population size of each poststratification cell to construct weighted survey estimates. More formally, suppose that the population contains K categorical variables and that the k^th has J_k categories. Hence the population can be represented by $J = \prod_{k = 1}^{K} J_{k}$ cells. Usually the population contains continuous variables, and in that case these variables will be discretized to form categorical variables. For example, age in a demographic study can be discretized into a finite number of categories. For every cell, there is a known population size N_j. Increasing the number of groups for a continuous variable will increase the number of cells J and correspondingly decrease the individual cell population sizes N_j.

Choosing the optimal group size for continuous variables is a difficult model selection problem, involving tradeoffs between accuracy and computational load, and this is something that we do not address in this manuscript. This area of MRP research is an active field and the best practice has yet to emerge from the research.

Suppose that the response variable of individual i is y_i ∈ {0, 1}. MRP for binary survey responses is summarized by the two steps below:

Multilevel regression step.

Let n be the number of individual observations in a dataset. Fit the hierarchical logistic regression model below to get estimated population averages θ_j for every cell j ∈ {1, … , J}. The hierarchical logistic regression portion of MRP has a set of varying intercepts ${α_{j *}^{k}}_{j * = 1}^{J_{k}}$ for each categorical covariate k, which have the effect of partially pooling each θ_j towards a globally-fitted regression model, X_jβ, with sparse cells benefiting the most from this regularization. X_j is the row in the design matrix that corresponds to θ_j, where j ∈ {1, … , J}. We follow a notation consistent with Gelman and Hill (2006).

Pr (y_{i} = 1) = \log {it}^{- 1} (X_{i} β + \sum_{k = 1}^{K} α_{j [i]}^{k}), for i = 1, \dots, n α_{j}^{k} ∣ σ^{k} \overset{ind.}{\sim} N (0, (σ^{k})^{2}), for k = 1, \dots, K, j = 1, \dots, J_{k} σ^{k} \sim N_{+} (0, 1), for for k = 1, \dots, K β \sim N (0, 1),

where we are giving default weakly informative priors to the non-varying regression coefficients β.

Poststratification step.

Using the known population sizes N_j of each cell j, poststratify to get posterior preference probabilities at the subpopulation level. The poststratification portion of MRP adjusts for nonresponse in the population by taking into account the sizes of every cell l relative to the total population size $N = \sum_{j = 1}^{J} N_{j}$ . Another way to interpret poststratification is as a weighted average of cell-wise posterior preferences, where the weighting scheme is determined by the size of each cell in the population. Smaller cells get downweighted and larger cells get upweighted. The final result is a more accurate estimate in the presence of non-representative data.

Let S be some subset of the population defined based on the poststratification matrix. Then the poststratified estimand for S is:

θ_{S} ≔ \frac{\sum_{j \in S} N_{j} θ_{j}}{\sum_{j \in S} N_{j}}

For example, S could correspond to the oldest age category in the lowest income bracket. Then θ_S would correspond to the proportion of people in this sub-population that would respond yes to the survey question of interest. It’s important to note that, one can model at a finer scale than the poststratification scale.

3. Proposed Approach and Motivation

We consider structured prior distributions for MRP taking the form of Gaussian Markov random fields (GMRF), modeling certain structure of the underlying categorical covariate in the hierarchical regression. We proceed as follows for a covariate in the population of interest:

Case 1: If we do not want to model any structure in a categorical covariate, we model its varying intercepts as independently normally distributed. This would be what’s described in the previous section, Overview of MRP.

Case 2: If there is underlying structure we would like to model in a covariate, and spatial smoothing using this structure seems sensible for the outcome of interest, then we use an appropriate GMRF as a prior distribution for this batch of varying intercepts.

We will specify informative hyperpriors when possible and model via a full Bayesian approach. For a detailed overview of principled hyperprior specification in GMRF models, we refer the reader to Simpson et al. (2017). As well, we do not restrict structured priors to have directed or undirected conditional distributions (Rue and Held, 2005). Some examples of directed conditional distributions include the autoregressive and random walk processes with discrete time indices, which are frequently used in time series analysis. The CAR and ICAR processes (Besag, 1975) are common undirected conditional distributions and are often used in specifying priors in spatial models.

More complex prior structure allows for nonuniform information-borrowing in the presence of non-representative surveys from a population. For example, it makes sense to partially pool inferences for the oldest age group toward data from the second-oldest group. An autoregressive prior placed on the ordinal variable age achieves this effect, without making the strong global assumptions involved in simply including age as a linear or quadratic predictor in the regression. The proposal of using structured priors aims to reduce bias for MRP estimates in extremely non-representative data regimes.

Structured priors improve upon the multilevel aspect of MRP while maintaining the regression structure. Because MRP is a model-based survey estimation approach, the multilevel regression component can be replaced with other forms of regression modelling, for example with sparse hierarchical regression (Goplerud et al., 2018) or Bayesian additive regression trees (Bisbee, 2019). It is important, though, that the regression step be regularized in some way to preserve the ability of the method to account for a potentially large number of adjustment factors and their interactions (Gelman, 2018). When compared to machine learning-style regularization methods as seen in Bisbee (2019) and Goplerud et al. (2018), structured priors offer a lot more interpretability in the modelling step of MRP. For example, it’s not as clear how Bayesian additive regression trees (BART) allow for information-borrowing for structured covariates despite BART and related regularized tree-based methods (Chen and Guestrin, 2016) being modern methods in out-of-sample prediction. In contrast to this, an AR(1) prior with autoregression coefficient ρ ∈ [0, 1) on the ordinal variable age has the clear interpretation that posterior estimates are regularized towards their previous first-order neighbouring age, where the amount of regularization is determined by the coefficient ρ.

GMRFs have a deep connection with Gaussian Processes (GPs). As the discretization of the underlying space gets finer, under certain technical conditions, a GMRF will converge to a specific GP (Lindgren et al., 2011). More details on the style of convergence can be found in Lindgren et al. (2011). These theoretical foundations show that structured priors in MRP are a discrete approximation to GP priors for structured covariates. GPs are universal function approximators, hence structured priors in MRP can be thought of as a flexible modeling strategy that simultaneously takes into account various data-generating processes and offers interpretable regularization.

4. Simulation Studies

4.1. Directed structured priors example: Models for partial pooling of group-level errors

For the first simulation example, we work with a simple model of three poststratification categories—51 states, age in years ranging from 21–80, and income in 4 categories—and no other predictors. Age is further categorized into 12 groups. In practice, the number of categories that age is discretized into for MRP models is a lot smaller than the total number of possible integer ages (Lax and Phillips, 2009; Trangucci et al., 2018). 12 categories were chosen since it was large enough to allow for prior distribution structure to introduce intelligent information-borrowing in the age covariate yet it was a lot less than the maximal number of age categories, 60. We define $α_{j [i]}^{Age Cat .}$ , $α_{j [i]}^{Income}$ and $α_{j [i]}^{Region}$ to be the varying intercepts for age category, income category and region respectively for the i^th survey respondent. Phone surveys will often have these three covariates on respondents, often times with the nonrepresentativeness of the survey driven by varying nonresponse rates across ages in the population.

For all three prior specifications of MRP, we use the link function,

Pr (y_{i} = 1) = \log {it}^{- 1} (β^{0} + α_{j [i]}^{State} + α_{j [i]}^{Age Cat .} + α_{j [i]}^{Income}), for i = 1, \dots, n .

(4.1)

For all three prior specifications we assume independent mean-zero normal distributions for the $α_{j [i]}^{Region} ’ s$ , $α_{j [i]}^{Age Cat .} ’ s$ and $α_{j [i]}^{Income} ’ s$ along with a weakly informative half-normal distribution for the corresponding scale parameter:

α_{j}^{State} ∣ β^{State-VS}, β^{Relig .}, (α_{j^{*}}^{Region})_{j^{*} = 1}^{5}, σ^{State} \overset{ind.}{\sim} N (α_{m [j]}^{Region} + β^{Relig .} X_{Relig ., j} + β^{State-VS} X_{State-VS, j}, (σ^{State})^{2}), for j = 1, \dots, 51 α_{m}^{Region} ∣ σ^{Region} \overset{ind .}{\sim} N (0, (σ^{Region})^{2}), for m = 1, \dots, 5 α_{j}^{Income} ∣ σ^{Income} \overset{ind.}{\sim} N (0, (σ^{Income})^{2}), for j = 1, \dots, 4 σ^{Income}, σ^{State}, σ^{Region} \sim N_{+} (0, 1) β^{State-VS}, β^{Relig .}, β^{0} \sim N (0, 1)

(4.2)

where X_State-VS,j ∈ [0, 1] is the covariate that corresponds to the 2004 Democratic vote share for state j and X_Relig.,j ∈ [0, 1] is the percentage of conservative religion in state j, which is defined as the sum of the percentage of Mormons and percentage of Evangelicals in state j. The term $α_{m [j]}^{Region} + β^{Relig .} X_{Relig ., j} + β^{State-VS} X_{State-VS, j}$ are state-level predictors that utilize auxillary data accounting for structured differences among the states.

The baseline specification is the classical prior distribution used in MRP with independent normal distributions for the varying intercepts for age categories:

α_{j}^{Age Cat.} ∣ σ^{Age Cat.} \overset{ind.}{\sim} N (0, (σ^{Age Cat.})^{2}), for j = 1, \dots, 12 σ^{Age Cat .} \sim N_{+} (0, 1)

(4.3)

The autoregressive specification models the ordinal structure of age category as a first-order autoregression (Rue and Held, 2005). The prior distribution imposed on ρ is restricted to the range (−1, 1), enforcing stationary for the autoregressive process.

α_{1}^{Age Cat.} ∣ ρ, σ^{Age Cat.} \sim N (0, \frac{1}{1 - ρ^{2}} (σ^{Age Cat.})^{2}) α_{j}^{Age Cat .} ∣ α_{j - 1}^{Age Cat .}, \dots, α_{1}^{Age Cat .}, ρ, σ^{Age Cat.} \sim N (ρ α_{j - 1}^{Age Cat .}, (σ^{Age Cat .})^{2}), for j = 2, \dots, 12 σ^{Age Cat .} \sim N_{+} (0, 1) (ρ + 1) ∕ 2 \sim Beta (0.5, 0.5)

(4.4)

Finally, we consider the random walk specification, which is a special case of first-order autoregression with ρ fixed as 1, although with a different parameterization to avoid the division by 1 − ρ² above. In addition, we introduce the sum-to-zero constraint $\sum_{j = 1}^{J} α_{j}^{Age Cat .} = 0$ to ensure that the joint distribution for the first-order random walk process is identifiable.

α_{j}^{Age Cat .} ∣ α_{j - 1}^{Age Cat .}, \dots, α_{1}^{Age Cat .}, σ^{Age Cat.} \sim N (α_{j - 1}^{Age Cat .}, (σ^{Age Cat .})^{2}), for j = 2, \dots, 12 σ^{Age Cat .} \sim N_{+} (0, 1) \sum_{j = 1}^{J} α_{j}^{Age Cat.} = 0

(4.5)

The three prior specifications differ in the amount of information shared between neighbors in the age category random effect. In the baseline specification, no information is shared between $α_{j}^{Age Cat .}$ and $α_{j - 1}^{Age Cat .}$ for j ∈ {2, … , 12}. In the autoregressive specification, partial information is shared and in the random walk specification the full amount of information is shared. The sharing of information between $α_{j}^{Age Cat .}$ and $α_{j - 1}^{Age Cat .}$ is analogous to shrinkage of one posterior towards another. In this case, the posterior of $α_{j}^{Age Cat .}$ shrinks toward the posterior of $α_{j - 1}^{Age Cat .}$ under both the random walk and autoregressive specifications. The amount of shrinkage is governed by the autoregressive coefficent ρ. The reason behind specifying first-order autoregressive processes as the structured prior for age is that individuals with similar ages should have similar opinions for the survey question of interest. First-order autoregressive processes incorporates the prior assumption that the opinion of an individual for a certain age is similar to the opinion of individuals with exactly the same demographics except with a slightly younger age. In the simulation studies below, we empirically show that the property of shrinking towards the previous neighboring variable in the autoregressive and random walk specifications result in decreased posterior bias of MRP estimates for every cell in the population.

Simulated data

The sample

We consider three scenarios of true E(y) as a function of age: U-shaped, capshaped, or monotonically increasing. We investigate the effects of non-representative data amongst elderly individuals (ages 61–80) in the simulation samples, and show that the random walk specification provides the lowest absolute bias in subpopulation level estimates when compared to the other two specifications. The likelihood of sampling from a subpopulation group given that individuals respond is dependent on the size of the subpopulation group along with the response probability of an individual in that group.

The probability vector of sampling is defined as:

\frac{(Probability of response) ⊙ (N_{1}, \dots, N_{J})}{\sum_{j} (Probability of response)_{j} \cdot N_{j}}

where ⊙ is the Hadamard product. This probability vector is in reference to the post-stratification matrix defined for this simulation study. A special case for the probability vector of sampling is when the probability of response is equal for all cells in the population, resulting in a probability vector of sampling that’s fully representative of the population. The probability vector of sampling is used to generate a sample of binary responses along with covariates. Through this probability vector, one can augment it to get highly non-representative samples for certain subpopulation groups. In the case of a completely random sample for subpopulation groups of interest, all subpopulation groups of interest have the same probability of sampling. As an example, all 12 age categories would have equal probability of being sampled from in the scenario of completely random sampling for age categories.

Assumed sample and population

In the following simulation study we will assume that the population is sufficiently large so that sampling with replacement is equivalent to retrieving a random sample from the population.

To empirically validate the improvements that structured priors have on posterior MRP estimates, we construct various data regimes for age categories 9–12. More specifically, let S be the index set corresponding to age categories 9–12. Summing the probability of sampling over S will return the expected proportion of the sample who are older adults. We perturb this probability through 9 scenarios, ranging from 0.05 (underrepresenting older adults) to 0.82 (over-representing older adults). This section contains plots for the U-shaped true preference curve, with the appendix containing plots for the increasing-shaped true preference curve and the cap-shaped true preference curve.

These three true preference curves capture the rough structure of the unseen truths in real survey data. Let x represent age of an individual, and let f(x) represent the preference curve for age. The three different preference curves with respect to age are defined as:

Cap-shaped preference:

f (x) = \frac{Γ (4)}{Γ (2) Γ (2)} x (1 - x), x \in [21, 80]

U-shaped preference:

f (x) = 1 - \frac{Γ (4)}{Γ (2) Γ (2)} x (1 - x), x \in [21, 80]

Increasing-shaped preference:

0.7 - 3 exp (- \frac{x}{0.2}), x \in [21, 80]

True preferences for every poststratification cell j ∈ {1, … , J} in the population are then generated with the following formula:

θ_{j} = \log {it}^{- 1} (β^{0} + f (X_{Age [j]}) + X_{Income [j]} + β^{State} X_{State [j]} + β^{Relig .} X_{Relig. [j]})

where X_Age[j], X_Income[j], X_State[j], X_Relig.[j] correspond to the age, income effect, state effect and religion effect respectively of poststratification cell j. X_Income[j], X_State[j], X_Relig.[j] along with β⁰, β^State and β^Relig. are defined in the appendix.

Directed structured priors results

We fit all models using the probabilistic programming language Stan (Carpenter et al., 2017; Stan Development Team, 2016) to perform full Bayesian inference, using the default settings of 2000 iterations on 4 chains run in parallel, with half the iterations in each chain used for warmup.

Impact of prior choice on bias of posterior preferences

The first way we evaluate the impact of prior specification is by considering the impact of bias when we manipulate the expected proportion of the sample that are older adults. In Figure 1 below, we plot the results for a sample size of 100 and 500.

When the expected proportion of the sample that are older adults is equal to 0.33, this corresponds to a completely random sample for age categories (probability of sampling every age category is the same) and a fully representative sample for age categories (probability of sampling every age category is proportional to the population sizes for every age category). In certain scenarios, a completely random sample may be more desirable than a fully representative sample of the population for modeling purposes. Certainly, oversampling a sparse subpopulation group in the population will return lower variance model estimates for that specific subpopulation group.

We can see from Figure 1 that the two structured prior specifications outperform the baseline prior specification by a few percentage points for almost all 12 age categories, and achieving the same performance for the remaining age categories.

When elderly individuals are undersampled relative to the rest of the population, the random walk prior specification outperforms the baseline prior specification in lower absolute bias by a few percentage points across all the age categories.

When elderly individuals are oversampled relative to the rest of the population, the random walk prior specification outperforms the baseline prior specification in lower absolute bias by close to 10 percentage points for mid-aged individuals when sample size is 100. As expected, the three prior specifications produce essentially the same posterior estimates in the bottom row of Figure 1, due to the sample size being large in each of these age categories – Increasing n will increase the weight of the likelihood on the posterior in a statistical model. Regardless, absolute bias is reduced or stays the same for all age categories and all data regimes for the two structured priors specifications, as seen in Figure 1.

As a secondary benefit of structured priors, averaging over all 200 runs, simulation studies had shown the difference of the 90^th and 10^th posterior quantiles for almost all age categories to be smaller when n = 100. This is shown in Figure 2. This difference can be interpreted as a measure of posterior standard deviation. When n = 500, reduction in posterior quantiles difference is even more apparent. Reduction in posterior standard deviations may not be ideal for estimators when the tradeoff is higher absolute bias, but for the case of structured priors, we see a reduction in both for every age category implying a decrease in L₂ risk for posterior estimates of every age category.

Figure 2: — Differences in the 90^th and 10^th posterior quantiles for every age category when true preference is U-shaped for 200 simulations. The top row corresponds to a sample size of 100 and the bottom row corresponds to a sample size of 500. The shaded grey region corresponds to the age categories of older individuals for which we over/undersample. The left column has a probability of sampling age categories 9-12 equal to 0.05. The middle column has a probability of sampling age categories 9-12 equal to 0.33, which is completely random sampling *and* representative sampling for all age categories. The right column has a probability of sampling age categories 9-12 equal to 0.82. Local regression is used for the smoothed estimates amongst the three prior specifications. For the same plots involving different probabilities of sampling, refer to Table 4 in the appendix.

Another visualization of bias reduction is based on Figure 3. It shows bias of posterior preferences for each cell in the population, the finest granularity, as the expected proportion of the sample that are older adults is perturbed. Absolute bias is significantly decreased when switching from the baseline specification to the random walk specification. The autoregressive specification also reduces absolute bias, but not as much when compared to the random walk specification. This is due to the prior ρ defining before inference that the information being borrowed from the neighboring age category posteriors should be a value in [−1, 1].

The population preference estimates for the three prior specifications remain nearly the same across all probability of sampling indices when the true preference curve is U-shaped or cap-shaped. When the true preference curve is increasing-shaped, the population preference remains nearly the same for all probability of sampling indices except 0.05 and 0.82. In those cases, the first-order random walk prior produces more unbiased population estimates by a few percentage points. The advantage of structured priors appear to be more drastic when reducing to more granular sub-population levels. For additional bias plots on all three true preference curves, the reader can refer to the appendix.

In summary, based on the simulation studies on the U-shaped true preference along with the two other true preference curves, we see that structured priors decrease absolute bias for posterior MRP estimates more than the classical specification of priors in MRP, regardless of how representative the survey data are to the population of interest. This implies that posterior MRP estimates coming from structured priors are much more invariant to differential nonresponse and biased sampling when compared to the classical priors used in MRP. The main goal of this paper is to argue that structured priors offer an improvement to MRP, even in extremely non-representative data regimes. We indeed see that in the simulation studies as large decreases in absolute bias are seen when the probability of sampling age categories 9-12 are 0.05 and 0.82. A secondary benefit of structured priors is variance reduction on posterior estimates of the structured covariates.

The benefits of structured priors in this case are apparent even when sample size is 1000, but to a lesser degree than when sample size is 100 or 500. The absolute bias reduction and posterior standard deviation reduction due to structured priors when n = 1000 are a few percentage points for the majority of under/over-represented age categories, based on all three true preferences. In contrast to this, the absolute bias reduction and posterior standard deviation reduction due to structured priors when n = 100 are around 10 percent for the majority of under/over-represented age categories. For results on this n = 1000 simulation, refer to Tables 3 and 4 with the configuration n = 1000 for all three true preference shapes.

The three prior specifications start to converge in terms of having the same posterior bias and variance with an increase in sample size due to the likelihood dominating the prior distributions imposed. Regardless, with a sample size of 1000 and 12 age categories, it appears that structured priors are still able to reduce posterior variance and absolute bias in the nonrepresentative data regime – where elderly individuals are oversampled and undersampled.

The efficacy of structured priors in this section were also evaluated based on the proportion of the time they outperformed baseline priors in bias and variance. The calculated metric, share of simulations where the posterior median of structured priors outperformed the posterior median of baseline priors, are shown in figures referenced by the Improvement Proportion column of Table 3. A similar metric, the share of simulations where posterior variance of structured priors outperformed posterior variance of baseline priors, is referenced by the Improvement Proportion column of Table 4. In almost all cases, there was a strong improvement over the baseline priors more than half the time. This improvement was uniform across simulation scenarios, except for some simulations with sample size n = 100 and severe undersampling of the highest age groups.

When there is complex and structured variation of the true preference across the age categories, it is important for the model to not surpress this variation. Standard MRP estimates assume exchangeability between age categories and therefore shrink cell estimates towards a global regression model in the absence of data. Structured priors for the age category covariate, while not perfectly recovering the true preference probabilities in the presence of limited data, are better able to capture much more of the variation in preference across age categories. This results in a better quantification of the variation across small areas.

When the number of categories for the structured covariates of interest is sufficiently large, they start to show beneficial effects on posterior MRP estimates. To quantify “sufficiently large” is problem-dependent as every structured prior will be different depending on the covariates of the data set. Furthermore, there are multiple structured priors one can choose from for a covariate. This is something we will not address here. We previously ran the same set of experiments in this results section for 3 and 6 age categories and did not observe a significant difference in posterior estimates for all three prior specifications. 12 age categories and more for our simulation studies are when the beneficial effects of structured priors become obvious.

4.2. Undirected structured priors example: Spatial MRP

For the second simulation example, we will apply MRP for spatial data. More specifically, in the case of binomial regression to get subpopulation-level estimates for all 52 Public Use Microdata Areas (PUMA) in the state of Massachusetts. We simulate a nonrepresentative survey by purposefully oversampling certain spatial areas.

In this section, we will fit the models using the R–INLA package (Rue et al., 2009, 2017; Seppä et al., 2019), a fast approximate Bayesian inference package catered towards spatial modelling.

We will use penalized complexity (PC) prior specifications (Simpson et al., 2017) for hyperpriors in the spatial MRP model. The response variable for an observation is count data t_i and the total number of possible occurences is T_i. Let n be the number of observations in the binomial response data set. Let i = 1, … , n be the index of counts. We define the following model below:

t_{i} \sim Binomial (μ_{i}, T_{i}) μ_{i} = \log {it}^{- 1} (β^{0} + α_{j [i]}^{PUMA} + α_{j [i]}^{Education} + α_{j [i]}^{Ethnicity}) α_{j}^{Education} ∣ σ^{Education} \sim N (0, (σ^{Education})^{2}), for j = 1, \dots, 6 α_{j}^{Ethnicity} ∣ σ^{Ethnicity} \sim N (0, (σ^{Ethnicity})^{2}), for j = 1, \dots, 6 σ^{Education}, σ^{Ethnicity} \sim PC-Prior (1, 0.1) β^{0} \sim N (0, 1)

(4.6)

The poststratified estimand based on Equation 4.6 for a subpopulation group S ⊆ {1, … , J} is then $\sum_{j \in S} \frac{N_{j}}{N} μ_{j}$ .

The BYM2 specification models the prior spatial structure of PUMA effect, $α_{j}^{PUMA}$ , as a BYM2 model (Morris et al., 2019; Riebler et al., 2016) specified in Equation 4.7. Let A_j be the set of first-degree neighbours for PUMA j. Let d_j be the cardinality of A_j.

The BYM2 spatial prior for PUMA is defined as:

α_{j}^{PUMA} = \frac{1}{\sqrt{τ}} (\sqrt{1 - ρ} θ_{j}^{*} + \sqrt{\frac{ρ}{s}} ϕ_{j}^{*}) ϕ_{j}^{*} \sim N (\frac{\sum_{k \sim A_{j}} ϕ_{k}^{*}}{d_{j}}, \frac{1}{d_{j}}) θ^{*} \sim N (0, I_{52}),

(4.7)

where we account for the multiple connected components using the method described in Freni-Sterrantino et al. (2018).

From Equation 4.7, we see that ϕ* is an ICAR prior and θ* is an isotropic multivariate normal prior. The scaling factor s is computed so that the variance of $\sqrt{\frac{ρ}{τ s}} ϕ_{j}^{*}$ is approximately equal to 1 for all PUMA j. Additionally, we have the constraint $\sum_{j = 1}^{52} ϕ_{j}^{*} = 0$ since the joint distribution of ϕ* is non-identifiable. We will specify τ ~ PC-Prior(1, 0.1) (Simpson et al., 2017), which results in the precision τ to have density function $\frac{- \log (0.1)}{2} τ^{- 3 ∕ 2} e^{\frac{\log (0.1)}{τ^{1 ∕ 2}}}$ and we’ll specify the mixing parameter ρ being the default BYM2 hyperprior specification in INLA.

The IID specification models the prior structure of PUMA effect, α^PUMA, as an isotropic multivariate normal distribution with precision τ ~ PC-Prior(1, 0.1).

The IID specification does not allow for any borrowing of information. However, the BYM2 prior specification allows for posterior estimates of true preference for every PUMA to borrow information with its first-degree neighbours, where the amount of information borrowed is controlled by the posterior mixing hyperparamter ρ. This borrowing of information with first-degree neighbours aims to capture the smooth spatial structure of true preference across the 52 PUMA in Massachusetts, if there indeed is one.