Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Sep 3.
Published in final edited form as: J Off Stat. 2021 Mar 12;37(1):71–95. doi: 10.2478/jos-2021-0004

Weighted Dirichlet Process Mixture Models to Accommodate Complex Sample Designs for Linear and Quantile Regression

Michael R Elliott 1, Xi Xia 2
PMCID: PMC8415180  NIHMSID: NIHMS1720820  PMID: 34483435

Abstract

Standard randomization-based inference conditions on the data in the population and makes inference with respect to the repeating sampling properties of the sampling indicators. In some settings these estimators can be quite unstable; Bayesian model-based approaches focus on the posterior predictive distribution of population quantities, potentially providing a better balance between bias correction and efficiency. Previous work in this area has focused on estimation of means and linear and generalized linear regression parameters; these methods do not allow for a general estimation of distributional functions such as quantile or quantile regression parameters. Here we adapt an extended Dirichlet Process Mixture model that allows the DP prior to be a mixture of DP random basis measures that are a function of covariates. These models allow many mixture components when necessary to accommodate the sample design, but can shrink to few components for more efficient estimation when the data allow. We provide an application to the estimation of relationships between serum dioxin levels and age in the US population, either at the mean level (via linear regression) or across the dioxin distribution (via quantile regression) using the National Health and Nutrition Examination Survey.

Keywords: Sampling weights, bayesian finite population inference, posterior predictive distribution, dioxin, NHANES

1. Introduction

Many population surveys use complex probability sample designs with unequal selection probabilities, clustering, and stratification. Standard “design-based” approaches to analyzing such data use randomization inference, treating population values as fixed, sampling indicators as random, and focusing on developing estimators that are at least approximately unbiased with respect to the repeated sampling distribution. While the design-based approach does not make distributional assumptions, it can be very inefficient under certain scenarios, such as when the sample size is small, the weights are highly variable, and/or the relationship between the quantity of interest and the probability of selection is weak. For example, if one is interested in a population-level linear regression parameter, if the model is correctly specified and the errors are homoscedastic, incorporating sampling weights in estimation is unnecessary for bias correction and will likely only inflate variance. However, misspecified models or designs with non-ignorable inclusion mechanisms can lead to settings where weights are required for bias correction.

An alternative approach uses Bayesian finite population inference, a model-based method that assumes a model for the observed data. The unobserved elements of the population are treated as missing data, and posterior predictive distributions of the population are generated by repeatedly imputing the unobserved elements of the population using draws from the posterior distribution of the parameters governing the data model (Ericson 1969; Rubin 1983; Fienberg 2011; Little 2012). Work by Zheng and Little (2003); Chen et al. (2010); and Chen et al. (2012) model an outcome of interest as a flexible function of the probability of selection, and develop consistent and efficient estimators of means and quantiles – descriptive statistics of a scalar variable. However, accommodating design elements, particularly sampling weights, is more complex when regression parameters themselves are of interest. One method to accommodate weights in a linear or generalized linear regression model setting is to create dummy variables stratified by equal or approximately equal case weights and include indicators for these weight strata and interactions between the covariates of interest and the weight strata in the regression model (Elliott 2007). Inference is based on the posterior predictive distribution of the population level regression model of interest, which no longer needs to explicitly include the weight interactions. This suggests a more general approach of developing models that retain the level of structure needed to incorporate the design features when necessary, but to default back to simpler models that are more efficient if the data suggest that they are not needed. More generally, the overarching goal of this work is to use modeling to balance bias-variance tradeoffs inherent in design-adjusted estimates of population parameters.

Here we consider an approach that does not directly incorporate design features into the model, but rather develops a very robust model that is relatively immune to substantial model misspecification, so that if a complex model is required to capture the key points of inference for the population, it is available, but if a simpler model is adequate, it will be used. This manifests in inference typically as a sort of bias-variance tradeoff, so that the complex model will yield something approximating a fully-weighted estimator, and the simpler model will yield something approximating an unweighted estimator. Specifically, we investigate use of Dirichlet Process Mixture (DPM) models (Blackwell and MacQueen 1973; MacEachern 1994) which loosen the assumption of a pre-determined number of mixture components, and provide a convenient mechanism to add or remove components from the model. Si et al. (2015) used a DPM in the estimation of population means, finding substantial reduction in mean square error and credible interval length over standard design-based estimators. Here we consider regression settings, which, as we note below, requires an extension of the DPM to accommodate weights in estimation. In particular, we consider a weighted Dirichlet Process Mixture (WDPM) model proposed by Dunson et al. (2007), which adds extra flexibility by assigning weights to locations in the population domain, relaxing implicit linearity in the regression setting. (We briefly note that “weighted” in WDPM should not be confused with the sampling design weight – the former are estimated as a flexible extension of a DPM model, whereas we treat the latter as a fixed elements of the population to generate population predictive distributions under the WDPM model.)

We focus on the estimation of linear and quantile regression parameters. While quantile regression is commonly used with population survey data, methodological exploration in the complex sample setting has been somewhat limited, and little if any work has explored the effect of highly variable sampling weights in the quantile regression setting. Use of WDPM in the quantile regression context is particularly appealing given its ability to accommodate a wide variety of continuous distributions.

In this study we extend the WDPM model to estimate quantities of interest from complex survey design data, in order to build data-driven inference that captures a wide variety of normal and non-normal distributions in a fashion that takes account of unequal probabilities of selection, but also offers increased efficiency when data permit. Because the WDPM models are highly flexible and can generate predictive distributions that are accurate in tails of the distribution, they are a natural choice for model-based methods to obtain population quantile regression estimates as well. We consider an application to the analysis of the association between blood dioxin level and age, using data from the National Health and Nutrition Examination Survey (NHANES) (CDC 2015).

This article is organized as follows. In Section 2 we review the theory of Bayesian finite population inference, quantile regression, Dirichlet Process Mixture models, and the the weighted Dirichlet Process Mixture model. Section 3 extends the WDPM model to incorporate survey weights in the draws from the posterior predictive distribution. Section 4 provides a simulation study, and compares bias, coverage and RMSE of the proposed method with standard methods, under both linear model and quantile regression settings. Section 5 applies the method to estimate associations between blood level dioxin and age using data from NHANES. Section 6 provides a summary discussion.

2. Background Methodology

2.1. Bayesian Finite Population Inference

To review Bayesian finite population inference, we denote the sample design variables by Z, and the population data Y is modeled as Y ~ f (Y|θ, Z). Design variables can include case weights, cluster indicators, or stratum indicators, although in this article we will focus on weights. The distribution f could be either highly parametric, with a low dimension θ, or semi-parametric or “non-parametric” with a high-dimension θ. Let N be the number of elements in the population, Yobs consist of the n observed data elements, and Ynob consist of the Nn unobserved cases in the population. (Note that, in this general description, Y is multivariate and includes both regression outcomes, elsewhere denoted traditionally by a scalar Y, and regression predictors, denoted by multivariate X.) Considering Ynob as missing data, its posterior predictive distribution is given by:

p(YnobYobs,I,Z)=p(YnobYobs,Z,θ,ϕ)p(IY,Z,θ,ϕ)p(YobsZ,θ)p(θ,ϕ)dθdϕp(YnobYobs,Z,θ,ϕ)p(IY,Z,θ,ϕ)p(YobsZ,θ)p(θ,ϕ)dθdϕdYnob (1)

where ϕ models the inclusion indicator I. If ϕ and θ have independent priors, and the sampling design is ignorable, that is, I does not depend on Ynob given Yobs and Z, the formula of predictive posterior distribution reduces to

p(YnobYobs,Z)=p(YnobYobs,Z,θ)p(YobsZ,θ)p(θ)dθp(YnobYobs,Z,θ)p(YobsZ,θ)p(θ)dθdYnob

allowing inference about Q(Y) to be made without explicitly modeling the sampling inclusion parameter I (Ericson 1969; Holt and Smith 1979; Little 1993; Rubin 1987; Skinner et al. 1989). This approach can be extended to develop inference on a function Q(Y) of the population data, by repeatedly obtaining draws from p(Ynob | Yobs, Z) and computing Q(Y) = Q(Ynob, Ynob) to thus obtain a draw from p(Q(Y) | Yobs, Z):

To move this review from the abstract to the concrete, consider data obtained from a (possibly disproportionately) stratified sampling design. Here Z would identify sampling strata, say z = 1, …, H, with known population sizes Nz, and samples of size nz with associated arithmetic sample means for single scalar covariate y¯z. Consider a simple model of the form

yziμz~N(μz,σ2),i=1,,nz
p(μz)1

where for illustration purposes σ2 is assumed known. Assume that the target of inference Q(Y) is the population mean Y¯=N1Σz[(Nznz)Y¯nob,z+nzy¯z], where Y¯nob,z=(Nznz)1i=1NznzYnob,z,i is the mean of the non-sampled elements and N = Σz Nz. Since p(Ynob,zy,Z)=p(Ynob,z,μy,Z)p(μy,Z)dμ=p(Ynob,z,μzyz)p(μzyz)dμz (where the second equality follows from prior independence across the strata), we could obtain draws from the posterior distribution of Y¯ in three steps:

  1. Draw μzrepy,Z, from its posterior N(y¯z,σ2/nz)

  2. Draw Nznz of Ynob,z,irep from N(μzrep,σ2) for each of the z = 1, …, H strata

  3. A draw of of the posterior of Y¯ is then obtained by computing

Y¯rep(1)=N1z[(NZnz)Y¯nob,zrep+nzy¯z].

Note that, when the population is large, generating an entire population may be highly time consuming. Hence we show that the exact draw Y¯rep(1) is well approximated, where sample sizes and population sizes are large, by Y¯rep(2), obtained as follows. For each of the nz observations in each of the z = 1, … , H strata

  1. Draw μzrepy,Z, from its posterior N(y¯z,σ2/nz)

  2. Draw Ynob,z,irep from N(μzrep,σ2)

  3. Repeat 1) and 2) and compute

Y¯rep(2)=N1z[i=1nz(wz1)Ynob,z,irep+nzy¯z],

where wz = Nz/nz.

Conceptually, Y¯rep(2) can be seen as obtaining posterior predictive draws of the population under the model and then “expanding” them to create a synthetic population by duplicating them by the number of unobserved elements they represent. To see the equivalence mathematically, note that the posterior distributions of Y¯rep(1) and Y¯rep(2) will both be normally distributed, with means equal to N1zNzy¯z:

E(Y¯rep(1)y,Z)=N1z[(Nznz)E(Y¯nob,zrepy,Z)+nzy¯z]=N1z[(Nznz)E(E(Y¯nob,zrepμz,y,Z))+nzy¯z]=N1z[(Nznz)E(μzy,Z)+nzy¯z]=N1z[(Nznz)y¯z+nzy¯z]=N1zNzy¯z

and similarly

E(Y¯rep(2))=N1z[i=1nz(wz1)E(Ynob,z,irepy,Z)+nzy¯z]=N1z[Nznznzi=1nzy¯z+nzy¯z]=N1z[(Nznz)y¯z+nzy¯z]=N1zNzy¯z.

The posterior sampling variances are given by

V(Y¯rep(1)y,Z)=N2V(zi=1NznzYnob,z,irepy,Z)=N2[E(V(zi=1NznzYnob,z,irepμz,y,Z))+V(E(zi=1NznzYnob,z,irepμz,y,Z))]=N2[E(z(Nznz)σ2)+z(Nznz)2V(μz,y,z)]=N2[z(Nznz)σ2+z(Nznz)2nzσ2]=σ2N2z(Nznz)(Nznznz+1)σ2N2z(Nznz)2nz

and

V(Y¯rep(2)y,z)=N2zi=1nz(wz1)2V(Ynob,z,irepy,Z)=N2z(wz1)2nz(E(V(Ynob,z,irepμz,y,Z))+V(E(Ynob,z,irepμz,y,Z)))=N2z[(wz1)2×nzσ2(1+1/nz)]=σ2N2z(Nznznz)2(nz+1)σ2N2z(Nznz)2nz

where the first equality for V(Y¯rep(2)y,Z) follows from the fact that the draws of the common mean μkrep are made independently for each observation. Note that a bit of algebra shows that V(Y¯rep(1)y,Z)V(Y¯rep(2)y,Z) can be written as σ2N2z[(Nznz)(Nznznz)2], showing that the variance estimate using the weighed approximation is usually conservative, with the positive bias going to 0 as N and nz increase, since z(Nznz)N20 and z(NznzN)21nz20. We use the approximation Y¯rep(2) throughout the remainder of the manuscript to improve computational efficiency. (Note also that V(Y¯rep(1)y,Z) can be written as σ2zPz2(1fz)/nz where Pz = Nz/N and fz= nz/Nz, corresponding to the design-based estimator of a population mean with a finite population correction adjustment.)

2.2. Quantile Regression

Quantile regression is a general class of linear models that estimates quantiles of the response variable conditional on covariates. Consider a real valued random variable Y with cumulative distribution function FY(y) = P(Yy). Then any τ ∈ [0, 1], the τ-th quantile of Y is defined by:

QY(τ)=FY1(τ)=inf {y:FY(y)τ}

The quantile function provides a complete characterization of the distribution of Y with various values of τ. To solve for the τ-th quantile numerically, we define the piecewise linear loss function

ρτ(y)=y(τI(y<0))

where I equals one if y < 0 is satisfied, and zero otherwise. The τ-th quantile of Y, namely u, is calculated by minimizing the expected loss of ρτ(Yu)

min uE(ρτ(Yu))=min u(τ1)u(yu)dFY(y)+τu(yu)dFY(y).

Thus

u^=argminuRE(ρτ(Yu))

Assuming a random sample of Y, yi, i = 1, …, n, the sample analogue of τth-quantile is attained by solving the following minimization problem:

u^=argminuRi=1nρτ(yiu)

Now we extend to a regression setting under a linear model assumption. Let xi, i = 1, …, n be a p × 1 vector of regressors. The τ-th conditional quantile function is then given by QYiXi(T)=XiTβτ, and one can obtain βτ by solving:

β^τ=argminβRpE(ρτ(YiXiβ))

The sample analogue:

β^τ=argminβRpi1nρτ(yixiβ) (2)

is usually solved by the simplex method (Murty, 1983).

Yu and Moyeed (2001) suggest a likelihood form based on the asymmetric Laplace distribution:

fτ(u)=τ(1τ)exp{ρτ(u)}

where pτ (u) has the same form of the loss function stated above. Thus the likelihood function could be written as:

L(yβ)=τn(1τ)nexp{iρτ(yixiTβ)}

Differentiating L(y | β) yields the objective function given in Equation (2).

2.3. Dirichlet Process Mixture Model

Assuming IID observations, the finite Gaussian mixture regression model is given by:

Yixi,Ci=c,βc,σc2~N(xiTβc,σc2),c=1,,K
Ci=cα,zi,xi~MULTI(1;pi1,pik)
log(PijPi1)=f(αj,zi,xi),j=2,,K

where Ci is the class membership, identifying the latent mixture component to which an observation belongs; U ~ MULTI(1; p1, …, pk) denotes the multinomial distribution for single trial with P(U = k) = pk; and f(aj, zi, xi) is the function of the multinomial logit parameterized by aj, and can take on a simple parametric form (e.g., linear in zi and xi) or semi-parametric (e.g., penalized splines). Given a sufficiently large K, a finite Gaussian mixture model can maintain robustness in the presence of regression model misspecification, as well as skewness and overdispersion in the residual error term. Yet when the data permit, fitting a simpler model with a small value of K could lead to increased efficiency.

The finite Gaussian mixture model can be written in a more general form as

f(yixi)=N(yiϕi)Gxi(ϕi)

where ϕi defines subject-level means and variances, and Gxi is multinomial. An alternative approach defines Gx as an element in an uncountable collection of probability measures Gx ~ DP(αG0), where DP denotes a Dirichlet Process (Ferguson 1973) centered at base distribution G0 with precision α. This leads to standard DP mixture models (MacEachern 1994) and avoids explicitly specifying the number of components K in advance.

Expressing the Dirichlet process in “stick-breaking” form leads to:

G=h=1phδθh,phl1h1pl~BETA(1,α)

where U ~ BETA(a, b) denotes the beta distribution, with PDF f(U=u;a,b)=Γ(a+b)Γ(a)Γ(b)ua1(1u)b1 for 0 ≤ u ≤ 1 and a, b > 0; δθh is degenerate at θ; and {θh} are atoms generated from G0. Use of a Polya urn scheme (Blackwell and MacQueen 1973) integrates out the infinite dimensional G and provides an easier form for simulation:

ϕiϕ(i),α~αα+n1G0+1α+n1jiδϕj.

That is, a new draw of observation i could be the same component of an existing observations with probability 1/(α + n − 1), or initiate a new draw from base measure G0 with probability α /(α + n − 1) Thus through each cycle of the Gibbs sampler, each observation is either assigned to an existing component of the base distribution or to a newly generated component, with the probability of assignment to a new component governed by the a parameter, where large values of the “precision parameter” α encourage creation of new components and small values suppress new components. Consequently large values of α encourage larger values of K and vice versa; α can be fixed in advance as a tuning parameter, or estimated from the data after assignment of a hyperprior (typically a gamma distribution). Here we consider a normal distribution for G0, so that Y is a continuous outcome.

The drawback of the standard DP mixture emerges in the regression setting, where a draw from the posterior predictive distribution of yi | xi is generated from αα+nN(yixi,β0,σ02)+h=1knhα+nN(yixi,βh,σh2) where nh is the number of observations assigned to component h = 1,…, K, and β0 and σ02 are further independent draws from Go. The conditional posterior predictive distribution of y for any value of K takes a linear form of x:

E(yirepxi,β0,,βK,σ02,,σK2)=xiβ¯

where β¯=αα+nβ0+h=1Knkα+nβh. This restricts the model’s ability to capture non-linear patterns in data, including, for example, interactions with selection probabilities.

2.3.1. Weighted Dirichlet Process Mixture (WDPM) Model

The weighted Dirichlet Process Mixture model is a more flexible extension proposed by Dunson et al. (2007), that allows the DP prior itself to be a mixture of DP random basis measures.

Assuming GXj*~DP(aG0), where j = 1, …n indexes random basis measures at each distinct covariate value, the actual DP prior is built as a mixture model of GXj*:

Gx=j=1nbj(x)Gxj*
bj(x)=γjexp(ψ)xxj)lnγlexp(ψxxl)

The form of bx grants high weight in Gx to subjects with xj closer to x, encouraging clustering of subjects that are near to each other in the covariate space. The parameter γ is designed to add extra “weight” at specific locations where data are sufficiently dense to detect the need for potential additional mixture components. The smoothing parameter ψ is included to control the degree to which Gx loads across multiple draws from DP(αG0). Note that the standard DP mixture model is a special case of the weighted DP mixture model, where bj (x) = 1/n for all j.

To obtain values of γj and ψ in a data-driven manner, ψ is assigned a truncated lognormal hyperprior, ψ~logN(μψ,σψ2), ψ ∈ (0, 5). The choice for the weight parameter γ is more subtle to avoid either single dominating weight or uniformly distributed weights equivalent to a standard DP mixture model. Here we consider the γj ~ Gamma(αγ, βγ) which favors a few dominant locations.

To complete the Bayesian specification of the linear regression model for N(yi | ϕi) with φi=(βi,σi2) and βi = (βi1, …, βip),

βiβ,σi2,Σβ~N(β,σi2Σβ)
τi=σi2~Gamma(aτ,bτ)
β~N(β0,Vβ0)
Σβ1~Wishart((v0Σ0)1,v0)
bτ~Gamma(a0,b0)

3. Generating Posterior Predictive Draws from the Finite Population using WDPM

Here we describe how we obtain posterior predictive draws of the quantity of interest, for example, the population regression parameters, using a data augmentation method.

First, using the analytical form of all conditional probabilities outlined in Dunson et al. (2007), we obtain draws from the posterior distribution of the parameters from the WDPM model using Gibbs Sampler (see online Supplementary Material). We then obtain a draw of yirep, the posterior predictive distribution of yi at xi conditional on a draw of all other parameters from a normal distribution with mean h=0KwihxiTβh, and variance h=0Kwih2(xi)σh2+wi02xiTβxi, where wi0(xi)=j=1nαbj(xi)α+liI(Csi=j) and wih(xi)=bCh(xi)miI(Sm=h)α+liI(Csi=Ch) for S = (S1, …, Sn) mapping n subjects into K distinct clusters and C = (C1, …, CK) denoting the K cluster themselves; thus for n = 4 and K = 2, S = (1,1,2,2) indicates that the first two observations are assigned to cluster C1 and the second two observations to cluster C2.

The next step changes the focus from the WDPM model to the finite population model of interest. First, we consider the linear regression model yi~N(xiTβ,σ2), so that the target population quantity of interest is the value β such that U(B) = 0 for

U(β)=i=1Nβlogf(yi;β)=1σ2i=1N(yixiTβ)xi,

or B=(i=1NxixiT)1(i=1Nxiyi). An approximate draw from the posterior predictive distribution of B is obtained as

Brep=(XTW*X)1XT(W*In)(yrep+y). (3)

where W* is a n × n diagonal matrix of the sampling weights wi* and yrep is the vector of draws yirep,,ynrep. If the sampling fraction is trivial, an approximate draw can be obtained simply using the predicted values: Brep = (XTW*X)−1XTW*yrep.

For the quantile regression model, our population target is Bτ such that such that U(Bτ)=0 for

U(βτ)=i=1Nβlogfτ(yi;β)=argminβRpi=1Nρτ(yixiβ)

and an approximate draw from the posterior predictive distribution of Bτ is obtained as

argminβRp(i=1n(wi1)ρτ(yirepxiβ)+i=1nρτ(yixiβ))

or, if ignoring finite population corrections, from argminβRpi=1nwiρτ(yirepxiβ).

Efficiency is gained when K is small, approximating unweighted linear prediction. To see why this is the case, consider the linear regression setting. If K = 1, then the prediction is exactly linear, with yrep = rep, e ~ N(0, σ2In). Then βrep = (XTW*X)−1 XTW*(rep + e) = βrep + (XTW*X)1XTW*e, which is a draw of β independent of the weights plus a weighted average error term with mean 0. Large values of K can accommodate nonlinearities that lead to bias if the survey weights are ignored, at some cost to variance, especially if the weights are highly variable. Hence a data-driven bias-variance tradeoff is induced.

4. Simulation Study

In this section we evaluate the application of the Weighted Dirichlet Process Mixture model in complex survey design in two scenarios: ordinary linear regression and quantile regression. For each setting, the target of interest is the population slope. The competing methods are the unweighted estimator, the fully-weighted estimator, and a “weight trimming” estimator (Potter 1990) with extreme weights trimmed at three standard deviations of mean weight. For the linear regression model, we also consider a semiparametric spline model as a more direct competitor to the WDPM model. Bias, relative root of mean square error (RMSE), and coverage of 95% confidence or credible intervals are calculated to assess the performance of the estimators. R code for generating the population and samples, and for fitting the WDMP model, is available at https://github.com/mrelliot/WPDM.

4.1. Weighted Dirichlet Mixture Model for Ordinary Linear Regression

For the linear regression setting, a population of N = 20, 000 is generated. The predictor X is uniformly distributed on the interval from 0 to 10. The response variable Y is created from a linear spline function of X, with knots at integer values. Three sets of coefficients are considered to represent different associations of Y and X, including convex and concave curves, and a linear slope:

YiXi,β,σ2~N(h=09βh(xih)+,σ2)
Xi~UNI(0,10),i=1,,N=20,000
βa=c(0,0,0,0,.5,.5,1,1,2,2)
βb=c(0,11,4,2,2,1,1,0.5,0.5,0)
βc=c(0,2,0,0,0,0,0,0,0,0)

We then sampled n = 200 observations in a stratified sample with selection probabilities P(Ii = 1) = πi ∝ (1 + hi) * hi, where the hth stratum is given by the ceiling function applied to X: hi = [x1]. This sets the maximum weight to be about 55 times larger than the minimum weight.

The target quantity of interest is β=(i=1NX˜iX˜i)1i=1NX˜iYi for X˜i=(1Xi), the least-squares linear approximation of the population slope. Under βa and βb, the linear model is misspecified, and weights may be necessary to correct for the corresponding bias. Under βc, the model is correctly specified, and it would be most efficient to ignore the sample weight.

Population variance σ2 varies as 101.5, 103.5 and 105.5, creating varying levels of variance influence, ranging from revealing a moderate curving pattern, to completely overwhelming the local differences in the population slope (see Figure 1). Also note that slope in setting βa changes more dramatically where data is most densely sampled, while the reverse happens in βb, suggesting that a complex model is needed to correctly capture the two different scenarios.

Fig. 1.

Fig. 1.

Scatter plot of population, with an example sample in crosses. From left to right, σ2 = 101.5, 103.5, 105.5; from top to bottom, setting βa, βb, βc.

The hyperprior parameters are pre-specified as following. For the prior on DP weight functions, we let αγ = 0.01, βγ = 2, and α = 0.01. For hyper-priors on basis distribution parameters, we have β0 = 0, Vβ0 = 1000 × Ip, ν0 = 1, Σ0 = Ip, aτ = 0.1, a0 = 0.1, b0 = 0.1, μψ = log(30) and μψ = 0.5. We also restrict the value of ψ in a range of zero to five, and fix the Dirichlet process precision parameter α = 1.

A Gibbs Sampler as previously described is applied, that is, the new distribution of S, DP weights, number of atom distributions and parameters within each atom are drawn sequentially from the full conditional distributions. A draw of the population slope is then obtained as described in Section 3. All free parameters in the Gibbs sample are initialized at zero, except for variance estimators σh2 for each component, which are initialized at one. The first 5,000 iterations are dropped as burn-in, and the following 10,000 iterations are kept to form the distribution of estimated parameters. Diagnostic plots are generated to assure the algorithm’s convergence.

The process is repeated for 200 independent samples to provide the empirical distribution for the repeated measures properties. We compare the properties of our weighted Dirichlet Process Mixture model (WDPM) with major competitors, including the unweighted estimator (UNWT), fully weighted estimator (FWT) and a standard ad-hoc weight trimming method with threshold at three times the standard deviation of the weights(WT3). For this linear model, we also included a spline competitor (SPL) that replaces the Dirichlet Process Mixture model with a B-spline basis matrix (Wang 1998); details of the SPL model fitting are provided in the supplementary material. Bias and nominal 95% coverage are recorded directly, while RMSE is rescaled according to fully weighted estimator. Results are provided in Table 1.

Table 1.

Comparison of various estimators of slope B1 under βa, βb, and βc linear spline setting. Bias, relative RMSE and 95% coverage under populations with residual variance 101.5, 103.5 and 105.5 from the following estimators: unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), spline (SPL), and weighted dirichlet process mixture model (WDPM).

σ2 = 101.5 σ2 = 103.5 σ2 = 105.5
Bias RMSE cover Bias RMSE cover Bias RMSE cover
β a
UNWT 1.184 2.499 0 1.120 0.692 0.87 0.475 0.599 0.93
FWT −0.355 1 0.72 −0.297 1 0.92 0.289 1 0.91
WT3 −0.024 0.531 0.98 −0.039 0.748 0.94 −0.185 0.751 0.94
SPL 0.066 0.830 0.95 0.602 0.703 1.00 2.323 0.704 0.99
WDPM −0.297 0.702 0.98 −0.258 0.490 0.99 −0.723 0.431 0.92
β b
UNWT −0.973 1.464 0 −1.047 0.665 0.87 −1.692 0.600 0.93
FWT 0.559 1 0.67 0.629 1 0.90 1.214 1 0.91
WT3 0.085 0.434 0.99 0.070 0.736 0.92 −0.076 0.750 0.94
SPL −0.338 1.102 0.70 −1.239 0.762 1.00 0.049 0.594 0.99
WDPM 0.203 0.532 1 0.205 0.326 0.98 −0.601 0.418 0.90
β c
UNWT −0.007 0.599 0.93 −0.072 0.599 0.93 −0.717 0.599 0.93
FWT 0.007 1 0.91 0.065 1 0.91 0.650 1 0.91
WT3 −0.002 0.751 0.94 −0.016 0.751 0.94 −0.162 0.751 0.94
SPL 0.044 0.639 1.00 0.008 0.735 0.99 −.715 0.677 1.00
WDPM −0.035 0.503 1 −0.128 0.456 1 −1.733 0.393 0.90

For the first two scenarios, the model is misspecified as linear, and the unweighted method tends to be biased, leading to an overall larger RMSE and lower coverage rate compared to fully weighted model. However, as residual variance increases, the gain in efficiency gradually overcomes the loss in accuracy, and at the large variance level, the unweighted estimator has better RMSE compared to the fully weighted estimator, suggesting that the model misspecification could be ignored. The weight trimming estimator has an overall better performance compared to fully the weighted method, maintaining the necessary bias-correction while improving the efficiency and nominal 95% coverage. The spline model is more stable than the weighted estimator, with a small bias at the low variance setting leading to a smaller RMSE than the unweighted estimator, and trading off a slight increase in bias for better RMSE than the trimmed weight estimator at higher variance. The spline model also has below nominal coverage for the βb scenario in the low variance setting, due to a modest degree of bias that remains even after the non-parametric fit of the mean. However, the weighted Dirichlet Process Mixture estimator demonstrates a dominating performance across all settings, obtaining more than 50% reduction in RMSE compared to the fully weighted estimator; when the residual variance is large, it leads to more efficient estimates than even the unweighted estimates.

Under βc; where the model is correctly specified, all methods yield approximately unbiased results. (The estimated biases for σ2 = 105.5 are due to simulation error, enhancing the instability in the estimation.) Here the WDPM method yields the maximum reduction in RMSE, reducing RMSE by 50–60% over the fully weighted estimator, while the unweighted estimator consistently reduces RMSE by about 40% over the fully weighted estimator, the spline estimator by about 30% over the fully weighted estimator, and the trimmed weighted estimator by about 20% over the weight trimming estimator. The UNWT, FWT, and WT3 coverage rates are somewhat low due to the instability caused by small sample size, while the spline coverage rate is conservative. Meanwhile, the WDPM estimator provides conservative coverage across all residual variance settings.

4.2. Weighted Dirichlet Mixture Model for Quantile Regression

To assess the performance of the WDPM method in quantile regression, we consider heavy tailed, skewed, and bimodal distributions. As in the linear regression setting, a population of 20,000 is generated, and samples of size 200 are drawn from the population. The covariate X is uniformly distributed on interval of (0, 10). Our inferential target is the linear population slope of X on the first quartile (25th percentile), median and third quartile (75th percentile) of Y:argminβ0τ,βiτi=1Nρτ(yiβ0τβ1τxi), for τ = .25, .5, .75. We again estimate bias, RMSE, and coverage from 200 independent simulations. We drop the spline model from consideration, since it makes the assumption of residual normality that is clearly not correct in these settings.

For the heavy tail setting, we consider a t distribution with five degrees of freedom and selection probability πi related to covariate X:

Xi~Uniform(0,10)
YiXi~T(μ=xi,df=5)
P(Ii=1)=πi(1+xi)×xi
i=1,,N=20,000.

For a skewed distributed setting, we consider a Gamma distribution and selection probability related to covariate X:

Xi~Uniform(0,10)
Y,Xi~Gamma(k=xi1.5/5+1,θ=2)
P(Ii=1)=πi(1+xi)×xi
i=1,,N=20,000.

For the bimodal distribution, we consider the following mixture with weight αi related to xi:

Xi~Uniform(0,10)
YiXi,αi~αiN(xi,16)+(1αi)N(5,16)
αi~Bernoulli(xi/10)
P(Ii=1)=πi(1+xi)×xi
i=1,,N=20,000.

Population and samples are shown in Figure 2. Under the first scenario, the linear model is correctly specified, with an over-dispersed residual. Thus we expect all estimates to be unbiased, with the unweighted estimator gaining efficiency, and the WDPM model correcting the coverage rate. For the other two scenarios, the unweighted estimator is biased due to non-linearity in xi combined with sampling probabilities that are a function of xi, and we expect the WDPM estimate to perform similarly to the fully weighted estimator.

Fig. 2.

Fig. 2.

Plot of quantile regression simulation settings. Population as grey in background and example sample in crosses. From left to right: t distribution, exponential distribution and bimodal distribution; From top to bottom: Scatter plot, histogram of Y at X = 0, 5, and 10.

The hyperprior parameters are pre-specified as in the linear regression setting: aγ = 0.01, bγ = 2, and ξ = 0.01. For hyper-priors on basis distribution parameters, we have β0 = 0, Vβ0 = 1000 × Ip, ν0 = 1, Σ0 = Ip , aτ = 0.1, a0 = 0.1, b0 = 0.1, μψ = log(30) and σψ = 0.5. Within each simulation there are 15,000 iterations, with the first 5,000 are dropped as burn-in. Diagnostic plots are generated to assure the algorithm’s convergence. Results are provided in Table 2.

Table 2.

Comparison across various estimators of slope β1 under non central t distribution, gamma distribution, and bimodal distribution. Bias, relative RMSE and 95% coverage of estimates for the 1st quartile, median and 3rd quartile of the outcome from the following estimators: unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), spline (SPL), and weighted dirichlet process mixture model (WDPM).

First quartile Median Third quartile
Bias RMSE cover Bias RMSE cover Bias RMSE cover
Non central T
UNWT −0.002 0.504 0.98 0.005 0.502 0.96 −0.009 0.469 0.98
FWT −0.022 1 0.91 −0.007 1 0.93 0.001 1 0.91
WT3 −0.012 0.718 0.99 −0.003 0.692 1 −0.007 0.656 1
WDPM 0.016 0.510 0.98 0.004 0.495 0.98 −0.008 0.455 0.98
Gamma
UNWT 0.235 1.971 0.53 0.202 1.315 0.71 0.212 1.052 0.81
FWT 0.010 1 0.94 0.006 1 0.97 0.050 1 0.93
WT3 0.092 1.100 0.78 0.082 0.887 0.92 0.092 0.855 0.96
WDPM 0.154 1.299 0.82 0.078 0.665 0.93 0.061 0.555 0.93
Binomial
UNWT 0.638 1.545 0.41 0.248 0.875 0.87 0.009 0.538 0.96
FWT −0.038 1 0.94 0.030 1 0.96 0.051 1 0.92
WT3 0.203 0.943 0.73 0.128 0.764 0.97 0.065 0.724 0.99
WDPM −0.098 0.652 0.97 −0.131 0.626 0.97 −0.148 0.540 0.94

For the population created from a heavy-tailed t distribution, the unweighted method has the best performance across all quantile estimates, prevailing in both efficiency and coverage over the fully weighted estimator, since there is no bias correction from weighting in this scenario. The fully weighted estimator has somewhat reduced coverage due to the instability caused by the small sample size, while the weight trimming estimator shows major improvements in RMSE relative to the fully-weighted estimator, with a conservative coverage rate. Meanwhile, the WDPM method maintains stable results across median, first and third quartiles, providing inference approximately equivalent to the unweighted estimator, with about a 45% reduction in RMSE compared to the fully weighted method or 20% compared to the weight trimming method, and a conservative coverage rate.

In the second scenario, where skewed population distributions and model misspecification both occur, weighting becomes necessary, and the fully weighted method has better performance with respect to bias at lower quartiles, as anticipated. The unweighted method is biased, and has poor coverage and larger RMSE for all except the 3rd quartile, where the impact of bias is offset by the reduction in variance. The weight trimming method has improved performance in the median and third quartile comparing to the fully weighted method, but suffers a minor drop in coverage rate in the first quartile. The WDPM model has smaller RMSE than any other model for the 50th and 75th percentile, with approximately correct coverage. For the 25th percentile, the WDPM model suffers from some reduction in coverage as well as increase in bias and RMSE due to the inability of the model to completely capture the behavior of the low percentile of the outcome for small values of X due to extremely small sample size (see Figure 2).

In the bimodal setting, bias reduction is required for estimation in the first quartile and median, but not in the third quartile, since it closely approximates a pure linear model. The fully weighted estimator successfully reduces biases and performs better with respect to RMSE than the unweighted estimator for the first quartile and median regression slopes, but loses efficiency when estimating the third quartile regression slope. The weight trimming estimator acts as an upgraded version of a fully weighted estimator, showing better results in all but the coverage of the first quartile regression slope. The WDPM model provides a large improvement compared to the fully weighted model in the first quartile and median, reducing RMSE by 30% to 40%. It also maintains this improvement even compared to the weight trimming estimator. While the unweighted estimator has better RMSE in the third quartile, the WDPM estimator closely follows its performance. Both the fully weighted estimator and WDPM estimator have satisfactory coverage rates.

4.3. Weighted Dirichlet Mixture Model for Quantile Regression With a Binary Covariate

In this subsection, we conduct a simulation study expanding the application of the weighted Dirichlet Mixture model to quantile regression with a binary covariate. To be more specific, we focus on the bimodal population setting, assessing performance differences between unweighted quantile regression, weighted quantile regression and WDPM model, to help in understanding the result from application on dioxin data in the next section.

The bimodal distributed population is created as follows:

Xi~Bernoulli(0.5)
αi~Bernoulli(0.5)
YiXi,αi~N(0.5*Xi+5*αi,1)
P(Ii=1)15*Xi+N(0,1)+7
i=1,,N=20,000.

Samples of size n = 200 are drawn from the population with the probability of selection defined above (yielding a ratio of approximately nine between the maximum and minimum selection probabilities, similar to the design in NHANES). Simulations on each sample consist of 10,000 iterations, with first 5,000 dropped as burn-in. Bias, RMSE and coverage is again assessed with 200 independent simulations, and results are displayed in Table 3. The results suggest that all models provide consistent results with good coverage in the first quartile and third quartile, while WDPM reduces the overall RMSE by 30%. However, when estimating population slope for the median, the true RMSEs from unweighted, weighted and weight trimming models are greatly increased, indicating unstable estimation. These findings, result from the median regression estimator attempting to “balance” between the two modes in the population.

Table 3.

Comparison across various estimators of slope β1 under bimodal distribution with binary covariates. bias, RMSE, and 95% coverage of estimates for the 1st quartile, median and 3rd quartile of the outcome from the following estimators: unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), and weighted dirichlet process mixture model (WDPM).

First quartile Median Third quartile
Bias RMSE cover Bias RMSE cover Bias RMSE cover
UNWT 0.023 0.986 0.96 0.015 0.948 0.97 0.006 0.969 0.96
FWT 0.026 1 0.96 0.059 1 0.97 0.025 1 0.96
WT3 0.026 0.991 0.99 0.047 0.969 1 0.028 0.998 0.98
WDPM −0.053 0.567 0.97 0.012 0.124 1 0.041 0.646 0.97

To explore these results further, Figure 3 plots biases for all three approaches for fifty simulations. This suggests that, in this setting, both unweighted method and weighted method often provide similar estimates far away from true value. WDPM is more robust for those situations, providing stable estimates.

Fig. 3.

Fig. 3.

Bias of the population slope estimator for the median from all three methods. Bias of point estimates from each simulation are plotted sequentially. Outer circles mark the unweighted quantile regression, outer crosses mark the fully weighted quantile regression, and inner circles mark WDPM estimates.

5. Application on Dioxin Data from NHANES

The National Health and Nutrition Examination Survey (NHANES) (CDC 2015) is a multi-stage, unequal probability-of-selection survey, consisting of 25 strata and two primary sampling units (PSU) per stratum. It provides an annual sample of approximately 7,000 persons interviewed about a large variety of factors relating to prevalence, awareness, treatment and control of disease, trends in risk behaviors and environmental exposures, and relationships between diet, nutrition, and health. Approximately three-quarters of subjects agree to participate in a medical exam that obtains biomarkers, including levels of a large variety of biomakers in the blood. Among those biomarkers measured include certain varieties of dioxins. Dioxin is a generic term for a class of chemicals often created as by-products of industrial processes, and even low levels of exposure are suspected to cause a wide variety of health problems, including cancer. Thus when University of Michigan Dioxin Exposure Study researchers wanted to understand how dioxin exposure varies by age in the general US population, they turned to the NHANES (Chen et al. 2013). The NHANES design oversamples low income persons, adolescents and persons 60 and older, and African American and Mexican American minorities. Weights in the NHANES adjust for this oversampling, and include additional adjustments based on the estimated probability of participating in the medical exam, as well as calibration adjustments so that weighted distributions of demographic factors such as gender match those known in the population from the U.S. Census Bureau.

We apply the weighted Dirichlet Process prior to an analysis relating age and gender to the blood level of dioxin using data from the 2013–2014 NHANES data set. We consider 2,3,7,8-tetracholorodibenzo-p-dioxin (TCDD), a compound resulting from incomplete combustion in incineration, paper and plastics manufacturing and smoking. Somewhat more than half of TCDD readings are below the limit of detection, and are imputed five times through multiple imputation described in Chen et al. (2013). A jackknife method is used to compute variances to fully account for clustering and stratification design features, and Rubin’s formulas (Rubin 1987) are used to combine inferences from each of the multiply-imputed data sets. R code for fitting this data is available at https://github.com/mrelliot/WPDM.

5.1. Linear Regression Model

We fit three linear regression models – age alone, gender alone, and age and gender together – to assess the impact of age and gender on log transformed blood TCDD. Hyper-priors are set to the same values as in the simulation study, and unweighted, fully weighted and weighted Dirichlet Process Mixture estimates are compared with respect to bias and RMSE, where the fully weighted version is treated as unbiased in the corresponding calculation. Note that there exists correlation between the weighted estimator and other estimators, with the unbiased estimated square bias of regression coefficient β^ given by max((β^β^w)2V^01,0), where V^01=V^ar(β^)+V^ar(β^w)2C^ov(β^,β^w). To fully account for the design features, all variance/covariance estimates are calculated via jackknife as V^ar(β^)=Σhkh1khi=1kh(β^(hi)β^)2, where β^(hi) denotes the β estimator by excluding the ith PSU in hth stratum, and the case weights utilized in the fully-weighted and WDPM analysis are given by:

wj*=(wjif jh,jikh1khif jh,ji0if jh,ji)

V^ar(β^)  and C^ov(β^w,β^) accordingly, and estimates from five imputed replicate data sets are combined with Rubin’s formula. The result in the WDPM estimate was based on 10,000 iterations after discarding 2,000 draws as burn-in. The estimated bias, RMSE, and 95% confidence intervals are summarized in Tables 4 through 6. Dioxin levels are positively associated with age and being male.

Table 4.

Regression of log TCDD on age. Bias, RMSE and 95% CI for linear slope estimated for age in unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), and weighted dirichlet process mixture model (WDPM)

Model Est(10−4) Bias(10−5) RMSE(10−5) 95%CI(10−4)
UNWT 331 −126 326 (283,379)
FWT 344 0 389 (270,420)
WT3 344 −0 550 (267,420)
WDPM 335 −84 66 (322,348)

Table 6.

Regression of log TCDD on age and gender. Bias, RMSE and 95% CI for linear slope estimated for age and gender in unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), and weighted dirichlet process mixture model (WDPM).

Age Est(10−4) Bias(10−5) RMSE(10−5) 95%CI(10−4)
UNWT 336 −91 330 (286,385)
FWT 345 0 390 (268,421)
WT3 345 −0 550 (268,421)
WDPM 330 −152 99 (310,349)
Gender Est(10−3) Bias(10−3) RMSE(10−3) 95%CI(10−3)
UNWT 256 2 90 (108,403)
FWT 254 0 62 (133,375)
WT3 255 1 87 (133,379)
WDPM 106 −148 137 (7,147)

In general, the survey weights have less impact on estimating the effect of age, but play a crucial role in estimating the effect of gender; thus unweighted estimates usually have smaller RMSE for estimated coefficient of age, and fully weighted and weight trimming estimates have smaller RMSE for estimated coefficient of gender. Consequently the WDPM estimator has much better performance than the weighted and weight trimming estimators, and even than the unweighted estimator, when estimating the effect of age on dioxin blood levels. The effect of gender appears to be biased toward the null, lead to a larger RMSE increase than other models except in the joint age and gender model.

5.2. Quantile Regression Model

In this section we evaluate the performance of WDPM estimator in the quantile regression setting based on the same dioxin data set from the NHANES study. We again focus on the impact of age, gender, and age and gender together on the first quantile, median and third quartile of log blood TCDD. While estimating bias and RMSE, results from weighted quantile regression are considered as unbiased, and jackknife and Rubin’s formula are applied for complex survey scheme and multiple imputation. The result for the WDPM estimator is based on 10,000 iterations after discarding 2,000 draws as burn-in. The estimated bias, RMSE, and 95% confidence intervals are summarized in Tables 7 through 9. The impact of age (older ages have higher TCDD level) and gender (males have higher TCDD levels) is greater at median levels than at the first and third quartiles.

Table 7.

Quantile regression of log TCDD on age. Bias, RMSE and 95% CI for linear slope estimated for age in unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), and weighted dirichlet process mixture model (WDPM).

First quartile
Est(10−4) Bias(10−5) RMSE(10−5) 95% CI(10−4)
UNWT 278 71 351 (209,347)
FWT 271 0 676 (138,404)
WT3 271 1 956 (138,404)
WDPM 277 63 87 (260,294)
Median
Est(10−4) Bias(10−5) RMSE(10−5) 95% CI(10−4)
UNWT 398 −276 549 (328,472)
FWT 426 0 470 (333,518)
WT3 426 −1 665 (333,518)
WDPM 416 −102 50 (406,426)
Third quartile
Est(10−4) Bias(10−5) RMSE(10−3) 95% CI(10−4)
UNWT 340 −128 343 (287,393)
FWT 353 0 397 (275,431)
WT3 353 −1 559 (275,430)
WDPM 355 19 54 (344,365)

Table 9.

Quantile regression of log TCDD on age and gender. Bias, RMSE and 95% CI for linear slope estimated for age and gender in unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), and weighted dirichlet process mixture model (WDPM).

First quartile
Age Est(10−4) Bias(10−5) RMSE(10−5) 95% CI(10−4)
UNWT 314 113 640 (188,440)
FWT 303 0 1242 (58,548)
WT3 303 0 1763 (57,548)
WDPM 271 −316 129 (246,297)
Gender Est(10−3) Bias(10−3) RMSE(10−3) 95% CI(10−3)
UNWT 357 66 229 (−10,725)
FWT 291 0 112 (71,511)
WT3 292 1 156 (74,510)
WDPM 109 −182 146 (71,147)
Median
Age Est(10−4) Bias(10−5) RMSE(10−5) 95% CI(10−3)
UNWT 3.96 −166 314 (345,448)
FWT 4.13 0 368 (340,486)
WT3 4.13 −1 518 (341,485)
WDPM 4.08 −47 86 (391,425)
Gender Est(10−3) Bias(10−3) RMSE(10−3) 95% CI(10−3)
UNWT 311 −28 81 (152,470)
FWT 339 0 222 (−100,777)
WT3 341 2 316 (−99,782)
WDPM 114 −225 52 (62,166)
Third quartile
Age Est(10−4) Bias(10−5) RMSE(10−5) 95% CI(10−4)
UNWT 3.44 −1.29 548 (264,424)
FWT 3.57 0 459 (266,447)
WT3 3.57 −0.01 647 (266,447)
WDPM 3.46 −1.02 90 (329,364)
Gender Est(10−3) Bias(10−3) RMSE(10−3) 95% CI(10−3)
UNWT 1.65 −28 65 (54,276)
FWT 1.93 0 65 (66,321)
WT3 1.93 −0 91 (66,319)
WDPM 0.80 −113 94 (50,111)

The patterns for quantile regression applied on the NHANES study are not consistent across different quartiles. In general, when estimating the population slope of age on the first quartile of outcomes, the unweighted method is clearly more efficient than the fully-weighted one, reducing the RMSE by almost 50%. The performance of the unweighted and fully weighted estimators of the median and third quantile of outcomes are much closer, usually with less than 15% differences in RMSE, and with no one method besting the other one across all models and settings. When dealing with gender, the fully weighted estimate is usually favored in the main-effect only model with respect to RMSE. Since very few weights actually fall out the range of plus or minus three standard deviations, the weight trimming method makes little modification, closely resembles the fully weighted method results and obtains larger RMSEs due to the way they are calculated.

The WDPM method always provides estimates with smaller variances. For age, the differences between the WDPM estimates and the fully-weighted estimates are small; this reduction in variability leads to large reduction in RMSE across all quartiles. For gender, the WDPM results are quite different from the other methods’ results, which are more similar to each other. This is consistent with our simulation finding in Subsection 4.3 that both the unweighted and full weighted estimator of linear trends can be highly unstable in this setting, while the WDPM approach yields more stable estimates. Hence, we do not trust the bias estimates for gender, although we cannot know the truth in this setting.

6. Discussion

Fully weighted estimators are generally used when bias correction is a priority. However, these estimators could lead to substantial losses in efficiency when sampling weights are unrelated to the quantity of interest. Other design-based methods like ad-hoc weight trimming usually target a balance between accuracy and efficiency. However, the weighted Dirichlet Process Mixture model combined with data augmentation provides estimates that have both improved mean square error properties and nominal interval coverage relative to fully weighted methods in both linear and quantile regression settings. For the simulations considered, the reduction in RMSE from WDPM estimates could be as large as 70% while retaining sufficient nominal coverage. Similarly, reductions in RMSE were obtained in the application, particularly with respect to gender-adjusted effects of age on dioxin. One exception to this generally positive performance may be in quantile regression settings with heavily-skewed population and small number of observations in certain areas of the prediction space, where the WDPM model may fail to correctly model the data and thus provide “over-smoothed” estimates with poor coverage.

Another feature of our analysis is the use of weights to create an approximation to the posterior predictive distribution of the population based on the posterior predictive distribution of the sample. While we are not arguing that this approach is a general solution to the difficulties of accounting for complex sample design in Bayesian computation (e.g., it does not account for stratification or clustering), it does appear to work well in our approximately asymptotic simulation settings. Further, reviewing the simple situation described in Subsection 2.1, we can see that in general if sampling fractions are on the order of 1% or less, the resulting approximation should be accurate to approximately the same degree. Zangeneh and Little (2015) discussed an alternative approach to generating synthetic populations under a Dirichlet prior for the sampling weights, which might have better performance when sampling fractions are large and/or populations are small.

Our approach could be extended in a number of ways. We have focused on continuous outcomes, but extensions to binary or multinomial models are straightforward by modeling the latent variables under a weighted DP process frame. Alternatives, highly skewed distributions could be modeled using skewed shape measures for G0 such as the gamma distribution. Adaptations that make use of additional design information that might be available for non-sampled cases and/or accommodates non-trivial sampling fractions are possible (e.g., Lu and Gelman 2003; Si et al. 2015).

Finally, reductions in overall RMSE from our complex Bayesian model are based on intensive computation. Based on current computation facilities, results for samples with hundreds of observations could be obtained in a reasonable amount of time, but when sample size escalates beyond several thousand cases, computing time can be intolerable. However, we anticipate that with the continuing development of hardware as well as parallel processing, this limitation will be minimized. Also, software such as STAN (http://mc-stan.org/) (Hoffman and Gelman 2014) which uses more efficient MCMC algorithms, may assist in speeding up computation.

Supplementary Material

Supplemental material

Table 5.

Regression of log TCDD on gender. Bias, RMSE and 95% CI for linear slope estimated for gender in unweighted (UNWT), fully weighted (FWT), weight trimming (WT3), and weighted dirichlet process mixture model (WDPM).

Model Est(10−3) Bias(10−3) RMSE(10−3) 95%CI(10−3)
UNWT 154 −82 125 (3,305)
FWT 236 0 64 (111,362)
WT3 237 1 90 (111,362)
WDPM 67 125 156 (58,77)

Table 8.

Quantile regression of log TCDD on gender. Bias, RMSE and 95% CI for linear slope estimated for gender in unweighted (UNWT), fully weighted (FWT), weight trimming (WT3) and weighted dirichlet process mixture model (WDPM).

First quartile
Est(10−3) Bias(10−3) RMSE(10−3) 95% CI(10−3)
UNWT 55 −60 238 (−313,423)
FWT 115 0 150 (−180,410)
WT3 115 1 212 (−180,411)
WDPM 15 −100 6 (3,26)
Median
Est(10−3) Bias(10−3) RMSE(10−3) 95% CI(10−3)
UNWT 295 −47 237 (−109,699)
FWT 342 0 203 (−58,741)
WT3 342 1 287 (−58,743)
WDPM 88 −254 154 (73,104)
Third quartile
Est(10−3) Bias(10−3) RMSE(10−3) 95% CI(10−3)
UNWT 179 −87 108 (22,337)
FWT 266 0 80 (108,423)
WT3 266 0 109 (113,418)
WDPM 110 −156 134 (99,121)

Acknowledgments:

The authors would like to thank the editor, associate editor, and two anonymous reviewers whose comments greatly improved the manuscript. This work was supported in part by Grant Number R01CA129101 from the National Cancer Institute.

7. References

  1. Blackwell D, and MacQueen JB. 1973. “Ferguson Distributions via Polya Urn Schemes.” The Annals of Statistics, 1:353–355. DOI: 10.1214/aos/1176342372. [DOI] [Google Scholar]
  2. CDC. 2015. National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. Available at: http://www.cdc.gov/nchs/nhanes.htm (accessed January 2015). [Google Scholar]
  3. Chen Q, Elliott MR, and Little RJA. 2010. “Bayesian Penalized Spline Model-Based Inference for Finite Population Proportions in Unequal Probability Sampling.” Survey Methodology, 36:22–34. [PMC free article] [PubMed] [Google Scholar]
  4. Chen Q, Elliott MR, and Little RJA 2012. “Bayesian Inference for Finite Population Quantiles from Unequal Probability Samples.” Survey Methodology, 38:203–215. [PMC free article] [PubMed] [Google Scholar]
  5. Chen Q, Jiang X, Hedgeman E, Knutson K, Gillespie B, Hong B, Lepkowski JM, Franzblau A, Jolliet O, Adriaens P, Demond AH, and Garabrant DH. 2013. “Estimation of age- and sex-specific background human serum concentrations of PCDDs, PCDFs, and PCBs in the UMDES and NHANES populations.” Chemosphere, 91:817–823. DOI: 10.1016/j.chemosphere.2013.01.078. [DOI] [PubMed] [Google Scholar]
  6. Dunson DR, Pillai N, and Park J-H. 2007. “Bayesian density regression.” Journal of the Royal Statistical Society, B69:163–183. DOI: 10.1111/j.1467-9868.2007.00582.x. [DOI] [Google Scholar]
  7. Elliott MR 2007. “Bayesian weight trimming for generalized linear regression models.” Survey Methodology, 33:23–24. [Google Scholar]
  8. Ericson WA 1969. “Subjective Bayesian modeling in sampling finite populations.” Journal of the Royal Statistical Society, 31:195–234. DOI: 10.1111/j.2517-6161.1969.tb00782.x. [DOI] [Google Scholar]
  9. Ferguson T 1973. “A Bayesian Analysis of Some Nonparametric Problems.” Annals of Statistics, 1:209–230. DOI: 10.1214/aos/1176342360. [DOI] [Google Scholar]
  10. Fienberg SE 2011. “Bayesian Models and Methods in Public Policy and Government Settings.” Statistical Science, 26:212–226. DOI: 10.1214/10-STS331. [DOI] [Google Scholar]
  11. Hoffman A, and Gelman A. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research, 15:1593–1623. [Google Scholar]
  12. Holt D, and Smith TMF. 1979. “Post Stratification.” Journal of the Royal Statistical Society, 142:33–46. DOI: 10.2307/2344652. [DOI] [Google Scholar]
  13. Little RJA 1993. “Post Stratification: A Modeler’s Perspective.” Journal of the American Statistical Association, 88:1001–1012. DOI: 10.1080/01621459.1993.10476368. [DOI] [Google Scholar]
  14. Little RJA 2012. Calibrated Bayes, An Alternative Inferential Paradigm for Official Statistics.” Journal of Official Statistics, 28:309–334. Available at: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/calibrated-bayes-an-alternative-inferential-paradigm-for-official-statistics.pdf (accessed January 2021). [Google Scholar]
  15. Lu H, and Gelman A. 2003. “A Method for Estimating Design-Based Sampling Variances For Surveys with Weighting, Poststratification, and Raking.” Journal of Official Statistics, 19:133–151. Available at: https://www.scb.se/contentassets/-ca21efb41fee47d293bbee5bf7be7fb3/a-method-for-estimating-design-based-sampling-variances-for-surveys-with-weighting-poststratification-and-raking.pdf (accessed January 2021). [Google Scholar]
  16. MacEachern SN 1994. “Estimating normal means with a conjugate style Dirichlet process prior.” Communications in Statistics – Simulation and Computation, 23:727–741. DOI: 10.1080/03610919408813196. [DOI] [Google Scholar]
  17. Murty KG 1983. Linear programming. New York: Wiley. [Google Scholar]
  18. Potter F 1990. “A study of procedures to identify and trim extreme sample weights.” Proceedings of the Section on Survey Research Methods, American Statistical Association: 225–230. [Google Scholar]
  19. Rubin DB 1983. Comment on “An Evaluation of Model-Dependent and Probability Sampling Inferences in Sampling Surveys” by M.H. Hansen, W.G. Madow, and B.J. Tepping. Journal of the American Statistical Association, 78:803–805. [Google Scholar]
  20. Rubin DB 1987. Multiple Imputation for Non-Response in Surveys. Wiley: New York. [Google Scholar]
  21. Si Y, Pillai N, and Gelman A. 2015. “Bayesian Nonparametric Weighted Sampling Inference.” Bayesian Analysis, 10:605–625. DOI: 10.1214/14-BA924. [DOI] [Google Scholar]
  22. Skinner CJ, Holt D, and Smith TMF. 1989. Analysis of Complex Surveys. Wiley: New York. [Google Scholar]
  23. Wang Y 1998. “Smoothing Spline Models with Correlated Random Errors.” Journal of the American Statistical Association, 93:341–348. DOI: 10.1080/01621459.1998.10474115. [DOI] [Google Scholar]
  24. Yu K, and Moyeed RA. 2001. “Bayesian quantile regression.” Statistics and Probability Letters, 54:437–447. DOI: 10.1016/S0167-7152(01)00124-9. [DOI] [Google Scholar]
  25. Zangeneh SZ, and Little RJA. 2015. “Bayesian Inference for the Finite Population Total from a Heteroscedastic Probability Proportional to Size Sample.” Journal of Survey Statistics and Methodology, 3:162–192. DOI: 10.1093/jssam/smv002. [DOI] [Google Scholar]
  26. Zheng H, and Little RJA. 2003. “Penalized Spline Model-Based Estimation of the Finite Populations Total from Probability-Proportional-to-Size Samples.” Journal of Official Statistics, 19:99–117. Available at: https://www.scb.se/contentassets/-ca21efb41fee47d293bbee5bf7be7fb3/penalized-spline-model-based-estimation-of-the-finite-populations-total-from-probability-proportional-to-size-samples.pdf (accessed January 2021). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material

RESOURCES