Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2018 Oct 29;7(2):157–174. doi: 10.1093/jssam/smy018

Quantile Regression Analysis of Survey Data Under Informative Sampling

Sixia Chen 1,, Yan Daniel Zhao 1
PMCID: PMC6505486  PMID: 31098386

Abstract

For complex survey data, the parameters in a quantile regression can be estimated by minimizing an objective function with units weighted by the original design weights. However, when the complex survey sampling design is informative (i.e., when the design weights are correlated with the study variable even after conditioning on other covariates), the efficiency of this design-weighted estimator may be improved. In this article, we propose several weight-smoothing estimators for quantile regression analysis of complex survey data collected with an informative sampling design. Our new estimators incorporate nonparametric methods for modeling the weight functions and pseudo-population bootstrap methods for variance estimation. A simulation study compares, our proposed methods with the original design-based method in terms of bias, standard error, mean squared error, and confidence coverage. Our proposed estimators have smaller bias and mean squared error than does the design-based estimator. We further illustrate and compare estimators for the 1988 US National Maternal and Infant Health Survey.

Keywords: Complex survey, Informative sampling, Nonparametric, Quantile regression, Weight-smoothing

1. INTRODUCTION

Researchers often use data collected from complex survey designs to draw scientific conclusions. For instance, Nelson, Powell-Griner, Town, and Kovar (2003) compared national estimates of smoking, height, and diabetes by using the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS). Harrington, Barreira, Staiano, and Katzmarzyk (2014) used National Health Nutrition and Examination (NHANES) 2009/2010 to estimate the amount of time that the US population spent sitting by age, sex, ethnicity, education, and body mass index. It is well known that statistical analysis ignoring design features—including stratification, clustering, and unequal weighting—may lead to biased results (see Pfeffermann and Sverchkov 1999, 2003, and 2009, among others). Generalized linear and mixed models with complex survey designs were developed in studies by Chambers and Skinner (2003) and Heeringa, West, and Berglund (2010).

The sampling design is informative when sample inclusion is related to the outcome variable conditional on covariates (Fuller 2009). For such designs, survey weights are often used in regression analysis of survey data to ensure consistent estimation of parameters. For example, the sampling design of NHANES (2013–2014) was informative since the first stage strata were built by using county-level health characteristics that are correlated with the study variables of interest, given the covariates. The traditional design-based approach using the original design weights leads to unbiased estimates, but the efficiency can be improved. One approach is a likelihood-based method that maximizes the conditional sample likelihood by using the joint model of study variable and sampling indicator (see Chambers 2003; Pfeffermann and Sverchkov 2009; Pfeffermann 2011; Scott and Wild 2011, among others). A second approach replaces the original design weights by predictions from a model for the conditional distribution of the design weights given the data, as in Magee (1998), Pfeffermann and Sverchkov (1999), Beaumont (2008), Fuller (2009), and Kim and Skinner (2013). In particular, Kim and Skinner (2013) proposed optimal weight modifications compared with other methods under generalized linear models.

These statistical methods model the conditional mean values of the study variables by regression. Such models may be suboptimal when the distribution of study variables is skewed or has outliers. In such cases, quantile regression (QR) (Koenker and Bassett 1978; Koenker 2005) is an effective tool for conditional modeling, providing robustness against outliers and a more comprehensive analysis of the relationship between variables than is offered by the conditional mean model.

There is rich literature on the use of QR for data collected by simple random sampling; for examples, see He and Shao (1996), Knight (1998), Mu and He (2007), and references therein. Deaton (1997) and Cameron and Trivedi (2005) applied QR to survey data ignoring the complex sampling scheme, and their estimates may be biased if the original sampling design is informative. Only a few research articles discuss QR estimates that account for a complex survey sampling scheme, including Li, Graubard, and Korn (2010) and Geraci (2016). These articles do not discuss QR for data collected using informative sampling, which is the topic of the present article. The quantile regression coefficients are defined at the super-population level, and consistency depends on the quantile regression–model assumption. Specifically, we extended several weight-smoothing estimators to our data, including unsmoothed and smoothed optimal estimators in Kim and Skinner (2013) and estimators proposed by Beaumont (2008) and Pfeffermann and Sverchkov (1999). Some of our proposed estimators (Design-Weighted [DW], PS, Unsmoothed Optimal [UOPT]) are design-consistent, even when the model (1) does not hold. Other estimators (Smoothed Design-Weighted [SDW], Smoothed Pfeffermann-Sverchkov [SPS], Smoothed Optimal [SOPT]) are consistent if the corresponding weight models are correct.

The remainder of the article is organized as follows: After preliminaries in section 2, our weight-smoothing estimators are proposed and developed in section 3. In section 4, we describe algorithms for computing our proposed estimators. Variance estimation is presented in section 5. A simulation study is described in section 6, and a real-data-based simulation study using the 1988 US National Maternal and Infant Health Survey is presented in section 7. In section 8, we conclude with a discussion.

2. QUANTILE REGRESSION AND THE DESIGN-BASED ESTIMATOR

Suppose the finite population FN={(xi,yi,zi),i=1,2,,N} is generated from a super-population model F, where xi is a p×1 vector of covariates, yi is the study variable, and zi is the design variable, which may not be observed. We assume yi given xi follows the following QR model:

yi=xiβτ+ϵi,i=1,2,,N, (1)

where βτ is p×1 unknown coefficient vector, and ϵi is an error term such that Pr(ϵi0|xi)=τ for the τ-th quantile (0<τ<1). A complex survey sample S is drawn with sampling indicator Ii such that Ii = 1 if unit i is selected and 0 otherwise, i=1,,N. The first- and second-order inclusion probabilities are denoted as πi=E(Ii) for selecting unit i and πij=E(IiIj) for unit i and unit j. Consequently, the corresponding design weight for unit i is di=πi1, which is known for units in S.

Following Kim and Skinner (2013), the sampling design is assumed to be informative in the sense that the design weights are functions of the covariates and the design variable; that is, πi=πi(X,Z) with X=(x1,x2,,xN), and Z=(z1,z2,,zN). The sampling is informative if y and z are related, conditional on x. Under informative sampling, we have Pr(Ii=1|xi,yi)Pr(Ii=1|xi) (i.e., the selection of unit i depends not only on the covariates but also on the study variable). The design-weighted (DW) estimator of the QR coefficents is

β^τ,d=argminβiSdiρτ(yixiβ),

where ρτ(u)=u{τI(u<0)}. By an argument similar to that of Koenker (2005) and after some algebra, it can be shown that β^τ,d is the solution of the following estimating equation:

U^d(β)=i=1NIidi{τI(yixiβ<0)}xi=0. (2)

The estimator β^τ,d is consistent for estimating βτ, by an argument similar to that of Wang and Opsomer (2011). Its efficiency may be further improved by modeling the design weights as described in the next section.

3. PROPOSED METHODS

We propose five new weight-smoothing estimators of QR coefficient βτ in generalized linear models. Specifically, we consider estimators that satisfy the following estimating equation:

U^w(β)=i=1NIiwi{τI(yixiβ<0)}xi=0, (3)

where the weights di in (2) are replaced by new weights wi chosen to improve efficiency. All of the weight-smoothing methods were initially developed for regression analysis of mean values of the study variable. We adapt these methods to our QR problem.

3.1 Smoothed Design-Weight (SDW) Estimator

Beaumont (2008) used a smoothing weight E(di|yi,Ii=1) to estimate the population mean of y. Kim and Skinner (2013) extended the idea by using wi=d˜i,x,y=E(di|xi,yi,Ii=1) in the context of linear regression to obtain regression coefficient estimates. For our QR analysis, we use the same weights in (4) and denote the corresponding estimator as β^τ,SDW. As in Kim and Skinner (2013), we can show that β^τ,SDW is consistent and V(β^τ,SDW)<V(β^τ,d) if the conditional expectation E(di|xi,yi,Ii=1) is correctly modeled.

In general, d˜i,x,y in (3) is unknown. To estimate d˜i,x,y, one can use a parametric model, such as a linear or nonlinear regression method, or a nonparametric model, such as splines. However, the parametric model approach is vulnerable to model misspecification, and the nonparametric model approach is subject to the well-known curse of dimensionality if the dimension of covariate x is large. Instead, we fit the following generalized additive model (GAM) to estimate d˜i,x,y (Hastie and Tibshirani 1990).

log(di1)=g0+t=1pgt(xit)+gp+1(yi)+ei,iS, (4)

where xit is the t-th variable in xi,g0 is an unknown parameter, gt,t=1,,p+1, are unknown functions that satisfy certain regularity conditions, and ei is assumed to have normal distribution with mean 0 and variance σ2. Model (4) is quite general and can be easily extended to more general cases with unequal variance and non-Gaussian exponential family distributions. For simplicity, we only consider (4) with lower-order spline functions and Gaussian errors with constant variance. After obtaining estimators g^0,g^t,t=1,,p+1, and σ^2, we estimate d˜i,x,y by using d˜^i,x,y=1+exp(g^0+t=1pg^t(xit)+g^p+1(yi)+σ^2/2).

3.2 Unsmoothed Pfeffermann-Sverchkov (PS) Estimator and Smoothed Pfeffermann-Sverchkov (SPS) Estimator

Pfeffermann and Sverchkov (1999) proposed weights wi=did˜^i,x1 where d˜^i,x=E^(di|xi,Ii=1) to produce efficient and consistent estimates of linear regression coefficients. We propose to obtain d˜^i,x by using a similar technique to that used to obtain d˜^i,x,y in section 3.1. The extension to quantile regression is trivial, and we denote the corresponding estimator as β^τ,PS. Note that β^τ,PS is consistent even if the model E(di|xi,Ii=1) is misspecified (see the justification for the consistency of UOPT estimator).

To further improve efficiency, Pfeffermann and Sverchkov (1999) proposed weights wi=d˜^i,x,yd˜^i,x1, which yield a consistent and a more efficient estimator if the weight model E(di|xi,yi,Ii=1) is correctly specified. We denote this SPS estimator as β^τ,SPS.

The SPS estimator minimizes the following prediction distance function

Q(β)=ρτ(yxβ)f(y|x)dy.

Because

f(y|x,I=1)=f(y|x)Pr(I=1|x,y)Pr(I=1|x)

and

E(d|x,y,I=1)=1E(π|x,y),E(d|x,I=1)=1E(π|x),

a consistent estimator can be obtained by solving (3) with wi=E(di|xi,yi,Ii=1){E(di|xi,Ii=1)}1.

3.3 Unsmoothed and Smoothed Optimal (UOPT and SOPT) Estimators

In this section, we propose two novel optimal weight modification estimators. Under the correct weight models, one will be more efficient than β^τ,PS, and the other will be more efficient than β^τ,SDW and β^τ,SPS. We assume πi=π(xi,zi), and the sampling design is Poisson; Kim and Skinner (2013) also made this assumption to derive the optimal weight in linear regression models.

Consider a class of estimators that solves (3) with wi=diq(xi). The UOPT estimator is obtained by choosing qi=q(xi) to minimize the variance of the following class of estimators:

β^τ,q=argminβiSdiqiρτ(yixiβ),

or equivalently as the solution of the following estimating equations:

U^dq(β)=i=1NIidiqi{τI(yixiβ<0)}xi=0. (5)

According to Koenker (2005), we have E{τI(yixiβτ<0)|xi}=0, so

E{U^dq(βτ)}=E[i=1NIidiqi{τI(yixiβτ<0)}xi]=E[i=1Nqi{τI(yixiβτ<0)}xi]=E[i=1NqixiE{τI(yixiβτ<0)|xi}]=0. (6)

By the argument in Van der Vaart (1998, Chapter 5) and according to (6), it can be shown that β^τ,q is consistent for βτ for arbitrary qi=q(xi), under mild regularity conditions. After some algebra, it can be shown that β^τ,q has the following asymptotic expansion:

β^τ,q=βτ,q+{i=1Nqixixify|x(xiβτ)}1U^dq(βτ)+op(n1/2),

and the corresponding asymptotic conditional variance can be written as

{i=1Nqixixify|x(xiβτ)}1i=1NE(diei2|xi)qi2xixi{i=1Nqixixify|x(xiβτ)}1, (7)

where fy|x(xiβτ) is the conditional density of y given x evaluated at xiβτ and ei=τI(yixiβτ<0). Thus, qi,1*=vi,11fy|x(xiβτ) with vi,1=E(diei2|xi) minimizes the variance defined in (7). Specifically, we have

vi,1=E(diei2|xi)=(τ1)2E(di|xi;yi<xiβτ)Pr(yi<xiβτ|xi)+τ2E(di|xi;yixiβτ)Pr(yixiβτ|xi). (8)

The estimator q^i,1* is discussed in section 4. Denote the estimator by using wi=diq^i,1* as β^τ,UOPT. It is easy to see that estimators β^τ,B and β^τ,PS belong to this class of estimators, so the UOPT estimator is more efficient.

For a more efficient estimator than β^τ,SDW and β^τ,SPS, the SOPT estimator is obtained by minimizing variance for a class of estimators defined by wi=d˜i,x,yq(xi). By arguments similar to those for UOPT, the corresponding estimators are consistent, since

E{U^w(βτ)}=E[i=1NIid˜i,x,yq(xi){τI(yixiβτ<0)}xi]=E[E[i=1NIidiq(xi){τI(yixiβτ<0)}xi|x,y]]=E[i=1NIidiq(xi){τI(yixiβτ<0)}xi]=E[i=1Nq(xi){τI(yixiβτ<0)}xi]=E[i=1NqixiE{τI(yixiβτ<0)|xi}]=0.

Under the correct weight models, SOPT is even more efficient than UOPT, as seen in our simulation studies. By similar techniques to those used for the UOPT estimator, it can be shown that the optimal choice of qi is qi,2*=v˜i,21fy|x(xiβτ) with v˜i,2=E(d˜iei2|xi). Specifically, we have

v˜i,2=E(d˜iei2|xi)=(τ1)2E(d˜i|xi;yi<xiβτ)Pr(yi<xiβτ|xi)+τ2E(d˜i|xi;yixiβτ)Pr(yixiβτ|xi).

We discuss how to obtain the estimator q^i,2* of qi,2* in section 4. We denote the estimator by using wi=d˜^i,x,yq^i,2* as β^τ,SOPT.

4. ALGORITHMS FOR COMPUTING THE UOPT AND SOPT ESTIMATORS

In this section, we discuss algorithms for computing the UOPT and SOPT estimators by the GAM approach. The UOPT estimator qi,1* can be estimated by the following steps:

  1. Set β^τ(0)=β^τ,d, the solution of the estimating equation (2).

  2. Estimate f^y|x by using the GAM approach and assuming a normal distribution of yi, where f^y|x is the estimated conditional density of y given x. The conditional expectation E(y|x) is assumed to include main effects of xi and their second- and third-order interactions.

  3. Estimate E(di|xi;yi<xiβτ) by predictions
    E^(di|xi;yi<xiβτ)=1+exp{g^01+t=1pg^t1(xit)+σ^12/2},
    by assuming the following generalized additive model (GAM):
    log(di1)=g01+t=1pgt1(xit)+e1i,iS1,
    where S1 denotes the units in S such that yi<xiβ^τ(t), using techniques similar to the estimation of d˜i,x,y in section 3.1. Estimate E^(di|xi;yixiβτ) by similar techniques. Estimate Pr^(yi<xiβτ|xi) and Pr^(yixiβτ|xi) by substituting the estimated density f^y|x in Step 2. Then, according to (8),
    v^i,1(t)=(τ1)2E^(di|xi;yi<xiβτ)Pr^(yi<xiβτ|xi)+τ2E^(di|xi;yixiβτ)Pr^(yixiβτ|xi).
  4. Estimate q^i,1*(t)=v^i,1(t)1f^y|x(xiβ^τ(t)) and the corresponding optimal estimator β^τ(t+1) by solving (5) with q^i,1*(t).

  5. Repeat step 3 to step 4 with updated estimator β^τ(t+1) until convergence.

The SOPT estimator qi,2* is obtained as follows:

  1. Same as step 1 for UOPT estimator.

  2. Estimate d˜^i,x,y of d˜i,x,y by the GAM approach described in section 3.1.

  3. Same as step 2 for the UOPT estimator.

  4. Same as step 3 for the UOPT estimator, with di replaced by d˜^i,x,y in the model.

  5. Estimate q^i,2*(t)=v^i,2(t)1f^y|x(xiβ^τ(t)) and the corresponding optimal estimator β^τ(t+1) by solving equation (3) with wi=d˜^i,x,yq^i,2*(t).

  6. Repeat steps 4 and 5 until convergence.

5. VARIANCE ESTIMATION

The Taylor linearization approach to variance estimation involves tedious technical derivations, especially when the estimation procedure includes semiparametric methods. We now describe bootstrap estimates of variance for our proposed estimators, with associated confidence regions.

We apply pseudo-population bootstrap methods (Gross 1980; Booth, Butler, and Hall 1994; Conti, Marelia, and Mecatti 2017), which are simple and practical and have been shown to work effectively under high-entropy designs (Conti et al. 2017), such as Rao-Sampford (Rao 1965; Sampford 1967) and randomized proportional-to-size systematic sampling. Our proposed bootstrap method can be described as follows:

  1. For k=1,,N, choose a unit i from the original sample S independently with probability πi1/jSπj1. If at trial k the unit iS is selected, define (xk*,yk*,zk*)=(xi,yi,zi).

  2. The pseudo-bootstrap population is then FN*={(xk*,yk*,zk*),k=1,,N}. Draw a bootstrap sample S* from FN* by using the same design as the original design with first-order inclusion probabilities nzk*/i=1Nzi*. If zi is unknown, then one can use πk*, which is the corresponding original inclusion probability for the k-th element in the pseudo-bootstrap population.

  3. Obtain the bootstrap sample estimator β^τ* from S* by using our proposed method.

Generate B bootstrap samples by the previously described procedure, with corresponding estimators β^τ*(b) for b=1,,B. Then, the bootstrap variance estimator is:

V^*=1Bb=1B(β^τ*(b)β^¯τ*)(β^τ*(b)β^¯τ*),

where β^¯τ*=B1b=1Bβ^τ*(b). The (1α)100% confidence region of βτ is then

(β^τβτ)V^*1(β^τβτ)<χp,1α2,

where χq,1α2 is the (1α)100-th percentile of a χ2distribution with degrees of freedom q, the dimension of βτ. Alternatively, one can use bootstrap percentiles of statistics (β^τ*(b)βτ*(b))V^*1(β^τ*(b)βτ*(b)) to obtain the confidence region. For inference with individual parameter βτ,a defined in βτ where a=1,,p, one can use the following normal-based confidence interval:

(β^τ,az1α/2V^aa*,β^τ,a+z1α/2V^aa*), (9)

where β^τ,a is the a-th component of β^τ and V^aa* is the corresponding estimated variance of β^τ,a.

6. SIMULATION STUDY

We now compare the performance of all six estimators in a simulation study. We generated M =1,000 finite populations with population size N =10,000 from the following population model: yi=β0+x1iβ1+x2iβ2+(1+ψ1x1i+ψ1x2i)ϵi, where (β0,β1,β2)=(1,1,0.5), covariates (x1i,x2i) were independently and identically distributed (iid) with a normal distribution with means E(x1i)=E(x2i)=0 and variances V(x1i)=V(x2i)=1, and ϵi were iid with a standard normal distribution. The parameter ψ1 was set to zero (homoscedastic variance) or 0.2 (heteroscedastic variance).

For each generated finite population of size N, a Poisson sample was then selected with inclusion probabilities πi=nki/(j=1Nkj), where n =400 was the expected sample size and ki was the size variable such that ki={1+exp(2.50.5zi)}1 and ziN(1+yi,0.52). Note that this sampling was informative because the inclusion probabilities depended on the outcome variable y. Specifically, the correlation between ϵ and π was about 0.6. Sample sizes varied but were all close to 400.

For the population model, it can be shown that the τ-th conditional quantile of yi is Qτ(yi|xi)=β0τ+x1iβ1τ+x2iβ2τ, where β0τ=β0+Qτ(ϵi),β1τ=β1+ψ1Qτ(ϵi),β2τ=β2+ψ1Qτ(ϵi) with Qτ(ϵi) as the τ-th quantile of ϵi, which could be readily calculated. Our parameters of interest were the QR regression coefficients βτ=(β1τ,β2τ). In the simulation study, τ=0.4 and 0.6 were considered.

We compared the six estimators described previously in terms of Monte Carlo (MC) relative bias (RBias), MC relative standard error (RSE), MC relative root mean squared error (RRMSE), and MC coverage properties, including MC coverage probability (CP), standard error relative bias (SERBias), and relative average confidence interval length (RCILen). The formulas for those quantities are as follows:

RBias=β^β|β|,RSE={(M1)1m=1M(β^(m)β^)2}1/2|β|,RRMSE={(β^β)2+(M1)1m=1M(β^(m)β^)2}1/2|β|,CP=1Mm=1MI(LB(m)<β<UB(m)),SERBias=M1m=1M{V^(m)}1/2{(M1)1m=1M(β^(m)β^)2}1/2{(M1)1m=1M(β^(m)β^)2}1/2,RCILen=M1m=1M(UB(m)LB(m))|β|,

where β represents the true value for parameters β1τ or β2τ,β^(m) represents the estimator based on the m-th MC sample for β, β^=M1m=1Mβ^(m),LB(m) and UB(m) represent the lower and upper 95 percent confidence interval bounds for β based on the formula (9) in section 5, and V^(m) represents our proposed bootstrap variance estimator based on the m-th MC sample. We selected 200 bootstrap samples for variance estimation for each MC sample.

The point estimation results are presented in table 1 for ψ1=0 and table 2 for ψ1=0.2. Under the homoscedastic scenario in table 1, all estimators had small RBias, which was consistent with the underlying theorem. The DW and SDW estimators had the largest RRMSE, since the DW estimator did not use any smoothing technique to reduce variance, and the single smoothing model in the SDW estimator was not efficient. The UOPT, PS, SPS, and SOPT estimators had comparable RSE and RRMSE. To test the sensitivity of model specification, we assumed an equal variance structure under the heteroscedastic scenario; the results were comparable with assuming the correct heteroscedastic variance structure. As shown in table 2, all estimators had small bias for most of the cases. For all cases, the UOPT and SOPT estimators had significantly smaller RSE and RRMSE than did other estimators. For simplicity, we only present the results for coverage properties for the scenario in which ψ1=0.2 and τ=0.6 (table 3). Other scenarios had similar results. The UOPT and SOPT estimators had better or comparable coverage than other estimators for most of the cases, and their CP was close to the nominal level of 95 percent. The SERBias for all estimators, based on our proposed bootstrap methods, were less than 8.8 percent, validating our proposed variance estimation approach. The DW and SDW estimators had larger RCILen than did other estimators. The UOPT and SOPT estimators had RCILen smaller than that of the PS and SPS estimators. We also considered the scenario where the correlation between ϵ and π is about 0.3, and the results were similar (results not presented here).

Table 1.

Monte Carlo (MC) Relative Bias (RBias) ( ×103 ), Relative Standard Error (RSE) ( ×103 ), and Relative Root Mean Squared Error (RRMSE) ( ×103 ) for Six Different Methods with ψ1=0.

Tau Par Method RBias RSE RRMSE
0.4 β1τ DW −1 94 94
UOPT 5 79 80
SDW 2 90 90
PS 2 79 79
SPS 5 77 77
SOPT 8 78 78
β2τ DW 7 169 169
UOPT 4 149 149
SDW 10 162 162
PS 2 148 148
SPS 6 143 143
SOPT 8 145 145
0.6 β1τ DW −1 81 81
UOPT 6 68 68
SDW 2 78 78
PS 4 68 68
SPS 8 67 67
SOPT 9 68 68
β2τ DW −1 150 150
UOPT 5 133 133
SDW 4 145 145
PS 3 131 131
SPS 8 127 127
SOPT 9 129 129

Table 2.

Monte Carlo (MC) Relative Bias (RBias) ( ×103 ), Relative Standard Error (RSE) ( ×103 ), and Relative Root Mean Squared Error (RRMSE) ( ×103 ) for Six Different Methods with ψ1=0.2 .

Tau Par Method RBias RSE RRMSE
0.4 β1τ DW 1 91 91
UOPT 9 54 55
SDW 5 89 89
PS 6 60 60
SPS 10 59 60
SOPT 11 55 56
β2τ DW 3 155 155
UOPT 9 104 104
SDW 9 152 152
PS 6 114 114
SPS 12 110 111
SOPT 12 102 102
0.6 β1τ DW −1 86 86
UOPT 7 53 54
SDW 4 83 83
PS 5 56 56
SPS 8 55 56
SOPT 10 54 55
β2τ DW −3 155 155
UOPT 9 112 113
SDW 3 152 152
PS 5 116 116
SPS 10 113 114
SOPT 13 110 111

Table 3.

Coverage Probability (CP) ( ×103 ), Standard Error Relative Bias (SERBias) ( ×103 ), and Relative Average Confidence Interval Length (RCILen) ( ×103 ) for Six Different Methods with ψ1=0.2 and τ=0.6.

Par Method CP SERBias RCILen
β1τ DW 948 −6 337
UOPT 953 88 228
SDW 939 14 330
PS 954 64 234
SPS 952 65 230
SOPT 949 66 225
β2τ DW 960 61 646
UOPT 949 65 469
SDW 959 60 631
PS 945 60 483
SPS 950 67 474
SOPT 952 74 462

7. REAL-DATA-BASED SIMULATION STUDY

We further compare our estimators on a real data set previously analyzed by Korn and Graubard (1995) and Pfeffermann and Sverchkov (1999). The data were collected as part of the 1988 US National Maternal and Infant Health Survey, which used a stratified random sample of vital records corresponding to live births, late fetal deaths, and infant deaths in the United States. The strata were constructed using the mother’s race and child’s birth weight, and the sampling fractions varied according to strata.

Pfeffermann and Sverchkov (1999) treated birth weight (measured in grams) as the study variable Y and gestational age (measured in weeks) as the predictor X. After deleting 506 observations with missing values, the finite population size was reduced to 9,447. One can fit the following linear regression model using the finite population and obtain the estimated model

Yi=β0+β1Xi+ϵi,i=1,,9447, (10)

with β0=2695.27 and β1=149.04. The p-values for all regression coefficients were highly significant (p <0.0001). The R2 value was about 0.6. The original design was informative because the strata were determined using the study variable birth weight. The correlation between d0 and ϵ^ was 0.32, where d0 was the original design weight in the survey and ϵ^ was the estimated residuals obtained from (10). In other words, even after adjusting for predictor variable gestational age, a correlation remained between the design weights and the study variable.

Rather than the mean model described in (10), we estimated the quantile regression of Y on X, with parameters of interest the τth quantile regression coefficients β0τ and β1τ. Before conducting the simulation, we first fitted the mean regression model, as well as quantile regression models with τ=0.2,0.4, 0.6, and 0.8. The results are presented in figure 1. From figure 1, it is clear that the quantile regression–fitted lines are not parallel, unlike the conventional homoscedastic mean regression model. This result suggests the skewness of distribution for birth weight and shows that quantile regression provides a more comprehensive analysis.

Figure 1.

Figure 1.

Mean Regression and Quantile Regression Models.

For the simulation, we chose τ=0.8 for illustration. We conducted 1,000 Monte Carlo simulations to compare the six quantile regression coefficient estimators. In each simulation, one sample was generated from the finite population with an expected sample size of 400 by using the Poisson sampling design with inclusion probability πi=400d0,i1/j=1Nd0,j1,i=1,,N, where d0,j was the design weight for the jth subject in the finite population. The bootstrap size was set to 200 for variance estimation for all six estimators, and 95 percent confidence intervals were constructed.

Before comparing the performance of our proposed estimators for quantile regression coefficients, we first compared the performance of the design-based estimator of median regression coefficients and mean regression coefficients using only the linear term for the purpose of illustration. The purpose of this comparison was to show that there is an advantage in using quantile regression instead of mean regression for data with certain features. The results in table 4 show that the estimators of median regression coefficients have smaller relative bias, relative standard error, and relative root mean squared error than do the estimators of mean regression. This occurs because the distribution of residual terms displays some skewness and the Kolmogorov-Smirnov test rejects the normality assumption (p <0.05). Furthermore, there was some heteroscedastic trend in variance.

Table 4.

Monte Carlo (MC) Relative Bias (RBias), Relative Standard Error (RSE), and Relative Root Mean Squared Error (RRMSE) for Comparing Mean Regression with Median Regression.

Parameters Method RBias RSE RRMSE
β 0 Mean −0.013 0.194 0.195
β0τ Median 0.000 0.116 0.116
β 1 Mean 0.005 0.097 0.097
β1τ Median 0.001 0.071 0.071

Table 5 summarizes the simulation results, comparing the performance of all six estimators. For point estimation, the DW and SDW estimators had larger RBias than did the other estimators. The DW estimator had the largest RSE and RRMSE for all cases—as expected—because the efficiency of the DW estimator is improved through weight-smoothing. The SOPT estimator had the smallest RSE and RRMSE. The SPS estimator was the second-best estimator in terms of RSE and RRMSE. The UOPT estimator had similar RRMSE to that of the PS estimator, as in Kim and Skinner (2013). All confidence coverages were close to the nominal rate of 95 percent.

Table 5.

Monte Carlo (MC) Relative Bias (RBias), Relative Standard Error (RSE), and Relative Root Mean Squared Error (RRMSE) for Six Different Methods.

Parameters Method RBias RSE RRMSE
β0τ DW 0.063 0.285 0.292
UOPT −0.021 0.223 0.224
SDW 0.081 0.228 0.242
PS −0.043 0.213 0.218
SPS −0.001 0.199 0.199
SOPT 0.001 0.182 0.182
β1τ DW −0.027 0.128 0.131
UOPT 0.005 0.100 0.100
SDW −0.035 0.102 0.108
PS 0.018 0.097 0.098
SPS −0.002 0.089 0.089
SOPT −0.002 0.082 0.082

8. DISCUSSION

In this paper, we proposed several weight-smoothing estimators for estimating quantile regression coefficients in complex surveys under informative sampling design. Our proposed estimators were compared in terms of point estimation and variance estimation by using both simulated data and a real-data-based simulation study. All proposed estimators have smaller standard errors than the original design-based estimator. Unsmoothed and smoothed optimal estimators showed a better balance of variance, bias, and coverage rate compared with other estimators. Smoothed estimators, based on nonparametric weight-smoothing models, outperformed unsmoothed estimators. All related R codes and an example data file are posted at the following website: https://github.com/yandzhao/Quantile-Regression-of-Survey-Data, last accessed August 28, 2018. For future research, we will consider estimating quantile regression coefficients with a clustered informative sampling design.

Acknowledgments

The authors sincerely thank Professor Danny Pfeffermann and Dr. Michael Sverchkov for sharing the 1988 US National Maternal and Infant Health Survey data with us. This work was supported partially by the funding provided by National Institutes of Health, National Institute of General Medical Sciences (Grant 1 U54GM104938), an IDeA-CTR to the University of Oklahoma Health Sciences Center

References

  1. Beaumont J. F. (2008), “ A New Approach to Weighting and Inference in Sample Surveys,” Biometrika, 95, 539–553. [Google Scholar]
  2. Booth J. G., Butler R. W., Hall P. (1994), “ Bootstrap Methods for Finite Populations,” Journal of the American Statistical Association, 89, 1282–1289. [Google Scholar]
  3. Cameron A. C., Trivedi P. K. (2005), Microeconometrics: Methods and Applications, Cambridge: Cambridge University Press. [Google Scholar]
  4. Chambers R. L. (2003), “Introduction to Part A,” in Analysis of Survey Data, eds. Chambers R. L., Skinner C. J., Chichester: Wiley. [Google Scholar]
  5. Chambers R. L., Skinner C. J. (2003), Analysis of Survey Data, Chichester: Wiley. [Google Scholar]
  6. Conti P. L., Marelia D., Mecatti F. (2017), “ Recovering Sampling Distributions of Statistics of Finite Populations via Resampling: A Predictive Approach,” submitted. [Google Scholar]
  7. Deaton A. (1997), The Analysis of Household Surveys: A Microeconometric Approach to Development Policy, Baltimore and London: Johns Hopkins University Press. [Google Scholar]
  8. Fuller W. (2009), Sampling Statistics, Hoboken: Wiley. [Google Scholar]
  9. Geraci M. (2016), “ Estimation of Regression Quantiles in Complex Surveys with Data Missing at Random: An Application to Birthweight Determinants,” Statistical Methods in Medical Research, 25, 1393–1421. [DOI] [PubMed] [Google Scholar]
  10. Gross S. (1980). “Median Estimation in Sample Surveys,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 181–184.
  11. Harrington D. M., Barreira T. V., Staiano A. E., Katzmarzyk P. T. (2014), “ The Descriptive Epidemiology of Sitting among US Adults, NHANES 2009/2010,” Journal of Science Medicine in Sport, 17, 371–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hastie T., Tibshirani R. (1990), Generalized Additive Models, New York: Chapman and Hall. [DOI] [PubMed] [Google Scholar]
  13. He X., Shao Q. (1996), “ A General Bahadur Representation of M -Estimators and Its Application to Linear Regression with Nonstochastic Designs,” The Annals of Statistics, 24, 2608–2630. [Google Scholar]
  14. Heeringa S. G., West B. T., Berglund P. A. (2010), Applied Survey Data Analysis, Boca Raton, FL: Taylor and Francis Group. [Google Scholar]
  15. Kim J. K., Skinner C. J. (2013), “ Weighting in Survey Analysis under Informative Sampling,” Biometrika, 100, 385–398. [Google Scholar]
  16. Knight K. (1998), “ Limiting Distribution for L1 Regression Estimators under General Conditions,” The Annals of Statistics, 26, 755–770. [Google Scholar]
  17. Koenker R. (2005), Quantile Regression, Cambridge. [Google Scholar]
  18. Koenker R., Bassett G. (1978), “ Regression Quantiles,” Econometrica, 46, 33–50. [Google Scholar]
  19. Korn E. L., Graubard B. I. (1995), “ Examples of Differing Weighted and Unweighted Estimates from a Sample Survey,” The American Statistician, 49, 291–295. [Google Scholar]
  20. Li Y., Graubard B. I., Korn E. L. (2010), “ Application of Nonparametric Quantile Regression to Body Mass Percentile Curves from Survey Data,” Statistics in Medicine, 29, 558–572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Magee L. (1998), “ Improving Survey-Weighted Least Squares Regression,” Journal of Royal Statistical Society, Series B, 60, 115–126. [Google Scholar]
  22. Mu Y., He X. (2007), “ Power Transformation toward a Linear Regression Quantile,” Journal of the American Statistical Association, 102, 269–279. [Google Scholar]
  23. Nelson D. E., Powell-Griner E., Town M., Kovar M. G. (2003), “ A Comparison of National Estimates from the National Health Interview Survey and the Behavioral Risk Factor Surveillance System,” American Journal of Public Health, 93, 1335–1341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Pfeffermann D. (2011), “Modelling of Complex Survey Data: Why Model? Why Is It a Problem? How Can We Approach It?” Survey Methodology, 37, 115–136. [Google Scholar]
  25. Pfeffermann D., Sverchkov M. Y. (1999), “ Parametric and Semi-Parametric Estimation of Regression Models Fitted to Survey Data,” Sankhya B, 61, 166–186. [Google Scholar]
  26. Pfeffermann D., Sverchkov M. Y. (2003), “Fitting Generalized Linear Models under Informative Sampling,” in Analysis of Survey Data, eds. Chambers R. L., Skinner C. J., Chichester: Wiley. [Google Scholar]
  27. Pfeffermann D., Sverchkov M. Y. (2009), “Inference under Informative Sampling,” in Handbook of Statistics 29B; Sample Surveys: Inference and Analysis, eds. Pfeffermann D., Rao C. R., Amsterdam: North Holland. [Google Scholar]
  28. Rao J. N. K. (1965), “ On Two Simple Schemes of Unequal Probability Sampling without Replacement,” Journal of the Indian Statistical Association, 3, 173–180. [Google Scholar]
  29. Sampford M. R. (1967), “ On Sampling without Replacement with Unequal Probabilities of Selection,” Biometrika, 54, 499–513. [PubMed] [Google Scholar]
  30. Scott A., Wild C. (2011), “ Fitting Regression Models with Response-Biased Samples,” Canadian Journal of Statistics, 39, 519–536. [Google Scholar]
  31. Van der Vaart A. W. (1998), Asymptotic Statistics, New York: Cambridge University Press. [Google Scholar]
  32. Wang J. Q., Opsomer J. D. (2011), “ On Asymptotic Normality and Variance Estimation for Nondifferentiable Survey Estimators,” Biometrika, 98, 91–106. [Google Scholar]

Articles from Journal of Survey Statistics and Methodology are provided here courtesy of Oxford University Press

RESOURCES