Quantile Regression Analysis of Survey Data Under Informative Sampling

Sixia Chen; Yan Daniel Zhao

doi:10.1093/jssam/smy018

. 2018 Oct 29;7(2):157–174. doi: 10.1093/jssam/smy018

Quantile Regression Analysis of Survey Data Under Informative Sampling

Sixia Chen ^1,^✉, Yan Daniel Zhao ¹

PMCID: PMC6505486 PMID: 31098386

Abstract

For complex survey data, the parameters in a quantile regression can be estimated by minimizing an objective function with units weighted by the original design weights. However, when the complex survey sampling design is informative (i.e., when the design weights are correlated with the study variable even after conditioning on other covariates), the efficiency of this design-weighted estimator may be improved. In this article, we propose several weight-smoothing estimators for quantile regression analysis of complex survey data collected with an informative sampling design. Our new estimators incorporate nonparametric methods for modeling the weight functions and pseudo-population bootstrap methods for variance estimation. A simulation study compares, our proposed methods with the original design-based method in terms of bias, standard error, mean squared error, and confidence coverage. Our proposed estimators have smaller bias and mean squared error than does the design-based estimator. We further illustrate and compare estimators for the 1988 US National Maternal and Infant Health Survey.

Keywords: Complex survey, Informative sampling, Nonparametric, Quantile regression, Weight-smoothing

1. INTRODUCTION

Researchers often use data collected from complex survey designs to draw scientific conclusions. For instance, Nelson, Powell-Griner, Town, and Kovar (2003) compared national estimates of smoking, height, and diabetes by using the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS). Harrington, Barreira, Staiano, and Katzmarzyk (2014) used National Health Nutrition and Examination (NHANES) 2009/2010 to estimate the amount of time that the US population spent sitting by age, sex, ethnicity, education, and body mass index. It is well known that statistical analysis ignoring design features—including stratification, clustering, and unequal weighting—may lead to biased results (see Pfeffermann and Sverchkov 1999, 2003, and 2009, among others). Generalized linear and mixed models with complex survey designs were developed in studies by Chambers and Skinner (2003) and Heeringa, West, and Berglund (2010).

The sampling design is informative when sample inclusion is related to the outcome variable conditional on covariates (Fuller 2009). For such designs, survey weights are often used in regression analysis of survey data to ensure consistent estimation of parameters. For example, the sampling design of NHANES (2013–2014) was informative since the first stage strata were built by using county-level health characteristics that are correlated with the study variables of interest, given the covariates. The traditional design-based approach using the original design weights leads to unbiased estimates, but the efficiency can be improved. One approach is a likelihood-based method that maximizes the conditional sample likelihood by using the joint model of study variable and sampling indicator (see Chambers 2003; Pfeffermann and Sverchkov 2009; Pfeffermann 2011; Scott and Wild 2011, among others). A second approach replaces the original design weights by predictions from a model for the conditional distribution of the design weights given the data, as in Magee (1998), Pfeffermann and Sverchkov (1999), Beaumont (2008), Fuller (2009), and Kim and Skinner (2013). In particular, Kim and Skinner (2013) proposed optimal weight modifications compared with other methods under generalized linear models.

These statistical methods model the conditional mean values of the study variables by regression. Such models may be suboptimal when the distribution of study variables is skewed or has outliers. In such cases, quantile regression (QR) (Koenker and Bassett 1978; Koenker 2005) is an effective tool for conditional modeling, providing robustness against outliers and a more comprehensive analysis of the relationship between variables than is offered by the conditional mean model.

There is rich literature on the use of QR for data collected by simple random sampling; for examples, see He and Shao (1996), Knight (1998), Mu and He (2007), and references therein. Deaton (1997) and Cameron and Trivedi (2005) applied QR to survey data ignoring the complex sampling scheme, and their estimates may be biased if the original sampling design is informative. Only a few research articles discuss QR estimates that account for a complex survey sampling scheme, including Li, Graubard, and Korn (2010) and Geraci (2016). These articles do not discuss QR for data collected using informative sampling, which is the topic of the present article. The quantile regression coefficients are defined at the super-population level, and consistency depends on the quantile regression–model assumption. Specifically, we extended several weight-smoothing estimators to our data, including unsmoothed and smoothed optimal estimators in Kim and Skinner (2013) and estimators proposed by Beaumont (2008) and Pfeffermann and Sverchkov (1999). Some of our proposed estimators (Design-Weighted [DW], PS, Unsmoothed Optimal [UOPT]) are design-consistent, even when the model (1) does not hold. Other estimators (Smoothed Design-Weighted [SDW], Smoothed Pfeffermann-Sverchkov [SPS], Smoothed Optimal [SOPT]) are consistent if the corresponding weight models are correct.

The remainder of the article is organized as follows: After preliminaries in section 2, our weight-smoothing estimators are proposed and developed in section 3. In section 4, we describe algorithms for computing our proposed estimators. Variance estimation is presented in section 5. A simulation study is described in section 6, and a real-data-based simulation study using the 1988 US National Maternal and Infant Health Survey is presented in section 7. In section 8, we conclude with a discussion.

2. QUANTILE REGRESSION AND THE DESIGN-BASED ESTIMATOR

Suppose the finite population $F_{N} = {(x_{i}, y_{i}, z_{i}), i = 1, 2, \dots, N}$ is generated from a super-population model $F$ , where x_i is a $p \times 1$ vector of covariates, y_i is the study variable, and z_i is the design variable, which may not be observed. We assume y_i given x_i follows the following QR model:

y_{i} = x_{i}^{'} β_{τ} + ϵ_{i}, i = 1, 2, \dots, N,

(1)

where $β_{τ}$ is $p \times 1$ unknown coefficient vector, and ϵ_i is an error term such that $\Pr (ϵ_{i} \leq 0 | x_{i}) = τ$ for the τ-th quantile $(0 < τ < 1)$ . A complex survey sample S is drawn with sampling indicator I_i such that I_i = 1 if unit i is selected and 0 otherwise, $i = 1, \dots, N$ . The first- and second-order inclusion probabilities are denoted as $π_{i} = E (I_{i})$ for selecting unit i and $π_{i j} = E (I_{i} I_{j})$ for unit i and unit j. Consequently, the corresponding design weight for unit i is $d_{i} = π_{i}^{- 1}$ , which is known for units in S.

Following Kim and Skinner (2013), the sampling design is assumed to be informative in the sense that the design weights are functions of the covariates and the design variable; that is, $π_{i} = π_{i} (X, Z)$ with $X = (x_{1}, x_{2}, \dots, x_{N})$ _, and $Z = (z_{1}, z_{2}, \dots, z_{N})$ . The sampling is informative if y and z are related, conditional on x. Under informative sampling, we have $\Pr (I_{i} = 1 | x_{i}, y_{i}) \neq \Pr (I_{i} = 1 | x_{i})$ (i.e., the selection of unit i depends not only on the covariates but also on the study variable). The design-weighted (DW) estimator of the QR coefficents is

{\hat{β}}_{τ, d} = \arg \min_{β} \sum_{i \in S} d_{i} ρ_{τ} (y_{i} - x_{i}^{'} β),

where $ρ_{τ} (u) = u {τ - I (u < 0)} .$ By an argument similar to that of Koenker (2005) and after some algebra, it can be shown that ${\hat{β}}_{τ, d}$ is the solution of the following estimating equation:

{\hat{U}}_{d} (β) = \sum_{i = 1}^{N} I_{i} d_{i} {τ - I (y_{i} - x_{i}^{'} β < 0)} x_{i} = 0.

(2)

The estimator ${\hat{β}}_{τ, d}$ is consistent for estimating $β_{τ},$ by an argument similar to that of Wang and Opsomer (2011). Its efficiency may be further improved by modeling the design weights as described in the next section.

3. PROPOSED METHODS

We propose five new weight-smoothing estimators of QR coefficient $β_{τ}$ in generalized linear models. Specifically, we consider estimators that satisfy the following estimating equation:

{\hat{U}}_{w} (β) = \sum_{i = 1}^{N} I_{i} w_{i} {τ - I (y_{i} - x_{i}^{'} β < 0)} x_{i} = 0,

(3)

where the weights d_i in (2) are replaced by new weights w_i chosen to improve efficiency. All of the weight-smoothing methods were initially developed for regression analysis of mean values of the study variable. We adapt these methods to our QR problem.

3.1 Smoothed Design-Weight (SDW) Estimator

Beaumont (2008) used a smoothing weight $E (d_{i} | y_{i}, I_{i} = 1)$ to estimate the population mean of y. Kim and Skinner (2013) extended the idea by using $w_{i} = {\tilde{d}}_{i, x, y} = E (d_{i} | x_{i}, y_{i}, I_{i} = 1)$ in the context of linear regression to obtain regression coefficient estimates. For our QR analysis, we use the same weights in (4) and denote the corresponding estimator as ${\hat{β}}_{τ, S D W}$ . As in Kim and Skinner (2013), we can show that ${\hat{β}}_{τ, S D W}$ is consistent and $V ({\hat{β}}_{τ, S D W}) < V ({\hat{β}}_{τ, d})$ if the conditional expectation $E (d_{i} | x_{i}, y_{i}, I_{i} = 1)$ is correctly modeled.

In general, ${\tilde{d}}_{i, x, y}$ in (3) is unknown. To estimate ${\tilde{d}}_{i, x, y}$ , one can use a parametric model, such as a linear or nonlinear regression method, or a nonparametric model, such as splines. However, the parametric model approach is vulnerable to model misspecification, and the nonparametric model approach is subject to the well-known curse of dimensionality if the dimension of covariate x is large. Instead, we fit the following generalized additive model (GAM) to estimate ${\tilde{d}}_{i, x, y}$ (Hastie and Tibshirani 1990).

log (d_{i} - 1) = g_{0} + \sum_{t = 1}^{p} g_{t} (x_{i t}) + g_{p + 1} (y_{i}) + e_{i}, i \in S,

(4)

where x_it is the t-th variable in $x_{i},$ g₀ is an unknown parameter, $g_{t}, t = 1, \dots, p + 1,$ are unknown functions that satisfy certain regularity conditions, and e_i is assumed to have normal distribution with mean 0 and variance $σ^{2}$ . Model (4) is quite general and can be easily extended to more general cases with unequal variance and non-Gaussian exponential family distributions. For simplicity, we only consider (4) with lower-order spline functions and Gaussian errors with constant variance. After obtaining estimators ${\hat{g}}_{0},$ ${\hat{g}}_{t}, t = 1, \dots, p + 1$ , and ${\hat{σ}}^{2},$ we estimate ${\tilde{d}}_{i, x, y}$ by using ${\hat{\tilde{d}}}_{i, x, y} = 1 + exp ({\hat{g}}_{0} + \sum_{t = 1}^{p} {\hat{g}}_{t} (x_{i t}) + {\hat{g}}_{p + 1} (y_{i}) + {\hat{σ}}^{2} / 2) .$

3.2 Unsmoothed Pfeffermann-Sverchkov (PS) Estimator and Smoothed Pfeffermann-Sverchkov (SPS) Estimator

Pfeffermann and Sverchkov (1999) proposed weights $w_{i} = d_{i} {\hat{\tilde{d}}}_{i, x}^{- 1}$ where ${\hat{\tilde{d}}}_{i, x} = \hat{E} (d_{i} | x_{i}, I_{i} = 1)$ to produce efficient and consistent estimates of linear regression coefficients. We propose to obtain ${\hat{\tilde{d}}}_{i, x}$ by using a similar technique to that used to obtain ${\hat{\tilde{d}}}_{i, x, y}$ in section 3.1. The extension to quantile regression is trivial, and we denote the corresponding estimator as ${\hat{β}}_{τ, P S} .$ Note that ${\hat{β}}_{τ, P S}$ is consistent even if the model $E (d_{i} | x_{i}, I_{i} = 1)$ is misspecified (see the justification for the consistency of UOPT estimator).

To further improve efficiency, Pfeffermann and Sverchkov (1999) proposed weights $w_{i} = {\hat{\tilde{d}}}_{i, x, y} {\hat{\tilde{d}}}_{i, x}^{- 1}$ , which yield a consistent and a more efficient estimator if the weight model $E (d_{i} | x_{i}, y_{i}, I_{i} = 1)$ is correctly specified. We denote this SPS estimator as ${\hat{β}}_{τ, S P S} .$

The SPS estimator minimizes the following prediction distance function

Q (β) = \int ρ_{τ} (y - x' β) f (y | x) d y .

Because

f (y | x, I = 1) = f (y | x) \frac{\Pr (I = 1 | x, y)}{\Pr (I = 1 | x)}

and

E (d | x, y, I = 1) = \frac{1}{E (π | x, y)}, E (d | x, I = 1) = \frac{1}{E (π | x)},

a consistent estimator can be obtained by solving (3) with $w_{i} = E (d_{i} | x_{i}, y_{i}, I_{i} = 1) {E (d_{i} | x_{i}, I_{i} = 1)}^{- 1} .$

3.3 Unsmoothed and Smoothed Optimal (UOPT and SOPT) Estimators

In this section, we propose two novel optimal weight modification estimators. Under the correct weight models, one will be more efficient than ${\hat{β}}_{τ, P S}$ , and the other will be more efficient than ${\hat{β}}_{τ, S D W}$ and ${\hat{β}}_{τ, S P S} .$ We assume $π_{i} = π (x_{i}, z_{i})$ , and the sampling design is Poisson; Kim and Skinner (2013) also made this assumption to derive the optimal weight in linear regression models.

Consider a class of estimators that solves (3) with $w_{i} = d_{i} q (x_{i}) .$ The UOPT estimator is obtained by choosing $q_{i} = q (x_{i})$ to minimize the variance of the following class of estimators:

{\hat{β}}_{τ, q} = \arg \min_{β} \sum_{i \in S} d_{i} q_{i} ρ_{τ} (y_{i} - x_{i}^{'} β),

or equivalently as the solution of the following estimating equations:

{\hat{U}}_{d q} (β) = \sum_{i = 1}^{N} I_{i} d_{i} q_{i} {τ - I (y_{i} - x_{i}^{'} β < 0)} x_{i} = 0.

(5)

According to Koenker (2005), we have $E {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0) | x_{i}} = 0,$ so

\begin{matrix} E {{\hat{U}}_{d q} (β_{τ})} = E [\sum_{i = 1}^{N} I_{i} d_{i} q_{i} {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0)} x_{i}] \\ = E [\sum_{i = 1}^{N} q_{i} {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0)} x_{i}] \\ = E [\sum_{i = 1}^{N} q_{i} x_{i} E {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0) | x_{i}}] \\ = 0. \end{matrix}

(6)

By the argument in Van der Vaart (1998, Chapter 5) and according to (6), it can be shown that ${\hat{β}}_{τ, q}$ is consistent for $β_{τ}$ for arbitrary $q_{i} = q (x_{i})$ , under mild regularity conditions. After some algebra, it can be shown that ${\hat{β}}_{τ, q}$ has the following asymptotic expansion:

{\hat{β}}_{τ, q} = β_{τ, q} + {\sum_{i = 1}^{N} q_{i} x_{i} x_{i}^{'} f_{y | x} (x_{i}^{'} β_{τ})}^{- 1} {\hat{U}}_{d q} (β_{τ}) + o_{p} (n^{- 1 / 2}),

and the corresponding asymptotic conditional variance can be written as

{\sum_{i = 1}^{N} q_{i} x_{i} x_{i}^{'} f_{y | x} (x_{i}^{'} β_{τ})}^{- 1} \sum_{i = 1}^{N} E (d_{i} e_{i}^{2} | x_{i}) q_{i}^{2} x_{i} x_{i}^{'} {\sum_{i = 1}^{N} q_{i} x_{i} x_{i}^{'} f_{y | x} (x_{i}^{'} β_{τ})}^{- 1},

(7)

where $f_{y | x} (x_{i}^{'} β_{τ})$ is the conditional density of y given x evaluated at $x_{i}^{'} β_{τ}$ and $e_{i} = τ - I (y_{i} - x_{i}^{'} β_{τ} < 0) .$ Thus, $q_{i, 1}^{*} = v_{i, 1}^{- 1} f_{y | x} (x_{i}^{'} β_{τ})$ with $v_{i, 1} = E (d_{i} e_{i}^{2} | x_{i})$ minimizes the variance defined in (7). Specifically, we have

\begin{matrix} v_{i, 1} = E (d_{i} e_{i}^{2} | x_{i}) \\ = {(τ - 1)}^{2} E (d_{i} | x_{i}; y_{i} < x_{i}^{'} β_{τ}) \Pr (y_{i} < x_{i}^{'} β_{τ} | x_{i}) \\ + τ^{2} E (d_{i} | x_{i}; y_{i} \geq x_{i}^{'} β_{τ}) \Pr (y_{i} \geq x_{i}^{'} β_{τ} | x_{i}) . \end{matrix}

(8)

The estimator ${\hat{q}}_{i, 1}^{*}$ is discussed in section 4. Denote the estimator by using $w_{i} = d_{i} {\hat{q}}_{i, 1}^{*}$ as ${\hat{β}}_{τ, U O P T}$ . It is easy to see that estimators ${\hat{β}}_{τ, B}$ and ${\hat{β}}_{τ, P S}$ belong to this class of estimators, so the UOPT estimator is more efficient.

For a more efficient estimator than ${\hat{β}}_{τ, S D W}$ and ${\hat{β}}_{τ, S P S},$ the SOPT estimator is obtained by minimizing variance for a class of estimators defined by $w_{i} = {\tilde{d}}_{i, x, y} q (x_{i}) .$ By arguments similar to those for UOPT, the corresponding estimators are consistent, since

\begin{matrix} E {{\hat{U}}_{w} (β_{τ})} = E [\sum_{i = 1}^{N} I_{i} {\tilde{d}}_{i, x, y} q (x_{i}) {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0)} x_{i}] \\ = E [E [\sum_{i = 1}^{N} I_{i} d_{i} q (x_{i}) {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0)} x_{i} | x, y]] \\ = E [\sum_{i = 1}^{N} I_{i} d_{i} q (x_{i}) {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0)} x_{i}] \\ = E [\sum_{i = 1}^{N} q (x_{i}) {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0)} x_{i}] \\ = E [\sum_{i = 1}^{N} q_{i} x_{i} E {τ - I (y_{i} - x_{i}^{'} β_{τ} < 0) | x_{i}}] = 0. \end{matrix}

Under the correct weight models, SOPT is even more efficient than UOPT, as seen in our simulation studies. By similar techniques to those used for the UOPT estimator, it can be shown that the optimal choice of q_i is $q_{i, 2}^{*} = {\tilde{v}}_{i, 2}^{- 1} f_{y | x} (x_{i}^{'} β_{τ})$ with ${\tilde{v}}_{i, 2} = E ({\tilde{d}}_{i} e_{i}^{2} | x_{i}) .$ Specifically, we have

\begin{matrix} {\tilde{v}}_{i, 2} = E ({\tilde{d}}_{i} e_{i}^{2} | x_{i}) \\ = {(τ - 1)}^{2} E ({\tilde{d}}_{i} | x_{i}; y_{i} < x_{i}^{'} β_{τ}) \Pr (y_{i} < x_{i}^{'} β_{τ} | x_{i}) \\ + τ^{2} E ({\tilde{d}}_{i} | x_{i}; y_{i} \geq x_{i}^{'} β_{τ}) \Pr (y_{i} \geq x_{i}^{'} β_{τ} | x_{i}) . \end{matrix}

We discuss how to obtain the estimator ${\hat{q}}_{i, 2}^{*}$ of $q_{i, 2}^{*}$ in section 4. We denote the estimator by using $w_{i} = {\hat{\tilde{d}}}_{i, x, y} {\hat{q}}_{i, 2}^{*}$ as ${\hat{β}}_{τ, S O P T} .$

4. ALGORITHMS FOR COMPUTING THE UOPT AND SOPT ESTIMATORS

In this section, we discuss algorithms for computing the UOPT and SOPT estimators by the GAM approach. The UOPT estimator $q_{i, 1}^{*}$ can be estimated by the following steps:

Set ${\hat{β}}_{τ}^{(0)} = {\hat{β}}_{τ, d}$ , the solution of the estimating equation (2).
Estimate ${\hat{f}}_{y | x}$ by using the GAM approach and assuming a normal distribution of $y_{i},$ where ${\hat{f}}_{y | x}$ is the estimated conditional density of y given x. The conditional expectation $E (y | x)$ is assumed to include main effects of x_i and their second- and third-order interactions.
Estimate $E (d_{i} | x_{i}; y_{i} < x_{i}^{'} β_{τ})$ by predictions
$\hat{E} (d_{i} | x_{i}; y_{i} < x_{i}^{'} β_{τ}) = 1 + exp {{\hat{g}}_{01} + \sum_{t = 1}^{p} {\hat{g}}_{t 1} (x_{i t}) + {\hat{σ}}_{1}^{2} / 2},$
by assuming the following generalized additive model (GAM):
$log (d_{i} - 1) = g_{01} + \sum_{t = 1}^{p} g_{t 1} (x_{i t}) + e_{1 i}, i \in S_{1},$
where S₁ denotes the units in S such that $y_{i} < x_{i}^{'} {\hat{β}}_{τ}^{(t)}$ , using techniques similar to the estimation of ${\tilde{d}}_{i, x, y}$ in section 3.1. Estimate $\hat{E} (d_{i} | x_{i}; y_{i} \geq x_{i}^{'} β_{τ})$ by similar techniques. Estimate $\hat{\Pr} (y_{i} < x_{i}^{'} β_{τ} | x_{i})$ and $\hat{\Pr} (y_{i} \geq x_{i}^{'} β_{τ} | x_{i})$ by substituting the estimated density ${\hat{f}}_{y | x}$ in Step 2. Then, according to (8),
$\begin{matrix} {\hat{v}}_{i, 1}^{(t)} = {(τ - 1)}^{2} \hat{E} (d_{i} | x_{i}; y_{i} < x_{i}^{'} β_{τ}) \hat{\Pr} (y_{i} < x_{i}^{'} β_{τ} | x_{i}) \\ + τ^{2} \hat{E} (d_{i} | x_{i}; y_{i} \geq x_{i}^{'} β_{τ}) \hat{\Pr} (y_{i} \geq x_{i}^{'} β_{τ} | x_{i}) . \end{matrix}$
Estimate ${\hat{q}}_{i, 1}^{* (t)} = {\hat{v}}_{i, 1}^{(t) - 1} {\hat{f}}_{y | x} (x_{i}^{'} {\hat{β}}_{τ}^{(t)})$ and the corresponding optimal estimator ${\hat{β}}_{τ}^{(t + 1)}$ by solving (5) with ${\hat{q}}_{i, 1}^{* (t)} .$
Repeat step 3 to step 4 with updated estimator ${\hat{β}}_{τ}^{(t + 1)}$ until convergence.

The SOPT estimator $q_{i, 2}^{*}$ is obtained as follows:

Same as step 1 for UOPT estimator.
Estimate ${\hat{\tilde{d}}}_{i, x, y}$ of ${\tilde{d}}_{i, x, y}$ by the GAM approach described in section 3.1.
Same as step 2 for the UOPT estimator.
Same as step 3 for the UOPT estimator, with d_i replaced by ${\hat{\tilde{d}}}_{i, x, y}$ in the model.
Estimate ${\hat{q}}_{i, 2}^{* (t)} = {\hat{v}}_{i, 2}^{(t) - 1} {\hat{f}}_{y | x} (x_{i}^{'} {\hat{β}}_{τ}^{(t)})$ and the corresponding optimal estimator ${\hat{β}}_{τ}^{(t + 1)}$ by solving equation (3) with $w_{i} = {\hat{\tilde{d}}}_{i, x, y} {\hat{q}}_{i, 2}^{* (t)} .$
Repeat steps 4 and 5 until convergence.

5. VARIANCE ESTIMATION

The Taylor linearization approach to variance estimation involves tedious technical derivations, especially when the estimation procedure includes semiparametric methods. We now describe bootstrap estimates of variance for our proposed estimators, with associated confidence regions.

We apply pseudo-population bootstrap methods (Gross 1980; Booth, Butler, and Hall 1994; Conti, Marelia, and Mecatti 2017), which are simple and practical and have been shown to work effectively under high-entropy designs (Conti et al. 2017), such as Rao-Sampford (Rao 1965; Sampford 1967) and randomized proportional-to-size systematic sampling. Our proposed bootstrap method can be described as follows:

For $k = 1, \dots, N,$ choose a unit i from the original sample S independently with probability $π_{i}^{- 1} / \sum_{j \in S} π_{j}^{- 1} .$ If at trial k the unit $i \in S$ is selected, define $(x_{k}^{*}, y_{k}^{*}, z_{k}^{*}) = (x_{i}, y_{i}, z_{i}) .$
The pseudo-bootstrap population is then $F_{N}^{*} = {(x_{k}^{*}, y_{k}^{*}, z_{k}^{*}), k = 1, \dots, N} .$ Draw a bootstrap sample $S^{*}$ from $F_{N}^{*}$ by using the same design as the original design with first-order inclusion probabilities $n z_{k}^{*} / \sum_{i = 1}^{N} z_{i}^{*} .$ If z_i is unknown, then one can use $π_{k}^{*}$ , which is the corresponding original inclusion probability for the k-th element in the pseudo-bootstrap population.
Obtain the bootstrap sample estimator ${\hat{β}}_{τ}^{*}$ from $S^{*}$ by using our proposed method.

Generate B bootstrap samples by the previously described procedure, with corresponding estimators ${\hat{β}}_{τ}^{* (b)}$ for $b = 1, \dots, B .$ Then, the bootstrap variance estimator is:

{\hat{V}}^{*} = \frac{1}{B} \sum_{b = 1}^{B} ({\hat{β}}_{τ}^{* (b)} - {\bar{\hat{β}}}_{τ}^{*}) {({\hat{β}}_{τ}^{* (b)} - {\bar{\hat{β}}}_{τ}^{*})}^{'},

where ${\bar{\hat{β}}}_{τ}^{*} = B^{- 1} \sum_{b = 1}^{B} {\hat{β}}_{τ}^{* (b)} .$ The $(1 - α) 100 %$ confidence region of $β_{τ}$ is then

({\hat{β}}_{τ} - β_{τ}) {\hat{V}}^{* - 1} {({\hat{β}}_{τ} - β_{τ})}^{'} < χ_{p, 1 - α}^{2},

where $χ_{q, 1 - α}^{2}$ is the $(1 - α) 100$ -th percentile of a χ²distribution with degrees of freedom q, the dimension of $β_{τ} .$ Alternatively, one can use bootstrap percentiles of statistics $({\hat{β}}_{τ}^{* (b)} - β_{τ}^{* (b)}) {\hat{V}}^{* - 1} {({\hat{β}}_{τ}^{* (b)} - β_{τ}^{* (b)})}^{'}$ to obtain the confidence region. For inference with individual parameter $β_{τ, a}$ defined in $β_{τ}$ where $a = 1, \dots, p,$ one can use the following normal-based confidence interval:

({\hat{β}}_{τ, a} - z_{1 - α / 2} \sqrt{{\hat{V}}_{a a}^{*}}, {\hat{β}}_{τ, a} + z_{1 - α / 2} \sqrt{{\hat{V}}_{a a}^{*}}),

(9)

where ${\hat{β}}_{τ, a}$ is the a-th component of ${\hat{β}}_{τ}$ and ${\hat{V}}_{a a}^{*}$ is the corresponding estimated variance of ${\hat{β}}_{τ, a} .$

6. SIMULATION STUDY

We now compare the performance of all six estimators in a simulation study. We generated M = 1,000 finite populations with population size N = 10,000 from the following population model: $y_{i} = β_{0} + x_{1 i} β_{1} + x_{2 i} β_{2} + (1 + ψ_{1} x_{1 i} + ψ_{1} x_{2 i}) ϵ_{i},$ where $(β_{0}, β_{1}, β_{2}) = (1, - 1, - 0.5)$ , covariates $(x_{1 i}, x_{2 i})$ were independently and identically distributed (iid) with a normal distribution with means $E (x_{1 i}) = E (x_{2 i}) = 0$ and variances $V (x_{1 i}) = V (x_{2 i}) = 1$ , and ϵ_i were iid with a standard normal distribution. The parameter ψ₁ was set to zero (homoscedastic variance) or 0.2 (heteroscedastic variance).

For each generated finite population of size N, a Poisson sample was then selected with inclusion probabilities $π_{i} = n k_{i} / (\sum_{j = 1}^{N} k_{j}),$ where n = 400 was the expected sample size and k_i was the size variable such that $k_{i} = {1 + exp (2.5 - 0.5 z_{i})}^{- 1}$ and $z_{i} \sim N (1 + y_{i}, {0.5}^{2})$ . Note that this sampling was informative because the inclusion probabilities depended on the outcome variable y. Specifically, the correlation between ϵ and π was about 0.6. Sample sizes varied but were all close to 400.

For the population model, it can be shown that the τ-th conditional quantile of y_i is $Q_{τ} (y_{i} | x_{i}) = β_{0 τ} + x_{1 i} β_{1 τ} + x_{2 i} β_{2 τ},$ where $β_{0 τ} = β_{0} + Q_{τ} (ϵ_{i}),$ $β_{1 τ} = β_{1} + ψ_{1} Q_{τ} (ϵ_{i}),$ $β_{2 τ} = β_{2} + ψ_{1} Q_{τ} (ϵ_{i})$ with $Q_{τ} (ϵ_{i})$ as the τ-th quantile of ϵ_i, which could be readily calculated. Our parameters of interest were the QR regression coefficients $β_{τ} = (β_{1 τ}, β_{2 τ}) .$ In the simulation study, $τ = 0.4$ and 0.6 were considered.

We compared the six estimators described previously in terms of Monte Carlo (MC) relative bias (RBias), MC relative standard error (RSE), MC relative root mean squared error (RRMSE), and MC coverage properties, including MC coverage probability (CP), standard error relative bias (SERBias), and relative average confidence interval length (RCILen). The formulas for those quantities are as follows:

\begin{matrix} RBias = \frac{\hat{β} - β}{| β |}, \\ RSE = \frac{{{(M - 1)}^{- 1} \sum_{m = 1}^{M} {({\hat{β}}^{(m)} - \hat{β})}^{2}}^{1 / 2}}{| β |}, \\ RRMSE = \frac{{{(\hat{β} - β)}^{2} + {(M - 1)}^{- 1} \sum_{m = 1}^{M} {({\hat{β}}^{(m)} - \hat{β})}^{2}}^{1 / 2}}{| β |}, \\ CP = \frac{1}{M} \sum_{m = 1}^{M} I (L B^{(m)} < β < U B^{(m)}), \\ SERBias = \frac{M^{- 1} \sum_{m = 1}^{M} {{\hat{V}}^{(m)}}^{1 / 2} - {{(M - 1)}^{- 1} \sum_{m = 1}^{M} {({\hat{β}}^{(m)} - \hat{β})}^{2}}^{1 / 2}}{{{(M - 1)}^{- 1} \sum_{m = 1}^{M} {({\hat{β}}^{(m)} - \hat{β})}^{2}}^{1 / 2}}, \\ RCILen = \frac{M^{- 1} \sum_{m = 1}^{M} (U B^{(m)} - L B^{(m)})}{| β |}, \end{matrix}

where β represents the true value for parameters $β_{1 τ}$ or $β_{2 τ},$ ${\hat{β}}^{(m)}$ represents the estimator based on the m-th MC sample for β, $\hat{β} = M^{- 1} \sum_{m = 1}^{M} {\hat{β}}^{(m)}, L B^{(m)}$ and $U B^{(m)}$ represent the lower and upper 95 percent confidence interval bounds for β based on the formula (9) in section 5, and ${\hat{V}}^{(m)}$ represents our proposed bootstrap variance estimator based on the m-th MC sample. We selected 200 bootstrap samples for variance estimation for each MC sample.

The point estimation results are presented in table 1 for $ψ_{1} = 0$ and table 2 for $ψ_{1} = 0.2$ . Under the homoscedastic scenario in table 1, all estimators had small RBias, which was consistent with the underlying theorem. The DW and SDW estimators had the largest RRMSE, since the DW estimator did not use any smoothing technique to reduce variance, and the single smoothing model in the SDW estimator was not efficient. The UOPT, PS, SPS, and SOPT estimators had comparable RSE and RRMSE. To test the sensitivity of model specification, we assumed an equal variance structure under the heteroscedastic scenario; the results were comparable with assuming the correct heteroscedastic variance structure. As shown in table 2, all estimators had small bias for most of the cases. For all cases, the UOPT and SOPT estimators had significantly smaller RSE and RRMSE than did other estimators. For simplicity, we only present the results for coverage properties for the scenario in which $ψ_{1} = 0.2$ and $τ = 0.6$ (table 3). Other scenarios had similar results. The UOPT and SOPT estimators had better or comparable coverage than other estimators for most of the cases, and their CP was close to the nominal level of 95 percent. The SERBias for all estimators, based on our proposed bootstrap methods, were less than 8.8 percent, validating our proposed variance estimation approach. The DW and SDW estimators had larger RCILen than did other estimators. The UOPT and SOPT estimators had RCILen smaller than that of the PS and SPS estimators. We also considered the scenario where the correlation between ϵ and π is about 0.3, and the results were similar (results not presented here).

Table 1.

Monte Carlo (MC) Relative Bias (RBias) ( $\times 10^{3}$ ), Relative Standard Error (RSE) ( $\times 10^{3}$ ), and Relative Root Mean Squared Error (RRMSE) ( $\times 10^{3}$ ) for Six Different Methods with $ψ_{1} = 0$ .

Tau	Par	Method	RBias	RSE	RRMSE
0.4	$β_{1 τ}$	DW	−1	94	94
		UOPT	5	79	80
		SDW	2	90	90
		PS	2	79	79
		SPS	5	77	77
		SOPT	8	78	78
	$β_{2 τ}$	DW	7	169	169
		UOPT	4	149	149
		SDW	10	162	162
		PS	2	148	148
		SPS	6	143	143
		SOPT	8	145	145
0.6	$β_{1 τ}$	DW	−1	81	81
		UOPT	6	68	68
		SDW	2	78	78
		PS	4	68	68
		SPS	8	67	67
		SOPT	9	68	68
	$β_{2 τ}$	DW	−1	150	150
		UOPT	5	133	133
		SDW	4	145	145
		PS	3	131	131
		SPS	8	127	127
		SOPT	9	129	129

Open in a new tab

Table 2.

Tau	Par	Method	RBias	RSE	RRMSE
0.4	$β_{1 τ}$	DW	1	91	91
		UOPT	9	54	55
		SDW	5	89	89
		PS	6	60	60
		SPS	10	59	60
		SOPT	11	55	56
	$β_{2 τ}$	DW	3	155	155
		UOPT	9	104	104
		SDW	9	152	152
		PS	6	114	114
		SPS	12	110	111
		SOPT	12	102	102
0.6	$β_{1 τ}$	DW	−1	86	86
		UOPT	7	53	54
		SDW	4	83	83
		PS	5	56	56
		SPS	8	55	56
		SOPT	10	54	55
	$β_{2 τ}$	DW	−3	155	155
		UOPT	9	112	113
		SDW	3	152	152
		PS	5	116	116
		SPS	10	113	114
		SOPT	13	110	111

Open in a new tab

Table 3.

Coverage Probability (CP) ( $\times 10^{3}$ ), Standard Error Relative Bias (SERBias) ( $\times 10^{3}$ ), and Relative Average Confidence Interval Length (RCILen) ( $\times 10^{3}$ ) for Six Different Methods with $ψ_{1} = 0.2$ and $τ = 0.6$ .

Par	Method	CP	SERBias	RCILen
$β_{1 τ}$	DW	948	−6	337
	UOPT	953	88	228
	SDW	939	14	330
	PS	954	64	234
	SPS	952	65	230
	SOPT	949	66	225
$β_{2 τ}$	DW	960	61	646
	UOPT	949	65	469
	SDW	959	60	631
	PS	945	60	483
	SPS	950	67	474
	SOPT	952	74	462

Open in a new tab

7. REAL-DATA-BASED SIMULATION STUDY

We further compare our estimators on a real data set previously analyzed by Korn and Graubard (1995) and Pfeffermann and Sverchkov (1999). The data were collected as part of the 1988 US National Maternal and Infant Health Survey, which used a stratified random sample of vital records corresponding to live births, late fetal deaths, and infant deaths in the United States. The strata were constructed using the mother’s race and child’s birth weight, and the sampling fractions varied according to strata.

Pfeffermann and Sverchkov (1999) treated birth weight (measured in grams) as the study variable Y and gestational age (measured in weeks) as the predictor X. After deleting 506 observations with missing values, the finite population size was reduced to 9,447. One can fit the following linear regression model using the finite population and obtain the estimated model

Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}, i = 1, \dots, 9447,

(10)

with $β_{0} = - 2695.27$ and $β_{1} = 149.04.$ The p-values for all regression coefficients were highly significant (p <0.0001). The R² value was about 0.6. The original design was informative because the strata were determined using the study variable birth weight. The correlation between d₀ and $\hat{ϵ}$ was 0.32, where d₀ was the original design weight in the survey and $\hat{ϵ}$ was the estimated residuals obtained from (10). In other words, even after adjusting for predictor variable gestational age, a correlation remained between the design weights and the study variable.

Rather than the mean model described in (10), we estimated the quantile regression of Y on X, with parameters of interest the τth quantile regression coefficients $β_{0 τ}$ and $β_{1 τ}$ . Before conducting the simulation, we first fitted the mean regression model, as well as quantile regression models with $τ = 0.2,$ $0.4,$ 0.6, and 0.8. The results are presented in figure 1. From figure 1, it is clear that the quantile regression–fitted lines are not parallel, unlike the conventional homoscedastic mean regression model. This result suggests the skewness of distribution for birth weight and shows that quantile regression provides a more comprehensive analysis.

Figure 1. — Mean Regression and Quantile Regression Models.

For the simulation, we chose $τ = 0.8$ for illustration. We conducted 1,000 Monte Carlo simulations to compare the six quantile regression coefficient estimators. In each simulation, one sample was generated from the finite population with an expected sample size of 400 by using the Poisson sampling design with inclusion probability $π_{i} = 400 d_{0, i}^{- 1} / \sum_{j = 1}^{N} d_{0, j}^{- 1}, i = 1, \dots, N$ , where $d_{0, j}$ was the design weight for the jth subject in the finite population. The bootstrap size was set to 200 for variance estimation for all six estimators, and 95 percent confidence intervals were constructed.

Before comparing the performance of our proposed estimators for quantile regression coefficients, we first compared the performance of the design-based estimator of median regression coefficients and mean regression coefficients using only the linear term for the purpose of illustration. The purpose of this comparison was to show that there is an advantage in using quantile regression instead of mean regression for data with certain features. The results in table 4 show that the estimators of median regression coefficients have smaller relative bias, relative standard error, and relative root mean squared error than do the estimators of mean regression. This occurs because the distribution of residual terms displays some skewness and the Kolmogorov-Smirnov test rejects the normality assumption (p < 0.05). Furthermore, there was some heteroscedastic trend in variance.

Table 4.

Monte Carlo (MC) Relative Bias (RBias), Relative Standard Error (RSE), and Relative Root Mean Squared Error (RRMSE) for Comparing Mean Regression with Median Regression.

Parameters	Method	RBias	RSE	RRMSE
β ₀	Mean	−0.013	0.194	0.195
$β_{0 τ}$	Median	0.000	0.116	0.116
β ₁	Mean	0.005	0.097	0.097
$β_{1 τ}$	Median	0.001	0.071	0.071

Open in a new tab

Table 5 summarizes the simulation results, comparing the performance of all six estimators. For point estimation, the DW and SDW estimators had larger RBias than did the other estimators. The DW estimator had the largest RSE and RRMSE for all cases—as expected—because the efficiency of the DW estimator is improved through weight-smoothing. The SOPT estimator had the smallest RSE and RRMSE. The SPS estimator was the second-best estimator in terms of RSE and RRMSE. The UOPT estimator had similar RRMSE to that of the PS estimator, as in Kim and Skinner (2013). All confidence coverages were close to the nominal rate of 95 percent.

Table 5.

Monte Carlo (MC) Relative Bias (RBias), Relative Standard Error (RSE), and Relative Root Mean Squared Error (RRMSE) for Six Different Methods.

Parameters	Method	RBias	RSE	RRMSE
$β_{0 τ}$	DW	0.063	0.285	0.292
	UOPT	−0.021	0.223	0.224
	SDW	0.081	0.228	0.242
	PS	−0.043	0.213	0.218
	SPS	−0.001	0.199	0.199
	SOPT	0.001	0.182	0.182
$β_{1 τ}$	DW	−0.027	0.128	0.131
	UOPT	0.005	0.100	0.100
	SDW	−0.035	0.102	0.108
	PS	0.018	0.097	0.098
	SPS	−0.002	0.089	0.089
	SOPT	−0.002	0.082	0.082

Open in a new tab

8. DISCUSSION

In this paper, we proposed several weight-smoothing estimators for estimating quantile regression coefficients in complex surveys under informative sampling design. Our proposed estimators were compared in terms of point estimation and variance estimation by using both simulated data and a real-data-based simulation study. All proposed estimators have smaller standard errors than the original design-based estimator. Unsmoothed and smoothed optimal estimators showed a better balance of variance, bias, and coverage rate compared with other estimators. Smoothed estimators, based on nonparametric weight-smoothing models, outperformed unsmoothed estimators. All related R codes and an example data file are posted at the following website: https://github.com/yandzhao/Quantile-Regression-of-Survey-Data, last accessed August 28, 2018. For future research, we will consider estimating quantile regression coefficients with a clustered informative sampling design.

Acknowledgments

The authors sincerely thank Professor Danny Pfeffermann and Dr. Michael Sverchkov for sharing the 1988 US National Maternal and Infant Health Survey data with us. This work was supported partially by the funding provided by National Institutes of Health, National Institute of General Medical Sciences (Grant 1 U54GM104938), an IDeA-CTR to the University of Oklahoma Health Sciences Center

References

Beaumont J. F. (2008), “ A New Approach to Weighting and Inference in Sample Surveys,” Biometrika, 95, 539–553. [Google Scholar]
Booth J. G., Butler R. W., Hall P. (1994), “ Bootstrap Methods for Finite Populations,” Journal of the American Statistical Association, 89, 1282–1289. [Google Scholar]
Cameron A. C., Trivedi P. K. (2005), Microeconometrics: Methods and Applications, Cambridge: Cambridge University Press. [Google Scholar]
Chambers R. L. (2003), “Introduction to Part A,” in Analysis of Survey Data, eds. Chambers R. L., Skinner C. J., Chichester: Wiley. [Google Scholar]
Chambers R. L., Skinner C. J. (2003), Analysis of Survey Data, Chichester: Wiley. [Google Scholar]
Conti P. L., Marelia D., Mecatti F. (2017), “ Recovering Sampling Distributions of Statistics of Finite Populations via Resampling: A Predictive Approach,” submitted. [Google Scholar]
Deaton A. (1997), The Analysis of Household Surveys: A Microeconometric Approach to Development Policy, Baltimore and London: Johns Hopkins University Press. [Google Scholar]
Fuller W. (2009), Sampling Statistics, Hoboken: Wiley. [Google Scholar]
Geraci M. (2016), “ Estimation of Regression Quantiles in Complex Surveys with Data Missing at Random: An Application to Birthweight Determinants,” Statistical Methods in Medical Research, 25, 1393–1421. [DOI] [PubMed] [Google Scholar]
Gross S. (1980). “Median Estimation in Sample Surveys,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 181–184.
Harrington D. M., Barreira T. V., Staiano A. E., Katzmarzyk P. T. (2014), “ The Descriptive Epidemiology of Sitting among US Adults, NHANES 2009/2010,” Journal of Science Medicine in Sport, 17, 371–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie T., Tibshirani R. (1990), Generalized Additive Models, New York: Chapman and Hall. [DOI] [PubMed] [Google Scholar]
He X., Shao Q. (1996), “ A General Bahadur Representation of M -Estimators and Its Application to Linear Regression with Nonstochastic Designs,” The Annals of Statistics, 24, 2608–2630. [Google Scholar]
Heeringa S. G., West B. T., Berglund P. A. (2010), Applied Survey Data Analysis, Boca Raton, FL: Taylor and Francis Group. [Google Scholar]
Kim J. K., Skinner C. J. (2013), “ Weighting in Survey Analysis under Informative Sampling,” Biometrika, 100, 385–398. [Google Scholar]
Knight K. (1998), “ Limiting Distribution for L1 Regression Estimators under General Conditions,” The Annals of Statistics, 26, 755–770. [Google Scholar]
Koenker R. (2005), Quantile Regression, Cambridge. [Google Scholar]
Koenker R., Bassett G. (1978), “ Regression Quantiles,” Econometrica, 46, 33–50. [Google Scholar]
Korn E. L., Graubard B. I. (1995), “ Examples of Differing Weighted and Unweighted Estimates from a Sample Survey,” The American Statistician, 49, 291–295. [Google Scholar]
Li Y., Graubard B. I., Korn E. L. (2010), “ Application of Nonparametric Quantile Regression to Body Mass Percentile Curves from Survey Data,” Statistics in Medicine, 29, 558–572. [DOI] [PMC free article] [PubMed] [Google Scholar]
Magee L. (1998), “ Improving Survey-Weighted Least Squares Regression,” Journal of Royal Statistical Society, Series B, 60, 115–126. [Google Scholar]
Mu Y., He X. (2007), “ Power Transformation toward a Linear Regression Quantile,” Journal of the American Statistical Association, 102, 269–279. [Google Scholar]
Nelson D. E., Powell-Griner E., Town M., Kovar M. G. (2003), “ A Comparison of National Estimates from the National Health Interview Survey and the Behavioral Risk Factor Surveillance System,” American Journal of Public Health, 93, 1335–1341. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pfeffermann D. (2011), “Modelling of Complex Survey Data: Why Model? Why Is It a Problem? How Can We Approach It?” Survey Methodology, 37, 115–136. [Google Scholar]
Pfeffermann D., Sverchkov M. Y. (1999), “ Parametric and Semi-Parametric Estimation of Regression Models Fitted to Survey Data,” Sankhya B, 61, 166–186. [Google Scholar]
Pfeffermann D., Sverchkov M. Y. (2003), “Fitting Generalized Linear Models under Informative Sampling,” in Analysis of Survey Data, eds. Chambers R. L., Skinner C. J., Chichester: Wiley. [Google Scholar]
Pfeffermann D., Sverchkov M. Y. (2009), “Inference under Informative Sampling,” in Handbook of Statistics 29B; Sample Surveys: Inference and Analysis, eds. Pfeffermann D., Rao C. R., Amsterdam: North Holland. [Google Scholar]
Rao J. N. K. (1965), “ On Two Simple Schemes of Unequal Probability Sampling without Replacement,” Journal of the Indian Statistical Association, 3, 173–180. [Google Scholar]
Sampford M. R. (1967), “ On Sampling without Replacement with Unequal Probabilities of Selection,” Biometrika, 54, 499–513. [PubMed] [Google Scholar]
Scott A., Wild C. (2011), “ Fitting Regression Models with Response-Biased Samples,” Canadian Journal of Statistics, 39, 519–536. [Google Scholar]
Van der Vaart A. W. (1998), Asymptotic Statistics, New York: Cambridge University Press. [Google Scholar]
Wang J. Q., Opsomer J. D. (2011), “ On Asymptotic Normality and Variance Estimation for Nondifferentiable Survey Estimators,” Biometrika, 98, 91–106. [Google Scholar]

[smy018-B2] Beaumont J. F. (2008), “ A New Approach to Weighting and Inference in Sample Surveys,” Biometrika, 95, 539–553. [Google Scholar]

[smy018-B3] Booth J. G., Butler R. W., Hall P. (1994), “ Bootstrap Methods for Finite Populations,” Journal of the American Statistical Association, 89, 1282–1289. [Google Scholar]

[smy018-B4] Cameron A. C., Trivedi P. K. (2005), Microeconometrics: Methods and Applications, Cambridge: Cambridge University Press. [Google Scholar]

[smy018-B5] Chambers R. L. (2003), “Introduction to Part A,” in Analysis of Survey Data, eds. Chambers R. L., Skinner C. J., Chichester: Wiley. [Google Scholar]

[smy018-B6] Chambers R. L., Skinner C. J. (2003), Analysis of Survey Data, Chichester: Wiley. [Google Scholar]

[smy018-B7] Conti P. L., Marelia D., Mecatti F. (2017), “ Recovering Sampling Distributions of Statistics of Finite Populations via Resampling: A Predictive Approach,” submitted. [Google Scholar]

[smy018-B8] Deaton A. (1997), The Analysis of Household Surveys: A Microeconometric Approach to Development Policy, Baltimore and London: Johns Hopkins University Press. [Google Scholar]

[smy018-B9] Fuller W. (2009), Sampling Statistics, Hoboken: Wiley. [Google Scholar]

[smy018-B10] Geraci M. (2016), “ Estimation of Regression Quantiles in Complex Surveys with Data Missing at Random: An Application to Birthweight Determinants,” Statistical Methods in Medical Research, 25, 1393–1421. [DOI] [PubMed] [Google Scholar]

[smy018-B11] Gross S. (1980). “Median Estimation in Sample Surveys,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 181–184.

[smy018-B12] Harrington D. M., Barreira T. V., Staiano A. E., Katzmarzyk P. T. (2014), “ The Descriptive Epidemiology of Sitting among US Adults, NHANES 2009/2010,” Journal of Science Medicine in Sport, 17, 371–375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[smy018-B13] Hastie T., Tibshirani R. (1990), Generalized Additive Models, New York: Chapman and Hall. [DOI] [PubMed] [Google Scholar]

[smy018-B14] He X., Shao Q. (1996), “ A General Bahadur Representation of M -Estimators and Its Application to Linear Regression with Nonstochastic Designs,” The Annals of Statistics, 24, 2608–2630. [Google Scholar]

[smy018-B15] Heeringa S. G., West B. T., Berglund P. A. (2010), Applied Survey Data Analysis, Boca Raton, FL: Taylor and Francis Group. [Google Scholar]

[smy018-B16] Kim J. K., Skinner C. J. (2013), “ Weighting in Survey Analysis under Informative Sampling,” Biometrika, 100, 385–398. [Google Scholar]

[smy018-B17] Knight K. (1998), “ Limiting Distribution for L1 Regression Estimators under General Conditions,” The Annals of Statistics, 26, 755–770. [Google Scholar]

[smy018-B18] Koenker R. (2005), Quantile Regression, Cambridge. [Google Scholar]

[smy018-B19] Koenker R., Bassett G. (1978), “ Regression Quantiles,” Econometrica, 46, 33–50. [Google Scholar]

[smy018-B20] Korn E. L., Graubard B. I. (1995), “ Examples of Differing Weighted and Unweighted Estimates from a Sample Survey,” The American Statistician, 49, 291–295. [Google Scholar]

[smy018-B21] Li Y., Graubard B. I., Korn E. L. (2010), “ Application of Nonparametric Quantile Regression to Body Mass Percentile Curves from Survey Data,” Statistics in Medicine, 29, 558–572. [DOI] [PMC free article] [PubMed] [Google Scholar]

[smy018-B22] Magee L. (1998), “ Improving Survey-Weighted Least Squares Regression,” Journal of Royal Statistical Society, Series B, 60, 115–126. [Google Scholar]

[smy018-B23] Mu Y., He X. (2007), “ Power Transformation toward a Linear Regression Quantile,” Journal of the American Statistical Association, 102, 269–279. [Google Scholar]

[smy018-B24] Nelson D. E., Powell-Griner E., Town M., Kovar M. G. (2003), “ A Comparison of National Estimates from the National Health Interview Survey and the Behavioral Risk Factor Surveillance System,” American Journal of Public Health, 93, 1335–1341. [DOI] [PMC free article] [PubMed] [Google Scholar]

[smy018-B25] Pfeffermann D. (2011), “Modelling of Complex Survey Data: Why Model? Why Is It a Problem? How Can We Approach It?” Survey Methodology, 37, 115–136. [Google Scholar]

[smy018-B26] Pfeffermann D., Sverchkov M. Y. (1999), “ Parametric and Semi-Parametric Estimation of Regression Models Fitted to Survey Data,” Sankhya B, 61, 166–186. [Google Scholar]

[smy018-B27] Pfeffermann D., Sverchkov M. Y. (2003), “Fitting Generalized Linear Models under Informative Sampling,” in Analysis of Survey Data, eds. Chambers R. L., Skinner C. J., Chichester: Wiley. [Google Scholar]

[smy018-B28] Pfeffermann D., Sverchkov M. Y. (2009), “Inference under Informative Sampling,” in Handbook of Statistics 29B; Sample Surveys: Inference and Analysis, eds. Pfeffermann D., Rao C. R., Amsterdam: North Holland. [Google Scholar]

[smy018-B29] Rao J. N. K. (1965), “ On Two Simple Schemes of Unequal Probability Sampling without Replacement,” Journal of the Indian Statistical Association, 3, 173–180. [Google Scholar]

[smy018-B31] Sampford M. R. (1967), “ On Sampling without Replacement with Unequal Probabilities of Selection,” Biometrika, 54, 499–513. [PubMed] [Google Scholar]

[smy018-B32] Scott A., Wild C. (2011), “ Fitting Regression Models with Response-Biased Samples,” Canadian Journal of Statistics, 39, 519–536. [Google Scholar]

[smy018-B34] Van der Vaart A. W. (1998), Asymptotic Statistics, New York: Cambridge University Press. [Google Scholar]

[smy018-B35] Wang J. Q., Opsomer J. D. (2011), “ On Asymptotic Normality and Variance Estimation for Nondifferentiable Survey Estimators,” Biometrika, 98, 91–106. [Google Scholar]

PERMALINK

Quantile Regression Analysis of Survey Data Under Informative Sampling

Sixia Chen

Yan Daniel Zhao

Abstract

1. INTRODUCTION

2. QUANTILE REGRESSION AND THE DESIGN-BASED ESTIMATOR

3. PROPOSED METHODS

3.1 Smoothed Design-Weight (SDW) Estimator

3.2 Unsmoothed Pfeffermann-Sverchkov (PS) Estimator and Smoothed Pfeffermann-Sverchkov (SPS) Estimator

3.3 Unsmoothed and Smoothed Optimal (UOPT and SOPT) Estimators

4. ALGORITHMS FOR COMPUTING THE UOPT AND SOPT ESTIMATORS

5. VARIANCE ESTIMATION

6. SIMULATION STUDY

Table 1.

Table 2.

Table 3.

7. REAL-DATA-BASED SIMULATION STUDY

Figure 1.

Table 4.

Table 5.

8. DISCUSSION

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Quantile Regression Analysis of Survey Data Under Informative Sampling

Sixia Chen

Yan Daniel Zhao

Abstract

1. INTRODUCTION

2. QUANTILE REGRESSION AND THE DESIGN-BASED ESTIMATOR

3. PROPOSED METHODS

3.1 Smoothed Design-Weight (SDW) Estimator

3.2 Unsmoothed Pfeffermann-Sverchkov (PS) Estimator and Smoothed Pfeffermann-Sverchkov (SPS) Estimator

3.3 Unsmoothed and Smoothed Optimal (UOPT and SOPT) Estimators

4. ALGORITHMS FOR COMPUTING THE UOPT AND SOPT ESTIMATORS

5. VARIANCE ESTIMATION

6. SIMULATION STUDY

Table 1.

Table 2.

Table 3.

7. REAL-DATA-BASED SIMULATION STUDY

Figure 1.

Table 4.

Table 5.

8. DISCUSSION

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases