Semiparametric Inference for Data with a Continuous Outcome from a Two-Phase Probability Dependent Sampling Scheme

Haibo Zhou; Wangli Xu; Donglin Zeng; Jianwen Cai

doi:10.1111/rssb.12029

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2013 Jul 3;76(1):197–215. doi: 10.1111/rssb.12029

Semiparametric Inference for Data with a Continuous Outcome from a Two-Phase Probability Dependent Sampling Scheme

Haibo Zhou ¹, Wangli Xu ^1,², Donglin Zeng ¹, Jianwen Cai ¹

PMCID: PMC3984585 NIHMSID: NIHMS553601 PMID: 24737947

Abstract

Multi-phased designs and biased sampling designs are two of the well recognized approaches to enhance study efficiency. In this paper, we propose a new and cost-effective sampling design, the two-phase probability dependent sampling design (PDS), for studies with a continuous outcome. This design will enable investigators to make efficient use of resources by targeting more informative subjects for sampling. We develop a new semiparametric empirical likelihood inference method to take advantage of data obtained through a PDS design. Simulation study results indicate that the proposed sampling scheme, coupled with the proposed estimator, is more efficient and more powerful than the existing outcome dependent sampling design and the simple random sampling design with the same sample size. We illustrate the proposed method with a real data set from an environmental epidemiologic study.

Keywords: Empirical likelihood, Missing data, Semiparametric, Probability sample

1 Introduction

Observational studies in epidemiology that relate disease outcome to individual exposures and other characteristics play a key role in understanding the determinants of diseases in humans. As all studies are conducted with a limited budget, the maximum study sizes are often restricted by the cost of the exposure assessments. Some large cohort studies, e.g., the Women's Health Initiative and the National Children's Study, could cost hundreds of millions of dollars to conduct. Cost-effective study designs for biomedical studies have always been an important research area. Among them, the biased sampling design, represented by the case-control design, has played a significant role in the development of biostatistics methodological research during the last half of the 20th century. It is often the preferred choice of study design for epidemiologic studies because of its efficiency and cost-effectiveness feature compared to cohort studies (e.g., Cornfield, 1951; Anderson, 1972; Prentice and Pyke, 1979).

The fundamental idea of case-control design is to over-sample observations (e.g., cases) that are believed to be more informative regarding the exposure-response relationship. This basic idea motivated the development of research in the area of general Outcome Dependent Sampling (ODS) for a continuous outcome in recent years (e.g., Zhou et al., 2002; Weaver and Zhou, 2005; Song, Zhou and Kosorok, 2009). The general ODS design allows investigators to selectively sample observations based on the observed values of a continuous outcome to achieve improved efficiency for a fixed sample size. The ODS design (Zhou et al., 2002) assumes that the values of the response, denoted by Y, are known for all subjects, but the exposure variable, denoted by X, may be expensive or difficult to assess. This is reasonable in many studies where responses like Intelligence Quotient (IQ) or disease status are easily obtainable, but exposure assessment needs expensive assay or follow up. Assume that the domain of the Y is partitioned into three mutually exclusive intervals: (−∞, y_L]⋃(y_L, y_U]⋃(y_U, ∞). The ODS sample proposed by Zhou et al. (2002) has X values ascertained on the following three samples: an overall simple random sample, a supplemental sample conditional on Y < y_L, and a supplemental sample conditional on Y > y_U. Other recent progresses in the ODS design includes (e.g., Kang and Cai, 2009; Lu and Tsiatis, 2006; Zhou et al. 2011; Qin and Zhou, 2011; Chatterjee, Chen and Breslow, 2003; Manatunga et al, 2008; Schildcrout and Rathouz, 2010; Wang and Zhou, 2006; Zhou, Song, et al, 2011; Zhou, Wu, et al, 2011). Part of the explanation that the ODS design is more efficient than the simple random sampling is because through sampling the response Y at its two distributional tails, the observed exposure values X were also more likely to occur at its distributional tails. Linear model theory shows that the variance of $\hat{β}$ , the estimate of the regression coefficient corresponding to X, is inversely proportional to the summed squares of observed X's values. Hence, when the goal is to evaluate the relationship between an exposure X and a response Y, having a sample of subjects whose X values are at its two distributional tails would be more informative than having a sample of subjects whose X values concentrated around its mean.

Assume that the domain of the exposure X is partitioned into three mutually exclusive intervals: (−∞, x_L]⋃(x_L, x_U]⋃(x_U, ∞). If an investigator knows which interval each individual's X value falls into, the investigator can draw a supplemental sample from those whose X values are in the upper or lower tail intervals, respectively. Such a strategy, however, is not feasible in practice as investigators do not have knowledge of X in advance. In this paper, we propose a new two-phase design where we select the second phase supplemental sample with a probability-dependent-sampling scheme (PDS) that will allow us to oversample X from its two distributional tails. The proposed two-phase PDS is outlined as follows. Let Y denote the response variable, X the primary exposure variable, and Z the collection of all other covariates. In the first phase of the proposed design, a simple random sample is drawn and the values of (X, Y, Z) are observed. We fit a model for E(X|Y, Z) using the phase one SRS sample. Based on this model, the chances of a new subject's X, conditional on Y = y, Z = z, will be in (−∞, X_L] and (X_U, ∞) are predicted by ${\hat{ϕ}}_{1} (y, z) = \hat{\Pr} (X < x_{L} ∣ Y, Z)$ and ${\hat{ϕ}}_{3} (y, z) = \hat{\Pr} (X > x_{U} ∣ Y, Z)$ , respectively. We then draw the supplemental samples in the second phase by obtaining a simple random sample from those who are likely to have high or low X values. For example, random samples can be drawn from those with ${\hat{ϕ}}_{1} (y, z) = > 80 %$ and with ${\hat{ϕ}}_{3} (y, z) = > 80 %$ , respectively. As a result, the final observed data is over-represented by individuals who are more likely to be on the distributional tails of X.

The roots of the proposed two-phase PDS design can also be traced back to Neyman (1938), who introduced the two-phase stratified design to enhance study efficiency. At the first phase of a typical two-phase design, a relatively large random sample is drawn and only Y and Z are measured in the first phase cohort. The ascertainment of X is made at the second phase of the design, where a subsample is drawn randomly, without replacement, from the first phase cohort. Greater efficiency can be obtained through the two-phase sampling design (e.g. Breslow and Cain, 1988; Breslow et al., 2003; Song et al, 2009; and Wang and Zhou, 2010).

The key differences among the traditional two-phase design, the recent work on the two-phase ODS design, and the proposed two-phase PDS design are that: (i) the second phase of the traditional two-phase design is either independent of Y and Z or is only dependent on binary Y, e.g., case-control second phase; (ii) the two-phase ODS allows for continuous Y but not Z in the 2nd phase drawing; (iii) the two-phase PDS, not only allows for a continuous Y, but also allows for any dimension of Z in the decision making of 2nd phase drawing. By estimating the chance of the unknown X's range, this approach avoided the impracticability of high dimension stratification of vector Z.

For data obtained via complex sampling designs like the PDS designs described above, estimators ignoring the design will be biased unless they properly account for the biased sampling scheme. In practice, some ad hoc or simplification of the data is often made prior to analysis. A commonly used approach in epidemiologic studies is to dichotomize a continuous outcome Y and then use available methods for binary outcome for inference (e.g., White, 1982; Amemiya, 1985; Prentice, 1986; Breslow and Cain, 1988; Weinberg and Wacholder, 1993; Langholz and Borgan, 1995; Breslow and Holubkov, 1997; Schildcrout and Heagerty, 2008). In this paper, we propose a semiparametric empirical likelihood method for estimating the regression parameters. The proposed methods are semiparametric in the sense that the marginal distribution of the exposure variable X is left unspecified.

The remainder of this paper is organized as follows. In Section 2, we introduce the data structure for the two-phase PDS design. We outline the estimation algorithm for the proposed semiparametric empirical likelihood estimator and establish its asymptotic properties. In Section 3, we present simulation study results comparing the proposed method with some competing designs and estimators. We illustrate the proposed method with a data set from the Collaborative Perinatal Project (CPP) data. Final remarks are given in Section 4.

2 Design and Inference for a Two-phase PDS Study

2.1 Design and Data Structure

Let Y denote a continuous outcome variable, (X, Z) denote the vector of covariates with X being the expensive scalar exposure variable and Z being the easily obtainable covariates. Assume that the regression model of Y given (X, Z) is

Y = β_{0} + β_{1} X + β_{2} Z + ε,

where (β₀, β₁, β₂) denote the unknown regression parameters and ε ~ N(0, σ²) is the random error. Let β = (β₀, β₁, β₂, σ²₁) and x_L and x_U (x_L < x_U) be known constants that partition the domain of X into three mutually exclusive intervals: A₁ ⋃ A₂ ⋃ A₃ = (−∞, x_L] ⋃ (x_L, x_U] ⋃ (x_U, ∞).

The proposed two-phase PDS scheme is as follows: in the first phase, we observe (Y, X, Z) in a simple random sample (SRS) of size n₀ from the underlying study population. A model of E(X|Y, Z) is then fitted based on this SRS sample and ϕ₁(Y, Z) = Pr(X ∊ A₁|Y, Z) and ϕ₃(Y, Z) = Pr(X ∊ A₃|Y, Z) are estimated. In the second phase of the PDS design, we draw a supplemental random sample from those in the study population whose predicted probability ${\hat{ϕ}}_{1} = \hat{\Pr} (X \in A_{1} ∣ Y, Z)$ satisfies ${\hat{ϕ}}_{1} \geq 80 %$ . Likewise, a supplemental sample is drawn from those whose X values are more likely in the upper tail, i.e, from those with ${\hat{ϕ}}_{3} = \hat{\Pr} (X \in A_{3} ∣ Y, Z) \geq 80 %$ . Note that the 80% value here is chosen for the simplicity of illustration. We will use constants c₁ and c₃, where 0 < c₁, c₃ < 1, in the formulation of the likelihood. The data structure for the proposed two-phase PDS is

\begin{matrix} The SRS sample: & {Y_{0 i}, X_{0 i}, Z_{0 i}}, i = 1, \dots, n_{0}; \\ The supplemental sample: & {(Y_{1 i}, X_{1 i}, Z_{1 i}) : \Pr (X_{1 i} \in A_{1} ∣ Y_{1 i}, Z_{1 i}) \geq c_{1}}, i = 1, \dots, n_{1}; \\ {(Y_{3 i}, X_{3 i}, Z_{3 i}) : \Pr (X_{3 i} \in A_{3} ∣ Y_{3 i}, Z_{3 i}) \geq c_{3}}, i = 1, \dots, n_{3} . \end{matrix}

(2.1)

The supplemental samples can be generated with different, perhaps unknown, selection probabilities, e.g., one can choose to select a fixed proportion of the sets {(Y_ki, Z_ki): ϕ_k(Y_ki, Z_ki) ≥ c_k} (k = 1, 3) from a underlying cohort of subjects whose (Y, Z) are known, or, one can select a predetermined number of subjects from the underlying population, in which case, the proportion of the selected set relative to the underlying population is unknown. The total sample size in the two-phase PDS design is n = n₀ + n₁ + n₃.

If X is a continuous variable and can be viewed as normally distributed after proper transformation, then a linear model can be used for ϕ_k(Y, Z) = Pr(X ∈ A_k|Y, Z), k = 1, 3. More specially, we estimate ϕ₁(Y, Z) by $\hat{P} r (X \in A_{1} ∣ Y, Z) = Φ ((x_{L} - ({\hat{γ}}_{0} + {\hat{γ}}_{1} Y + {\hat{γ}}_{2} Z)) ∕ {\hat{σ}}_{1})$ and ϕ₃(Y, Z) by $\hat{P} r (X \in A_{3} ∣ Y, Z) = 1 - Φ ((x_{U} - ({\hat{γ}}_{0} + {\hat{γ}}_{1} Y + {\hat{γ}}_{2} Z)) ∕ {\hat{σ}}_{1})$ , where Φ(·) is the c.d.f. of the standard normal distribution and ${\hat{γ}}_{i}, i = 0, 1, 2$ and ${\hat{σ}}_{1}$ are estimates using the first phase data based on the following regression model:

X = γ_{0} + γ_{1} Y + γ_{2} Z + e, e ~ N (0, σ_{1}^{2}) .

(2.2)

Another natural estimator for ϕ_k results from the use of logistic regression model. Denote δ_k = I(X ∈ A_k), k = 1, 3. We estimate ϕ(Y, Z) by ${\hat{ϕ}}_{k} (Y, Z) = {(1 + \exp (- ({\hat{α}}_{0 k} + {\hat{α}}_{1 k} Y + {\hat{α}}_{2 k} Z)))}^{- 1}$ , where ( ${\hat{α}}_{0 k}, {\hat{α}}_{1 k}, {\hat{α}}_{2 k}$ ) are obtained from fitting

\Pr (δ_{k} = 1 ∣ Y, Z) = {(1 + \exp (- (α_{0 k} + α_{1 k} Y + α_{2 k} Z)))}^{- 1}

(2.3)

to the first phase SRS data. Alternatively, one can also derive a nonparametric estimator for ϕ_k, k = 1, 3, by using the kernel method. Note that

\Pr (X \in A_{k} ∣ Y, Z) = \frac{\int I_{x \in A_{k}} f (Y ∣ X, Z) d G (X ∣ Z)}{\int f (Y ∣ X, Z) d G (X ∣ Z)},

(2.4)

where G(X|Z) is the conditional c.d.f. for X|Z that can be estimated by

\hat{G} (X ∣ Z) = \frac{\sum_{j = 1}^{n_{0}} I (X_{0 j} \leq X) ϕ_{h} (Z_{0 j} - Z)}{\sum_{j = 1}^{n_{0}} ϕ_{h} (Z_{0 j} - Z)},

where ϕ_h(·) = ϕ(·/h) is a kernel function with a bandwidth h. One can then estimate ϕ_k(Y, Z) by

{\hat{ϕ}}_{E k} (Y, Z) = \frac{\sum_{j = 1}^{n_{0}} I_{X_{0 j \in A_{k}}} f (Y ∣ X_{0 j}, Z) ϕ_{h} (Z_{0 j} - Z)}{\sum_{j = 1}^{n_{0}} f (Y ∣ X_{0 j}, Z) ϕ_{h} (Z_{0 j} - Z)} .

(2.5)

2.2 A Semiparametric Empirical Likelihood Inference

Let G(X, Z) and g(X, Z) denote the joint c.d.f. and p.d.f. of (X, Z), respectively. If ϕ_k(Y, Z), k = 1, 3 were known, the likelihood function for the data in (2.1) would be

L (β, G) = {\prod_{i = 1}^{n_{0}} f_{β} (Y_{0 i} ∣ X_{0 i}, Z_{0 i}) g (X_{0 i}, Z_{0 i})} {\prod_{k = 1, 3} \prod_{j = 1}^{n_{k}} f_{β} (Y_{k j}, X_{k j}, Z_{k j} ∣ ϕ_{k} (Y_{k j}, Z_{k j}) \geq c_{k})} .

(2.6)

Due to the biased sampling of the proposed design, maximizing the likelihood function over β involves addressing G(X, Z). Hence we include G in the above likelihood function. For k = 1, 3, define

π_{k} = \Pr (ϕ_{k} (Y, Z) \geq c_{k}) = \int \int \int f_{β} (Y ∣ X, Z) g (X, Z) I_{{(Y, Z) : ϕ_{k} (Y, Z) \geq c_{k}}} d Y d X d Z

Using the Bayes formula, L(β, G) can be expressed as

L (β, G) = {\prod_{i = 1}^{n_{0}} f_{β} (Y_{0 i} ∣ X_{0 i}, Z_{0 i}) g (X_{0 i}, Z_{0 i})} {\prod_{k = 1, 3} \prod_{j = 1}^{n_{k}} f_{β} (Y_{k j} ∣ X_{k j}, Z_{k j}) g (X_{k j}, Z_{k j})} {\prod_{k = 1, 3} π_{k}^{- n_{k}}} .

(2.7)

We propose a semiparametric likelihood method to maximize the likelihood function without specifying the underlying distribution of G(X, Z). We first profile the likelihood function L(β, G) by fixing β and obtaining the empirical likelihood function of G(X, Z) over all distributions whose support contains the observed (X, Z) values. For a fixed β, this is a biased sampling likelihood (Vardi 1982, 1985; Qin 1993). We then maximize the resulting profile likelihood function with respect to β. For simplicity of notation, let (X₁, …, X_n) = (X₀₁, …, X_0n₀, X₁₁, …, X_1n₁, X₃₁, …, X_3n₃), (Z₁, …, Z_n) = (Z₀₁, …, Z_0n₀, Z₁₁, …, Z_1n₁, Z₃₁, …, Z_3n₃) and (Y₁, …, Y_n) = (Y₀₁, …, Y_0n₀, Y₁₁, …, Y_1n₁, Y₃₁, …, Y_3n₃). Then the log-likelihood function can be written as

\begin{matrix} l_{f} (β, {p_{i}}, {π_{k}}) & = \sum_{i = 1}^{n} \log f_{β} (Y_{i} ∣ X_{i}, Z_{i}) + {\sum_{i = 1}^{n} \log p_{i} - \sum_{k = 1, 3} n_{k} \log (π_{k})} \\ ≕ l_{1} (β) + l_{2} ({p_{i}}, {π_{k}}), \end{matrix}

(2.8)

where $p_{i} = g (X_{i}, Z_{i}), l_{1} (β) = \sum_{i = 1}^{n} \log f_{β} (Y_{i} ∣ X_{i}, Z_{i})$ is a function only involving β, and $l_{2} ({p_{i}}, {π_{k}}) = \sum_{i = 1}^{n} \log p_{i} - \sum_{k = 1, 3} n_{k} \log (π_{k})$ .

The first step in deriving the proposed estimator for β is to profile (2.8) over {p_i}, by fixing (β, π₁, π₃), and obtain the empirical likelihood function of {p_i} over all distributions whose support contains the observed values of X and Z. To this end, we need only consider discrete distributions with jumps at each of the observed points (Owen, 1988, 1990). That is, for fixed (β, π₁, π₃), we search for ${{\hat{p}}_{i}}$ that mamximize l₂({p_i}, {π_k}) in (2.8) under the following four constraints:

{p_{i} \geq 0; \sum_{i = 1}^{n} p_{i} = 1 \sum_{i = 1}^{n} p_{i} (\int f_{β} (Y ∣ X_{i}, Z_{i}) I_{{(Y, Z_{i}) : ϕ_{1} (Y, Z_{i}) \geq c_{1}}} d Y - π_{1}) = 0; \sum_{i = 1}^{n} p_{i} (\int f_{β} (Y ∣ X_{i}, Z_{i}) I_{{(Y, Z_{i}) : ϕ_{3} (Y, Z_{i}) \geq c_{3}}} d Y - π_{3}) = 0 .}

(2.9)

These constraints reflect the properties of g(X, Z) being a discrete distribution function with support points at the observed (X, Z) values, i.e., {p_i} are nonnegative probabilities that sum up to unity.

For a fixed β, using a similar idea to Qin and Lawless (1994), a unique maximum for {p_i} in l₂({p_i}, {π_k}) with constraints (2.9) exists if 0 is inside the convex hull of points $\int f_{β} (Y ∣ X_{i}, Z_{i}) I_{{(Y, Z_{i}) : ϕ_{k} (Y, Z_{i}) \geq c_{k}}} d Y - π_{k}$ for i = 1, …, n and k = 1, 3. The Lagrange multiplier argument can be invoked to derive the maximum over {p_i}. Specifically, write

H (β, {p_{i}}, {π_{k}}) = l_{2} ({p_{i}}, {π_{k}}) + ρ (1 - \sum_{i = 1}^{n} p_{i}) + n \sum_{k = 1, 3} λ_{k} \sum_{i = 1}^{n} p_{i} {\int f_{β} (Y ∣ X_{i}, Z_{i}) I_{{(Y, Z_{i}) : ϕ_{k} (Y, Z_{i}) \geq c_{k}}} d Y - π_{k}},

where ρ and λ_k are Lagrange multipliers. Taking derivatives of H(β, {p_i}, {π_k}) with respect to {p_i} and solving the score equations together with the constraints in (2.9), we can obtain that ρ = n and

{\hat{p}}_{i} = n^{- 1} {1 + \sum_{k = 1, 3} λ_{k} (\int f (Y ∣ X_{i}, Z_{i}) I_{{(Y, Z_{i}) : ϕ_{k} (Y, Z_{i}) \geq c_{k}}} d Y - π_{k})}^{- 1} .

Replacing p_i with ${\hat{p}}_{i}$ in (2.8), we have a profile log-likelihood function $l_{f} (β, {{\hat{p}}_{i}}, {π_{k}})$ that is a function of (β, π₁, π₃, λ₁, λ₃) only. Typically, the true value of the Lagrange multipliers are zero in unbiased sampling problem. However, due to the biased nature of the PDS sampling design, λ₁ and λ₃ are not centered around zero. To unify the notation, we center them by reparameterizing v_k = λ_k − n_k/(nπ_k), k = 1, 3. We define ξ = (β, π₁, π₃, v₁, v₃). The resulting profile log-likelihood function l(ξ) can be expressed as

l (ξ) = l_{1} (β) - \sum_{i = 1}^{n} \log (1 + v^{τ} h (X_{i, Z_{i}})) - \sum_{i = 1}^{n} \log (Δ (X_{i}, Z_{i})) - \sum_{k = 1, 3} n_{k} \log π_{k}

where h(X_i, Z_i) = (h₁(X_i, Z_i), h₃(X_i, Z_i))^τ with h_k(X_i, Z_i) = F_k(X_i, Z_i)−π_k /Δ(X_i, Z_i), F_k(X_i, Z_i) = ∫ f_β(Y|X_i, Z_i)I_{{(Y,Z_i):ϕ_k(Y,Z_i)≥c_k}}dY, and Δ(X_i, Z_i) = q₀+Σ_k=1,3q_kπ_k^-1F_k(X_i, Z_i) with q_k = n_k/n for k = 0, 1, 3, respectively.

Finally, replacing ϕ_k(Y, Z) by ${\hat{ϕ}}_{k} (Y, Z)$ in l(ξ), we have the following estimated profile log-likelihood function:

\tilde{l} (ξ) = l_{1} (β) - \sum_{i = 1}^{n} \log (1 + v^{τ} \hat{h} (X_{i, Z_{i}})) - \sum_{i = 1}^{n} \log (\hat{Δ} (X_{i}, Z_{i})) - \sum_{k = 1, 3} n_{k} \log π_{k},

(2.10)

where $\hat{h} (X, Z)$ and $\hat{Δ} (X, Z)$ are obtained by replacing ϕ_k(Y, Z) by ${\hat{ϕ}}_{k} (Y, Z)$ in h(X, Z) and Δ(X, Z), respectively. We call $\hat{ξ}$ the maximum semiparametric empirical likelihood estimator (MSELE) where $\hat{ξ}$ is the maximizer for $\tilde{l} (ξ)$ . The MSELE for β is $\hat{β}$ is the corresponding portion of $\hat{ξ}$ . The Newton-Raphson iterative procedure can be used to obtain $\hat{ξ}$ . The following theorem summarizes the asymptotic properties for the proposed estimators.

THEOREM 1 (asymptotic properties): Under the regularity conditions outlined in the Appendix, $\hat{ξ}$ converges in probability to the true value ξ = (β, π₁, π₃, 0, 0), and $n^{1 ∕ 2} (\hat{ξ} - ξ)$ converges in distribution to N(0, Σ), where Σ = V⁻¹(ξ)U(ξ){V⁻¹(ξ)}^T is given in the Appendix.

Details of the proof are given in the Appendix. It will be shown that the asymptotic variance-covariance of $\sqrt{n} (\hat{ξ} - ξ)$ takes a sandwich form V⁻¹(ξ)U(ξ){V⁻¹(ξ)}^T. In addition, a consistent estimator of the variance-covariance matrix is given by ${\hat{V}}^{- 1} (\hat{ξ}) \hat{U} (\hat{ξ}) {{\hat{V}}^{- 1} (\hat{ξ})}^{T}$ , where $\hat{U}$ and $\hat{V}$ are obtained by replacing the large-sample quantities in U and V with their corresponding small-sample quantities.

Remark 1 The proposed estimation algorithm enables us to change an infinite dimension problem, with regard to nonparametric G, into a finite dimension problem at the expense of introducing 4 parameters π₁, π₃, λ₁, λ₃.

Remark 2 When ${\hat{ϕ}}_{k} (Y, Z_{i})$ is from the logistic regression model, ${\hat{ϕ}}_{k} (Y, Z_{i}) \geq c_{k}$ is equal to

{\begin{matrix} y \geq - \frac{{\hat{α}}_{0 k} {\hat{α}}_{2 k} Z_{i} + \log \frac{1 - c_{k}}{c_{k}}}{{\hat{α}}_{1 k}}, & if {\hat{α}}_{1 k} > 0; \\ y \leq - \frac{{\hat{α}}_{0 k} {\hat{α}}_{2 k} Z_{i} + \log \frac{1 - c_{k}}{c_{k}}}{{\hat{α}}_{1 k}}, & if {\hat{α}}_{1 k} < 0; \end{matrix}

and F_k(X_i, Z_i) can be simply expressed as $F (- ({\hat{α}}_{0 k} + {\hat{α}}_{2 k} Z_{i} + \log \frac{1 - c_{k}}{c_{k}}) ∕ {\hat{α}}_{1 k} ∣ X_{i}, Z_{i}) I_{{{\hat{α}}_{1 k} < 0}} + \overset{‒}{F} (- ({\hat{α}}_{0 k} + {\hat{α}}_{2 k} Z_{i} + \log \frac{1 - c_{k}}{c_{k}}) ∕ {\hat{α}}_{1 k} ∣ X_{i}, Z_{i}) I_{{{\hat{α}}_{1 k} > 0}}$ where F (u|X_i, Z_i) = Pr(Y ≤ u|X_i, Z_i) and $\overset{‒}{F} = 1 - F$ .

3 Numerical Analysis

3.1 Simulation Studies

We evaluate the small sample behavior of the proposed estimator using Monte Carlo studies. We assume that the domains of both Y and X are partitioned into three mutually exclusive intervals: γ = B₁ ⋃ B₂ ⋃ B₃ and χ = A₁ ⋃ A₂ ⋃ A₃, where B₁ = (−∞, μ_Y − a * σ_Y], B₂ = (μ_Y −a * σ_Y, μ_Y +a * σ_Y], B₃ = (μ_Y +a * σ_Y,∞), A₁ = (−∞, μ_X − a * σ_X], A₂ = (μ_X−a*σ_X, μ_X+a*σ_X] and A₃ = (μ_X+a*σ_X,∞). We assume n₁ = n₃, a = 1, 1.5, and c₁ = c₃ = 85%, 95%. The proposed estimator, denoted by ${\hat{β}}_{{PDS}_{1}}$ for c=95% and ${\hat{β}}_{{PDS}_{2}}$ for c=85%, is compared with five other estimators: (i) The first estimator, denoted by ${\hat{β}}_{X}$ , is an estimator based on a hypothetical situation where one assumes all X values are available in the study. The supplemental samples are drawn from individuals whose X values are in the two tails of X, defined by μ_X ± a * σ_X. We emphasize that this estimator is not available in practice since X is unknown, we include it for comparison purpose only. We use the least square method for estimation in this case. (ii) The second estimator, denoted by ${\hat{β}}_{ODS}$ , is the ODS estimator (Zhou et al, 2002). The supplemental samples are drawn from individuals whose Y values are in the two tails of the distribution of Y, defined by μ_Y ± a * σ_Y ; (iii) The third method, denoted by ${\hat{β}}_{IPW}$ , is the inverse probability weighted (IPW) method (Horvitz and Thompson, 1952). The data structure for this estimator is the same as that for estimator ${\hat{β}}_{ODS}$ and we use the weights given by Weaver and Zhou (2005); (iv) The fourth case is the ordinary linear regression estimator, denoted by ${\hat{β}}_{SRS}$ , from a simple random sample with the same sample size as the total sample size in the PDS design. (V) ${\hat{β}}_{N}$ is the estimator ignoring the sampling structure and treats the data as if an independent sample. All methods compared are under the same sample size scenarios. The IPW also assumes a known sampling fraction. We first generate a large underlying study cohort (4000) and then subsample from it to compare different designs and methods.

We generate data from the following regression model:

Y = β_{0} + β_{1} X + β_{2} Z + ∊,

(3.1)

where Z = I_{(log(|X|)+e)>1} describes a dependent but weak relationship between X and Z with ∊, e and X generated independently from N(0, 1). Tables 1 and 2 summarize the simulation results. Results are based on 1000 independent simulation runs.

Table 1.

Simulation results PDS design.^†

β ₁

β ₂

β ₁

Method

Mean

\hat{SE}

Mean

\hat{SE}

(n₀, n₁, n₃) = (200, 100, 100)

0.0

1.0

{\hat{β}}_{X}

−0.002

0.038

0.948

−0.502

0.127

0.951

{\hat{β}}_{PDS 1}

0.000

0.043

0.041

0.930

−0.496

0.150

0.127

0.946

{\hat{β}}_{PDS 2}

0.000

0.037

0.950

−0.500

0.095

0.096

0.962

{\hat{β}}_{ODS}

0.002

0.038

0.039

0.942

−0.505

0.120

0.123

0.958

{\hat{β}}_{IPW}

0.000

0.046

0.952

−0.510

0.146

0.140

0.931

{\hat{β}}_{SRS}

0.001

0.050

0.957

−0.502

0.150

0.154

0.955

{\hat{β}}_{N}

0.001

0.047

0.048

0.956

−0.493

0.310

0.119

0.575

0.5

1.0

{\hat{β}}_{X}

0.499

0.038

0.957

−0.498

0.126

0.946

{\hat{β}}_{PDS 1}

0.501

0.039

0.041

0.954

−0.494

0.097

0.106

0.968

{\hat{β}}_{PDS 2}

0.500

0.043

0.042

0.936

−0.495

0.104

0.106

0.950

{\hat{β}}_{ODS}

0.503

0.044

0.045

0.953

−0.505

0.128

0.131

0.954

{\hat{β}}_{IPW}

0.502

0.047

0.046

0.941

−0.508

0.155

0.151

0.938

{\hat{β}}_{SRS}

0.500

0.049

0.050

0.955

−0.496

0.156

0.154

0.944

{\hat{β}}_{N}

0.633

0.060

0.052

0.302

−0.495

0.122

0.134

0.972

0.5

1.5

{\hat{β}}_{X}

0.502

0.032

0.951

−0.500

0.120

0.117

0.940

{\hat{β}}_{PDS 1}

0.499

0.037

0.938

−0.503

0.100

0.097

0.946

{\hat{β}}_{PDS 2}

0.499

0.038

0.942

−0.500

0.103

0.099

0.940

{\hat{β}}_{ODS}

0.503

0.042

0.043

0.956

−0.495

0.118

0.120

0.963

{\hat{β}}_{IPW}

0.504

0.055

0.052

0.934

−0.508

0.167

0.166

0.937

{\hat{β}}_{SRS}

0.500

0.049

0.050

0.955

−0.496

0.156

0.154

0.944

{\hat{β}}_{N}

0.694

0.108

0.049

0.178

−0.507

0.297

0.128

0.603

(n₀, n₁, n₃) = (300, 50, 50)

0.5

1.5

{\hat{β}}_{X}

0.498

0.038

0.951

−0.497

0.130

0.949

{\hat{β}}_{PDS 1}

0.501

0.042

0.952

−0.497

0.114

0.113

0.932

{\hat{β}}_{PDS 2}

0.499

0.043

0.044

0.966

−0.498

0.120

0.121

0.950

{\hat{β}}_{ODS}

0.503

0.044

0.045

0.955

−0.495

0.126

0.129

0.960

{\hat{β}}_{IPW}

0.502

0.046

0.045

0.942

−0.498

0.147

0.144

0.940

{\hat{β}}_{SRS}

0.500

0.049

0.050

0.955

−0.496

0.156

0.154

0.944

{\hat{β}}_{N}

0.623

0.072

0.050

0.367

−0.487

0.149

0.130

0.923

(n₀, n₁, n₃) = (100, 50, 50)

0.5

1.0

{\hat{β}}_{X}

0.499

0.051

0.053

0.956

−0.501

0.178

0.179

0.951

{\hat{β}}_{PDS 1}

0.494

0.058

0.056

0.934

−0.498

0.150

0.140

0.942

{\hat{β}}_{PDS 2}

0.502

0.059

0.058

0.948

−0.505

0.146

0.150

0.968

{\hat{β}}_{ODS}

0.502

0.062

0.955

−0.505

0.188

0.187

0.937

{\hat{β}}_{IPW}

0.508

0.066

0.064

0.929

−0.507

0.218

0.210

0.922

{\hat{β}}_{SRS}

0.495

0.072

0.071

0.944

−0.499

0.219

0.955

{\hat{β}}_{N}

0.636

0.084

0.074

0.544

−0.492

0.180

0.192

0.967

Open in a new tab

^†

Results are based on the model Y = β₀ + β₁X + β₂I_log(|x|)+e>1 + ∊, where e ~ N(0, 1), ∊ ~ N (0, 1), and X ~ N (0, 1); the true parameter values are β₀ = 1.0 and (β₂ = −0.5. ${\hat{β}}_{X}$ , ${\hat{β}}_{PDS 1}$ , ${\hat{β}}_{PDS 2}$ , ${\hat{β}}_{ODS}$ , ${\hat{β}}_{SRS}$ , ${\hat{β}}_{IPW}$ , and ${\hat{β}}_{N}$ are defined as in Section 3.1.

Table 2.

Simulations results for the power and relative efficiency.^†

β ₁

β₂ = −0.5

β ₁

β₂ = −0.5

β ₁

Method

Size/Power

Power

Size/Power

Power

(n₀, n₁, n₃) = (100, 50, 50), σ² = 4

(n₀, n₁, n₃) = (150, 25, 25), σ² = 4

0.0

{\hat{β}}_{X}

1.000

0.055

1.000

0.284

1.000

0.051

1.000

0.245

{\hat{β}}_{{PDS}_{1}}

0.983

0.056

1.011

0.312

1.001

0.048

1.016

0.296

{\hat{β}}_{{PDS}_{2}}

1.002

0.058

0.780

0.436

1.013

0.044

0.806

0.354

{\hat{β}}_{ODS}

1.011

0.059

0.934

0.331

1.019

0.044

1.036

0.308

{\hat{β}}_{IPW}

1.188

0.071

1.085

0.278

1.106

0.055

1.042

0.267

{\hat{β}}_{SRS}

1.332

0.058

1.227

0.229

1.238

0.058

1.187

0.229

0.1

{\hat{β}}_{X}

1.000

0.167

1.000

0.286

1.000

0.134

1.000

0.240

{\hat{β}}_{{PDS}_{1}}

1.005

0.172

0.980

0.322

1.000

0.128

0.981

0.252

{\hat{β}}_{{PDS}_{2}}

1.010

0.176

0.795

0.446

1.005

0.148

0.823

0.380

{\hat{β}}_{ODS}

1.016

0.159

0.940

0.350

1.042

0.131

1.015

0.287

{\hat{β}}_{IPW}

1.189

0.144

1.099

0.264

1.093

0.129

1.053

0.293

{\hat{β}}_{SRS}

1.336

0.110

1.234

0.236

1.225

0.110

1.178

0.236

0.5

{\hat{β}}_{X}

1.000

0.997

1.000

0.291

1.000

0.990

1.000

0.259

{\hat{β}}_{{PDS}_{1}}

1.031

0.999

0.942

0.315

1.000

0.976

0.837

0.316

{\hat{β}}_{{PDS}_{2}}

1.042

0.997

0.812

0.443

1.001

0.980

0.829

0.334

{\hat{β}}_{ODS}

1.068

0.996

0.962

0.324

1.027

0.985

0.991

0.282

{\hat{β}}_{IPW}

1.191

0.981

1.117

0.274

1.068

0.978

1.036

0.254

{\hat{β}}_{SRS}

1.338

0.930

1.237

0.207

1.202

0.930

1.114

0.207

Open in a new tab

^†

Results are based on the model Y = β₀ + β₁X + β₂l_log(|x|)+e>1 +∊, where e ~ N(0, 1), ∊ ~ N(0, σ²), and X ~ N(0, 1).

We note the following observations from Table 1: (i) except ${\hat{β}}_{N}$ , all estimators for (β₁, β₂) are unbiased. Clearly, ${\hat{β}}_{N}$ shows that ignoring the sampling scheme will lead to biased estimate for β₁ ≠ 0; (ii) The average of the proposed variance estimator is very close to the empirical variance based on the 1000 simulations; (iii) The nominal 95% confidence interval coverage rates are close to 95%, indicating that the large sample normal approximation works well in these situations. As β₁ is of primary interest, we will concentrate on the efficiency comparison of various estimators for β₁ and note the following observations: (iv) When β₁ ≠ 0, the proposed estimator ${\hat{β}}_{{PDS}_{1}}$ is the most efficient among all practically available estimators; (v) When β₁ ≠ 0, as a changes from 1 to 1.5, i.e., when we move the partition of X further towards the tails, ${\hat{β}}_{X}$ , ${\hat{β}}_{{PDS}_{1}}$ , ${\hat{β}}_{{PDS}_{2}}$ and ${\hat{β}}_{ODS}$ all become more efficient, while ${\hat{β}}_{IPW}$ becomes less efficient but ${\hat{β}}_{SRS}$ is not affected; (vi) For a fixed overall sample size n = n₀ + n₁ + n₃, as we allocate more samples to the tails, e.g., when (n₀, n₁, n₃) changes from (300, 50, 50) to (200, 100, 100), the efficiency of ${\hat{β}}_{PDS 1}$ , ${\hat{β}}_{PDS 2}$ , ${\hat{β}}_{ODS}$ improves while the efficiency of ${\hat{β}}_{IPW}$ decreases; (vii) As overall sample sizes increase from 200 to 400, all estimators' efficiency improved. (viii) in general, ${\hat{β}}_{PDS 1}$ , which corresponds to c = 0.95, is more efficient than ${\hat{β}}_{PDS 2}$ , which corresponds to c = 0.85.

Table 2 lists the power for testing β₁ = 0, and relative efficiency (RE) for a = 1.0, σ² = 4, and (n₀, n₁, n₃) = (100, 50, 50) and (150, 25, 25). RE is defined as the ratio of the standard error for the estimator of interest to that of ${\hat{β}}_{X}$ . At β₁ = 0, we see that all estimators, except ${\hat{β}}_{IPW}$ , have type I error rates close to the nominal level. ${\hat{β}}_{IPW}$ has slightly inflated type I error rate (0.07). As β₁ increases, the proposed estimator ${\hat{β}}_{PDS}$ has almost the same power as ${\hat{β}}_{X}$ and is more powerful than the other competing estimators. ${\hat{β}}_{SRS}$ , the estimator from a simple random sampling, has the least power among all. The observation regarding the relative efficiency is similar to that from the power.

We further conducted additional simulation studies to check on the robustness of the estimation. We considered four different combinations for the covariates X and Z in model (3.1): (i) X is a standard normal distribution, while Z is a binary variable with parameter p = 0.45; (ii) both X and Z are standard normal distributions; (iii) X is a exponential distribution with parameter being 1, while Z is a binary variable with parameter p = 0.45; (iv) X is a log normal distribution with parameters (μ, σ) = (0.0, 0.6), while Z is from standard normal distribution. Denote the estimators from the true model For β₀ = 1.0, β₁ = 0.5 and β₂ = −0.5, the simulation results are summarized in Table 3 (Part A). The results show that the proposed methods are consistent under the above mentioned scenarios.

Table 3.

Robust property of the PDS estimator.^†

β ₁

β ₂

β ₁

Method

Mean

\hat{SE}

Mean

\hat{SE}

Part A

(n₀, n₁, n₃) = (200, 100, 100)

X ~ N(0, 1), Z ~ binary(0.45)

0.5

1.0

{\hat{β}}_{{PDS}_{1}}

0.498

0.041

0.042

0.950

−0.500

0.076

0.075

0.944

{\hat{β}}_{{PDS}_{2}}

0.500

0.043

0.044

0.957

−0.495

0.082

0.083

0.950

X ~ N(0, 1), Z ~ N(0, 1)

{\hat{β}}_{{PDS}_{1}}

0.501

0.043

0.042

0.949

−0.501

0.037

0.948

{\hat{β}}_{{PDS}_{2}}

0.502

0.045

0.044

0.941

−0.499

0.040

0.041

0.953

X ~ exp(1), Z ~ binary(0.45)

{\hat{β}}_{{PDS}_{1}}

0.498

0.039

0.037

0.940

−0.503

0.107

0.097

0.938

{\hat{β}}_{{PDS}_{2}}

0.497

0.044

0.042

0.942

−0.503

0.108

0.098

0.944

X ~ log-normal(0, 0.6), Z ~ N(0, 1)

{\hat{β}}_{{PDS}_{1}}

0.500

0.047

0.954

−0.503

0.039

0.038

0.945

{\hat{β}}_{{PDS}_{2}}

0.501

0.056

0.053

0.942

−0.499

0.043

0.042

0.950

Part B

(n₀, n₁, n₃) = (100, 150, 150)

0.5

1.0

{\hat{β}}_{X}

0.500

0.035

0.034

0.941

−0.501

0.114

0.117

0.960

{\hat{β}}_{{PDS}_{1}}

0.500

0.041

0.037

0.940

−0.494

0.111

0.097

0.946

{\hat{β}}_{{PDS}_{2}}

0.501

0.042

0.039

0.941

−0.501

0.100

0.097

0.942

{\hat{β}}_{ODS}

0.501

0.046

0.044

0.944

−0.497

0.127

0.125

0.946

{\hat{β}}_{IPW}

0.511

0.058

0.057

0.940

−0.505

0.191

0.185

0.937

{\hat{β}}_{SRS}

0.500

0.049

0.050

0.955

−0.496

0.156

0.154

0.944

Open in a new tab

^†

Estimators are defined the same as in Table 1.

Part B of Table 3 illustrated a situation where overwhelming number of sample are allocated to the tails, in this case, (n₀, n₁, n₃) = (100, 150, 150). Results show that the unbiasedness property of ${\hat{β}}_{PDS}$ still hold, with the efficiency further improved as more sample are allocated in the tails. However, at some point, the loss of precision in ${\hat{ϕ}}_{k}$ as SRS sample getting smaller will impact the efficiency.

3.2 Analysis of the Collaborative Perinatal Project Data

We illustrate our method using data from the Collaborative Perinatal Project (CPP) (Niswander and Gordon, 1972). This study evaluates the effect of mother's maternal pregnancy serum level of polychlorinated biphenyls (PCB) on her child's IQ test performance at age 7. Pregnant mothers were enrolled through university-affiliated medical clinics, and data were collected from mother at each prenatal visit. The study children were also followed for various neurodevelopmental outcomes for up to 8 years. One of the hypotheses is that the PCB levels are related to the performance on the Weschler Intelligence Scale for children at 7 years of age (Longnecker et al., 1997). To investigate the in utero exposure of PCB in relation to neurodevelopmental abnormality, the PCB levels were measured by analyzing the third trimester blood serum specimens that had been preserved from mothers in the CPP study. PCB levels are available for a simple random sample of 849 subjects from the underlying population. In addition to the PCB level as the exposure variable of interest, other variables available for all subjects under study include socioeconomic status of the child's family (SES), the gender (SEX, 1=female) and race (RACE, 1=black) of the child, and the mother's education (EDU) and age (AGE).

To illustrate our methods, we select a simple random sample with size n₀ = 100 from the cohort of 849 subjects. We then select two supplemental samples with size n₁ = 50 and n₃ = 50 randomly from the set ${(Y, Z) : \hat{P} r (X \in A_{1} ∣ Y, Z) \geq 85 %}$ and ${(Y, Z) : \hat{P} r (X \in A_{3} ∣ Y, Z) \geq 85 %}$ , respectively. Note that the estimator $\hat{P} r (X \in A_{1} ∣ Y, Z)$ and $\hat{P} r (X \in A_{3} ∣ Y, Z)$ are estimated from the logistics model, and the domain of PCB is partitioned into 3 intervals with a = 1 as the cutpoint, i.e., A₁ = (−∞, μ_PCB − σ_PCB] = (−∞, 1.210] and A₃ = (μ_PCB +σ_PCB, ∞) = (5.037, +∞). The ODS design also partitions the domain of Y into three intervals. The supplemental sample with size n₁ = 50 and n₃ = 50 are from the strata B₁ = (−∞, μ_IQ − σ_IQ] = (−∞, 81.441] and B₃ = (μ_IQ + σ_IQ, ∞) = (109.469, +∞), respectively. The variables EDU and AGE are standardized, and we denote them as EDU and AGE without loss of generality. We tested the proper fitting of the covariates in the SRS sample and found that the p-values from partial F-test for testing a cubic model for AGE and EDU versus a quadratic model was 0.7, and a quadratic model versus a linear model is 0.008. Hence, we used the following quadratic model for all estimators compared.

I Q = β_{0} + β_{1} PCB + β_{2} EDU + β_{3} SES + β_{4} AGE + β_{5} RACE + β_{6} SEX + β_{7} {EDU}^{2} + β_{8} {AGE}^{2} + ε,

(3.2)

The results for the CPP data analysis are summarized in Table 4. ${\hat{β}}_{Full}$ denotes the full data analysis, which is included for the purpose of comparison.

Table 4.

Analysis results for the CPP data set.^††

Covariate

Int

PCB

EDU

SES

AGE

RACE

SEX

EDU²

AGE²

{\hat{β}}_{full}

93.160

0.219

3.407

0.998

−0.566

−7.712

−0.768

0.579

0.684

\hat{S E} ({\hat{β}}_{full})

1.688

0.227

0.535

0.269

0.525

0.925

0.839

0.225

0.370

upperC.I.

89.851

−0.225

2.357

0.471

−1.595

−9.526

−2.414

0.136

−0.042

lowerC.I.

96.469

0.665

4.457

1.526

0.462

−5.897

0.877

1.022

1.410

{\hat{β}}_{PDS}

92.818

0.153

3.154

1.072

−0.279

−7.453

−1.053

0.665

0.500

\hat{SE} ({\hat{β}}_{PDS})

3.588

0.465

1.102

0.554

1.044

1.907

1.770

0.460

0.735

upperC.I.

85.784

−0.758

0.992

−0.015

−2.326

−11.191

−4.523

−0.238

−0.941

lowerC.I.

99.851

1.064

5.315

2.159

1.767

−3.715

2.416

1.568

1.941

{\hat{β}}_{ODS}

92.228

0.641

5.073

1.273

−0.722

−12.394

−0.070

0.457

0.527

\hat{SE} ({\hat{β}}_{ODS})

4.341

0.516

1.267

0.686

1.330

2.417

2.084

0.459

0.886

upperC.I.

83.718

−0.371

2.588

−0.072

−3.329

−17.132

−4.157

−0.443

−1.210

lowerC.I.

100.738

1.654

7.557

2.619

1.885

−7.656

4.016

1.357

2.265

{\hat{β}}_{IPW}

93.101

0.271

3.589

0.995

−0.602

−7.944

−0.674

0.556

0.678

\hat{SE} ({\hat{β}}_{IPW})

11.757

1.587

3.867

1.851

3.612

6.309

5.882

1.644

2.493

upperC.I.

70.057

−2.839

−3.991

−2.632

−7.682

−20.309

−12.204

−2.667

−4.208

lowerC.I.

116.146

3.382

11.169

4.623

6.478

4.421

10.856

3.780

5.565

{\hat{β}}_{SRS}

91.296

0.579

3.535

0.906

−0.439

−6.143

0.183

0.981

0.832

\hat{SE} ({\hat{β}}_{SRS})

3.527

0.554

1.017

0.547

1.158

1.847

1.685

0.467

0.705

upperC.I.

84.383

−0.508

1.542

−0.166

−2.710

−9.765

−3.119

0.066

−0.550

lowerC.I.

98.209

1.666

5.529

1.979

1.831

−2.521

3.486

1.896

2.215

Open in a new tab

^††

a = 1 and the allocation pattern is (n₀, n₁, n₃) = (100, 50, 50).

The outcome is the Weschler Intelligence Scale for children at 7 years of age (IQ). PCB is the level measured from the third-trimester blood serum specimens that have been preserved from mothers in the CPP study; EDU is the standardized mother's education level; SES is the socioeconomic status of the child's family; AGE is standardized mother's age; RACE and SEX are the race and gender of the child. The fitted model is IQ = β₀ + β₁PCB + β₂EDU + β₃SES + β₄AGE + β₅RACE + β₆SEX + β₇EDU² + β₈AGE² + ε, where ε is zero mean normal variable with unknown variance. ${\hat{β}}_{full}$ , ${\hat{β}}_{PDS}$ , ${\hat{β}}_{ODS}$ , ${\hat{β}}_{IPW}$ and ${\hat{β}}_{SRS}$ are defined in 3.2.

Results in Table 4 reveal that none of the estimators demonstrated a significant PCB effect on the IQ scores for children at 7 years of age. Nevertheless, the effect of two-phase PDS design can be seen from the fact that the estimator ${\hat{β}}_{PDS}$ for PCB under the two-phase PDS design has smaller standard error than the estimators ${\hat{β}}_{ODS}$ , ${\hat{β}}_{IPW}$ and ${\hat{β}}_{SRS}$ . As the result, the 95% confidence interval for ${\hat{β}}_{PDS}$ is narrower than those from ${\hat{β}}_{ODS}$ , , ${\hat{β}}_{IPW}$ and ${\hat{β}}_{SRS}$ . It is not surprising that the standard error estimator ${\hat{β}}_{Full}$ based on all data with a size of 849 for the PCB is the smallest, and consequently, has the narrowest confidence interval (−0.225, 0.665) for the effect of PCB.

4 Concluding Remarks

We proposed an innovative and cost-effective sampling design, the two-phase PDS design, that will enable the investigators to collect more informative samples at a fixed budget. The proposed design is multi-phase based and uses a biased sampling scheme where one observes the main exposure variable with a probability that depends on the outcome variable and other covariates. This research is developed in response to the need for designing more powerful study to effectively utilize the available financial resources in the current ongoing study, the Gulf Long-term Follow-up Study conducted at US National Institute of Environmental Health Sciences (NIEHS) (Sandler et al. 2011). The GuLF Study is a health study specifically for workers and volunteers who helped clean up the 2010 Deep water Horizon oil spill. About 56,000 subjects will be recruited. It is the largest study ever conducted on possible short-term and long-term health effects of oil spills. The budget for assessing benzene level in individuals will only be about 900 individuals. Collaborating with NIEHS scientists, we are in the process of designing a sampling strategy using the proposed two-phase PDS scheme to target for sampling more informative subjects.

The main advantages of the proposed design is that it allows for a continuous Y and a vector of available covariate Z to be used in selecting a more informative second phase data set. The proposed design avoids the impractical high dimension stratification issue when multiple covariate are included in Z. The proposed semiparametric empirical likelihood method is an efficient and robust way to analyze data from the proposed design. The primary competitors of the PDS design in practice are the ODS design with continuous response variable, the simple random sampling design and the inverse probability weighted method for two phase design, though the IPW method will also require the sample probability to be known. Our simulation results suggest that for the same sample size, the proposed PDS design, coupled with the proposed estimator, is more efficient and more powerful than these competing estimators. Our robustness simulation results also suggest that even though the logistic model estimates of Pr(X ∈ A_k|Y, Z) is quite robust with respect to misspecification of the true underlying models.

There are a few recommendations for using the two-phase PDS design in practice. One needs to consider how to choose a, c, and how to distribute the supplemental samples. We suggest that a three-category design, (−∞, μ_X − a * σ_X], A₂ = (μ_X − a * σ_X, μ_X + a * σ_X] and (μ_X + a * σ_X, ∞), with a cut point reasonably away from the mean of the exposure be sufficient. The simulation results and subject matter considerations might support large values of a, e.g., greater than 1, so that it corresponds to a clinically abnormal value. However one has to be cautious selecting observations too far out in the distribution as the reward from choosing a relatively large value of a depends on assumption that f_β (Y|X) is true across the entire range. This assumption may be violated if a is too large and stability could be an issue if observations are sampled from the very extreme tails. We recommend a to be between 1 and 1.5. We also recommend the value of c to be between 75% and 95% and with an even split for the supplement samples in the two outside tails (i.e, n₁ = n₃).

Some interesting future works remain. As we pointed out in earlier, implementing the proposal design can be done in two ways: (i) with underly cohort population and sampling proportion unknown, and stopping recruitment after the pre-specified number of subjects in the tail supplement samples are reached; or, (ii) with all (Y, Z) in the underlying cohort known. In the latter case, the augmented IPW estimator can be explored to get more efficient than the IPW estimator. It would be interesting to explore if the combination of the PDS design with ODS design would results in more efficient designs. Such combination could decompose efficiency gains into those gained with increase variation in Y (from ODS) and those gained with increased variation in X beyond the increased variation in Y. On the theory front, it would be interesting to explore the existence of a semiparametric efficient estimator for the proposed PDS designs. Finally, it would be interesting to explore the possible bias-variance tradeoff with different approaches for estimating ϕ₁ and ϕ₃.

Acknowledgements

The authors thank the editor, the associate editor and the referees for their helpful comments. The authors also wish to thank Dr. Jane Monaco for the careful proofreading of the manuscript. This work was partly supported by United States National Institutes of Health grants R01-CA79949, R01-ES021900, RR025747, and P01-CA142538.

Appendix: proof of Theorem 1

Recall that the approximated profile log-likelihood function is

\begin{matrix} \tilde{l} (β, π, ν) = & \sum_{i = 1}^{n} \log f_{β} (Y_{i} ∣ X_{i}, Z_{i}) \\ - \sum_{i = 1}^{n} \log {q_{0} + \sum_{k \in S} q_{k} π_{k}^{- 1} {\hat{F}}_{k} (X_{i} Z_{i}) + ν^{T} ({\hat{F}}_{1} (X_{i}, Z_{i}) - π_{1}, {\hat{F}}_{3} (X_{i}, Z_{i}) - π_{3})} \\ - \sum_{k \in S} n_{k} \log π_{k}, \end{matrix}

where ${\hat{F}}_{k} (X_{i}, Z_{i}) = \int f_{β} (Y ∣ X_{i}, Z_{i}) I ({\hat{ϕ}}_{k} (Y, Z_{i}) \geq c_{k}) d Y$ and $S = {1, 3}$ We define l*(β, π, ν) the same as $\tilde{l} (β, π, ν)$ except that ${\hat{ϕ}}_{k}$ is replaced by ϕ_k. Let ξ = (β, π, ν), then we can abbreviate $\tilde{l} (β, π, ν)$ and l*(β, π, ν) as $\tilde{l} (ξ)$ and l*(ξ), respectively.

We impose the following assumptions.

(C.1) The log-density log f_β (Y |X, Z) is twice-continuously differentiable with respect β.

(C.2) The proportion n_j/n is a fixed constant q_j ∈ (0, 1).

(C.3) The class of functions

F \equiv {f_{β} (Y ∣ X, Z), \frac{\partial^{s}}{\partial β^{s}} \log f_{β} (Y ∣ X, Z), \int \frac{\partial^{s}}{\partial β^{s}} \log f_{β} (Y ∣ X, Z) f_{β} (Y ∣ X, Z) I (ϕ_{k} (Y, Z) \geq c_{k}) d Y : s = 0, 1, 2 and the function are indexed by parameters ξ and ϕ_{k} ’ s parameters}

is P-Donsker and have an envelope function with finite second moment.

(C.4) The hessian matrix of E[n⁻¹l* (ξ)] is continuous in a neighborhood of the true ξ (β, π, 0, 0) and is non-singular at ξ.

(C.5) The estimator $\int \frac{\partial^{s}}{\partial β^{s}} \log f_{β} (Y ∣ X, Z) I ({\hat{ϕ}}_{k} (Y, Z) \geq c_{k}) d Y$ , s = 0, 1, 2, belongs to $F$ and $\Pr ((Y, Z) : I ({\hat{ϕ}}_{k} (Y, Z) \geq c_{k}) \to I (ϕ_{k} (Y, Z) \geq c_{k})) = 1$ .

(C.6) It holds

E [\partial \log f_{β} (Y ∣ X, Z) ∕ \partial β I ({\hat{ϕ}}_{k} (Y, Z) \geq c_{k})] - E [\partial \log f_{β} (Y ∣ X, Z) ∕ \partial β I (ϕ_{k} (Y, Z) \geq c_{k})] = n^{- 1} \sum_{i = 1}^{n} Q_{1 k} (Y_{i}, X_{i}, Z_{i}) + o_{p} (1)

and

E [I ({\hat{ϕ}}_{k} (Y, Z) \geq c_{k})] - E [I (ϕ_{k} (Y, Z) \geq c_{k})] = n^{- 1} \sum_{i = 1}^{n} Q_{2 k} (Y_{i}, X_{i}, Z_{i}) + o_{p} (1)

for k = 1, 3, where Q_1k(Y, X, Z) and Q_2k(Y, X, Z) are mean 0 random vectors with finite second moments.

Conditions (C.1)–(C.4) are all regular conditions for f_β (Y |X, Z) and ϕ_k(Y, Z), which hold for usual regression models and the choices of ϕ_k. Conditions (C.5) and (C.6) regard the properties of the estimator ${\hat{ϕ}}_{k}$ . These conditions can be easily verified if ${\hat{ϕ}}_{k}$ takes parametric structure such as (2.2) or (2.3). For the kernel estimator (2.4), verifying these two conditions needs some additional work but can be shown to hold if the bandwidth is chosen small enough.

(i) Proof of Consistency At the true value for ξ = (β, π, 0, 0), we calculate the first derivative of $n^{- 1} \tilde{l} (ξ)$ so obtain

\frac{\partial}{\partial β} n^{- 1} \tilde{l} (ξ) = n^{- 1} \sum_{i = 1}^{n} \frac{\partial}{\partial β} \log f_{β} (Y_{i} ∣ X_{i}, Z_{i})

- n^{- 1} \sum_{i = 1}^{n} \sum_{k \in S} q_{k} \frac{\int \partial \log f_{β} (Y ∣ X_{i}, Z_{i}) ∕ \partial β f_{β} (Y ∣ X_{i}, Z_{i}) I ({\hat{ϕ}}_{k} (Y, Z_{i}) \geq c_{k}) d Y}{π_{k}},

(A.1)

and for k = 1, 3,

\frac{\partial}{\partial π_{k}} n^{- 1} \tilde{l} (ξ) = n^{- 1} \sum_{i = 1}^{n} \frac{q_{k} \int f_{β} (Y ∣ X_{i}, Z_{i}) I ({\hat{ϕ}}_{k} (Y, Z_{i}) \geq c_{k}) d Y}{π_{k}^{2}} - \frac{q_{k}}{π_{k}},

(A.2)

and

\frac{\partial}{\partial ν_{k}} n^{- 1} \tilde{l} (ξ) = - n^{- 1} \sum_{i = 1}^{n} (\int f_{β} (Y ∣ X_{i}, Z_{i}) I ({\hat{ϕ}}_{k} (Y, Z_{i}) \geq c_{k}) d Y - π_{k}) .

(A.3)

By the Donsker property in (C.3) and (C.5), we apply the Glivenko-Cantelli theorem and obtain

∣ \frac{\partial}{\partial ξ} n^{- 1} \tilde{l} (ξ) - \frac{\partial}{\partial ξ} E [n^{- 1} \tilde{l} (ξ)] ∣ \to_{a . s .} 0 .

Since $E [n^{- 1} \tilde{l} (ξ)] \to E [n^{- 1} l^{*} (ξ)]$ , we have

\frac{\partial}{\partial ξ} n^{- 1} \tilde{l} (ξ) \to_{a . s .} E [n^{- 1} l^{*} (ξ)] .

Here n⁻¹l*(ξ) takes the same expression as (A.1)–(A.3) except that ${\hat{ϕ}}_{k}$ is replaced by ϕ_k. On the other hand, using the ODS design fact that

E [n^{- 1} \sum_{i = 1}^{n} g_{1} (Y_{i}, X_{i}, Z_{i})] = q_{0} E [g_{1} (Y, X, Z)] + \sum_{k \in S} q_{k} E [g_{1} (Y, X, Z) ∣ ϕ_{k} (Y, Z) \geq c_{k}],

E [n^{- 1} \sum_{i = 1}^{n} g_{2} (X_{i}, Z_{i})] = E [g_{2} (X, Z)],

and the fact that π_k = E[I(ϕ_k(Y, Z) c_k)], we can easily calculate E[n⁻¹l* (ξ)] = 0. Thus, $n^{- 1} \partial \tilde{l} (ξ) ∕ \partial ξ \to_{a . s .} 0$ ; that is, 0 belongs to the image of $n^{- 1} \partial \tilde{l} (ξ)$ in any given neighborhood of the true ξ when n is large enough. Similarly, we can show $n^{- 1} \partial^{2} \tilde{l} (ξ) ∕ \partial ξ^{2} \to_{a . s .} E [n^{- 1} \partial^{2} l^{*} (ξ) ∕ \partial ξ^{2}]$ for ξ in a neighborhood of the true value. Thus, from condition (C.4), $n^{- 1} \partial^{2} \tilde{l} (ξ) ∕ \partial ξ^{2}$ is invertible in this neighborhood when n is large enough. From the inverse mapping theorem, $n^{- 1} \partial \tilde{l} (ξ) ∕ \partial ξ$ is invertible in any small neighborhood of the true ξ. Consequently, we conclude that there exists a solution $\hat{ξ}$ to $\partial \tilde{l} (ξ) ∕ \partial ξ = 0$ and $\hat{ξ}$ converges almost surely to the true ξ.

(ii) Proof of Asymptotic Normality From equation

n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (\hat{ξ}) = 0,

we obtain

n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (\hat{ξ}) - E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (\hat{ξ})] = - E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (\hat{ξ})] + E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (ξ)] - E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (ξ)] .

We apply the Taylor expansion to the first term on the right-hand side and obtain

n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (\hat{ξ}) - E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (\hat{ξ})] = - E [n^{- 1} \frac{\partial^{2}}{\partial ξ^{2}} \tilde{l} (\tilde{ξ})] (\hat{ξ} - ξ) - E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (ξ)],

(A.4)

where $\tilde{ξ}$ is between $\hat{ξ}$ and ξ.

In equation (A.4), the left-hand side can be expressed as an empirical process indexed by functions

{\frac{\partial}{\partial ξ} \log f_{β} (Y ∣ X, Z) - \frac{\partial}{\partial ξ} \log (q_{0} + \sum_{k \in S} q_{k} π_{k}^{- 1} {\hat{F}}_{k} (X_{i}, Z_{i}) + ν^{T} ({\hat{F}}_{1} (X_{i}, Z_{i}) - π_{1}, {\hat{F}}_{3} (X_{i}, Z_{i}) - π_{3})) + \sum_{k \in S} q_{k} \frac{\partial}{\partial ξ} \log π_{k} : ξ is in a neighborhood of the true value} .

By conditions (C.3) and (C.5), it is asymptotically equivalent to $n^{- 1 ∕ 2} \sum_{i = 1}^{n} U (Y_{i}, X_{i}, Z_{i})$ , where

U (Y_{i}, X_{i}, Z_{i}) = (\begin{matrix} \frac{\partial}{\partial β} \log f_{β} (Y_{i} ∣ X_{i}, Z_{i}) - \sum_{k \in S} q_{k} π_{k}^{- 1} \int \frac{\partial}{\partial β} \log f_{β} (Y ∣ X_{i}, Z_{i}) f_{β} (Y ∣ X_{i}, Z_{i}) I (ϕ_{k} (Y, Z_{i}) \geq c_{k}) d Y \\ q_{k} π_{k}^{- 2} \int f_{β} (Y ∣ X_{i}, Z_{i}) I (ϕ_{k} (Y, Z_{i}) \geq c_{k}) d Y - q_{k} π_{k}^{- 1}, k = 1, 3 \\ \int f_{β} (Y ∣ X_{i}, Z_{i}) I (ϕ_{k} (Y, Z_{i}) \geq c_{k}) d Y - π_{k}, k = 1, 3 \end{matrix}) .

According to (C.5), the matrix in the first term of the right-hand side of (A.4) satisfies

E [n^{- 1} \frac{\partial^{2}}{\partial ξ^{2}} \tilde{l} (\tilde{ξ})] \to E [n^{- 1} \frac{\partial^{2}}{\partial ξ^{2}} l^{*} (ξ)] = V (ξ) .

For the second term on the right-hand side of (A.4), we note

E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (ξ)] = E [n^{- 1} \frac{\partial}{\partial ξ} \tilde{l} (ξ)] - E [n^{- 1} \frac{\partial}{\partial ξ} l^{*} (ξ)],

which is further simplified as

(\begin{matrix} - \sum_{k \in S} q_{k} π_{k}^{- 1} {E [\frac{\partial}{\partial β} \log f_{β} (Y ∣ X, Z) I ({\hat{ϕ}}_{k} (Y, Z) \geq c_{k})] - E [\frac{\partial}{\partial β} \log f_{β} (Y ∣ X, Z) I (ϕ_{k} (Y, Z) \geq c_{k})]} \\ q_{k} π_{k}^{- 2} {E [I ({\hat{ϕ}}_{k} (Y, Z_{i}) \geq c_{k})] - E [I (ϕ_{k} (Y, Z) \geq c_{k})]}, k = 1, 3 \\ - {E [I ({\hat{ϕ}}_{k} (Y, Z) \geq c_{k})] - E [I (ϕ_{k} (Y, Z) \geq c_{k})]}, k = 1, 3 \end{matrix}) .

Combining all these results and using condition (C.6), we obtain

- (V (ξ) + o (1)) (\hat{ξ} - ξ) = n^{- 1 ∕ 2} \sum_{i = 1}^{n} [U (Y_{i}, X_{i}, Z_{i}) + (\begin{matrix} - \sum_{k \in S} q_{k} π_{k}^{- 1} Q_{1 k} (Y_{i}, X_{i}, Z_{i}) \\ q_{k} π_{k}^{- 2} Q_{2 k}) (Y_{i}, X_{i}, Z_{i}), k = 1, 3 \\ - Q_{2 k} (Y_{i}, X_{i}, Z_{i}), k = 1, 3 \end{matrix})] .

(A.5)

The asymptotic normality of $\hat{ξ}$ thus follows.

(iii) Consistent estimator of variance From the above derivation, the asymptotic covariance of $\hat{ξ}$ takes form V (ξ)⁻¹U(ξ){V (ξ)⁻¹}^T, where U(ξ) is the variance of each summand on the right-hand side of (A.5). Thus, a consistent estimator of the asymptotic variance for $\sqrt{n} (\hat{ξ} - ξ)$ is given by $\hat{V} {(\hat{ξ})}^{- 1} \hat{U} (\hat{ξ}) {\hat{V} {(\hat{ξ})}^{- 1}}^{T}$ , where $\hat{V} (\hat{ξ}) = n^{- 1} \frac{\partial^{2}}{\partial ξ^{2}} \tilde{l} (\hat{ξ})$ and $\hat{U} (\hat{ξ})$ is the sample variance of the sample version of

U (Y_{i}, X_{i}, Z_{i}) + (\begin{matrix} - \sum_{k \in S} q_{k} π_{k}^{- 1} Q_{1 k} (Y_{i}, X_{i}, Z_{i}) \\ q_{k} π_{k}^{- 2} Q_{2 k}) (Y_{i}, X_{i}, Z_{i}), k = 1, 3 \\ - Q_{2 k} (Y_{i}, X_{i}, Z_{i}), k = 1, 3 \end{matrix}) .

References

Amemiya T. Advanced Econometrics. Harvard University Press; Cambridge, Massachusetts: 1985. [Google Scholar]
Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35. [Google Scholar]
Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20. [Google Scholar]
Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, Series B. 1997;59:447–461. [Google Scholar]
Breslow N, McNeney B, Wellner JA. Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. The Annals of Statistics. 2003;31:1110–1139. [Google Scholar]
Chatterjee N, Chen YH, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association. 2003;98:158–168. [Google Scholar]
Cornfield J. A method of estimating comparative rates from clinical data. Applications to cancer of lung, breast, and cervix. Journal of the National Cancer Institute. 1951;ll:1269–1275. [PubMed] [Google Scholar]
Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685. [Google Scholar]
Kang S, Cai J. Marginal hazards regression for retrospective studies within cohort with possibly correlated failure time data. Biometrics. 2009;65:405–414. doi: 10.1111/j.1541-0420.2008.01077.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Langholz B, Borgan O. Counter-matching: A stratified nested case-control sampling method. Biometrika. 1995;82:69–79. [Google Scholar]
Lu W, Tsiatis AA. Semiparametric transformation models for the case-cohort sturdy. Biometrika. 2006;93:207–214. [Google Scholar]
Longnecker M, Klebanoff M, Zhou H, Wilcox A, Berendes H, Hoffman H. Proposal to study in utero exposure to DDE and PCBs in relation to m,ale hirth defects and neurodevelopmental outcomes in the Collaborative Perinatal Project. Study Proposal, National Institute of Environmental Health Sciences; Washington, D.C.: 1997. [Google Scholar]
Manatunga A, Chen H, Terrell M, Lyles R, Marcus M. A longitudinal model or repeated highly skewed outcome data. Journal of Applied Statistics. 2008;9:39–51. [Google Scholar]
Neyman J. Contribution to the theory of sampling from human populations. Journal of the American Statistical Association. 1938;33:101–116. [Google Scholar]
Niswander KR, Gordon M. US. Department of Health, Education, and Welfare Publication (NIH) 73–379. U.S. Government Printing Office; Washington, D.C.: 1972. The women and their pregnancies. [Google Scholar]
Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75:237–249. [Google Scholar]
Owen AB. Empirical likelihood for confidence regions. The Annals of Statistics. 1990;18:90–120. [Google Scholar]
Prentice RL. A case-cohort design for epidemiologic studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]
Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–412. [Google Scholar]
Qin G, Zhou H. Partial linear inference for a 2-stage outcome-dependent sampling design with a continuous outcome. Biostatistics. 2011;12:506–520. doi: 10.1093/biostatistics/kxq070. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qin J. Empirical likelihood in biased sample problems. The Annals of Statistics. 1993;21:1182–1196. [Google Scholar]
Qin J, Lawless JF. Empirical likelihood and general estimating equations. The Annals of Statistics. 1994;22:300–325. [Google Scholar]
Sandler D, et al. National Institute of Environmental Health Sciences (NIEHS) GuLF Worker Study Draft to IOM-v2. 2010. GuLF Worker Study: Gulf Long-Term Follow-Up Study for Oil Spill Clean-Up Workers and Volunteers. [Google Scholar]
Schildcrout JS, Heagerty PJ. On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9:735–749. doi: 10.1093/biostatistics/kxn006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schildcrout JS, Rathouz PJ. Longitudinal Studies of Binary Response Data Following Case-Control and Stratified Case-Control Sampling: Design and Analysis. biometrics. 2010;66:365–373. doi: 10.1111/j.1541-0420.2009.01306.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song R, Zhou H, Kosorok MR. On semiparametric efficient inference for two-stage outcome dependent sampling with a continuous outcome. Biometrika. 2009;96:221–228. doi: 10.1093/biomet/asn073. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vardi Y. Nonparametric estimation in presence of length bias. The Annals of Statistics. 1982;10:616–620. [Google Scholar]
Vardi Y. Empirical distribution in selection bias models. The Annals of Statistics. 1985;13:178–203. [Google Scholar]
Wang X, Zhou H. A semiparametric empirical likelihood method for biased sampling schemes with auxiliary covariates. Biometrics. 2006;62:1149–1160. doi: 10.1111/j.1541-0420.2006.00612.x. [DOI] [PubMed] [Google Scholar]
Wang X, Zhou H. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics. 2010;66:502–511. doi: 10.1111/j.1541-0420.2009.01280.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weaver MA, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J. Am. Statist. Assoc. 2005;100:459–469. [Google Scholar]
Weinberg CR, Wacholder S. Prospective analysis of case-control data under general multiplicative intercept risk models. Biometrika. 1993;80:461–465. [Google Scholar]
White JE. A two stage design for the sturdy of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]
Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics. 2002;58:413–421. doi: 10.1111/j.0006-341x.2002.00413.x. [DOI] [PubMed] [Google Scholar]
Zhou H, Wu Y, Liu Y, Cai J. Semiparametric inference for 2-stage outcome-auxiliary-dependent sampling design with continuous outcome. Biostatistics. 2011;12:521–534. doi: 10.1093/biostatistics/kxq080. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou H, You J, Qin G, Longnecker MP. A partially linear regression model for data from an outcome-dependent samplign design. Journal of the Royal Statistical Society: Series C. 2011 doi: 10.1111/j.1467-9876.2010.00756.x. DOI: 10.1111/j.1467-9876.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou H, Song R, Wu Y, Qin J. Statistical inference for a two-stage outcome dependent sampling design with a continuous outcome. Biometrics. 2011;67:194–202. doi: 10.1111/j.1541-0420.2010.01446.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Amemiya T. Advanced Econometrics. Harvard University Press; Cambridge, Massachusetts: 1985. [Google Scholar]

[R2] Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35. [Google Scholar]

[R3] Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20. [Google Scholar]

[R4] Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, Series B. 1997;59:447–461. [Google Scholar]

[R5] Breslow N, McNeney B, Wellner JA. Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. The Annals of Statistics. 2003;31:1110–1139. [Google Scholar]

[R6] Chatterjee N, Chen YH, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association. 2003;98:158–168. [Google Scholar]

[R7] Cornfield J. A method of estimating comparative rates from clinical data. Applications to cancer of lung, breast, and cervix. Journal of the National Cancer Institute. 1951;ll:1269–1275. [PubMed] [Google Scholar]

[R8] Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685. [Google Scholar]

[R9] Kang S, Cai J. Marginal hazards regression for retrospective studies within cohort with possibly correlated failure time data. Biometrics. 2009;65:405–414. doi: 10.1111/j.1541-0420.2008.01077.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Langholz B, Borgan O. Counter-matching: A stratified nested case-control sampling method. Biometrika. 1995;82:69–79. [Google Scholar]

[R11] Lu W, Tsiatis AA. Semiparametric transformation models for the case-cohort sturdy. Biometrika. 2006;93:207–214. [Google Scholar]

[R12] Longnecker M, Klebanoff M, Zhou H, Wilcox A, Berendes H, Hoffman H. Proposal to study in utero exposure to DDE and PCBs in relation to m,ale hirth defects and neurodevelopmental outcomes in the Collaborative Perinatal Project. Study Proposal, National Institute of Environmental Health Sciences; Washington, D.C.: 1997. [Google Scholar]

[R13] Manatunga A, Chen H, Terrell M, Lyles R, Marcus M. A longitudinal model or repeated highly skewed outcome data. Journal of Applied Statistics. 2008;9:39–51. [Google Scholar]

[R14] Neyman J. Contribution to the theory of sampling from human populations. Journal of the American Statistical Association. 1938;33:101–116. [Google Scholar]

[R15] Niswander KR, Gordon M. US. Department of Health, Education, and Welfare Publication (NIH) 73–379. U.S. Government Printing Office; Washington, D.C.: 1972. The women and their pregnancies. [Google Scholar]

[R16] Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75:237–249. [Google Scholar]

[R17] Owen AB. Empirical likelihood for confidence regions. The Annals of Statistics. 1990;18:90–120. [Google Scholar]

[R18] Prentice RL. A case-cohort design for epidemiologic studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]

[R19] Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–412. [Google Scholar]

[R20] Qin G, Zhou H. Partial linear inference for a 2-stage outcome-dependent sampling design with a continuous outcome. Biostatistics. 2011;12:506–520. doi: 10.1093/biostatistics/kxq070. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Qin J. Empirical likelihood in biased sample problems. The Annals of Statistics. 1993;21:1182–1196. [Google Scholar]

[R22] Qin J, Lawless JF. Empirical likelihood and general estimating equations. The Annals of Statistics. 1994;22:300–325. [Google Scholar]

[R23] Sandler D, et al. National Institute of Environmental Health Sciences (NIEHS) GuLF Worker Study Draft to IOM-v2. 2010. GuLF Worker Study: Gulf Long-Term Follow-Up Study for Oil Spill Clean-Up Workers and Volunteers. [Google Scholar]

[R24] Schildcrout JS, Heagerty PJ. On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9:735–749. doi: 10.1093/biostatistics/kxn006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Schildcrout JS, Rathouz PJ. Longitudinal Studies of Binary Response Data Following Case-Control and Stratified Case-Control Sampling: Design and Analysis. biometrics. 2010;66:365–373. doi: 10.1111/j.1541-0420.2009.01306.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Song R, Zhou H, Kosorok MR. On semiparametric efficient inference for two-stage outcome dependent sampling with a continuous outcome. Biometrika. 2009;96:221–228. doi: 10.1093/biomet/asn073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Vardi Y. Nonparametric estimation in presence of length bias. The Annals of Statistics. 1982;10:616–620. [Google Scholar]

[R28] Vardi Y. Empirical distribution in selection bias models. The Annals of Statistics. 1985;13:178–203. [Google Scholar]

[R29] Wang X, Zhou H. A semiparametric empirical likelihood method for biased sampling schemes with auxiliary covariates. Biometrics. 2006;62:1149–1160. doi: 10.1111/j.1541-0420.2006.00612.x. [DOI] [PubMed] [Google Scholar]

[R30] Wang X, Zhou H. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics. 2010;66:502–511. doi: 10.1111/j.1541-0420.2009.01280.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Weaver MA, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J. Am. Statist. Assoc. 2005;100:459–469. [Google Scholar]

[R32] Weinberg CR, Wacholder S. Prospective analysis of case-control data under general multiplicative intercept risk models. Biometrika. 1993;80:461–465. [Google Scholar]

[R33] White JE. A two stage design for the sturdy of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]

[R34] Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics. 2002;58:413–421. doi: 10.1111/j.0006-341x.2002.00413.x. [DOI] [PubMed] [Google Scholar]

[R35] Zhou H, Wu Y, Liu Y, Cai J. Semiparametric inference for 2-stage outcome-auxiliary-dependent sampling design with continuous outcome. Biostatistics. 2011;12:521–534. doi: 10.1093/biostatistics/kxq080. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Zhou H, You J, Qin G, Longnecker MP. A partially linear regression model for data from an outcome-dependent samplign design. Journal of the Royal Statistical Society: Series C. 2011 doi: 10.1111/j.1467-9876.2010.00756.x. DOI: 10.1111/j.1467-9876.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Zhou H, Song R, Wu Y, Qin J. Statistical inference for a two-stage outcome dependent sampling design with a continuous outcome. Biometrics. 2011;67:194–202. doi: 10.1111/j.1541-0420.2010.01446.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Semiparametric Inference for Data with a Continuous Outcome from a Two-Phase Probability Dependent Sampling Scheme

Haibo Zhou

Wangli Xu

Donglin Zeng

Jianwen Cai

Abstract

1 Introduction

2 Design and Inference for a Two-phase PDS Study

2.1 Design and Data Structure

2.2 A Semiparametric Empirical Likelihood Inference

3 Numerical Analysis

3.1 Simulation Studies

Table 1.

Table 2.

Table 3.

3.2 Analysis of the Collaborative Perinatal Project Data

Table 4.

4 Concluding Remarks

Acknowledgements

Appendix: proof of Theorem 1

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Semiparametric Inference for Data with a Continuous Outcome from a Two-Phase Probability Dependent Sampling Scheme

Haibo Zhou

Wangli Xu

Donglin Zeng

Jianwen Cai

Abstract

1 Introduction

2 Design and Inference for a Two-phase PDS Study

2.1 Design and Data Structure

2.2 A Semiparametric Empirical Likelihood Inference

3 Numerical Analysis

3.1 Simulation Studies

Table 1.

Table 2.

Table 3.

3.2 Analysis of the Collaborative Perinatal Project Data

Table 4.

4 Concluding Remarks

Acknowledgements

Appendix: proof of Theorem 1

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases