Abstract
The performance of a biomarker predicting clinical outcome is often evaluated in a large prospective study. Due to high costs associated with bioassay, investigators need to select a subset from all available patients for biomarker assessment. We consider an outcome- and auxiliary-dependent subsampling (OADS) scheme, in which the probability of selecting a patient into the subset depends on the patient’s clinical outcome and an auxiliary variable. We proposed a semiparametric empirical likelihood method to estimate the association between biomarker and clinical outcome. Asymptotic properties of the estimator are given. Simulation study shows that the proposed method outperforms alternative methods.
Keywords: Auxiliary variable, Biomarker, Outcome- and auxiliary-dependent subsampling, Population-based studies, Semiparametric empirical likelihood
1. INTRODUCTION
In medical research, there is a growing need to assess the utility of a biomarker (e.g., genetic, molecular, or imaging) in predicting disease prognosis and treatment efficacy. Such a task involves estimating the association between clinical outcome and biomarker while adjusting for confounding variables in regression models. In many cases, due to a low prevalence rate of subjects with positive outcome (e.g., response) and a low prevalence rate of positive biomarker (e.g., genetic mutations), rigorous evaluation of biomarkers performance requires large prospective cohort study. If the biomarker assays are expensive, the cost of assessing all subjects in the entire study cohort is prohibitive. In such a situation, subsampling a subset of subjects from the study cohort for biomarker assays is often necessary.
We illustrate the idea using a lung cancer biomarker study. The epidermal growth factor receptor (EGFR) inhibitors, such as Erlotinib and Gefitinib, modestly extended survival for patients with advanced non-small-cell lung cancer. Of these patients, however, researchers found that those with EGFR mutations responded to the EGFR inhibitor drugs significantly better than those without mutations. Since this finding is based on small retrospective studies, a national consortium was recently established (Eberhard et al., 2008) to prospectively evaluate EGFR mutations as a predictive biomarker for receiving EGFR inhibitors. Hundreds of patients treated with EGFR inhibitors will be assembled into a study cohort and all of them are required to submit tissue samples. The study cohort is expected to predominantly consist of non-responders to the treatment (~70%) and EGFR wild types (~85%). Due to the high cost of genotyping EGFR genes, it is not cost-effective to genotype all banked samples for such a large cohort. How to efficiently select a subset of patients for EGFR mutations assays becomes an important issue. Paez et al. (2004) found that women, East Asians, nonsmokers, and patients with adenocarcinoma have much higher probability of being EGFR mutants. Taking advantage of this finding, CALGB investigators (Jänne et al., 2008) suggested a subsampling scheme that includes a simple random subsample from the study cohort as well as two supplementary samples. Of the two supplementary subsamples, one includes all responders and the other includes nonresponders with a >0.70 likelihood score of EGFR mutations. The likelihood score of EGFR mutations is the predicted probability of a patient having EGFR mutations from logistic regression model using baseline patient predictors, such as smoking history, sex, race, and histology. The likelihood score correlates with the true status of EGFR mutations, and it contains valuable auxiliary information about EGFR mutations for those subjects who have no EGFR mutations measured.
In the preceding example, the selection of a patient into the subset depends on the outcome (tumor response: yes vs. no) and an auxiliary variable (the likelihood of EGFR mutations: likely vs. unlikely). We refer to such subsampling scheme as the outcome- and auxiliary-dependent subsampling (OADS). The OADS can be considered as an extension of the outcome-dependent subsampling (ODS). In the ODS, the subsampling depends on the subjects’ outcomes in order to enrich the selected sample with those who have a rare outcome. Study designs using the ODS subsampling have been discussed by Zhou and his colleagues (Weaver and Zhou, 2005; Zhou and Weaver, 2001; Zhou et al., 2002, 2007). In the OADS, the subsampling depends on both the subjects’ outcome and an auxiliary variable. The motivation is to achieve higher efficiency by concentrating more information in the OADS subsample as compared to the simple random subsample (SRS) and the ODS subsample. Wang and Zhou (2006) considered an OADS design with two sampling components—a random sample (SRS) and an outcome- and auxiliary-dependent sample (OADS), in which all subjects have all variables observed, including the expensive biomarker. On the other hand, the OADS design that we consider here consists of three sampling components: a simple random subsample (SRS), an outcome- and auxiliary-dependent subsample (OADS), and those subjects who are not selected. In this article, we assume that complete data information is observed for each subject in SRS and OADS. We also assume that the outcome and the auxiliary variable are observed for the rest of subjects in the study cohort.
The origin of the ODS sampling can be found in the case-control study (e.g., Breslow and Day, 1993) and its extensions such as the nested case-control study (Breslow and Cain, 1988), case-cohort study (Prentice, 1986), and two-stage study (e.g., Breslow and Chatterjee, 1999; Weinberg and Wacholder, 1993; White, 1982). These designs may be considered as examples of ODS sampling. The sampling scheme we consider in this article is closely related to a two-stage study, in which the outcomes and some stratification variables of all subjects are observed at the first stage, but the expensive biomarker and other variables are observed in a subsample of all subjects at the second stage. In a general framework of a two-stage sampling, Weaver and Zhou (2005) developed an estimated likelihood method to allow both continuous outcome and ODS subsampling. For the two-stage study with binary outcome, Flanders and Greenland (1991) and Zhao and Lipsitz (1992) proposed a Horvitz–Thompson type method (Horvitz and Thompson, 1952) that weights the complete data observed inversely with the selection probability; Breslow and Cain (1988) developed a conditional likelihood estimator; Wild (1991), Cosslett (1981), and Breslow and Holubkov (1997) studied nonparametric maximum likelihood estimation. The two-stage sampling scheme can be viewed as a missing data problem, where some subjects have the biomarker of interest missing by design in our example. In this case, the missingness is missing at random (MAR) as defined by Little and Rubin (1987). Robins et al. (1994) proposed a class of semiparametric estimators based on the inverse selection probability weighted estimating equations for the missing covariates problem. These statistical methods were proposed without considering the role of auxiliary variable, and many of these methods are not fully likelihood-based. Therefore, they may not be ideal methods for data arising from the motivating lung cancer biomarker study.
To make unbiased inference on data arising from the ODS design or the OADS design, one generally needs statistical methods that account for the outcome-dependent nature of the sampling scheme. In this article, we study statistical inference on the OADS design that consists of three sampling components: a simple random subsample (SRS), an outcome- and auxiliary-dependent subsample (OADS), and those subjects who are not selected. We formulate the association between biomarker and clinical outcome in a generalized linear model. We use an empirical likelihood method (Owen, 1988, 1990; Qin and Lawless, 1994) to enforce the constraints existing among different quantities of the observed likelihood and to estimate the conditional distribution of the covariates G(x | w). The proposed method is efficient because it involves a fully likelihood-based estimator. The proposed method is semiparametric in the sense that it treats G(x | w) as nuisance and profiles the quantity out from the likelihood function using a nonparametric procedure.
We organize the rest of the article as follows. In section 2, we specify the data structure and the likelihood function of the OADS design. In section 3, we propose a semiparametric empirical likelihood estimation method and present the asymptotic properties of the proposed estimator. In section 4, we compare via simulation the proposed method to several alternative methods and investigate the effect of the correlation between the biomarker and its auxiliary variable on estimation efficiency. A data example is provided in section 5 to illustrate the proposed method. The Appendix gives a proof outline for the asymptotic properties of the proposed estimator.
2. DATA STRUCTURE AND LIKELIHOOD
Let Y be a categorical outcome with possible values 1, …, J. Let X1 be the biomarker of interest (continuous or binary). X1 is observed only if a subject is selected into either the SRS subsample or the OADS subsample. Let X2, …, Xp be additional covariates. Denote by X the covariate vector, i.e., X = {1, X1, X2, …, Xp} where 1 represents the intercept. X may contain both continuous and discrete variables. Let P(Y | X) be the conditional density of Y given X, which can be parameterized as a generalized linear model Pβ(Y | X) = r(X′β) with r−1 a known link function and β the regression parameters for X.
Let W be a categorical auxiliary variable for X1 with possible values 1, …, K. Assume {Y, W} partitions the entire study cohort into J × K strata such that the number of subjects for the {Y = j, W = k} stratum is Njk. Following the motivating study, we first draw an SRS subsample from the study cohort, and then from the rest of study cohort we draw an OADS subsample from each of the strata {Y = j, W = k} with j = 1, …, J and k = 1, …, K. Denote the SRS subsample by V0, the OADS subsample by V1 and the rest of subjects by V̄. Define V = V0 + V1. Let V0jk, V1jk, Vjk and V̄jk denote the {Y = j, W = k} substrata for V0, V1, V, and V̄ respectively, and n0jk, n1jk, njk, and n̄jk for their sizes. Notice that njk = n0jk + n1jk, where n1jk is fixed by design, and n̄jk = Njk − njk. The data structure is as follows:
The combined likelihood consists of the contribution from the observed data of subjects in V0, V1, and V̄. Since the N − n0 observations left in the cohort after the SRS subsample is drawn still constitute a random sample from the joint distribution of {Y, W}, {Njk − n0jk} must be distributed according to a multinomial distribution. Therefore, the distribution of {n̄jk} must be the same as that of {Njk − n0jk}, shifted with respect to the value {n1jk}. By putting the three parts together and by applying Bayes law to P(X | Y, W), we have the combined likelihood
(1) |
where πjk needs to satisfy the constraint
(2) |
where G(x | w) is the cumulative distribution of X given W. We assume P(Y | X, W) = Pβ(Y | X), which is true when W is a surrogate for X or when W is absorbed by X.
3. SEMIPARAMETRIC EMPIRICAL LIKELIHOOD METHOD
Because the constraint (2) involves G(x | w = k), inference about β requires estimation on G(x | w). One straightforward approach is fitting a parametric model to G(x | w), but β̂ inconsistency could be resulted if G(x | w = k) is misspecified, specifically when x is high-dimensional. In this section, we describe a semiparametric empirical likelihood based method that allows the maximization of the likelihood function (1) with respect to β without specifying G(x | w).
3.1. The Proposed Method
For simplicity of presentation, we assume Y = {1, 2} a binary variable with 1 for positive and 2 for negative outcome, and W = {1, 2} a binary auxiliary variable. Pβ(y | x) is parameterized as a generalized linear model with a known link function, such as logit, probit, or log-log. Let with size . Note . The log of the combined likelihood (1) becomes
(3) |
where pik = g(xi | wi = k) and θ = (β′, π11, π12)′ with π2k = 1 − π1k.
To estimate θ, we first profile the likelihood function l(θ, G(x | w)) in Eq. (3) by fixing θ and obtaining the empirical likelihood function of G(x | w) over all distributions whose support contains the observed x values. We then maximize the resulting profile likelihood function with respect to θ. Specifically, to maximize G(x | w) over all distributions whose support contains the observed x values, we only need to consider discrete distributions with jumps at each of the observed points (Owen, 1988, 1990; Qin and Lawless, 1994). As a result, we search for pik that maximize L(θ, G(x | w)) under the following constraints:
(4) |
For a fixed θ, a unique maximum pik which satisfies the above constraints exists if 0 is inside the convex hull formed by the points {P(y = 1 | xi; β) − π1k} for i ∈ Vk.
Let lN (θ) be the maximum value of the log likelihood with respect to pik, where pik and θ now satisfy the constraints (4). An explicit expression for lN (θ) can be derived by a Lagrange multiplier argument:
where λk and tk are Lagrange multipliers. Taking derivatives with respect to pik, and setting and , we have
with restriction
(5) |
where λk and θ satisfy the constraint (5).
Typically, the true value of the Lagrange multiplier is zero. However, due to the nature of the biased sampling schemes, the λk are not centered at zero. To be consistent with the literature of empirical likelihood theory, we center them by reparameterizing λk as follows
such that νk has a true value 0. Now we write pik as
and the constraint (5) as
(6) |
where
Substituting pik into the likelihood (3), we obtain the log empirical profile likelihood
(7) |
The constraint (6) implicitly defines a continuous and differentiable function νk(θ), and it is the same as . The estimate β̂ can be obtained by solving the score equations , and . We use the Newton–Raphson iterative algorithm to solve the equations, and transform π to the logit scale to avoid the boundary problem.
3.2. Asymptotic Properties
Assume nk/N → ρk, nhk/nk → ρhk, nhjk/nhk → ρhjk and all ρ’s are between 0 and 1. Note ρ0jk ≡ πjk. Let ξ′ = (β′, π′, ν′) be the parameter vector with dimension (p+ 2K) where dim(β) = p. Let be the random profile score equations with respect to ξ. Let Q(ξ) be its the limiting form so that ξ* satisfies Q(ξ) = 0.
Theorem
For N large enough, the score equations QN(ξ) = 0 at some point ξ̂ in a small neighborhood of ξ*, which is the solution of Q(ξ) = 0. Then, as N → ∞.
-
Under general regularity conditions, the solution ξ̂ of QN(ξ) = 0 satisfies the constraint equation (6) and the score equations of lN(θ) in (7), and
where Σ = limN → ∞ N var(ξ̂) = S−1(V + A)S−1 where S, V, and A are defined in Appendix.
An outline of the proof for the theorem is given in the Appendix. A consistent estimator of the covariance matrix Σ is Ŝ−1 (V̂ +Â)Ŝ−1, where Ŝ, V̂, and  are obtained by replacing the large sample quantities by their corresponding small sample quantities.
4. SIMULATION STUDIES
4.1. Comparison with Alternative Methods
To evaluate the proposed method β̂SMP, we conduct simulation study to investigate its small sample behavior and compare it with alternative methods. We review the alternative methods as follows.
β̂CS: Component stratification method treats the subsampling components as separate strata. It fits a logistic regression model to the pooled SRS and OADS data by setting a separate intercept term for each component. This may be viewed as an application of the Prentice and Pyke (1979) result for case-control data to our setting. This method is not fully likelihood-based and only works for data with binary outcome and a regression model with logit link.
β̂SM: Wang and Zhou (2006) studied a semiparametric method for an OADS design with unknown {Njk}. Working with the combined likelihood of the SRS component and the OADS component only, the authors applied similar empirical likelihood approach to estimate β.
β̂WL: The weighted-likelihood method employs the Horvitz–Thompson approach to data from a two-stage study (e.g., Flanders and Greenland, 1991; Zhao and Lipsitz, 1992). The idea is to estimate by using the completely observed subjects and weighing their contributions inversely according to their selection probability into the second stage.
β̂CL: The conditional likelihood method focuses on the conditional probability that a subject is selected into the second stage P(xi|y = j, w = k, δi = 1) where δi is the selection indicator (Breslow and Cain, 1988). The β is estimated after factoring the likelihood into P(y = j|xi; β) and the selection probability.
In the case of a binary outcome, we generate y = {1, 2} according to the following logistic model:
where β0 = −2.5, β1 = 0.0 or 0.5, and β2 = 0.5. The value of β0 is chosen to simulate the situation of a rare disease. Under this model, βk (k = 0, 1, 2) represents the increase in log odds ratio of observing y = 1 due to the corresponding covariate while holding other covariates fixed. Two types of distribution of X are considered: (i) where and C90% is its 90th percentile, x2 has the same distribution as x1, and they are independent. Define w = I(x1 + ε > C90%), where ε ~ N(0, 0.25), such that corr(x1, w) = 0.64. (ii) x1 and x2 are independent standard normal variables. Define w = I[x1 + ε > 0], where ε ~ N(0, 0.25), such that corr(x1, w) = 0.82.
A simple random subsample (SRS) of size n0 = 300 is chosen first without stratifying on w from the study cohort of 30,000. Then 50 subjects are sampled from each of the four strata defined by y and w, resulting in n1 = 200 subjects in the OADS subsample.
Simulation is based on 2,000 independent runs. Results are displayed in Table 1 for binary x1 and Table 2 for normal x1 with mean of the estimates (Mean), standard deviation of the estimates (SE), mean of the estimated standard errors ( ), and coverage of the 95% nominal confidence intervals (Coverage).
Table 1.
β1 = 0.0 |
β1 = 0.5 |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Methods | Mean | SE | Coverage | Mean | SE | Coverage | |||||
CS | β̂0 | −2.53 | 0.24 | 0.23 | 0.95 | −2.52 | 0.24 | 0.23 | 0.95 | ||
β̂1 | 0.01 | 0.39 | 0.37 | 0.95 | 0.51 | 0.39 | 0.37 | 0.94 | |||
β̂2 | 0.50 | 0.38 | 0.37 | 0.95 | 0.49 | 0.38 | 0.36 | 0.95 | |||
SM | β̂0 | −2.52 | 0.22 | 0.22 | 0.95 | −2.51 | 0.22 | 0.22 | 0.95 | ||
β̂1 | 0.01 | 0.34 | 0.33 | 0.94 | 0.51 | 0.33 | 0.32 | 0.95 | |||
β̂2 | 0.49 | 0.33 | 0.32 | 0.96 | 0.48 | 0.33 | 0.32 | 0.95 | |||
WL | β̂0 | −2.51 | 0.06 | 0.06 | 0.94 | −2.50 | 0.06 | 0.06 | 0.95 | ||
β̂1 | 0.01 | 0.27 | 0.26 | 0.93 | 0.50 | 0.24 | 0.23 | 0.94 | |||
β̂2 | 0.48 | 0.38 | 0.36 | 0.95 | 0.47 | 0.36 | 0.35 | 0.95 | |||
CL | β̂0 | −2.50 | 0.07 | 0.07 | 0.94 | −2.50 | 0.07 | 0.07 | 0.95 | ||
β̂1 | 0.01 | 0.19 | 0.18 | 0.95 | 0.50 | 0.18 | 0.17 | 0.94 | |||
β̂2 | 0.49 | 0.33 | 0.32 | 0.95 | 0.48 | 0.33 | 0.32 | 0.95 | |||
SMP | β̂0 | −2.50 | 0.05 | 0.05 | 0.94 | −2.50 | 0.05 | 0.05 | 0.94 | ||
β̂1 | 0.00 | 0.11 | 0.11 | 0.96 | 0.50 | 0.10 | 0.10 | 0.96 | |||
β̂2 | 0.49 | 0.33 | 0.32 | 0.96 | 0.48 | 0.33 | 0.32 | 0.95 | |||
ALL | β̂0 | −2.50 | 0.03 | 0.02 | 0.94 | −2.50 | 0.03 | 0.02 | 0.95 | ||
β̂1 | −0.00 | 0.07 | 0.07 | 0.95 | 0.50 | 0.06 | 0.06 | 0.95 | |||
β̂2 | 0.50 | 0.06 | 0.06 | 0.94 | 0.50 | 0.06 | 0.06 | 0.95 |
Note: Assume , where β0 = −2.5, β1 = 0.0 or 0.5, and β2 = 0.5. where . x2 ~ x1, and they are independent. Where ε~ N(0, 0.25).
Table 2.
β1 = 0.0 |
β1= 0.5 |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Methods | Mean | SE | Coverage | Mean | SE | Coverage | |||||
CS | β̂0 | −2.55 | 0.36 | 0.35 | 0.94 | −2.56 | 0.42 | 0.39 | 0.96 | ||
β̂1 | 0.01 | 0.20 | 0.20 | 0.95 | 0.51 | 0.20 | 0.20 | 0.95 | |||
β̂2 | 0.51 | 0.13 | 0.13 | 0.94 | 0.51 | 0.13 | 0.13 | 0.95 | |||
SM | β̂0 | −2.53 | 0.22 | 0.22 | 0.95 | −2.53 | 0.22 | 0.22 | 0.95 | ||
β̂1 | 0.00 | 0.14 | 0.15 | 0.95 | 0.51 | 0.16 | 0.15 | 0.94 | |||
β̂2 | 0.51 | 0.11 | 0.11 | 0.95 | 0.50 | 0.11 | 0.11 | 0.95 | |||
WL | β̂0 | −2.51 | 0.05 | 0.05 | 0.94 | −2.51 | 0.06 | 0.06 | 0.94 | ||
β̂1 | 0.00 | 0.07 | 0.07 | 0.95 | 0.51 | 0.08 | 0.08 | 0.95 | |||
β̂2 | 0.51 | 0.11 | 0.11 | 0.95 | 0.50 | 0.12 | 0.12 | 0.95 | |||
CL | β̂0 | −2.51 | 0.05 | 0.05 | 0.94 | −2.50 | 0.05 | 0.05 | 0.95 | ||
β̂1 | 0.00 | 0.07 | 0.07 | 0.95 | 0.50 | 0.07 | 0.07 | 0.94 | |||
β̂2 | 0.51 | 0.11 | 0.11 | 0.95 | 0.50 | 0.11 | 0.11 | 0.95 | |||
SMP | β̂0 | −2.51 | 0.05 | 0.05 | 0.94 | −2.50 | 0.05 | 0.05 | 0.95 | ||
β̂1 | 0.00 | 0.04 | 0.04 | 0.96 | 0.50 | 0.04 | 0.04 | 0.96 | |||
β̂2 | 0.51 | 0.11 | 0.11 | 0.95 | 0.50 | 0.11 | 0.11 | 0.96 | |||
ALL | β̂0 | −2.50 | 0.02 | 0.02 | 0.94 | −2.50 | 0.02 | 0.02 | 0.94 | ||
β̂1 | 0.00 | 0.02 | 0.02 | 0.95 | 0.50 | 0.02 | 0.02 | 0.95 | |||
β̂2 | 0.50 | 0.02 | 0.02 | 0.95 | 0.50 | 0.02 | 0.02 | 0.95 |
Note: Assume , where β0 = −2.5, β1 = 0.0 or 0.5, and β2 = 0.5. x1 and x2 are sampled from independent standard normal variables; w = I[x1 + ε > 0], where ε ~ N(0, 0.25).
We make the following observations from the two tables. First, all estimators except β̂CS yield consistent estimates for all regression parameters β including the intercept term. Second, as shown by a smaller standard error, β̂SM is more efficient in estimating β1 than β̂CS is, but the estimators β̂WL, β̂CL, and β̂SMP, which incorporate the information from the Njk, are more efficient than the estimators β̂CS and β̂SM, which do not utilize such information. Third, the proposed method β̂SMP produces good estimation of var(β̂SMP), as shown in Tables 1 and 2, by the coverage of the 95% nominal confidence intervals based on the estimated variances. As a reference, the two tables also provide the results assuming all subjects in the cohort have complete data (β̂ALL). It can be seen that the standard errors of β̂ associated with β̂SMP are not much bigger than those of β̂ALL.
Based on the preceding observations, we conclude that the proposed estimator β̂SMP performs better than alternative estimators in the analysis of the OADS design data with known Njk. It yields more efficient estimates than the weighted likelihood method and the conditional likelihood method. The proposed variance estimator yields good coverage by the 95% nominal confidence intervals.
4.2. Impact of Auxiliary Information
An important issue in the OADS subsampling that we discussed is the impact of the auxiliary variables on study efficiency. In particular, it is useful to examine the correlation between the biomarker and its auxiliary variable in the estimation precision of β̂1. We only consider the case of normal distributed X1. Through simulation, we investigate this issue by systematically changing the variance of the normally distributed random error ε such that corr(X1, W) varies from 0.0 to 0.9 with seven intermediate values while other simulation settings remain the same. Figure 1 displays the relationship of the simulated standard deviation of β̂1 and corr(X1, W) for four estimators β̂SM, β̂WL, β̂CL, and β̂SMP, where each point corresponds to an average of 2,000 independent simulation runs.
The significant observations from Figure 1 are as follows. As the correlation between X1 and W increases, the standard errors of β̂1 constantly decrease, regardless of the distribution of X1 and the definition of W. Substantial efficiency gain, say >30% of the maximum possible gain, occurs when corr(X1, W) is moderate, say >0.25. Importantly, if there is no correlation between X1 and W, i.e., W contains no auxiliary information on X1, the standard error of β̂1 of β̂SM and those of β̂CL and β̂SMP are about the same. In other words, a noninformative auxiliary variable will not improve the estimation precision of the biomarker effect, though it does improve the precision in estimating the intercept (not shown in this figure). Also, β̂SMP appears at least as efficient as β̂SM, β̂WL, and β̂CL in estimating β1 when the auxiliary variable W is not informative on X1.
5. DATA EXAMPLE
The motivating lung cancer biomarker study is a concept to be approved by the consortium and the data are not available for illustration. In this section, we illustrate the proposed method using a dataset from the Collaborative Perinatal Project (CPP) (Niswander and Gordon, 1972). Women who were pregnant were enrolled through university-affiliated medical clinics. In all, 55,908 pregnancies were registered, representing the experience of about 44,000 women. The children born during the study were followed for various outcomes for up to 8 years. One of the hypotheses is that the level of polychlorinated biphenyl (PCB), a pollutant, is related to performance on the Wechsler Intelligence Scale for children at 7 years of age (Longnecker et al., 2002). Because of the cost associated with the blood serum assay, the PCB level is measured for two supplemental subgroups of subjects in addition to a random sample of 1,000 subjects. Two supplemental subgroups have 200 subjects each sampled from the subgroup defined by the IQ scores that are one standard deviation above and below the mean of the cohort IQ scores.
The CPP is an ongoing study and the available dataset has 849 SRS subjects and 189 OADS subjects. The OADS subsample is selected by the outcome only in the original CPP study design. For the purpose of illustration, we resample the 849 subjects in the SRS subsample of the available dataset and form a hypothetical cohort of size 7,500. We define a binary outcome with Y = 1 if IQ is below normal and Y = 2 otherwise. We create a surrogate W for PCB by letting W = 1 if PCB + ε ≤ 6.3 and W = 2 otherwise, where 6.3 is the 75th percentile of PCB such that corr(PCB, W) = 0.44. Other covariates include race (white vs. black), socioeconomic status of the child’s family (SES), the child’s sex (female vs. male), and the mother’s education level (MEDU). Continuous variables, MEDU, SES, and PCB, are centered at their means (Table 3).
Table 3.
Y | W | V0 | V1 | V̄ |
---|---|---|---|---|
1 | 1 | 18 | 20 | 419 |
1 | 2 | 3 | 20 | 79 |
2 | 1 | 228 | 20 | 5483 |
2 | 2 | 51 | 20 | 1139 |
300 | 80 | 7120 |
The results of analysis in Table 4 confirm previous simulation findings. The estimators β̂WL, β̂CL, and β̂SMP yield more precise estimates as evidenced by their narrower 95% CIs on the odds ratio for the PCB than those from β̂CS and β̂SM. Further, the β̂SMP is the most efficient among the estimators among β̂WL, β̂CL, and β̂SMP, which use the cohort stratum size information. It is observed that the efficiency gain of the estimators β̂WL, β̂CL, and β̂SMP over the estimators βCS and βSMP is only observed on the PCB effect, while the standard errors of other covariates remain unchanged since the chosen auxiliary variable is not correlated with other covariates in the model. The analysis suggests that the PCB level in the trimester blood serum specimens is not significantly related to the abnormal IQ status for child at 7 years of age. Those children who are white and have higher socioeconomic status and longer years of mother’s education have better odds of having normal or high IQ scores.
Table 4.
Methods | β̂ | SE(β̂) | OR | 95% CI | |
---|---|---|---|---|---|
CS | int | −2.726 | 0.390 | 0.066 | (0.030, 0.141) |
PCB | 0.035 | 0.074 | 1.036 | (0.896, 1.198) | |
MEDU | −0.425 | 0.091 | 0.654 | (0.547, 0.781) | |
SES | −0.104 | 0.121 | 0.902 | (0.712, 1.142) | |
RACE | −1.680 | 0.453 | 0.186 | (0.077, 0.453) | |
SEX | 0.297 | 0.379 | 1.346 | (0.640, 2.831) | |
SM | int | −2.469 | 0.344 | 0.085 | (0.043, 0.166) |
PCB | 0.045 | 0.051 | 1.046 | (0.946, 1.156) | |
MEDU | −0.346 | 0.078 | 0.707 | (0.607, 0.824) | |
SES | −0.121 | 0.113 | 0.886 | (0.710, 1.106) | |
RACE | −1.637 | 0.394 | 0.195 | (0.090, 0.421) | |
SEX | −0.045 | 0.334 | 0.956 | (0.496, 1.840) | |
WL | int | −2.418 | 0.265 | 0.089 | (0.053, 0.150) |
PCB | 0.020 | 0.066 | 1.021 | (0.896, 1.163) | |
MEDU | −0.318 | 0.076 | 0.727 | (0.626, 0.845) | |
SES | −0.161 | 0.114 | 0.851 | (0.681, 1.064) | |
RACE | −1.570 | 0.411 | 0.208 | (0.093, 0.465) | |
SEX | 0.012 | 0.345 | 1.012 | (0.515, 1.988) | |
CL | int | −2.419 | 0.261 | 0.089 | (0.053, 0.148) |
PCB | 0.044 | 0.040 | 1.045 | (0.966, 1.131) | |
MEDU | −0.347 | 0.078 | 0.707 | (0.606, 0.825) | |
SES | −0.121 | 0.112 | 0.886 | (0.711, 1.104) | |
RACE | −1.637 | 0.393 | 0.195 | (0.090, 0.420) | |
SEX | −0.046 | 0.333 | 0.956 | (0.498, 1.835) | |
SMP | int | −2.390 | 0.254 | 0.092 | (0.056, 0.151) |
PCB | 0.029 | 0.038 | 1.029 | (0.955, 1.109) | |
MEDU | −0.349 | 0.079 | 0.705 | (0.604, 0.824) | |
SES | −0.116 | 0.112 | 0.890 | (0.714, 1.110) | |
RACE | −1.644 | 0.395 | 0.193 | (0.089, 0.419) | |
SEX | −0.051 | 0.336 | 0.951 | (0.492, 1.836) |
Note: The outcome is below-normal IQ scores for children at 7 years of age. PCB is the effect of interest. SES is the socioeconomic status of the family; SEX and RACE are the gender and race of the child. MEDU is the mother’s education level. Continuous covariates including PCB, MEDU, and SES are centered at their means.
6. DISCUSSION
We have proposed a semiparametric empirical likelihood-based method for efficient inference on data from an outcome- and auxiliary-dependent subsampling design in which both a simple random subsample (SRS) and an outcome- and auxiliary-dependent subsample (OADS) are simultaneously observed. Adding an OADS subsample to the SRS subsample will improve the efficiency on a rare disease. One can view the design a generalization of case-cohort study and two-stage study. The advantage of the design is that when the cohort stratum size information, as defined by the outcome and the auxiliary variable, is known, we may further increase study efficiency. The proposed method is applicable to a generalized linear model with any link function for categorical outcome data. This method is robust to misspecification of the conditional distribution of the biomarker covariates, given the auxiliary variable. The proposed estimator has asymptotic normality property. Simulation supports its small sample behavior. It is superior to alternative methods. The proposed variance estimator yields a good nominal coverage by the 95% confidence interval. Simulation also suggests that the OADS design is most efficient when the correlation between the biomarker and the chosen auxiliary variable has moderate or high correlation. We illustrate the proposed method by fitting a logistic regression to a dataset from the CPP study.
Nonparametric estimation of the covariates distribution jointly with maximum likelihood estimation of β has been studied by several authors in two-stage studies (e.g., Breslow and Holubkov, 1997; Scott and Wild, 1997). In principle, these nonparametric maximum-likelihood methods can be extended to our design setting. Since our method is fully likelihood-based, the efficiency of these possible extensions are not expected better than our method. It should be pointed out that a binary outcome model with logit link is used to illustrate the subsampling scheme as well as the proposed estimator, but the approach is equally applicable to continuous outcome and linear regression model. In practice, multiple existing variables may contain auxiliary information for the biomarker of interest. More auxiliary variables mean more constraints to be used to derive the empirical likelihood-based estimator, and this may create a convergence problem in the case of small sample size. We recommend not adding auxiliary variables that lack correlation with the biomarker of interest. If multiple variables exist as candidates for auxiliary variables, a dimension reduction procedure can be used to summarize the auxiliary information. For example, in the EGFR lung caner study, we proposed to predict the EGFR mutation likelihood using a logistic regression model with these variables as predictive covariates and the predicted EGFR mutation likelihood is used as the sole auxiliary variable for stratification. It is worth noticing that auxiliary variables are used to establish the connection between the selected patients and the unselected patients. An omission of auxiliary variable may cause loss of efficiency but will not complicate the consistency of the proposed estimator. When the auxiliary variable is continuous, one has to categorize the auxiliary variable into a categorical variable before applying the proposed method. Wang and Zhou (2009) studied an estimated likelihood method in a similar sampling scheme, which utilizes the auxiliary information from a continuous variable with help from a kernel smoother. An empirical likelihood approach that handles a continuous auxiliary covariate is theoretically possible, and it will be an interesting topic for future research.
Acknowledgments
The first author was supported by the National Cancer Institute, grant CA-131596, and a Duke Clinical and Translational Science Award, UL1-RR024128. The third author was supported by National Institutes of Health grant CA-79949.
APPENDIX
The log profile score function consists of two parts:
Let ξ = (β′, π′, ν′)′ = (β′, π11, π12, ν1, ν2)′. The score equations of the profile likelihood in Eq. (7) have the form
In particular,
where
Let h = 0, 1 with 0 corresponding to the SRS subsample V0 and 1 corresponding to the OADS subsample V1. When the sample size is large, n/N → ρV, nk/N → ρk, nhk/nk → ρhk, nhjk/nhk → ρhjk, (ρ0jk ≡ πjk). All ρ’s are (0,1). We assume regularity conditions of maximum likelihood estimator are satisfied. The following two results are useful in our proof.
Result 1
As Njk/Nk → πjk and n0jk/n0k → πjk, it holds
where
and
And, means asymptotically equivalent .
Result 2
For any continuous function g(k(x, θ)), at ξ = ξ*,
First, we can show and at ξ = ξ*.
In particular,
where n̄jk/nk → γjk as nk/N → ρk, Nk/N → πk.
We can easily show that uniformly for ξ ∈ Ξ, and at the true parameter value, it is invertible. Furthermore, we have shown that . Clearly, is a continuous function of ξ that maps Ξ into Rp+2K. Then we can apply the general version of Foutz’s consistency theorem (Foutz, 1977) to conclude that ξ̂ exists in the set Ξ with probability approaching one as N → ∞, and since the size of Ξ is arbitrarily small, that .
To show asymptotic normality of ξ̂ where ξ = (β′, π′, ν′)′, we consider a first-order Taylor series expansion of the profile score function around the true parameter value ξ*:
where |ξ* − ξ̃| ≤ |ξ*− β̂|. To prove the asymptotic normality of , we need to study the asymptotic behavior of each term on the right-hand side.
First consider the first derivatives with respect to ξ. By the law of large numbers, we have
When evaluated at ξ = ξ*, we have shown above . Recall . For i ∈ Vjk, the observations (yi, xi) are independent random draws either from the joint distribution of (Y, X, W) for i ∈ V0 or from the joint distribution of (X | Y, W) for i ∈ V1, and they do not depend on the random stratum size(Njk − n0jk). That is, is independent of .
By applying the central limit theorem, we have
where
and
where
Similarly,
Since S(ξ*) is invertible, it allows us to arrange Eq. (6) as
By Slutsky’s theorem,
where Σ = Nvar(ξ̂) = S−1 (V + A)S−1, which has a sandwich form.
References
- Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20. [Google Scholar]
- Breslow NE, Day NE. Statistical Methods in Cancer Research: The Analysis of Case-Control Studies. IARC; 1993. [PubMed] [Google Scholar]
- Breslow NC, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J Roy Statist Society B. 1997;59:447–461. [Google Scholar]
- Breslow NE, Chatterjee N. Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Appl Statist. 1999;48:457–468. [Google Scholar]
- Cosslett SR. Maximum likelihood estimator for choice-based samples. Econometrica. 1981;49:1289–1316. [Google Scholar]
- Eberhard DA, Giaccone G, Johnson BE on behalf of the Molecular Assays in Non–Small-Cell Lung Cancer Working Group. Biomarkers of response to epidermal growth factor receptor inhibitors in nonsmall-cell lung cancer: Standardization for use in the clinical trial setting. J Clin Oncol. 2008;26(6):983–994. doi: 10.1200/JCO.2007.12.9858. [DOI] [PubMed] [Google Scholar]
- Flanders WD, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Statist Med. 1991;10:739–747. doi: 10.1002/sim.4780100509. [DOI] [PubMed] [Google Scholar]
- Foutz RV. On the unique consistent solution to the likelihood equations. J Am Statist Assoc. 1977;72:147–148. [Google Scholar]
- Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Statist Assoc. 1952;47:663–685. [Google Scholar]
- Jänne PA, Wang XF, Kratzke R. Unpublished CALGB Study Concept. 2008. Evaluation of EGFR and K-ras mutations in patients with non-small-cell lung cancer. [Google Scholar]
- Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 1987. [Google Scholar]
- Longnecker MP, Klebnoff MA, Zhou H, Brock JW. Maternal serum level of the DDT metabolite DDE is associated with preterm and small-for-gestational-age birth. Am J Epidemiol. 2002;155:311–322. [Google Scholar]
- Niswander KR, Gordon M. USDHEW Publication No. (NIH) 73-379. Washington, DC: U.S. Government Printing Office; 1972. The Women and their Pregnancies. [Google Scholar]
- Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75:237–249. [Google Scholar]
- Owen AB. Empirical likelihood for confidence regions. Ann Statist. 1990;18:90–120. [Google Scholar]
- Paez JG, Jänne PA, Lee JC, et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004;304:1497–1500. doi: 10.1126/science.1099314. [DOI] [PubMed] [Google Scholar]
- Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]
- Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. [Google Scholar]
- Qin J, Lawless JF. Empirical likelihood and general estimating equations. Ann Statist. 1994;22:300–325. [Google Scholar]
- Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84:57–71. [Google Scholar]
- Wang XF, Zhou HB. A semiparametric empirical likelihood method for biased sampling schemes in epidemiologic studies with auxiliary covariates. Biometrics. 2006;62(4):1149–1160. doi: 10.1111/j.1541-0420.2006.00612.x. [DOI] [PubMed] [Google Scholar]
- Wang XF, Zhou HB. Design and inference for cancer biomarker study with an outcome/auxiliary-dependent subsampling. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01280.x. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weaver MA, Zhou HB. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J Am Statist Assoc. 2005;100:459–469. [Google Scholar]
- Weinberg CR, Wacholder S. Prospective analysis of case-control data under general multiplicative-intercept risk models. Biometrika. 1993;80:461–465. [Google Scholar]
- Wild CJ. Fitting prospective regression models to case-control data. Biometrika. 1991;78:705–717. [Google Scholar]
- White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]
- Zhao LP, Lipsitz S. Designs and analysis of two-stage studies. Statist Med. 1992;11:769–782. doi: 10.1002/sim.4780110608. [DOI] [PubMed] [Google Scholar]
- Zhou H, Weaver MA. Outcome dependent selection models. Encyclopedia of Environmetrics. 2001;3:1499–1502. [Google Scholar]
- Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome-dependent sampling design with a continuous outcome. Biometrics. 2002;58:413–421. doi: 10.1111/j.0006-341x.2002.00413.x. [DOI] [PubMed] [Google Scholar]
- Zhou H, Chen J, Rissanen T, Korrick S, Hu H, Salonen JT, Longnecker MP. An efficient sampling and inference procedure for studies with a continuous outcome. Epidemiology. 2007;18(4):461–468. doi: 10.1097/EDE.0b013e31806462d3. [DOI] [PMC free article] [PubMed] [Google Scholar]