Abstract
Two-phase studies are crucial when outcome and covariate data are available in a first-phase sample (e.g., a cohort study), but costs associated with retrospective ascertainment of a novel exposure limit the size of the second-phase sample, in whom the exposure is collected. For longitudinal outcomes, one class of two-phase studies stratifies subjects based on an outcome vector summary (e.g., an average or a slope over time) and oversamples subjects in the extreme value strata while undersampling subjects in the medium-value stratum. Based on the choice of the summary, two-phase studies for longitudinal data can increase efficiency of time-varying and/or time-fixed exposure parameter estimates. In this manuscript, we extend efficient, two-phase study designs to multivariate longitudinal continuous outcomes, and we detail two analysis approaches. The first approach is a multiple imputation analysis that combines complete data from subjects selected for phase two with the incomplete data from those not selected. The second approach is a conditional maximum likelihood analysis that is intended for applications where only data from subjects selected for phase two are available. Importantly, we show that both approaches can be applied to secondary analyses of previously conducted two-phase studies. We examine finite sample operating characteristics of the two approaches and use the Lung Health Study (Connett et al. (1993), Controlled Clinical Trials, 14, 3S–19S) to examine genetic associations with lung function decline over time.
Keywords: ascertainment corrected maximum likelihood, Lung Health Study, missing data, multiple imputation, outcome-dependent sampling, secondary outcome analysis
1 |. INTRODUCTION
Modern observational studies regularly collect phenotype (e.g., disease status and vital signs) and easily obtained covariate data for most or all study subjects, while biomarker exposure variables (e.g., DNA genotype or gut microbiome) used to address novel hypotheses are obtained with additional collection processes and therefore costs. In cases where ascertainment costs limit the sample size, an attractive approach is to implement a two-phase outcome-dependent sampling (ODS) design (White, 1982) that utilizes phase-one outcome and covariate data to select the most informative subjects for phase-two exposure collection. It is well known that two-phase studies can reduce costs and increase statistical power compared to simple random sampling (SRS) studies (Breslow and Chatterjee, 1999; Zhou et al., 2002; Tao et al., 2020).
There is extensive literature on two-phase ODS studies with a single outcome. For cross-sectional outcomes, classical examples include the case-control design for binary data and the extreme-tail design for continuous data (Zhou et al., 2002; Weaver and Zhou, 2005; Lin et al., 2013; Zhou et al., 2017). For longitudinal outcomes, efficient designs select subjects based on low-dimensional summaries of the outcome vector (Schildcrout et al., 2013; Sun et al., 2017). While Prentice and Pyke (1979) showed that for the case-control design with binary response data valid inference can be obtained using standard logistic regression, for nearly all other outcome distributions, failure to account for the design will generally result in bias and efficiency loss (Holt et al., 1980; Lin et al., 2013). Analysis procedures that properly account for two-phase ODS designs can be classified broadly into two classes: those that base inference on subjects sampled for phase two only (i.e., complete-case analysis) (Zhou et al., 2002) and those that combine complete data on subjects sampled for phase two with partial data on subjects not sampled for phase two (Lawless et al., 1999; Chatterjee et al., 2003; Weaver and Zhou, 2005; Song et al., 2009; Schildcrout et al., 2015; Sun et al., 2017; Tao et al., 2017).
Two-phase ODS studies with multivariate outcomes are becoming more common because they allow researchers to target informative subjects for multiple outcomes and therefore permit efficiency gains for multiple parameters. For example, the National Heart, Lung and Blood Institute Exome Sequencing Project (Lin et al., 2013) sought to boost statistical power to detect genetic associations by oversampling subjects with extreme values of low-density lipoprotein cholesterol, blood pressure, or body mass index (BMI) for whole exome sequencing. Similarly, the Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study selected subjects with extreme values in 14 traits on top of a simple random sample for sequencing at 77 genomic loci that had been identified by genome-wide association studies to be associated with one or more traits (Lin et al., 2014). For these scenarios, Tao et al. (2015) developed a multivariate linear regression approach to conduct proper inference.
In this paper, we propose efficient, two-phase ODS designs and associated analysis procedures for multivariate, longitudinal continuous outcomes, which to our knowledge, have not been studied. We describe designs that select informative subjects for second-phase exposure ascertainment based on low-dimensional summaries of subjects’ outcome vectors and covariate data. Corresponding to the designs, we present two analysis procedures: (1) a multiple imputation (MI) procedure that can be used when complete data from subjects selected for phase two and partial data from subjects not selected for phase two are available, and (2) a complete-case procedure that extends ascertainment corrected maximum likelihood (ACML; Schildcrout et al., 2013) to multivariate outcomes and applies when only data from subjects selected for phase two are available. Importantly, we show that the MI and ACML procedures can both be applied to secondary analyses of previously conducted two-phase designs which has been discussed in scalar (Lee et al., 1997; Lin and Zeng, 2009; Pan et al., 2018), but not longitudinal, response settings. While the designs and analysis procedures described herein apply to multivariate, longitudinal response data, for ease of exposition we will focus on the bivariate, longitudinal response data case.
We motivate the proposed two-phase ODS designs and inference procedures with the Lung Health Study (LHS), a multicenter randomized controlled trial that aimed to evaluate the effectiveness of a smoking cessation intervention program and inhaled bronchodilators on lung function (Connett et al., 1993). For the present paper, We are interested in examining a genetic association with lung function decline, where we characterize lung function with two correlated longitudinal measures: forced expiratory volume (FEV) in the first second of an exhalation following bronchodilator use and forced vital capacity (FVC). We conceive of a scenario where lung function and covariate data are available, but we must retrospectively collect genetic data which is expensive to analyze from stored blood samples and thus limits the sample size.
The paper is organized as follows. Section 2 discusses the bivariate linear mixed effects model and the LHS data, and Section 3 introduces efficient sampling designs for bivariate longitudinal continuous data. Section 4 then develops the MI and ACML procedures that can be used to perform valid inference under two-phase ODS, and Section 5 evaluates the finite sampling operating characteristics of the designs and inference procedures. Section 6 illustrates the designs and estimation procedures using the LHS and finally, Section 7 concludes the paper with a discussion.
2 |. NOTATION AND DATA
2.1 |. Bivariate linear mixed effects model
Let N be the number of subjects in the original phase-one cohort, Y1i and Y2i be the n1i- and n2i-vectors of responses for the first and the second outcome, respectively, X1i and X2i be the n1i × p1 and n2i × p2 design matrices for the fixed effects associated with the first and second outcome, respectively. A bivariate linear mixed effects model is given by
| (1) |
where β1 and β2 are the p1- and p2-vectors of fixed-effect coefficients, b1i and b2i are the random effects, U1i and U2i are the design matrices for the random effects, and ϵ1i and ϵ2i are the error terms in these two models. A common design matrix for the random effects is given by U1i = [1, t1i] and U2i = [1, t2i], where t1i and t2i are observation times. With these design matrices, we assume with D being a 4 × 4 covariance matrix,
| (2) |
with and being the error variances, and and being n1i × n1i and n2i × n2i identity matrices, respectively. Furthermore, we assume that the error terms (ϵ1i, ϵ2i) are independent of the random effects (b1i, b2i). For ease of exposition, the model in Equation (1) can be rewritten using a single equation
| (3) |
where , , , and . In subsequent sections, we assume that n1i = n2i = ni, U1i = U2i, and X1i = X2i = (Xei, Zi), where Xei is an expensive exposure to be measured on a fraction of the subjects in phase two and Zi represents the inexpensive covariates that are available for everyone in phase one. The methodology presented in this paper, however, applies to more general scenarios with different number of observations and/or expensive and inexpensive covariates per outcome.
The multivariate density corresponding to subject i is
| (4) |
where μi = E(Yi|Xi) = Xiβ, , and θ is the vector of parameters that includes the coefficients of the fixed effects β as well as a vector α consisting of the unique elements of D and Σi. In standard analysis, if Ns representative subjects were drawn from the original cohort, inference on θ can be made by maximizing the log-likelihood .
2.2 |. The Lung Health Study
The LHS was a multicenter randomized clinical trial recruiting adults aged 35–60 with moderate lung function impairment. A total of 5887 subjects were recruited and their lung function was measured annually for 5 years. The primary goal of the LHS was to evaluate the effectiveness of a smoking cessation intervention program and inhaled bronchodilators use in slowing down annual rate of lung function decline (Connett et al., 1993).
In an ancillary study of 4251 participants with DNA genotypes measured, Hansel et al. (2013) performed genomewide association studies and identified two novel loci associated with accelerated lung function decline. Even though Hansel et al. (2013) had complete DNA data, this is often not the case; thus, in what follows we assume that at the ancillary study design stage, genetic data have not yet been collected. Given the interest in rate of lung function decline and the cost associated with the collection of genotype data, we implement a two-phase ODS study to examine the association between one of the Single nucleotide polymorphisms (SNPs) identified by Hansel et al. (2013) and rate of lung function decline over time. Specifically, our expensive exposure is the binary presence/absence of at least one copy of the T-allele at SNP, rs177852, while the outcomes include FEV and FVC as measures of lung function. Thus, the model of interest is given by
| (5) |
where Y1ij and Y2ij denote FEV and FVC measured for subject i at study year j, respectively, tij ∈ {1, …, 5} denotes the study year j for subject i, Xei is the indicator of the presence of T-allele at SNP rs177852, and Zi includes pack-years of smoking, number of cigarettes smoked per day during the year before enrollment in the study, BMI at enrollment, sex, age, and site (nine indicators), change in BMI from baseline, and the interactions of study year tij with age and sex. Our primary interest lies in the inference of β1xt and β2xt, which correspond to the differences in rate of lung function decline between those with and without the T-allele at SNP rs177852.
3 |. EFFICIENT SAMPLING DESIGNS FOR LONGITUDINAL OUTCOMES
Efficient sampling designs for a single longitudinal continuous outcome have been described previously (Schildcrout et al., 2013, 2020; Sun et al., 2017). Briefly, given the outcome Y1i and a set of inexpensive covariates available for the full cohort, Z1i, these designs classify subjects into multiple strata based on a user-defined function Q1i = g(Y1i, Z1i), and then select subjects for exposure ascertainment based on stratum membership. For instance, one may define Q1i to be the mean of the observed outcome Y1i, partition the support of Q1i into K strata {(−∞, k1], …, (kK−1, ∞)} ≡ {R11, …, R1K}, and then select subjects for expensive exposure ascertainment from each stratum. Given Si, the indicator of whether subject i had been selected for exposure ascertainment, and q1i, the observed value of Q1i, the phase-two selection probability for the kth stratum is π(R1k) = Pr(Si = 1|Y1i, Z1i) = Pr(Si = 1|q1i ∈ R1k).
Although Q1i can take any form, we focus on two specific linear functions of Y1i in this paper. The first function, proposed by Schildcrout et al. (2013), defines Q1i to be the estimated intercept and slope from the subject-specific regression of the outcome on time, that is, . Sampling strata can be defined using the subject-specific intercept, slope, or both. We refer to the corresponding designs as ODS designs since sampling is based on strata defined by the outcome value. In contrast, Sun et al. (2017) defined Q1i to be the best linear unbiased predictor (BLUP) of the random effects estimated from a linear mixed effects model that relates the outcome to phase-one covariates. That is, , where , and are the covariance matrices for the random effects and error terms, respectively, estimated from a linear mixed effects model relating Y1i to the inexpensive covariates, and are the estimated coefficients of the fixed effects. Similar to the ODS designs, sampling strata can be defined using the random intercept, slope, or both. We refer to the corresponding designs as BLUP dependent sampling (BDS) designs since sampling is based directly on strata defined by the phase-one BLUPs.
ODS and BDS designs can improve estimation efficiency when sampling strata R1k are identified based on a summary statistic Q1i that is related to the inferential target. Specifically, oversampling subjects with high/low subject-specific intercepts yield higher between-subject variation in phase-two data that translate into higher efficiency of the parameters’ estimates associated with time-fixed covariates. Similarly, oversampling subjects with high/low subject-specific slopes yield higher within-subject variation in phase-two data that translate into higher efficiency of the parameters’ estimates associated with time-varying covariates. When the associations of the inexpensive covariates and the outcome are small, the ODS and BDS designs perform similarly. When the associations of inexpensive covariates and the outcome are large, the BDS design can be more efficient than the ODS design (Sun et al., 2017).
We extend the ODS and BDS designs to multiple longitudinal outcomes though for ease of exposition we focus on the bivariate longitudinal outcome setting. Specifically, we first define Q1i and Q2i for the first and second outcomes, respectively, and use them to create sampling strata {R11, …, R1K} and {R21, …, R2K}. Then, with probability ξ1i and (1 − ξ1i) we assign subject i to be sampled based on Q1i or Q2i. Finally, depending on the sampling variable assignment, we randomly sample subjects from strata {R11, …, R1K} with probability π(R1k) = Pr(Si = 1|q1i ∈ R1k) and from strata {R21, …, R2K} with probability π(R2k) = Pr(Si = 1|q2i ∈ R2k), with q1i and q2i being the observed values of Q1i and Q2i, respectively. This setting allows us to define different functions Q1i and Q2i, different sampling strata {R11, …, R1K} and {R21, …, R2K}, and different sampling probability assignments ξ1i and (1 − ξ1i) for the two outcomes. In this paper, we create sampling strata based on the percentiles of Q1i and Q2i, and we choose K = 3 for both outcomes so that {R11, R12, R13} and {R21, R22, R23} represent subjects with low, middle, and high value of Q1i and Q2i. By oversampling from the extreme strata R11, R13, R21, and R23, we increase the response and exposure variability in the phase-two sample for both outcomes, which could lead to efficiency gains over SRS.
The values of π(R1k) and π(R2k) should be chosen by taking into account the research question and a potential bias–variance trade-off associated with the extremeness of the design, where an extreme sampling design is one that samples proportionately more subjects from the extremes of the sampling variable distribution. Together with the cut points that define sampling strata, π(R1k) and π(R2k) determine how extreme a design is. With fixed cutpoints or sampling strata, larger π(R11), π(R13), π(R21), and π(R23) values yield more extreme designs which, if implemented correctly, imply greater observed response variability and greater statistical efficiency (Schildcrout et al., 2013, 2015). However, with likelihood-based estimation procedures, relatively extreme designs are also likely to be more susceptible to estimation bias when the regression model is misspecified (Schildcrout et al., 2020). To strike a balance between robustness and efficiency, less extreme designs or robust estimation procedures can be considered. We do not discuss robust procedures here because this paper is primarily focused on likelihood-based estimation procedures.
Similarly, the choice of ξ1i should also be made based on the research question. For example, if both Y1i and Y12 are of equal importance, then we may choose ξ1i = ξ2i = 0.5; however, if the association between Xei and Y1i is of greater clinical relevance than that between Xei and Y2i, then we may choose ξ1i > 0.5 and ξ2i < 0.5.
Settings in which sampling was conducted based on Y1i but interest shifts to the association between Xei and Y2i can be seen as special cases of the multivariate ODS and BDS designs where ξ1i = 1. Thus, the methods described below apply to secondary outcome analysis of data from a previous two-phase ODS study.
4 |. ESTIMATION AND INFERENCE
Since subjects sampled for phase two are not selected via SRS, valid inference on the parameters of interest must account for the biased sampling design (Holt et al., 1980; Lawless et al., 1999; Lin et al., 2013). In this section, we introduce an MI approach that combines complete data from subjects selected in phase two with partial data from subjects that were not selected in phase two. We also extend the ACML estimator of Schildcrout et al. (2013) to address circumstances where we only have complete data from subjects selected in phase two.
4.1 |. Multiple imputation
The ODS and BDS designs depend on a low-dimensional summary of the outcome Yi and inexpensive covariates Zi. Thus, Si is independent of the expensive covariates Xei given Yi and Zi, that is,
| (6) |
Since the LHS analysis the expensive covariate of interest was binary, in what follows we focus on imputation for a missing binary exposure Xei. We show in the Appendix that using Bayes’ theorem and Equation (6), the conditional exposure log-odds model for subjects not sampled in phase two (i.e., those with missing Xei) can be written as
| (7) |
where μx,i = E(Yi|Xei = x, Zi) and Vi = var(Yi|Xei, Zi). We model expression (b) with logistic regression indexed by parameter γ, and so the conditional exposure log-odds model (7), that is used to impute Xei in subjects with Si = 0, becomes an offsetted logistic regression model with expression (a) being the offset. We conduct the MI approach as follows:
Fit the bivariate linear mixed effects model Yi = Xiβ + Uibi + ϵi among subjects with Si = 1 to obtain the initial estimates , and
- At the kth iteration:
- Sample . Calculate offset for all subjects using expression (a) in Equation (7).
- For subjects with Si = 1, fit the offsetted logistic regression model in Equation (7):
to obtain estimates and .(8) - Sample .
- Use γ(k) and to calculate for all subjects with Si = 0.
- For all subjects with Si = 0, impute .
- Fit the bivariate linear mixed effects model Yi = Xiβ + Uibi + ϵi using the imputed data set to obtain estimates , and .
Repeat steps (a)–(f) in (2) m times and take the mth data set as the complete data. Fit the bivariate linear mixed effects model Yi = Xiβ + Uibi + ϵi on the complete data set and store the results.
Repeat steps (2)–(3) M times and combine the results using Rubin’s rule (Rubin, 1976).
We observed that m equal 5 to 20 iterations appears adequate in scenarios we studied and, similar to Harrell (2016), we recommend M to be at least 100λ, where λ is the percentage of missing data. Since in our designs Xei is missing for the majority of subjects, a higher number of imputations M is needed to (1) use the normal approximation to the t-distribution for the computation of the confidence interval and (2) aid reproducibility of the results (White et al., 2011).
It is worth nothing that in our MI algorithm we appropriately acknowledge the uncertainty in estimating β and γ but for pragmatic reasons ignore the uncertainty in estimating α. This is because our MI approach is based on the R package nlme, which is widely used in the analysis of longitudinal data, but does not provide uncertainty estimates of the variance components. While imputation methods should account for the uncertainty in all parameters, previous studies showed that the uncertainty in the variance components estimates is low (Schildcrout et al., 2013), and the inclusion of subjects with complete and partial data in the analysis recovers nearly all the information associated with the variance parameters (Zelnick et al., 2018). Moreover, the score spaces of β and α are orthogonal under SRS and scenarios like genetics studies, where the variables of interest are weakly associated with the outcome (Tao et al., 2020). Thus, ignoring the variability in α is not expected to have a substantial impact on the validity in the estimation of the coefficients of the fixed effects β. Simulation studies in Section 5 showed that the MI procedure led to unbiased estimates of the model’s coefficients and standard errors.
4.2 |. Ascertainment corrected maximum likelihood
The ACML is a complete-case analysis method that requires specification of a model for Yi|Xi, and corrects for biased sampling by explicitly conditioning on a subject i being selected in phase two. For a single longitudinal outcome, the ascertainment corrected likelihood (ACL) takes the form
| (9) |
where N2 is the number of subjects selected in phase two, and the denominator
| (10) |
is the scaling factor that corrects for the biased sample. Under ODS sampling, and the scaling factor in Equation (9) can be calculated in closed form because Y1i|X1i being normally distributed implies Q1i|X1i is normally distributed with mean W1iE(Yi|Xi) and variance (Schildcrout et al., 2013). Under BDS sampling, , and by the law of large numbers and Delta method, it can be shown that Q1i|X1i is normally distributed with mean and variance (Sun et al., 2017).
We extend the ACML method to bivariate longitudinal outcomes for designs where subjects are assigned to be sampled based on one of the outcomes with probability ξwi, (w = 1, 2). Let τwi be the indicator of whether subject i is sampled based on the wth outcome and fw(Yi|Xi, Si = 1; θ) be the corresponding likelihood contribution of subject i. The ACL for bivariate longitudinal outcomes takes the form
| (11) |
Because the probabilities ξwi and π(qwi) are known a priori and do not depend on θ, the ascertainment corrected log-likelihood is proportional to
| (12) |
The second term in expression (12) represents the ascertainment correction that accounts for the biased sample. Similar to the case of a single longitudinal outcome, both terms can be calculated analytically under the Gaussian assumption. We estimate θ by maximizing expression (12) using the Newton–Raphson algorithm.
We note that because secondary outcome analysis of two-phase studies with ODS or BDS designs based on the primary outcome can be seen as special cases of the multivariate ODS and BDS designs, respectively, both the MI and ACML methods can be readily applied to analyze secondary outcomes from a previously conducted ODS or BDS design.
5 |. FINITE SAMPLING OPERATING CHARACTERISTICS
We conducted simulation studies to examine the validity and efficiency of the proposed designs and inference procedures for two-phase studies with multivariate longitudinal data. We generated data from the bivariate linear mixed effects model
| (13) |
where Xei is an expensive binary covariate with Pr(Xei = 1) = 0.3, tij denotes the follow-up time variable, Zi is a continuous confounder generated from a normal random variable with mean −0.15 + δXei and variance 1, and δ is a parameter controlling the correlation between Xei and Zi. For each subject i, we generated equally spaced tij values from 0 to 5, randomly selected dropout time from tij = 3, 4, or 5, and removed all subsequent time points. This resulted in subjects having 4–6 observations. Unless otherwise specified, across all simulations, we set N = 2400, (β10, β1x, β1z, β1t, β1xt) = (80, 0.5, −2.5, −1.5, −0.25), (β20, β2x, β2z, β2t, β2xt) = (65, −0.6, −2, −1, −0.15). The random effects (b10i, b11i, b20i, b21i) were generated from a multivariate normal distribution with mean 0 and covariance
| (14) |
and error the terms ϵ1ij and ϵ2ij were generated from normal distributions with mean 0 and variances and , respectively. We present simulation results in two subsections, each with a different aim. In Section 5.1, we assumed that both Y1i and Y2i are of interest and we investigate the validity and efficiency of the proposed designs and the ACML and MI inference procedures. Following the LHS analysis, we focused on estimating the coefficient associated with the time by expensive covariate interaction, and we oversampled subjects with extreme slopes under the ODS and BDS designs. In Section 5.2, we considered the secondary data analysis setting where we are interested in analyzing Y2i using data from an existing two-phase ODS study based on Y1i.
5.1 |. Validity of inference procedures and efficiency of two-phase designs
We assumed that data on the outcome and inexpensive covariate were available on 2400 subjects but resource constraints only permit Xei ascertainment on approximately 800 subjects. We considered three designs: (1) SRS, (2) ODS slope sampling, and (3) BDS slope sampling. For the ODS slope sampling design, we generated three strata (Rw1, Rw2, Rw3) for each outcome Ywi (w = 1, 2) by fitting a simple linear regression of Ywi on ti and defining the three strata based on the 15th and 85th percentiles of the subject-specific slopes. We then randomly assigned half of the subjects to be selected based on Y1i and the other half based on Y2i. For each outcome, we selected (Nw1, Nw2, Nw3) subjects from the three strata. We carried out the BDS slope sampling design in a similar fashion as the ODS slope sampling design, except that we used the BLUP slopes to define strata. When implementing the MI approach we set m = 10 iterations and M = 70 imputations. Results with m = 5 and m = 20 were similar as shown in Web Tables 1 and 2.
We report four simulation settings, denoted hereafter by Settings (a)–(d), with (a) as described above and with (b), (c), and (d) altering the correlation between Xei and Zi, and degree of extreme sampling and the confounder by time interaction association, respectively; see Table 1 for detailed specifications. In addition, we have examined a different scenario with lower values for the variance and covariance components in D (Web Table 3). Table 2 and Web Tables 4–6 show the results for Settings (a)–(d), respectively. Overall, the methods performed well with approximately unbiased parameter estimates, standard error estimates that accurately reflected the true sampling variation, and the coverage probabilities that were close to 0.95. Although not shown in the paper, the estimates of D, σ1, and σ2 were unbiased. For instance, the average estimated variances of the random effects were , , , and , and the corresponding empirical standard deviations were 0.66, 0.03, 0.28, and 0.003, respectively.
TABLE 1.
Simulation settings
| Setting | (β1zt, β2zt) | (Nw1, Nw2, Nw3) | δ | Cor (Xei, Zi) |
|---|---|---|---|---|
| (a) | (−0.1, −0.15) | (150, 100, 150) | −0.05 | −0.02 |
| (b) | (−0.1, −0.15) | (150, 100, 150) | −2 | −0.66 |
| (c) | (−0.1, −0.15) | (180, 40, 180) | −0.05 | −0.02 |
| (d) | (−0.7, −0.45) | (150, 100, 150) | −0.05 | −0.02 |
TABLE 2.
Simulation results for Setting (a) when phase-two sampling depends on both outcomes
| Y 1 | Y 2 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| β 10 | β 1x | β 1z | β 1t | β 1xt | β 1zt | β 20 | β 2x | β 2z | β 2t | β 2xt | β 2zt | |
| Random sampling with ML | ||||||||||||
| % Bias | 0.000 | −0.284 | −0.073 | 0.086 | −1.593 | −1.448 | 0.003 | −1.519 | −0.048 | 0.004 | 0.188 | −0.406 |
| SE | 0.196 | 0.372 | 0.160 | 0.046 | 0.087 | 0.039 | 0.133 | 0.240 | 0.109 | 0.025 | 0.045 | 0.021 |
| SEE | 0.198 | 0.359 | 0.164 | 0.048 | 0.087 | 0.040 | 0.132 | 0.239 | 0.110 | 0.026 | 0.046 | 0.021 |
| CP | 0.943 | 0.941 | 0.957 | 0.961 | 0.947 | 0.948 | 0.952 | 0.955 | 0.957 | 0.956 | 0.962 | 0.957 |
| ODS slope sampling with ACML | ||||||||||||
| % Bias | −0.004 | 1.394 | −0.104 | −0.108 | 1.236 | −1.732 | −0.006 | 0.990 | −0.054 | −0.029 | 0.770 | −0.364 |
| SE | 0.195 | 0.362 | 0.148 | 0.041 | 0.079 | 0.035 | 0.133 | 0.241 | 0.097 | 0.023 | 0.042 | 0.019 |
| SEE | 0.198 | 0.359 | 0.149 | 0.042 | 0.076 | 0.035 | 0.131 | 0.238 | 0.099 | 0.023 | 0.042 | 0.019 |
| CP | 0.962 | 0.956 | 0.946 | 0.946 | 0.947 | 0.957 | 0.957 | 0.958 | 0.944 | 0.951 | 0.951 | 0.952 |
| BDS slope sampling with ACML | ||||||||||||
| % Bias | 0.011 | −5.601 | 0.063 | 0.390 | −6.841 | −1.873 | −0.013 | −2.265 | −0.079 | 0.155 | −4.041 | −0.700 |
| SE | 0.196 | 0.356 | 0.163 | 0.038 | 0.069 | 0.031 | 0.125 | 0.233 | 0.108 | 0.020 | 0.037 | 0.010 |
| SEE | 0.199 | 0.359 | 0.165 | 0.038 | 0.069 | 0.031 | 0.129 | 0.234 | 0.107 | 0.021 | 0.038 | 0.017 |
| CP | 0.953 | 0.939 | 0.951 | 0.950 | 0.936 | 0.946 | 0.954 | 0.948 | 0.948 | 0.960 | 0.941 | 0.955 |
| Random sampling with MI | ||||||||||||
| % Bias | 0.001 | −0.797 | −0.088 | 0.093 | −2.374 | −1.736 | −0.005 | −2.090 | −0.057 | −0.020 | −0.331 | −0.526 |
| SE | 0.142 | 0.365 | 0.093 | 0.034 | 0.085 | 0.024 | 0.097 | 0.238 | 0.067 | 0.018 | 0.045 | 0.012 |
| SEE | 0.143 | 0.353 | 0.095 | 0.034 | 0.084 | 0.023 | 0.096 | 0.024 | 0.064 | 0.018 | 0.045 | 0.012 |
| CP | 0.952 | 0.940 | 0.952 | 0.956 | 0.944 | 0.941 | 0.945 | 0.944 | 0.936 | 0.957 | 0.945 | 0.959 |
| ODS slope sampling with MI | ||||||||||||
| % Bias | 0.001 | 0.654 | −0.099 | −0.064 | 1.020 | −1.417 | 0.002 | 0.686 | −0.024 | −0.044 | 0.487 | −0.418 |
| SE | 0.139 | 0.345 | 0.094 | 0.031 | 0.071 | 0.024 | 0.098 | 0.231 | 0.067 | 0.016 | 0.037 | 0.012 |
| SEE | 0.143 | 0.354 | 0.096 | 0.031 | 0.069 | 0.023 | 0.095 | 0.234 | 0.064 | 0.017 | 0.037 | 0.012 |
| CP | 0.951 | 0.950 | 0.954 | 0.950 | 0.942 | 0.942 | 0.948 | 0.954 | 0.943 | 0.947 | 0.950 | 0.955 |
| BDS slope sampling with MI | ||||||||||||
| % Bias | −0.002 | 1.051 | −0.099 | 0.007 | −0.182 | −1.637 | −0.001 | −0.332 | −0.036 | −0.078 | 1.403 | −0.483 |
| SE | 0.140 | 0.364 | 0.094 | 0.030 | 0.069 | 0.024 | 0.095 | 0.232 | 0.067 | 0.016 | 0.036 | 0.012 |
| SEE | 0.143 | 0.353 | 0.095 | 0.031 | 0.069 | 0.023 | 0.094 | 0.231 | 0.0639 | 0.017 | 0.037 | 0.012 |
| CP | 0.955 | 0.936 | 0.953 | 0.957 | 0.947 | 0.938 | 0.944 | 0.950 | 0.943 | 0.950 | 0.950 | 0.953 |
Abbreviations: ACML, ascertainment corrected maximum likelihood; CP, coverage probability of the 95% confidence interval; MI, multiple imputation; ML, maximum likelihood; ODS, outcome-dependent sampling; SE, empirical standard error; SEE, empirical mean of the standard error estimator.
Figure 1 and Web Table 7 show the relative efficiency between different designs and inference procedures compared to SRS with maximum likelihood analysis. The ODS and BDS designs were more efficient than SRS in estimating (β1t, β1xt, β1zt, β2t, β2xt, β2zt) because they oversample relatively informative subjects in phase two. In Setting (a), the ODS design with ACML improved estimation efficiency of (β1xt, β2xt) by 46% and 48%, respectively, while MI improved estimation efficiency of β1xt by an additional 12% and of β2xt by an additional 6%. Similarly, the BDS design with ACML improved estimation efficiency of (β1xt, β2xt) by 59% and 49%, respectively, while MI improved estimation efficiency β1xt by an additional 14% and of β2xt by an additional 11%. Some notable observations from the other scenarios displayed in Figure 1 include:
FIGURE 1.

Relative efficiency of two-phase designs and inference procedures to standard maximum likelihood estimation under random sampling (RS) in estimating β1xt and β2xt. ODS, outcome dependent sampling. BDS, BLUP dependent sampling. ACML, ascertainment corrected maximum likelihood. MI, multiple imputation
Toward estimating (β1xt, β2xt), efficiency gains associated with MI over ACML were further enhanced when the correlation between Xei and Zi was increased (Setting (b)).
Efficiency gains of the ODS and BDS designs over SRS were larger when sampling was more extreme (Setting (c)).
The ODS and BDS designs exhibited similar precision toward estimating (β1xt, β2xt) when time by confounder associations were weak (Settings (b)–(c)); however, the ODS design became somewhat less efficient when time by confounder associations were strong (Setting (d)).
In all settings, MI was much more efficient than ACML toward estimating (β1z, β1zt, β2z, β2zt) because the latter only uses data on subjects selected in phase two, while the former uses (Y, Z) on subjects not selected in phase two.
5.2 |. Secondary outcome analysis
In a second set of simulations, we assumed that data from a previous two-phase ODS study on Y1i were available and we were interested in analyzing a secondary outcome Y2i, which was not used for sampling but was potentially correlated with Y1i. Specifically, we assumed that in a previous two-phase ODS study on Y1i the expensive covariate Xei was ascertained in 400 of the 2400 subjects using an ODS slope sampling design, where (150, 100, 150) subjects per stratum were sampled. We considered three different analyses: (1) we ignored the original ODS design and carried out a naive analysis of Y2i using a standard linear mixed effects model; (2) we used complete data available on subjects sampled for phase two and implemented the ACML method to account for the original ODS slope sampling scheme; (3) we used available data from all subjects and implemented the MI approach with m = 10 iteration and M = 85 imputations.
Table 3 summarizes the results for Setting (d) in Section 5.1. The naive analysis that ignored the original ODS sampling scheme based on Y1i yielded biased estimates because Y2i was correlated with Y1i. By accounting for the original ODS sampling scheme, the ACML and MI methods yielded unbiased estimates of the coefficients and standard errors, and reasonable coverage probabilities. As expected, the MI approach was more efficient than the ACML method in estimating (β2z, β2zt). MI was as efficient as the ACML method in estimating β2xt, because the inexpensive confounder Zi was not very informative about the expensive exposure Xei when δ = −0.05.
TABLE 3.
Simulation results for Setting (d) when one analyzes the secondary outcome Y2 reusing data from a two-phase ODS study based on Y1
| Y 1 | Y 2 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| β 10 | β 1x | β 1z | β 1t | β 1xt | β 1zt | β 20 | β 2x | β 2z | β 2t | β 2xt | β 2zt | |
| Naive analysis | ||||||||||||
| % Bias | −0.016 | 17.744 | 11.813 | 0.625 | 19.081 | 17.482 | ||||||
| SE | 0.189 | 0.333 | 0.138 | 0.037 | 0.068 | 0.026 | ||||||
| SEE | 0.190 | 0.342 | 0.139 | 0.038 | 0.068 | 0.028 | ||||||
| CP | 0.943 | 0.950 | 0.597 | 0.951 | 0.939 | 0.180 | ||||||
| ODS slope sampling with ACML | ||||||||||||
| % Bias | 0.001 | −4.042 | −0.050 | 0.133 | −0.085 | 0.214 | 0.006 | 3.816 | 0.187 | 0.131 | −0.016 | −0.231 |
| SE | 0.268 | 0.515 | 0.212 | 0.055 | 0.098 | 0.050 | 0.186 | 0.327 | 0.146 | 0.035 | 0.063 | 0.027 |
| SEE | 0.280 | 0.506 | 0.215 | 0.055 | 0.099 | 0.050 | 0.185 | 0.334 | 0.143 | 0.035 | 0.064 | 0.028 |
| CP | 0.955 | 0.948 | 0.956 | 0.947 | 0.957 | 0.944 | 0.943 | 0.958 | 0.945 | 0.945 | 0.956 | 0.962 |
| ODS slope sampling with MI | ||||||||||||
| % Bias | 0.010 | −5.852 | −0.086 | 0.002 | −0.487 | −0.078 | 0.009 | 2.688 | −0.008 | 0.015 | −1.437 | −0.133 |
| SE | 0.178 | 0.505 | 0.094 | 0.036 | 0.096 | 0.024 | 0.117 | 0.324 | 0.067 | 0.023 | 0.063 | 0.012 |
| SEE | 0.176 | 0.488 | 0.097 | 0.037 | 0.096 | 0.024 | 0.117 | 0.322 | 0.065 | 0.022 | 0.061 | 0.013 |
| CP | 0.950 | 0.941 | 0.953 | 0.958 | 0.953 | 0.938 | 0.947 | 0.950 | 0.939 | 0.944 | 0.933 | 0.958 |
Abbreviations: ACML, ascertainment corrected maximum likelihood; CP, coverage probability of the 95% confidence interval; MI, multiple imputation; ODS, outcome-dependent sampling; SE, empirical standard error; SEE, empirical mean of the standard error estimator.
6 |. THE LHS
We considered 2563 participants in the LHS that were continuous smokers with at least two observations. Complete information on baseline characteristics of this cohort can be found in Schildcrout et al. (2020). Briefly, the majority of subjects (62%) were male. Median age was 48 years (interdecile range: 39–57 years). Fifty-six percent of subjects had at least one copy of the T-allele at SNP rs177852 and 66% of subjects completed all five follow-up visits.
We are interested in estimating the differences in the rate of lung function decline between those with and without the T-allele at SNP rs177852. Even though genetic data were available for all subjects, for illustration purpose we considered a scenario where information on the presence of the T-allele at SNP rs177852 was unavailable and could be measured for about a third of the subjects due to financial constraints. Specifically, from the initial 2563 subjects in the cohort, we subsampled approximately 800 subjects and examined the following designs and inference methods: (1) SRS with standard maximum likelihood estimator, (2) ODS slope sampling with ACML estimator, (3) BDS slope sampling with ACML estimator, (4) SRS with the MI approach, (5) ODS slope sampling with the MI approach, and (6) BDS slope sampling with the MI approach. When implementing the ODS slope sampling design, we computed all the estimated slopes from the subject-specific simple linear regressions of FEV and FVC on study year, and we randomly assigned half of the cohort to be sampled based on FEV and the remaining half to be sampled based on FVC. For each outcome, we sampled on average 300 subjects whose subject-specific slope was either below the 15th percentile or above the 85th percentile, and 100 subjects whose subject-specific slope fell in the central 70% region. For the BDS slope sampling design, we adopted a similar procedure, except that we used the BLUP of random slopes from a linear mixed effects model analysis of phase-one data. When implementing the MI approach, we set m = 5 iterations and M = 70 imputations. We replicated each study 200 times and report the average estimated coefficients and standard errors.
The results from the six two-phase analyses and those from the full-cohort analysis are presented in Figure 2 and Web Table 8. For the full cohort analysis, FEV of participants with at least one copy of T-allele declined at a rate of 0.08dL (95% confidence interval: −0.12dL, −0.04dL) per year faster than in subjects without the T-allele; similarly, FVC of participants with at least one copy of T-allele declined at a rate of 0.07dL (95% confidence interval: −0.13dL, −0.02dL) per year faster than in participants without the T-allele. The estimates under most two-phase analyses were consistent with the full-cohort estimates. The results under the ODS and BDS slope sampling designs appeared to be similar because the confounder by time interaction associations were not strong in this data set. As expected, confidence intervals for the study year by SNP interaction association estimates were narrower under the ODS and BDS designs compared to SRS. Standard error estimates obtained by MI were generally smaller than those obtained by the ACML method. For example, the standard errors of and for the MI approach were 26% and 23% smaller, respectively, than those for the ACML method under the ODS design. For the other regression parameters associated with time-varying covariates, the combination of design and MI recovered nearly all the information from the full cohort. For example, 95% confidence intervals for the study year by sex interaction association estimates of the full cohort and those of the ODS and BDS designs with the MI approach had the same width.
FIGURE 2.

Point estimates and 95% confidence intervals (CI) for study year by SNP and study year by sex interaction effects. FEV, forced expiratory volume. FVC, forced vital capacity. FC, full cohort. RS, random sampling. ODS, outcome dependent sampling. BDS, BLUP dependent sampling. ML, maximum likelihood. ACML, ascertainment corrected maximum likelihood. MI, multiple imputation
7 |. DISCUSSION
We have described a class of two-phase ODS designs for multivariate longitudinal continuous outcomes that uses available information on outcomes and covariates to identify the most informative subjects for whom the expensive exposure should be measured. The extensions of design and analysis procedure to multiple outcomes allow us to study multiple associations simultaneously and to reuse data from existing two-phase ODS studies. Similar to designs for a single longitudinal continuous outcome (Schildcrout et al., 2013; Sun et al., 2017), selecting subjects based on appropriate low-dimensional summaries of the outcomes can increase estimation efficiency. Specifically, we showed that subjects with extreme slopes are informative for the coefficients of the time-varying covariates. Similarly, we would expect that subjects with extreme intercepts would be informative for the coefficients of time-fixed covariates.
We have proposed two valid inference procedures under two-phase ODS designs for multivariate longitudinal continuous outcomes, that is, the ACML and MI approaches. When there are no inexpensive covariates Zi that are correlated with the expensive covariate Xei, we expect the ACML and MI approaches to be equally efficient. This was observed in Zelnick et al. (2018) for a single longitudinal outcome with balanced and complete data and is consistent with Derkach et al. (2015) and Bjørnland et al. (2018) for two-phase studies with cross-sectional outcomes and the fact that MI approaches with proper imputation models are as efficient as full-data maximum likelihood approaches. When there are Zi that are correlated with Xei, we expect the MI approach to be more efficient than the ACML approach because the former exploits additional information about Xei contained in Zi about subjects not selected in phase two. We saw in our simulations that this was true for finite samples. In particular, when we increased the correlation between Xei and Zi (Scenario (b) in Section 5.1), the efficiency gains of the MI over ACML were more pronounced. Calculating the asymptotic relative efficiency in this situation, however, is nontrivial and to our knowledge, no such result exists, even for two-phase studies with cross-sectional data. The analytical calculation of the asymptotic relative efficiency might be addressed in future work.
The ACML and MI approaches described in this paper should most likely be applied under different scenarios: the ACML approach can be used when the analyst does not have access to phase-one data while the MI approach is preferred if the analyst can obtain all data from both phases, including the incomplete data from those subjects not selected in phase two. The ACML estimator is easy to compute and can accommodate any type of expensive exposure. However, it requires a specific implementation for each different design because the calculation of the ascertainment correction term in expression (12) is design-dependent. On the other hand, the MI algorithm is generic for any two-phase ODS design, although the corresponding computational cost is higher. We have focused on a binary expensive exposure for the MI procedure. In principle, it is straightforward to extend the method to accommodate categorical, count, or continuous expensive covariates (Web Appendix B). Alternatively, we can model f(Xei|Zi, Si = 0) nonparametrically using B-spline sieves or kernel smoothing techniques.
Supplementary Material
ACKNOWLEDGMENTS
This research was supported by the National Institute of Health Grant R01HL094786. This work was conducted in part using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University, Nashville, TN.
Funding information
National Institute of Health; National Heart, Lung, and Blood Institute, Grant/Award Number: R01HL094786
APPENDIX: Derivation of the conditional exposure odds model (7)
Because phase-two sampling is based on Yi and Zi, we have f(Xei|Yi, Zi, Si = 0) = f(Xei|Yi, Zi, Si = 1) = f(Xei|Yi, Zi). Consequently, we can construct the conditional exposure model for subjects with Si = 0 by using available data on subjects with Si = 1, that is,
| (A.1) |
Applying Bayes’ rule to both the numerator and denominator of Pr(Xei = 1|Zi, Yi)∕Pr(Xei = 0|Zi, Yi), we obtain the following formulation for the conditional exposure model:
| (A.2) |
Under the Gaussian assumption, Yi|Xei = 0, Zi is normally distributed with mean μ0,i = E(Yi|Xei = 0, Zi) and V0,i = var(Yi|Xei = 0, Zi). Similarly, Yi|Xei = 1, Zi is normally distributed with mean μ1,i = E(Yi|Xei = 1, Zi) and V1,i = var(Yi|Xei = 1, Zi). Consequently, we can rewrite expression (A.2) as:
| (A.3) |
If we further assume that V0,i = V1,i = Vi, then the conditional exposure model reduces to
| (A.4) |
Footnotes
SUPPORTING INFORMATION
Supporting information referenced in Sections 5, 6, and 7 are available at the Biometrics website on Wiley Online Library. All simulation and analysis code is available on GitHub at https://github.com/ChiaraDG/MultivariateODS_LMM.
DATA AVAILABILITY STATEMENT
The Lung Health Study data used for the analyses in Section 6 were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap through dbGaP accession [phs000335.v3.p2].
REFERENCES
- Bjørnland T, Bye A, Ryeng E, Wisløff U and Langaas M (2018) Powerful extreme phenotype sampling designs and score tests for genetic association studies. Statistics in Medicine, 37, 4234–4251. [DOI] [PubMed] [Google Scholar]
- Breslow N and Chatterjee N (1999) Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. Journal of the Royal Statistical Society, Series C, 48, 457–468. [Google Scholar]
- Chatterjee N, Chen Y and Breslow N (2003) A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association, 98, 158–168. [Google Scholar]
- Connett J, Kusek J, Bailey W, O’Hara P and Wu M (1993) Design of the Lung Health Study: A randomized clinical trial of early intervention for chronic obstructive pulmonary disease. Controlled Clinical Trials, 14, 3S–19S. [DOI] [PubMed] [Google Scholar]
- Derkach A, Lawless JF and Sun L (2015) Score tests for association under response-dependent sampling designs for expensive covariates. Biometrika, 102, 988–994. [Google Scholar]
- Hansel N, Ruczinski I, Rafaels N, Sin D, Daley D, Malinina A, et al. (2013) Genome-wide study identifies two loci associated with lung function decline in mild to moderate COPD. Human Genetics, 132, 79–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrell F (2016) Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd edition. Switzerland: Springer International Publishing. [Google Scholar]
- Holt D, Smith T and Winter P (1980) Regression analysis of data from complex survey. Journal of the Royal Statistical Society, Series A, 143, 474–487. [Google Scholar]
- Lawless J, Kalbfleisch J and Wild C (1999) Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society, Series B, 61, 413–438. [Google Scholar]
- Lee A, McMurchy L and Scott A (1997) Re-using data from case-control studies. Statistics in Medicine, 16, 1377–1389. [DOI] [PubMed] [Google Scholar]
- Lin D and Zeng D (2009) Proper analysis of secondary phenotype data in case-control association studies. Genetic Epidemiology, 3, 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin D, Zeng D and Tang Z (2013) Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proceedings of the National Academy of Sciences of the United States of America, 110, 12247–12252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin H, Wang M, Brody J, Bis J, Dupuis J, Lumley T, McKnight B, et al. (2014) Strategies to design and analyze targeted sequencing data: cohorts for heart and aging research in genomic epidemiology (charge) consortium targeted sequencing study. Circulation: Cardiovascular Genetics, 7, 335–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan Y, Cai J, Longnecker M and Zhou H (2018) Secondary outcome analysis for data from an outcome-dependent sampling design. Statistics in Medicine, 37, 2321–2337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prentice R and Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411. [Google Scholar]
- Rubin D (1976) Inference and missing data. Biometrika, 63, 581–590. [Google Scholar]
- Schildcrout J, Garbett S and Heagerty P (2013) Outcome vector dependent sampling with longitudinal continuous response data: stratified sampling based on summary statistics. Biometrics, 69, 405–416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schildcrout J, Haneuse S, Tao R, Zelnick L, Schisterman E, Garbett S, et al. (2020) Two-phase, generalized case-control designs for quantitative longitudinal outcomes. American Journal of Epidemiology, 182, 81–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schildcrout J, Rathouz P, Zelnick L, Garbett S and Heagerty P (2015) Biased sampling design to improve research efficiency: factors influencing pulmonary function over time in children with asthma. Annals of Applied Statistics, 9, 731–753. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song R, Zhou H and Kosorok M (2009) A note on semiparametric efficient inference for two-stage outcome-dependent sampling with a continuous outcome. Biometrika, 96, 221–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun Z, Mukherjee B, Estes J, Vokonas P and Park S (2017) Exposure enriched outcome dependent designs for longitudinal studies of gene-environment interaction. Statistics in Medicine, 36, 2947–2960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tao R, Zeng D, Franceschini N, North K, Boerwinkle E and Lin D (2015) Analysis of sequence data under multivariate trait-dependent sampling. Journal of the American Statistical Association, 110, 560–572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tao R, Zeng D and Lin D (2017) Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies. Journal of the American Statistical Association, 112, 1468–1476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tao R, Zeng D and Lin D (2020) Optimal designs of two-phase studies. Journal of the American Statistical Association, 115, 1946–1959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weaver M and Zhou H (2005) An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. Journal of the American Statistical Association, 100, 459–469. [Google Scholar]
- White E (1982) A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology, 115, 119–128. [DOI] [PubMed] [Google Scholar]
- White I, Royston P and Wood A (2011) Multiple imputation using chained equations: issues and guidance for practice. Statistics in Medicine, 30, 377–399. [DOI] [PubMed] [Google Scholar]
- Zelnick L, Schildcrout J and Heagerty P (2018) Likelihood-based analysis of outcome-dependent sampling designs with longitudinal data. Statistics in Medicine, 37, 2120–2133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H, Chen J, Rissanen T, Korrick S, Hu H, Salonen J, et al. (2017) An efficient sampling and inference procedure for studies with a continuous outcome. Epidemiology, 18, 461–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H, Weaver M, Qin J, Longnecker M and Wang MC (2002) A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics, 58, 413–421. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Lung Health Study data used for the analyses in Section 6 were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap through dbGaP accession [phs000335.v3.p2].
