Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 15.
Published in final edited form as: Stat Med. 2019 Jul 29;38(23):4611–4624. doi: 10.1002/sim.8321

On the analysis of two-phase designs in cluster-correlated data settings

C Rivera-Rodriguez 1,*, D Spiegelman 2,3,4, S Haneuse 4
PMCID: PMC6736737  NIHMSID: NIHMS1039160  PMID: 31359448

Summary

In public health research information that is readily available may be insufficient to address the primary question(s) of interest. One cost-efficient way forward, especially in resource-limited settings, is to conduct a two-phase study in which the population is initially stratified, at phase I, by the outcome and/or some categorical risk factor(s). At phase II detailed covariate data is ascertained on a sub-sample within each phase I strata. While analysis methods for two-phase designs are well established, they have focused exclusively on settings in which participants are assumed to be independent. As such, when participants are naturally clustered (e.g. patients within clinics) these methods may yield invalid inference. To address this we develop a novel analysis approach based on inverse-probability weighting (IPW) that permits researchers to specify some working covariance structure, appropriately accounts for the sampling design and ensures valid inference via a robust sandwich estimator for which a closed-form expression is provided. To enhance statistical efficiency, we propose a calibrated IPW estimator that makes use of information available at phase I but not used in the design. In addition to describing the technique, practical guidance is provided for the cluster-correlated data settings that we consider. A comprehensive simulation study is conducted to evaluate small-sample operating characteristics, including the impact of using naïve methods that ignore correlation due to clustering, as well as to investigate design considerations. Finally, the methods are illustrated using data from a one-time survey of the national anti-retroviral treatment program in Malawi.

Keywords: Calibration, Generalized estimating equations, Inverse-probability weighting, Two-phase study

1 |. INTRODUCTION

In public health and medical research it is often the case that at least some information is readily available for all participants in the population under investigation. It is also often the case, however, that this information is insufficient in regard to addressing the primary question(s) of interest. For example, it may be that only a surrogate for the exposure of interest is available or that a key adjustment variable is not recorded1. As another example, specifically one that motivated this paper, consider the evaluation of national antiretroviral treatment programs in resource-limited settings, such as one currently implemented in Malawi2,3. Because of logistical and financial constraints, to-date these programs have typically been designed, implemented and evaluated using a framework developed by the World Health Organization that relies primarily on aggregated, facility-level data4. Thus, the data that is typically available constitutes an ecological study the analysis of which may suffer from ecological bias, depending upon the question of interest5.

In these settings, researchers may be able to collect additional detailed information on a sub-sample of participants as a means to augment the existing data. One cost-efficient approach for doing so, especially in resource-limited settings, is to conduct a two-phase study6,7. Briefly, at phase I an initial large sample is drawn from the population of interest and stratified on the basis of covariates that are readily available on all participants. If the outcome is binary and no other covariates are stratified upon, then the two-phase study design corresponds to the well-known case-control design8. More generally, one may solely stratify on the basis of covariates other than the outcome or jointly on the outcome and certain covariates or not at all. At phase II, a random sample of participants is drawn (implicitly without replacement) from each phase I strata, with covariate information not available at phase I ascertained.

For the most part, the statistical literature on the analysis of data from a two-phase design has focused on settings where the outcome is binary. In this setting, existing methods include the weighted likelihood or inverse-probability weighted (IPW) estimator9,10, the pseudo-likelihood estimator11,12 and the maximum likelihood estimator13,14. One feature shared by all of these is that individual participants are assumed to be independent of each other. In reality, however, it is often the case that study participants are correlated as may be the case if patients are treated at different clinics. In complete data settings, it is well-known that ignoring correlation among outcomes that is induced by clustering will, in general, lead to invalid inference15. While a number of outcome-dependent sampling schemes, and corresponding analysis methods, have been developed for longitudinal studies16,17,18, family-based genetic studies19,20, and designs where entire clusters are selected on the based on observed outcome rates21, to the best of our knowledge no methods have been developed specifically for the analysis of data from two-phase designs in contexts where participants are cluster-correlated. To address this gap, in this paper we develop an IPW framework for valid estimation and inference with respect to marginally-specified generalized linear models that simultaneously accounts for the two-phase design and for cluster-correlation. While the setting we consider could be framed as a special case of the general framework considered by Robins et al22, wherein the missingness mechanism is not assumed to be under the control of the researcher (as it would be in two-phase design), in this paper we explicitly acknowledge the fact that sampling within the phase I strata is fixed size for each stratum and performed without replacement through consideration in the estimation of the asymptotic variance of the joint probability distribution of the sampling indicators. Furthermore, the recent literature in the independent data setting has sought to use survey sampling techniques, such as calibration, to improve statistical efficiency of standard IPW estimators23,24. Central to the appeal of these techniques is that they facilitate the use of information that is readily available on all participants in the initial cohort but was not used in the design25,26,27. A second key development of this paper, therefore, is a general framework for the use of calibration as a means to increasing efficiency in the cluster-correlated data setting, together with practical guidance on how to operationalize the framework in practice.

As we elaborate upon, the methods we propose are motivated and illustrated using data on 82,887 patients, registered at one of 189 clinics in the Malawian national anti-retroviral treatment program, available through a one-time survey conducted between 2005–2007.

2 |. MODEL SPECIFICATION AND THE TWO-PHASE DESIGN

2.1 |. Model specification

Suppose interest lies in the relationship between some response Y and a set of covariates, which may be categorical, continuous or a mixture of different types. Furthermore, suppose that, in the context within which this relationship is to be investigated, participants are naturally cluster-correlated. Notationally, let K denote the number of clusters, Nk the number of participants in the kth cluster and N=k=1KNk the total number of participants. At the outset, after consideration of the scientific question at hand, we assume that the response Y and some of the relevant covariates are readily available on all N participants; we let X denote covariates that are readily available and Z those covariates that are not. Given complete data on (Y,X,Z) for all N participants, we assume that an analysis would proceed by fitting the marginal mean model:

μik=E[Yik|Xik,Zik]=g1([Xik,Zik]Tβ), (1)

for the ith participant in the kth cluster, where g(·) is a link function and β is a p-vector of unknown regression parameters.

2.2 |. The two-phase design

In the absence of complete data on all N participants, suppose that a two-phase study was conducted. Specifically suppose that at phase I the N participants are stratified by S, a categorical variable defined on the basis of Y and/or X and/or additional variables not in either X or Z but known for all participants. If either the response and/or some of the components of X are continuous they may be categorized in order to form the phase I strata. Note, this structure extends the typical set-up for a two-phase design which the phase I strata are defined as a cross-classification of Y and some categorical variable based exclusively on components of X. Note also that, in the set up we consider, S may or may not involve an indicator of cluster membership. If it does then the stratification will be nested within clusters; if it does not then the stratification will be crossed with the clustering.

Table 1 provides two examples of a phase I stratification for the N=82,887 patients in the Malawi data. In both examples, the outcome is a binary indicator representing ‘status at six months post-registration’: stopped treatment, lost to follow-up and death within 180 days were considered negative, with Y =1; transferred-out and alive and on-treatment were considered non-negative, with Y=0. In Design #1, S is the cross-classification of Y and a binary indicator of whether or not the clinic at which the patient registered was private. In design #2, S further cross-classifies the N=82,887 patients according to the year of registration (2005, 2006, 2007).

TABLE 1.

Two possible phase I stratification schemes that use readily-available group-level information collected by systems currently used by Malawian national antiretroviral treatment program.

Design #1 Clinic type
Public Private
 Non-negative status (Y=0) 65,466 2,119
 Negative status (Y=1) 15,024 278
Design #2 Clinic type/Year of registration
Public Private
2005 2006 2007 2005 2006 2007
 Non-negative status (Y=0) 11,991 22,887 29,773 247 1,006 842
 Negative status (Y=1) 3,492 6,014 6,333 22 167 113

For a given specification of S, let J denote the number of phase I strata and Nj the number of participants within stratum [S = j], for j=1,…,J. From Table 1, J =4 and 12 for Design #1 and Design #2, respectively. At phase II, a sample of njNj participants is drawn, at random, from the jth stratum. The value of the (otherwise unknown) Z is then ascertained so that (Y,X,Z) is known for n=j=1Jnj participants.

In principle, while the phase I stratification may or may not have considered the fact that participants are clustered they retain their cluster membership nonetheless. To acknowledge this let Njk denote the number of participants in the jth phase I stratum that belong to the kth cluster. Note, for any cluster that is not represented by at least one patient in a given phase I stratum Njk will equal zero. Similarly, let njk denote the number of participants selected at phase II from the jth phase I stratum that belong to the kth cluster.

3 |. ESTIMATION AND INFERENCE BASED ON COMPLETE DATA

Given complete data on (Y,X,Z) for all N participants estimation and inference with respect to β can proceed via generalized estimating equations (GEE). In particular, following the original formulation proposed by Liang and Zeger28, an estimate of β can be obtained as the solution to:

V(β)K1UT1=K1k=1KDkTVk1(Ykμk)=0, (2)

where μk=(μ1k,,μNkk) is given by model (1), Dk= μk/∂β is an Nk × p matrix of partial derivatives and Vk is an Nk × Nk working covariance matrix for Yk that may depend on β (through, say, a mean-variance relationship) and α, a parameter that indexes some pre-specified working dependence structure. Note that U = diag{Yμ}V−1D is an N × p matrix of contributions to the score for β from all N study units across the K clusters.

Operationally, solving (2) requires an estimate of α. Towards this Liang and Zeger28 proposed that a plug-in moment-based estimator be used. Subsequent work then proposed that α be estimated via a second set of estimations equations that would be solved iteratively29 or simultaneously30,31 with those given by expression (2).

3.1 |. Asymptotic properties

Let β^ denote the solution to equation (2). Xie and Yang32 consider the asymptotic properties of β^ under a range of large sample scenarios. Focusing on the setting where K → ∞ and max{Nk; k = 1, …, K} is bounded above ∀ K, the setting that Liang and Zeger28 consider, Xie and Yang32 show that given a consistent estimator of α and assuming mild regularity conditions, β^ is consistent for β0, the true population parameter, and that

M(β0)1/2H(β0)(β^β0)d Normal(0,Ip×p)

where M(β)=Var[V(β)] and H(β)=E[V(β)/β]. From a practical perspective, inference then proceeds on the basis of the plug-in estimator of Var[β^], specifically Var^(β^]=H^1(β^)M^(β^)H^1(β^) where

H^(β)=K1k=1KDkTVk1Dk

and

M^(β)=K2k=1KDkTVk1(Ykμk)(Ykμk)TVk1Dk.

4 |. ESTIMATION AND INFERENCE BASED ON DATA FROM A TWO-PHASE DESIGN

In the complete data setting, the choice of working dependence structure is often geared towards efficiency considerations. However, if (X, Z) includes at least one covariate that varies across units within a cluster (as will almost always be the case), the estimating equations given by expression (2) are not guaranteed to be unbiased unless working independence is adopted33,34. Thus, to avoid this potential source of bias, in the remainder of this paper we focus on estimation and inference assuming that working independence is adopted.

4.1 |. Estimation

Given data from a two-phase design, we propose to estimate β by solving the following IPW generalized estimating equations:

Vw(β)=K1k=1KDkTVk1Wkdiag{Rk}(Ykμk)=0, (3)

where Rk=(R1k,,RNkk) with Rik indicating whether the ith participant in the kth cluster was selected at phase II and Wk is an Nk × Nk diagonal matrix with the [i, i]th entry given as the inverse of the inclusion probability corresponding to the phase I strata to which the participant belonged. To formalize the latter let sik{1,,J} denote the observed phase I stratum membership for the ith participant in the kth cluster. That is, there are J strata, so sik can take only J values. At phase II, the inclusion probability for the ith participant in the kth cluster is E[Rik|Y,S]=nsik/Nsik. The ith diagonal entry of Wk is therefore wik=Nsik/nsik. From this, we see that the weight for ith participant in the kth cluster solely depends on their phase I stratum membership (i.e. through the value of sik); moreover, beyond the covariates used to form the phase I stratification, the weight does not depend on the underlying cluster membership.

Note, as will become evident in Section 4.3, it is useful to be able to succinctly represent the system of equations given by expression (3) as

Vw(β)=K1UTWR=K1DTV1diag{Yμ}WR=0,

where Y = (Y1, …, YK), μ = (μ1, …, μK), V a N × N block-diagonal matrix with elements Vk, D an N × p matrix consisting of the K Dk cluster-specific matrices stacked on top of each other, R is an N-vector consisting of the K cluster-specific Rk concatenated with each other and W is an N × N diagonal matrix with the vector (W1, …, WK) on the diagonal.

4.2 |. Asymptotic properties

Let β^W denote the estimate of β obtained as the solution to (3). In the Web Appendix we show that β^W is consistent for β0 and that

VT(β0)1/2M(β0)1/2H(β0)(β^wβ0)d Normal (0,Ip×p),

where M(β) and H(β) are as in Section 3, and VT(β) = V1(β) + E [V2(β)] with V1(β)=Var[M(β)1/2V(β)] and V2(β)=Var[M(β)1/2Vw(β)|K] where K={Y,X,Z,S}. The arguments follow Xie and Yang32 for complete data GEE, with additional regularity conditions to accommodate the two-phase sampling scheme including that: the cluster sizes, Nk, be bounded above ∀ k; the sampling fractions, nj/Nj are bounded below ∀ N; and, that as N → ∞: nj → ∞, nj/Njfj,∞ > 0, Nj/NWj,∞ < 1, and Kj/Njkj,∞ > 0, where Kj is the number of clusters in stratum j.

4.3 |. Inference in practice

Let Ks denote the number of clusters represented in the phase II sub-sample. In practice, the asymptotic variance of β^w can be estimated by

Var^[β^w]=KsKspH^w1(β^w)M^w(β^w)H^w1(β^w) (4)

where

H^w(β)=K1k=1KDkTVk1Wkdiag{Rk}Dk (5)

and M^w(β)=VI(β)+VII(β), where

VI(β)=K1UTdiag{R}W2Kdiag{R}U, (6)

with W2K an N × N block diagonal matrix in which the kth block consists of an Nk × Nk matrix with 1/πiikk in the [i, i′] entry, where πiikk denotes the joint probability that study units i and i′ in cluster k are selected by the two-phase design (i.e. the pairwise selection probability); if study units ii′ in cluster k belong to the same phase I stratum (i.e. sik=sik) then πiikk=(nsik{nsik1})/(Nsik{Nsik1}); if they belong to different strata, then πiikk=(nsiknsik)/(NsikNsik); and, πiikk=πik=(nsik/Nsik). Finally,

VII(β)=K1UTWdiag{R}Δ˜diag{R}WU, (7)

where Δ˜ A is the N × N matrix with entries Δ˜iikk=(πiikkπikπik)/πiikk with πiikk denoting the joint probability that participants i and i′ from clusters k and k′, respectively, are selected by the two-phase design. Finally, to help understand expression (4), note that VI (β) is an estimate of the variance of the complete data estimating equations for β while VII (β) represents the additional uncertainty due to only having complete data on study units selected at phase II.

5 |. IMPROVED ESTIMATION AND INFERENCE VIA CALIBRATED IPW

The framework proposed in Section 4 provides a means to perform consistent estimation and valid inference with respect to β in model (1). Here we consider calibration as a means to improving efficiency of the IPW estimator through use of information that is readily available at phase I but was not used as part of the design.

5.1 |. Calibration to a single variable

To simplify the exposition, we initially consider calibration to a single variable. Specifically, let X˜ik denote some random variable that was not used in the definition of S but that is known for all N participants (i.e, at phase I). In Design #1 from Table 1, for example, year of registration is known for all program registrants but was not used in the phase I stratification. Let

T(X˜)=k=1Ki=1NkX˜ik

denote the (known) total of X˜ik across all N study participants. For any given two-phase design, an unbiased estimate of T(X˜) is

T^(X˜)=k=1Ki=1NkwikRikX˜ik,

where the design-specific weights, wik, are as defined in Section 4.1. Since T(X˜) is known, however, one can construct a new set of weights, say w˜ik, such that weighted average of the X˜ik among those sampled at phase II is forced to equal T(X˜). Operationally, these so-called calibrated weights are found by minimizing some distance function d(w˜ik,wik) subject to the calibration constraint

T(X˜)=k=1Ki=1Nkw˜ikRikX˜ik (8)

and that k=1Ki=1NkRikw˜ik=N. In practice, common choices for the distance function are are the χ2 distance, d(w˜,w)=(w˜w)2/2w, and the deviance distance, d(w˜,w)=w˜ln(w˜/w)w˜+w. Finally, the calibrated weights can be substituted into expression (3) to give a new set of estimating equations for β, specifically

V˜w(β)=K1k=1KDkTVk1W˜kdiag{Rk}(Ykμk)=0, (9)

where W˜k is an Nk × Nk diagonal matrix with w˜ik in the [i, i]th entry.

5.2 |. Calibration to the influence functions

In principle, one can calibrate an IPW estimator to any number of random variables observed at phase I with the only modification being that a separate constraint, analogous to expression (8), be satisfied for each variable included. Towards increasing statistical efficiency, wise choices include variables that are related to the estimand of interest. Based on this, recent work has shown that calibrating to the so-called influence functions for the mean model, that is the columns of the N × p matrix:

V(β)E[V(β)β]1|β=β0, (10)

is optimal in terms of efficiency35,36. To be consistent with the notation developed so far, we denote this matrix as X˜IF

Practically, computing the elements of X˜IF requires knowledge of (Y,X,Z) for all N participants at phase I and of the true parameter β0. Since (Y,X,Z) is only known for those participants selected at phase II, however, the contributions to X˜IF for those participants not selected at phase II must be estimated. Towards this we follow Breslow et al25 and propose the following strategy:

  1. Impute Z for all N participants at phase I on the basis of any and all information available at phase I, including on covariates that are observed at phase I but not included in X.

  2. Conduct a complete data working independence GEE analysis, as in Section 3, using a dataset of size N for which the values of X are as observed and the values of Z are the imputed values from step (i).

  3. Compute the N × p matrix of influence function contributions from the fit in step (ii), which we denote as X˜^IF.

  4. Noting that, by definition, the column totals of X˜^IF is a p-vector of zero’s, calibrate the design weights (i.e. wik) to the columns of X˜^IF by minimizing the distance function d(w˜ik,wik) subject to the p constraints:
    k=1Ki=1Nkw˜ikRikX˜^IF:i,lk=0,l=1,,p
    where X˜^IF:i,lk is the contribution to the lth column of X˜^IF from the ith participant in the kth cluster, together with the constraint that k=1Ki=1NkRikw˜ik=N.

The resulting calibrated weights, w˜ik, can then be plugged into (9) to obtain the calibrated IPW estimator β˜W. Finally, we note that, while calibrating on variables beyond the influence functions will neither improve nor diminish (asymptotic) efficiency36, our experience has been that doing so can help with convergence issues that sometimes arise when computing the calibrated weights. This was particularly the case in our simulations and analyses when a covariate known at phase I was rare.

5.3 |. Additional details regarding imputation

Here we make a number of comments regarding the the construction of an imputed dataset from which the estimated influence functions are obtained. First, as discussed by Breslow et al26, the estimated influence functions should only depend on the phase I information; if any information solely available at phase II is used, the analysis runs the risk of violating the missing at random assumption implicit to the design. This will be because the calibrated weights will be a function, in part, on covariates not observed on all individuals (see Web Appendix D for additional detail).

Second, a special situation arises for cluster-level variables. It may be the case, in particular, that a covariate collected at phase II is a cluster-specific covariate. In the simulations of Section 6, Z2k is an example of such a variable. Since this variable is common to all participants in the cluster, in lieu of imputing via some model, it may be reasonable to simply carrying forward the value to those participants not selected at phase II. In settings where all clusters are represented in the phase II sample, this will be equivalent to having known the variable at phase I. If some clusters are not represented, however, then their values will need to be imputed and the fact that the imputation depends implicitly on the design may lead to violations of the missing at random assumption, with the extent of violation depending on how many clusters are represented in the phase II sub-sample.

Third, our experience suggests that one should impute on the basis of models for the (partially missing) Z using information on the n phase II participants, fit via IPW based on the design weights defined in Section 4.1. The advantage of doing so is that if the imputation models are misspecified, IPW estimation guarantees the estimates to be consistent for what would have been obtained had the misspecified model been fitted to the complete data37. We also note that the weights for any given imputation model should depend on whether the variable being imputed is specific to the participant or to the cluster. For the former, the weights should be the standard design weights (i.e. wik defined in Section 4.1). For the latter, they should be taken as the inverse of the probability that the kth cluster is represented by at least one participant at phase II:

πk=1Pr(R1k==RNkk=0|X,Y,S)=1j=0Ji=0njk1{1(njiNji)}.

Finally, it is well-known that, in standard imputation settings, including the outcome directly in the imputation model is wise, in part because it ensures that the relationships between the outcome and variables that are imputed are not destroyed38. In conducting the simulation studies of Section 6 and the analysis of the data from Malawi in Section 7, however, we experienced consistent convergence/identifiability difficulties when performing step (ii) of the algorithm in Section 5.2 if the imputation model in step (i) included the outcome. While this may not be a universal phenomenon, these difficulties led to the exclusion of the outcome in the imputation models used in the analyses presented in Sections 6 and 7. As pointed out by a reviewer, a consequence of this is that it will not be possible to perform optimal calibration since the estimates of β that are plugged into expression (10) to compute X˜^IF will be biased. Moreover, if the outcome is not included in the imputation model for Z, and there is no relationship between Z and Y in step (ii) (that is, the components of β referring to the Y ~ Z relationship are estimated to be zero or close to zero), then step (iv) will effectively be calibrating to the influence function for a misspecified model, specifically that for E[Y|X] instead of the analysis model E[Y|X,Z]. At this point, the efficiency gains obtained through calibration will depend on the nature of the misspecification. For example, suppose the analysis forges ahead with fitting a model E[Y |X,Z] in step (iii). That the ‘true’ association between Z and Y in the imputed dataset is null and yet the corresponding coefficients are nevertheless estimated will introduce noise that is, in turn, detrimental; efficiency gains will be obtained for X over an uncalibrated analysis but greater efficiency gains may be obtained by calibrating to the influence function for E[Y|X] directly or on X directly. Alternatively, if auxiliary information that is not part of the analysis model is available and can be used to impute Z such that the components of β corresponding to Z are estimated in step (ii) to be non-zero (albeit still with bias since Y is not being used directly), then calibrating to E[Y|X, Z] may be preferable. As we elaborate upon below, this is what we do in our simulations, specifically through the use of auxiliary variables X3 and X4 (see Section 6.3).

5.4 |. Asymptotic properties

For a given choice of calibration variables, let β˜w denote the solution to (9). To establish asymptotic properties of β˜w we build on results developed for β˜w in Section 4 together with results from Pierce39 and of Deville and Sardnal24. Briefly, the approach hinges on representing the calibrated weights as w˜ikw˜ik(λ^)=wikF(X˜iTλ^), where λ^ is an estimate of the so-called calibration parameter, λ, obtained by solving:

1Kk=1Ki=1NkRikwikF(λTX˜ik)X˜i1Kk=1Ki=1NkX˜i=0 (11)

with F(·) determined by the choice of the distance function. For example, F(y) = 1 + y corresponds to the χ2 distance function while F(y) = exp(y) corresponds to the deviance distance; see Table 1 of Deville and Sardnal24.

Consistency and asymptotic normality of β˜w then follow from asymptotic results for the joint vector

V¯T1/2[M(β0)1/2H(β0)(β˜wβ0),NΦ˙(λ0)(λ^λ0)],

where M(β) and H(β) are as in Section 4.2, Φ˙(λ)=E[F˙(λTX˜)TX˜] and V¯T is defined in the Web Appendix. We propose the following robust variance estimator of the calibrated coefficients

Var˜[β˜w]=H˜1M˜1/2V˜T1/2V˜cV˜T1/2M˜1/2H˜1=H˜1[M˜1/2V˜TM˜1/2+˜2V˜Φ˜2T2˜2C˜T21]H˜1,

where

H˜=K1k=1KDk(β˜w)TVk(β˜w)1W˜kdiag{Rk}Dk(β˜w)
Var˜[Vw,Km]=VI(β˜w)+VII(β^w),

and ˜2, V˜Φ, C˜T21, VI and VII are as defined in section C6 of the Web Appendix. Detailed technical arguments, together with the form of the asymptotic variance are also given in the Web Appendix.

6 |. SIMULATION STUDY

To evaluate small-sample operating characteristics of the estimators proposed in Sections 4 and 5 we conducted a comprehensive simulation study. Here we provide detail on the structure of the simulation as well as select key results; additional results are provided in the Web Appendix.

6.1 |. Data generation

Throughout the simulation we considered the relationships between a binary response variable, Y, and four covariates, specified via the following logistic regression:

logitE[Yik]=ηi,Mk=β0+βX1X1ik+βX2X2k+βZ1Z1ik+βZ2Z2k. (12)

In this model X1ik was a participant-specific binary covariate generated as a Bernoulli random variable with cluster-specific probabilities, pX1k=P(X1ik=1), taken as a random draw from a Uniform(0.2, 0.5) distribution. This covariate could correspond, for example, to a patients’ co-morbidity status with variation in the pX1k indicating variation in prevalence across clinics. The second covariate, X2k, was a cluster-specific binary covariate with P(X2k=1)=pX2k=0.15 corresponding to some dichotomous feature of the clinic (e.g. whether or not it is private clinic). The third and fourth variables, the participant-specific Z1ik and cluster-specific Z2k, were generated as normal variables. In particular, Z1 and Z2 were generated as a Normal(1+0.5X1ik+X3ik,1) and Normal(1+0.5X2k+X4k,1) random variables respectively, where X3ik~Normal(0,1) and X4k~Normal(0,1) are auxiliary variables that, according to model (12), are conditionally independent of Y.

Given (X1ik,X2k,Z1ik,Z2k), the binary response variable was generated via model (12) with βX1=βX2=βZ1=βZ2=0.7, and the intercept β0 specified so that the overall proportion of cases in any given scenario was 0.10 (specifically β0=−4.85). To induce correlation among the participants in a given cluster we used an approach developed for marginally-specified logistic-Normal models40 and implemented in the GenBinaryY() function in the MMLB package for R. Briefly, given the linear predictor for the marginal mean (i.e. ηi,Mk) the induced conditional linear predictor, denoted here by ηi,Ck, based on a mixed effects model with a single Normally-distributed random intercept, with mean zero and variance σV2, can be obtained as the solution to a convolution equation (see expression (4) of Heagerty40). Given ηi,Ck and a random intercept for the kth cluster, say bk, the induced conditional mean is exp(ηi,Ck+bk)/(1+exp(ηi,Ck+bk)), which can then be used as a basis for generating the responses.

Based on the above framework we considered seven data scenarios. The first, referred to henceforth as the ‘baseline’ scenario, set (K, Nk, σV, n)=(100, 200, 0.5, 1,000). The next six scenarios considered a modification of one of these components: #1 decreased σV to 0.0; #2 increased σV to 1.0; #3 increased K to 200; #4 increased n to 2,000; #5 decreased n to 500; and, #6 decreased K to 50.

6.2 |. Sampling schemes for the phase II data

For each of the data scenarios we generated M=5,000 ‘complete’ datasets. Consistent with the notation of Section 2, we assumed that (Y, X1ik, X2ik) was available for all participants at phase I but that (Z1ik,Z2ik) was not. To obtain complete data on a sub-sample of n participants we considered five two-phase designs that differed in the information used to form the phase I strata: (i) simple random sampling, so that no information was used at phase I; (ii) case-control sampling, so that only Y was used; (iii) two-phase sampling on the basis of Y × X1; (iv) two-phase sampling on the basis of Y × X2; and, (v) two-phase sampling on the basis of Y × X5, where the variable X5 was a surrogate for Z2, specifically an indicator of the 80% quantile of its distribution. Note, under the five designs the number of phase I strata are J =1, 2, 4, 4 and 4, respectively. For any given design, we assume the phase II samples are obtained via random sampling within each phase I stratum with balanced sampling across the strata.

6.3 |. Analyses

For each of the 25,000 simulated dataset/design combinations we performed two analyses based on the IPW and calibrated IPW methods in Section 4 and 5 respectively. For the calibrated IPW analyses, as described in Section 5.2, we first imputed values of (Z1ik,Z2k) for those participants with ‘missing’ data (i.e. those not selected at phase II) via two linear regression models of the form E[Z1ik]=b01+b11X1ik+b21X3ik and E[Z2k]=b02+b12X2k+b22X4k, respectively, each with Normally distributed errors. For Z1ik, the parameters of the imputation model were estimated via IPW based on the inclusion probabilities from the design. For Z2k, the corresponding parameters were estimated via IPW based on the corresponding cluster-level inclusion probabilities (see Section 5.3). Last, for the baseline scenario only we conducted a calibrated IPW analysis based on carrying forward the values of Z2k for those clustered represented in the phase II sample and imputing the remaining ones (see Section 5.3); with a slight abuse of terminology we labelled these analyses as Scenario #7.

For all analyses, naïve standard errors that solely consider the design (i.e. ignore the potential for cluster-correlation) were computed as were those based on the proposed estimators of the asymptotic variance (see Web Appendices C and D). Finally, to investigate the efficiency properties of the five designs, we also performed complete data analyses (i.e. based on all Nk=200 participants within each cluster).

6.4 |. Results: parameter and standard error estimation

Table 2 presents results for the baseline scenario. From the first two columns, we see that both of the estimators exhibits little-to-no small-sample bias for all parameters. These observations hold for all scenarios (see Web Appendix Tables WA-1-B through WA-1–7).

TABLE 2.

Small-sample operating characteristics, based on M=5,000 simulated datasets, for the standard IPW and calibrated IPW estimators of the parameters in model (12) under the baseline simulation scenario. Shown are: the mean of the point estimates; coverage probabilities for Wald-based 95% confidence intervals using: (i) naïve standard error estimates that ignore cluster-correlation, and (ii) the proposed robust standard error estimators that account for both the design and clustering; and, relative uncertainty defined as the ratio of the empirical standard error for the estimator to that of the complete data estimator.

Coefficient Phase I Mean point Coverage probability Relative
design estimate Naïve Robust uncertainty
IPW Cal-IPW IPW IPW Cal-IPW IPW Cal-IPW
β0
None −4.87 −4.87 0.93 0.94 0.94 3.15 3.17
Y −4.85 −4.85 0.92 0.94 0.94 2.10 2.14
Y × X1 −4.84 −4.84 0.92 0.94 0.94 1.95 2.01
Y × X2 −4.85 −4.85 0.92 0.94 0.94 2.25 2.31
Y × X5 −4.84 −4.84 0.92 0.94 0.95 1.84 1.88
βZ1
None 0.70 0.71 0.95 0.95 0.95 4.48 3.99
Y 0.70 0.70 0.94 0.95 0.94 3.41 3.07
Y × X1 0.70 0.70 0.94 0.95 0.95 3.34 2.95
Y × X2 0.71 0.71 0.94 0.94 0.94 3.95 3.51
Y × X5 0.71 0.70 0.94 0.95 0.95 3.38 2.98
βX1
None 0.70 0.71 0.94 0.95 0.95 4.17 2.19
Y 0.70 0.70 0.94 0.95 0.94 3.07 1.98
Y × X1 0.69 0.69 0.94 0.95 0.95 2.18 1.90
Y × X2 0.70 0.70 0.94 0.95 0.95 3.56 2.17
Y × X5 0.70 0.70 0.95 0.95 0.96 2.92 1.90
βZ2
None 0.70 0.70 0.91 0.94 0.93 2.16 1.86
Y 0.71 0.70 0.89 0.93 0.93 1.80 1.56
Y × X1 0.71 0.70 0.89 0.93 0.93 1.77 1.52
Y × X2 0.71 0.71 0.89 0.93 0.94 1.94 1.66
Y × X5 0.70 0.70 0.84 0.94 0.95 1.34 1.33
βX2
None 0.69 0.71 0.91 0.94 0.95 2.12 1.26
Y 0.71 0.71 0.89 0.94 0.94 1.69 1.25
Y × X1 0.71 0.70 0.89 0.94 0.93 1.67 1.24
Y × X2 0.68 0.69 0.78 0.94 0.94 1.21 1.14
Y × X5 0.71 0.70 0.89 0.94 0.95 1.71 1.27

From the third column, we see that Wald-based 95% confidence intervals based on the naïve standard error estimator generally have coverage that is lower than the nominal 0.95 for βZ2 and βX2, the parameters that correspond to the two cluster-specific covariates. In comparison, 95% confidence intervals based on the proposed robust estimators generally achieve the nominal coverage rate (see columns 4 and 5). Note, that the coverage is slightly lower than 0.95 for βZ2 is primarily a function of K; Web Appendix Tables WA-2-B through WA-2–7 report than the coverage for this parameter achieves the nominal rate when K is increased to 200. Finally, these results are mirrored when one evaluates the standard error estimates by comparing their mean with the empirical standard error (i.e. the standard deviation of the point estimates across the M=5,000 point estimates); see Tables WA-1-B through WA-1–7 in the Web Appendix.

6.5 |. Results: statistical efficiency

The final two columns of Table 2 reports on statistical efficiency and its interplay with the choice of phase I stratification. Values shown are relative uncertainty, defined as the ratio of the empirical standard error estimates for a given estimator relative to that of the complete data GEE estimator. Thus, the ratio reflects the relative magnitude of the widths of 95% confidence intervals for the estimator compared to that for the complete data GEE estimator.

Not surprisingly, compared to simple random sampling each of the four designs that stratify on the outcome exhibit generally improved or no worse relative uncertainty (i.e. have ratios closer to 1.0). For the three two-phase designs that stratify on covariates X1, X2 and X5 there is further improvement in relative uncertainty for the corresponding parameter. For example, the relative uncertainty for βX1 is 2.18 for the standard IPW estimator under two-phase design that stratified jointly on Y × X1; this is substantially smaller than 3.07 for the same estimator under the case-control design. Similarly, the relative uncertainty for the same estimator of βX2 under the case-control design is 1.69 while it is 1.21 under the two-phase design that stratified jointly on Y × X2.

Comparing the standard and calibrated IPW estimators we see that, with the exception of the intercept, the latter has relative uncertainty closer to 1.0 (sometimes substantially so) and is thus more efficient than the former. For the intercept, it seems that there is a very small loss of efficiency (i.e. between a 0.6–3.0% increase in the standard error). In theory there should be no asymptotic efficiency gains or losses for the intercept due to calibration. Thus, we believe that these differences are likely due to a combination of the set up of the simulation not reflecting asymptotia and Monte Carlo error. Finally, we note that results regarding statistical efficiency for Scenario #7 did not vary from those based on the baseline scenario (see Web Appendix Table WA-3–7). For these scenarios the phase II samples typically represented 90–95 of the K=100 clusters so that carrying forward most of the Z2k, values was a reasonable strategy.

7 |. THE MALAWI STUDY

Here we provide a concrete illustration of how the methods proposed in this paper might be used in practice. In particular, we consider a hypothetical investigation of the association between private/public clinic type and risk of a negative status at six-months post-registration among patients N=82,887 registered at one of K=189 clinics in Malawi between 2005–2007; see Section 2. More specifically, we suppose that interest lies with whether differences between private and public clinics changed between 2005 and 2007 (i.e. whether or not there is an interaction between clinic type and year of registration).

7.1 |. A hypothetical two-phase study

In considering the association between private/public clinic status and the outcome, we assume that adjustment for patient age and gender will be necessary. Since, as described in the Introduction, individual-level data on these characteristics will typically not be available to analysts at the Malawian Ministry of Health, we consider a hypothetical two-phase design. Towards this, note that the outcome, private/public clinic status and year of registration would be readily-available on all N=82,887 patients; see Haneuse et al3 for additional details. Thus, in principle, a hypothetical two-phase study could adopt either of the phase I stratifications given in Table 1. To illustrate efficiency gains provided through calibration, we suppose that Design #1 is adopted.

At phase II we assume that sufficient resources exist for up to n=2,000 manual chart reviews, through which information on age and gender could be abstracted. Considering the counts provided in Table 1, we ‘sampled’ all 278 patients who registered at a private clinic and who had a negative status at six months post-registration together with a random sample of 500 patients from each of the other phase I strata. Then, for each of the n=1,778 patients sampled at phase II, we ‘recorded’ their age and gender.

Using the data from this two-phase design, we fit a logistic regression model to the binary outcome status with the following covariates: age, gender, clinic type, year of registration and an interaction between clinic type and year of registration. For both the main effects and the interaction term, year of registration (2005, 2006, or 2007) was coded via two dummy variables with 2005 as the referent. Estimates were obtained via the two IPW approaches proposed in Sections 4.1 and 5. For the latter, we imputed age and gender using logistic and normal linear regression models, respectively, with clinic type and year of registration as explanatory variables, and subsequently calibrated using the influence functions. To illustrate the generality of the methods, beyond the logit link function, we performed a parallel set of analyses with g(·) taken to be the log link function.

7.2 |. Results

Table 3 presents results, specifically point estimates for the odds ratio and relative risk associations, depending on the link function g(·), as well as 95% confidence intervals based on naïve and robust standard error estimates for both analyses. Substantively, we find that the point estimates for the IPW and calibrated IPW analyses are generally very similar, with the results for the main effect of clinic type suggesting that patients registered at a private clinic in 2005 had substantially lower odds (0.17–0.18) or risk (0.29–0.31) of a negative status at six months compared to patients registered at a public clinic in 2005. The point estimates for the two interaction terms suggest that this difference persisted in 2006 and 2007 but were not as large. For example, based on the standard IPW analysis, the odds of a negative status at six months for patients registered in a private clinic in 2007 is estimated to be 54% lower than that for patients registered at a public clinic in 2007 (0.18 × 2.58 ≈ 0.46).

TABLE 3.

Results from a hypothetical two-phase study investigating the relationship private/public clinic status and a negative status at six months post-registration among patients enrolled in the Malawian national antiretroviral treatment program between 2005–2007. Shown are estimated odds ratios (OR) and relative risks (RR), based on the logit and log links, and 95% confidence intervals (CI) based on: (i) naïve standard error estimates that ignore cluster-correlation, and (ii) the proposed robust standard error estimators that account for both the design and clustering.

Standard IPW
logit link log link
OR 95% CI RR 95% CI
Naïve Robust Naïve Robust
Age, years 0.97 (0.81, 1.14) (0.80, 1.15) 0.87 (0.78, 0.96) (0.78, 0.97)
Gender: female 0.75 (0.52, 0.98) (0.49, 1.01) 0.75 (0.62, 0.91) (0.61, 0.92)
Clinic type: private 0.18 (0.05, 0.32) (0.06, 0.31) 0.29 (0.17, 0.50) (0.18, 0.48)
Year of registration: 2006 0.60 (0.34, 0.87) (0.32, 0.89) 0.93 (0.72, 1.21) (0.73, 1.19)
Year of registration: 2007 0.49 (0.28, 0.70) (0.28, 0.69) 0.83 (0.65, 1.07) (0.66, 1.05)
Private × 2006 interaction 4.17 (0.72, 7.63) (1.26, 7.08) 2.57 (1.44, 4.60) (1.44, 4.61)
Private × 2007 interaction 2.58 (0.41, 4.76) (0.61, 4.55) 1.75 (0.96, 3.20) (0.99, 3.11)
Calibrated IPW
logit link log link
OR 95% CI RR 95% CI
Naïve Robust Naïve Robust
Age, years 0.97 (0.80,1.14) (0.78, 1.16) 0.87 (0.78, 0.97) (0.78, 0.97)
Gender: female 0.75 (0.50, 1.00) (0.48, 1.02) 0.75 (0.62, 0.91) (0.61, 0.92)
Clinic type: private 0.17 (0.07, 0.27) (0.08, 0.27) 0.31 (0.19, 0.51) (0.21, 0.48)
Year of registration: 2006 0.67 (0.43, 0.91) (0.40, 0.93) 0.97 (0.81, 1.16) (0.82, 1.14)
Year of registration: 2007 0.52 (0.33, 0.71) (0.33, 0.71) 0.79 (0.66, 0.95) (0.68, 0.92)
Private × 2006 interaction 4.04 (1.60, 6.48) (2.58, 5.50) 2.40 (1.45, 3.95) (1.47, 3.89)
Private × 2007 interaction 2.61 (1.00, 4.21) (1.21, 4.00) 1.65 (0.97, 2.79) (1.02, 2.66)

From an inferential perspective, we find that the conclusions one draws regarding the two interaction terms differ depending on whether one uses the naïve or robust standard errors. For example, the 95% confidence intervals under the standard IPW analyses with the logit link both include 1.0, whereas the robust 95% confidence interval for 2006 interaction excludes 1.0. Furthermore, calibrating to the influence functions yields substantial efficiency gains with the robust 95% confidence intervals for both interaction terms being shorter than their counterparts based on standard IPW and both excluding 1.0. As a formal evaluation of the interaction between clinic type and year of registration, we note that the p-values for a Wald-type test for the two terms in the model (i.e. with 2 degrees of freedom) based on the robust variance estimator was < 0.001 for all four model specifications/fits.

8 |. DISCUSSION

The recent statistical literature shown that there is substantial interest in making optimal use of readily available auxiliary information that might otherwise not be used41,42. Within this vein, in this paper we propose IPW and calibrated IPW estimation and inference for regression parameters from a marginal mean model for data arising from a two-phase design in clustered data settings. As demonstrated by extensive simulations, estimates of regression parameters exhibit little-to-no finite-sample bias in a broad range of scenarios and inference based on the proposed standard error estimates is generally valid. Furthermore, the simulation studies illustrate the potential for efficiency gains associated with calibration. We also note that, from a broader perspective, while this paper focuses on accounting for the non-random sampling inherent to the two-phase design, analysts using the proposed methods will still need to contend with the all of the usual considerations in GEE-based analyses. In particular, it is well-known that in some settings caution is needed when specifying the working correlation structure if bias is to be avoided43,44.

We conclude with a number of directions for future work. First, while the potential for efficiency gains through calibration, particularly to the influence functions of the mean model, is well-established for independent data settings35,36, we are not aware of a formal justification for correlated data settings. Nevertheless, the simulations we present provide compelling evidence that such efficiency gains are possible and can be substantial. Second, since the extent to which efficiency gains obtained through calibration depends on the quality of the imputation for those variables only available at phase II, novel strategies may be needed to make full use of the phase I information. One area where this will be particularly useful is in applying the ideas of this paper to ecological studies where the phase II data can be used to overcome ecological bias3,45. Since these studies typically only involve aggregated or group-level information at phase I, however, novel strategies are needed to develop reasonable imputation models for individual-level covariates. Third, although we present large sample results in the setting where K → ∞ and max{Nk; k = 1,… ,K} is bounded above ∀ K, in the complete data setting32 consider large sample results when K is bounded and Nk → ∞ as well as when both K and Nk → ∞. Consideration of these large sample settings for the two-phase context would provide a complete treatment of the asymptotics. Fourth, while we have focused on IPW, additional efficiency gains may be obtained by adapting existing pseudo-likelihood maximum likelihood or quasi-least squares46 methods for two-phase designs to the cluster-correlated data setting. Fifth, while the developments in this paper focus on instances where working independence is adopted, efficiency gains may be obtained (for select types of covariates, at least), through consideration of alternative working correlation structures. Towards this, recent work by Chen and Westgate47 may be useful as a means to enhance efficiency while ensuring unbiasedness of the estimating equations used. Sixth, an important practical area for is that of design considerations, specifically how one allocates finite resources at phase II. Towards this, adapting recent work by McIsaac and Cook48,49,50 on optimal design in two-phase studies where the sampling fractions are specified at the level of the cluster to settings where sampling fractions are at the level of the individual (i.e. the case considered in this paper) may be a reasonable strategy. Finally, although this paper focuses on marginal mean models, substantive interest often lies in the conditional models51. To the best of our knowledge, however, methods for these models in the two-phase context have not been developed.

Supplementary Material

Supp info1
supp info2

9 |. ACKNOWLEDGEMENTS

Drs. Rivera and Spiegelman were supported by NIH grant DP1 ES025459. Dr. Haneuse was supported by NIH grant R01 HL094786 and Harvard University Center for AIDS Research Feasibility Research grant P03 A106054.

Footnotes

SUPPORTING INFORMATION

Additional Supporting Information may be found online in the supporting information tab for this article.

References

  • 1.Breslow N, Chatterjee N. Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. JRSS: Series C 1999;48:457–468. [Google Scholar]
  • 2.Lowrance D, Filler S, Makombe S, et al. Assessment of a national monitoring and evaluation system for rapid expansion of antiretroviral treatment in Malawi. Tropical Medicine & International Health. 2007;12(3):377–381. [DOI] [PubMed] [Google Scholar]
  • 3.Haneuse Sebastien, Hedt-Gauthier Bethany, Chimbwandira Frank, Makombe Simon, Tenthani Lyson, Jahn Andreas. Strategies for monitoring and evaluation of resource-limited national antiretroviral therapy programs: the two-phase design. BMC Med. Res. Meth. 2015;15(1):31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gilks Charles F, Crowley Siobhan, Ekpini Rene, et al. The WHO public-health approach to antiretroviral treatment against HIV in resource-limited settings. The Lancet. 2006;368(9534):505–510. [DOI] [PubMed] [Google Scholar]
  • 5.Wakefield J Ecologic studies revisited. Annual Review of Public Health. 2008;29:75–90. [DOI] [PubMed] [Google Scholar]
  • 6.White J. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115(1):119–128. [DOI] [PubMed] [Google Scholar]
  • 7.Wakefield J. Ecological inference for 2 × 2 tables (with discussion). Journal of the Royal Statistical Society: Series A. 2004;167(3):385–445. [Google Scholar]
  • 8.Breslow Norman E, Day Nicholas E, others Statistical methods in cancer research. Vol. 1. The analysis of case-control studies. Distributed for IARC by WHO, Geneva, Switzerland; 1980. [Google Scholar]
  • 9.Flanders W, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine. 1991;10(5):739–747. [DOI] [PubMed] [Google Scholar]
  • 10.Saegusa T, Wellner T Weighted likelihood estimation under two-phase sampling. The Annals of Statistics. 2013;41(1):269–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Breslow N, Cain K. Logistic Regression for Two-Stage Case-Control Data. Biometrika. 1988;75(1):11–20. [Google Scholar]
  • 12.Schill W, Jockel K, Drescher K, Timm J. Logistic analysis in case-control studies under validation sampling. Biometrika. 1993;80(2):339–352. [Google Scholar]
  • 13.Scott Alastair J, Wild Chris J. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84(1):57–71. [Google Scholar]
  • 14.Breslow N, Holubkov R. Maximum Likelihood Estimation of Logistic Regression Parameters Under Two-Phase, Outcome-Dependent Sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1997;59(2):447–461. [Google Scholar]
  • 15.Diggle Peter J, Heagerty Patrick J, Liang K-Y, Zeger Scott L. Analysis of Longitudinal Data. Oxford University Press; second ed2002. [Google Scholar]
  • 16.Schildcrout Jonathan S, Heagerty Patrick J. On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9(4):735–749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Schildcrout Jonathan S, Mumford Sunni L, Chen Zhen, Heagerty Patrick J, Rathouz Paul J. Outcome-dependent sampling for longitudinal binary response data based on a time-varying auxiliary variable. Statistics in Medicine. 2012;31(22):2441–2456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Schildcrout Jonathan S, Garbett Shawn P, Heagerty Patrick J. Outcome vector dependent sampling with longitudinal continuous response data: stratified sampling based on summary statistics. Biometrics. 2013;69(2):405–416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Neuhaus John M, Jewell Nicholas P. The effect of retrospective sampling on binary regression models for clustered data. Biometrics. 1990;46(4):977–990. [PubMed] [Google Scholar]
  • 20.Neuhaus J, Scott AJ, Wild CJ. The analysis of retrospective family studies. Biometrika. 2002;89(1):23–37. [Google Scholar]
  • 21.Cai J, Qaqish B, Zhou H. Marginal Analysis for Cluster-Based Case-Control Studies. The Indian Journal of Statistics, Series B. 2001;63(3):326–337. [Google Scholar]
  • 22.Robins James M, Rotnitzky Andrea, Zhao Lue Ping. Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data. Journal of the American Statistical Association. 1995;90(429):106–121. [Google Scholar]
  • 23.Sarndal CE, Swensson B, Wretman J. Model Assisted Survey Sampling. Springer Series in Statistics; 1992. [Google Scholar]
  • 24.Deville JC, Sarndal CE. Calibration estimators in survey sampling. Journal of the American Statistical Association. 1992;87:376–382. [Google Scholar]
  • 25.Breslow N, Lumley T, Ballantyne C, Chambless. Using the Whole Cohort in the Analysis of Case-Cohort Data. American Journal of Epidemiology. 2009;169(11):1398–1405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Breslow N, Amorim G, Pettinger M, Rossouw J. Using the Whole Cohort in the Analysis of Case-Control Data. Statistics in Biosciences. 2013;5(2):232–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rivera C, Lumley T. Using the whole cohort in the analysis of countermatched samples. Biometrics. 2016;72(2):382–391. [DOI] [PubMed] [Google Scholar]
  • 28.Liang K, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. [Google Scholar]
  • 29.Prentice R. Correlated binary regression with covariates specific to each binary observation. Biometrics. 1988;44(4):1033–1048. [PubMed] [Google Scholar]
  • 30.Zhao Lue Ping, Prentice Ross L. Correlated binary regression using a quadratic exponential model. Biometrika. 1990;77(3):642–648. [Google Scholar]
  • 31.Prentice Ross L, Zhao Lue Ping. Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics. 1991;:825–839. [PubMed] [Google Scholar]
  • 32.Minge Xie, Yaning Yang. Asymptotics for generalized estimating equations with large cluster sizes. The Annals of Statistics. 2003;31(1):310–347. [Google Scholar]
  • 33.Pepe Margaret Sullivan, Anderson Garnet L. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics - Simulation and Computation. 1994;23(4):939–951. [Google Scholar]
  • 34.Stijn Vansteelandt. On Confounding, Prediction and Efficiency in the Analysis of Longitudinal and Cross-sectional Clustered Data. Scandinavian Journal ofStatistics. 2007;34(3):478–498. [Google Scholar]
  • 35.Breslow N, Lumley T, Ballantyne C, Chambless L, Kulich M. Improved Horvitz-Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology. Statistics in Biosciences. 2009;1(1):32–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Lumley T, Shaw P, Dai J. Connections between Survey Calibration Estimators and Semiparametric Models for Incomplete Data. Int. Stat. Rev. 2011;79(2):200–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Seaman Shaun R, White Ian R. Review of inverse probability weighting for dealing with missing data. Statistical methods in medical research. 2013;22(3):278–295. [DOI] [PubMed] [Google Scholar]
  • 38.Keogh Ruth H, White Ian R. Using full cohort data in nested case-control and case-cohort studies by multiple imputation. Statistics in Medicine. 2013;32(23):4021–4043. [DOI] [PubMed] [Google Scholar]
  • 39.Pierce D. The Asymptotic Effect of Substituting Estimators for Parameters in Certain Types of Statistics. The Annals of Statistics. 1982;10(2):475–478. [Google Scholar]
  • 40.Heagerty Patrick J Marginally specified logistic-normal models for longitudinal binary data. Biometrics. 1999;55(3):688–698. [DOI] [PubMed] [Google Scholar]
  • 41.Qin J, Zhang H, Li P, Albanes D, Yu K. Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika. 2015;102(1):169–180. [Google Scholar]
  • 42.Chatterjee N, Chen Y-H, Maas P, Carroll RJ. Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources. Journal of the American Statistical Association. 2016;111(513):107–117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Margaret Pepe, Anderson Garnet L. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics-Simulation and Computation. 1994;23(4):939–951. [Google Scholar]
  • 44.Stijn Vansteelandt. On confounding, prediction and efficiency in the analysis of longitudinal and cross-sectional clustered data. Scandinavian Journal of Statistics. 2007;34(3):478–498. [Google Scholar]
  • 45.Wakefield J, Haneuse S. Overcoming ecologic bias using the two-phase study design. American Journal of Epidemiology. 2008;167(8):908–916. [DOI] [PubMed] [Google Scholar]
  • 46.Justine Shults, Hilbe Joseph M. Quasi-least squares regression. Chapman and Hall/CRC; 2014. [Google Scholar]
  • 47.Chen I-Chen, Westgate Philip M. Improved methods for the marginal analysis of longitudinal data in the presence of time-dependent covariates. Statistics in medicine. 2017;36(16):2533–2546. [DOI] [PubMed] [Google Scholar]
  • 48.McIsaac Michael A, Cook Richard J. Response-dependent sampling with clustered and longitudinal data. In: Springer; 2013. (pp. 157–181). [Google Scholar]
  • 49.McIsaac Michael A, Cook Richard J. Statistical models and methods for incomplete data in randomized clinical trials. In: Springer; 2014. (pp. 1–27). [Google Scholar]
  • 50.McIsaac Michael A, Cook Richard J. Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis. Statistics in Medicine. 2015;34(21):2899–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Schildcrout Jonathan S, Haneuse Sebastien, Peterson Josh F, et al. Analyses of longitudinal, hospital clinical laboratory data with application to blood glucose concentrations. Statistics in Medicine. 2011;30(27):3208–3220. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info1
supp info2

RESOURCES