Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Jun 1.
Published in final edited form as: Biometrics. 2009 Aug 10;66(2):365–373. doi: 10.1111/j.1541-0420.2009.01306.x

Longitudinal studies of binary response data following case-control and stratified case-control sampling: design and analysis

Jonathan S Schildcrout 1, Paul J Rathouz 2
PMCID: PMC3051172  NIHMSID: NIHMS138788  PMID: 19673861

SUMMARY

We discuss design and analysis of longitudinal studies after case-control sampling, wherein interest is in the relationship between a longitudinal binary response that is related to the sampling (case-control) variable, and a set of covariates. We propose a semiparametric modelling framework based on a marginal longitudinal binary response model and an ancillary model for subjects’ case-control status. In this approach, the analyst must posit the population prevalence of being a case, which is then used to compute an offset term in the ancillary model. Parameter estimates from this model are used to compute offsets for the longitudinal response model. Examining the impact of population prevalence and ancillary model misspecification, we show that time-invariant covariate parameter estimates, other than the intercept, are reasonably robust, but intercept and time-varying covariate parameter estimates can be sensitive to such misspecification. We study design and analysis issues impacting study efficiency, namely: choice of sampling variable and the strength of its relationship to the response, sample stratification, choice of working covariance weighting, and degree of flexibility of the ancillary model. The research is motivated by a longitudinal study following case-control sampling of the time course of ADHD symptoms.

Keywords: Bias, binary data, efficiency, Generalized Estimating Equations, longitudinal data, logistic regression, outcome dependent sampling

1. Introduction

The Attention Deficit Hyperactivity Disorder (ADHD) Study (Lahey, 1998; Hartung et al., 2002) is a longitudinal study on 255 children that seeks to identify risk and prognostic factors in early childhood for ADHD symptoms, diagnoses, and functional outcomes across childhood, adolescence and early adulthood. In the paper we model ADHD prevalence as a function of time and baseline predictors in the first eight waves of data (including baseline). One hundred thirty-eight children who were referred to one of two participating clinics due to parent or teacher suspicion of ADHD symptom exhibition were enrolled in the study, as was a demographically and socioeconomically similar group of 117 non-referred children. All participants were followed over seven annual visits after baseline. Assessment of ADHD symptoms was made at each visit using the Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM-IV; American Psychatric Association, 1994) criteria, and these assessments were used to generate at each wave a diagnosis of ADHD in the previous six months. While participant referral was a strong predictor of ADHD symptom level, particularly at the first (baseline) visit, the relationship was not deterministic, and some referred subjects did not meet criteria for ADHD at baseline. Conversely, non-referred participants exhibited symptoms and at times met diagnostic criteria for ADHD.

Because referred and non-referred participants are at high and low risk, respectively, for expressing symptoms during followup, the ADHD study design allows researchers to observe substantial response variation and thereby to potentially estimate many target regression effects efficiently. Because the sampling scheme is biased, however, standard longitudinal data analysis methods do not apply. In this manuscript, we discuss analytical strategies and design considerations when such “case-control” (e.g., referred and non-referred) sampling is followed by longitudinal followup on a binary response related to case-control status at baseline. Similar biased sampling in longitudinal studies has been used elsewhere (e.g., Lahey et al., 1999), and the methods described herein would apply in those settings as well.

To formalize the problem, assume interest lies in the longitudinal marginal relationship E(Yi | Xi) where i indexes subjects in a population, Yi is a binary vector of responses on the ith subject, and Xi is a design matrix containing predictor and adjustment variables of interest. Subjects are sampled from the population into the study with probability that depends on a univariate case-control or sampling variable Zi which is related to Yi, or possibly on (Zi, X1i), where X1i is contained in Xi. The analytic goal is to make inferences on the marginal mean E(Yi | Xi). Though case-control sampling is in the general sense a stratified design, for the purpose of this paper, we refer to sampling based on (Zi, X1i) as stratified sampling (dropping the case-control designation for ease of exposition only), and we refer to sampling based only on Zi as case-control sampling.

The case-control design we consider is a specific instance of what is more generally termed outcome dependent sampling (ODS). The majority of such ODS designs, including those pertaining to longitudinal and correlated data, require explicit acknowledgment of non-equal probability of participant ascertainment in the analysis. Neuhaus and Jewell (1990) and Qaqish et al. (1997) discuss the implications of ignoring the ODS design with cluster-based sampling for correlated data, and Neuhaus and Jewell propose subject-specific conditional logistic regression models when sampling is based upon binary response vector sums. Similarly, Schildcrout and Heagerty (2008) describe sampling based on the presence/absence of binary response series variation and propose conditional maximum likelihood analyses for marginal models. Case-control family studies, an alternative design, sample on a single component (the proband) of a cluster rather than on a summary of the entire cluster-level response vector. Whittemore (1995), Zhao et al (1998), and Neuhaus, Scott, and Wild (2002) approach case-control family studies via marginal models; Neuhaus, Scott and Wild (2006) have more recently developed methods using subject-specific models. Our design is linked to the case-control family study; however, we sample on an ancillary variate that is related but not equal to the index response. Whereas Neuhaus et al. (2006) discuss a ‘stochastic’ sampling design like ours, they propose likelihood-based estimation. In contrast, we develop a semiparametric estimation strategy for the marginal model E(Yi | Xi) using generalized estimating equations (GEE; Liang and Zeger, 1986). Advantages to likelihood-based estimation over GEE are well known (e.g., model selection and missing data); however, we believe it is important to develop an estimation strategy for this design using methods that, unlike parametric approaches, are insensitive to dependence model misspecification. Our approach can be implemented using standard GEE software and we provide a macro for doing so in Stata (StatCorp, 2007) on the second author’s (PJR) website (http://health.bsd.uchicago.edu/rathouz/Software).

This manuscript is organized as follows. In section 2, we describe modeling assumptions that must be made for valid inferences and a general strategy that can be used for estimation with this study design. We detail a semiparametric estimation approach to parameter estimation and inference under logistic regression models for Zi and Yi in section 3. Section 4 reports on simulation studies conducted for the purpose of examining the finite sample operating characteristics of the proposed estimator, focusing on the impact on bias of model misspecification and on statistical efficiency of design and estimation strategies. We return to the ADHD study in section 5 and describe an analysis of those data. Finally, we provide concluding remarks and a discussion in section 6.

2. Sampling and modeling assumptions

Consider a target population wherein each subject i in the population admits (Yi, ti, Xi). Here, Yi = (Yi1, …, Yini)′ is a longitudinal series of binary outcomes such as annual ADHD diagnosis, Xi = (xi1, …, xini)′ is a ni × p matrix of covariates predicting Yi, and ti = (ti1, …, tini)′ is a vector of observation times which may also be contained in Xi. For example, in the ADHD study, each row j of Xi may contain a vector of baseline (e.g., gender and ethnicity) and time-varying (e.g., wave, age, other psychiatric diagnoses, or interactions between baseline predictors and time) predictors for ADHD diagnosis Yij at time tij. For purposes of exposition, we assume that the number ni and values ti of observation times are fixed by design. In practice, ni and ti can vary either functionally or stochastically depending on baseline predictors contained in Xi, so long as they are independent of Yi given such baseline predictors.

We assume that interest lies in the marginal probability that Yij = 1 given Xi in the target population,

μPij=Pr(Yij=1|Xi)=g1(β0+xijβ1) (1)

(subscript P for target population), where g(·) is a link function mapping (0, 1) to the real line, and, generally, β1 is the parameter of interest. Note that (1) implicitly contains the “reproducibility” or “no interference” assumption that

Pr(Yij=1|Xi)=Pr(Yij=1|xij),

i.e., that predictors available in Xi provide no additional predictive value for Yij over and above the information available in xij. In the ADHD study, for example, this assumption would be easily satisfied if xij contains only baseline and non-stochastic predictors such as time or age. In situations wherein xij contains stochastic predictors such as other mental health diagnosees as time tij, the assumptions needs to be more carefully examined.

In randomly drawing a sample from the target population, let Si be an indicator variable for the ith subject in the population being selected into the sample, and assume that SiSi, ii′. Under simple random sampling, or sampling that is related to Xi but not to Yi, models and inferences for β = (β0, β1)′ in µPij can be carried out ignoring the sampling process. A typical approach would be to fit mean model (1) along with a working correlation model, corr(Yij, Yik|Xi), j ∈ {1, …, ni}, k ∈ {1, …, ni}, jk, using GEE.

Now, consider a more challenging design wherein sampling Si is related in some way to Yi and possibly Xi. In this setting, the sample is no longer representative of the target population. Rather, it represents a pseudo-population that is a reweighted version of the target population, where weights vary as a function of Yi, or if sampling also depends upon Xi, as a function of (Yi, Xi). To proceed, define the sampling probability ρij(y, Xi) ≡ Pr(Si = 1|Yij = y, Xi). That is, ρij(y, Xi) is defined to equal the probability of being sampled conditional on the entire design matrix Xi, but only on the jth response Yij; even though Si may depend on the entire vector Yi, for reasons that will become evident, we focus here only on this marginal probability. We have included subscripts ij on ρij(y, Xi) to indicate that this probability could vary with one or more of the observation number j, time tij, and design matrix Xi. Then, conditional on being sampled (i.e., Si = 1), standard Bayes’ Theorem calculations applied to target population model (1) yield the following pseudo-population marginal odds model,

Pr(Yij=1|Xi,Si=1)Pr(Yij=0|Xi,Si=1)=μSij1μSij=μPij1μPijρij(1,Xi)ρij(0,Xi) (2)

(subscript S indicating pseudo-population sample).

When ρij(1, Xi)/ρij(0, Xi) is known, we may use (2) to make inferences about target population parameters through parameter estimation for the pseudo-population model. Estimation with GEE would require specification of the mean model for µSij given by (2) and a working correlation model for corr(Yij, Yik|Xi, Si = 1) in the pseudo-population.

Here, we consider the circumstance where the sampling fraction, ρij(1, Xi)/ρij(0, Xi) is unknown. We assume that sampling depends upon (Yi, Xi) only indirectly through a binary case-control or sampling variable Zi, or, when a stratified sampling scheme is implemented, only indirectly through (Zi, X1i), where Xi = (X1i, X2i), and X1i contains a subset of the information, generally available at baseline, in Xi. In the ADHD study, Zi indicates referral status, and X1i contains subject’s gender. In other designs it may also include measures such as baseline age, neighborhood or community variables available at the time of enrollment, etc. Stratifed sampling may be utilized in order to improve estimation efficiency on the coefficients for X1i as well as covariates in X2i that are related to X1i. Formally, we assume that, without stratification, Si ∐ (Yi, Xi)|Zi, while for the stratified design,

Si(Yi,Xi)|(Zi,X1i). (3)

In the ADHD study, (3) indicates that, given referral status and gender, selection is independent of baseline or subsequent ADHD diagnoses and of other predictor variables. In subsequent exposition, we focus on the stratified sampling design. Our development is easily adapted to the simpler, unstratified design if that is of interest.

Let π(z, X1i) = Pr(Si = 1 | Zi = z, X1i), z = 0, 1. Then, if (3) holds, knowledge of π(1, X1i)/π(0, X1i) permits estimation of the ratio ρij(1, Xi)/ρij(0, Xi). This in turn permits inferences about parameters β in model (1) via relationship (2). To see this, define λPij(y, Xi) = Pr(Zi = 1|Yij = y, Xi), y = 0, 1. Note that this quantity may at first appear counterintuitive, since Zi occurs prior to Yij in time. Nevertheless, this “reverse” conditional probability certainly exists and can be modeled. This model is ancillary to model of interest (1) and is specified in order to render identifiable parameters in model (1). Utilization of this intermediary model to identify parameters in the target model follows directly from Lee, McMurchy, and Scott (1997) and Neuhaus et al. (2006). Owing to the reverse time sequence and to the fact that the conditioning statistic Yij varies with j, we will tend to choose flexible specifications for λPij(y, Xi). Note also that, as with ρij(y, Xi), λPij(y, Xi) is conditional on the entire design matrix Xi, but only on the jth response Yij. Similarly to (2), Bayes’ Theorem calculations yield an odds model for Zi in the pseudo-population, viz,

Pr(Zi=1|Yij=y,Xi,Si=1)Pr(Zi=0|Yij=y,Xi,Si=1)=λSij(y,Xi)1λSij(y,Xi)=λPij(y,Xi)1λPij(y,Xi)π(1,X1i)π(0,X1i), (4)

y = 0, 1, where λSij(y, Xi) = Pr(Zi = 1|Yij = y, Xi, Si = 1). Additionally, due to (3),

ρij(y,Xi)=π(0,X1i){1λPij(y,Xi)}+π(1,X1i)λPij(y,Xi),y=0,1,

from which we can write the ratio

ρij(1,Xi)ρij(0,Xi)=1λPij(1,Xi)+{π(1,X1i)/π(0,X1i)}λPij(1,Xi)1λPij(0,Xi)+{π(1,X1i)/π(0,X1i)}λPij(0,Xi). (5)

In Section 3, relationship (4) and sampling ratio π(1, X1i)/π(0, X1i) will be used to specify and fit a model for λPij(y, Xi) using data from the pseudo-population. This will lead to estimates of ρij(1, Xi)/ρij(0, Xi) using (5) which can be used in (2) to make β-inferences.

3. Implementation with logistic regression and GEE

Here, we present a specific approach to the program outlined in section 2, beginning with model specification and estimation for λPij(y, Xi). Suppose λPij(y, Xi) is modeled as a logistic regression of Zi of the form

λPij(y,Xi)=logit1(w1,ijγ1+y×w2,ijγ2), (6)

where w1,ij and w2,ij are functions of (Xi, ti). We assume that w1,ij and w2,ij are sufficiently rich so that

Pr(Zi=1|Yij=y,w1,ij,w2,ij)=Pr(Zi=1|Yij=y,Xi). (7)

We separately denote xij, w1,ij and w2,ij because, even though they may be overlapping in their information content, they may take on different functional forms and because, in order to compute (5), any interactions with y in (6) need to be made explicit. In most applications, we would expect to maximize model flexibility in ancillary model (6), but to be more parsimonious in our specification of model of interest (1). For example, while ti, may be included as a linear term in E(Yi | Xi), the relationship between Zi and both ti and the interaction between ti and y may be non-linear. Later, in the ADHD study example, we allow ti to be a series of time-specific indicator variables in w1,ij, a piecewise linear spline function in w2,ij, and a simple linear term in xij.

Via (4), (6) induces a model for λSij, i.e.,

λSij(y,Xi)=logit1(w1,ijγ1+y×w2,ijγ2+log{π(1,X1i)/π(0,X1i)}). (8)

Model (8) can be fitted to the data from a case-control sample using standard logistic regression software, including log{π(1, X1i)/π(0, X1i)} as an offset term in the linear predictor. Specifically, setting λSij = λSij(Yij, Xi) and γ=(γ1,γ2),γ is estimated by solving the logistic regression score ∑iTi(γ) = 0, yielding γ̂, where

Ti(γ)=j=1ni(w1,ijYij×w2,ij  )(ZiλSij). (9)

Ti(γ) nominally treats the jth term in (9) as independent of the other ni − 1 terms in the sum. This independence does not, of course, hold, as all terms share the same response variable Zi. Nevertheless, Ti(γ) is unbiased and so in general will yield consistent estimators for γ. Therefore, γ can be estimated using any logistic regression GEE software program which permits offset terms and allows for the independence correlation structure.

Turning to model specification for µPij, let g(·) be the logit function. Then (1) implies μPij=logit1(β0+xijβ1), and, from (2),

μSij=logit1(β0+xijβ1+Bij), (10)

wherein the bias-correction term Bij = Bij(Xi) = log{ρij(1, Xi)/ρij(0, Xi)} appears as an offset. By (5) and (6), Bij is a function of γ, and so is estimable by plugging in γ̂ for γ to obtain ij. With ij, the sampled data can then be analyzed using marginal model (10). This mean model can be complemented with a working correlation model

cSijk(α)=corr(Yij,Yik|Xi,Si=1;α) (11)

in the sampled pseudo-population, governed by parameter α, though α̂ cannot be applied to inferences regarding the target population. The model specified via (10) and (11) with ij replacing Bij can then be estimated directly using any standard GEE software program. If the working correlation model is a reasonable approximation to the true correlation structure in the sampled pseudo-population, it should result in an increase in statistical efficiency for β estimation under (10) relative to, say, estimation under the independence working correlation model (Liang et al, 1992; Fitzmaurice, 1995; Mancl and Leroux, 1996; Schildcrout and Heagerty, 2005). Standard errors for β̂ will not however be correct under this approach, since they must account for the uncertainty in estimation of γ.

Standard errors can be calculated via a corrected version of the sandwich estimator (Liang and Zeger, 1986). Note that β is estimated by solving the GEE logistic regression estimating equation ∑i Ui(β, γ̂) = 0 for β, yielding β̂, where

Ui(β,γ)=DiVi1(YiµSi).

Here, as in the usual GEE setup, μSi=(μSi1,,μSini),Di=(1ni,Xi)Ai,Ai=diag{μSij(1μSij)}j=1ni,Vi=Ai1/2CiAi1/2, and Ci is the ni × ni matrix with element (j, k) given by (11). Correlation parameter α is estimated iteratively with β, but owing to the orthogonality of α and β in Ui, estimation of α has no asymptotic impact on the validity of the standard errors of β̂ (Liang and Zeger, 1986). Robust standard errors for β̂ are developed by viewing (γ̂′, β̂′)′ as the solution to the “stacked” estimating equation

i(Ti(γ)Ui(β,γ))=0. (12)

The asymptotic variance of (γ̂′, β̂′)′ is then given as

AVar(γ^,β^)=I^1Q^I^1, (13)

where the ^’s indicate that (γ′, β′)′ has been replaced by (γ̂′, β̂′)′,

Q=i(Ti(γ)Ui(β,γ))2,   and    I=(ITT0IUTIUU). (14)

In (14),

ITT=ij=1niE(Tiγ)=i(w1,ijYij×w2,ij  )2{λSij(1λSij)}1,

the upper right quadrant of I is 0 because E(− ∂Ti/∂β′) = 0,

IUT=iE(Uiγ)=iDiVi1Ai(Biγ),

and

IUU=iE(Uiβ)=iDiVi1Di.

In IUT, Bi = (Bi1,…, Bini)′ and

(Biγ)=(W1iW2i)Fi(1)+(W1i0)Fi(0), (15)

where

Fi(y)=diag{λPij(y,Xi){1λPij(y,Xi)}1{π(1,X1i)/π(0,X1i)}1λPij(y,Xi)+{π(1,X1i)/π(0,X1i)}λPij(y,Xi)}j=1ni,

W1i = (w1i1, …, w1ini)′ and W2i = (w2i1, …, w2ini)′ (see Web Appendix A available online at http://www.biometrics.tibs.org).

4. Finite sample operating characteristics of estimators

In the previous section we outlined a strategy to estimate population model parameters as well as uncertainty estimation in a longitudinal study following (stratified) case-control sampling. We now explore via Monte-Carlo simulation, the impact that misspecification of π(1, X1i)/π(0, X1i) and λSij(y, Xi) can have on inferential validity, and the effect that design and estimation strategy can have on estimation efficiency.

4.1 Population model

The population model we consider is a marginalized transition and latent variable model (Schildcrout and Heagerty, 2007) which is given by:

logit(μPij)=β0+βttij+βx1x1i+βx2x2i (16)
logit(μPijc)=Δij+γYij1+bi (17)

Equation (16) is the marginal mean model for Yij which captures the impact of target covariates on the average response, and equation (17) is the conditional mean model for (Yij|Yi,j−1, bi) that captures within-subject response dependence. The conditional mean model introduces two sources of dependence among repeated measurements within an individual. Subject-to-subject heterogeneity in predisposition for a positive response (Yij = 1) is introduced by the random intercept bi, and serial dependence is introduced by the transition term, Yij−1 with coefficient γ. The marginal and conditional mean models, along with the distributional assumption, bi~N(0,σb2), complete the multivariate distribution of [Yi | Xi] and allow us to generate data for the population. The value, Δij, linking µPij and μPijc has been described in a number of earlier manuscripts (e.g., Azzalini, 1994; Heagerty, 1999 and 2002; Schildcrout and Heagerty, 2007). For this simulation, xij = (tij, x1i, x2i)′, β0 = −2.75, β1 = (βt, βx1, βx2) = (0.25, 0.75, 0.75), σb = 2.5 and γ = 1. Covariates x1i and x2i are binary and time-invariant with Pr(x1i = 1) = 0.2 and Pr(x2i = 1 | X1i) = 0.25 + 0.1X1i, and tij is a time covariate with tij = j, j ∈ {1, …, ni}. We assume a missing completely at random dropout mechanism, where the last follow-up time ni is uniformly distributed between three and eight. The large σb value is intended to reflect the substantial between-subject heterogeneity in ADHD diagnoses; many children never exhibit symptoms or meet diagnostic criteria, while others do so often. This model induces a marginal prevalence at ti1 of Pr(Yi1 = 1) ≈ 0.119.

The sampling covariate Zi is binary, and we generated its value for each subject using the model, λPij = logit−10 + γ1Yi1) with γ = (γ0, γ1) fixed at two sets of values: (−5, 10) and (−3, 3). We will refer to the former as strong Zi ~ Yi1 dependence (Strz~y) and the latter as weak Zi ~ Yi1 dependence (Wkz~y). Note that with Strz~y, Zi is effectively equal to the first response value Yi1. Pr(Zi = 1) equals 0.124 and 0.101, respectively under Strz~y and Wkz~y.

4.2 Sampling from the population

For the purpose of sampling from the population, we consider both unstratified and stratified approaches, denoted respectively by S(z) and S(z, x1). In S(z), the sampling probability depends only on Zi, i.e., π(z, X1i) = π(z). In S(z, x1), it depends both on Zi and subject-level covariate xi1, i.e., π(z, X1i) = π(z, xi1). In a population of size N with Nz, z ∈ {(0, 1)} and Nz,x1, (z, x1) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)} members in each sampling stratum, we sample each subject according to their stratum membership with probability π(z) = 200/Nz for S(z) and π(z, x1) = 100/Nz,x1 for S(z, x1). The number of sampled subjects nz,x1 in each stratum (z, x1), follows a binomial distribution so that the expected number of controls (Zi = 0) and cases (Zi = 1) is equal to 200. For a detailed description of how data were generated, see the Web Appendix B available online at http://www.biometrics.tibs.org.

4.3 Analysis models and model misspecification

In model fitting, for each replicate, we focus on the impact of two forms of model misspecification: 1) misspecified sampling ratio π(1)/π(0) (for S(z)) or π(1, x1)/π(0, x1) (for S(z, x1)), and 2) misspecified (w1,ij, w2,ij) in (8) which is equivalent to violation of assumption (7); analysis models are summarized in table 1. In the former case, misspecification occurs during estimation, when we assume the value the sampling ratio is one-half of, one-fifth of, and twice its true value. In the latter case, we consider four functional forms for (w1,ij, w2,ij). In order of increasing flexibility, the linear predictors in (8) are given as follows: (lp-1), w1,ij = w2,ij = (1, x1i, x2i)′; (lp-2), w1,ij = w2,ij = (1, x1i, x2i, tij)′; (lp-3), w1,ij = (1, x1i, x2i, x1i × x2i, tij, x1i × tij, x2i × tij)′ and w2,ij = (1, x1i, x2i, tij)′; (lp-4), (lp-3) with the main effect tij replaced in w1,ij by time-specific indicator variables, and tij in interaction terms in w1,ij and in w2,ij by a piecewise linear function with a knot at t = 3. In all cases, GEE with exchangeable working covariance weighting (GEEE) was used for β estimation with (12), and Wald-based 95% confidence intervals were computed based on standard errors estimated using (13). Finally, we consider a modeling approach that ignores the study design, i.e., is a standard GEE model for E(Yi|Xi) with exchangeable correlation matrix. We used the exchangeable working correlation model because, noting that σ = 2.5 and γ = 1, the random intercept is the dominating source of response dependence in these data, and although the true dependence structure also has a serial component, very few standard GEE software packages permit estimation that acknowledges both sources.

Table 1.

Estimation Strategies: In strategy 1, the sampling ratio, π(1)/π(0) or π(1, x1)/π(0, x1), is correctly specified, and the model (8) is very flexible. In strategies 2, 3, and 4, we misspecify the sampling ratio by assuming it is one half, one fifth and twice its true value, respectively. In strategies 5, 6, and 7, we assume three less flexible models for (8), given in the text. Approach 8 ignores the design altogether with analyses conducted as if the sample was representative of the target population.

Estimation
Strategy
Specification of
π(1, x1)/π(0, x1)
Specification of
Model (8)
1 π(1, x1)/π(0, x1) (lp-4)
2 0.5 · π(1, x1)/π(0, x1) (lp-4)
3 0.2 · π(1, x1)/π(0, x1) (lp-4)
4 2.0 · π(1, x1)/π(0, x1) (lp-4)
5 π(1, x1)/π(0, x1) (lp-1)
6 π(1, x1)/π(0, x1) (lp-2)
7 π(1, x1)/π(0, x1) (lp-3)
8 Ignored Ignored

4.4 Results: Inferential Validity

We now discuss inferential validity in the presence of possible misspecification of sampling ratio π(1)/π(0) or π(1, x1)/π(0, x1) and of model (8). Results are displayed in table 2 for the four combinations of S(z) and S(z, x1), and Strz~y and Wkz~y. As expected, approach 1, which has the most flexible model for Zi and uses the correctly-specified sampling ratio, does very well in terms of both bias and coverage probability. As with case-control studies, sampling ratio misspecification (approaches 2, 3, 4 and 8) led to biased estimates of the intercept, which is often not a large concern. It also led to biased estimates of the time-varying covariate parameter βt; this is not unexpected because the slope over time summarizes the change in time-specific intercepts, which are biased due to sampling ratio misspecification. Biases in estimates of time-invariant covariate coefficients were generally modest and less than ten percent, except with S(z, x1) where estimation approach 8 was severly biased for the stratification variable coefficient, βx1. Severe inflexibility in the model for Zi (approach 5), also led to large biases in the intercept and time parameters, but with more flexible approaches such as 6 or 7, these biases were substantially reduced. In most cases, when bias was low or zero for a given parameter estimator, 95% confidence intervals were accurate.

Table 2.

Percent bias in parameter estimates and coverage percentages in eight estimation strategies described in table 1 and across 1500 replicates. The γ values (−5, 10) and (−3, 3) correspond to model parameters in Pr(Zi | Yi1), and represent strong (Strz~y) and weak (Wkz~y) dependence of Zi on Yi1, respectively. GEEE was used for estimation, and sampling was based Zi in the unstratified designs (S(z)) and on (Zi, x1i in the stratified designs (S(z, x1)). Percent bias in parameter estimates is calculated with 100 · (β̂k − βk)/βk for k ∈ (0, t, x11, x2i). Coverage percentages were calcualted as the percent of nominal 95% Wald confidence intervals using robust standard errors spanning the true parameter value.

Estimation
Approach
β0 βt βx1 βx2 cp(β̂0) cp(β̂t) cp(β̂x1) cp(β̂x2)
Strz~y with S(z)
1 −1 −3 1 3 95 94 95 94
2 −23 −29 2 3 1 19 95 95
3 −52 −57 2 4 0 0 95 96
4 20 26 3 5 11 53 94 93
5 −24 −65 −5 −3 0 0 94 95
6 −5 −13 −1 1 86 79 94 95
7 −4 −12 1 2 87 81 94 94
8 −63 −65 2 4 0 0 96 95
Wkz~y with S(z)
1 0 −1 0 0 94 94 95 94
2 −11 −15 2 1 56 74 95 95
3 −29 −35 5 4 0 4 95 93
4 8 11 −2 −1 84 90 95 94
5 −17 −46 −3 −3 24 0 95 94
6 −2 −7 −1 −2 94 92 95 94
7 −1 −6 0 0 94 92 95 94
8 −40 −46 7 5 0 0 94 94
Strz~y with S(z, x1)
1 −1 −2 −4 2 93 94 92 95
2 −23 −28 −3 3 3 19 93 96
3 −52 −55 −3 4 0 0 93 95
4 20 26 −4 3 19 51 92 94
5 −22 −59 −9 −4 4 0 93 96
6 −4 −11 −6 0 87 80 93 95
7 −4 −11 −5 1 88 82 93 95
8 −65 −59 −62 3 0 0 22 95
Wkz~y with S(z, x1)
1 0 −1 −1 1 94 94 95 95
2 −11 −15 0 3 63 73 95 96
3 −29 −34 2 5 1 3 96 95
4 9 11 −1 0 85 90 95 95
5 −15 −43 −3 −2 43 0 95 96
6 −2 −6 −2 −1 94 91 95 96
7 −1 −5 −1 1 93 92 95 96
8 −40 −43 −16 6 0 0 90 95

4.5 Results: Estimation Efficiency

Regarding the impact of the design and estimation strategy on parameter estimation efficiency, we consider four specific contrasts (table 3): 1) Strz~y versus Wkz~y (with S(z) and GEEE) 2) S(z, x1) versus S(z) (with Strz~y and GEEE), 3) GEEE versus independence weighted GEE (GEEI; with Strz~y and S(z)), and 4) estimation strategy 1 versus others (using Strz~y, S(z) and GEEE). We only consider efficiency for parameter value combinations that yielded approximately valid inferences in the last section (coverage percentages ≥ 92).

Table 3.

Study design and estimation efficiency across 1500 replicates. Relative efficiency of A vs B is defined by the empirical variance of B divided by the empirical variance of A across 1500 replications. We only consider parameter by estimation strategies that were observed to be approximately valid in table 2. The upper portion of the table shows efficiency gains due to study design. We compare the efficiency of Strz~y versus Wkz~y with S(z) and GEEE, and S(z, x1) versus S(z) with Strz~y and GEEE. In the bottom portion of the table, we show efficiency gains due to estimation strategy. First, we compare GEEE versus GEEI with Strz~y and S(z), and second we show the impact of using estimation strategy 1 versus other estimation approaches with Strz~y, S(z), and GEEE.

Estimation
strategy
Study design
Strz~y versus Wkz~y S(z, x1) versus S(z)
β0 βt βx1 βx2 β0 βt βx1 βx2
1 1.84 1.26 1.42 1.42 0.75 1.02 1.32 1.2
2 - - 1.31 1.31 - - 1.28 1.16
3 - - 1.09 1.11 - - 1.23 1.09
4 - - 1.37 1.37 - - 1.38 1.21
5 - - 1.38 1.39 - - 1.25 1.23
6 - - 1.41 1.42 - - 1.25 1.2
7 - - 1.39 1.37 - - 1.29 1.23
8 - - 1.05 1.07 - - - 1.07
Estimation procedure
GEEE versus GEEI Strategy 1 versus others
β0 βt βx1i βx2i β0 βt βx1i βx2i
1 1.04 1.08 1.05 1.03 1 1 1 1
2 - - 1.09 1.07 - - 0.84 0.83
3 - - 1.11 1.08 - - 0.76 0.74
4 - - 0.99 0.98 - - 1.31 1.3
5 - - 1.06 1.03 - - 0.97 0.99
6 - - 1.04 1.01 - - 0.97 0.98
7 - - 1.04 1.02 - - 1.04 1.05
8 - - 1.09 1.06 - - 0.76 0.73

Efficiency gains for Strz~y over Wkz~y were pronounced. Wkz~y variances were up to 42% larger for β̂x1 and β̂x2 and 26% larger β̂t. With S(z, x1), efficiency improvements over S(z) were observed for β̂x1 and β̂x2 by values as high as 38%. No efficiency improvements were observed for β̂t because stratification variable x1i was unrelated to tij.

GEEE improved estimation efficiency only modestly over GEEI which may be expected for time-invariant covariate estimates; however, we anticipated larger efficiency gains for time-varying covariate coefficient, βt. We speculate that this is due to the dependence of β estimates on γ estimates; i.e., IUT from (14) was non-zero. Finally, with the exception of estimation approach 4, approach 1 was no more efficient—and sometimes less so—than the other approaches for the time-invariant covariate coefficients. That it was as efficient as 5, 6, and 7, we gather that estimation of more parameters in 1 does not have a major impact on uncertainty in β̂. That it was more efficient than 4, and less efficient than 2, 3, and 8, can possibly be explained by the impact of ‘weighting’ of cases relative to controls. By increasing the assumed values of π(1, x1)/π(0, x1), we are effectively giving greater weight to cases, and differential weighting of subjects is known to impact estimation efficiency.

4.6 Summary

To summarize, substantial effort should be made to ascertain reasonable approximations of π(1, x1)/π(0, x1), unless interest is only in time-fixed covariate coefficients. Topic specific experts should be involved in this process, and sensitivity analyses should be conducted over a range of reasonable values in order to examine the extent to which inference would change based on mild to moderate misspecification. Similarly, if time-varying covariate coefficients are of interest, it is important to build a sufficient model for (8). The crucial elements of this model include a flexible functional form of tij, Yij, and their interaction. Finally, estimation efficiency can be improved by choosing study designs with stratification and strong relationships between Zi and Yi.

5. Application to natural history studies of childhood mental health disorders

Participants of the ADHD study were sampled on the basis of whether (Zi = 1) or not (Zi = 0) they were referred to one of the two participating clinics. The study was matched on gender, Gi, and can be thought of as stratified since the probability of being sampled depended upon the pair, (Zi, Gi). Patient referral was strongly related to ADHD symptom diagnosis at baseline as Pr(Yi1 = 1 | Zi = 1) ≈ 0.92 and Pr(Yi1 = 1 | Zi = 0) ≈ 0.02, but this relationship was not deterministic. The demographic characteristics among referred and non-referred participants were similar. Both groups were approximately 82% male, 64% white, 31% african-american, and 6% were classified as “other” ethinicity. Age distributions were also similar with a median value of 5 years.

The primary goal of this analysis is to estimate the time trend of ADHD prevalence for boys (Gi = 0) and girls (Gi = 1) separately, and to examine whether this trend differs between them. The impact of race/ethnicity and age at baseline were of secondary interest. Similar to section 4, we examine the impact of assumptions about π(1, g)/π(0, g) and those in auxiliary model (8). We considered six reasonable analysis approaches plus the naive analysis that ignored the design altogether. We assume that approximately five percent of girls in the population would qualify for referral and among boys this rate is likely to be higher. We considered the values, five, ten, and fifteen percent for boys. In our sample, 25 out of 46 girls were cases, and with five percent prevalence, Pr(Zi = 1 | Gi = 1) = 0.05, we have π(1, 1)/π(0, 1) = (25 · 0.95)/(21 · 0.05) = 22.6. Among boys, there were 113 cases and 96 controls. With Pr(Zi = 1 | Gi = 0) equal to 0.05, 0.10, and 0.15, π(1, 0)/π(0, 0) equals 22.4, 10.6, and 6.7, respectively. Next, we considered two linear predictors in auxiliary model (8). In the simpler model, the linear predictor included: Yij, tij, age at baseline, gender, African American ethnicity, “other” ethnicity, and all pairwise interactions with Yij. The more flexible model (8), was identical to the simpler one except the main effect of tij and its interaction with Yij were replaced with time-specific indicator variables.

Results from these analyses are displayed in table 4. The naive analysis yielded vastly different conclusions than analyses that acknowledge the biased study design. While tij was significantly and positively associated with ADHD prevalence in boys (Gi = 0) among all analyses that acknowledged the study design, it was significantly and negatively associated with ADHD in the naive analysis. The prevalence time trend for girls in the naive analysis was positive (although not significant), but was flat when study design was taken into account. Similarly, gender appeared to be independent of ADHD prevalence at baseline (tij = 0) in the naive analysis while four of the six other approaches showed substantial evidence of females being at lower risk for ADHD than males. Among the other six analyses, the assumed values of π(1, g)/π(0, g) had a far larger impact on conclusions than did choice of the linear predictor in model (8). Lower assumed prevalence of referral status (Zi = 1) among boys or equivalently, higher π(1, 0)/π(0, 0) values, resulted in smaller effect sizes (in magnitude) for gender at baseline. The estimated baseline log odds ratios for girls versus boys ranged from approximately −1.05 to −0.4. The magnitude of the ADHD prevalence time trend for boys was also highly impacted by π(1, 0)/π(0, 0), with higher values being associated with larger time trend estimates. It is interesting to note that the time trend among females agreed very closely across the six approaches; in all cases, the values of the coefficients for tij and tij · Gi were in the opposite direction, and were comparable in magnitude.

Table 4.

ADHD Study results: A gender stratified design was used to examine the timecourse of ADHD symptom exhibition in males and females separately and the difference in the trajectories. Linear tij columns correspond to estimates where the functional form of tij in the intermediate model (8) was assumed to be linear. With flexible tij, time-specific indicator variables were substituted for linear tij. We display parameter estimates on the log odds scale and the 95% confidence intervals are in parentheses.

π(1, 0)/π(0, 0) = 22.4 π(1, 0)/π(0, 0) = 10.6 π(1, 0)/π(0, 0) = 6.7 Naive
Flexible tij Linear tij Flexible tij Linear tij Flexible tij Linear tij



Time (years) 0.13
(0.06, 0.19)
0.10
(0.05, 0.16)
0.09
(0.04, 0.14)
0.08
(0.03, 0.12)
0.06
(0.02, 0.11)
0.06
(0.01, 0.10)
−0.04
(−0.07, −0.01)
Age (years) -5 −0.25
(−0.59, 0.09)
−0.18
(−0.51, 0.14)
−0.20
(−0.50, 0.11)
−0.16
(−0.46, 0.13)
−0.17
(−0.45, 0.12)
−0.14
(−0.42, 0.13)
−0.09
(−0.34, 0.15)
Female −0.41
(−1.12, 0.30)
−0.50
(−1.18, 0.17)
−0.77
(−1.45, −0.08)
−0.80
(−1.46, −0.14)
−1.05
(−1.72, −0.37)
−1.05
(−1.71, −0.39)
0.00
(−0.66, 0.66)
Female · Time −0.12
(−0.23, −0.01)
−0.10
(−0.20, −0.01)
−0.08
(−0.19, 0.02)
−0.08
(−0.17, 0.02)
−0.06
(−0.16, 0.05)
−0.06
(−0.16, 0.04)
−0.09
(−0.19, 0.01)
Afr Am Ethnicity 1.29
(0.73, 1.85)
0.96
(0.43, 1.48)
1.06
(0.57, 1.56)
0.88
(0.40, 1.36)
0.94
(0.48, 1.41)
0.82
(0.37, 1.27)
0.54
(0.13, 0.95)
Other Ethnicity 0.11
(−0.97, 1.18)
0.00
(−1.05, 1.06)
0.17
(−0.82, 1.16)
0.11
(−0.87, 1.08)
0.22
(−0.74, 1.17)
0.17
(−0.77, 1.12)
0.38
(−0.51, 1.27)
Intercept −2.21
(−2.66, −1.77)
−2.05
(−2.48, −1.63)
−1.84
(−2.21, −1.46)
−1.75
(−2.12, −1.38)
−1.55
(−1.89, −1.20)
−1.49
(−1.83, −1.16)
−0.05
(−0.36, 0.25)

6. Discussion

In this manuscript we discussed design and analysis considerations for stratified and un-stratified case-control sampling followed by longitudinal followup on a stochastically related binary response vector. We developed a GEE-based estimation strategy as well as robust standard error calculations that incorporate information and uncertainty associated with the study design into the analysis via an ancillary model for case-control status. We found for time-invariant covariate coefficients that the biased design does not have a major impact on inferential validity, as most estimation approaches performed reasonably well. This result may be expected given the well-known performace of logistic regression under case-control sampling. There was one exception which involved the naive analyses with a stratified design. In this case, inferences related to the stratification variable should not be trusted as estimates are likely to exhibit large biases. Misspecification of the ancillary case-control model can lead to invalid inferences on time-varying covariate coefficients; however, as long as this model is reasonably well specified and includes flexible functions of time in the linear predictor, analyses should perform reasonable well. Specification of the sampling ratio was also shown to have a major impact on the validity of analyses related to time-varying covariates, and, since this value is often unknown, appropriate specification of it is a major challenge. Content-specific experts should be involved in determining its value, and sensitivity analyses over a reasonable range should be conducted to examine the impact on inferences. While we found that stratification and strong dependence between the sampling covariate and response vector can improve estimation efficiency, covariance weighting only improved efficiency slightly.

Supplementary Material

Supp Data

Acknowledgements

Rathouz’s contribution to this research was funded in part by National Institute of Mental Health grants R01 MH62437 and R01 MH53554.

Footnotes

Supplementary Materials

Web Appendices A and B referenced in Sections 3 and 4.2, respectively, are available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

Contributor Information

Jonathan S. Schildcrout, Departments of Biostatistics and Anesthesiology, Vanderbilt University School of Medicine, 1161 21st Avenue South, S-2323 Medical Center North, Nashville, TN 37232, USA, jonathan.schildcrout@vanderbilt.edu

Paul J. Rathouz, Department of Health Studies, University of Chicago, 5841 South Maryland Ave, MC 2007, Chicago, Illinois 60637, USA, prathouz@uchicago.edu

References

  1. American Psychiatric Association. Diagnostic and statistical manual of mental disorders : (DSM-IV) 4th edition. Washington, DC: American Psychiatric Association; 1994. [Google Scholar]
  2. Azzalini A. Logistic regression for autocorrelated data with application to repeated measures (Corr: 97V84 p989) Biometrika. 1994;81:767–775. [Google Scholar]
  3. Fitzmaurice GM. A caveat concerning independence estimating equations with multivariate binary data. Biometrics. 1995;51:309–317. [PubMed] [Google Scholar]
  4. Hartung C, Willcutt E, Lahey B, Pelham W, Loney J, Stein M, Keenan K. Sex differences in young children who meet criteria for attention deficit hyperactivity disorder. J Clin Child Adolesc Psychol. 2002;31:453–464. doi: 10.1207/S15374424JCCP3104_5. [DOI] [PubMed] [Google Scholar]
  5. Heagerty PJ. Marginally specified logistic-normal models for longitudinal binary data. Biometrics. 1999;55:688–698. doi: 10.1111/j.0006-341x.1999.00688.x. [DOI] [PubMed] [Google Scholar]
  6. Heagerty PJ. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics. 2002;58:342–351. doi: 10.1111/j.0006-341x.2002.00342.x. [DOI] [PubMed] [Google Scholar]
  7. Heagerty PJ, Kurland BF. Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika. 2001;88:973–985. [Google Scholar]
  8. Lahey B, Pelham W, Stein M, Loney J, Trapani C, Nugent K, Kipp H, Schmidt E, Lee S, Cale M, Gold E, Hartung C, Willcutt E, Baumann B. Validity of DSM-IV attention-deficit/hyperactivity disorder for younger children. J Am Acad Child Adolesc Psychiatry. 1998;37:695–702. doi: 10.1097/00004583-199807000-00008. [DOI] [PubMed] [Google Scholar]
  9. Lahey BB, Gordon RA, Loeber R, Stouthamer-Loeber M, Farrington DP. Boys who join gangs: A prospective study of predictors of first gang entry. Journal of Abnormal Child Psychology. 1999;27:261–276. doi: 10.1023/b:jacp.0000039775.83318.57. [DOI] [PubMed] [Google Scholar]
  10. Lee A, McMurchy L, Scott A. Re-using data from case-control studies. Stat Med. 1997;16:1377–1389. doi: 10.1002/(sici)1097-0258(19970630)16:12<1377::aid-sim557>3.0.co;2-k. [DOI] [PubMed] [Google Scholar]
  11. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
  12. Liang K-Y, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data (Disc: P24-40) Journal of the Royal Statistical Society, Series B: Methodological. 1992;54:3–24. [Google Scholar]
  13. Mancl LA, Leroux BG. Efficiency of regression estimates for clustered data. Biometrics. 1996;52:500–511. [PubMed] [Google Scholar]
  14. Neuhaus J, Scott AJ, Wild CJ. The analysis of retrospective family studies. Biometrika. 2002;89:23–37. [Google Scholar]
  15. Neuhaus JM, Jewell NP. The effect of retrospective sampling on binary regression models for clustered data. Biometrics. 1990;46:977–990. [PubMed] [Google Scholar]
  16. Neuhaus JM, Kalbfleisch JD, Hauck WW. A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. International Statistical Review. 1991;59:25–35. [Google Scholar]
  17. Neuhaus JM, Scott AJ, Wild CJ. Family-specific approaches to the analysis of case-control family data. Biometrics. 2006;62:488–494. doi: 10.1111/j.1541-0420.2005.00450.x. [DOI] [PubMed] [Google Scholar]
  18. Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics: Simulation and Computation. 1994;23:939–951. [Google Scholar]
  19. Qaqish BF, Zhou H, Cai J. On case-control sampling of clustered data. Biometrika. 1997;84:983–986. [Google Scholar]
  20. Schildcrout J, Heagerty P. Regression analysis of longitudinal binary data with time-dependent environmental covariates: bias and efficiency. Biostatistics. 2005;6:633–652. doi: 10.1093/biostatistics/kxi033. [DOI] [PubMed] [Google Scholar]
  21. Schildcrout JS, Heagerty PJ. On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9:735–749. doi: 10.1093/biostatistics/kxn006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Ten Have TR, Kunselman AR, Tran L. A comparison of mixed effects logistic regression models for binary response data with two nested levels of clustering. Stat Med. 1999;18:947–960. doi: 10.1002/(sici)1097-0258(19990430)18:8<947::aid-sim95>3.0.co;2-b. [DOI] [PubMed] [Google Scholar]
  23. Whittemore AS. Logistic regression of family data from case-control studies (Corr: 97V84 p989-990) Biometrika. 1995;82:57–67. [Google Scholar]
  24. Zeger SL, Liang K-Y, Albert PS. Models for longitudinal data: A generalized estimating equation approach (Corr: V45 p347) Biometrics. 1988;44:1049–1060. [PubMed] [Google Scholar]
  25. Zhao LP, Hsu L, Holte S, Chen Y, Quiaoit F, Prentice RL. Combined association and aggregation analysis of data from case-control family studies. Biometrika. 1998;85:299–315. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Data

RESOURCES