Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 25.
Published in final edited form as: Biometrics. 2021 Oct 12;79(1):332–343. doi: 10.1111/biom.13571

Generalized case-control sampling under generalized linear models

Jacob M Maronge 1, Ran Tao 2,3, Jonathan S Schildcrout 2, Paul J Rathouz 4
PMCID: PMC9358725  NIHMSID: NIHMS1826318  PMID: 34586638

Abstract

A generalized case-control (GCC) study, like the standard case-control study, leverages outcome-dependent sampling (ODS) to extend to nonbinary responses. We develop a novel, unifying approach for analyzing GCC study data using the recently developed semiparametric extension of the generalized linear model (GLM), which is substantially more robust to model misspecification than existing approaches based on parametric GLMs. For valid estimation and inference, we use a conditional likelihood to account for the biased sampling design. We describe analysis procedures for estimation and inference for the semiparametric GLM under a conditional likelihood, and we discuss problems with estimation and inference under a conditional likelihood when the response distribution is misspecified. We demonstrate the flexibility of our approach over existing ones through extensive simulation studies, and we apply the methodology to an analysis of the Asset and Health Dynamics Among the Oldest Old study, which motives our research. The proposed approach yields a simple yet versatile solution for handling ODS in a wide variety of possible response distributions and sampling schemes encountered in practice.

Keywords: conditional likelihood, efficiency, generalized case-control studies, generalized linear models, outcome-dependent sampling

1 |. INTRODUCTION

With the growing availability of electronic health records and publicly available research databases, researchers who seek to answer novel medical and public health questions may have access to outcome (e.g., disease status or a quantitative phenotype) and covariate data, but may be missing the key exposure variable (e.g., a biomarker derived from an assay or a manual electronic health record review). Often, such exposure ascertainment is expensive and cannot be completed on all members of a cohort. It is well known that in many such “expensive-exposure” scenarios, random sampling can be inefficient for estimating exposure-outcome associations. It can therefore be beneficial to oversample portions of the population via outcome-dependent sampling (ODS). The most common ODS design is the case-control study, which became popular in the 1920s and led to successes in finding associations between lip cancer and pipe smoking (Broders, 1920), breast cancer and reproductive history (Lane-Claypon, 1926), and oral cancer and pipe smoking (Lombard and Doering, 1928). Methods of analysis for case-control studies are well established (Cornfield, 1951; Anderson, 1972; Prentice and Pyke, 1979; Breslow and Day, 1980; Breslow, 1996), and generally rely on the logistic regression model. Although extensions of case-control sampling designs to nonbinary distributions have been proposed (Lawless et al., 1999; Scott and Wild, 2011), doing so in a unified way has been challenging under the standard generalized linear model (GLM; McCullagh and Nelder, 1989) framework because it requires tailoring the implementation to each distinct outcome distribution; each unique response requires a new implementation that can be tedious and computationally burdensome.

To fix ideas, we frame the ODS problem in terms of a two-phase study (Breslow and Day, 1980; Breslow and Cain, 1988; Breslow, 1996; Breslow and Holubkov, 1997; Breslow and Chatterjee, 1999). A typical two-phase study is conducted as follows: in Phase 1, a representative sample of the response (and possibly inexpensive covariates) is collected for a, potentially large, number of observations. After Phase 1, a sampling plan is developed based on observed response information. In Phase 2, using the proposed sampling plan, a subset of Phase 1 observations is selected to collect expensive exposure information. As an example, consider a hypothetical electronic health record setting where the investigator is interested in the relationship between disease status and a (relatively expensive) genetic marker. Available health record information includes phenotype and other standard information, but does not readily contain the genetic marker data. In this example, the data available in the electronic health record are the Phase 1 sample. Based on the observed phenotype, which may have a more complicated structure than binary, we design a sampling scheme to identify the Phase 2 sample in which the genetic marker is ascertained. Following this sampling scheme, we would then collect genetic marker information for selected subjects.

Whereas in some ODS settings, both Phase 1 and Phase 2 data are available for analysis, we restrict our attention here to the commonly observed setting where Phase 2 data are available but Phase 1 data are not. That is, we only have access to the “complete-case” data set (Lawless et al., 1999), and not to Phase 1 responses. Many examples of this are given via the database for genotypes and phenotypes (dbGaP), where much of the available data are from studies only retaining observations selected into Phase 2. There are many reasons why only Phase 2 data would be retained. For example, one might need to obtain consent from patients to include them in publicly available data repositories and that may be more challenging for those not selected into Phase 2. Additionally, there are costs associated with data cleaning, and researchers may not be enthusiastic to clean and manage Phase 1 inexpensive covariates and data from the unsampled subjects in whom Phase 2 exposure data were not collected. An example of a study with only Phase 2 data is the CHARGE study (Lin et al., 2014), where Phase 1 and Phase 2 data were collected, but only Phase 2 data are freely available. Many available methods for ODS assume the inclusion of incomplete Phase 1 data, and are not readily adaptable to the case where only Phase 2 data are available (e.g., Chatterjee et al. (2003); Weaver and Zhou (2005); Tao et al. (2017)).

Unique to the analysis of case-control study data, logistic regression yields valid and efficient inferences on exposure-outcome associations (Cornfield, 1951; Anderson, 1972; Prentice and Pyke, 1979; Breslow and Day, 1980; Breslow, 1996) even if we ignore the case-control sampling plan. In general, however, analyzing the complete case data naive to the ODS sampling scheme is not valid. Lawless et al. (1999) describe the conditional likelihood calculations that explicitly acknowledge the design within the likelihood. Importantly, even though the approach allows for nonbinary responses, it still requires a known parametric distributional form for the response, which may lead to incorrect estimation and inference for parameters of interest if the parametric form is misspecified.

In this report, we describe a unifying framework—in the sense that we have a single analysis strategy for many different forms of responses—for ODS from two-phase studies by leveraging an approach that will appeal to those familiar with GLMs and the classic case-control design. We consider a semiparametric extension to the GLM (SPGLM; Rathouz and Gao, 2009) that provides a full-likelihood alternative to the quasi-likelihood (QL; Wedderburn, 1974; McCullagh, 1983) modeling framework. The SPGLM nonparametrically models the response distribution given covariates, but similar to QL and other estimating function-based approaches, parametrically models the mean of the response given covariates. To highlight the connection to case-control studies, we refer to studies with beyond binary responses as generalized case-control (GCC) studies (Schildcrout et al., 2019). We are focusing on the setting where only Phase 2 data, as well as the sampling plan used to collect these data, are available. We apply the SPGLM framework to analyze data collected from GCC studies. The major strength to our approach is that it allows the experimenter to analyze data from GCC studies in a unified way covering a wide variety of response distributions, while also adding robustness by not having to assume a particular form for that distribution. This robustness is important under GCC sampling because, as we will show, fully parametric approaches can be sensitive to misspecification of the response distribution. As an added feature, the response distribution under ODS takes essentially the same functional form as it does under simple random sampling, thus simplifying the extension from simple random sampling to ODS.

The paper is organized as follows. In Section 2, we review the SPGLM and GCC designs. In Section 3, we develop the estimation and inference procedures under GCC studies using the SPGLM framework. In Section 4, we conduct simulation studies to examine the validity and robustness of the SPGLM approach to an analogous parametric approach under misspecification. Finally, in Section 5, we demonstrate the use of our methods using the AHEAD study wherein independent variables required an extensive interviews by an expert, and is therefore time-consuming and expensive to collect. Furthermore, the response for the AHEAD study does not follow one of the standard exponential family forms, meaning that standard parametric assumptions are not easy to implement. We conclude with a discussion and summary.

2 |. MODELING AND SAMPLING FRAMEWORK

2.1 |. Model

Rathouz and Gao (2009) proposed a semiparametric extension of traditional GLMs (McCullagh and Nelder, 1989). As with traditional GLMs, their model focuses on inference with respect to a parametric mean model. However, the SPGLM operates under nonparametric estimation of the response distribution. As such, it provides a full, albeit semiparametric, likelihood alternative to QL. Although the SPGLM has been shown to perform comparably to QL in terms of estimation and inference, the specification of a full likelihood has a number of advantages. These advantages include an objective function for model comparison, the ability to do prediction, and potential robustness under the traditional missing-at-random (MAR) assumption. Conveniently, through application of Bayes’ theorem, the model also adapts easily to ODS designs, the major theme of this paper, and one which we will study in detail.

Although, in contrast to standard GLMs, the SPGLM only requires specification of the mean model (i.e., it does not require specification of the response distribution); however, it still maintains the same interpretation of mean model parameters. The user specifies the linear predictor, and then selects the link function that corresponds to the desired model interpretation. Once the mean model is specified, the remaining characteristics of the conditional response distribution are estimated nonparametrically. Using minor distributional assumptions, the model potentially accounts for a large variety of analyses with nontraditional responses without concern for misspecifying the response distribution, yielding a unified approach across a range of distributional forms.

To fix ideas, let Xi represent a p × 1 vector of covariates for the ith subject, and let Y=(Yi)i=1n represent a n × 1 vector of observations, where each observation Yi = yi is an independent realization sampled from

f(yiXi;β,f0)=exp{yiθib(θi)+logf0(yi)}, (1)

for yi in support 𝒴. The vector β contains the coefficients for the linear predictor, and f0(·) is a nonparametric reference distribution on 𝒴, to be estimated from the data. However, for fixed f0, (1) is a natural exponential family (Morris, 1982) with canonical parameter θi and cumulant generating function

b(θ;f0)b(θ)=logu𝒴f0(u)eθu du.

For given f0, we therefore recover the standard results from GLMs, E(Yi|Xi) = b′(θi) and Var(Yi|Xi) = b″ (θi). The focus of inference is on the conditional mean of Yi given Xi,

g1(ηi)μiE(YiXi;β)=b(θi)=u𝒴uf0(u)eθiu duu𝒴f0(u)eθiu du, (2)

where ηi=XiTβ is the linear predictor and g(·) is a known link function, strictly increasing on the interval (m, M), m = inf 𝒴 and M = sup 𝒴. An important note is that the free parameters here are β and f0; β determines μi, which, together with f0, determines θi through solution to (2). The resulting score and information for β is given by

S(β)=i=1n(yiμi)1b(θi)g(μi)Xi. (3)

and

ββ=i=1nXi1b(θi){g(μi)}2XiT. (4)

For fixed f0, S(β) and ββ arise from standard GLM theory.

When f0 is not known, the fitting procedure for this model works via an extension of Fisher scoring by iteratively estimating β and f0 while holding the other parameter fixed; see Wurm and Rathouz (2018). The response distribution for Yi given covariates Xi is then estimated by plugging the estimated mean parameter, β^, and reference distribution, f^0, into (1).

2.2 |. Generalized case-control studies

As previously mentioned, case-control sampling is used in studies with retrospective sampling of a binary response. If one level of the response, for example, disease status, is rare, case-control sampling is much more efficient in terms of sample size than an equivalent equal probability sampling (EPS) design, which samples at random with respect to the response given predictors. In principle, there is nothing restricting the case-control concept to binary outcomes. For example, if the response of interest is a count, we may design a study to oversample extreme levels of the count to increase observed covariability between the exposure of interest and response. We refer to such designs as GCC studies, and propose analysis for such studies when the underlying model is a member of the GLM family.

In order that we may precisely define a GCC study, we consider a setting where the researcher provides us with a collection of observations (Yi,Xi)i=1n, where Xi includes both exposure data and adjustment variables, from Phase 2 of a two-phase study. The challenge is that the exposure data were collected via ODS, encoded by a function ξ(·) on the support of Yi, and defined as

P(Si=1Yi,Xi)=P(Si=1Yi)=ξ(Yi) since SiXiYi. (5)

Here, Si is the sampling indicator for the ith subject. The setup is very similar to the well-known case-control study; however for most scenarios beyond binary response with logistic regression, the investigator must also supply ξ(y) to the analyst.

There are many different ways to define the sampling scheme via the function ξ(·). For example, each level of Yi could receive a unique sampling probability. Alternatively, the investigator may choose to make ξ(·) a smooth function of Yi, or the experimenter may choose to group particular levels of the response, and sample according to

ξ(y)={ξ1,y<aξ2,aybξ3,b<y,

where b > a are cut-points. This flexibility allows experimenters to pick simple designs while still retaining the complexity of the response variable for modeling purposes.

3 |. ESTIMATION AND INFERENCE

3.1 |. Estimation

In this section, we develop estimation and inference for coefficients of the mean model in the SPGLM, accounting for GCC sampling. Beginning with Model (1), sampling based on the observed values of outcome Y induces a new response distribution in the sampled data. To study this induced distribution, we leverage the sampling scheme introduced in (5). Using Bayes’ rule, we obtain the following density for outcome Y conditional on being sampled under ODS,

f(yS=1,X)=f(yX)ξ(y)𝒴f(uX)ξ(u)du=exp{yθ+log(f0)}𝒴exp{uθ+log(f0)}du=exp{yθb(θ)+log(f0)}, (6)

where we define f0(y)=f0(y)ξ(y) and b(θ)=log(𝒴exp{uθ+log(f0)}du). The induced mean under ODS is now μ∗= b∗′(θ), which is defined as

b(θi)=u𝒴uf0(u)eθiu duu𝒴f0(u)eθiu du,

and the induced variance is b∗″ (θ). It important to note that μi and μi (and the respective variances) are with respect to different distributions: μi is respect to the response distribution under EPS and μi is with respect to the induced distribution from performing GCC sampling.

Focusing for the moment on estimation of β with fixed f0(y), and using (6), the conditional log-likelihood function under ODS is

l(β)=log[i=1nexp{yiθib(θi)+log(f0)}]=i=1n{yiθib(θi)+log(f0)}.

To derive the score for β, we recall that ηi=XiTβ and g(μi) = ηi and apply the chain rule to obtain

S(β)=lβ=i=1nliθiθiμiμiηiηiβ,

Where liθi={yib(θi)}=(yiμi), θiμi=Var(YiXi)1=1b(θi), μiηi={g(μi)}1, and ηiβ=Xi. Note that the derivatives involving the mean with respect to EPS (μi) are not with respect to the induced mean under GCC sampling (μi). Finally, we obtain the score for β under ODS,

S(β)=i=1n(yiμi)1b(θi)g(μi)Xi. (7)

Note that the score in (7), obtained under ODS closely resembles the score for β under EPS shown in (3), the only difference being that μi is replaced with μi. Next, we consider the information for β under ODS. Using standard information calculations,

ββ=i=1nb(θi)b(θi)Xi1b(θi){g(μi)}2XiT. (8)

Notice that in (8), the information involves a ratio of Var(Yi|Xi) under GCC sampling and the corresponding value under EPS, namely, b∗″(θi)/b″(θi). This gives us reason to believe that if we use GCC sampling to sample in such a way that the empirical variance of Y (given covariates) is increased relative to under EPS, we will realize increased efficiency for estimating β. This is only approximately true, as we will see, owing to the need to also estimate the reference distribution f0. However, this ratio is the driving factor to determine efficiency gains from GCC sampling compared to equivalent studies sampling at random with respect to the response.

Under EPS, the SPGLM has an interesting feature where the parameters β in the mean model are orthogonal to the parameters for estimating the reference distribution f0 (Huang and Rathouz, 2017). Examining (7), we can see that (yiμi) is a function of f0(y) and as such the orthogonality between β and f0(y) under EPS no longer exists under ODS. As is true with all nuisance parameter problems, this implies that estimation of f0(y) under ODS will now affect the asymptotic variance of β^. However, this also creates another issue in the context of GCC sampling; note that the mean model is being estimated with respect to β, and consequently, μi, but the score function for β using the conditional likelihood is a function of μi. This means when solving (7) equal to zero, we must use the sampling plan and the estimated response distribution to translate μi to μi. As the only difference in (7) for parametric GLMs is that f0 is assumed to have a particular parametric form, this means that misspecifying f0 leads to potential bias (as well as incorrect standard errors) for β under fully-parametric GLMs. As shown in Section 4, the SPGLM is flexible enough to eliminate this bias in many cases. In addition to flexibility, the SPGLM is easier to implement than likelihoods that assume an unbounded support for the response because the integral in the denominator of (6) becomes a sum over the observed support. In the next section, we show how to obtain correct inferences using the SPGLM with a conditional likelihood.

3.2 |. Inference

We will focus on inference with respect to the coefficients in the mean model. Although we have already calculated an information matrix for β, we do not yet have a method to conduct correct inferences for β because these calculations do not reflect simultaneous estimation of the parameter f0 under ODS. To address this issue, we note that the effective information for β in the presence of estimating f0 is

ββeff=βββf0f0f01f0β (9)

(see, for example, Shao, 2003). Above, ββ represents the β corner of the joint information matrix of β and f0. A similar interpretation holds for f0f0, and βf0 represents the cross-information between the two parameters. As previously stated, in the case of ODS, the cross-information between β and f0 is nonzero, and therefore, the orthogonality between f0 and β available under EPS is not obtained under ODS.

We restrict the inference to the case where Y has finite support, denote 𝒴=(s1,,sK)T. Similarly, let f0=(f1,,fK)T, where the mth entry of f0 is the induced reference distribution evaluated at Sm. Now, we write the empirical (conditional) likelihood (see Owen, 1991) for f0,

l(f0)=i=1n{yiθilog(k=1Kfkeθisk)+k=1KI(yi=sk)log(fk)}.

We denote the mth component of the score for f0(y) by S(fm) and use the chain rule,

S(fm)=lfmfmfm+lθθfm,

yielding

S(fm)=i=1n{I(yi=sm)fmeθismeb(θi)ξ(sm)(yiμi)eθism(smμi)eb(θi)b(θi)}. (10)

Then, the maximum empirical likelihood estimate is the solution to S(fm) = 0, ∀m, and the resulting information is

fkfm=i=1nI(k=m)eθiskξ(sk)eb(θi)fkeθi(sk+sm)ξ(sk)ξ(sm)e2b(θi)(skμi)(smμi)eθi(sk+sm)ξ(sk)eb(θi)eb(θi)b(θi)(skμi)(smμi)eθi(sk+sm)ξ(sm)eb(θi)eb(θi)b(θi)+(skμi)(smμi)eθi(sk+sm)b(θ)e2b(θi){b(θi)}2.

See Online Appendix A for score and information calculations.

Finally, The resulting cross-information is given by

βfm=i=1nXieθismg(μi)b(θi)×{(smμi)ξ(sm)eb(θi)(smμi)b(θi)eb(θi)b(θi)}. (11)

Note that a term without the star notation is the corresponding term as if the data were sampled at random with respect to the response. The first thing we notice about (11) is that if we reduce this equation to the EPS setting, the term in brackets equals zero, which matches the expected result.

Due to a technical issue regarding identifiability, the model in (1) has three constraints imposed on the reference distribution, which requires us to calculate a “constrained” information for f0. In essence, this requires us to perform a projection of the information into a subspace that accounts for these constraints. However, this does not have a meaningful impact on the interpretation of our results, and is therefore left to Online Appendix B.

Because we restrict our setting to the case where the response has finite support, we may use the inverse effective information (9) for β and standard maximum likelihood results to construct asymptotically correct standard errors for β under ODS. Together with the results for estimation in Section 3.1, we can conduct asymptotically correct analysis for ODS from an arbitrary response distribution with a unified approach.

4 |. SIMULATIONS

4.1 |. Zero-inflated truncated poisson

A main strength of our approach is that we simultaneously estimate the response distribution given covariates in the population, while also accounting for GCC sampling. This gives us a valid and robust method for handling data from GCC studies under many response distributions. Our goal with the present simulation study is to assess how the SPGLM and traditional GLM approaches under GCC work in cases where standard model distributional assumptions (e.g., Poisson, Binomial) are violated. We also want to assess if there is a penalty in efficiency for estimating the response distribution using the SPGLM in cases where a standard GLM is correctly specified. To study this, we begin with a mean model of the form

logμi=β0+β1xi, (12)

where each yi is an independent sample from model (1). As a reference distribution, we consider two cases; the first is a modified Poisson with rate parameter equal to 1, truncated at y = 5, and three times the original mass on y = 0. We refer to this as a zero-inflated, truncated Poisson. This distribution is intended to represent data one might encounter in practice, in the sense that it is an overdispersed count of a finite number of items, but is not well captured with standard GLMs. Furthermore, this data-generating model is similar to the “instrumental activities of daily living” from the AHEAD study. For the second case, we simply generate data from a Poisson distribution with mean μi. For each setting, we let Xi~Unif(124,124); these bounds set sd(Xi) = 1/2. Finally, we set β0 = 0.2, let β1 = 0.2 or 0.7, and allow sample size to vary as shown in Table 1. We consider a two-phase design that samples 100,000 observations at random during Phase 1, and then uses GCC sampling to sample, on average, equal numbers from two response strata: y < 4 (controls) and y ≥ 4 (cases), where within each stratrum, we sample, on average, equal numbers from each level of the response. For example, if we sample 1000 observations in total, approximately 500 observed values will be such that y < 4 and y ≥ 4, respectively. Within the “controls” stratum, approximately 125 will come from each level of y, whereas within the “cases” stratum, approximately 250 will come from each level of y. The prevalence of cases in the population of the zero-inflated, truncated Poisson is approximately 6% in the small effect setting (when β1 = 0.2) and 8% in the large effect setting (when β1 = 0.7), and slightly larger in the standard Poisson setting.

TABLE 1.

Results for estimation and inference from simulation study described in Section 4.1 (only β1). We show results from 2000 replicates with varying sample size and effect size of predictor, as well as two different data-generating models/analysis strategies

Sample size Data-generating model Analysis strategy True value AE ESE AESE CP
n = 100 Poisson GLM Poisson GLM 0.2 0.206 0.120 0.117 0.950
SPGLM 0.208 0.125 0.121 0.953
Zero-inflated, truncated Poisson Poisson GLM 0.2 0.166 0.132 0.122 0.922
SPGLM 0.204 0.164 0.163 0.956
n = 500 Poisson GLM Poisson GLM 0.2 0.199 0.0518 0.0516 0.952
SPGLM 0.200 0.0531 0.0528 0.955
Zero-inflated, truncated Poisson Poisson GLM 0.2 0.161 0.0589 0.0537 0.854
SPGLM 0.198 0.0730 0.0720 0.946
n = 1000 Poisson GLM Poisson GLM 0.2 0.200 0.0368 0.0364 0.947
SPGLM 0.200 0.0378 0.0372 0.944
Zero-inflated, truncated Poisson Poisson GLM 0.2 0.161 0.0420 0.0379 0.794
SPGLM 0.199 0.0520 0.0508 0.941
n = 100 Poisson GLM Poisson GLM 0.7 0.712 0.139 0.141 0.959
SPGLM 0.713 0.156 0.156 0.951
Zero-inflated, truncated Poisson Poisson GLM 0.7 0.592 0.148 0.135 0.847
SPGLM 0.711 0.185 0.183 0.954
n = 500 Poisson GLM Poisson GLM 0.7 0.701 0.0629 0.0618 0.952
SPGLM 0.701 0.0702 0.0688 0.942
Zero-inflated, truncated Poisson Poisson GLM 0.7 0.581 0.0669 0.0597 0.491
SPGLM 0.700 0.0821 0.0808 0.947
n = 1000 Poisson GLM Poisson GLM 0.7 0.700 0.0448 0.0436 0.954
SPGLM 0.700 0.0500 0.0485 0.940
Zero-inflated, truncated Poisson Poisson GLM 0.7 0.581 0.0465 0.0421 0.223
SPGLM 0.701 0.0576 0.0571 0.949

Abbreviations: AE, average estimate; ESE, empirical standard error; AESE, average estimated standard error; CP, coverage probability, 95% confidence intervals.

To investigate the robustness property of the semiparametric GLM, we consider two different analysis strategies. First, we analyze the resulting Phase 2 data from the previously described data-generating methods using the SPGLM accounting for GCC sampling. We also analyze the same data with a Poisson GLM (assuming ξ(y) = 0 for y > 5), also accounting for GCC sampling. This is done to (1) investigate the impact of misspecifying the response distribution on the conditional likelihood approach, and (2) characterize the loss of efficiency for estimating the response distribution compared to correctly specifying it. To perform the analysis using the SPGLM, we use the methodology developed in Section 3. To perform the corresponding Poisson analysis, as discussed in Rathouz and Gao (2009), any GLM can be expressed using (1), for example, for a Poisson GLM, f0(y) = 1/(ey!). Therefore, instead of estimating f0, we perform the Poisson analysis by supplying the correct f0.

In Table 1, we report the empirical mean and empirical standard deviations of the coefficient estimates for 2000 replications, the average estimated standard error, and coverage probability associated with 95% confidence intervals for the slope coefficient. Results regarding the intercept are shown in Table C.1, which is available in Online Appendix C. Results are shown for both data-generating models, as well as both analysis strategies. For the SPGLM, we observe minimal bias in parameter and uncertainty estimates, even for n = 100, and the coverage probabilities approximately obtain their 95% nominal values in each setting. On the other hand, for the Poisson GLM analysis, when there is model misspecification, there is both bias in parameters of interest, as well as incorrect coverage. When the Poisson model is correctly specified, the Poisson GLM model behaves as expected, achieving minimal bias and correct coverage probabilities. Additionally, the SPGLM achieves minimal bias, and only has around a 5% loss of efficiency in the cases with a small effect size, which is when performing ODS has the greatest opportunity to increase efficiency. The largest loss of efficiency is approximately 20% (in the large effect size), but still achieves correct coverage probabilities.

4.2 |. Overdispersed binomial

Next, we aim to investigate the effects of model mispecification in another setting, using an overdispersed binomial distribution. This set of simulations is important because it allows us to vary the amount of model mispecification by varying the amount of overdispersion as a hyperparameter to the data-generating model. This allows us to characterize how sensitive GLM-based approaches are to distributional mispecification. To be specific, data were generated as follows:

  1. Generate Xi~Unif(124,124).

  2. Calculate μi=11+eηi, where ηi = β0 + β1Xi (β0 = −0.6 or 0.2 and β1 = 0.2).

  3. Generate Pi ~ Beta(a, b), where a and b are selected such that E(Pi) = μi and Var(Pi) is selected to give the desired amount of overdispersion (relative to a binomial distribution with m = 9), for example, Var(Pi) = μi(1 − μi)/{α ∗ (m − 1)} yields {(1/α) × 100}% overdispersion.

  4. Generate Yi ~ Bin(m = 9, p = pi), where m and p denote the number of trials and probability of success, respectively.

This results in binomial data with overdispersion, where the amount of overdispersion is governed by the parameter α. In Figure 1, we show the resulting beta-binomial distribution for 25%, 50%, and 100% overdispersion compared to binomial responses without overdispersion from Bin(m = 9, p = μi). The setting with 25% overdispersion is considered a setting with a “very minor” amount of overdispersion. In our simulation, we vary the intercept term as either equal to −0.6 or 0.2 to give a “skewed” or “symmetric” response distribution.

FIGURE 1.

FIGURE 1

Example of resulting response distribution for varying amount of overdispersion (25%, 50%, and 100%) compared to generating binomial responses with probability equal to μi. This figure appears in color in the electronic version of this article, and any mention of color refers to that version

For these simulations, we generate Phase 1 data at random with respect to the response. In this setting, the Phase 1 data have 1 million observations; this is so that there are enough samples in each level of the response to conduct our GCC sampling plan. We perform GCC sampling such that in each unique level of the response, we expect 50 samples, resulting in an expected sample size of 500. For each setting, we perform two analyses: one assuming a binomial response distribution (as explained in Section 4.1, except now f0(y) = (1/2)m), and another with our methodology developed for the SPGLM. The analysis strategy is such that we analyze the data according to the mean model

log(μimμi)=β0+β1xi,

with the corresponding distributional assumption, accounting for GCC sampling. The simulation study is replicated 2000 times. We show the results for the slope parameter in Table 2. Results regarding the intercept are shown in Table C.2, which is available in Online Appendix C. In summary, the SPGLM achieves approximately valid estimation and inference in all settings. However, the binomial analysis yields inaccurate results, both in terms of estimation and inference, under GCC sampling.

TABLE 2.

Results for simulation study investigating robustness (only β1), as described in Section 4.2

Symmetry Data-generating model Analysis strategy True calue AE ESE AESE CP
Symmetric Binomial Binomial GLM 0.2 0.201 0.0326 0.0327 0.954
SPGLM 0.201 0.0338 0.0341 0.953
25% Overdispersion Binomial GLM 0.2 0.230 0.0438 0.0383 0.846
SPGLM 0.204 0.0398 0.0411 0.962
50% Overdispersion Binomial GLM 0.2 0.246 0.0579 0.0428 0.733
SPGLM 0.205 0.0489 0.0485 0.946
100% Overdispersion Binomial GLM 0.2 0.253 0.0766 0.0495 0.690
SPGLM 0.207 0.0628 0.0634 0.953
Skewed Binomial Binomial GLM 0.2 0.201 0.0326 0.0328 0.954
SPGLM 0.201 0.0341 0.0345 0.953
25% Overdispersion Binomial GLM 0.2 0.226 0.0466 0.0387 0.852
SPGLM 0.196 0.0418 0.0413 0.940
50% Overdispersion Binomial GLM 0.2 0.241 0.0595 0.0434 0.757
SPGLM 0.197 0.0495 0.0488 0.946
100% Overdispersion Binomial GLM 0.2 0.243 0.0806 0.0499 0.722
SPGLM 0.195 0.0649 0.0636 0.945

Abbreviations: AE, average estimate; ESE, empirical standard error; AESE, average estimated standard error; CP, coverage probability, 95% confidence intervals.

It is important to note that even though the SPGLM achieves correct estimation and inference, the beta-binomial data generated in these simulations are not a member of the assumed model for the SPGLM, as covered in Section 2.1. Even so, the flexibility of the SPGLM captures the distribution sufficiently to yield reliable statistical results. On the other hand, very mild deviations from the binomial distribution yield inaccurate analyses using standard GLM approaches.

5 |. THE AHEAD STUDY

To test our method, we implement GCC sampling using data from the Asset and HEAlth Dynamics among the oldest old (AHEAD) study (Soldo et al., 1997). The AHEAD study, part of the HRS (Health and Retirement Study), is sponsored by the National Institute on Aging (grant number NIA U01AG009740) and is conducted by the University of Michigan. The AHEAD study was a national longitudinal study of individuals aged 70 and older. The objectives of AHEAD were to monitor transitions in physical, functional, and cognitive health, and to study the relationship of late-life changes in health to patterns of dissaving and income. The AHEAD study is of particular interest because an exposure of interest, household net worth, was ascertained by in-depth interview of the subject by a specialist, which was expensive and time-consuming.

For our purposes, we use the complete-case baseline data from 1993, N = 6, 441. The variables of interest are: number of instrumental activities of daily living tasks for which the subject reports some difficulty (ranges from zero to five), age, sex, immediate word recall (number of words out of 10 that subjects can list immediately after hearing them read), and categorical values of net worth. We perform two separate analyses, one regressing the number of instrumental activities of daily living on age, sex, immediate word recall, and categorical values of net worth and another analysis regressing immediate word recall on age, sex, and categorical values of net worth. For the analysis with number of instrumental activities of daily living as the outcome, we use a log link function, and for the analysis with immediate word recall as the outcome, we use a logistic link with 11 categories (i.e., g(μi) = log{μi/(10 − μi)}). These analyses are chosen to mirror those from Rathouz and Gao (2009) under EPS. The distribution of the response values is given in Table 3, along with the proposed number of samples and corresponding sampling probabilities for each level. Note that for these candidate designs, we are oversampling levels with relatively low counts compared to the full data.

TABLE 3.

Distribution of response data and proposed GCC sampling design for AHEAD study

Response value 0 1 2 3 4 5 6 7 8 9 10
Number of instrumental activities of daily living
Count 4,806 1,043 343 157 64 28
Sampling probability 10% 30% 50% 70% 90% 100%
Expected sample size 481 313 172 110 58 28
Immediate word recall
Count 154 195 526 1,001 1,450 1,355 954 445 196 105 60
Sampling probability 100% 90% 70% 50% 30% 10% 30% 50% 70% 90% 100%
Expected sample size 154 176 368 501 435 136 286 223 137 95 60

We first analyze the complete data using the semiparametric GLM under EPS as a gold standard. Then, we select a subsample from the complete data according to the sampling plans in Table 3. We then analyze the sampled data using either the SPGLM or a standard GLM-based analysis, both accounting for the appropriate sampling scheme. In the log-linear model (instrumental activities of daily living as outcome), a Poisson distribution is assumed for the response for the GLM-based analysis. Similarly, a binomial distribution is assumed for the corresponding analysis with immediate word recall as the outcome.

The results of this are shown in Tables 4 and 5. For nearly every coefficient, estimates for the SPGLM analysis are closer to the complete data analysis than the equivalent GLM-based analysis; in general, the SPGLM-based methods yield valid estimates for coefficients under GCC sampling, whereas there is a small amount of bias for coefficients of interest under the GLM-based analysis. Additionally, and importantly, there are significant reductions in standard error due to GCC sampling. For example, in the first analysis, the standard error of the coefficient for the first level of net worth from the EPS analysis is reduced by a factor of 1.37 (0.0780/0.134 times the square root of the sampling fraction) by performing GCC sampling. Therefore, the GCC design is almost twice (1.372 = 1.88) as efficient as EPS; similar results hold for other regression parameters. Inspecting the same coefficient in the analysis considering immediate word recall as the response, we see that the standard error is reduced by a factor of 1.20. This efficiency gain is smaller due to a different sampling plan and model; however, there is still an efficiency gain of 1.45 (1.202). Differences in estimates from the complete data to the corrected approach could be due to issues with mean model specification or violations in the model in (1). It appears that the only parameter that is markedly different is the coefficient of sex in the analysis involving instrumental activities of daily living, and it is possible that the form of the mean model is adequate for age, immediate word recall, and net worth, but the form does not hold as well for comparison of females to males, leading to less consistent results for the sex coefficient.

TABLE 4.

Results from study implementing GCC sampling on AHEAD data with number of instrumental activities of daily living as response. Below we show estimated coefficients and estimated standard error from the full data. We also show estimated coefficients with estimated standard error under SPGLM and GLM-based analyses implementing the GCC sampling plan from Table 3

Coefficient Full data (n = 6,441)
Estimate (SE)
SPGLM GCC
Estimate (SE)
GLM GCC
Estimate (SE)
(Intercept) −3.606 (0.337) −3.588 (0.546) −5.176 (0.307)
age 0.0496 (0.00389) 0.0492 (0.00634) 0.0620 (0.00359)
sex:female 0.158 (0.0518) 0.0434 (0.0856) 0.195 (0.0470)
immed. word recall −0.207 (0.0141) −0.182 (0.0229) −0.252 (0.0130)
netwc:1–24k −0.256 (0.0780) −0.263 (0.134) −0.355 (0.0718)
netwc:25–74k −0.450 (0.0800) −0.450 (0.135) −0.602 (0.0732)
netwc:75–199k −0.692 (0.0814) −0.745 (0.135) −0.871 (0.0741)
netwc:200k+ −0.763 (0.0899) −0.859 (0.152) −0.936 (0.0816)

TABLE 5.

Results from study implementing GCC sampling on AHEAD data with immediate word recall as response. Below we show estimated coefficients and estimated standard error from the full data. We also show estimated coefficients with estimated standard error under SPGLM- and GLM-based analyses implementing the GCC sampling plan from Table 3

Coefficient Full data (w = 6441)
Estimate (SE)
SPGLM GCC
Estimate (SE)
GLM GCC
Estimate (SE)
(Intercept) 2.220 (0.134) 2.047 (0.172) 1.403 (0.0972)
age −0.0393 (0.00167) −0.0372 (0.00217) −0.0249 (0.00121)
sex:female 0.215 (0.0188) 0.206 (0.0236) 0.133 (0.0133)
netwc:1–24k 0.279 (0.0413) 0.314 (0.0547) 0.188 (0.0308)
netwc:25–74k 0.388 (0.0401) 0.433 (0.0535) 0.259 (0.0298)
netwc:75–199k 0.549 (0.0387) 0.550 (0.0523) 0.357 (0.0288)
netwc:200k+ 0.687 (0.0397) 0.673 (0.0533) 0.440 (0.0293)

6 |. DISCUSSION

The results for the AHEAD study imply that there are potentially huge benefits to planning and performing GCC studies, but it is also of importance to correctly specify the response distribution. We have proposed an asymptotically correct method for analyzing data arising from GCC studies under a wide variety of settings. We developed a consistent estimator in GCC studies, and showed how to obtain correct standard errors and inferences for each of these design cases. Furthermore, we have shown that standard methods are highly sensitive to assumptions regarding response distributions, but our methodology is flexible enough to hold for a wide variety of possible response distributions, whereas standard methods fail. This flexibility allows for a single analysis method covering a large variety of possible response structures.

Now that we have established a unified approach for the analysis of GCC data, future work can address how to select designs within a given setting for this class of studies. For example, a future aim is to give researchers guidelines and tools for selecting the best sampling probabilities across possible response values. Additionally, we may aim to develop methods that use our method, but leverage all the data in Phase 1 via a full-likelihood approach (as opposed to our conditional likelihood approach proposed here) to increase efficiency in parameter estimates of interest when covariate information besides the exposure of interest is available (see, for example, Robins et al., 1995; Chatterjee et al., 2003; Weaver and Zhou, 2005; Tao et al., 2017). Leveraging the remainder of data in Phase 1 requires modeling the distribution of the covariate of interest, which is usually done nonparametrically. The additional Phase 1 data require different methods that are not directly comparable to the conditional likelihood methods described here. It is of interest to extend the SPGLM to this setting because current methods assume that the distribution of response given covariates is fully parametric. Another line of future research can be how to use our work to develop tools to check modeling assumptions from GCC studies. Finally, an additional area of future research may be to extend the SPGLM to account for other complex designs such as multistage and partial questionnaire designs (Wacholder et al., 1994; Whittemore and Halpern, 1997).

Our framework is a natural extension of the case-control study, one of the most commonly used designs in both medicine and public health. Using this novel methodology, practitioners—in particular epidemiologists and clinical investigators—will have more flexibility in experimental design, while also having a tool that yields a similar interpretation to those currently used in practice.

Supplementary Material

Supplementary Material

ACKNOWLEDGMENTS

Maronge, Schildcrout, and Rathouz’s effort toward this work was supported by NIH grant R01HL094786. Maronge was also supported by the University of Wisconsin - Madison Morse Fellowship. We would like to thank the editor, associate editor, and three referees for their helpful suggestions to improve this work.

Funding information

NIH, Grant/Award Number: R01HL094786; University of Wisconsin - Madison Morse Fellowship

Footnotes

SUPPORTING INFORMATION

Web Appendices and Tables referenced in Sections 3 and 4 are available with this paper at the Biometrics website on Wiley Online Library, as is a markdown file containing code to run simulations described in Section 4.2. Data from the AHEAD study are available for download via the Health & Retirement Study through the University of Michigan. Software for methods proposed are available on GitHub at (https://github.com/jmmaronge/gldrm). This package extends the gldrm package available on CRAN by implementing the conditional likelihood described in Section 3.

DATA AVAILABILITY STATEMENT

The data that support the findings in this paper are available in the Health and Retirement Study Repository at (http://hrsonline.isr.umich.edu/index.php?p=avail). These data were derived from the following resources available in the public domain: AHEAD 1993 Core, (http://hrsonline.isr.umich.edu/index.php?p=shoavail&iyear=BC).

REFERENCES

  1. Anderson JA (1972) Separate sample logistic discrimination. Biometrika, 59, 19–35. [Google Scholar]
  2. Breslow NE (1996) Statistics in epidemiology: the case-control study. Journal of the American Statistical Association, 91, 14–28. [DOI] [PubMed] [Google Scholar]
  3. Breslow N & Cain KC (1988) Logistic regression for two-stage case-control data. Biometrika, 75, 11–20. [Google Scholar]
  4. Breslow NE & Chatterjee N (1999) Design and analysis of two-phase studies with binary outcome applied to Wilms’ tumour prognosis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 48, 457–468. [Google Scholar]
  5. Breslow N & Day N (1980) Statistical methods in cancer research. Lyon: IARC Scientific Publications, International Agency for Research on Cancer. [Google Scholar]
  6. Breslow NE & Holubkov R (1997) Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59, 447–461. [Google Scholar]
  7. Broders AC (1920) Squamous-cell epithelioma of the lip: a study of five hundred and thirty-seven cases. JAMA, 74, 656–664. [Google Scholar]
  8. Chatterjee N, Chen Y-H & Breslow NE (2003) A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association, 98, 158–168. [Google Scholar]
  9. Cornfield J (1951) A method of estimating comparative rates from clinical data. Applications to cancer of the lung, breast, and cervix. JNCI: Journal of the National Cancer Institute, 11, 1269–1275. [PubMed] [Google Scholar]
  10. dbGaP (2006). Database of Genotypes and Phenotypes/National Center for Biotechnology Information, National Library of Medicine (NCBI/NLM) https://www.ncbi.nlm.nih.gov/gap. Accessed April 17, 2021.
  11. HRS. Health and Retirement Study, (AHEAD 1993 Core) public use dataset. (1993) Produced and distributed by the University of Michigan with funding from the National Institute on Aging (grant number NIA U01AG009740). Ann Arbor, MI. [Google Scholar]
  12. Huang A & Rathouz PJ (2017) Orthogonality of the mean and error distribution in generalized linear models. Communications in Statistics: Theory and Methods, 46, 3290–3296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lane-Claypon JE (1926) A Further Report on Cancer of the Breast with Special Reference to Its Associated Antecedent Conditions. London: H.M.S.O. [Google Scholar]
  14. Lawless JF, Kalbfleisch JD & Wild CJ (1999) Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 61, 413–438. [Google Scholar]
  15. Lin H, Wang M, Brody JA, Bis JC, Dupuis J, Lumley T et al. (2014) Strategies to design and analyze targeted sequencing data: cohorts for heart and aging research in genomic epidemiology (charge) consortium targeted sequencing study. Circulation: Cardiovascular Genetics, 7, 335–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lombard HL & Doering CR (1928) Cancer studies in Massachusetts: habits, characteristics and environment of individuals with and without cancer. New England Journal of Medicine, 198, 481–487. [DOI] [PubMed] [Google Scholar]
  17. McCullagh P (1983) Quasi-likelihood functions. Annals of Statistics, 11, 59–67. [Google Scholar]
  18. McCullagh P & Nelder J (1989) Generalized linear models, 2nd edition. Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series. London: Chapman & Hall. [Google Scholar]
  19. Morris CN (1982) Natural exponential families with quadratic variance functions. The Annals of Statistics, 10, 65–80. [Google Scholar]
  20. Owen A (1991) Empirical likelihood for linear models. Annals of Statistics, 19, 1725–1747. [Google Scholar]
  21. Prentice RL & Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411. [Google Scholar]
  22. Rathouz PJ & Gao L (2009) Generalized linear models with unspecified reference distribution. Biostatistics, 10, 205–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Robins JM, Rotnitzky A & Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90, 106–121. [Google Scholar]
  24. Schildcrout JS, Haneuse S, Tao R, Zelnick LR, Schisterman EF, Garbett SP et al. (2019) Two-phase, generalized case-control designs for the study of quantitative longitudinal outcomes. American Journal of Epidemiology, 189, 81–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Scott AJ & Wild CJ (2011) Fitting regression models with response-biased samples. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 39, 519–536. [Google Scholar]
  26. Shao J (2003) Mathematical Statistics. Springer Texts in Statistics. New York: Springer. [Google Scholar]
  27. Soldo BJ, Hurd MD, Rodgers WL & Wallace RB (1997) Asset and health dynamics among the oldest old: an overview of the ahead study. The Journals of Gerontology: Series B, 52B, 1–20. [DOI] [PubMed] [Google Scholar]
  28. Tao R, Zeng D & Lin D-Y (2017) Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies. Journal of the American Statistical Association, 112, 1468–1476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Wacholder S, Carroll RJ, Pee D & Gail MH (1994) The partial questionnaire design for case-control studies. Statistics in Medicine, 13, 623–634. [DOI] [PubMed] [Google Scholar]
  30. Weaver MA & Zhou H (2005) An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. Journal of the American Statistical Association, 100, 459–469. [Google Scholar]
  31. Wedderburn RWM (1974) Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika, 61, 439–447. [Google Scholar]
  32. Whittemore AS & Halpern J (1997) Multi-stage sampling in genetic epidemiology. Statistics in Medicine, 16, 153–167. [DOI] [PubMed] [Google Scholar]
  33. Wurm MJ & Rathouz PJ (2018) Semiparametric generalized linear models with the gldrm package. The R Journal, 10, 288–307. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

Data Availability Statement

The data that support the findings in this paper are available in the Health and Retirement Study Repository at (http://hrsonline.isr.umich.edu/index.php?p=avail). These data were derived from the following resources available in the public domain: AHEAD 1993 Core, (http://hrsonline.isr.umich.edu/index.php?p=shoavail&iyear=BC).

RESOURCES