Abstract
When research questions require the use of precious samples, expensive assays or equipment, or labor-intensive data collection or analysis, nested case-control or case-cohort sampling of observational cohort study participants can often reduce costs. The two study designs have similar statistical precision for addressing a singular research question, but case-cohort studies have broader efficiency and superior flexibility. Despite this, case-cohort designs are comparatively underutilized in the epidemiologic literature. Recent advances in statistical methods and software have made analyses of case-cohort data easier to implement, and advances from casual inference, such as inverse probability of sampling weights, have allowed the case-cohort design to be used with a variety of target parameters and populations. To provide an accessible link to this technical literature, we give a conceptual overview of case-cohort study analysis with inverse probability of sampling weights. We show how this general analytic approach can be leveraged to more efficiently study subgroups of interest or disease subtypes, or to examine associations independent of case status. A brief discussion of how this framework could be extended to incorporate other related methodologic applications further demonstrates the broad cost-effectiveness and adaptability of case-cohort methods for a variety of modern epidemiologic applications in resource-limited settings.
Keywords: Case-cohort, nested case-control, observational study designs, inverse probability of sampling weights
Introduction
When designing epidemiologic studies, we are often confronted with tradeoffs between statistical precision, measurement accuracy, and cost.1 Some common examples where such trade-offs occur include studies involving expensive measurement of genetic, physiological or environmental exposure markers, including the rapidly growing “omics” fields of metabolomics, epigenomics, and proteomics. Other types of cost constraints include human capital, which may be required for complex sample preparation, records abstraction, or analysis time, and depletion of limited resources, as could occur if a certain assessment required all of an available sample.
When new data collection is needed, prospective cohort studies are often considered the gold standard of observational research. Such cohort studies are often assembled with multiple research topics in mind (e.g., the Nurses’ Health Studies2, the Southern Community Cohort Study,3 the Sister Study4), but some specific research questions may require the nesting of compact designs within the larger study due to cost constraints. One such commonly used approach is the nested case-control design. In such a design, individuals from the cohort with the outcome of interest are identified; this is typically followed by a selection of a referent group (the “controls”) from among those in the cohort who do not have the specified outcome. Alternatively, case-cohort designs use a random sample of the entire cohort as the comparison group for the cases.5
Here, we summarize the benefits of case-cohort designs in terms of flexibility and cost-effectiveness, and renew a call for their increased usage. Although most of the points we raise here are not novel,6–11 improvements in analytic tools and epidemiologic methods have rendered case-cohort designs more attractive in the time since previous tutorials were published. While recent technical literature provides a solid framework for generalizing the analysis of case-cohort designs, our primary aim is to provide an accessible translation of this literature for applied researchers. Specifically, we focus on how case-cohort designs can be understood within a general sampling framework where approaches like inverse probability weighting can be used to re-weight the study sample to represent a cohort study. For the applied researcher, these recent conceptual and software advances imply that case-cohort designs can be utilized and adapted to efficiently and flexibly address a broader array of study questions than may be currently appreciated.
We also describe how this general sampling framework can be used to extend standard case-cohort designs for assessing subgroup differences (i.e., effect measure modification or disease subtype differences) and to increase study-wide efficiency, while still maintaining compatibility with other epidemiologic study designs for pooled or meta-analyses. We also highlight examples of how case-control studies can be used to estimate population causal effects, which suggest strong advantages for the case-cohort design for making population-based inferences using modern causal inferential approaches.
Cohort studies and the need for more cost-efficient designs
Experiments and clinical trials can be used to assess how a limited number of exposures are associated with one or more outcomes. Case-control studies, in contrast, are usually limited to a single outcome but can consider multiple exposures. Cohort studies tend to offer more flexibility than the other designs, allowing researchers to assess the relationship between many different exposure-outcome combinations through the collection of data on a large number of variables (exposures, covariates, outcomes) at enrollment and during follow-up. Cohorts are usually sampled so that they represent a specific, identifiable population (e.g. women who identify as Black in the US or US adults over age 65 with no history of cancer), and often include the collection of environmental samples (e.g. house dust, air quality measures) or biospecimens (e.g. blood, urine, saliva, toenails, placenta) for genetic and/or molecular epidemiology studies.
When assessing relationships between exposures and health outcomes that occur at specific, known event times, investigators often report hazard ratios (HRs) and confidence intervals for the association of interest. HRs are often estimated in a Cox proportional hazards model via maximum partial likelihood.12 Such models are semi-parametric, meaning that under the assumption that the estimated parameters are constant over time, a baseline hazard distribution need not be specified to obtain valid HRs. Cox models can be implemented in all standard statistical software packages and can easily accommodate additional covariates, including time-varying ones, and different time scale specifications (e.g., age time, calendar time, or time on study). Accelerated failure time models can be a useful parametric alternative for assessing time-to-event outcomes.13,14 If investigators are less interested in when, specifically, the outcome occurs, it is also sometimes conceivable to treat study outcomes as binary measures at a fixed time point (e.g. diagnosis within five years of enrollment), and use parametric regression to estimate a contrast of cumulative risks, including risk differences, risk ratios, or odds ratios (ORs).
Each of these estimands could potentially be obtained through efficient, retrospective sampling and measurement designs that minimize random error. However, in epidemiologic analyses, systematic bias is often a greater concern than random error. Broadly, this concern has led to a preference for “prospective” cohort designs, where exposure and other relevant confounders can be assessed prior to when the outcome(s) occur. More specific advantages of prospective cohort studies include avoidance of reverse causality (when the symptoms of the disease or the disease itself contributes to changes in exposure or its measurement) and recall bias (differential reporting of exposure according to case status).8,15 Prospective cohort studies may still be prone to selection bias, which occurs when the included participants are different from the target population, or when certain individuals have missing or censored covariate or outcome data. However, because all included participants are recruited prior to determining exposure, outcome, and covariate status, there is less concern that these biases are differential between comparison groups.
Given these advantages, there is a clear need for study designs that can make use of prospectively collected cohort data, even when financial or time constraints mean that it is not possible to get all of the desired information from everyone in the sample.
Nested case-control studies
Nested case-control studies were developed to address this need, yet retain a satisfactory level of statistical precision. Here we defined “nested case-control studies” to mean studies that include individuals specifically selected because they (1) already experienced the outcome of interest (“cases”), or (2) are known to be unaffected by the outcome of interest at the time of sampling (“controls”). The term “nested” is used to indicate that the sample is embedded within a larger prospective cohort and that the temporal relationship between exposure and outcome is thereby preserved.
Controls are often identified via risk-set sampling,15 such that at the age/time each case that develops, one or more non-cases currently contributing person-time are selected as matched control(s). Alternatively, investigators can chose a “cumulative” design,15 where controls are selected as individuals who do not develop the outcome over a fixed time period of interest (e.g. those who have not developed the outcome before age 50, or within the five years since enrollment). In both cases, individuals who develop the outcome at future time points are eligible to serve as controls.
Data from nested case-control studies are typically analyzed using conditional or unconditional logistic regression with adjustment for covariates, as needed. If risk-set sampling is used, an incidence rate ratio can be estimated using conditional logistic regression.15 In unmatched scenarios, incidence odds ratio or cumulative incidence odds ratio can be estimated using unconditional logistic regression, with the latter approximating cumulative incidence ratios when the outcome is rare. It has also long been known that such studies can also be used to estimate cumulative incidence (and thus risk differences),16 but this approach relies on external data on population risk that may not be available. Though initial sample selection can be complex depending on the matching protocol, logistic regression analyses are simple to implement and can be completed using standard statistical software.17–19
Case-cohort studies
First developed and described by Prentice,5,20 a case-cohort study is another nested alternative to full cohort analyses that cuts costs but preserves the temporal relationship between exposure and outcome. Here we define case-cohort studies to be those that include (1) a sample of individuals from the cohort (up to 100%)21,22 who have experienced the outcome of interest (“cases”) and (2) a (possibly overlapping) sample of individuals randomly selected from among the members of the full cohort observed at baseline (the “sub-cohort”) (Figure 1A).
Figure 1.
Visualization of case-cohort designs assuming a time-on-study time scale. (A) The case-cohort study includes (1) a sample of individuals from the cohort who have experienced the outcome of interest (“cases”) and (2) a sample of individuals randomly selected from among the members of the full cohort observed at baseline (the “sub-cohort”). Selection into the sub-cohort is conducted without respect to future case status, meaning that some cases may be included by chance. (B) When considering an inverse probability of sampling weights approach to analyzing case-cohort data, cases not in the sub-cohort only contribute person-time just before their diagnosis (shown in solid black line), with the weight determined by the probability of selection as a case (weight=1 if 100% of cases are selected for inclusion). All individuals selected into the sub-cohort contribute person-time at risk, with the non-cases contributing for their entire follow-up period and cases contributing from enrollment until just before their diagnosis. The sub-cohort person-time is weighted based on the probability of being sampled into the sub-cohort (e.g. if sub-cohort is a 10% sample of cohort, weight=10).
Sum of case weights = total number of cases in cohort
Sum of weighted sub-cohort person-time = total person-time observed for cohort
Assuming no loss to follow-up in the cohort and no competing risks for the event(s) of interest, risk ratios can be directly estimated from case-cohort data.15 However, such assumptions are often unreasonable, and Prentice et al. first described how Cox proportional hazards models could be applied to case-cohort data to estimate HRs that are asymptotically identical to those obtained from the full cohort, albeit with increased estimated variance and thus larger estimated confidence intervals (CIs).5,20,23 The modification requires a “pseudo-likelihood” approach, which is essentially a weighted version of the partial-likelihood used for Cox proportional hazards regression.23 Updates to the original approach have shown that the weights can be considered time-varying, with values determined by each participant’s case and sub-cohort status at each observed event time.11,23,24
In a basic case-cohort design, the person-time among the sub-cohort is a random sample of all person-time in the full cohort.5,20 Cases outside the sub-cohort have a weight of 0 until at the exact time they become a case. At that point, they are compared to all individuals at risk of the outcome at that same timepoint, with other cases contributing little5 (or no20) person-time at risk.
Alternatively, it is possible to re-weight the person-time contributions of case-cohort sample, so that the weighted “pseudo-population” represents the source cohort in terms of exposures, covariates, outcomes, and follow-up times. With T0 representing time at start of follow-up, TY representing time at event/censoring, and ε a very small number (smaller than the time units being measured), we describe a simple version of this weighting scheme in the first row of the Table. Corresponding data examples and SAS and R code are provided online (https://github.com/ TBD) and as an appendix.
Table.
Weighting schemes for case–cohort designs.
Type of case–cohort design | Cases not in sub-cohort | Non-cases in sub-cohort | Cases in sub-cohort | |||
---|---|---|---|---|---|---|
| ||||||
Simple All cases selected; Selection probability of sub-cohort = x% | ||||||
TY-ε to TY: w=1.0 | w= 1/x% | T0 to TY-ε: w= 1/x% TY-ε to TY: w=1.0 |
||||
| ||||||
Covariate-stratified All cases selected; Sub-cohort selection probabilities of xA% (Group A) and xB% (Group B) | ||||||
Group A | TY-ε to TY: w=1.0 | w= 1/xA% | T0 to TY-ε: w= 1/xA% TY-ε to TY: w=1.0 |
|||
Group B | TY-ε to TY: w=1.0 | w= 1/xB% | T0 to TY-ε: w= 1/xB% TY-ε to TY: w=1.0 |
|||
| ||||||
Outcome-stratified 100% of type I and y% of type II cases selected; Sub-cohort selection probability x% for all | ||||||
Type I: TY-ε to TY: w=1.0 |
Type II: TY-ε to TY: |
w= 1/x% |
Type I: T0 to TY-ε: w= 1/x% TY-ε to TY: w=1.0 |
Type II: T0 to TY-ε: w= 1/x% TY-ε to TY: |
||
| ||||||
Covariate- and Outcome-stratified 100% of type I and y% of type II cases selected; Sub-cohort selection probabilities of xA% (Group A) and xB% (Group B)a | ||||||
Group A |
Type I: TY-ε to TY: w=1.0 |
Type II: TY-ε to TY: |
w= 1/xA% |
Type I: T0 to TY-ε: w= 1/xA% TY-ε to TY: w=1.0 |
Type II: T0 to TY-ε: w= 1/xA% TY-ε to TY: |
|
Group B |
Type I:
TY-ε to TY: w=1.0 |
Type II: TY-ε to TY: |
w= 1/xB% |
Type I: T0 to TY-ε: w= 1/xB% TY-ε to TY: w=1.0 |
Type II: T0 to TY-ε: w= 1/xB% TY-ε to TY: |
|
| ||||||
Case-independent designs Selection probability of cases = v%; Selection probability of non-cases = z% | ||||||
w=1/v% | w= 1/z% | w=1/v% |
W = weight;T0 = Start of follow-up; TY = Event time, ε= a very small number (less than a unit increase on your time scale)
Assumes case selection independent of group selection, but if this is not the case the weights can be calculated separately for each group/subtype combination (e.g. selecting yA% of type II cases in group A but yB% of type II cases in group B)
This “simple” approach to case-cohort analysis applies if all cases are selected (sampling probability=100%) and the members of the sub-cohort are all selected with equal probability given by x%. Accordingly: 1) cases not in the sub-cohort would get a weight of 1.0 from just before their event time (TY-ε) to their diagnosis (TY); 2) non-cases in the sub-cohort would be weighted as for all of their follow-up time (T0 to TY); and 3) cases in the sub-cohort would get a weight of from the start of follow-up until just before their event time ((T0 to TY-ε), and then a weight of 1.0 at the exact time of their event (TY-ε to TY).
The key to this approach is understanding the weighting of the cases in the sub-cohort, who have their person-time split between into two separate observations in the modified data set (shown visually in Figure 1B). If everyone in the sub-cohort received the inverse probability of selection weights for their entire follow-up period, the sub-cohort would itself be weighted back to equal the full cohort, including all of the cases in the sub-cohort. Therefore, the number of cases in the sub-cohort would be weighted to approximately equal the number of true cases in the full cohort. As such, any additional data from cases outside the sub-cohort would overcount the total number of cases. By assigning all of the cases a weight of exactly 1.0 at the time of their event, the weighting scheme described here ensures that each case is only counted once and that all cases contribute equally. Their contribution prior to their event time is determined by whether they also contributed to the sub-cohort. Of note, if the cases were selected with a probability of less than 100%, either by design or as a result of missing data, the weight for cases at the time of their selection could be adjusted accordingly (e.g., a 20% sample would be weighted as 1/0.20=5.0).
Case-cohort data were initially difficult to analyze with standard software. Barlow et al.23 and Langholz and Jiao25 provided SAS code, but the robust variance calculations needed to account for the atypical data structure and/or weighting scheme required custom coding, and Cox model routines in available software packages could not always accommodate weights. However, changes to PHREG procedure to allow sandwich variance estimates and weighting (SAS versions 8.2 and 9.0, respectively), have made implementation much more straight-forward and adaptable.11,23,25 Similar capabilities have also become available in numerous other statistical software packages, including R and Stata.
A comparison: Nested case-control versus case-cohort
Both nested case-control and case-cohort study designs can yield precision approaching what is seen for cohort designs,5,8,20,26 while reducing costs incurred by measurement. Because of the previously stated software issues, case-cohort data have historically been more difficult to analyze. However, as this is no longer the case, case-cohort designs have several clear advantages compared to nested case-control designs.
A primary advantage of case-cohort studies is that the selected sub-cohort can easily be used as a comparison group for many different outcomes, or, in some circumstances, as a study sample on its own. For example, researchers could design a case-cohort study of breast cancer and measure the concentrations of a physiological marker in baseline blood samples collected from both (future) breast cancer cases and a random sub-cohort.27 The same sub-cohort could then be used again as a comparison group for a study of the association between the same exposure and ovarian cancer, or perhaps another chronic disease such as diabetes or stroke. Additionally, the same researchers could use data from just the sub-cohort to examine the relationship between the environmental contaminants and a common outcome, such as obesity or hypertension, or to look at whether the concentrations of the measured contaminant varied by race/ethnicity or occupation.
In contrast, establishing a common control group for multiple outcomes with a nested case-control design is more challenging. For example, it might be reasonable to have a control group that is free of two rare diseases (e.g. ovarian cancer and pancreatic cancer), but each additional exclusion criterion diminishes the representativeness of exposure distributions within the control group, which biases the OR estimated from the nested case-control sample relative to the OR estimate from the full cohort. The use of matching for efficiency would limit flexibility and generalizability even further. Methods have been developed to re-weight or combine control groups to create a more appropriate referent group,28–30 but such approaches often have specific data requirements and may be difficult to implement with standard software.
Both the case-control and the case-cohort study designs make it easy to layer exposure measurements on top of one another, given sufficient banked samples and/or consistent storage and analytical techniques. For example, if investigators are interested in studying the role of genetic factors as they relate to an exposure, an outcome, or their interaction, it is fairly straight-forward to genotype the same sample that was selected for a prior nested case-control or case-cohort study.27,31,32 In doing this, investigators using case-cohort sampling would open up a wide range of genetic-related research opportunities within the sub-cohort,33 a sample fully representative of the full cohort, while the control group for the nested case-control sample would have less utility and interpretability.
In a comparison of several different approaches to analyzing nested case-control and case-cohort data, Kim found that when inverse probability of selection weights were incorporated, the designs had similar statistical power.9 Langholz and Thomas26 previously showed that a single nested case-control study may be more efficient than a single case-cohort study of the same sample size, particularly in the presence of late entry and right censoring. The rarity of the outcome of interest also matters, with nested case-control studies having increasingly greater statistical precision than case-cohort studies of the same total sample size as the outcome becomes more common.15
However, this does not account for the potential efficiency gains for case-cohort studies when studying alternative outcomes, where case-cohort designs may require identifying only new cases while nested case-control designs would generally necessitate identifying both new cases and controls. Further, though not covered in detail here, additional increases in statistical power could be gained by including auxiliary data from participants not selected into the sub-cohort, even if those participants were missing data on crucial covariates,34–36 or by jointly considering multiple outcomes.37–39 Some previous work suggests nested case-control studies may be better suited for studying biomarkers sensitive to batch, storage, and freeze/thaw cycle effects,40 but nested case-control studies with alternative designs could be subject to similar biases.29
The increasing advantages of case-cohort designs
Despite these potential benefits in terms of optimizing study resources, case-cohort studies have not permeated epidemiologic study design as fully as nested case-control studies. As an update to the numbers reported by Barlow et al.,23 a June 2021 MEDLINE keyword search indicated that “nested case-control” studies were much more common (9,364 entries, including 6,090 since January 2010), than “case-cohort” studies (2,192 entries, including 1,611 since January 2010).
This preference for nested case-control studies is likely due, at least in part, to perceived difficulties in understanding and implementing case-cohort studies, particularly if they go beyond the simplest of applications. Now that the software/programming concerns have been addressed through new software routines, we offer some additional thoughts on how to adapt case-cohort to more complex scenarios, including what we are calling “covariate-stratified case-cohort designs” and “outcome-stratified case-cohort designs”.
Stratified case-cohort sampling to enhance efficiency for subgroup analyses
Case-cohort designs can also be easily adapted to focus resources on the study of certain covariate-defined subgroups or disease types. We present examples of these stratified case-cohort designs in the Table, with the basic framework following what was previously presented for the simple scenario of a 100% case sample with an x% random sample of the cohort into the sub-cohort. The elements are flexible and adaptable, as long as the selection probabilities are known and their inverses are included as weights in the regression models.7,10,11
A covariate-stratified case-cohort design could be used to maximize precision for groups of particular interest (e.g. Black or Latinx individuals in environmental studies where there are documented exposure disparities,41 or infant boys when studying endocrine disruptors and there are expected differences in effect size42). Such a design would involve over-sampling cohort members who are in the selected subgroups, rather than just taking a random sample of the full cohort. As shown in the Table, if there are two sampling groups “A” and “B”, the sampling weights (xA% and xB%, respectively) are factored in when determining the weights for the sub-cohort members. This includes both cases (weight = or weight = from T0 to TY-ε) and non-cases (weight = or weight = from T0 to TY). If 100% of the cases are included, then the case weights do not change (weight=1.0 from TY-ε to TY).
If the goal is to disproportionately sample one disease type over another, an outcome-stratified case-cohort design may be more appropriate. For example, researchers might want to use an outcome-stratified case-cohort design to include all incident cases of a rare subtype (e.g., estrogen receptor-negative breast cancer), but only a sample of a more common subtype (estrogen receptor-positive breast cancer). Here, the sampling weights for the sub-cohort would be the same as the simplest version (weight = from T0 to TY for non-cases or T0 to TY-ε for cases), but the weights for cases would change depending on the outcome type. If 100% of those with the type I outcome were selected, but only y% of those with the type II outcome, type I cases (both in and out of the sub-cohort) would be given weights of 1.0 from TY-ε to TY and type II cases would be given weights of from TY-ε to TY).
Provided that sampling probabilities are known, case-cohort designs can also incorporate both covariate-stratified and outcome-stratified elements together (described in the 4th row of the Table). It could also be expanded to include more than two strata of covariates or outcomes, and it is not limited to any one group being sampled with 100% probability. As a simple data checking step, investigators can construct weighted frequency tables using only the case-cohort sample and show that the person-time and case counts in the weighted sample have approximately the same distribution of exposure, covariates, and outcomes as the original eligible cohort (within sampling error).
Re-imagining the data for assessing case-independent research questions
As previously mentioned, an additional strength of case-cohort studies over nested case-control studies is the ability to use the sub-cohort as an independent sample that is fully representative of the original full cohort. For example, investigators might want to look at the cross-sectional relationship between a physiological biomarker and epigenetic modifications measured in the selected individuals. Both prospective and cross-sectional studies could be accommodated, with weighting only needed if covariate-stratified approaches had been utilized.
If looking to increase statistical precision, it is also possible to include the cases when assessing research questions not directly related to the outcome (for examples, see Lawrence et al.43 and Kresovich et al.44) To ensure that the measured association is “case-independent”, weights would be used to account for the inverse probability of selection into the study. However, unlike the previous examples, we cannot use event times to define the weights, and must instead assign them based on case or non-case status at a single time point. In other words, selected cases would be representative of all cases that existed at the time the case-cohort sample was selected (weight of , if v% of cases were selected; Table). Non-cases would be representative of all the individuals who had not yet become cases at the time the case-cohort sample was selected (weight = , if z% of non-cases were selected). Though lower powered than an analysis that included all individuals weighted equally but with adjustment for case status (e.g. O’Brien et al.32, White et al.45), an inverse-probability-of-sampling weighted analyses would more accurately estimate effects in the full cohort, if that is the desired target population.
Further adaptations in causal inference with case-cohort designs
Methods for correcting for selection bias and time-varying confounding in case-cohort samples have already been developed in ways that are compatible with the framework presented here.46–48 By allowing the case-cohort to be described as a weighted version of a full cohort study, this places case-cohort designs in the context of many modern approaches to epidemiologic data analysis that include weighting, or in which special considerations for weight have already been made. This includes including estimation of absolute risk measures, imputation, quantitative bias analyses,49 assessment of complex mixtures,41,50 inverse probability of treatment weighting,51,52 and generalizability and transport.53,54
Notably, the preceding examples are mainly questions pertinent to causal inference in prospective study designs. When population odds of the study outcome is known, it may be possible to estimate population cumulative risks from case-control data.55 However, because of the outcome dependent sampling scheme, understanding how covariates develop over time is sometimes challenging in case-control studies. Because case-cohort studies include a sample of the cohort, this design potentially opens up questions and methods that specifically leverage time-varying covariates, such as questions about mediation or methods such as the parametric g-formula56,57 or doubly-robust estimation.58 Details of such applications need further research and development and are beyond the scope of this review.
Compatibility of case-cohort and nested case-control
Despite what we see as clear arguments in favor of case-cohort over nested case-control designs, the reality is that many previous investigations of this type have already been sampled and analyzed as nested case-control studies. This can lead to issues of compatibility between the two, whether comparing effect estimates informally or formally (i.e. meta-analysis), or attempting to pool data for combined analysis.
Nested case-control data can be used to estimate HRs if the cases and controls if risk-set sampling is used. As mentioned earlier, assuming no loss to follow-up and no competing risks, risk ratios or ORs can be directly estimated from case-cohort data.15 In other words, under certain assumptions, a case-cohort study can be thought of as a nested case-control study done over a fixed period of interest (e.g. 5-year risk of developing a disease). Importantly, without explicit matching on time, exposure assessments for both cases and non-cases should be based their status at some comparable timepoint to ensure exchangeability59 and avoid being subject to differential misclassification by case status, as could occur with exposures prone to recall bias, reverse causation, or confounding by age or time.
Summary
Case-cohort designs are an established but little-utilized tool for answering epidemiologic questions that require prospective ascertainment of cases and expensive measurements. Recent software developments have made case-cohort studies simpler to analyze and the design offers clear benefits in flexibility and general efficiency compared to more traditional nested case-control approaches. With this inverse probability of sampling weights approach, we provide tools for improved implemention of case-cohort studies, including adaptions to allow for better representation of certain subgroups or disease types, and extensions to address research questions beyond their initial purpose. The proposed frameworks also offer great potential for further methodological application and development in the context of modern causal inference-based approaches to epidemiologic data.
Supplementary Material
Funding:
This work was supported by the Intramural Research Program of the National Institutes of Health (National Institute of Environmental Health Sciences).
Abbreviations:
- HR
hazard ratio
- OR
odds ratio
References
- 1.White E, Hunt JR, Casso D. Exposure measurement in cohort studies: The challenges of prospective data collection. Epidemiologic Reviews. 1998;20(1):43–56. doi: 10.1093/oxfordjournals.epirev.a017971 [DOI] [PubMed] [Google Scholar]
- 2.Bao Y, Bertoia ML, Lenart EB, et al. Origin, methods, and evolution of the three nurses’ health studies. American Journal of Public Health. 2016;106(9):1573–1581. doi: 10.2105/AJPH.2016.303338 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Signorello LB, Hargreaves MK, Blot WJ. The Southern Community Cohort Study: Investigating health disparities. Journal of Health Care for the Poor and Underserved. 2010;21(1 SUPPL. 1):26–37. doi: 10.1353/hpu.0.0233 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sandler DP, Hodgson ME, Deming-Halverson SL, et al. The Sister Study: Baseline methods and participant characteristics. Environ Health Perspect. 2017;125(12):127003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73(1):1–11. doi: 10.1093/biomet/73.1.1 [DOI] [Google Scholar]
- 6.Wacholder S Practical Considerations Nested in Choosing between Designs the Case-Cohort and Case-Control. Epidemiology. 1991;2(2):155–158. [DOI] [PubMed] [Google Scholar]
- 7.Therneau TM, Li H. Computing the cox model for case cohort designs. Lifetime Data Analysis. 1999;5(2):9–112. [DOI] [PubMed] [Google Scholar]
- 8.Gail MH, Altman DG, Cadarette SM, et al. Design choices for observational studies of the effect of exposure on disease incidence. BMJ Open. 2019;9(12):1–9. doi: 10.1136/bmjopen-2019-031031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kim RS. A new comparison of nested case–control and case–cohort designs and methods. European Journal of Epidemiology. 2014;30(3):197–207. doi: 10.1007/s10654-014-9974-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Binder DA. Fitting Cox’s proportional hazards models from survey data. Biometrika. 1992;79(1):139–147. doi: 10.1093/biomet/79.1.139 [DOI] [Google Scholar]
- 11.Kulathinal S, Karvanen J, Saarela O, Kuulasmaa K. Case-cohort design in practice - experiences from the MORGAM Project. Epidemiol Perspect Innov. 2007;4:15. doi: 10.1186/1742-5573-4-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society: Series B (Methodological). 1972;34(2):187–202. doi: 10.1111/j.2517-6161.1972.tb00899.x [DOI] [Google Scholar]
- 13.Wei LJ. The accelerated failure time model: A useful alternative to the cox regression model in survival analysis. Statistics in Medicine. 1992;11(14–15):1871–1879. doi: 10.1002/sim.4780111409 [DOI] [PubMed] [Google Scholar]
- 14.Orbe J, Ferreira E, Núñez-Antón V. Comparing proportional hazards and accelerated failure time models for survival analysis. Statistics in Medicine. 2002;21(22):3493–3510. doi: 10.1002/sim.1251 [DOI] [PubMed] [Google Scholar]
- 15.Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott Williams and Wilkins; 2008. [Google Scholar]
- 16.Cornfield J A method of estimating comparative rates from clinical data; applications to cancer of the lung, breast, and cervix. J Natl Cancer Inst. 1951;11(6):1269–1275. [PubMed] [Google Scholar]
- 17.SAS Institute. Conditional Logistic Regression for Matched Pairs Data. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_logistic_examples11.htm.
- 18.LOGIT REGRESSION | SAS DATA ANALYSIS EXAMPLES. Accessed October 27, 2021. https://stats.idre.ucla.edu/sas/dae/logit-regression/
- 19.LOGIT REGRESSION | R DATA ANALYSIS EXAMPLES. Accessed October 27, 2021. https://stats.idre.ucla.edu/r/dae/logit-regression/
- 20.Self SG, Prentice RL. Asymptotic Distribution Theory and Efficiency Results for Case-Cohort Studies. Annals of Statistics. 1988;16(1):64–81. [Google Scholar]
- 21.Kang S, Cai J. Marginal hazards model for case-cohort studies with multiple disease outcomes. Biometrika. 2009;96(4):887–901. doi: 10.1093/biomet/asp059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Cai J, Zeng D. Power calculation for case-cohort studies with nonrare events. Biometrics. 2007;63(4):1288–1295. doi: 10.1111/j.1541-0420.2007.00838.x [DOI] [PubMed] [Google Scholar]
- 23.Barlow WE, Ichikawa L, Rosner D, Izumi S. Analysis of case-cohort designs. J Clin Epidemiol. 1999;52(12):1165–1172. [DOI] [PubMed] [Google Scholar]
- 24.Barlow WE. Robust Variance Estimation for the Case-Cohort Design. Biometrics. 1994;50(4):1064–1072. [PubMed] [Google Scholar]
- 25.Langholz B, Jiao J. Computational methods for case-cohort studies. Computational Statistics and Data Analysis. 2007;51(8):3737–3748. doi: 10.1016/j.csda.2006.12.028 [DOI] [Google Scholar]
- 26.Langholz B, Thomas DC. Nested case-control and case-cohort methods of sampling from a cohort: A critical comparison. American Journal of Epidemiology. 1990;131(1):169–176. doi: 10.1093/oxfordjournals.aje.a115471 [DOI] [PubMed] [Google Scholar]
- 27.O’Brien KM, Sandler DP, Taylor JA, Weinberg CR. Serum Vitamin D and Risk of Breast Cancer within Five Years. Environmental Health Perspectives. 2017;125(7):077004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Saarela O, Kulathinal S, Arjas E, Laara E. Nested case-control data utilized for multiple outcomes: A likelihood approach and alternatives. Stat Med 2008;27:5991–6008. [DOI] [PubMed] [Google Scholar]
- 29.Salim A, Hultman C, Sparén P, Reilly M. Combining data from 2 nested case-control studies of overlapping cohorts to improve efficiency. Biostatistics. 2009;10(1):70–79. doi: 10.1093/biostatistics/kxn016 [DOI] [PubMed] [Google Scholar]
- 30.Salim A, Xiangmei M, Jialiang L, Reilly M. A maximum likelihood method for secondary analysis of nested case-control data. Stat Med. 2012;33:1842–1852. [DOI] [PubMed] [Google Scholar]
- 31.O’Brien KM, Sandler DP, Kinyamu HK, Taylor JA, Weinberg CR. Single nucleotide polymorphisms in vitamin D-related genes may modify vitamin D-breast cancer associations. Cancer Epidemiology, Biomarkers & Prevention. 2017;26(12):1761–1771. doi: 10.1158/1055-9965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.O’Brien KM, Sandler DP, Xu Z, Kinyamu HK, Taylor JA, Weinberg CR. Vitamin D, DNA methylation, and breast cancer. Breast Cancer Res. 2018;20(70):1–11. doi: 10.1186/s13058-018-0994-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.O’Brien KM, Sandler DP, Shi M, Harmon QE, Taylor JA, Weinberg CR. Genome-wide association study of serum 25-hydroxyvitamin D in US women. Frontiers in Genetics. 2018;9(March):1–11. doi: 10.3389/fgene.2018.00067 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Using the whole cohort in the analysis of case-cohort data. American Journal of Epidemiology. 2009;169(11):1398–1405. doi: 10.1093/aje/kwp055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lumley T, Shaw PA, Dai JY. Connections between survey calibration estimators and semiparametric models for incomplete data. Int Stat Rev. 2011;79(2):200–220. doi: 10.1111/j.1751-5823.2011.00138.x.Connections [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Noma H, Tanaka S. Analysis of case-cohort designs with binary outcomes: Improving efficiency using whole-cohort auxiliary information. 2Statistical Methods in Medical Research. 2017;26(2):691–706. [DOI] [PubMed] [Google Scholar]
- 37.Kim S, Zeng D, Cai J. Analysis of multiple survival events in generalized case-cohort designs. Biometrics. 2018;74:1250–1260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kim S, Xu Y, Zhang M-J, Ahn K-W. Stratified proportional subdistribution hazards model with covariate-adjusted censoring weight for case-cohort studies. Scandinavian Journal of Statistics. Published online 2020. [Google Scholar]
- 39.Kim S, Cai J, Lu W. More efficient estimators for case-cohort studies. Biometrika. 2013;100(3):695–708. doi: 10.1093/biomet/ast018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rundle AG, Vineis P, Ahsan H. Design options for molecular epidemiology research within cohort studies. Cancer Epidemiology Biomarkers and Prevention. 2005;14(8):1899–1907. doi: 10.1158/1055-9965.EPI-04-0860 [DOI] [PubMed] [Google Scholar]
- 41.Niehoff NM, O’Brien KM, Keil AP, et al. Metals and breast cancer risk: a prospective study using toenail biomarkers. American Journal of Epidemiology. Published online 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Buckley JP, Doherty BT, Keil AP, Engel SM. Statistical Approaches for Estimating Sex-Specific Effects in Endocrine Disruptors. :1–7. doi: 10.1289/EHP334 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lawrence KG, Kresovich JK, O’Brien KM, et al. Association of neighborhood deprivation with epigenetic aging using four clock methodologies. JAMA Open. Published online 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kresovich JK, Martinez Lopez AM, Garval EL, et al. Alcohol consumption and methylation-based measures of biological age. Journals of Gerontology. Published online 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.White AJ, Kresovich JK, Xu Z, Sandler DP, Taylor JA. Shift work, DNA methylation and epigenetic age. International Journal of Epidemiology. Published online 2019:1–9. doi: 10.1093/ije/dyz027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Lee H, Hudgens MG, Cai J, Cole SR. Marginal structural Cox models with case-cohort sampling. Stat Sin. 2016;26(2):509–526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Cole SR, Hudgens MG, Tien PC, et al. Marginal structural models for case-cohort study designs to estimate the association of antiretroviral therapy initiation with incident AIDS or death. American Journal of Epidemiology. 2012;175(5):381–390. doi: 10.1093/aje/kwr346 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Buchanan AL, Hudgens MG, Cole SR, Lau B, Adimora AA. Worth the weight: Using inverse probability weighted Cox models in AIDS research. AIDS Research and Human Retroviruses. 2014;30(12):1170–1177. doi: 10.1089/aid.2014.0037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Lash TL, Fox MP, Maclehose RF, Maldonado G, Mccandless LC, Greenland S. Good practices for quantitative bias analysis. International Journal of Epidemiology. 2014;43(6):1969–1985. doi: 10.1093/ije/dyu149 [DOI] [PubMed] [Google Scholar]
- 50.Keil AP, Buckley JP, O’Brien KM, Ferguson KK, Zhao S, White AJ. A quantile-based g-computation approach to addressing the effects of exposure mixtures. Environmental Epidemiology. 2019;3(April):44. doi: 10.1097/01.ee9.0000606120.58494.9d [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology. 2008;168(6):656–664. doi: 10.1093/aje/kwn164 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. doi: 10.1097/00001648-200009000-00011 [DOI] [PubMed] [Google Scholar]
- 53.Bareinboim E, Pearl J. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences of the United States of America. 2016;113(27):7345–7352. doi: 10.1073/pnas.1510507113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing Study Results: A Potential Outcomes Perspective. Epidemiology. 2017;28(4):553–561. doi: 10.1097/EDE.0000000000000664 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Greenland S Model-based estimation of relative risks and other epidemiologic measures in studies of common outcomes and in case-control studies. Am J Epidemiol. 2004;160(4):301–305. doi: 10.31826/9781463222444-001 [DOI] [PubMed] [Google Scholar]
- 56.Taubman SL, Robins JM, Mittleman MA, Hernán MA. Intervening on risk factors for coronary heart disease: An application of the parametric g-formula. International Journal of Epidemiology. 2009;38(6):1599–1611. doi: 10.1093/ije/dyp192 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Keil AP, Edwards JK, Richardson DB, Naimi AI, Cole SR. The Parametric g-Formula for Time-to-event Data: Intuition and a Worked Example. Epidemiology. Published online August 2014. doi: 10.1097/EDE.0000000000000160 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–973. doi: 10.1111/j.1541-0420.2005.00377.x [DOI] [PubMed] [Google Scholar]
- 59.Flanders WD, Klein M. Properties of 2 Counterfactual Effect Definitions of a Point Exposure. Epidemiology. 2007;18(4):453–460. doi: 10.1097/01.ede.0000261472.07150.4f [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.