Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2024 Aug 30;194(3):585–586. doi: 10.1093/aje/kwae287

Mathur and Shpitser respond to “The evolution of selection bias in the recent epidemiologic literature—a selective overview”

Maya B Mathur 1,, Ilya Shpitser 2
PMCID: PMC11879549  PMID: 39214648

We thank Lu et al1 for their response to our article on graphical rules for selection bias.2 They provide an insightful and interesting overview of methodological developments regarding selection bias in epidemiology. We reflect on certain topics they address and on areas of active research, drawing on research in epidemiology and adjacent disciplines.

Lu et al1 emphasize the need to consider the practical utility of estimands that may be used in the presence of selection bias, such as the net treatment difference (described in our article2). We certainly agree. Other estimands have been proposed, and identification conditions proven, for situations that are analogous to selection bias. For example, in the context of survival analysis, the survivor average causal effect (SACE) is the effect of the treatment on the outcome for the subset of individuals who would have survived regardless of treatment status.3 In the context of treatment-affected selection, the SACE would represent the effect of treating the subset of individuals who would be in the selected group regardless of treatment. With treatment-affected selection, it can also be informative to consider the effects of the treatment while simultaneously intervening to set the selection indicator to 1 (eg, to prevent missing data from occurring).2,4,5 We provided some results for this case.2 Each of these estimands requires different identification assumptions, whose plausibility will depend on scientific context.2,3

To date in epidemiology, discussion of selection bias has largely focused on directed acyclic graphs (DAGs) containing a single selection indicator. That is, each participant is either selected into analysis, in which case all of their variables are observed, or else is not selected, in which case none of their variables is observed. However, in the context of selection due to missing data, some participants may have incomplete data. Graphical models have been developed to accommodate separate indicators for each missing variable.4-6 Such approaches have yielded interesting and simple identification results. For example, if one wants to adjust for systematic selection using popular approaches such as multiple imputation or inverse-probability weighting, this is possible if and only if the probability of being a complete case given only the observed part of the data is itself identifiable.4 In addition, the form of the identifying functional for the complete-case probability will in turn influence the form the inverse weights or imputing distributions must take.

If the joint distribution of the analysis variables is indeed identified, then any estimand pertaining to those variables—including causal estimands such as average treatment effects, but also noncausal estimands such as simple means and proportions—is also identified. Graphical models for missing data remain an active area of research, with recent work considering methods for missing data in the presence of dependent data, such as interference settings.7

Systematic selection is a recurring challenge in data analysis and subsumes a number of seemingly disparate areas of research. In particular, transportability/generalizability problems and data fusion problems both feature multiple domains where units may be selected into a particular domain systematically. In transportability/generalizability problems, the aim is to apply findings in one context to another, with appropriate adjustments. In data fusion problems, multiple data sources are combined to address a single question. In addition, classical causal inference in nonrandomized studies and missing data problems feature a systematic selection of the treatment or of a selection indicator. Finally, outcome-dependent sampling and principled analysis of case-control studies must involve appropriate adjustment for “inconvenient” selection.

Causal inference is often viewed as a missing data problem.8 On the other hand, causal inference ideas have similarly been used to derive new identification theory in missing data settings.4-6 Similarly, the conceptual similarity of the wealth of problems that feature systematic selection has yet to be fully explored. We look forward to further developments along these lines.

Contributor Information

Maya B Mathur, Quantitative Sciences Unit, Department of Medicine, Stanford University, Palo Alto, CA 94304, United States.

Ilya Shpitser, Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA, 21287.

Acknowledgments

M.B.M. and I.S. co-wrote the manuscript.

Funding

M.B.M. was supported by National Institutes of Health grants R01 LM013866, UL1TR003142, P30CA124435, and P30DK116074. I.S. was supported by ONR N00014-21-1-2820, NSF 2040804, NSF CAREER 1942239 and NIH R01 AI127271-01A1. The funders had no role in the design, conduct, or reporting of this work.

Conflict of interest

The authors declare no conflicts of interest.

References

  • 1. Lu  H, Howe  CJ, Zivich  PN, et al.  The evolution of selection bias in the recent epidemiologic literature—a selective overview. Am J Epidemiol.  2025;194(3):580-584. 10.1093/aje/kwae282 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Mathur  MB, Shpitser  I. Simple graphical criteria for selection bias in general population and selected-sample treatment effects. Am J Epidemiol.  2025;194(1):267-277. 10.1093/aje/kwae145 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Tchetgen Tchetgen  EJ. Identification and estimation of survivor average causal effects. Stat Med.  2014;33(21):3601-3628. 10.1002/sim.6181 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Nabi  R, Bhattacharya  R, Shpitser  I. Full law identification in graphical models of missing data: completeness results. Proc Mach Learn Res.  2020;119:7153-7163. [PMC free article] [PubMed] [Google Scholar]
  • 5. Nabi  R, Bhattacharya  R, Shpitse  I, et al.  Causal and counterfactual views of missing data models. 2022. Preprint retrieved from  https://arxiv.org/abs/2210.05558
  • 6. Mohan  K, Pearl  J. Graphical models for processing missing data. J Am Stat Assoc.  2021;116(534):1023-1037. 10.1080/01621459.2021.1874961 [DOI] [Google Scholar]
  • 7. Srinivasan  R  et al.  Graphical models of entangled Missingness. arXiv  preprint arXiv:2304.01953. 2023. [Google Scholar]
  • 8. Rubin  DB. Inference and missing data. Biometrika.  1976;63(3):581-592. 10.2307/2335739 [DOI] [Google Scholar]

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES